Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: How to detect outliers
From
Steve Samuels <[email protected]>
To
[email protected]
Subject
Re: st: How to detect outliers
Date
Mon, 11 Feb 2013 17:51:12 -0500
Identifying outliers on the basis of a least squares fit is a very bad
idea, however popular (Hampel et al., 1986). A far superior approach in
Stata is the robust regression package -mmregress- by Verardi and Croux
(-findit-). In providing a resistant fit, -mmregress- also identifies
outliers and high leverage points.
Verardi, V., and C. Croux. 2009. Robust regression in Stata. Stata
Journal 9, no. 3: 439-453.
Hampel, Frank, Elvezio Ronchetti, Peter Rousseeuw, and Werner Stahel.
1986. Robust Statistics: The Approach Based on Influence Functions
(Wiley Series in Probability and Mathematical Statistics). New York:
John Wiley and Sons.
Steve
On Feb 11, 2013, at 2:37 PM, Xixi Lin wrote:
Hi Nick,
You are absolutely right! I messed up the obs numbers, it should be
obs in each period instead. And After I fix that, the results from
these two methods are pretty close.
Thanks again. You are so helpful! ^_^
Best,
Xixi Lin
On Mon, Feb 11, 2013 at 2:24 PM, Nick Cox <[email protected]> wrote:
> I wouldn't regard any kind of large residual as indicating outliers
> unequivocally. On the contrary, a really marked outlier is likely to
> pull the regression towards it, with the result of a small residual.
>
> Your criterion here for Cook is 4/n, but evidently you are fitting
> regressions separately for each period. The total dataset size of
> 165779 is not pertinent to regressions fitted individually. The
> relevant criterion is the number of observations used in each
> regression.
>
> I think you'd learn more from residual vs fitted plots, even all 119 of them.
>
> Whether you would be better off with a different model depends on your
> research problem.
>
> Nick
>
> On Mon, Feb 11, 2013 at 6:50 PM, Xixi Lin <[email protected]> wrote:
>> Hi,
>> I tried two ways to detect outliers: one is to regard Cook’s Distance
>> greater than 4/n as outliers; the other is to regard those with
>> standardized residuals greater than 2 in magnitude as outliers. Here
>> is the my code:
>>
>> gen residual=.
>> tempvar temp
>> foreach z of numlist 2/120 {
>> capture reg Y X1 X2 X3 X4 if Period==`z', noconstant
>> if !_rc {
>> predict temp,rstu
>> replace residual=temp if Period==`z'
>> drop temp
>> }
>> }
>>
>> //cook's distance
>> gen di_bench=4/165979
>> gen distance=.
>> tempvar temp1
>> foreach z of numlist 2/120 {
>> capture reg Y X1 X2 X3 X4 if Period==`z', noconstant
>> if !_rc {
>> predict temp1,cook
>> replace distance=temp1 if Period==`z'
>> drop temp1
>> }
>> }
>> //outlier numbers
>> count if abs(residual) > 2 // 7922
>> count if distance > di_bench //111879
>>
>> My question is did I mess up the codes? Why the two results are so
>> different? one shows 7922 outliers, the other shows 111879 outliers.
>> If I compare Cook's Distance with 1, then the outlier number is 133.
>>
>> Can anyone tells me which method I should choose? Or is there any
>> other better ways to detect outliers? Thanks a lot.
>>
>> Best,
>> Xixi Lin
>>
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>> * http://www.ats.ucla.edu/stat/stata/
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/