Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Steve Samuels <sjsamuels@gmail.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: using -drop if- with weights |
Date | Mon, 6 Sep 2010 06:30:35 -0400 |
-- Luis must mean "standard deviation", not "standard error", and the SD is the statistic that Maarten used. Standard errors are functions of sample size, and can be very small, so that almost all observations would be dropped. But even with this correction, the process is a very bad idea in my opinion. (See section 1.4, p.56 of FR Hampel, et al. Robust Statistics: the Approach Based on Influence Functions, Wiley, 1986). The standard deviation will be distorted by outliers, making detection more difficult, and multiple outliers will mask one another. Repeating the process can find and reject new "outliers" at each stage, leading to a very unrepresentative sample. Better to use a program like the user-written -mcd- or the 20 year-old old -iqr- to detect outliers (-findit-), even though neither accepts weights. ***************************** set more off sysuse auto, clear sum mpg list if abs(mpg-r(mean)>3*r(sd)) replace mpg = 50 in 1/5 //5 new outliers sum mpg list if abs(mpg-r(mean)>3*r(sd)) //gone! capture which mcd if _rc net install st0173_1.pkg mcd mpg, e(20) gen(outlier rdist) setseed(5000) list mpg if outlier //found! ******************************** Steve Steven J. Samuels sjsamuels@gmail.com 18 Cantine's Island Saugerties NY 12477 USA Voice: 845-246-0774 Fax: 206-202-4783 On Mon, Sep 6, 2010 at 5:15 AM, Maarten buis <maartenbuis@yahoo.co.uk> wrote: > --- Luis Armando Galvis writes: >> I have a question I am stuck with. I need to drop >> observations that are beyond 3 standard errors from the mean >> of one of the variables. The problem is that using -drop if- >> will eliminate observations without taking into account the >> weights and will eliminate more observations than needed. I >> cannot expand the dataset to 8 million records because of >> memory issues. My question is if there is a way to do this >> procedure in a more manageable way. > > The command -drop- doesn't know weights, or allows for weights. > It doesn't know the mean or standard deviation either, so the > problem is not with -drop- but with what you typed before. > Since you did not tell us what you typed before, it is hard for > us to comment. Also you did not tell us why you think that your > command drops too many observations. This can be crucial > information, as the rules of thumb about how many observations > should be dropped with such a rule are often based on the normal > distribution, but if your variable is severly skewed or has a > spike than all bets are off when it comes to predicting how many > observations will be dropped with such a rule. > > On a more fundamental note: such automatic deletion of observation > is almost always very very very wrong. Almost always it is the > exceptions that contain the most information, so we do not want > to throw them away. Think about it from a policy point of view, it > is usually the exceptions that we want to attain or prevent: We > want the population to live long and healthy and be richt, and want > to prevent early deaths, illness, and poverty. It is the extremes > that contain information on these events, not the "normal" > observations. > > However, technically this is how you can do it: > > sum var [fw=w] > drop if var < r(mean) - 3*r(sd) | var > r(mean) + 3*r(sd) > > (assuming that your variables is called var and your weight > is called w) > > Hope this helps, > Maarten > > -------------------------- > Maarten L. Buis > Institut fuer Soziologie > Universitaet Tuebingen > Wilhelmstrasse 36 > 72074 Tuebingen > Germany > > http://www.maartenbuis.nl > -------------------------- > > > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/