Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Elimination of outliers
From
Austin Nichols <[email protected]>
To
[email protected]
Subject
Re: st: Elimination of outliers
Date
Mon, 6 Jun 2011 10:50:18 -0400
All--
Looking at that again, the outliers in terms of Mah. distance should use
g ex2=(d<r(r98))
since 0 is not an outlier; far from it.
On Mon, Jun 6, 2011 at 10:45 AM, Austin Nichols <[email protected]> wrote:
> Nick--
> I think the advisability of trimming outliers depends on what is
> meant; restricting a regression to a range of X (explanatory
> variables) more plausibly free of measurement error by dropping cases
> with extreme values can improve estimates, both by reducing bias due
> to measurement error and providing much more accurate SEs; but doing
> the same to the outcome y will typically introduce bias even where
> there was none before--in general selecting on the outcome variable is
> demonstrably a terrible idea.
>
> If you want to restrict X, you can go variable by variable:
>
> webuse mheart5, clear
> loc X age bmi
> foreach v of loc X {
> _pctile `v', nq(100)
> g byte lo_`v'=(`v'<r(r1)|`v'>r(r99))
> }
> egen ex=rowtotal(lo_*)
> logit attack smokes age bmi hsgrad female if ex<1
> sc bmi age || sc bmi age if ex<1, leg(lab(1 "Outlier"))
>
> or you could do some kind of multivariate outlier detection:
>
> webuse mheart5, clear
> loc X age bmi
> foreach v of loc X {
> qui su `v'
> g t_`v'=((`v'-r(mean))/r(sd))^2
> }
> egen d=rowtotal(t_*)
> _pctile d, nq(100)
> g byte ex=(d<r(r1)|d>r(r99))
> logit attack smokes age bmi hsgrad female if ex<1
> sc bmi age || sc bmi age if ex<1, leg(lab(1 "Outlier"))
>
> Whatever path Achmed Aldai is pursuing, though, a simulation should be
> done as a proof of concept IMHO. There are no guarantees when it
> comes to small sample properties using whatever strange-looking data
> is at hand.
>
> On Mon, Jun 6, 2011 at 10:17 AM, Nick Cox <[email protected]> wrote:
>> 1. Transformation means using a transformed scale (e.g. logarithms) for one or more of your variables.
>>
>> 2. A non-identity link function in a generalized linear model means what it says: the help for -glm- is the place to start and points to other documentation.
>>
>> Otherwise, I assert that elimination of outliers is a very bad idea _unless_ you know from independent evidence that they arise from serious and irremediable problems of measurement, in which case chopping the tails of the distribution is _not_ the way to do it. In most fields I know, the outliers that stick out are genuine and important (the Amazon in hydrology, USA or China wherever it is in economics, and so on, and so on) and leaving them out is in my view lousy science and lousy statistics.
>>
>> If you disagree, well, we disagree, but I am not going to tell you how to do this in Stata.
>>
>> Nick
>> [email protected]
>>
>>
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]] On Behalf Of Achmed Aldai
>> Sent: 06 June 2011 15:07
>> To: [email protected]
>> Subject: Re: st: Elimination of outliers
>>
>> Hi
>>
>> sorry I cannot really understand why it is a bad idea. I want to eliminate the outliers beacuse I think they cause a bias in my results.
>>
>> How can I transform my predictors and what do you mean by that?
>>
>> What is a non-identity link function?
>>
>> Thank you
>>
>> FElix
>
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/