Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Extreme data points
From
Jorge Eduardo Pérez Pérez <[email protected]>
To
"[email protected]" <[email protected]>
Subject
Re: st: Extreme data points
Date
Wed, 8 Jun 2011 11:47:09 -0400
You might also want to take a look at multivariate outlier detection
methods in Stata: -hadimvo- and -bacon-
_______________________
Jorge Eduardo Pérez Pérez
On Wed, Jun 8, 2011 at 10:23 AM, Austin Nichols <[email protected]> wrote:
> Achmed Aldai <[email protected]>:
> While that multivariate code works in the example with 2 vars, it
> falls down when more than 2 variables are put in X; this is better
> (bearing in mind that I cannot recommend the multivariate outlier
> detection approach for use in any real data, given the simulation
> evidence presented already):
>
> clear all
> sysuse nlsw88
> loc X wage hours
> foreach v of loc X {
> _pctile `v', nq(200)
> g byte lo_`v'=(`v'<r(r1)|`v'>r(r199))
> }
> egen ex=rowtotal(lo_*)
> replace ex=1 if ex>1
> la var ex "Excluded values of X (univariate)"
> loc i 1
> loc 1
> qui foreach v of loc X {
> if `i'==1 {
> su `v'
> g double t_`v'=(`v'-r(mean))/r(sd)
> }
> else {
> reg `v' `1'
> predict double t_`v', res
> su t_`v'
> replace t_`v'=(t_`v'-r(mean))/r(sd)
> }
> loc 1 `1' t_`v'
> loc i=`i'+1
> }
> qui foreach v of loc X {
> replace t_`v'=(t_`v')^2
> }
> egen double dm=rowtotal(t_*)
> _pctile dm, nq(100)
> g byte ex2=(dm>r(r99))
> la var ex2 "Excluded values of X (multivariate)"
> logit married wage hours if ex<1
> logit married wage hours if ex2<1
> sc wage hours||sc wage hours if ex<1,leg(lab(1 "Outlier (univar)")) name(u)
> sc wage hours||sc wage hours if ex2<1,leg(lab(1 "Outlier (multivar)"))
> ta ex ex2
>
> *but see also mahapick on SSC for a canned solution to calculating Mah. distance
>
> On Wed, Jun 8, 2011 at 6:47 AM, Austin Nichols <[email protected]> wrote:
>> Achmed Aldai <[email protected]>:
>> See
>> http://www.stata.com/statalist/archive/2011-06/msg00240.html
>> and the rest of that thread; or read on for an improved multivariate
>> exclusion algorithm (the prior iteration ignored possible correlations
>> in X).
>>
>> The advisability of dropping extreme data points depends on what is
>> meant and why this is wanted; restricting a regression to a range of X
>> (explanatory variables) more plausibly free of measurement error by
>> dropping cases with extreme values can improve estimates; but dropping
>> extreme values of the outcome y will typically introduce bias even
>> where there was none before. In general, selecting on the outcome
>> variable is a terrible idea.
>>
>> If you want to restrict X, you can go variable by variable and drop
>> the top half a percent, or look at all X together as an ellipsoidal
>> cloud (i.e. using Mahalobis distance and excluding those obs with
>> distance in the top one percent); probably the variable by variable
>> approach is better (especially when data is not multivariate normal)
>> but here is an example of both:
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/