Very good. This could perhaps be added like this:
0. "Outliers are sample values that cause surprise
in relation to the majority of the sample" (W.N.
Venables and B.D. Ripley. 2002. Modern applied
statistics with S. New York: Springer, p.119).
However, surprise is in the mind of the beholder
and is dependent on some tacit or explicit model
of the data. There may be another model under which
the outlier is not surprising at all, so the
data really are lognormal or gamma rather than normal.
Rich's use of the word "surprise" reminded me of
the quotation from Venables and Ripley.
Nick
[email protected]
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]]On Behalf Of Richard
> Goldstein
> Sent: 07 June 2007 20:29
> To: [email protected]
> Subject: Re: st: RE: RE: Re: RE: Re: RE: RE: IQR
>
>
> I would add one point to Nick's laundry list -- an outlier
> is a surprising result and it is often surprising because
> we have used a particular model -- thinking about why
> we obtained the surprise can sometimes lead to a different
> model without any outliers.
>
> Rich
>
> Nick Cox wrote:
> > Sure, there is a -winsor- ado which I wrote on SSC
> > and, according to Kit Baum's reports, it is quite heavily
> > used. I have never used it myself, bar in development.
> >
> > I cannot recall the details, but perhaps someone
> > wrote into Statalist reporting that it seemed that
> > Stata did not support Winsorizing and that was a black
> > mark against Stata. To which the best reply was a
> > program, being concrete evidence that you can easily do
> > Winsorizing in Stata and here is one way to do it.
> >
> > But let us look at the wider picture. There is no
> > one way to deal with outliers. There are many ways
> > to deal with outliers, including
> >
> > 1. Going out "into the field" and doing the measurement
> > again.
> >
> > 2. Testing whether they are genuine. Most of the
> > tests look pretty contrived to me, but you might find one
> > that you can believe fits your situation. Irrational
> > faith that a test is appropriate is always needed
> > to apply a test that is then presented as quintessentially
> > rational.
> >
> > 3. Throwing them out as a matter of judgement, i.e.
> > in Stata terms -drop-ping them from the data.
> >
> > 4. Throwing them out using some more-or-less
> > automated (usually not "objective") rule.
> >
> > 5. Ignoring them, along the lines of either 3 or 4.
> > This could be formal (e.g. trimming) or just leaving
> > them in the dataset, but omitting them from analyses
> > as too hot to handle.
> >
> > 6. Pulling them in using some kind of adjustment,
> > e.g. Winsorizing.
> >
> > 7. Downplaying them by using some other robust estimation
> > method.
> >
> > 8. Downplaying them by working on a transformed
> > scale.
> >
> > 9. Downplaying them by using a non-identical link
> > function.
> >
> > 10. Accommodating them by fitting some appropriate
> > fat-, long-, or heavy-tailed distribution, without
> > or with predictors.
> >
> > 11. Sidestepping the issue by using some non-parametric
> > (e.g. rank-based) procedure.
> >
> > 12. Getting a handle on the implied uncertainty
> > using bootstrapping, jackknifing or permutation-based
> > procedure.
> >
> > 13. Editing to replace an outlier with some more
> > likely value, based on deterministic logic. "An 18-
> > year-old grandmother is unlikely, but the person
> > in question was born in 1926, so presumably is
> > really 81."
> >
> > 14. Editing to replace an impossible or implausible
> > outlier using some imputation method that is currently
> > acceptable not-quite-white magic.
> >
> > 15. Analysing with and without, and seeing how much
> > difference the outlier(s) make(s), statistically,
> > scientifically or practically.
> >
> > 16. Something Bayesian. My prior ignorance of quite
> > what forbids from giving any details.
> >
> > Naturally, these categories intergrade in some
> > cases, and I can believe I have forgotten
> > or am not aware of yet other approaches.
> >
> > What is quite striking to me -- as with many
> > any areas of statistical science -- is how much
> > preferred solutions vary between investigator
> > and discipline, despite the broad similarity
> > of the problems that outliers pose.
> >
> > Nick
> > [email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/