There are several implementations of box plots,
but Stata's follows the definition that
outliers are at least (3/2) iqr from the nearer
quartile. This rule of thumb comes from John
W. Tukey, who named the box plot (but did not,
contrary to many reports, really invent it).
It's well documented that Tukey -- despite
having been involved with computing since the
1940s and having invented the terms "software" and
"bit", not to mention smaller points like the
FFT -- developed his rule of thumb out of
experience in drawing box plots by hand. (N.B.!)
The sort of datasets he was dealing with
in that way were, it seems, typically thus << 1000 in size
and so in a way the rule goes in a circle
with the number of outliers you might want to
plot separately, and think about.
An equally simple point, but one worth underlining
briefly,
is that Tukey made very heavy use of transformations
to approximate symmetry, especially logarithms.
Those not in the habit of transforming first,
or of transforming at all, would on the whole
see more outliers flagged than he would have done.
Nick
[email protected]
Michael Blasnik
> I too have thought that the standard box plot fences flag too
> many values as
> outliers. Maybe it's because I often work with fairly large
> N, or because I
> work with messy real world data, but I find so many values
> outside the
> fences that the crietria has no meaning. Based on the
> standard defintion,
> you should expect about 22 "outliers" in a sample of 1,000
> when the sample
> is perfectly Gaussian. In my experience, 5%-10% outliers are
> even more
> common with real data.
>
> When I want to investigate outliers, in addition to using
> graphs and model
> diagnostics (e.g., df-betas), I often define "fences" at 3
> iqr above and
> below the median. That threshold, which should result in 0.3
> outliers per
> 1,000 Gaussian observations, tends to give me a more
> manageable list of
> "severe" outliers to investigate.
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/