Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: xtreg check for outliers
From
Richard Goldstein <[email protected]>
To
[email protected]
Subject
Re: st: xtreg check for outliers
Date
Thu, 09 Aug 2012 10:01:30 -0400
my view is a little different
an outlier is a surprising value; it is surprising because one is
comparing it, sometimes implicitly, to a model -- once you determine
that the value is not an error, you need to consider whether you are
using the "right" model -- changing the model will often change which
values, if any, are "outliers"
Rich
On 8/9/12 9:48 AM, Nick Cox wrote:
> Somewhat in the spirit of David's comment, but from a broader perspective:
>
> I think it's important to recognise that the one word "outliers"
> covers some quite different situations, some of which are not even
> problems. Indeed, one definition of outliers is that they surprise the
> researcher, so being an outlier is as much psychology as ontology.
>
> A complete taxonomy is necessarily elusive, and there is at least one
> lengthy monograph on outliers. But minimally we should distinguish
>
> 1. Outliers that are essentially mistakes, as they represent
> impossible or at least implausible values. These can arise from
> equipment malfunction, contamination of samples, human
> misunderstanding, lies, careless recording of data, clashes in
> convention, inconsistencies in measurement units, etc. Thus -999 for
> age is evidently a missing data code, if not a joke by data entry
> people. If people are still on the lookout for such outliers when
> doing their modelling, it is a sign that they don't know enough or are
> not zealous enough about data management, including data quality
> checking. Sometimes there is scope for re-measurement, sometimes a
> rough value can be estimated in other ways, but often such values just
> have to be excluded from the data being analysed.
>
> 2. Outliers that are genuine, require care in handling but can be
> accommodated by using an appropriate transformed scale for analysis.
> As a geographer the canonical example to me is the Amazon, which on
> most river measures really is big! Perhaps I am lucky but it has been
> my experience that most such outliers can be accommodated by either
> transformation or using a suitable link function, either explicitly
> (e.g. -glm-) or tacitly (e.g. -poisson-). Logarithms are your friend.
>
> 3. Outliers that are genuine but seem to be awkward for or destructive
> to any model fit tried and which the analyst is tempted to exclude
> from the data, or model ad hoc. A weak or inexperienced analyst yields
> to the temptation; a strong analyst knows several ways of including
> the outlier with various tricks, including devising new models. To me,
> the best rationale for exclusion is a substantive or scientific
> argument making it clear why the outlier really doesn't belong (it's a
> goat that doesn't belong with these sheep) and excluding outliers just
> because they make life statistically difficult is less convincing.
>
> Naturally, much more could be said. A purely personal aside is that I
> don't think that nonparametric statistics or robust statistics are
> quite as helpful in practice in dealng with outliers as their most
> energetic advocates would have you believe.
>
> Nick
>
> On Wed, Aug 8, 2012 at 1:37 PM, David Hoaglin <[email protected]> wrote:
>> Dalhia,
>>
>> In multiple regression, "outliers" can take a variety of forms.
>>
>> An observation may have an unusual combination of values of the
>> predictor variables. Such points are influential. If the model fits
>> well there, the corresponding value of y may not be an outlier.
>> Cook's distance, DFFITS, and DFBETAS help to diagnose various aspects
>> of influence.
>>
>> Studentized residuals can show whether the model fits poorly at an
>> individual observation (in effect, whether that value of y is an
>> outlier, relative to the model).
>>
>> The variety of possibilities can make diagnosis of "outliers" challenging.
>
> On Wed, Aug 8, 2012 at 7:03 AM, Dalhia <[email protected]> wrote:
>
>>> How do I check for outliers when using xtreg, fe? One
>>> solution I thought of was to demean each variable for each panel, and
>>> then rerun using regress, and then use the cook's d, dfits, avplot etc.
>>> to identify outliers. Is this a reasonable solution? Is there a
>>> different/better way to do this?
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/