Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Elimination of outliers
From
Austin Nichols <[email protected]>
To
[email protected]
Subject
Re: st: Elimination of outliers
Date
Mon, 6 Jun 2011 17:11:27 -0400
Nick--
I don't pretend to know much about environmental data; but I did a
quick introspection, followed by a quick google on air quality
sensors, and found
Engel-Cox, J.A. and Holloman, C.H. and Coutant, B.W. and Hoff, R.M. 2004.
"Qualitative and quantitative evaluation of MODIS satellite sensor
data for regional and urban scale air quality."
Atmospheric Environment, 38(16):2495--2509.
which seems to indicate a lot of concern about extreme readings on air
quality; if one were regressing some y on air quality and other
explanatory variables, it seems reasonable to drop the extreme
measurements. Better to get auxiliary measurements and run IV for
measurement error; best to get error-free measurements; but in the
event that one must proceed with error-prone data, dropping the
extreme data on explanatory variables can often be a reasonable step.
On Mon, Jun 6, 2011 at 4:59 PM, Nick Cox <[email protected]> wrote:
> Thanks for the clarification.
>
> On your last question, I think that usually makes no physical sense
> for environmental data where I have most experience. I am straining to
> imagine that it is anything other than horribly ad hoc in any
> application.
>
> On dummies for outliers: better than dropping them; good if there is
> some independent rationale.
>
> One definition of an outlier is that it surprises the analyst, and the
> best outcome is to think of a model in which the surprise disappears.
> Working on a logarithmic scale is so far as I can see the best trick,
> if not the oldest. (Thucydides recorded the use of the mode as a
> robust estimator, alhough not quite in those words, about 2400 years
> ago.)
>
> Nick
>
> On Mon, Jun 6, 2011 at 9:35 PM, Austin Nichols <[email protected]> wrote:
>> Nick--
>> The simulation is contrived to illustrate one and only one point:
>> trimming data based on values of X that are suspect is fine, but
>> trimming data based on values of y that are suspect is dangerous at
>> best and nearly always ill-advised. This is a point I have made many
>> times on the list, sometimes in the context of replying to folks who
>> want to take the log of zero. Note I have made no mention of model
>> residuals; that is a different kind of outlier detection with its own
>> issues. The poster asked about trimming data based on the variables'
>> values alone, and my point was that this is not a bad idea a priori as
>> long as you only do it to RHS (explanatory) variables and not LHS
>> (outcome) variables. I think Jeff and Richard are thinking in terms
>> of model outliers, perhaps in terms of leverage or such. Your Amazon
>> example could fall in any of these categories, but including an Amazon
>> dummy is no different in practice from dropping the Amazon data point,
>> right? Or did you have in mind allowing for nonlinearities? It makes
>> sense in many cases to fit a best linear approximation to a subset of
>> the data and then to look at the outlying data with a less linear
>> model, no?
>>
>> On Mon, Jun 6, 2011 at 4:24 PM, Nick Cox <[email protected]> wrote:
>>> I don't think what happens in contrived simulations hits the main
>>> methodological issue at all. As a geographer, some of the time, an
>>> outlier to me is something like the Amazon which is big and different
>>> and something that needs to be accommodated in the model. That can be
>>> done in many ways other than by discarding outliers. Once throwing
>>> away awkward data is regarded as legitimate, when you do stop?
>>> (Independent evidence that an outlier is untrustworthy, as in lab
>>> records of experiments, is a different thing, although even there
>>> there are well-known stories of discarding as a matter of prior
>>> prejudice.)
>>>
>>> To make the question as stark as possible, and to suppress large areas
>>> of grey (gray): There are people who fit the data to the model and
>>> people who fit models to the data. It may sound like the same thing,
>>> but the attitude that one is so confident that the model is right that
>>> you are happy to discard the most inconvenient data is not at all the
>>> same as the attitude that the data can tell you something about the
>>> inadequacies of the current model.
>>>
>>> Nick
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/