Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Elimination of outliers
From
Nick Cox <[email protected]>
To
[email protected]
Subject
Re: st: Elimination of outliers
Date
Mon, 6 Jun 2011 21:59:00 +0100
Thanks for the clarification.
On your last question, I think that usually makes no physical sense
for environmental data where I have most experience. I am straining to
imagine that it is anything other than horribly ad hoc in any
application.
On dummies for outliers: better than dropping them; good if there is
some independent rationale.
One definition of an outlier is that it surprises the analyst, and the
best outcome is to think of a model in which the surprise disappears.
Working on a logarithmic scale is so far as I can see the best trick,
if not the oldest. (Thucydides recorded the use of the mode as a
robust estimator, alhough not quite in those words, about 2400 years
ago.)
Nick
On Mon, Jun 6, 2011 at 9:35 PM, Austin Nichols <[email protected]> wrote:
> Nick--
> The simulation is contrived to illustrate one and only one point:
> trimming data based on values of X that are suspect is fine, but
> trimming data based on values of y that are suspect is dangerous at
> best and nearly always ill-advised. This is a point I have made many
> times on the list, sometimes in the context of replying to folks who
> want to take the log of zero. Note I have made no mention of model
> residuals; that is a different kind of outlier detection with its own
> issues. The poster asked about trimming data based on the variables'
> values alone, and my point was that this is not a bad idea a priori as
> long as you only do it to RHS (explanatory) variables and not LHS
> (outcome) variables. I think Jeff and Richard are thinking in terms
> of model outliers, perhaps in terms of leverage or such. Your Amazon
> example could fall in any of these categories, but including an Amazon
> dummy is no different in practice from dropping the Amazon data point,
> right? Or did you have in mind allowing for nonlinearities? It makes
> sense in many cases to fit a best linear approximation to a subset of
> the data and then to look at the outlying data with a less linear
> model, no?
>
> On Mon, Jun 6, 2011 at 4:24 PM, Nick Cox <[email protected]> wrote:
>> I don't think what happens in contrived simulations hits the main
>> methodological issue at all. As a geographer, some of the time, an
>> outlier to me is something like the Amazon which is big and different
>> and something that needs to be accommodated in the model. That can be
>> done in many ways other than by discarding outliers. Once throwing
>> away awkward data is regarded as legitimate, when you do stop?
>> (Independent evidence that an outlier is untrustworthy, as in lab
>> records of experiments, is a different thing, although even there
>> there are well-known stories of discarding as a matter of prior
>> prejudice.)
>>
>> To make the question as stark as possible, and to suppress large areas
>> of grey (gray): There are people who fit the data to the model and
>> people who fit models to the data. It may sound like the same thing,
>> but the attitude that one is so confident that the model is right that
>> you are happy to discard the most inconvenient data is not at all the
>> same as the attitude that the data can tell you something about the
>> inadequacies of the current model.
>>
>> Nick
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/