Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | "Allan Reese (Cefas)" <allan.reese@cefas.co.uk> |
To | <statalist@hsphsun2.harvard.edu> |
Subject | st: Treatment of outliers |
Date | Tue, 7 Jun 2011 11:00:06 +0100 |
The exchanges prompted by a request to *trim* variables (technically distinct from identifying and removing outliers) prompt me to post a comment I bottled up at the time Peter Diggle's paper was read at the RSS. As it's geostatistics, Nick may have a view. http://www.math.ntnu.no/~hrue/r-inla.org/case-studies/Diggle09/DiggleSep t09.pdf (an odd ref, but one that google found and it works today) has the title "Geostatistical inference under preferential sampling". Since the premise was that data were collected with prejudice, and the point of the data and the modelling was to identify locations with high Pb contamination, it seemed to me very odd that the paper includes a throwaway comment "The measured lead concentrations included two gross outliers in 2000, each of which we replaced by the average of the remaining values from that year's survey." In principle, I agree with Nick (gosh, that's a phrase gone out of fashion) that outliers in real data need very careful consideration. One of the major problems in the use of statistical methods is that people apply textbook methods without noting the assumptions underlying the data generation. (So, doctor, can we assume all your patients are independent, identical and exchangeable from a single normal distribution?) A simple test of the robustness of a model is to compare the fit with/without the use of suspected outliers. If the fit is substantially the same, you can use the results. If including the outliers substantially changes the model, you are forced to make a judgment (non-probabilistic) on the source of the data. I also note the original posting mentioned, "I have 150000 observations and out of these observations I want to delete 25 observations from the upper and lower boundaries." Allan R Allan Reese Senior statistician, Cefas The Nothe, Weymouth DT4 8UB Tel: +44 (0)1305 20 6614 -direct Fax: +44 (0)1305 20 6601 www.cefas.defra.gov.uk * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/