|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: RE: Data consistency heuristics
I agree with Nick Cox that there is probably no system for creating
such rules that is independent of the domain/topic, though I too
would be happy to find out otherwise.
Two minor ideas which may or may not help are 1) for time series
data, looking for large changes between time periods and 2) for
interval-level data, looking for outliers. These will still net some
false positives (e.g. Zimbabwe's current annual inflation rate is
reportedly 24,000,000%) and false negatives, but that is always the
case.
With respect to his anecdote about his colleague discovering numerous
errors in someone else's dataset, I fear this is probably far more
common (at least in social science) than is generally recognized. The
article I recently cited in
http://www.stata.com/statalist/archive/2008-09/msg01363.html
describes a study where the researchers replicated a year's worth of
articles from an economics journal and found them rife with such
errors.
The recommended solution, if I may repeat myself, is to carefully and
deliberately document one's research (do files and otherwise) so that
the procedures and conclusions could be easily replicated by someone
else. This approach not only allows errors to be more easily detected
but also tends to prevent them in the first place.
David
At 3:32 PM +0100 10/8/08, Nick Cox wrote:
I too have wanted to find a theory of data cleaning, but in practice
it's mightily elusive. I think this is the most bottom-up part of
statistical science in which at best you have rules that work most of
the time for your kind of data.
A colleague worked with records on glaciers which supposedly had been
reviewed very carefully. He found many things that the quality control
had missed, including glaciers that were just in the wrong places, as
shown by a scatter of latitude and longitude; glaciers reported twice,
by different countries; and many Russian glaciers reported to face East
when they faced West and vice versa. (Apparently, Sergiy, that was a
transliteration/translation problem.) He found these things by slow
scrutiny and started building up ad hoc a list of things that could be
wrong.
--
David Radwin // [email protected]
Office of Student Research, University of California, Berkeley
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/