I too have wanted to find a theory of data cleaning, but in practice
it's mightily elusive. I think this is the most bottom-up part of
statistical science in which at best you have rules that work most of
the time for your kind of data.
A colleague worked with records on glaciers which supposedly had been
reviewed very carefully. He found many things that the quality control
had missed, including glaciers that were just in the wrong places, as
shown by a scatter of latitude and longitude; glaciers reported twice,
by different countries; and many Russian glaciers reported to face East
when they faced West and vice versa. (Apparently, Sergiy, that was a
transliteration/translation problem.) He found these things by slow
scrutiny and started building up ad hoc a list of things that could be
wrong.
As for gender having two known categories and one unknown category,
there are plenty of datasets in which that classification misses really
important distinctions.
Nick
[email protected]
Sergiy Radyakin
this is more or less general question, not related to Stata itself,
but to data processing. I wonder if anyone could point me to a good
source of heuristics/rules on checking the data for
consistency/plausibility. I am looking for something like:
* age of a person must be within the range 0-120
* gender must have no more than 2 unique values
* person younger than NNN years may not be a mother
* if a person is reporting not working, wage must be missing/zero
* if a person is attending primary school, occupation may not be
"manager"
* if a person is attending university, [s]he may not report being
illiterate
etc
Note that these are more or less flexible rules and there might be
exceptions. But if it is valid for 99% of cases - it's what I am
looking for.
The context topics include economics
(employment/earnings/wages/sector/hours of work etc), education(years
of educ/enrollment/completion), family structure and composition, and
other related topics commonly found in family, household or labor
force surveys.
I believe a significant amount of such checks is being done by data
collectors before releasing the data to public, and I wouldn't want to
reinvent the wheel here.
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/