Thank you, Peter and Maarten.
I probably need to be more specific: I know how to implement e.g.
range check, but I am looking for the rules themselves. E.g. what do
people check, before they conclude that the value reported for e.g.
age is plausible/possible? For example, the DHS study considers birth
intervals inconsistent if they are less than 7 months: "For key events
in the respondent' s life, dates have been imputed when the full date
of the event was not provided by the respondent or in some cases if
dates are inconsistent (e.g. less than 7 months between births)."
(quote from http://www.measuredhs.com/pubs/pdf/DHSG4/Recode1DHS.pdf)
What is the preferred ranking of the rules? E.g. data shows that a
person is 5 years old and enrolled into a university. I have a feeling
that this is inconsistent. I'd probably trust age more in this case,
rather than education and keep age=5, but replace univ_attend=missing.
However if the person answered a dozen of other questions related to
the university curriculum, subjects and courses, then probably the age
is incorrect. In any case I will give the user an opportunity to
review all these detected questionable cases. So some false positives
are fine, but missed errors are not ok. In some cases it's hard to say
if there is an error, e.g. if incomes are all reported in multiples of
100, I'd still like to notify the user that there is some bunching
going on.
I know this all really depends on the particular case and there is no
silver bullet. But just in case someone knows a good cookbook of such
checks, please let me know.
Does S.Juul really have this explained in his book? or is it
explanation how replace/recode/assert/confirm work?
Thank you, Sergiy Radyakin
On Tue, Oct 7, 2008 at 5:46 PM, Lachenbruch, Peter
<[email protected]> wrote:
> A simple do file should work.
>
> display caseno age if age<0 or age>120 & age<. // may want to print
> missing ages
> display caseno gender if gender~=a | gender ~=b // a and b are the
> unique values (could be strings so you'd want to fix that up)
> diplay caseno if age<NNN and mother==1 // mother is an indicator
> etc.
>
> An interesting question is whether you want to correct these - e.g.
> convert them to missing or an error code (I first typed coed - but
> that's NOT what I meant!)
> In a study earlier this summer I did just this. Initially I printed all
> the missing value cases, but the data came from medical records and
> about half of 2000 cases were missing, so I simply didn't print, but
> gave a count for each variable.
> Some of the variables had many possible legal values (e.g., which of 30
> drugs were being taken), so the checking became very complicated -
> especially when the dosage and schedule were being checked.
>
> Svend Juul has a nice chapter on this in his book.
>
> Tony
>
> Peter A. Lachenbruch
> Department of Public Health
> Oregon State University
> Corvallis, OR 97330
> Phone: 541-737-3832
> FAX: 541-737-4001
>
>
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Sergiy
> Radyakin
> Sent: Tuesday, October 07, 2008 2:02 PM
> To: [email protected]
> Subject: st: Data consistency heuristics
>
> Hello All,
>
> this is more or less general question, not related to Stata itself,
> but to data processing. I wonder if anyone could point me to a good
> source of heuristics/rules on checking the data for
> consistency/plausibility. I am looking for something like:
>
> * age of a person must be within the range 0-120
> * gender must have no more than 2 unique values
> * person younger than NNN years may not be a mother
> * if a person is reporting not working, wage must be missing/zero
> * if a person is attending primary school, occupation may not be
> "manager"
> * if a person is attending university, [s]he may not report being
> illiterate
> etc
>
> Note that these are more or less flexible rules and there might be
> exceptions. But if it is valid for 99% of cases - it's what I am
> looking for.
>
> The context topics include economics
> (employment/earnings/wages/sector/hours of work etc), education(years
> of educ/enrollment/completion), family structure and composition, and
> other related topics commonly found in family, household or labor
> force surveys.
>
> I believe a significant amount of such checks is being done by data
> collectors before releasing the data to public, and I wouldn't want to
> reinvent the wheel here.
>
> Thank you, Sergiy Radyakin
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/