|
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: complex data cleaning issue (well, complex for me)
Stephen,
the following applies given that you have dealt with the problem of
corrections occurring in multiple records manually.
2008/4/29 Stephen Cox <[email protected]>:
> EXAMPLE A.
>
> employee# startdate startday enddate endday hours days
> 109123 07Aug07 Monday 09Aug07 Wednesday 21 3
> 109123 07Aug07 Monday 09Aug07 Wednesday -21 -3
...
> EXAMPLE B.
>
> employee# startdate startday enddate endday hours days
> 109123 07Aug07 Monday 09Aug07 Wednesday 21 3
> 109123 07Aug07 Monday 09Aug07 Wednesday -21 -3
> 109123 07Aug07 Monday 09Aug07 Wednesday 21 3
****
gen correction = (sign(days) == -1)
replace correction = 1 if sign(hours) == -1
replace hours = abs(hours)
replace days = abs(days)
duplicates tag employee startdate startday enddate endday hours days, gen(tag)
****
The first bit is to keep track of which entries have negative values
for hours or days.
The variable tag indicates the number of duplicates. If you only have
example A and B cases left, this variable should take on the values 0,
1 and 2. You could then do something like
***
tab tag
drop if tag==1 /* this should get rid of all example A cases */
drop if tag==2 & correction == 0 /* this leaves you with one
observation if there are three identical entries, as in example B */
***
Note that this solution cannot be applied before you have dealt with
the cases where correction occurred in multiple entries, since these
cases would show up as duplicates but unrelated to the original entry.
Maybe someone else who is more familiar with this kind of data comes
up with an idea how to find and eliminate those.
HTH,
Eva
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/