Folks - I have a data set I have to clean up (it was supposed to completely cleaned prior to coming to me, but that's just an opportunity for me to whinge). Data management in these ways is not my strength. Any help would be greatly appreciated. I have tried to communicate the issues as clearly as possible, but let me know if I need to add more detail.
The data reflects employee absenteeism.
Data structure.
The data are in long form.
For each employee, I have employee number (e.g., 109123) the start date (e.g., 07Aug07), start day (e.g., Monday) , end date (e.g., 09Aug07), end day (e.g., Wednesday), number of hours (e.g., 21), and number of days (e.g., 3). There is other info as well. Each case of absenteeism is on one line, with some employees being represented on one line only (only one recorded absenteeism entry), and other employees with multiple records of absenteeism, for example being absent on several different days across the year.
Unfortunately, for many records, there are then 'corrections'. (A correction could occur for several reasons, such as that entry no longer being regarded as absenteeism but sick leave). A correction shows up as a re-entry of that data with negative values for hours and days. Below is an example of the original entry, plus a correction.
EXAMPLE A.
employee# startdate startday enddate endday hours days
109123 07Aug07 Monday 09Aug07 Wednesday 21 3
109123 07Aug07 Monday 09Aug07 Wednesday -21 -3
What I want to do is find those matching pairs of records and remove them from my file.
But it gets a little more difficult also. Inspection indicates that quite a few of the corrections are are of the form above. In some cases however, there is a correction, and then a re-entry again! So it might look like this, below:
EXAMPLE B.
employee# startdate startday enddate endday hours days
109123 07Aug07 Monday 09Aug07 Wednesday 21 3
109123 07Aug07 Monday 09Aug07 Wednesday -21 -3
109123 07Aug07 Monday 09Aug07 Wednesday 21 3
For this reason, just finding the duplicate dates (using -duplicate-) and deleting both of them wont necessary work: sometimes I want to remove all the records (as in EXAMPLE A), sometimes I do not (EXAMPLE B, where I want to keep one record).
Problem extension 2.
Sometimes the original entry will be corrected over several records. For example, EXAMPLE A might be corrected as follows:
employee# startdate startday enddate endday hours days
109123 07Aug07 Monday 09Aug07 Wednesday 21 3
109123 07Aug07 Monday 07Aug07 Monday -7 -1
109123 08Aug07 Monday 08Aug07 Monday -7 -1
109123 09Aug07 Monday 09Aug07 Monday -7 -1
I think I might be able to deal with these by hand if necessary. I can't deal with all of it by hand (12000 records).
I will then need to collapse hours and days into the months in which they occurred (I think -reshape- and then -generate- new variables is the way to go there).... but removing the invalid data will be a start.
Cheers,
Stephen
______________________________________
Stephen Cox |
Faculty of Business | Queensland University of Technology | <http://www.bus.qut.com/> www.bus.qut.edu.au <http://www.bus.qut.edu.au/>
phone: +61 7 3138 1776 | Office: Z758 | email: [email protected] | CRICOS No. 00213J
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/