Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
st: Re: Cleaning messy data
From
Martyn Sherriff <[email protected]>
To
[email protected]
Subject
st: Re: Cleaning messy data
Date
Mon, 28 Nov 2011 16:53:20 +0000
I have a large dataset in which the dates have been entered as
freetext with random spaces, spellings and characters:
. list dpr dpr4 in 1/10
. list dpr dpr4 in 1/10
+-------------------------------------+
| dpr dpr4 |
|-------------------------------------|
1. | 2 yrs 5months 26 days 2y 5m 26d |
2. | 3 yrs 2 months 3y 2m |
3. | 1yr 9 months 1y 9m |
4. | 1 yr 8 months 1y 8m |
5. | 1 yr 11 months 28 days 1y 11m 28d |
|-------------------------------------|
6. | 1 yr 12 days 1y 12d |
7. | 3 yrs 3 months12 days 3y 3m 12d |
8. | 3yrs 4 months 26 days 3y 4m 26d |
9. | 1 yr 9mnths 8 days 1y 9m 8d |
10. | |
+-------------------------------------+
I have managed to convert dpr to dpr4 in which I have separated the
years months and days in a very pedestrian way:
gen dpr1= subinstr(dpr," y","y",.)
replace dpr1= subinstr(dpr1,"yrs","y",.)
replace dpr1= subinstr(dpr1,"yr","y",.)
replace dpr1= subinstr(dpr1,"year","y",.)
gen dpr2= subinstr(dpr1,"months","m",.)
replace dpr2= subinstr(dpr2,"month","m",.)
replace dpr2= subinstr(dpr2,"mnths","m",.)
replace dpr2= subinstr(dpr2," m","m",.)
gen dpr3= subinstr(dpr2," days","d",.)
replace dpr3= subinstr(dpr3,"days","d",.)
replace dpr3= subinstr(dpr3," d","d",.)
gen dpr4= subinstr(dpr3,"+","",.)
replace dpr4= subinstr(dpr4,"y","y ",.)
replace dpr4= subinstr(dpr4,"m","m ",.)
replace dpr4= subinstr(dpr4," "," ",.)
replace dpr4= subinstr(dpr4," "," ",.)
I now need to convert these values in dpr4 into a total number of
days. The problem I have is that in some cases either the year, month
or day is missing. I have to assume that a month is 28 days.
I was thinking of using 'split' and then processing the results based
on the character.
Is this a reasonable approach or is there a more efficient process?
I would be grateful for some advice on this step. I am using Stata 12
Thank you.
Martyn
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/