Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Re: Cleaning messy data
From
Nick Cox <[email protected]>
To
[email protected]
Subject
Re: st: Re: Cleaning messy data
Date
Mon, 28 Nov 2011 19:49:03 +0000
Interesting problem.
Having sometimes sung the song "regex functions are good, but more
basic string functions can be as good or better", I am now going to
reverse that.
This is a good test problem for -moss- (SSC)!
I started something like this, with the idea that the pattern was a
number, possibly a space and then a word:
. moss dpr, match("([0-9]+ ?[a-z]+)") regex
. drop _pos*
. l
+-----------------------------------------------------------------+
| dpr _count _match1 _match2 _match3 |
|-----------------------------------------------------------------|
1. | 2 yrs 5months 26 days 3 2 yrs 5months 26 days |
2. | 3 yrs 2 months 2 3 yrs 2 months |
3. | 1yr 9 months 2 1yr 9 months |
4. | 1 yr 8 months 2 1 yr 8 months |
5. | 1 yr 11 months 28 days 3 1 yr 11 months 28 days |
|-----------------------------------------------------------------|
6. | 1 yr 12 days 2 1 yr 12 days |
7. | 3 yrs 3 months12 days 3 3 yrs 3 months 12 days |
8. | 3yrs 4 months 26 days 3 3yrs 4 months 26 days |
9. | 1 yr 9mnths 8 days 3 1 yr 9mnths 8 days |
+-----------------------------------------------------------------+
That is a fair start, but there is still some cleaning up to do. An
alternative, which I prefer, extracts each element separately.
. keep dpr
. moss dpr , match("([0-9]+) ?y") regex prefix(y)
. moss dpr , match("([0-9]+) ?m") regex prefix(m)
. moss dpr , match("([0-9]+) ?d") regex prefix(d)
. drop *count *pos*
. l
+------------------------------------------------------+
| dpr ymatch1 mmatch1 dmatch1 |
|------------------------------------------------------|
1. | 2 yrs 5months 26 days 2 5 26 |
2. | 3 yrs 2 months 3 2 |
3. | 1yr 9 months 1 9 |
4. | 1 yr 8 months 1 8 |
5. | 1 yr 11 months 28 days 1 11 28 |
|------------------------------------------------------|
6. | 1 yr 12 days 1 12 |
7. | 3 yrs 3 months12 days 3 3 12 |
8. | 3yrs 4 months 26 days 3 4 26 |
9. | 1 yr 9mnths 8 days 1 9 8 |
+------------------------------------------------------+
The assumption of "at most one space" between number and word works
for the example, but might be too strong for the whole dataset.
Nick
On Mon, Nov 28, 2011 at 6:12 PM, daniel klein
<[email protected]> wrote:
> Martyn,
>
> based on what you have now, something like the following could work.
>
> replace dpr4 = subinstr(dpr4, "y", "*365", .)
> replace dpr4 = subinstr(dpr4, "m", "*28", .)
> replace dpr4 = subinstr(dpr4, "d", "", .)
> replace dpr4 = subinstr(dpr4, " ", "+")
> // note the missing spaces between "*" and "#"
>
> encode dpr4 ,g(new)
> qui su new ,mean
> forv j = 1/`r(max)' {
> qui replace new = `: lab new `j'' if new == `j'
> }
>
> This is a little ad-hoc and might neither be the most elegant nor the
> fastest solution.
>
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/