Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Re: Cleaning messy data
From
Nick Cox <[email protected]>
To
[email protected]
Subject
Re: st: Re: Cleaning messy data
Date
Tue, 29 Nov 2011 08:13:15 +0000
For the record,
moss dpr, match("([0-9]+) *y") regex
is an example weakening the assumption about spaces to zero or more.
Nick
On Mon, Nov 28, 2011 at 7:49 PM, Nick Cox <[email protected]> wrote:
> Interesting problem.
>
> Having sometimes sung the song "regex functions are good, but more
> basic string functions can be as good or better", I am now going to
> reverse that.
>
> This is a good test problem for -moss- (SSC)!
>
> I started something like this, with the idea that the pattern was a
> number, possibly a space and then a word:
>
> . moss dpr, match("([0-9]+ ?[a-z]+)") regex
>
> . drop _pos*
>
> . l
>
> +-----------------------------------------------------------------+
> | dpr _count _match1 _match2 _match3 |
> |-----------------------------------------------------------------|
> 1. | 2 yrs 5months 26 days 3 2 yrs 5months 26 days |
> 2. | 3 yrs 2 months 2 3 yrs 2 months |
> 3. | 1yr 9 months 2 1yr 9 months |
> 4. | 1 yr 8 months 2 1 yr 8 months |
> 5. | 1 yr 11 months 28 days 3 1 yr 11 months 28 days |
> |-----------------------------------------------------------------|
> 6. | 1 yr 12 days 2 1 yr 12 days |
> 7. | 3 yrs 3 months12 days 3 3 yrs 3 months 12 days |
> 8. | 3yrs 4 months 26 days 3 3yrs 4 months 26 days |
> 9. | 1 yr 9mnths 8 days 3 1 yr 9mnths 8 days |
> +-----------------------------------------------------------------+
>
> That is a fair start, but there is still some cleaning up to do. An
> alternative, which I prefer, extracts each element separately.
>
> . keep dpr
>
> . moss dpr , match("([0-9]+) ?y") regex prefix(y)
>
> . moss dpr , match("([0-9]+) ?m") regex prefix(m)
>
> . moss dpr , match("([0-9]+) ?d") regex prefix(d)
>
> . drop *count *pos*
>
> . l
>
> +------------------------------------------------------+
> | dpr ymatch1 mmatch1 dmatch1 |
> |------------------------------------------------------|
> 1. | 2 yrs 5months 26 days 2 5 26 |
> 2. | 3 yrs 2 months 3 2 |
> 3. | 1yr 9 months 1 9 |
> 4. | 1 yr 8 months 1 8 |
> 5. | 1 yr 11 months 28 days 1 11 28 |
> |------------------------------------------------------|
> 6. | 1 yr 12 days 1 12 |
> 7. | 3 yrs 3 months12 days 3 3 12 |
> 8. | 3yrs 4 months 26 days 3 4 26 |
> 9. | 1 yr 9mnths 8 days 1 9 8 |
> +------------------------------------------------------+
>
> The assumption of "at most one space" between number and word works
> for the example, but might be too strong for the whole dataset.
>
> Nick
>
> On Mon, Nov 28, 2011 at 6:12 PM, daniel klein
> <[email protected]> wrote:
>> Martyn,
>>
>> based on what you have now, something like the following could work.
>>
>> replace dpr4 = subinstr(dpr4, "y", "*365", .)
>> replace dpr4 = subinstr(dpr4, "m", "*28", .)
>> replace dpr4 = subinstr(dpr4, "d", "", .)
>> replace dpr4 = subinstr(dpr4, " ", "+")
>> // note the missing spaces between "*" and "#"
>>
>> encode dpr4 ,g(new)
>> qui su new ,mean
>> forv j = 1/`r(max)' {
>> qui replace new = `: lab new `j'' if new == `j'
>> }
>>
>> This is a little ad-hoc and might neither be the most elegant nor the
>> fastest solution.
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/