Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Fwd: Fastest way to identify values that start and end with a 9?
From
Sergiy Radyakin <[email protected]>
To
"[email protected]" <[email protected]>
Subject
Re: st: Fwd: Fastest way to identify values that start and end with a 9?
Date
Thu, 3 Oct 2013 11:16:58 -0400
Paul (or Evan) is using a dataset where missing values (DK and REF)
are coded as values of the type 9...9 and 9...8. This is similar to
the convention used in the DHS datasets, see eg here page 3:
http://www.measuredhs.com/pubs/pdf/DHSG4/Recode6_DHS_22March2013_DHSG4.pdf
Paul (Evan) must check with the data provider whether the other
convention is also true - that the missing values should be at least
one digit wider than the widest (in terms of digits) possible value.
Otherwise, e.g. if the value 999 determines the missing age, ages 9
and 99 will also be caught in the recoding schemes he is using based
on the proposed regular expressions.
In general, I don't see how he will be able to determine which 9...9
patterns indeed correspond to missing without having a prior knowledge
of the variable contents, or instructions from the data provider, or
carefully inspecting individual values of each variable trying to
determine the ranges of widths of the values. Values of income of
99USD or 998USD might in the end be actual data, etc.
Best, Sergiy Radyakin
On Thu, Oct 3, 2013 at 5:08 AM, Evan DeFilippis <[email protected]> wrote:
> Values in my data set contain different numerical representations for
> "Don't Know" and "Refusal"
>
> A "Don't Know" will always start and end with a '9', but there can be
> as many '9's in between as possible, up to the maximum length of a
> string (244).
>
> A "Refusal" will always start with a '9' and end with an '8', and
> there can be as many '9's' in between as possible, up to the maximum
> length of a string (244).
>
> The data set contains strings, integers, bytes, etc..
>
> I want to be able to convert the numerical representations of 'Don't
> Know' and 'Refusal's' into DK and REF, respectively.
>
> My current strategy for doing this looks like so:
>
> quietly tostring _all, replace
> ds, has(type string)
> di "`r(varlist)'"
> unab string_vars : `r(varlist)'
> foreach j in `string_vars' {
> quietly replace `j'= regexr(`j', "^[9]*[9]$","DK")
> quietly replace `j' = regexr(`j', "^[9]*[8]$", "REF")
> }
>
> However, this is slow because it converts the entire data set into
> strings, which takes about 5 minutes, and then it has to do has(type
> string) in order to get r(varlist) to iterate over all those strings
> which takes about 4 minutes.
>
> Is there a faster way to do this that perhaps does not involve
> converting everything to strings?
>
> Paul
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/