Yesterday Jeph Herrin asked a question that he, Martin Weiss and I found
an interesting challenge. Subtracting various minor hiccups and
histrionics, and adding a few glosses, a "rational reconstruction" of
the thread follows that is to be considered a joint product from us all.
We hope that it may prove interesting or useful to others.
Nick
[email protected]
Problem
=======
I have a dataset in which many variables are structured in a way that is
difficult to handle. If a question has multiple checkboxes as possible
answers, the response is stored as a string, with a number indicating
each box checked and numbers separated by colons.
Thus in one simple example
myvar
1:2:3:5:6:7:8:9
1:2:3:6
1:2:3:4:5:7:8:9
1:2:3:5:7:9
1:2:3:5:7:8:9
2:3:4:6:9
1:2:3:5:6:7:8:9
1:2:7:8:9
7:9
This variable takes 9 values. I want to map the information to 9
different indicator variables, myvar_1-myvar_9, each indicating whether
that number was selected.
-split- does not help much, because of the differing number of values
per string. That is, it produces myvar_1 which equals "7" for the last
observation.
So I am looking for a way to check whether a given string contains a
given integer. A loop using -strpos()- would work with this example
forv i=1/9 {
gen byte myvar_`i'= strpos(myvar, "`i'") > 0
}
but the code above would fall over with strings with two or more digits,
as, for example, searching for "1" returns true even if there is only
"11".
Solutions
=========
1. Use regular expressions.
forv i=1/9 {
gen byte myvar_`i'=
regexm(myvar,"^`i'$|^`i':|:`i':|:`i'$")
}
In words, this looks for a specified string on its own, OR at the start
of the value followed by a colon, OR such a string between colons, OR
such a string at the end of the value preceded by a colon. 9 here could
be 29 or 39 or whatever is the highest indicator variable you will need.
2. You could use -strpos()-, but you need to work backwards and ensure
That, once seen, longer strings are zapped to avoid being found later
and acting as false positives.
clonevar work = myvar
qui forval i = 29(-1)1 {
gen myvar_`i' = strpos(work, "`i'") > 0
replace work = subinstr(work, "`i'", "", .)
}
We work with a copy of -myvar- because we are going to change the data.
A
test of this method is that at the end -work- should contain only
colons. As above, 29 is just an example, as whatever highest indicator
variable is needed.
3. You could use -split-, but you need some processing afterwards.
split myvar, generate(comp) parse(:) destring
egen rowmax = rowmax(comp*)
su rowmax, meanonly
forv i=1/`r(max)'{
egen byte my`i' = anymatch(comp*), values(`i')
}
A merit of this solution is that it automatically finds the highest
number specified (as the column maximum of the row maxima). Naturally,
if there were 32 boxes but no respondent picked 30, 31, 32, that
would lie beyond the reach of this code. But if we knew in advance that
there were 32 boxes that respondents could possibly check, we could cut
the code down to
split myvar, generate(comp) parse(:) destring
forv i=1/32 {
egen byte my`i' = anymatch(comp*), values(`i')
}
Morals
======
There's more than one way to do it.
Know loops, functions, -egen- functions and data management commands
such as -split- and -clonevar-.
-regexm()- and friends are there when all else fails.
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/