| |
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: Data management: code to be able to..
On Jun 3, 2007, at 3:09 PM, S J wrote:
I have a string identifier variable of the form:
id
"LOCALITY_NAMED_ABCD 001"
"LOCALITY_NAMED_F 060"
"HOUSTON 078"
"SAN ANTONIO 112"
The variable id thus has both the name of the locality in question
(say, HOUSTON), and an identifying code (say, 078).
How can I generate a new variable, idcode, that only captures the
numeric component of id, so that, I get, for the above 4 cases, the
values below:
idcode
1
60
78
112
If you can assume that in all cases the ID code will be preceded by a
space *and* has no embedded spaces itself (e.g., "CHICAGO 112 09"
where the ID code is "112 09"), then the following will work:
gen idcode = substr(trim(id),-strpos(reverse(trim(id))," ")+1,.)
Note that this allows for the possibility that (1) the ID code is
variable in length, and (2) the entire string has trailing spaces.
BTW, this is a situation where it would be nice to have a function
that returned the nth subexpression from a regular expression match
variable-wise (as opposed to the function -regexs()-, which, I
believe, can only handle one observation at a time). For example, if
there were such a function (call it -foo()-) and if we could assume
that the ID code were always an integer, we could write something
like the following:
gen idcode = foo(id,"^.+[ ]+([0-9]+)[ ]*",1)
where we are setting idcode equal to the first subexpression of the
match. Although the first approach based on -substr()- is probably
adequate in this case, the latter approach is -- for those used to
working with regular expressions -- much more readable.
-- Phil
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/