[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Data management: code to be able to..

From	Phil Schumm <[email protected]>
To	[email protected]
Subject	Re: st: Data management: code to be able to..
Date	Sun, 3 Jun 2007 21:08:48 -0500

On Jun 3, 2007, at 3:09 PM, S J wrote:

I have a string identifier variable of the form:

id
"LOCALITY_NAMED_ABCD 001"
"LOCALITY_NAMED_F 060"
"HOUSTON 078"
"SAN ANTONIO 112"

The variable id thus has both the name of the locality in question (say, HOUSTON), and an identifying code (say, 078).

How can I generate a new variable, idcode, that only captures the numeric component of id, so that, I get, for the above 4 cases, the values below:

idcode
1
60
78
112

If you can assume that in all cases the ID code will be preceded by a space *and* has no embedded spaces itself (e.g., "CHICAGO 112 09" where the ID code is "112 09"), then the following will work:

gen idcode = substr(trim(id),-strpos(reverse(trim(id))," ")+1,.)

Note that this allows for the possibility that (1) the ID code is variable in length, and (2) the entire string has trailing spaces.

BTW, this is a situation where it would be nice to have a function that returned the nth subexpression from a regular expression match variable-wise (as opposed to the function -regexs()-, which, I believe, can only handle one observation at a time). For example, if there were such a function (call it -foo()-) and if we could assume that the ID code were always an integer, we could write something like the following:

gen idcode = foo(id,"^.+[ ]+([0-9]+)[ ]*",1)

where we are setting idcode equal to the first subexpression of the match. Although the first approach based on -substr()- is probably adequate in this case, the latter approach is -- for those used to working with regular expressions -- much more readable.

-- Phil

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- RE: st: Data management: code to be able to..
  - From: "Nick Cox" <[email protected]>

References:
- st: Data management: code to be able to..
  - From: S J <[email protected]>

Prev by Date: st: again about -xtivreg2- endogeneity and attrition
Next by Date: st: Stata 10 announcement
Previous by thread: Re: st: Data management: code to be able to..
Next by thread: RE: st: Data management: code to be able to..
Index(es):
- Date
- Thread