Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: Data management: code to be able to..


From   "Nick Cox" <[email protected]>
To   <[email protected]>
Subject   RE: st: Data management: code to be able to..
Date   Mon, 4 Jun 2007 09:31:30 +0100

I use regular expressions all the time in my favourite text
editor, but in Stata I find that the provision of
string functions is rich enough that I rarely 
need to reach for my regex. 

I take it that the numeric identifier is the 
last "word" of -id-. Set aside whatever you learnt 
about English or any other non-computer language
you speak. Words in Stata are separated by
spaces, except that double quotes (and compound
double quotes) within strings bind more tightly than spaces
separate. Let us focus just on parsing on spaces, 
until we hear that we need to worry about quotes 
within strings. 

My solution is 

gen numid = word(id, -1) 

This is equivalent to Phil's

... = substr(trim(id),-strpos(reverse(trim(id))," ")+1,.)

and will return the last word of -id- as a string, using
the syntax that negative numbers show a count 
from the end. If SJ wants that as a number, stripping 
the leading zero characters, then 

gen numid = real(word(id, -1)) 

will do it. Note that any trailing spaces get 
ignored here automatically, which is presumably
what is wanted here, and there are no hidden 
assumptions on the length of the numeric id. 

Nick 
[email protected] 

Phil Schumm
 
> If you can assume that in all cases the ID code will be 
> preceded by a  
> space *and* has no embedded spaces itself (e.g., "CHICAGO 112 09"  
> where the ID code is "112 09"), then the following will work:
> 
> 
> gen idcode = substr(trim(id),-strpos(reverse(trim(id))," ")+1,.)
> 
> 
> Note that this allows for the possibility that (1) the ID code is  
> variable in length, and (2) the entire string has trailing spaces.
> 
> BTW, this is a situation where it would be nice to have a function  
> that returned the nth subexpression from a regular expression match  
> variable-wise (as opposed to the function -regexs()-, which, I  
> believe, can only handle one observation at a time).  For 
> example, if  
> there were such a function (call it -foo()-) and if we could assume  
> that the ID code were always an integer, we could write something  
> like the following:
> 
> gen idcode = foo(id,"^.+[ ]+([0-9]+)[ ]*",1)
> 
> where we are setting idcode equal to the first subexpression of the  
> match.  Although the first approach based on -substr()- is probably  
> adequate in this case, the latter approach is -- for those used to  
> working with regular expressions -- much more readable.

S J 

> > I have a string identifier variable of the form:
> >
> > id
> > "LOCALITY_NAMED_ABCD 001"
> > "LOCALITY_NAMED_F 060"
> > "HOUSTON 078"
> > "SAN ANTONIO 112"
> >
> > The variable id thus has both the name of the locality in question  
> > (say, HOUSTON), and an identifying code (say, 078).
> >
> > How can I generate a new variable, idcode, that only captures the  
> > numeric component of id, so that, I get, for the above 4 
> cases, the  
> > values below:
> >
> > idcode
> > 1
> > 60
> > 78
> > 112
> 

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2025 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index