Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: Re: Splitting a string variable


From   "Frank de Libero" <[email protected]>
To   <[email protected]>
Subject   st: RE: Re: Splitting a string variable
Date   Tue, 6 Sep 2005 13:57:27 -0700

The loop isn't necessary. Using Stata's capabilites, the following
works:

replace id = regexs(2) if regexm(id,"^(0)+(.+)")

or, making the distinction between [] and () in regular expressions,

replace id = regexs(1) if regexm(id,"^[0]+(.+)")

BTW, Kevin developed the Stata implementation of the three regular
expression functions in version 9 and did a really nice job.

..Frank

-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Kevin Turner
Sent: Tuesday, September 06, 2005 11:31 AM
To: [email protected]
Subject: st: Re: Splitting a string variable

Raphael Fraser ([email protected]) writes:

>I have a string variable of the type listed below:
>
>id
>0008
>0020
>016A
>0160C
>
>How do I remove the leading zeros from this variable? I tried using
>the -split- command, but it removed both leading and trailing zeros.
>The end result should look like this:
>
>id
>8
>20
>16A
>160C

The presence of sporadic letters and trailing zeros causes problems, but
the
solution is one that the new regular expression functions of Stata are
easily
adapted to solving.  The solution is a loop over the observations, using
an
initial regular expression function to test for a match, and if so, the
corresponding regular expression function to pull the subexpression that
matches the non-leading-zero portion of the string.

local obs = _N
forvalues x = 1(1)`obs' {
	if (regexm(id[`x'], "^[0]+(.+)")) {
		replace id = regexs(1) in `x'	/* grab first sub
expression */
	} 
}

A few comments on regular expression syntax:

1) The string "^[0]+(.+)" matches one or more leading zeros, and then
one or
   more characters till the end. 
2) ^ represents beginning of string
3) [] denotes a set of characters to match, in this case just zeros 
4) + denotes a 'one or more' match of the previous expression
5) () denote a subexpression
6) . will match any character

We also had to construct a loop over the observations because we needed
a pair
of function calls to operate on each individual observation.

Hope this helps!
--Kevin 
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2025 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index