Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: problem with regexm leading to "regexp: unterminated ()" error for all observations
From
Phil Schumm <[email protected]>
To
[email protected]
Subject
Re: st: problem with regexm leading to "regexp: unterminated ()" error for all observations
Date
Fri, 3 Jun 2011 12:10:53 -0500
On Jun 3, 2011, at 7:35 AM, Jamie Fagg wrote:
I've a problem with the function -regexm-. I get the following
message:
regexp: unterminated ()
<snip>
#delimit ;
//regular expression to define whether postcode is syntactically
correct
ge postcodevalid = 1 if regexm(postcode,"(GIR 0AA)|(((A[BL]|
B[ABDHLNRSTX]
?|C[ABFHMORTVW]|D[ADEGHLNTY]|E[HNX]?|F[KY]|G[LUY]?|H[ADGPRSUX]
|I[GMPV]|JE|K[ATWY]|L[ADELNSU]?|M[EKL]?|N[EGNPRW]?|O[LX]|P[AEHLOR]
|R[GHM]|S[AEGKLMNOPRSTY]?|T[ADFNQRSW]|UB|W[ADFNRSV]|YO|ZE)[1-9]?[0-9]
|((E|N|NW|SE|SW|W)1|EC[1-4]|WC[12])[A-HJKMNPR-Y]|(SW|W)([2-9]|[1-9]
[0-9])|EC[1-9][0-9]) [0-9][ABD-HJLNP-UW-Z]{2})")==1;
I'm not sure why Stata chokes on this, though I would suspect it might
have something to do with the length. As Nick and Steven have already
noted, the repeat qualifier {n} is not supported by Stata's regular
expression syntax, so you'll need to replace
[ABD-HJLNP-UW-Z]{2}
with the equivalent
[ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]
Now, Nick suggested breaking the expression up, so let's do that.
Your expression is equal to
(p1)|(((p2a1a|p2a1b|p2a1c)p2a1d|p2a2|p2a3|p2a4)p2b)
where the individual parts (as assigned to Stata macros) are
loc p1 "GIR 0AA"
loc p2a1a "A[BL]|B[ABDHLNRSTX]?|C[ABFHMORTVW]|D[ADEGHLNTY]|
E[HNX]?|F[KY]|G[LUY]?"
loc p2a1b "H[ADGPRSUX]|I[GMPV]|JE|K[ATWY]|L[ADELNSU]?|M[EKL]?|
N[EGNPRW]?|O[LX]"
loc p2a1c "P[AEHLOR]|R[GHM]|S[AEGKLMNOPRSTY]?|T[ADFNQRSW]|UB|
W[ADFNRSV]|YO|ZE"
loc p2a1d "[1-9]?[0-9]"
loc p2a2 "((E|N|NW|SE|SW|W)1|EC[1-4]|WC[12])[A-HJKMNPR-Y]"
loc p2a3 "(SW|W)([2-9]|[1-9][0-9])"
loc p2a4 "EC[1-9][0-9]"
loc p2b " [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]"
This may then be easily broken up as follows:
gen byte valid = regexm(postcode,"`p1'")
replace valid = 1 if regexm(postcode,"`p2a1a'`p2a1d'`p2b'")
replace valid = 1 if regexm(postcode,"`p2a1b'`p2a1d'`p2b'")
replace valid = 1 if regexm(postcode,"`p2a1c'`p2a1d'`p2b'")
replace valid = 1 if regexm(postcode,"(`p2a2'|`p2a3'|`p2a4')`p2b'")
-- Phil
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/