Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
st: regexs and regexm
From
Simon Falck <[email protected]>
To
"[email protected]" <[email protected]>
Subject
st: regexs and regexm
Date
Thu, 03 Oct 2013 14:22:23 +0200
Dear Statlist,
Using Stata 11.2, I want to extract a portion of a string variable using
regular expressions, i.e. -regexs- and -regexm-
This job is a bit tricky because the string variable contains several
different types of expressions, lengths, and sometimes spaces, with
information that looks something like this,
string variable
UK/FI/EI
PMSE NO(20)
PMSE NO(20),EI(5),GE(35),CN(20)
PMSE2004 NO(50),EI(10),GE(30),UK(30),SW(30)
POLARLIS FR(220)
LIDAR_GPS NI(20),NO(20)
IASK SE(60),NO(20),UK(20)
What I want is to extract (decomposed) information from the string
variable into new columns, such as,
var1 var2 var3 var4 var5 var6 var7 var8 var9
var10 var11 var12
UK FI EI
PM SE NO 20
PM SE NO 20 EI 5 GE 30
UK 30 SE 30
As I understand, one way of doing this is to use Stata´s regular
expressions: -regexs- and -regexm-, i.e.:
gen x1 = regexs(1)+ regexs(2) if regexm(expnamn, "([a-zA-Z])([a-zA-Z]+)")
gen x2 = regexs(1)+ regexs(2) if regexm(expnamn, "([0-9]+)*([0-9]+)")
..and so on..
However, since the characteristics of the string variable is rich on
variety this task appears far more complex than what I first thought,
and I am unable to construct a proper script to decompose the string
variable in an efficient way.
Any suggestions?
Thanks in advance,
Simon
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/