Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: regexs and regexm
From
Robert Picard <[email protected]>
To
"[email protected]" <[email protected]>
Subject
Re: st: regexs and regexm
Date
Thu, 3 Oct 2013 09:16:50 -0400
You can use -moss- (from SSC) to split your string variable using a
regex pattern. Here are two ways of splitting your string:
Robert
* ----------------- begin example ---------------
clear
input str80 s
"UK/FI/EI"
"PMSE NO(20)"
"PMSE NO(20),EI(5),GE(35),CN(20)"
"PMSE2004 NO(50),EI(10),GE(30),UK(30),SW(30)"
"POLARLIS FR(220)"
"LIDAR_GPS NI(20),NO(20)"
"IASK SE(60),NO(20),UK(20)"
end
* match any sequence of 2 chars or number
moss s, match("([A-Z][A-Z]|[0-9][0-9])") regex
* match anything that is not a delimiter
moss s, match("([^ \(\),/]+)") regex pre(v_)
* ----------------- end example -----------------
On Thu, Oct 3, 2013 at 8:22 AM, Simon Falck <[email protected]> wrote:
> Dear Statlist,
>
> Using Stata 11.2, I want to extract a portion of a string variable using
> regular expressions, i.e. -regexs- and -regexm-
>
> This job is a bit tricky because the string variable contains several
> different types of expressions, lengths, and sometimes spaces, with
> information that looks something like this,
>
> string variable
> UK/FI/EI
> PMSE NO(20)
> PMSE NO(20),EI(5),GE(35),CN(20)
> PMSE2004 NO(50),EI(10),GE(30),UK(30),SW(30)
> POLARLIS FR(220)
> LIDAR_GPS NI(20),NO(20)
> IASK SE(60),NO(20),UK(20)
>
> What I want is to extract (decomposed) information from the string variable
> into new columns, such as,
>
> var1 var2 var3 var4 var5 var6 var7 var8 var9 var10
> var11 var12
> UK FI EI
> PM SE NO 20
> PM SE NO 20 EI 5 GE 30 UK
> 30 SE 30
>
> As I understand, one way of doing this is to use Stata´s regular
> expressions: -regexs- and -regexm-, i.e.:
>
> gen x1 = regexs(1)+ regexs(2) if regexm(expnamn, "([a-zA-Z])([a-zA-Z]+)")
> gen x2 = regexs(1)+ regexs(2) if regexm(expnamn, "([0-9]+)*([0-9]+)")
> ..and so on..
>
> However, since the characteristics of the string variable is rich on variety
> this task appears far more complex than what I first thought, and I am
> unable to construct a proper script to decompose the string variable in an
> efficient way.
>
> Any suggestions?
>
> Thanks in advance,
> Simon
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/