Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: regexs and regexm
From
Simon Falck <[email protected]>
To
[email protected]
Subject
Re: st: regexs and regexm
Date
Thu, 03 Oct 2013 15:42:07 +0200
Robert,
Thank you for this excellent suggestion. I tried -moss- and it does the job.
All the best,
Simon
On 2013-10-03 15:16, Robert Picard wrote:
You can use -moss- (from SSC) to split your string variable using a
regex pattern. Here are two ways of splitting your string:
Robert
* ----------------- begin example ---------------
clear
input str80 s
"UK/FI/EI"
"PMSE NO(20)"
"PMSE NO(20),EI(5),GE(35),CN(20)"
"PMSE2004 NO(50),EI(10),GE(30),UK(30),SW(30)"
"POLARLIS FR(220)"
"LIDAR_GPS NI(20),NO(20)"
"IASK SE(60),NO(20),UK(20)"
end
* match any sequence of 2 chars or number
moss s, match("([A-Z][A-Z]|[0-9][0-9])") regex
* match anything that is not a delimiter
moss s, match("([^ \(\),/]+)") regex pre(v_)
* ----------------- end example -----------------
On Thu, Oct 3, 2013 at 8:22 AM, Simon Falck <[email protected]> wrote:
Dear Statlist,
Using Stata 11.2, I want to extract a portion of a string variable using
regular expressions, i.e. -regexs- and -regexm-
This job is a bit tricky because the string variable contains several
different types of expressions, lengths, and sometimes spaces, with
information that looks something like this,
string variable
UK/FI/EI
PMSE NO(20)
PMSE NO(20),EI(5),GE(35),CN(20)
PMSE2004 NO(50),EI(10),GE(30),UK(30),SW(30)
POLARLIS FR(220)
LIDAR_GPS NI(20),NO(20)
IASK SE(60),NO(20),UK(20)
What I want is to extract (decomposed) information from the string variable
into new columns, such as,
var1 var2 var3 var4 var5 var6 var7 var8 var9 var10
var11 var12
UK FI EI
PM SE NO 20
PM SE NO 20 EI 5 GE 30 UK
30 SE 30
As I understand, one way of doing this is to use Stata´s regular
expressions: -regexs- and -regexm-, i.e.:
gen x1 = regexs(1)+ regexs(2) if regexm(expnamn, "([a-zA-Z])([a-zA-Z]+)")
gen x2 = regexs(1)+ regexs(2) if regexm(expnamn, "([0-9]+)*([0-9]+)")
..and so on..
However, since the characteristics of the string variable is rich on variety
this task appears far more complex than what I first thought, and I am
unable to construct a proper script to decompose the string variable in an
efficient way.
Any suggestions?
Thanks in advance,
Simon
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/