Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: regexs and regexm

From	Robert Picard <[email protected]>
To	"[email protected]" <[email protected]>
Subject	Re: st: regexs and regexm
Date	Thu, 3 Oct 2013 09:16:50 -0400

You can use -moss- (from SSC) to split your string variable using a
regex pattern. Here are two ways of splitting your string:

Robert

* ----------------- begin example ---------------
clear
input str80 s
"UK/FI/EI"
"PMSE    NO(20)"
"PMSE    NO(20),EI(5),GE(35),CN(20)"
"PMSE2004    NO(50),EI(10),GE(30),UK(30),SW(30)"
"POLARLIS    FR(220)"
"LIDAR_GPS    NI(20),NO(20)"
"IASK    SE(60),NO(20),UK(20)"
end

* match any sequence of 2 chars or number
moss s, match("([A-Z][A-Z]|[0-9][0-9])") regex

* match anything that is not a delimiter
moss s, match("([^ \(\),/]+)") regex pre(v_)
* ----------------- end example -----------------


On Thu, Oct 3, 2013 at 8:22 AM, Simon Falck <[email protected]> wrote:
> Dear Statlist,
>
> Using Stata 11.2, I want to extract a portion of a string variable using
> regular expressions, i.e. -regexs- and -regexm-
>
> This job is a bit tricky because the string variable contains several
> different types of expressions, lengths, and sometimes spaces, with
> information that looks something like this,
>
> string variable
> UK/FI/EI
> PMSE    NO(20)
> PMSE    NO(20),EI(5),GE(35),CN(20)
> PMSE2004    NO(50),EI(10),GE(30),UK(30),SW(30)
> POLARLIS    FR(220)
> LIDAR_GPS    NI(20),NO(20)
> IASK    SE(60),NO(20),UK(20)
>
> What I want is to extract (decomposed) information from the string variable
> into new columns, such as,
>
> var1     var2    var3     var4    var5    var6    var7    var8 var9    var10
> var11    var12
> UK        FI        EI
> PM       SE       NO        20
> PM       SE       NO        20        EI        5        GE 30        UK
> 30        SE        30
>
> As I understand, one way of doing this is to use Stata´s regular
> expressions: -regexs- and -regexm-, i.e.:
>
> gen x1 = regexs(1)+ regexs(2) if regexm(expnamn, "([a-zA-Z])([a-zA-Z]+)")
> gen x2 = regexs(1)+ regexs(2) if regexm(expnamn, "([0-9]+)*([0-9]+)")
> ..and so on..
>
> However, since the characteristics of the string variable is rich on variety
> this task appears far more complex than what I first thought, and I am
> unable to construct a proper script to decompose the string variable in an
> efficient way.
>
> Any suggestions?
>
> Thanks in advance,
> Simon
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: regexs and regexm
  - From: Simon Falck <[email protected]>

References:
- st: regexs and regexm
  - From: Simon Falck <[email protected]>

Prev by Date: st: regexs and regexm
Next by Date: st: RE: Fwd: Fastest way to identify values that start and end with a 9?
Previous by thread: st: regexs and regexm
Next by thread: Re: st: regexs and regexm
Index(es):
- Date
- Thread