Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: regexm
From
KOTa <[email protected]>
To
[email protected]
Subject
Re: st: regexm
Date
Sat, 27 Aug 2011 15:52:47 +0200
yes, i do work now with split, just thought with regex it will be better.
anyway, is there a way to find out how many expressions regexm finds?
1. what i mean is i can access the 1st 2nd etc up to 9 with regexs,
but if i dont know how many there are -> i dont know which one is
last.
2. what if more the 9 expressions found? according to manual regexs
only can have 0-9 parameters.
thanks
2011/8/27 Nick Cox <[email protected]>:
> Well, you did say "it always ends by "% th_aft".
>
> I will continue as I started.
>
> If you first blank out stuff you don't need then you can just use
> -split- to separate out elements. If you parse on spaces then it is
> immaterial when you have 2 or 3 digits before, you retrieve the number
> either way.
>
> No need for regex demonstrated.
>
> Nick
>
> On Sat, Aug 27, 2011 at 2:16 PM, KOTa <[email protected]> wrote:
>> thanks Eric, Nick I used your advices and almost finished.
>>
>> but encountered one small problems on the way.
>>
>> i have the same type of string - "0.15%-$1(B) 0.14%-$2(B) 0.12%-$2(B)
>> 0.10% th_aft." - number of digits after the dot can be 2 or 3, it's
>> not constant
>>
>> and i am trying to extract the last % (i.e.0.10% in this case) using
>> "$" like this:
>>
>> g example = regexs(0) if regexm( fee_str, "[0-9]+\.[0-9]*[%]$") or g
>> example = regexs(0) if regexm( fee_str, "[0-9]+\.[0-9]*[%]+$") and it
>> fails in both cases.
>>
>> the result is empty
>>
>> it does extract the first one (0.15%) if i dont use "$"
>>
>> what is wrong?
>>
>> thanks
>>
>> p.s. Nick, th_aft is not a terminator, its not always there
>>
>>
>> 2011/8/27 Nick Cox <[email protected]>:
>>> It is not obvious to me that you need -regexm()- at all.
>>>
>>> The text " th_aft" appears to be just a terminator that you don't care
>>> about, so remove it.
>>>
>>> replace j = subinstr(j, " th_aft", "", .)
>>>
>>> The last element can be separated off and then removed.
>>>
>>> gen last = word(j, -1)
>>>
>>> replace j = reverse(j)
>>> replace j = subinstr(j, word(j,1) , "", 1)
>>> replace j = reverse(j)
>>>
>>> We reverse it in order to avoid removing any identical substring.
>>>
>>> Those three lines could be telescoped into one.
>>>
>>> Then it looks like an exercise in -subinstr()- and -split-.
>>>
>>> Nick
>>>
>>> On Sat, Aug 27, 2011 at 2:28 AM, Eric Booth <[email protected]> wrote:
>>>> <>
>>>>
>>>> Here's an example...note that I messed with the formatting of the %'s and $'s in my example data a bit to show how flexible the -regex- is in the latter part of the code; however, you'll need to check that there aren't other patterns/symbols in your string that could break my code.
>>>> There are other ways to approach this, but I think the logic here is easy to follow:
>>>>
>>>> *************! watch for wrapping:
>>>>
>>>> **example data:
>>>> clear
>>>> inp str70(j)
>>>> "A: 0.35%-$197(M) 0.30%-$397(M) 0.27% th_aft."
>>>> "A: 0.25%-$198(M) 0.12%-$398(M) 0.99%-$300(M) 0.00% th_aft."
>>>> "A: 1.0%-$109(M) 0.1% th_aft."
>>>> "A: 0%-$199(M) 0.30%-$366(M) 1.99% th_aft."
>>>> end
>>>>
>>>>
>>>>
>>>> **regexm example == easier to use -split- initially
>>>> g example = regexs(0) ///
>>>> if regexm(j, "(([0-9]+\.[0-9]*[%-]+)([\$][0-9]*))")
>>>> l
>>>> drop example
>>>>
>>>>
>>>> **split:
>>>> replace j = subinstr(j, "A: ", "", 1)
>>>> split j, p("(M) ")
>>>>
>>>> **first, find x10 :
>>>> g x10 = ""
>>>>
>>>> tempvar flag
>>>> g `flag' = ""
>>>> foreach var of varlist j? {
>>>> replace `flag' = "`var'" if ///
>>>> strpos(`var', "th_aft")>0
>>>> replace x10 = subinstr(`var', "th_aft.", "", .) ///
>>>> if `flag' == "`var'"
>>>> replace `var' = "" if strpos(`var', "th_aft")>0
>>>> }
>>>>
>>>>
>>>> **now, create x1-x9 and y1-y9
>>>> forval num = 1/9 {
>>>> g x`num' = ""
>>>> g y`num' = ""
>>>> cap replace x`num' = regexs(0) if ///
>>>> regexm(j`num', "([0-9]+\.?[0-9]*[%]+)") ///
>>>> & !mi(j`num') & mi(x`num') //probably overkill
>>>> cap replace y`num' = regexs(0) if ///
>>>> regexm(j`num', "([\$][0-9]*\.?[0-9]*)") ///
>>>> & !mi(j`num') & mi(y`num')
>>>> }
>>>> **finally, create y10 == y2:
>>>> g y10 = y2
>>>>
>>>>
>>>> ****list:
>>>> l *1
>>>> l *2
>>>> l *3
>>>>
>>>> *************!
>>>> - Eric
>>>>
>>>> On Aug 26, 2011, at 6:59 PM, KOTa wrote:
>>>
>>>>> I am trying to extract some data from text variable and being new to
>>>>> stata programming struggling with finding right format.
>>>>>
>>>>> my problem is as following:
>>>>>
>>>>> for example i have string variable as following: "A: 0.35%-$100(M)
>>>>> 0.30%-$300(M) 0.27% th_aft."
>>>>>
>>>>> number of pairs "% - (M)" can be from 1 to 9 and it always ends by "% th_aft"
>>>>>
>>>>> I have 10 pairs of variables X1 Y1 .... X10 Y10
>>>>>
>>>>> my goal is to extract all pairs from the string variable and split
>>>>> them into my separate variables.
>>>>>
>>>>> in this case the result should be:
>>>>>
>>>>> X1 = 0.35%
>>>>> Y1 = $100
>>>>>
>>>>> X2 = 0.30%
>>>>> Y2 = $300
>>>>>
>>>>> X3-X9 = y3-Y9 = 0
>>>>>
>>>>> X10 = 0.27%
>>>>> Y10 = Y2 (i.e. last Y extracted from sting)
>>>>>
>>>>> I am trying to use regexm but unsuccessfully, Any suggestions?
>>>>>
>>>
>>> *
>>> * For searches and help try:
>>> * http://www.stata.com/help.cgi?search
>>> * http://www.stata.com/support/statalist/faq
>>> * http://www.ats.ucla.edu/stat/stata/
>>>
>>
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/statalist/faq
>> * http://www.ats.ucla.edu/stat/stata/
>>
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/