Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: regexm

From	KOTa <[email protected]>
To	[email protected]
Subject	Re: st: regexm
Date	Sat, 27 Aug 2011 17:14:51 +0200

thanks

2011/8/27 Robert Picard <[email protected]>:
> I second looking at -moss- from SSC. Try:
>
> moss svar, match("([0-9\.]+)") regex
>
> Robert
>
> On Sat, Aug 27, 2011 at 10:33 AM, Nick Cox <[email protected]> wrote:
>> Strings longer than 244 characters cannot be read into variables. You could
>> read them into Mata.
>>
>> As said, do look at -moss-.
>>
>> Nick
>>
>> On 27 Aug 2011, at 15:22, KOTa <[email protected]> wrote:
>>
>>> simplier in logistics way. i.e. i tried to do the whole thing withot
>>> creating additional variables (that split creates) in the middle.
>>>
>>> another question, if you know. also about strings. when i import file
>>> to stata (from excel, for example) i have some very long strings, that
>>> stata cuts to 244 chars.
>>>
>>> is there any trick to go around it? except making them shorter before
>>> importing :)
>>>
>>> thank you
>>>
>>> 2011/8/27 Nick Cox <[email protected]>:
>>>>
>>>> Better in what sense? Quicker to get a solution? Simpler? Other criteria?
>>>>
>>>> I don't know a way of counting more than 9 matches directly. I think
>>>> you would need, if you continue to follow that path, to loop over a
>>>> string repeatedly finding new instances and counting.
>>>>
>>>> See also -moss- from SSC.
>>>>
>>>> Nick
>>>>
>>>> On Sat, Aug 27, 2011 at 2:52 PM, KOTa <[email protected]> wrote:
>>>>>
>>>>> yes, i do work now with split, just thought with regex it will be
>>>>> better.
>>>>>
>>>>> anyway, is there a way to find out how many expressions regexm finds?
>>>>> 1. what i mean is i can access the 1st 2nd etc up to 9 with regexs,
>>>>> but if i dont know how many there are -> i dont know which one is
>>>>> last.
>>>>> 2. what if more the 9 expressions found? according to manual regexs
>>>>> only can have 0-9 parameters.
>>>>>
>>>>>
>>>>> thanks
>>>>>
>>>>> 2011/8/27 Nick Cox <[email protected]>:
>>>>>>
>>>>>> Well, you did say "it always ends by "% th_aft".
>>>>>>
>>>>>> I will continue as I started.
>>>>>>
>>>>>> If you first blank out stuff you don't need then you can just use
>>>>>> -split- to separate out elements. If you parse on spaces then it is
>>>>>> immaterial when you have 2 or 3 digits before, you retrieve the number
>>>>>> either way.
>>>>>>
>>>>>> No need for regex demonstrated.
>>>>>>
>>>>>> Nick
>>>>>>
>>>>>> On Sat, Aug 27, 2011 at 2:16 PM, KOTa <[email protected]> wrote:
>>>>>>>
>>>>>>> thanks Eric, Nick I used your advices and almost finished.
>>>>>>>
>>>>>>> but encountered one small problems on the way.
>>>>>>>
>>>>>>> i have the same type of string -  "0.15%-$1(B) 0.14%-$2(B) 0.12%-$2(B)
>>>>>>> 0.10% th_aft." - number of digits after the dot can be 2 or 3, it's
>>>>>>> not constant
>>>>>>>
>>>>>>> and i am trying to extract the last % (i.e.0.10% in this case) using
>>>>>>> "$" like this:
>>>>>>>
>>>>>>> g example = regexs(0) if regexm( fee_str, "[0-9]+\.[0-9]*[%]$") or g
>>>>>>> example = regexs(0) if regexm( fee_str, "[0-9]+\.[0-9]*[%]+$") and it
>>>>>>> fails in both cases.
>>>>>>>
>>>>>>> the result is empty
>>>>>>>
>>>>>>> it does extract the first one (0.15%) if i dont use "$"
>>>>>>>
>>>>>>> what is wrong?
>>>>>>>
>>>>>>> thanks
>>>>>>>
>>>>>>> p.s. Nick, th_aft is not a terminator, its not always there
>>>>>>>
>>>>>>>
>>>>>>> 2011/8/27 Nick Cox <[email protected]>:
>>>>>>>>
>>>>>>>> It is not obvious to me that you need -regexm()- at all.
>>>>>>>>
>>>>>>>> The text " th_aft" appears to be just a terminator that you don't
>>>>>>>> care
>>>>>>>> about, so remove it.
>>>>>>>>
>>>>>>>> replace j = subinstr(j, " th_aft", "", .)
>>>>>>>>
>>>>>>>> The last element can be separated off and then removed.
>>>>>>>>
>>>>>>>> gen last = word(j, -1)
>>>>>>>>
>>>>>>>> replace j = reverse(j)
>>>>>>>> replace j = subinstr(j, word(j,1) , "", 1)
>>>>>>>> replace j = reverse(j)
>>>>>>>>
>>>>>>>> We reverse it in order to avoid removing any identical substring.
>>>>>>>>
>>>>>>>> Those three lines could be telescoped into one.
>>>>>>>>
>>>>>>>> Then it looks like an exercise in -subinstr()- and -split-.
>>>>>>>>
>>>>>>>> Nick
>>>>>>>>
>>>>>>>> On Sat, Aug 27, 2011 at 2:28 AM, Eric Booth <[email protected]>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> <>
>>>>>>>>>
>>>>>>>>> Here's an example...note that I messed with the formatting of the
>>>>>>>>> %'s and $'s in my example data a bit to show how flexible the -regex- is in
>>>>>>>>> the latter part of the code; however, you'll need to check that there aren't
>>>>>>>>> other patterns/symbols in your string that could break my code.
>>>>>>>>>  There are other ways to approach this, but I think the logic here
>>>>>>>>> is easy to follow:
>>>>>>>>>
>>>>>>>>> *************! watch for wrapping:
>>>>>>>>>
>>>>>>>>> **example data:
>>>>>>>>> clear
>>>>>>>>> inp str70(j)
>>>>>>>>> "A: 0.35%-$197(M) 0.30%-$397(M) 0.27% th_aft."
>>>>>>>>> "A: 0.25%-$198(M) 0.12%-$398(M)  0.99%-$300(M) 0.00% th_aft."
>>>>>>>>> "A: 1.0%-$109(M) 0.1% th_aft."
>>>>>>>>> "A: 0%-$199(M) 0.30%-$366(M) 1.99% th_aft."
>>>>>>>>> end
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> **regexm example == easier to use -split- initially
>>>>>>>>> g example = regexs(0) ///
>>>>>>>>>  if regexm(j, "(([0-9]+\.[0-9]*[%-]+)([\$][0-9]*))")
>>>>>>>>> l
>>>>>>>>> drop example
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> **split:
>>>>>>>>> replace j = subinstr(j, "A: ", "", 1)
>>>>>>>>> split j, p("(M) ")
>>>>>>>>>
>>>>>>>>> **first, find x10 :
>>>>>>>>> g x10 = ""
>>>>>>>>>
>>>>>>>>> tempvar flag
>>>>>>>>> g `flag' = ""
>>>>>>>>> foreach var of varlist j? {
>>>>>>>>> replace `flag' = "`var'" if ///
>>>>>>>>>       strpos(`var', "th_aft")>0
>>>>>>>>> replace x10  = subinstr(`var', "th_aft.", "", .) ///
>>>>>>>>>        if `flag' == "`var'"
>>>>>>>>> replace `var' = "" if strpos(`var', "th_aft")>0
>>>>>>>>>       }
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> **now, create x1-x9 and y1-y9
>>>>>>>>> forval num = 1/9 {
>>>>>>>>>  g x`num' = ""
>>>>>>>>>  g y`num' = ""
>>>>>>>>>  cap replace x`num' = regexs(0) if ///
>>>>>>>>>       regexm(j`num', "([0-9]+\.?[0-9]*[%]+)") ///
>>>>>>>>>       & !mi(j`num') & mi(x`num') //probably overkill
>>>>>>>>>  cap replace y`num' = regexs(0) if ///
>>>>>>>>>       regexm(j`num', "([\$][0-9]*\.?[0-9]*)") ///
>>>>>>>>>       & !mi(j`num') & mi(y`num')
>>>>>>>>>       }
>>>>>>>>> **finally, create y10 == y2:
>>>>>>>>>  g y10 = y2
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ****list:
>>>>>>>>> l *1
>>>>>>>>> l *2
>>>>>>>>> l *3
>>>>>>>>>
>>>>>>>>> *************!
>>>>>>>>> - Eric
>>>>>>>>>
>>>>>>>>> On Aug 26, 2011, at 6:59 PM, KOTa wrote:
>>>>>>>>
>>>>>>>>>> I am trying to extract some data from text variable and being new
>>>>>>>>>> to
>>>>>>>>>> stata programming struggling with finding right format.
>>>>>>>>>>
>>>>>>>>>> my problem is as following:
>>>>>>>>>>
>>>>>>>>>> for example i have string variable as following: "A: 0.35%-$100(M)
>>>>>>>>>> 0.30%-$300(M) 0.27% th_aft."
>>>>>>>>>>
>>>>>>>>>> number of pairs "% - (M)" can be from 1 to 9 and it always ends by
>>>>>>>>>> "% th_aft"
>>>>>>>>>>
>>>>>>>>>> I have 10 pairs of variables X1 Y1 .... X10 Y10
>>>>>>>>>>
>>>>>>>>>> my goal is to extract all pairs from the string variable and split
>>>>>>>>>> them into my separate variables.
>>>>>>>>>>
>>>>>>>>>> in this case the result should be:
>>>>>>>>>>
>>>>>>>>>> X1  = 0.35%
>>>>>>>>>> Y1 = $100
>>>>>>>>>>
>>>>>>>>>> X2 = 0.30%
>>>>>>>>>> Y2 = $300
>>>>>>>>>>
>>>>>>>>>> X3-X9 = y3-Y9 = 0
>>>>>>>>>>
>>>>>>>>>> X10 = 0.27%
>>>>>>>>>> Y10 = Y2 (i.e. last Y extracted from sting)
>>>>>>>>>>
>>>>>>>>>> I am trying to use regexm but unsuccessfully, Any suggestions?
>>>>>>>>>>
>>>>>>>>
>>>>>>>> *
>>>>>>>> *   For searches and help try:
>>>>>>>> *   http://www.stata.com/help.cgi?search
>>>>>>>> *   http://www.stata.com/support/statalist/faq
>>>>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>>>>>
>>>>>>>
>>>>>>> *
>>>>>>> *   For searches and help try:
>>>>>>> *   http://www.stata.com/help.cgi?search
>>>>>>> *   http://www.stata.com/support/statalist/faq
>>>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>>>>
>>>>>>
>>>>>> *
>>>>>> *   For searches and help try:
>>>>>> *   http://www.stata.com/help.cgi?search
>>>>>> *   http://www.stata.com/support/statalist/faq
>>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>>>
>>>>>
>>>>> *
>>>>> *   For searches and help try:
>>>>> *   http://www.stata.com/help.cgi?search
>>>>> *   http://www.stata.com/support/statalist/faq
>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>>
>>>>
>>>> *
>>>> *   For searches and help try:
>>>> *   http://www.stata.com/help.cgi?search
>>>> *   http://www.stata.com/support/statalist/faq
>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>
>>>
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/statalist/faq
>>> *   http://www.ats.ucla.edu/stat/stata/
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
>>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: regexm
  - From: KOTa <[email protected]>
- Re: st: regexm
  - From: Eric Booth <[email protected]>
- Re: st: regexm
  - From: Nick Cox <[email protected]>
- Re: st: regexm
  - From: KOTa <[email protected]>
- Re: st: regexm
  - From: Nick Cox <[email protected]>
- Re: st: regexm
  - From: KOTa <[email protected]>
- Re: st: regexm
  - From: Nick Cox <[email protected]>
- Re: st: regexm
  - From: KOTa <[email protected]>
- Re: st: regexm
  - From: Nick Cox <[email protected]>
- Re: st: regexm
  - From: Robert Picard <[email protected]>

Prev by Date: Re: st: regexm
Next by Date: Re: st: Marginal effect at each category of dependent variable after ologit using margins command
Previous by thread: Re: st: regexm
Next by thread: Re: st: regexm
Index(es):
- Date
- Thread