Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: regexm
From
Nick Cox <[email protected]>
To
"[email protected]" <[email protected]>
Subject
Re: st: regexm
Date
Sat, 27 Aug 2011 15:33:44 +0100
Strings longer than 244 characters cannot be read into variables. You
could read them into Mata.
As said, do look at -moss-.
Nick
On 27 Aug 2011, at 15:22, KOTa <[email protected]> wrote:
simplier in logistics way. i.e. i tried to do the whole thing withot
creating additional variables (that split creates) in the middle.
another question, if you know. also about strings. when i import file
to stata (from excel, for example) i have some very long strings, that
stata cuts to 244 chars.
is there any trick to go around it? except making them shorter before
importing :)
thank you
2011/8/27 Nick Cox <[email protected]>:
Better in what sense? Quicker to get a solution? Simpler? Other
criteria?
I don't know a way of counting more than 9 matches directly. I think
you would need, if you continue to follow that path, to loop over a
string repeatedly finding new instances and counting.
See also -moss- from SSC.
Nick
On Sat, Aug 27, 2011 at 2:52 PM, KOTa <[email protected]> wrote:
yes, i do work now with split, just thought with regex it will be
better.
anyway, is there a way to find out how many expressions regexm
finds?
1. what i mean is i can access the 1st 2nd etc up to 9 with regexs,
but if i dont know how many there are -> i dont know which one is
last.
2. what if more the 9 expressions found? according to manual regexs
only can have 0-9 parameters.
thanks
2011/8/27 Nick Cox <[email protected]>:
Well, you did say "it always ends by "% th_aft".
I will continue as I started.
If you first blank out stuff you don't need then you can just use
-split- to separate out elements. If you parse on spaces then it is
immaterial when you have 2 or 3 digits before, you retrieve the
number
either way.
No need for regex demonstrated.
Nick
On Sat, Aug 27, 2011 at 2:16 PM, KOTa <[email protected]> wrote:
thanks Eric, Nick I used your advices and almost finished.
but encountered one small problems on the way.
i have the same type of string - "0.15%-$1(B) 0.14%-$2(B) 0.12%-
$2(B)
0.10% th_aft." - number of digits after the dot can be 2 or 3,
it's
not constant
and i am trying to extract the last % (i.e.0.10% in this case)
using
"$" like this:
g example = regexs(0) if regexm( fee_str, "[0-9]+\.[0-9]*[%]$")
or g
example = regexs(0) if regexm( fee_str, "[0-9]+\.[0-9]*[%]+$")
and it
fails in both cases.
the result is empty
it does extract the first one (0.15%) if i dont use "$"
what is wrong?
thanks
p.s. Nick, th_aft is not a terminator, its not always there
2011/8/27 Nick Cox <[email protected]>:
It is not obvious to me that you need -regexm()- at all.
The text " th_aft" appears to be just a terminator that you
don't care
about, so remove it.
replace j = subinstr(j, " th_aft", "", .)
The last element can be separated off and then removed.
gen last = word(j, -1)
replace j = reverse(j)
replace j = subinstr(j, word(j,1) , "", 1)
replace j = reverse(j)
We reverse it in order to avoid removing any identical substring.
Those three lines could be telescoped into one.
Then it looks like an exercise in -subinstr()- and -split-.
Nick
On Sat, Aug 27, 2011 at 2:28 AM, Eric Booth
<[email protected]> wrote:
<>
Here's an example...note that I messed with the formatting of
the %'s and $'s in my example data a bit to show how flexible
the -regex- is in the latter part of the code; however, you'll
need to check that there aren't other patterns/symbols in your
string that could break my code.
There are other ways to approach this, but I think the logic
here is easy to follow:
*************! watch for wrapping:
**example data:
clear
inp str70(j)
"A: 0.35%-$197(M) 0.30%-$397(M) 0.27% th_aft."
"A: 0.25%-$198(M) 0.12%-$398(M) 0.99%-$300(M) 0.00% th_aft."
"A: 1.0%-$109(M) 0.1% th_aft."
"A: 0%-$199(M) 0.30%-$366(M) 1.99% th_aft."
end
**regexm example == easier to use -split- initially
g example = regexs(0) ///
if regexm(j, "(([0-9]+\.[0-9]*[%-]+)([\$][0-9]*))")
l
drop example
**split:
replace j = subinstr(j, "A: ", "", 1)
split j, p("(M) ")
**first, find x10 :
g x10 = ""
tempvar flag
g `flag' = ""
foreach var of varlist j? {
replace `flag' = "`var'" if ///
strpos(`var', "th_aft")>0
replace x10 = subinstr(`var', "th_aft.", "", .) ///
if `flag' == "`var'"
replace `var' = "" if strpos(`var', "th_aft")>0
}
**now, create x1-x9 and y1-y9
forval num = 1/9 {
g x`num' = ""
g y`num' = ""
cap replace x`num' = regexs(0) if ///
regexm(j`num', "([0-9]+\.?[0-9]*[%]+)") ///
& !mi(j`num') & mi(x`num') //probably overkill
cap replace y`num' = regexs(0) if ///
regexm(j`num', "([\$][0-9]*\.?[0-9]*)") ///
& !mi(j`num') & mi(y`num')
}
**finally, create y10 == y2:
g y10 = y2
****list:
l *1
l *2
l *3
*************!
- Eric
On Aug 26, 2011, at 6:59 PM, KOTa wrote:
I am trying to extract some data from text variable and being
new to
stata programming struggling with finding right format.
my problem is as following:
for example i have string variable as following: "A: 0.35%-
$100(M)
0.30%-$300(M) 0.27% th_aft."
number of pairs "% - (M)" can be from 1 to 9 and it always
ends by "% th_aft"
I have 10 pairs of variables X1 Y1 .... X10 Y10
my goal is to extract all pairs from the string variable and
split
them into my separate variables.
in this case the result should be:
X1 = 0.35%
Y1 = $100
X2 = 0.30%
Y2 = $300
X3-X9 = y3-Y9 = 0
X10 = 0.27%
Y10 = Y2 (i.e. last Y extracted from sting)
I am trying to use regexm but unsuccessfully, Any suggestions?
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/