Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
st: AW: RE: Splitting string variables "advanced"
From
"Seliger Florian" <[email protected]>
To
"'[email protected]'" <[email protected]>
Subject
st: AW: RE: Splitting string variables "advanced"
Date
Thu, 19 Jan 2012 09:12:10 +0000
Thank you, Nick. That helped a lot.
Best,
__________
Florian Seliger
ETH Zurich
KOF Swiss Economic Institute
WEH C 5
Weinbergstrasse 35
8092 Zurich, Switzerland
[email protected]
www.kof.ethz.ch
-----Ursprüngliche Nachricht-----
Von: [email protected] [mailto:[email protected]] Im Auftrag von Nick Cox
Gesendet: Mittwoch, 18. Januar 2012 16:10
An: '[email protected]'
Betreff: st: RE: Splitting string variables "advanced"
This is a bit of a kludge but the technique may help. (I tried regex approaches including -moss- (SSC) without success, but there may well be a better solution that way.)
gen copy = itrim(myvar)
gen isnum = .
local todo 1
quietly while `todo' {
replace isnum = !missing(real(substr(copy, strpos(copy, ";") + 4, 1)))
replace copy = subinstr(copy, ";", cond(isnum, "@", ","), 1)
count if strpos(copy, ";")
local todo = r(N)
}
The logic of this is
1. -itrim()- first. It shouldn't make anything more difficult, and it might help.
2. "Number" for you evidently means something beginning something like "US2" or "EP1". So I look for a numeric character in a certain position.
3. Depending on what is found, I replace ";" by "@" or ",".
4. Later I would -split- on "@". Clearly you should use a character not otherwise present which you can check with -count if strpos(myvariable, "@")-.
Nick
[email protected]
Seliger Florian
I want to split string variables with values such as:
EP1763200-A1 -- EP1530342-A2 ; US2004199663-A1 HORVITZ E J (HORV-Individual); APACIBLE J T (APAC-Individual) HORVITZ E J, APACIBLE J T; US2004254998-A1 MICROSOFT CORP (MICT) HORVITZ E J
At the end, there should be several variables and their values should look as follows:
Var1
EP1763200-A1 -- EP1530342-A2
Var2
US2004199663-A1 HORVITZ E J (HORV-Individual); APACIBLE J T (APAC-Individual) HORVITZ E J, APACIBLE J T
Var3
US2004254998-A1 MICROSOFT CORP (MICT) HORVITZ E J
My problem is the following: I used
split cp, p(" ; " "; ")
but in this case, Stata will also split Var2 because of the semicolon.
I'm searching for a way to tell Stata that it should keep the value of Var2 in one variable if there is a semicolon before a name.
Stata shall be asked to split the variable only if there is a number after the semicolon.
Alternatively, I would like to delete the confusing semicolon in a first step, then asking Stata to split the variable with split cp, p(" ; " "; ").
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/