You can make progress on this by looking
at the help for string functions. There
is no royal road to geometry, or to this
kind of thing. The solutions tend to be
pedestrian and literal. This could be
refined, but it should help a bit.
"PAGES" or "COST" can be the last word,
so let's pick that off when it occurs.
. rename string s
. gen pages_or_cost = word(s,-1) if inlist(word(s,-1), "PAGES", "COST")
Zapping that will simplify things a bit. Create a copy to be safe.
. gen s2 = subinstr(s, pages_or_cost,"",1) if pages_or_cost == word(s,-1)
Now let's look for the drug number. We want the position of the
first numeric digit. Some people would do this with regexps but as I
warned my solution is pedestrian. First I find where the first "0"
is by
. gen index = index(s2,"0") if index(s2, "0")
but I make sure that I get missing not 0 if there is no occurrence.
and then I see if any other numeric digit 1 ... 9 occurs earlier.
Again, I need to be careful to ignore 0 results for -index()- which
mean "not found".
. qui forval i = 1/9 {
2. replace index = min(index, index(s2, "`i'")) if index(s2, "`i'")
3. }
Then like Caesar and ancient Gaul I attempt a division into three parts.
. gen company = substr(s2,1,index - 1)
. gen drug_number = substr(s2,index,5)
. gen after = substr(s2,index + 5,.)
Nick
[email protected]
Terra Curtis
> I am dealing with a string variable called 'string' like the
> example below
> (this is copied from the data browser):
>
> string
> ABBOTT DIA 40410 CHLAMYDIA TSPK PAGES
> COST
> 40410 CHLAMYDIAZYME PAGES
> COST
> 78920 INSTITUTIONAL PAGES
> COST
> 80000 VISION BL ANALYSER PAGES
> COST
> COMPANY TOTAL PAGES
> COST
> ABBOTT HPD 04200 AMIDATE PAGES
> COST
> 60700 AMINOSYN PAGES
> COST
> 53192 AMINOSYN II PAGES
> COST
> 76340 CALCIJEX PAGES
> COST
> 78920 INSTITUTIONAL PAGES
> COST
> 78920 MULTIPLE PRODUCTS PAGES
> COST
> COMPANY TOTAL PAGES
> COST
>
> I want to split this up a certain way. In some of the observations, a
> company name comes first, always the words directly before
> any number in the
> string. So first I want to split the string just at the
> company name (and
> words before any numbers). Then, I want to split it after
> the 5 numbers.
> Lastly, I want to split it after the 5 numbers and before the
> word "PAGES."
> When I am done, I want to have -- new variables, one with
> company name, one
> with drug number (the 5 numbers), on with drug name (words
> following the
> numbers, except "PAGES"), and one with either "PAGES" or
> "COST" according to
> what is the last word in 'string.' I guess this a lot of
> questions in one,
> but does anyone see an easy way to do this? I'm new to
> working with string variables.
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/