Hi Todd,
I use Stata 8 and I cannot touch strings with more that 80 chars. If
I'm not mistaken this is not the case for newer releases... so i use a
dummy example with less than 80 chars below... I presume this would
work in Stata 9/10 - if not don't shoot me.
e.g., use
1 Str A|Str B|Str C|Str D
2 Str A|Str B|Str D
3 Str D
A. change the spaces to underscores in a clone variable, and then the
"|" to spaces. The -wordcount- and -word- functions of Stata use
spaces to parse (if someone knows how to use a different separator in
these functions, this step is superfluous.)
. gen newString = subinstr(originalString," ", "_",.)
. replace newString = subinstr(newString,"|", " ",.)
B. get the maximum number of "words" per record
. gen howMany = wordcount(newString)
. summ howMany
.forval i=1/`howMany' {
. gen des_`i' = word(newString,`i')
.}
This gives you
des_1 des_2 des_3 des_4
Str_A Str_B Str_C Str_D
Str_A Str_B Str_D
Str_D
C. You can easily restore the spaces in the strings in des_1 to des_4
by changing the underscores back to spaces.
D. However, are you sure you want this as a final step? If you want to
have e.g. 4 dummies (one for Str_A, one for Str_B etc.):
str_a str_b str_c str_d
1 1 1 1
1 1 0 1
0 0 0 1
you would have to continue with reshaping long per record and then
back again reshaping wide per content of the string variable... Some
-encode-ing will probably necessary also in the meanwhile...
That being said I'd do the "silly way" (using python or vim or sed) to
manipulate the strings outside Stata...
hth
tom
On undefined, Todd Wagner <[email protected]> wrote:
> Hi,
>
> I have data from a publicly available database
> (clinicaltrials.gov). This database has a number of text variables
> that I want to break into individual variables and I could use some help.
>
> For example, one of the variables is called study designs. Here are
> some data from the study designs variable
>
> Treatment|Randomized|Double-Blind|Placebo Control|Parallel
> Assignment|Safety/Efficacy Study
> Prevention|Randomized|Open Label|Active Control|Parallel
> Assignment|Bio-equivalence Study
> Prevention|Randomized|Double Blind (Subject, Caregiver, Investigator,
> Outcomes Assessor)|Crossover Assignment
> Randomized|Single Blind|Active Control|Parallel Assignment
> Natural History|Cross-Sectional|Case Control|Prospective Study
> Treatment|Randomized|Open Label|Active Control|Parallel
> Assignment|Efficacy Study
> Treatment|Randomized|Double-Blind|Placebo Control|Single Group
> Assignment|Safety/Efficacy Study
> Treatment|Randomized|Open Label|Placebo Control|Parallel
> Assignment|Safety/Efficacy Study
> Treatment|Randomized|Double-Blind|Active Control|Parallel
> Assignment|Safety/Efficacy Study
> Prevention|Randomized|Double-Blind|Placebo Control|Parallel
> Assignment|Safety/Efficacy Study
> Treatment|Randomized|Single Blind (Investigator)|Placebo
> Control|Parallel Assignment
> Treatment|Randomized|Open Label|Active Control|Parallel
> Assignment|Efficacy Study
>
> What I want to do is parse this text using the "|" into individual variables
>
> So the first case would be
> des1 des2 des3 des4 des5 des6
> Treatment Randomized Double-Blind Placebo Control Parallel
> Assignment Safety/Efficacy Study
>
> I can think of a brute force way where I save this variable and my id
> variable, change | to a comma, output as text, read the text into
> stata as a comma separated file, and then merge it back into my
> data. Sounds silly, but perhaps it is the easiest. Any other ideas?
>
> Thanks,
>
> Todd
>
> *
> * For searches and help try:
> * http://www.stata.com/support/faqs/res/findit.html
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/