Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Extracting substrings from variable and combining variables.
From
Nick Cox <[email protected]>
To
[email protected]
Subject
Re: st: Extracting substrings from variable and combining variables.
Date
Mon, 4 Jun 2012 11:35:34 +0100
This helps clarify what you want. But as already shown in this thread
your data show that some people are both "637" and "642", so you can't
get a variable like this. A string variable can't be both "637" and
"642". At most you can take the composite string variable and edit it.
I already explained the double counting at
http://www.stata.com/statalist/archive/2012-06/msg00010.html so that's
not an issue.
Nick
On Mon, Jun 4, 2012 at 11:09 AM, Amal Khanolkar <[email protected]> wrote:
> Hi Nick,
>
> Sorry for the confusion: I missed your request for a better explaination on what I mean by combining:
>
> If, I have the following 3 variables, preght1, 2 & 3:
>
> preght1 | Freq. Percent Cum.
> ------------+-----------------------------------
> 637 | 8,314 20.76 20.76
> 642 | 21,268 53.11 73.88
> O1 | 10,461 26.12 100.00
> ------------+-----------------------------------
> Total | 40,043 100.00
>
> . tab preght2
>
> preght2 | Freq. Percent Cum.
> ------------+-----------------------------------
> 637 | 11,202 33.51 33.51
> 642 | 15,191 45.44 78.95
> O1 | 7,036 21.05 100.00
> ------------+-----------------------------------
> Total | 33,429 100.00
>
> I'd like to generate preghtX, where I combine the above 3 categories from both preght1 and preght2 as below:
>
>
> preghtX | Freq. Percent Cum.
> ------------+-----------------------------------
> 637 | 19516 20.76 20.76
> 642 | 36459 53.11 73.88
> O1 | 17497 26.12 100.00
> ------------+-----------------------------------
> Total | 73472 100.00
>
>
> I did try something very similar to what you suggested below:
>
>
> forval j = 1/8 {
> 2. replace hasO1 = 1 if hasO1 == 0 & substr(mdiag`j', 1, 2) == "O1"
> 3. replace has637 = 1 if has637 == 0 & substr(mdiag`j', 1, 3) == "637"
> 4. replace has642 = 1 if has642 == 0 & substr(mdiag`j', 1, 3) == "642"
> 5. }
> (10461 real changes made)
> (8314 real changes made)
> (21268 real changes made)
> (6753 real changes made)
> (11007 real changes made)
> (14844 real changes made)
> (3637 real changes made)
> (2092 real changes made)
> (5152 real changes made)
> (1718 real changes made)
> (579 real changes made)
> (1602 real changes made)
> (480 real changes made)
> (0 real changes made)
> (0 real changes made)
> (202 real changes made)
> (0 real changes made)
> (0 real changes made)
> (74 real changes made)
> (0 real changes made)
> (0 real changes made)
> (36 real changes made)
> (0 real changes made)
> (0 real changes made)
>
> .
> end of do-file
>
> . sum hasO1 has637 has642
>
> Variable | Obs Mean Std. Dev. Min Max
> -------------+--------------------------------------------------------
> hasO1 | 2991456 .0078092 .0880242 0 1
> has637 | 2991456 .0073516 .0854258 0 1
> has642 | 2991456 .0143295 .1188451 0 1
>
> . tab hasO1
>
> hasO1 | Freq. Percent Cum.
> ------------+-----------------------------------
> 0 | 2,968,095 99.22 99.22
> 1 | 23,361 0.78 100.00
> ------------+-----------------------------------
> Total | 2,991,456 100.00
>
> . tab has637
>
> has637 | Freq. Percent Cum.
> ------------+-----------------------------------
> 0 | 2,969,464 99.26 99.26
> 1 | 21,992 0.74 100.00
> ------------+-----------------------------------
> Total | 2,991,456 100.00
>
> . tab has642
>
> has642 | Freq. Percent Cum.
> ------------+-----------------------------------
> 0 | 2,948,590 98.57 98.57
> 1 | 42,866 1.43 100.00
> ------------+-----------------------------------
> Total | 2,991,456 100.00
>
> The reason I was a bit unsure of the above method is because those subjects coded as '1' above total to 88219 and not 90930 as they should. I wasn't able to figure out how I was loosing the 2711 additional subjects - if Stata treated them as duplicates or something else.
>
> But thanks for your help! Just wanted to clear-up why I didn't use the above method discussed last week.
>
> Best regards,
>
> /Amal.
> ________________________________________
> From: [email protected] [[email protected]] on behalf of Nick Cox [[email protected]]
> Sent: 04 June 2012 11:20
> To: [email protected]
> Subject: Re: st: Extracting substrings from variable and combining variables.
>
> Previously I wrote
>
> " I don't know exactly what you want, so that rules out further
> suggestions from me for the time being. You would get better help by
> giving examples of what the variables you want would look like."
>
> You've not done this. All that I can pick up here is that you want to
> combine variables. I don't know what that "combining" means. So, this
> is another (but final) attempt from me to help.
>
> Note that -regexm()- and -regexs()- are functions, not commands. This
> is not just a piece of pedantry as (1) referring to functions as
> commands may confuse at least some readers, and clarifies nothing (2)
> thinking of these, always, as functions helps reminds everyone that
> they are defined and documented distinctly.
>
> It seems that you have variables -mdiag1-mdiag8- and wish to extract
> diagnoses "O1", "637", "642". You expect those diagnoses to be leading
> substrings. You can create a new composite variable this way.
>
> gen anydiag = ""
>
> foreach diag in O1 637 642 {
> forval j = 1/8 {
> local len = length("`diag'")
> replace anydiag = anydiag + "`diag'" if
> substr(mdiag`j', 1, `len') == "`diag'"
> }
> }
>
> But we've already gone over similar ideas in this thread. I don't
> think you ever said why you can't work from that resulting composite
> variable.
>
> You can create new indicator variables this way
>
> gen hasO1 = 0
> gen has637 = 0
> gen has642 = 0
>
> forval j = 1/8 {
> replace hasO1 = 1 if hasO1 == 0 & substr(mdiag`j', 1, 2) == "O1"
> replace has637 = 1 if has637 == 0 & substr(mdiag`j', 1, 3) == "637"
> replace has642 = 1 if has642 == 0 & substr(mdiag`j', 1, 3) == "642"
> }
>
> This can be done with regex machinery too as a matter of taste.
>
> Nick
>
> On Mon, Jun 4, 2012 at 9:42 AM, Amal Khanolkar <[email protected]> wrote:
>
>> Originally, I started using the 'regex' command to extract ICD codes from my variables of interest shown below (mdiag1, mdiag2, mdiag3, mdiag4 etc....). I'm extracting the same ICD codes from all the mdiag variables starting with the numbers/letters: 637, 642 and O1. Initially I extracted the ICD codes from each mdiag variable separately with the idea of combining them at the end. But that seems a bit more complicated now. Maybe, one solution could be to extract all ICD codes from all mdiag variables at the same time. There are 12 such mdiag variables.
>>
>> gen preght1 = regexs(0) if regexm(mdiag1, "^(637|642|O1)")
>> tab preght1
>>
>> gen preght2 = regexs(0) if regexm(mdiag2, "^(637|642|O1)")
>> tab preght2
>>
>> gen preght3 = regexs(0) if regexm(mdiag3, "^(637|642|O1)")
>> tab preght3
>>
>> gen preght4 = regexs(0) if regexm(mdiag4, "^(637|642|O1)")
>> tab preght4
>>
>> gen preght5 = regexs(0) if regexm(mdiag5, "^(637|642|O1)")
>> tab preght5
>>
>> gen preght6 = regexs(0) if regexm(mdiag6, "^(637|642|O1)")
>> tab preght6
>>
>> gen preght7 = regexs(0) if regexm(mdiag7, "^(637|642|O1)")
>> tab preght7
>>
>> gen preght8 = regexs(0) if regexm(mdiag8, "^(637|642|O1)")
>> tab preght8
>>
>> The above generates 8 preght variables and works great.
>>
>> Initially I tried to combine the (mdiagX, "^(637|642|O1) for each mdiag variable by enclosing them in separate brackets one after another. But it doesn't work. How do I modify the regexs/regexm commands to be able to tell Stata to pluck out the ICD codes for several variables in the same command line?
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/