Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: st: Working with complex strings
From
Nick Cox <[email protected]>
To
"'[email protected]'" <[email protected]>
Subject
RE: st: Working with complex strings
Date
Thu, 1 Dec 2011 11:46:42 +0000
With either of the approaches I proposed a medication whose name began with a numeric character would imply a missing value for the variable created. That should be easy enough to detect and handle.
Sergiy raises good extra questions about the form of the string variable. Only a report from Anwar can answer them.
Nick
[email protected]
Sergiy Radyakin
On Wed, Nov 30, 2011 at 4:08 AM, Nick Cox <[email protected]> wrote:
> Parsing on spaces can be more helpful than stated here. We just need
> to reject "words" once we have found the first "word" that starts with
> a numeric digit. That can be done in a loop. It also copes with the
> possibility that numeric characters might be found within medication
> names, but _not_ with the possibility that medication names start with
> numeric characters.
FDA database lists some medications with numbers in the beginning:
8-HOUR BAYER (ASPIRIN)
8-MOP (METHOXSALEN)
Based on the example given, searching for "mg" is another option for
determining the dose.
1. find "mg" as a separate word trailing a number
2. if not found - the whole string is the medication
3. if found, proceed leftwards till the first non-space/non-numeric
character ->n
4. the character in #3 is the last character in the medication name.
split taking n first characters into medication name, the rest ->
dose.
I am not sure if each patient is assigned one medication only, if not,
and the string lists a complicated prescription, what do we know about
the rules then? Are all doses specified in mg of something? or can it
be g? or smth else? Perhaps use different methods and check if they
all produce the same results, let human operator resolve the
conflicts.
Best, Sergiy
>
> . split medication
> variables created as string:
> medication1 medication2 medication3 medication4
>
> . gen found = 0
>
> 4 here is empirical for this example. See how many variables -split- creates.
>
> . qui forval j = 1/4 {
> 2. replace found = 1 if inrange(substr(medication`j', 1, 1), "0", "9")
> 3. replace medication`j' = "" if found
> 4. }
>
> . l
>
> +--------------------------------------------------------------------------------------+
> | medication medicati~1 medicati~2
> medica~3 medica~4 found |
> |--------------------------------------------------------------------------------------|
> 1. | metoprolol 100 mg qday metoprolol
> 1 |
> 2. | metoprolol tatrate 150mg bid metoprolol tatrate
> 1 |
> 3. | atenelol 150 mg qday atenelol
> 1 |
> 4. | hctz 25 mg qday hctz
> 1 |
> 5. | PEG interferon PEG interferon
> 0 |
> |--------------------------------------------------------------------------------------|
> 6. | cimzia 50 mg qday cimzia
> 1 |
> +--------------------------------------------------------------------------------------+
>
>
> Then we put the words back together again:
>
> . egen medname = concat(medication?), p(" ")
>
> . l medication medname
>
> +---------------------------------------------------+
> | medication medname |
> |---------------------------------------------------|
> 1. | metoprolol 100 mg qday metoprolol |
> 2. | metoprolol tatrate 150mg bid metoprolol tatrate |
> 3. | atenelol 150 mg qday atenelol |
> 4. | hctz 25 mg qday hctz |
> 5. | PEG interferon PEG interferon |
> |---------------------------------------------------|
> 6. | cimzia 50 mg qday cimzia |
> +---------------------------------------------------+
>
>
> On Wed, Nov 30, 2011 at 8:36 AM, Nick Cox <[email protected]> wrote:
>> -split- by default parses on spaces, which clearly is no good here
>> given that medications can have compound names and dosages will not be
>> discarded. Steve was evidently pointing to the -parse()- option, not
>> suggesting that parsing on spaces was the answer.
>>
>> If we assume that (a) dose always starts with a number and (b) dose
>> when specified always follows name of medication and (c) names never
>> have numeric characters, then -split- can be used to parse on numeric
>> characters. Here I used 1-9 but 0 should be added if it's ever the
>> first numeric digit:
>>
>> . split medication, parse(1 2 3 4 5 6 7 8 9) limit(1)
>> variable created as string:
>> medication1
>>
>> . replace medication1 = trim(medication1)
>> (5 real changes made)
>>
>> . l
>>
>> +---------------------------------------------------+
>> | medication medication1 |
>> |---------------------------------------------------|
>> 1. | metoprolol 100 mg qday metoprolol |
>> 2. | metoprolol tatrate 150mg bid metoprolol tatrate |
>> 3. | atenelol 150 mg qday atenelol |
>> 4. | hctz 25 mg qday hctz |
>> 5. | PEG interferon PEG interferon |
>> |---------------------------------------------------|
>> 6. | cimzia 50 mg qday cimzia |
>> +---------------------------------------------------+
>>
>> Another approach is to use -moss- (SSC):
>>
>> . moss medication, match("(.+) [1-9]+") regex
>>
>> . drop _count _pos1
>>
>> . rename _match1 medication2
>>
>> With this regular expression, -moss- misses names without dosages,
>> which can just be copied across.
>>
>> . replace medication2 = medication if missing(medication2)
>> (1 real change made)
>>
>> . l
>>
>> +------------------------------------------------------------------------+
>> | medication medication1 medication2 |
>> |------------------------------------------------------------------------|
>> 1. | metoprolol 100 mg qday metoprolol metoprolol |
>> 2. | metoprolol tatrate 150mg bid metoprolol tatrate metoprolol tatrate |
>> 3. | atenelol 150 mg qday atenelol atenelol |
>> 4. | hctz 25 mg qday hctz hctz |
>> 5. | PEG interferon PEG interferon PEG interferon |
>> |------------------------------------------------------------------------|
>> 6. | cimzia 50 mg qday cimzia cimzia |
>> +------------------------------------------------------------------------+
>>
>> Nick
>>
>> On Wed, Nov 30, 2011 at 5:43 AM, Dudekula, Anwar <[email protected]> wrote:
>>> Thank you very much
>>>
>>> I will work on it .Would the parse() option split metoprolol tatrate 150mg bid as
>>>
>>> metoprolol tatrate and 150mg bid
>>>
>>> Or
>>>
>>> metoprolol & tatrate & 150mg & bid
>>>
>>> Thank you
>>> Anwar
>>>
>>> -----Original Message-----
>>> From: [email protected] [mailto:[email protected]] On Behalf Of Steve Nakoneshny
>>> Sent: Wednesday, November 30, 2011 12:38 AM
>>> To: [email protected]
>>> Subject: Re: st: Working with complex strings
>>>
>>> - help split - would have answered this question.
>>>
>>> - split medication, parse( ) -
>>>
>>> should do what you want.
>>
>>
>> On Nov 29, 2011, at 9:54 PM, "Dudekula, Anwar" <[email protected]> wrote:
>>
>>>> I am working with deidentified hospitaldatabase with patient names(as string variable) and medications (as string variable)as follows
>>>>
>>>> Patients_name medication
>>>> ------------------------------------
>>>> Patient-1 metoprolol 100 mg qday
>>>> Patient-1 metoprolol tatrate 150mg bid
>>>> Patient-1 atenelol 150 mg qday
>>>> Patient-2 hctz 25 mg qday
>>>> Patient-2 PEG interferon
>>>> Patient-3 cimzia 50 mg qday
>>>>
>>>> Question: I am interested in name of medication only , not their dosages.Is it possible to split the medication string after the name i.e.,
>>>>
>>>> 1) split metoprolol tatrate 150mg bid into metoprolol tatrate & 150mg bid
>>>> 2) split metoprolol 100 mg qday into metoprolol & 100 mg qday
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/