Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: st: Working with complex strings

From	Nick Cox <[email protected]>
To	"'[email protected]'" <[email protected]>
Subject	RE: st: Working with complex strings
Date	Thu, 1 Dec 2011 11:46:42 +0000
With either of the approaches I proposed a medication whose name began with a numeric character would imply a missing value for the variable created. That should be easy enough to detect and handle. 

Sergiy raises good extra questions about the form of the string variable. Only a report from Anwar can answer them. 

Nick 
[email protected] 

Sergiy Radyakin

On Wed, Nov 30, 2011 at 4:08 AM, Nick Cox <[email protected]> wrote:
> Parsing on spaces can be more helpful than stated here. We just need
> to reject "words" once we have found the first "word" that starts with
> a numeric digit. That can be done in a loop. It also copes with the
> possibility that numeric characters might be found within medication
> names, but _not_ with the possibility that medication names start with
> numeric characters.


FDA database lists some medications with numbers in the beginning:
	

    8-HOUR BAYER  (ASPIRIN)
    8-MOP  (METHOXSALEN)


Based on the example given, searching for "mg" is another option for
determining the dose.
1. find "mg" as a separate word trailing a number
2. if not found - the whole string is the medication
3. if found, proceed leftwards till the first non-space/non-numeric
character ->n
4. the character in #3 is the last character in the medication name.
split taking n first characters into medication name, the rest ->
dose.

I am not sure if each patient is assigned one medication only, if not,
and the string lists a complicated prescription, what do we know about
the rules then? Are all doses specified in mg of something? or can it
be g? or smth else? Perhaps use different methods and check if they
all produce the same results, let human operator resolve the
conflicts.

Best, Sergiy

>
> . split medication
> variables created as string:
> medication1  medication2  medication3  medication4
>
> . gen found = 0
>
> 4 here is empirical for this example. See how many variables -split- creates.
>
> . qui forval j = 1/4 {
>  2. replace found = 1 if inrange(substr(medication`j', 1, 1), "0", "9")
>  3. replace medication`j' = "" if found
>  4. }
>
> . l
>
>     +--------------------------------------------------------------------------------------+
>     |                   medication   medicati~1   medicati~2
> medica~3   medica~4   found |
>     |--------------------------------------------------------------------------------------|
>  1. |       metoprolol 100 mg qday   metoprolol
>                   1 |
>  2. | metoprolol tatrate 150mg bid   metoprolol      tatrate
>                   1 |
>  3. |         atenelol 150 mg qday     atenelol
>                   1 |
>  4. |              hctz 25 mg qday         hctz
>                   1 |
>  5. |               PEG interferon          PEG   interferon
>                   0 |
>     |--------------------------------------------------------------------------------------|
>  6. |            cimzia 50 mg qday       cimzia
>                   1 |
>     +--------------------------------------------------------------------------------------+
>
>
> Then we put the words back together again:
>
> . egen medname = concat(medication?), p(" ")
>
> . l medication medname
>
>     +---------------------------------------------------+
>     |                   medication              medname |
>     |---------------------------------------------------|
>  1. |       metoprolol 100 mg qday           metoprolol |
>  2. | metoprolol tatrate 150mg bid   metoprolol tatrate |
>  3. |         atenelol 150 mg qday             atenelol |
>  4. |              hctz 25 mg qday                 hctz |
>  5. |               PEG interferon       PEG interferon |
>     |---------------------------------------------------|
>  6. |            cimzia 50 mg qday               cimzia |
>     +---------------------------------------------------+
>
>
> On Wed, Nov 30, 2011 at 8:36 AM, Nick Cox <[email protected]> wrote:
>> -split- by default parses on spaces, which clearly is no good here
>> given that medications can have compound names and dosages will not be
>> discarded. Steve was evidently pointing to the -parse()- option, not
>> suggesting that parsing on spaces was the answer.
>>
>> If we assume that (a) dose always starts with a number and (b) dose
>> when specified always follows name of medication and (c) names never
>> have numeric characters, then -split- can be used to parse on numeric
>> characters. Here I used 1-9 but 0 should be added if it's ever the
>> first numeric digit:
>>
>> . split medication, parse(1 2 3 4 5 6 7 8 9) limit(1)
>> variable created as string:
>> medication1
>>
>> . replace medication1 = trim(medication1)
>> (5 real changes made)
>>
>> . l
>>
>>     +---------------------------------------------------+
>>     |                   medication          medication1 |
>>     |---------------------------------------------------|
>>  1. |       metoprolol 100 mg qday           metoprolol |
>>  2. | metoprolol tatrate 150mg bid   metoprolol tatrate |
>>  3. |         atenelol 150 mg qday             atenelol |
>>  4. |              hctz 25 mg qday                 hctz |
>>  5. |               PEG interferon       PEG interferon |
>>     |---------------------------------------------------|
>>  6. |            cimzia 50 mg qday               cimzia |
>>     +---------------------------------------------------+
>>
>> Another approach is to use -moss- (SSC):
>>
>> . moss medication, match("(.+) [1-9]+") regex
>>
>> . drop _count _pos1
>>
>> . rename _match1 medication2
>>
>> With this regular expression, -moss- misses names without dosages,
>> which can just be copied across.
>>
>> . replace medication2 = medication if missing(medication2)
>> (1 real change made)
>>
>> . l
>>
>>     +------------------------------------------------------------------------+
>>     |                   medication          medication1          medication2 |
>>     |------------------------------------------------------------------------|
>>  1. |       metoprolol 100 mg qday           metoprolol           metoprolol |
>>  2. | metoprolol tatrate 150mg bid   metoprolol tatrate   metoprolol tatrate |
>>  3. |         atenelol 150 mg qday             atenelol             atenelol |
>>  4. |              hctz 25 mg qday                 hctz                 hctz |
>>  5. |               PEG interferon       PEG interferon       PEG interferon |
>>     |------------------------------------------------------------------------|
>>  6. |            cimzia 50 mg qday               cimzia               cimzia |
>>     +------------------------------------------------------------------------+
>>
>> Nick
>>
>> On Wed, Nov 30, 2011 at 5:43 AM, Dudekula, Anwar <[email protected]> wrote:
>>> Thank you very much
>>>
>>> I will work on it .Would the parse() option split metoprolol tatrate 150mg bid as
>>>
>>> metoprolol tatrate and 150mg bid
>>>
>>> Or
>>>
>>> metoprolol & tatrate & 150mg &  bid
>>>
>>> Thank you
>>> Anwar
>>>
>>> -----Original Message-----
>>> From: [email protected] [mailto:[email protected]] On Behalf Of Steve Nakoneshny
>>> Sent: Wednesday, November 30, 2011 12:38 AM
>>> To: [email protected]
>>> Subject: Re: st: Working with complex strings
>>>
>>> - help split - would have answered this question.
>>>
>>> - split medication, parse( ) -
>>>
>>> should do what you want.
>>
>>
>>  On Nov 29, 2011, at 9:54 PM, "Dudekula, Anwar" <[email protected]> wrote:
>>
>>>> I am working with deidentified hospitaldatabase with patient names(as string variable) and medications (as string variable)as follows
>>>>
>>>> Patients_name        medication
>>>> ------------------------------------
>>>> Patient-1            metoprolol 100 mg qday
>>>> Patient-1            metoprolol tatrate 150mg bid
>>>> Patient-1            atenelol 150 mg qday
>>>> Patient-2            hctz 25 mg qday
>>>> Patient-2            PEG interferon
>>>> Patient-3            cimzia 50 mg qday
>>>>
>>>> Question: I am interested in name of medication only , not their dosages.Is it possible to split  the medication string  after the name  i.e.,
>>>>
>>>> 1) split  metoprolol tatrate 150mg bid into  metoprolol tatrate  &  150mg bid
>>>> 2) split  metoprolol 100 mg qday into   metoprolol   &   100 mg qday

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Prev by Date: Re: st: Setting same seed, getting different random numbers.
Next by Date: RE: st: capturing the sizes of the sequences of countinous (uninterrupted) values equal to 1
Previous by thread: Re: st: capturing the sizes of the sequences of countinous (uninterrupted) values equal to 1
Next by thread: Re: st: Working with complex strings
Index(es):
- Date
- Thread