Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: extracting portions of a string variable using observations from another variable
From
Daniel Henriksen <[email protected]>
To
[email protected]
Subject
Re: st: extracting portions of a string variable using observations from another variable
Date
Wed, 26 Jan 2011 21:52:30 +0100
I had some technical problems with my mail, sorry.
Thank you very much Eric for your new solution. I need to sit down and
read it through carefully. My hope is that I can use this example in
general (and it looks like I can). Because I'd like to match the the
drug names as well (cefuroxim, metronidazol ect). I have about 3000
combinations of drug names (some consist of one word others two or
three words) and ways of administer them.
Again, thank you very much! There're a lot of nice people around here!
Cheers
Daniel
2011/1/26 Eric Booth <[email protected]>:
> <>
>
>
> Daniel asked about matching more than one word in the first example using -merge- to match the data.
> One way would be to just create a dataset for each word of the split 'dispersingsform' and match each one in during the loop. Below, I've modified the first example I provided to do what he asks (edits are marked with *comments*):
>
> **********************************! Begin Example
> ** Note: Watch for Wrapping **
>
> //DATASET OF WORDS TO BE EXTRACTED FROM RECORDS DATA-->
> clear
> inp str30 Dispenseringsform
> "filmovertrukne tabl."
> "oral opløsning"
> "pulv.t.konc.t.inf.v."
> "inj.-/inf.væske"
> "enterotabletter"
> "tabletter"
> "pulv.t.inj.væske,opl"
> "inf.væske, opløsning"
> "pul.t.inj.+inf.,opl."
> end
> levelsof Dispenseringsform, loc(alt)
> di `"`alt'"'
> split Dispenseringsform
> l Dispenseringsform1
>
> **new**
> **new**
> preserve
> keep Dispenseringsform1
> duplicates drop //--so that you can m:1 merge later
> sa dispense_Dispenseringsform1.dta, replace
> restore
> preserve
> keep Dispenseringsform2
> duplicates drop
> sa dispense_Dispenseringsform2.dta, replace
> restore
>
>
> //1. EXTRACT WORDS IN DISPENSE.DTA USING MERGE -->
> clear
> inp str244 record
> "Cefuroxim Stragen pulv.t.inj.væske,opl 7,5 mg/ml intravenøst 100 ml kl 00:00 + 100 ml kl 08:00 + 100 ml kl 16:00 ;(xxx yyy (Overl‘ge) aaa12 09-09-2010 00:35)"
> "Metronidazol Actavis filmovertrukne tabl. 500 mg peroralt 1 tablet 3 gang(e) Daglig ;(xxx yyy-zzz (Stud. med.) aaa1bb 19-08-2010 01:20)"
> "Metronidazol B. Braun inf.væske, opløsning 5 mg/ml intraveøst 100 ml 3 gang(e) Daglig ;(xxx yyy (Reservel‘ge) aaa2bb 29-09-2010 01:21)"
> "Nexium pul.t.inj.+inf.,opl. 0,4 mg/ml intravenøst 100 ml 1 gang(e) Daglig ;(xxx yyy (Overl‘ge) aaa12 27-10-2010 01:37)"
> end
> sa records.dta, replace
>
> split record
> l record2-record6
>
> /*
> extract words in 'record' that
> match dispense.dta list: (
> pulv.t.inj.væske,opl,
> filmovertrukne tabl. inf.væske,
> opløsning and pul.t.inj.+inf.,opl.)
> */
>
> g str30 newvar = ""
>
>
> **updated**
> **updated**
> forval n = 2/5 {
> **Adding the -foreach- below allows you to merge over more
> ****than one word split from 'dispersingsform' in the master data
> foreach new in Dispenseringsform1 Dispenseringsform2 {
>
> rename record`n' `new'
> merge m:1 `new' using "dispense_`new'.dta"
> drop if _m==2 //--keep matched and master records only
> replace newvar = `new' if _m==3 & mi(newvar)
> rename `new' record`n' // --*I reordered the drop/rename lines
> drop _merge
> cap drop Dispenseringsfor*
> }
> }
> order newvar
> drop record? record??
> l
> **********************************! End Example
> Also, keep in mind that you can match on all the words using the second example I provided.
>
> - Eric
> __
> Eric A. Booth
> Public Policy Research Institute
> Texas A&M University
> [email protected]
>
> On Jan 26, 2011, at 9:27 AM, Steven Samuels wrote:
>
>> Daniel, for the edification of all users (including Eric) who might not remember your original question and his response, please include edited versions in follow-up questions. (FAQ 3.4 "Edit Previous Posting").
>>
>> Steve
>> [email protected]
>>
>>
>>
>> On Jan 26, 2011, at 9:33 AM, Daniel Henriksen wrote:
>>
>> Dear Eric
>>
>> thank you so much for your suggestions! I will dig further into them asap.
>>
>> regarding your first suggestion, is it possible to match two or three
>> words and not just the one parsed. excuse my ignorance. still a
>> beginner when it comes to stata
>>
>> cheers
>> daniel
>
>
>
>> Eric A. Booth wrote:
>>
>> <>
>>
>>
>> Here are 2 approaches:
>>
>> The first one is less reliable (i.e., it might require careful examination and tweaking) but might be more useful if you are bringing over more variables from the 'dispersingsform'/using dataset to the 'records'/master dataset. Keep in mind that it matches on the first word (parsed by a space character) in 'dispersingsform' -- so it matches "filmovertrukne tabl" by the "filmovertrukne" part.
>>
>> The second approach is more straightforward if you are working with a list of 'dispersingsform' that is short enough to fit into a macro (see help limits) and you don't need to bring in any extra variables from the 'dispersingsform' dataset. It simply collects all the dispersingsform into a local macro (`alt') and then uses a string position function (see help string_functions) to find matches.
>>
>> The result of both approaches are stored in the variable 'newvar':
>>
>> <snip>
>
>
>> On Jan 24, 2011, at 3:21 PM, Daniel Henriksen wrote:
>>
>>> Hello statalist
>>>
>>> Hope you can help me. Is it possible for stata to extract specific
>>> words within a string using observations from another variable?
>>> I have a dataset with a list different ways of dispensing the drug
>>> (which form it is). here's an example:
>>>
>>> Dispenseringsform
>>> filmovertrukne tabl.
>>> oral opløsning
>>> pulv.t.konc.t.inf.v.
>>> inj.-/inf.væske
>>> enterotabletter
>>> tabletter
>>> pulv.t.inj.væske,opl
>>> inf.væske, opløsning
>>> pul.t.inj.+inf.,opl.
>>> (I have 270 rows of these (different forms and different ways of spelling it))
>>>
>>> the I have another dataset (only one variable but many observations)
>>> containing information on what drug, way of dispensing, dose and time
>>> the drug is to be administered to the patient:
>>>
>>> Cefuroxim Stragen pulv.t.inj.væske,opl 7,5 mg/ml intravenøst 100 ml kl
>>> 00:00 + 100 ml kl 08:00 + 100 ml kl 16:00 ;(xxx yyy (Overl‘ge) aaa12
>>> 09-09-2010 00:35)
>>> Metronidazol Actavis filmovertrukne tabl. 500 mg peroralt 1 tablet 3
>>> gang(e) Daglig ;(xxx yyy-zzz (Stud. med.) aaa1bb 19-08-2010 01:20)
>>> Metronidazol B. Braun inf.væske, opløsning 5 mg/ml intraveøst 100 ml
>>> 3 gang(e) Daglig ;(xxx yyy (Reservel‘ge) aaa2bb 29-09-2010 01:21)
>>> Nexium pul.t.inj.+inf.,opl. 0,4 mg/ml intravenøst 100 ml 1 gang(e)
>>> Daglig ;(xxx yyy (Overl‘ge) aaa12 27-10-2010 01:37)
>>>
>>> So I would like to extract pulv.t.inj.væske,opl, filmovertrukne
>>> tabl. inf.væske, opløsning and pul.t.inj.+inf.,opl. from these four
>>> observations and place them in a new variable without having to go
>>> through all of the information manually.
>>> I hope my question is clear.
>>>
>>> Thank you for your time
>>> Daniel
>>>
>
>
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
--
Daniel Henriksen
Ph.d. studerende, læge
Infektionsmedicinsk afd Q / Akut Modtage Afdelingen
Odense Universitetshospital
Bygning 2, 1. sal
Sdr. Boulevard 29
5000 Odense C
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/