Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: st: RE: RE: RE: RE: Combining multiple observations by an ID variable
From
Claude Beaty <[email protected]>
To
"[email protected]" <[email protected]>
Subject
RE: st: RE: RE: RE: RE: Combining multiple observations by an ID variable
Date
Wed, 13 Jun 2012 18:50:14 +0000
Thanks
Claude A. Beaty Jr., M.D.
Halsted Surgical Resident
Cardiac Surgery Research Fellow
The Johns Hopkins Hospital
-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Sarah Edgington
Sent: Wednesday, June 13, 2012 1:54 PM
To: [email protected]
Subject: RE: st: RE: RE: RE: RE: Combining multiple observations by an ID variable
Claude,
If the comorbidity measures are constant across visits than the solution is simply to keep only 1 visit. If they are not constant you'll have to create a rule for whether an individual patient has or does not have a given comorbidity (or does or does not have some other factor) no matter what form the data is in. Implementing whatever rules you need to create measures that are constant for an individual is probably going to be easier in a long data set than it would be in a wide one for exactly the reasons my number of visits example suggests. Even if you don't need number of visits, thinking through that example (and how you would calculate that sort of variable in a wide dataset) is likely to help you figure out how to construct other measures.
I've already addressed the issue of how to deal with multiple lines of data per person making it impossible to calculate prevalence/incidence/means/standard deviations/whatever other descriptive statistic you want. Just because you have multiple lines per person doesn't mean you need to use all them in every command.
-Sarah
-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Claude Beaty
Sent: Wednesday, June 13, 2012 10:28 AM
To: [email protected]
Subject: RE: st: RE: RE: RE: RE: Combining multiple observations by an ID variable
Sarah,
I am attempting to investigate a clinical outcome based on patient parameters prior to an intervention and then subsequent to that intervention. I would like to incorporate and adjust for factors such as existing comorbidities and the prevalence/incidence of these factors is necessary for thorough analysis. One database has patient parameters prior to the intervention while the other has patient outcomes subsequent to the intervention (master). The master data set is in the long form by return visit ID, and does not include visit dates. The problem is that when looking at the data in the long form, individual patients are repeated multiple times depending on the number of visits post-intervention, which is not consistent. This will artificially affect both the prevalence and incidence of all comorbidities. Likewise, an outcome that may be noticed in the first post-intervention visit is subsequently re-noted in every other follow-up visit, artificially affecting this measure !
as well. If I could look at the data in wide form, then each comorbidity and subsequent outcome would only be counted once, instead of multiple times. Incidentally, you may be asking why I made the database this way. It is in fact a database that was coded and given to me by a national organization and I had no input on its coding style.
I appreciate your examples regarding the potential difficulty in managing the data in wide format and I admit, I am not proficient with the -by-command. However, I am not interested in how many follow-up visits a patient had. I am merely concerned with whether or not a certain outcome was reached and by how many people. I think the purest way (that I know of) to assess this is to arrange the data so that each person is counted only once.
However, if there are more efficient methods to accomplish these same goals in the current layout, I am more than happy to use them. With my current track record, I am skeptical that a wide reshaping is even possible in this data set.
Claude A. Beaty Jr., M.D.
Halsted Surgical Resident
Cardiac Surgery Research Fellow
The Johns Hopkins Hospital
-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Sarah Edgington
Sent: Wednesday, June 13, 2012 12:51 PM
To: [email protected]
Subject: RE: st: RE: RE: RE: RE: Combining multiple observations by an ID variable
Claude,
What is your actual motivation for wanting the data in wide form? Your previous messages seem to suggest that you've been able to successfully do the merge of the two datasets without reshaping. Is there any reason, then, to translate the data into wide form? It seems to me like it's going to be much harder to work with wide data, particularly if your number of visits per person is not constant.
When your data is in long form as it is now, doing calculations for individuals will require understanding how to use -by- effectively. Once you get that logic, though, I think you'll find many calculations fairly straightforward. However, if you make the data wide I predict that you'll find yourself often having to loop through variables for each follow-up visit and figure out how to deal with missing values in a way that's tedious and confusing.
Take as an example something as simple as merely counting how many follow-up visits each patient has. If each record is a single follow-up visit then in long form this is simply: - by trr_id_code: gen nvisits=_N - Trying to do the same thing in wide form requires somewhat more complicated maneuvering involving checking to see whether there are missing values for certain variables to determine whether a visit happened or not (without knowing more about the data I can't say exactly how you'd construct a number of visits variable in wide form, but probably it would involve counting non-missing dates or something similar). And it's just going to get more complicated from there.
Now you may be thinking to yourself that the nvisits variable I just suggested is useless because you can't, for instance, - sum nvisits - and get the mean number of visits per person because you have multiple records per person. However, this too is easily handled with long data. If you construct measures that are constant within observations (as the nvisits variable I just suggested would be) you can simply create a flag for the first record per person to use for generating descriptive statistics. For instance -by trr_id_code: gen indexrec=(_n==1) - would create an indicator equal to one for the first observation for a particular patient and zero for all others.
Of course what makes most sense depends entirely on what sort of analysis you're doing. However, even if you ultimately want to end up with only one observation per patient, it really does often make more sense to start in long form. I know it may seem counter-intuitive but it may be easiest to create your analytic variables that summarize individual experiences while the data is long and then -keep- only a single observation per person. I think this is particularly likely to be true if there's a lot of variation in visits per individual. If you're in wide form not only do you risk running out of room to create new variables, you also have to think carefully through how to construct measures differently for a patient who has 2 visits and one who has 10.
-Sarah
-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Claude Beaty
Sent: Wednesday, June 13, 2012 8:33 AM
To: [email protected]
Subject: RE: st: RE: RE: RE: RE: Combining multiple observations by an ID variable
Steve,
As suggested, I am including more information about my master dataset. My individual ID variable is " trr_id_code". My follow up visit ID variable is "trr_fol_id_code". Both variables are string. As previously mentioned, I have about 50,000 "trr_id_code" observations and over 350,000 "trr_fol_id_code" observations. Currently, the dataset is in the long form by the "trr_fol_id_code" variable. I would like this dataset to be in the long form by the "trr_id_code" variable instead (the wide form of the "trr_fol_id_code" variable), as I currently have another dataset which is organized in this way and would like to merge the two files. I am using the following code to accomplish this task:
sort trr_id_code
unab vlist:_all
reshape wide `vlist', i(trr_id_code) j(trr_fol_id_code) string
When this code is applied to the master dataset (approximately 70 variables in the variable list), I receive the error code "too many macros". I have attempted to -reshape- after merging by " trr_id_code" and paring down the database to approximately 13,000 "trr_id_code" observations and 30,000 "trr_fol_id_code" observations, but the increased number of variables in my second dataset (460) results in the same error message. Is my code incorrect, or have I reached the limit of Stata's capabilities by having so many variables and/or observations? Any thoughts would be appreciated.
Claude A. Beaty Jr., M.D.
Halsted Surgical Resident
Cardiac Surgery Research Fellow
The Johns Hopkins Hospital
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/