Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
st: RE: Unique Clients with Multiple IDs that need to be integrated
From
"Sarah Edgington" <[email protected]>
To
<[email protected]>
Subject
st: RE: Unique Clients with Multiple IDs that need to be integrated
Date
Thu, 11 Oct 2012 17:05:20 -0700
Emily,
There are a few ways of going about this, I think. Which one makes the most
sense depends a bit on what pieces of information you have. I'm not sure I
completely understand what form your data is in. So this is my best guess
about something that might work based on what I think you have.
Your example looks like it contains multiple records per person and that the
cid1 variable is the id that was assigned to the patient at the time of that
encounter not the unique client id number that you need to match them to
other data sources. It seems like some substantial effort has been made to
figure out which IDs go together, but that doesn't really help you figure
out which is right cid1 to use for all the encounters. Do you happen to
have a master list of data that contains the right cid and then the list of
alt ids? If you have a list of the final chosen client ids already
associated with their alt ids, there may be other strategies that will be
more efficient than what I'm going to suggest below.
Otherwise, it looks to me like what you have for each encounter is a
client's id at the time of that encounter and a list of IDs that client used
during other encounters. You know that one of those IDs is the correct
final id. But when looking at a single encounter you have no way of
determining which one is the correct one without some other source of
information. The first important thing is to be sure you know exactly how
to identify which is the real final id. (I think, from your description,
the rule is that the first cid1 to appear in your data is the real one,
right?).
The first thing I'd do is create an index variable so you know for sure you
can recreate the original sort order of the data:
-gen index=_n-
Next I'd create a new variable that contains the lowest of the ids from the
list of all the possible ids for that client. I'm assuming your id
variables are numeric. If they aren't this will require some more thought
to deal with strings, but logic is just that you want to have a rule that
allows you to identify a single constant id from the list for all the
encounters. Don't worry if that isn't the "right" id. We'll fix it in a
moment.
So now you have cid1, alt1-alt10 and newid for each encounter. Remember
newid is just the minimum of the non-missing id fields.
If you sort by newid you should group all the records for a given client
together, no matter what order the original id choices appeared in.
Next step is to identify which of the records for the client appeared first
in your data set. You'll use the index variable you created above. Once
you identify the first record for a given client (based on the original sort
order) you'll know that clients "real" id number.
-sort newid index-
-by newid: gen realid=cid1 if _n=1
Now you just need to copy that realid value to the rest of the records for
that client. (There are probably other more elegant ways to do this; this
is just my default)
-by newid: egen finalid=sum(realid)
The finalid variable should now contain the client id number that matches to
your other data source.
Hope that helps.
-Sarah
-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Emily
Putnam-Hornstein
Sent: Thursday, October 11, 2012 3:20 PM
To: [email protected]
Subject: st: Unique Clients with Multiple IDs that need to be integrated
I am working with a large substance abuse treatment dataset with roughly
900,000 observations for approximately 200,000 unique clients.
In working with these data (these records were being linked to a second
dataset), it was discovered that a number of clients had been inadvertently
assigned a new id when returning for a follow-up treatment service. In other
words, the agency thought it was linking/working with dataset of unique
clients, only to discover that many records were clearly for the same
individual.
To complete the data linkages the decision was made to maintain one id/one
record for each client, but to create up to 10 new variables with the
alternative ids listed so that after the linkages were completed, data for
these other ids could be pulled to construct a longitudinal record of
service encounters for a given individual. They were attached to a client?s
record as depicted in the data below. So, for example, the first client
(cid1=1) was not found to have any duplicate ids, so fields altid1=cid1 and
altid2-altid10 are all empty (I have only shown alt1-altid3 for simplicity).
But for the second client (cid1=2) it was discovered that this client has
information under two other ids ?65958? and ?62781?
To be able to construct a longitudinal record of service encounters for each
client I need to recode the records currently appearing as cid1=65958 and
cid1=62781 to cid1=2 (and I need to do this for up to 10 ids for some
clients). Additionally, the correct client id is listed as an alternative id
for those records I want to recode. So for cid1=65958, altid1=2 and
altid2=62781 (I have also shown this below, under ?Later Data?).
Finally, I mentioned the data linkages earlier because it is important that
I maintain the first (as sorted) cid1 as that is the id that links to other
data source ? although it is not necessarily the id associated with a
client?s first admission.
I will not burden the listserv with the completely inelegant code I have
been working on for the last two days (consisting of attempts to reshape,
merge subsets of records, etc. and has gotten me nowhere). Nor will I
attempt to justify decisions that were made in creating this dataset that
was given to me?but would greatly appreciate any and all guidance! Thank you
and apologies in advance if I have explained this poorly. Emily
+-----------------------------------------------------------------------
+----
-----------------+
| CLIENT modality_type adm1 dis1 cid1
altid1 altid2 altid3 |
|-----------------------------------------------------------------------
|----
-----------------|
1. | 1 Admission 12 Dec 06 . 1
1 . . |
2. | 1 Discharge 12 Dec 06 31mar2007 1
1 . . |
|-----------------------------------------------------------------------
|----
-----------------|
3. | 2 Admission 08 Mar 07 . 2
2 65958 62781 |
4. | 2 Discharge 08 Mar 07 17mar2007 2
2 65958 62781 |
5. | 2 Admission 14 Jun 07 . 2
2 65958 62781 |
6. | 2 Discharge 14 Jun 07 30jun2007 2
2 65958 62781 |
|-----------------------------------------------------------------------
|----
-----------------|
7. | 3 Admission 14 Jul 06 . 3
123171 3 . |
8. | 3 Discharge 14 Jul 06 03aug2006 3
123171 3 . |
9. | 3 Admission 02 Oct 06 . 3
123171 3 . |
10. | 3 Discharge 02 Oct 06 22oct2006 3
123171 3 . |
11. | 3 Admission 30 Apr 07 . 3
123171 3 . |
12. | 3 Discharge 30 Apr 07 20may2007 3
123171 3 . |
|-----------------------------------------------------------------------
|----
-----------------|
13. | 4 Admission 07 Mar 07 . 11
11 15262202 . |
14. | 4 Discharge 07 Mar 07 27mar2007 11
11 15262202 . |
|-----------------------------------------------------------------------
|----
-----------------|
15. | 5 Admission 20 Dec 06 . 13
66234 13 172805 |
16. | 5 Discharge 20 Dec 06 07jan2007 13
66234 13 172805 |
17. | 5 Admission 01 Mar 07 . 13
66234 13 172805 |
18. | 5 Discharge 01 Mar 07 12mar2007 13
66234 13 172805 |
|-----------------------------------------------------------------------
|----
-----------------|
19. | 6 Admission 13 Feb 07 . 31
14997478 31 190690 |
20. | 6 Discharge 13 Feb 07 06mar2007 31
14997478 31 190690 |
21. | 6 Admission 22 Jun 10 . 31
14997478 31 190690 |
22. | 6 Discharge 22 Jun 10 23sep2010 31
14997478 31 190690 |
(LATER DATA)
+-----------------------------------------------------------------------
+----
----------+
| CLIENT mo~y_type adm1 dis1 cid1 altid1
altid2 altid3 |
|-----------------------------------------------------------------------
|----
----------|
67821. | 14922 Admission 19 Dec 06 . 65958 2
65958 62781 |
67822. | 14922 Discharge 19 Dec 06 30jan2007 65958 2
65958 62781 |
67823. | 14922 Admission 08 Feb 07 . 65958 2
65958 62781 |
67824. | 14922 Discharge 08 Feb 07 12mar2007 65958 2
65958 62781 |
+-----------------------------------------------------------------------
+----
----------+
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/