Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Matching samples in Stata
From
David Kantor <[email protected]>
To
[email protected]
Subject
Re: st: Matching samples in Stata
Date
Thu, 11 Oct 2012 16:53:01 -0400
Hi Paula,
At 01:40 PM 10/11/2012, you wrote:
HI David,
I finally got round to matching my sample. I match the two samples
on family education level and gender
mahapick ed_level_fam sex, idvar( "ID") genfile(D:\matched)
nummatches(4) full treated(course)
where course is 1 for medicine and 0 for other - as in my analyses I
want to compare medicine students vs. the others. I created a file
'matched' as I intend to import the relevant variables into it so
that I can just run the analyses for this.
Ideally I want to only keep the first match.
However, when I check for duplicates using
duplicates list ID
I find that many of the matched respondents are the same for
different medicine students.
Can you suggest what I am doing wrong and any way around this pls?
[...]
You are doing nothing wrong. Mahapick did what it was designed to do;
it got the best 4 matches for each treated case -- with no regard for
what is matched to other treated cases (similar to sampling with
replacement). There is no guarantee that there will be uniqueness.
Incidentally, if you are checking for duplicates, you might want to try
duplicates ... if _matchnum==1
which will look at the best match for each treated case. That might
be a better measure, as without filtering for _matchnum==1, you are
comparing all matches; a given case might be, e.g., the first choice
for one treated case, and the second (third, fourth) choice for
another. (To clarify: _matchnum==1 gives the best choice for each
treated case; _matchnum==2 is the second best choice; etc..)
There are two approaches to proceeding.
The first may seem like no approach at all, but we have done it in
one of our studies using matched cases. Just use what you have, with
the duplicates. But try to measure the duplication and report it
along with your results.
For example, you might find that your matches sample is 85% unique.
And that may be good enough.
The second is to do some kind of unique selection.
I did this somewhere, and if I can find it, I would let you have that
code; I'll try to see if it can be located. The idea is...
randomly choose a treated case
select its closest match
remove both the treated case and its match from the pool
repeat this process on the remaining cases until all are matched.
The particular set you get will depend on a randomizaton of the
selection order. That is, with Stata's random number generator, it
will depend on the seed.
Note that this procedure will get one match per treated case. If you
want more, say 3, then you repeat the whole process again and again.
(It will help to have nummatches significantly larger than the number
of desired final matches per treated case. In your example, you want
1 match per treated case, so nummatches(4) is probably okay, but you
might as well make it a bit higher.)
This randomization process is a pragmatic way to go. But there may be
a more ideal goal, such as to minimize the total distance measures.
Doing that is very complicated for large numbers of cases; it's a
subject that's open for research, much like the travelling
salesperson problem. I can find some references and some R code that
purports to do it. One reference mentions the use of network flow
algorithms -- not that I know about that. But et me know if you want
those references.
Just off the top of my head, one possibility is to start by taking
all matches that are unique -- the ones that are not under contention
to be matched to different treated cases. This may be a large portion
of the cases. Then you only have to worry about the remaining smaller
set. (That is, unless it works out that reassigning a non-contended
match can result in a better overall result -- analogous to the case
where, in the travelling salesperson problem, close cities are not
visited in sequence.)
On one occasion, we wanted an optimized unique matching. We tried a
process where we started with a given matching, and then iteratively
swapped matches so as to minimize the total distance measure -- done
outside of Stata. (Stata seemed awkward for the task; possibly Mata
would do fine, but it wasn't available at that time.) Though we got
an optimized set, the analytical results were no better than the
original set. That is, after a lot of trouble and expense, the result
was no better.
I hope this is useful. I will look for the random-selection code.
--David
P.S., With the randomization process, you can, say, do it three
times, and run you analysis on the three matches.
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/