--- On Tue, 26/1/10, raoul reulen wrote:
> I need to select up to 20 controls for each of 10,000
> subjects from a dataset of around half-a-million
> subjects. The controls need to satisfy certain criteria
> (e.g., same age). How can I do this without having to
> loop over observations?
What about this?
*-------------------- begin example ----------------------------
// prepare some toy data
// we want to find 2 controlls per treated observation
// with the same patterns on x1 and x2 (if 2 such controls exist)
clear
set obs 1000
gen byte treat = _n <= 100
gen x1 = floor(runiform()*6)
gen x2 = floor(runiform()*6)
// find number treated observations per pattern in x1 and x2
bys x1 x2: gen long count = sum(treat)
by x1 x2: gen long Ntreat = count[_N]
drop count
// touse is an indicator variable indicating that
// that observatin is to be used
// we want to include all treated observations
gen byte touse = treat
// For each treated observation we find a maximum
// of 2 controls with the same pattern in x
bys x1 x2 treat: replace touse = 1 if _n <= `= min(Ntreat[1]*2,_N)'
*-------------------------- end example ----------------------------
( For more on how to use examples I sent to statalist see:
http://www.maartenbuis.nl/stata/exampleFAQ.html )
Hope this helps,
Maarten
--------------------------
Maarten L. Buis
Institut fuer Soziologie
Universitaet Tuebingen
Wilhelmstrasse 36
72074 Tuebingen
Germany
http://www.maartenbuis.nl
--------------------------
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/