Title | Sampling clusters, not individuals | |
Authors |
Nicholas J. Cox, Durham University, UK Scott Merryman, Risk Management Agency/USDA |
Often you need to sample clusters, not individuals. Suppose you have a dataset with individual people from several households, but you wish to sample households randomly, not individuals. Here are two ways to do so. Much of what we do here is also feasible through sample2 (Weesie 1997).
In each selection, clusters are chosen on random numbers produced by using runiform(). If you are serious about replicating your research, you will need to set your seed (see set seed) before generating the random numbers.
One way to accomplish our goal would be to keep one observation from each household, randomly sample from the remaining observations, and then merge back to the original dataset.
For example, assume a household identifier hhid:
sort hhid preserve tempfile tmp bysort hhid: keep if _n == 1 sample 10 sort hhid save `tmp' restore merge m:1 hhid using `tmp' keep if _merge == 3 drop _merge
We preserve the dataset before keeping just one observation from each household, and then use sample to select an approximate 10% sample. We save this sample to a temporary file. We then restore the original dataset and merge with the saved dataset. The part of the dataset that we want is indicated by _merge == 3. A successful merge depends on a previous sort of both datasets.
If you want to take a sample that is not a particular percentage of the dataset but rather has an exact sample size, use sample, count.
We now show another solution, all in place with no pas de deux of dancing files and without using sample. Knowing how to do it using basic principles may appeal to you.
First, keep a copy of the sort order:
gen long order = _n
Then select one observation from each household:
egen select = tag(hhid)
Now produce some random numbers and sort:
gen rnd = runiform() sort select rnd
One observation per household has now been sorted to the end, and those observations have been shuffled on the fly, courtesy of the random numbers. Suppose you want 10 of 100 households:
replace select = _n > (_N - 10)
The indicator select is now 1 for the last 10 observations and 0 otherwise. Now we spread the word of being selected among the household members:
bysort hhid (select): replace select = select[_N]
Finally, go back to the original sort order, and clean up:
sort order drop order rnd
This variation keeps both the selected sample, for which select == 1, and the other observations, for which select == 0. If you wanted the sample observations, then drop if !select or keep if select.
In addition to the usual online help or manual entries, see FAQ: "How can I take random samples from an existing dataset?" for a discussion of sampling individuals.