This implements a rather pedestrian look-up
technique. Proofs of incorrectness are solicited.
*! NJC 1.0.0 3 Oct 2005
* sampleproptosize #, size(size_variable) generate(in_sample)
program sampleproptosize, sort
version 8
gettoken n 0 : 0, parse(" ,")
confirm integer num `n'
if `n' <= 0 {
di as err "`n' must be positive"
exit 198
}
syntax [if] [in] , size(varname) Generate(str)
marksample touse
markout `touse' `size'
su `size' if `touse', meanonly
if r(min) < 0 {
di as err "negative values in `size'"
exit 411
}
if r(N) < `n' {
di as err ///
"sample `n' requested, but only " r(N) " observations"
exit 198
}
quietly {
tempvar target id
tempname rnd
replace `touse' = -`touse'
bysort `touse': gen `target' = sum(`size' / r(sum))
local g "`generate'"
gen byte `g' = 0
gen long `id' = _n
count if `g'
while r(N) < `n' {
scalar `rnd' = uniform()
su `id' if `touse' ///
& inrange(`rnd',`target'[_n-1],`target') ///
& !`g', meanonly
if r(max) < . replace `g' = 1 in `r(max)'
count if `g'
}
}
end
Nick
[email protected]
Willard van Ooij
> Hm, interesting!
>
> This is indeed not the kind of selection I wanted. I went for your
> suggestion, Nick. Which was (for a sample size of 100):
>
> (1) Calculate for each company the probability of inclusion. This is
> (sample size) * (size of company / total of company sizes). So
> assuming a sample size of 100:
>
> . sum size
> . gen prob = 100 * ( size / r(sum) )
>
> (2) Then select the sample based on these probabilities
>
> . gen u = uniform()
>
> . gen insamp = u < prob
>
> Since the sample size didn't have to be that precise, but had to be
> substantially lower than 100, it sufficed for me to tweak a
> little with
> the number 100 untill I had about the right sample size. I remain
> interested in a solution which leads to a precise sample size.
>
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/