My understanding is that the cluster option will cause bootstrap to resample
clusters, not observations within cluster. Given that you have only 2
clusters, there are only 3 possible samples -- sample all of each cluster
(should happen half the time), sample all of cluster 1 twice (a quarter of
the time), sample all of cluster 2 twice (also a quarter of the time). The
95th percentile can therefore take on just 3 values -- depending on which of
these three samples is drawn. Your results appear to confirm this
interpretation.
Michael Blasnik
[email protected]
----- Original Message -----
From: <[email protected]>
To: <[email protected]>
Sent: Wednesday, July 14, 2004 10:14 AM
Subject: st: bootstrap command -- cluster and strata options
>
> Dear Statalisters:
>
> I am trying to understand what the "cluster" and "strata" options do on
> -bootstrap-. I may be misinterpreting the manual with respect to what
> these options do because when I gin up a dataset to which I think I know
> what the result should be, the Stata answer doesn't seem to be what I
> expected.
>
> Basically, I set up a data set which is drawn from two distributions --
> 1000 observations from a uniform distribution of from 0 to 100 and 1000
> observations from a uniform distribution from 0 to 1000. "Score" is the
> value, group is a "1" or "2" indicating whether it was drawn from the
> U(0,100) or U(0.1000) distribution, and id is a unique identifier.
> The final data set description and summary is as follows:
>
<snip>
> I am interested in sampling by "group" so tried both the -cluster- and
> -strata- options (only the cluster option shown below -- but both
> produce results I did not expect). Specifically, I would like Stata to,
> when it samples, to repeatedly sample from only group 1 or group 2
> (i.e., not mix a group 1 value with a group 2 value). I am interested
> in the 95th percentile values that result from the exercise. I would
> expect the -saving(bsout)- output from this command to contain a value
> close to 95 half of the time and close to 950 the remainder of the
> time. This would be true if Stata were consistently sampling from the
> U(0,100) half of the time and the U(0,1000) the remaining half. I used
> the following command (output follows) :
>
>
> . bootstrap "summarize score, detail" r(p95), reps(500) saving(bsout)
> cluster(group) replace
>
> command: summarize score , detail
> statistic: _bs_1 = r(p95)
>
<snip>
>
> . tabulate _bs_1
>
> r(p95) | Freq. Percent Cum.
> ------------+-----------------------------------
> 95.48 | 112 22.40 22.40
> 899.03 | 261 52.20 74.60
> 950.77 | 127 25.40 100.00
> ------------+-----------------------------------
> Total | 500 100.00
>
>
> Again, not at all what I expected (it's discrete and tri-valued and I
> thought it would be continuous). I thought the appropriate command
> would (for the expected continuous distribution) be -histogram _bs_1-
> and I would have seen a bimodal distribution centered on 95 and 950.
> What I would like to see is a distribution which results from either
> repeated sampling from group 1 (ca. half the time) OR repeated sampling
> from group 2 (the remainder fo the time). My reading and understanding
> of the -cluster- and -strata- options under -bootstrap- must be
> faulty. Can anyone let me know what I am missing here? Or what I might
> do to obtain what I am looking for?
>
> I am sure that the problem lies with my (mis)understanding, but I am
> using Stata 8.2:
>
> David Miller
> Health Effects Division
> Office of Pesticide Programs
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/