Dear Statalisters:
I am trying to understand what the "cluster" and "strata" options do on
-bootstrap-. I may be misinterpreting the manual with respect to what
these options do because when I gin up a dataset to which I think I know
what the result should be, the Stata answer doesn't seem to be what I
expected.
Basically, I set up a data set which is drawn from two distributions --
1000 observations from a uniform distribution of from 0 to 100 and 1000
observations from a uniform distribution from 0 to 1000. "Score" is the
value, group is a "1" or "2" indicating whether it was drawn from the
U(0,100) or U(0.1000) distribution, and id is a unique identifier.
The final data set description and summary is as follows:
Contains data from D:\scorestrata.dta
obs: 2,000
vars: 3 14 Jul 2004 07:17
size: 26,000 (97.5% of memory free)
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
id float %9.0g
group byte %9.0g
score float %9.0g
-------------------------------------------------------------------------------
Sorted by: group
Note: dataset has changed since last saved
. summarize score, detail
score
-------------------------------------------------------------
Percentiles Smallest
1% 1.52 .01
5% 8.695 .08
10% 19.22 .1 Obs 2000
25% 46.35 .12 Sum of Wgt. 2000
50% 91.035 Mean 273.2597
Largest Std. Dev. 302.6721
75% 476.29 997.14
90% 806.22 997.52 Variance 91610.4
95% 899.03 999.48 Skewness 1.018213
99% 976.93 999.9 Kurtosis 2.590768
I am interested in sampling by "group" so tried both the -cluster- and
-strata- options (only the cluster option shown below -- but both
produce results I did not expect). Specifically, I would like Stata to,
when it samples, to repeatedly sample from only group 1 or group 2
(i.e., not mix a group 1 value with a group 2 value). I am interested
in the 95th percentile values that result from the exercise. I would
expect the -saving(bsout)- output from this command to contain a value
close to 95 half of the time and close to 950 the remainder of the
time. This would be true if Stata were consistently sampling from the
U(0,100) half of the time and the U(0,1000) the remaining half. I used
the following command (output follows) :
. bootstrap "summarize score, detail" r(p95), reps(500) saving(bsout)
cluster(group) replace
command: summarize score , detail
statistic: _bs_1 = r(p95)
Warning: Since summarize is not an estimation command or does not set
e(sample),
bootstrap has no way to determine which observations are used
in calculating
the statistics and so assumes that all observations are used.
This means no
observations will be excluded from the resampling due to
missing values or
other reasons.
If the assumption is not true, press Break, save the data, and
drop the
observations that are to be excluded. Be sure the dataset in
memory contains
only the relevant data.
Bootstrap statistics Number of obs =
2000
N of clusters =
2
Replications =
500
------------------------------------------------------------------------------
Variable | Reps Observed Bias Std. Err. [95% Conf. Interval]
-------------+----------------------------------------------------------------
_bs_1 | 500 899.03 -166.8533 343.0897 224.9517 1573.108
(N)
| 95.48 950.77
(P)
| 899.03 950.77
(BC)
------------------------------------------------------------------------------
Note: N = normal
P = percentile
BC = bias-corrected
I then took a look at the file "bsout" which represent the 95th
percentile values from each of the 500 trials. Here are the first 10
values:
list in 1/10, sep(0)
+--------+
| _bs_1 |
|--------|
1. | 95.48 |
2. | 899.03 |
3. | 899.03 |
4. | 899.03 |
5. | 95.48 |
6. | 95.48 |
7. | 950.77 |
8. | 950.77 |
9. | 950.77 |
10. | 950.77 |
+--------+
Two things: (1) I see the values close to 95 and 950 which I expected,
but also see a 899.03 which I don't expect if Stata is consistently
drawing from either the U(0,100) or the U(0, 1000) distributions for any
given trial; and (2) when Stata draws, it consistently gets the exact
same value for the 95th percentile -- I would expect it to vary
somewhat.
Here is a summary of the bsout file (tabulated):
. tabulate _bs_1
r(p95) | Freq. Percent Cum.
------------+-----------------------------------
95.48 | 112 22.40 22.40
899.03 | 261 52.20 74.60
950.77 | 127 25.40 100.00
------------+-----------------------------------
Total | 500 100.00
Again, not at all what I expected (it's discrete and tri-valued and I
thought it would be continuous). I thought the appropriate command
would (for the expected continuous distribution) be -histogram _bs_1-
and I would have seen a bimodal distribution centered on 95 and 950.
What I would like to see is a distribution which results from either
repeated sampling from group 1 (ca. half the time) OR repeated sampling
from group 2 (the remainder fo the time). My reading and understanding
of the -cluster- and -strata- options under -bootstrap- must be
faulty. Can anyone let me know what I am missing here? Or what I might
do to obtain what I am looking for?
I am sure that the problem lies with my (mis)understanding, but I am
using Stata 8.2:
about
Intercooled Stata 8.2 for Windows
Born 1 July 2004
Copyright (C) 1985-2004
David Miller
Health Effects Division
Office of Pesticide Programs
visit: http://www.epa.gov/pesticides/
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/