> What I want to do is as follows. I have a sample of size N and a
> matching sample of size N. Say the variable I care is x. I want to test
> the significance of the difference of the mean and the difference of
> the median between the two samples. But because the distibution of x is
> skewed, the conventional t-test or z-test is not good here. I am
> following literature, trying to use bootstrap to do it.
There was a paper by Norman Johnson in about 1978 on the effect of the
skewness of the original distribution on the t-tests. He came up with
reasonably simple corrections involving the third moments that yield
reasonable test sizes at sample sizes as low as 12 and a distribution as
skewed as exponential.
> Basically, the procedure is first to pool the two sample to get a size
> 2N sample.
If you suspect the difference in means/locations in your data, you might
want to decenter it before pooling. Then a natural question is: OK, you
say your data are heterogeneous; are you sure they only differ in shift?
What about the variance? etc.
> Then randomly draw with replacement a sample of size N, get the mean
> and median. Then draw another one. Then calculate the difference. And
> repeat the above drawing and calculation 1000 times. And I will get the
> distribution of the 1000 difference and can do the inference with them.
>
> My question is how to write the Stata program to let it draw 2 random
> samples in each bootstrap replication.
you can draw one with sample size of 2N, and then generate a dummy
variable for two groups in a manner like
gen byte group = (_n > _N/2)
and then
ttest x, by(group)
As far as -bootstrap- does not allow for the sample sizes greater than the
original data set (for good statistical reasons; I am just equally, if not
better, convinced with the bootstrap samples of the 10% of the original
data, if that still gives a few hundred observations), you would need
either to sacrifice your sample size (and get a sample of 20% of your data
breaking it later into two pieces), or if you insist on the full data set,
you would need to write a special program to handle that.
The core of your program will be like this:
program define MyBS, rclass
tempvar group
g byte `group' = (_n > _N/2)
ttest `1', by(`group')
return add
end
(make sure you understand every step here)
with a later call
bootstrap "MyBS x" , size(200) reps(1000) and other bootstrap options
if you had an original data set of a 1000 and ready to go with 10%+10%
subsample, where -x- is your variable of interest in the original data. If
you want the original size... let me see. Let's do it low level with
-post- command
tempname topost
tempvar group
tempfile group1
postfile `topost' t using mybs, every(10) replace
keep x
preserve
qui forvalues k=1/1000 {
restore, preserve
bsample
gen byte `group' = 1
save `group1', replace
restore, preserve
bsample
gen byte `group' = 2
append using `group1'
ttest x, by(`group')
post `topost' ( r(t) )
noi di "." _c
}
use mybs
Again, see if you understand every line. I checked it for -sysuse auto-
data and the variable -price- -- the actual test-statistic of -3.4 did not
show up for any of the 160 bootstrap samples. The distribution does show a
substantial kutrosis though, so the normal approximation is not quite in
place.
Keep in mind that -bootstrap- for dependent data is not a very good idea,
and is not very straightforward to implement properly. The dependent data
here include time series, clustered samples, and panel data. Also, don't
forget -set seed- if you are serious about your bootstrap procedure.
--- Stas Kolenikov
-- Ph.D. student in Statistics at UNC-Chapel Hill
- http://www.komkon.org/~tacik/ -- [email protected]
* This e-mail and all attachments to it are not intended to provide any
* reasonable point of view and was transmitted to you in error. It
* should be immediately deleted by all recipients unless they really
* enjoy communicating with the author :). Other restrictions apply.
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/