[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: how to bootstrap the difference of two sample means

From	Stas Kolenikov <[email protected]>
To	[email protected]
Subject	Re: st: how to bootstrap the difference of two sample means
Date	Tue, 13 Jul 2004 12:47:17 -0400 (EDT)

> What I want to do is as follows. I have a sample of size N and a
> matching sample of size N. Say the variable I care is x. I want to test
> the significance of  the difference of the mean and the difference of
> the median between the two samples. But because the distibution of x is
> skewed,  the conventional t-test or z-test is not good here. I am
> following literature, trying to use bootstrap to do it.

There was a paper by Norman Johnson in about 1978 on the effect of the
skewness of the original distribution on the t-tests. He came up with
reasonably simple corrections involving the third moments that yield
reasonable test sizes at sample sizes as low as 12 and a distribution as
skewed as exponential.

> Basically, the procedure is first to pool the two sample to get a size
> 2N sample.

If you suspect the difference in means/locations in your data, you might
want to decenter it before pooling. Then a natural question is: OK, you
say your data are heterogeneous; are you sure they only differ in shift?
What about the variance? etc.

> Then randomly draw with replacement a sample of size N, get the mean
> and median. Then draw another one. Then calculate the difference. And
> repeat the above drawing and calculation 1000 times. And I will get the
> distribution of the 1000 difference and can do the inference with them.
>
> My question is how to write the Stata program to let it draw 2 random
> samples in each bootstrap replication.

you can draw one with sample size of 2N, and then generate a dummy
variable for two groups in a manner like

gen byte group = (_n > _N/2)

and then

ttest x, by(group)

As far as -bootstrap- does not allow for the sample sizes greater than the
original data set (for good statistical reasons; I am just equally, if not
better, convinced with the bootstrap samples of the 10% of the original
data, if that still gives a few hundred observations), you would need
either to sacrifice your sample size (and get a sample of 20% of your data
breaking it later into two pieces), or if you insist on the full data set,
you would need to write a special program to handle that.

The core of your program will be like this:

program define MyBS, rclass
  tempvar group
  g byte `group' = (_n > _N/2)
  ttest `1', by(`group')
  return add
end

(make sure you understand every step here)

with a later call

bootstrap "MyBS x" , size(200) reps(1000) and other bootstrap options

if you had an original data set of a 1000 and ready to go with 10%+10%
subsample, where -x- is your variable of interest in the original data. If
you want the original size... let me see. Let's do it low level with
-post- command

tempname topost
tempvar group
tempfile group1
postfile `topost' t using mybs, every(10) replace
keep x
preserve
qui forvalues k=1/1000 {
   restore, preserve
   bsample
   gen byte `group' = 1
   save `group1', replace
   restore, preserve
   bsample
   gen byte `group' = 2
   append using `group1'
   ttest x, by(`group')
   post `topost' ( r(t) )
   noi di "." _c
}
use mybs

Again, see if you understand every line. I checked it for -sysuse auto-
data and the variable -price- -- the actual test-statistic of -3.4 did not
show up for any of the 160 bootstrap samples. The distribution does show a
substantial kutrosis though, so the normal approximation is not quite in
place.

Keep in mind that -bootstrap- for dependent data is not a very good idea,
and is not very straightforward to implement properly. The dependent data
here include time series, clustered samples, and panel data. Also, don't
forget -set seed- if you are serious about your bootstrap procedure.

 ---                                    Stas Kolenikov
 --       Ph.D. student in Statistics at UNC-Chapel Hill
 - http://www.komkon.org/~tacik/  -- [email protected]

* This e-mail and all attachments to it are not intended to provide any
* reasonable point of view and was transmitted to you in error. It
* should be immediately deleted by all recipients unless they really
* enjoy communicating with the author :). Other restrictions apply.

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Prev by Date: RE: st: RE: Novel feature of -shell- under Windows
Next by Date: Re: st: gllamm question - estimating 4 random effects
Previous by thread: Re: st: how to bootstrap the difference of two sample means
Next by thread: st: Is there a simple way to change the storage type from str to numerical
Index(es):
- Date
- Thread