[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Construct Null Datasets through Bootstrap Resampling

From	Erik Ingelsson <Erik.Ingelsson@pubcare.uu.se>
To	statalist@hsphsun2.harvard.edu
Subject	Re: st: Construct Null Datasets through Bootstrap Resampling
Date	Thu, 30 Nov 2006 23:37:24 -0500

Dear Austin,

Well, actually Michael Blasniks suggestion was a bit closer to what I first intended. But when looking at your suggestion and reading some literate on multiple testing in genetic analyses, I realized that I actually perhaps should not scramble all variables, but rather keep them together in two groups (e.g. keeping var 1-4 togeter as in the orignal observation, but randomly pairing them with var 5-8). For this, your solution worked perfectly. The next step is to meet with our senior statistician and show him what I got. He can tell which way to go, I hope.

Thanks a lot to both of you! I am impressed by the fast answers!
Erik Ingelsson

Quoting Austin Nichols <austinnichols@gmail.com>:

Erik Ingelsson --
I'm not sure I understand what you are trying to do, so I will propose
a solution to a question that may be similar to yours:  how do I take
independent bootstrap samples of two variables that should be treated
as unpaired data.  As a concrete example, take the auto data, and
regress price on mpg:
sysuse auto, clear
reg pr mpg

Now pick price and mpg randomly:
sysuse auto, clear
keep pr
bsample
tempname p
save `p'
sysuse auto
keep mpg
bsample
merge using `p'
reg p mpg

Not surprisingly, the randomly selected obs on price and mpg no longer
appear so correlated.  If you want to do this 10000 times, wrap the
calls to -bsample- in a program, call -simulate-, and then use -bstat-
to calculate bootstrap statistics.

cap drop pr eachone
pr eachone, rclass
drop _all
sysuse auto
keep pr
bsample
tempname p
save `p'
sysuse auto
keep mpg
bsample
merge using `p'
qui reg p mpg
return scalar b=_b[mpg]
end
sysuse auto, clear
qui reg pr mpg
local n=e(N)
local est=_b[mpg]
simulate b=r(b), reps(10) seed(12345): eachone
bstat, stat(est) n(`n')

HTH--I might have left out some code, or totally misinterpreted the
question (I know next to nothing of genotypes and phenotypes).

On 11/30/06, Erik Ingelsson <Erik.Ingelsson@pubcare.uu.se> wrote:

Dear Statalist users,

I am trying to construct null data sets through bootstrap resampling,
to be able to account for multiple testing in genetic analyses. I
would like to sample my genotypes and phenotypes randomly with
replacement (without keeping linked them together as the original
observations in my dataset), and then run regressions on these samples
to evaluate a distribution of minimum probability values. Thereby, I
will obtain empirical p-values by comparing the nominal p-values with
the distribution of probability from the null data sets. I have seen
this implemented in SAS, but I hope that it could be done also in STATA.

In my ?trail-and-error? approach, I have come so far, that I have
learned how to use bootstrap sampling to get a new dataset with
p-values from a set of regressions. However, these simulations are
still using my original observations (although creating new samples),
while I would like the observations to be randomly created from the
available variables (not keeping them together as in the original
dataset). Below is the code for what I have done so far. Variables
linj001-linj004 are the genotypes, thus the important independent
variables; the other variables are covariates to adjust for; stset and
all definitions, etc are done above. In reality, I will have much more
regressions to include in the simulation, but this is just for
learning how to do it.

---
*Program with the commands to be run in all bootstrap samples*
capture program drop myboot
program myboot, rclass
    stcox linj001 whoht70 adadiab70 ami70 vit70 z972 z290 z085 zekg_lvh
    return scalar p1 = 2*(1-normal(abs(b[linj001]/_se[linj001])))
    stcox linj002 whoht70 adadiab70 ami70 vit70 z972 z290 z085 zekg_lvh
    return scalar p2 = 2*(1-normal(abs(b[linj002]/_se[linj002])))
    stcox linj003 whoht70 adadiab70 ami70 vit70 z972 z290 z085 zekg_lvh
    return scalar p3 = 2*(1-normal(abs(b[linj003]/_se[linj003])))
    stcox linj004 whoht70 adadiab70 ami70 vit70 z972 z290 z085 zekg_lvh
    return scalar p4 = 2*(1-normal(abs(b[linj004]/_se[linj004])))
end

*Run the program in the original sample*
myboot
ret list

*Bootstrapping in 10000 samples*
bootstrap "myboot" p1=r(p1) p2=r(p2) p3=r(p3) p4=r(p4), reps(10000)
saving(C:\bootstrapsample) replace
---

This leaves me with a dataset (C:\bootstrapsample) which consists of
the p-values from the 4 regressions derived from 10000 simulations.
However, this is not exactly what I need, since the variables are
still ?connected? in the original observations (and then randomly
chosen for my simulated sets). I would like to get simulations with
all variables scrambled, so that new observations with all variables
scrambled are created in a number of bootstrap simulations, and then
used for regressions. The present macro can give me 10000 simulated
p-values for the regressions, based on samples with replacement, but
these simulations are reusing the actual 2000 observations from the
original dataset. Now I would like to create a ?null dataset?, in
which I instead of sampling from the observations in the real dataset,
I would like Stata to randomly ?make up? observations from the
existing variables and values, so that I have 2000 fake observations
(with random selection of all variables) to base the regressions on,
in 10000 simulations.

I have read the help files, manual, Statalist, searched at Internet,
even asked Technical Support (which helped me to come this far, but
not the last part). I am using Stata 8.2 for Windows. Is there a way
to do this? Did I explain what I want to do properly? Is there anyone
who can help me with this?

Thanks a lot in advance,
Erik Ingelsson

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


---
Erik Ingelsson, MD, PhD

Current affiliation (until June 30, 2007):
Framingham Heart Study
73 Mt. Wayte Avenue, Suite 2
Framingham, MA 01702-5827
Phone: 508-935-3453
Fax: 508-626-1262
Cell: 508-202-8493

Permanent affiliation:
Uppsala University, Department of Public Health and Caring Sciences, Uppsala
Science Park, SE-751 85 Uppsala, SWEDEN.
Fax: +46-18-611 79 76
E-mail: erik.ingelsson@pubcare.uu.se
---


*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: Construct Null Datasets through Bootstrap Resampling
  - From: "Michael Blasnik" <michael.blasnik@verizon.net>

References:
- st: Construct Null Datasets through Bootstrap Resampling
  - From: Erik Ingelsson <Erik.Ingelsson@pubcare.uu.se>
- Re: st: Construct Null Datasets through Bootstrap Resampling
  - From: "Austin Nichols" <austinnichols@gmail.com>

Prev by Date: Re: st: Re: Construct Null Datasets through Bootstrap Resampling
Next by Date: Re: st: Construct Null Datasets through Bootstrap Resampling
Previous by thread: Re: st: Construct Null Datasets through Bootstrap Resampling
Next by thread: Re: st: Construct Null Datasets through Bootstrap Resampling
Index(es):
- Date
- Thread