| |
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
st: Idea for a faster bootstrap
I'm interested in ways to do resampling quickly. -bootstrap- can be
excruciatingly slow, especially when the data set is large. While I
much appreciate all the built-in features of -bootstrap-, I've
thought that there might be approaches or algorithms for a DIY
bootstrap that would be faster (presumably at the expense of not
being so general purpose, etc.) I didn't seem to find anything in
the archives.
What I came up with is an implementation of an algorithm popularized
by a contributor to the SPSS list many years ago, in which a file of
Reps * Sample Size observations with random pointers into the
original data has the original data merged onto it. I'll take the
liberty of posting it below, since it is not much longer than
reasonable pseudocode for it would be. For a larger data set (auto,
expanded by 100 to 7400), it took about <4% as much time as
-bootstrap-. (This was on simple problem, e.g. -summarize price-).
Obviously, for some problems, the following algorithm would eat up
too much memory, but Stata seems to run it happily with, e.g,, 10,000
samples of N = 100 on a data set of 100,000.
I'm suspecting a free lunch here. Obviously, for -bootstrap- problems
for which the statistical calculation itself is slow, the overhead of
-bootstrap- won't matter, so improving bootstrap might be irrelevant.
Anyway, I'd appreciate any thoughts on the following as a possible
-bootstrap- alternative.
*Example data
sysuse auto
expand 100 // make it bigger for demonstration
*Algorithm starts
*---------------
local reps = 10000 // choose
local sampsize = 50 // choose
local popsize = _N
gen long ident = _n
sort ident
tempfile temp
save `temp', replace
clear
*
* Create a file to hold a resampled data set
local bigsize = `reps' * `sampsize'
set obs `bigsize'
gen long repnum = _n if _n <=`reps'
replace repnum = repnum[_n - `reps'] if _n > `reps'
* Create a pointer to the population for each resample element
gen long ident = 1 + int(`popsize' * uniform())
sort ident
merge ident using `temp', uniqusing
keep if _merge ==3
drop _merge ident
sort repnum
*
statsby mean = r(mean), by(repnum)clear : summ price
Regards,
=-=-=-=-=-=-=-=-=-=-=-=-=
Mike Lacy
Fort Collins CO USA
(970) 491-6721 office
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/