Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
st: fastsample: Sampling 100x faster than Stata's sample and bsample
From
Andrew Maurer <[email protected]>
To
"[email protected]" <[email protected]>
Subject
st: fastsample: Sampling 100x faster than Stata's sample and bsample
Date
Thu, 27 Mar 2014 15:09:19 +0000
Hi Statalist,
Results:
fastbsample: 0.2 seconds vs bsample: 55.6 seconds
fastsample: 1.6 seconds vs sample 58.9 seconds
(note: relative performance increases with population size)
This is partly a follow up to yesterday's thread "st: Sampling problem". Daniela was asking how to efficiently sample with replacement. I took a look at Stata's sample.ado and bsample.ado files and realized that the algorithms are painfully inefficient (at least in the simple case with no "by", no weights, etc). Both programs essentially sort on a newly generated runiform() variate and, I'm simplifying, keep the first N observations, for sample size N.
Sorting large datasets is a very intensive process. For bsample with sample size 3 on a population of 1million, the computer must 1) create 2 million random variates (see bsample.ado), 2) compound sort those million into ascending order, and 3) keep the subset of size 3. Why not just make a list of 3 random integers between 1 and 1 million and select those observations? Ie: obstokeep = ceil(10^6*runiform(3,1)) (using mata's runiform()).
Sampling without replacement is slightly harder. If we follow the above idea, it's possible that of sample size 3, we selected 2 of the same integer. Eg randomly sampled observation numbers = {14, 66, 66}. In this case, we can calculate the difference between the number of distinct values in the sampled set (2) and the requested sample size (3), and then iterate over the set {1..1mil} - {14,66,66} until we have the size of the sampled set equal to the requested sample size.
See below for the code. I'd be very interested to hear if anyone has any thoughts on its efficiency. One thing that I don't fully understand is why the line -R = selectindex(!I)- is so fast. Even with !I on the order of 10^7 or 10^8, selectindex(!I) is almost instant (assuming the required memory is already allocated to Stata). I don't understand why it's so much faster than a for loop, looping through the rows of !I and putting the indices into R.
************** Define mata functions **********************
mata
void fastbsample(real scalar n)
// faster alternative to stata's bsample.
// .2s vs 55.6s in one test
{
real scalar origN
real vector allnum, allstr
// check for errors
if ((n-ceil(n)!=0) | (n <= 0)) _error(9,"n must be a positive integer")
if (st_nobs() == 0) _error(9,"cannot sample empty dataset")
// declare objects
allnum = J(1,0,.)
allstr = J(1,0,.)
v = st_nvar()
// separate string and numeric variables
for (i=1;i<=v;i++) if (st_isnumvar(i)==1) allnum = allnum, i; else allstr = allstr, i;
st_view(Nvars=.,.,allnum)
st_sview(Svars=.,.,allstr)
origN = rows(Nvars)
// manually add extra obs if origN < n
if (origN < n) {
st_addobs(n - origN,1)
st_view(Nvars=.,.,allnum)
st_sview(Svars=.,.,allstr)
}
/*
Below: slightly less efficient to store vector obstokeep than direct
subscripting. We only have to if Nvars and Svars are both nonempty
*/
if (cols(allstr) == 0) Nvars[|1,.\n,.|] = Nvars[ceil(origN*runiform(n,1)),.]
else if (cols(allnum) == 0) Svars[|1,.\n,.|] = Svars[ceil(origN*runiform(n,1)),.]
else {
real vector obstokeep
obstokeep = ceil(origN*runiform(n,1))
Nvars[|1,.\n,.|] = Nvars[obstokeep,.]
Svars[|1,.\n,.|] = Svars[obstokeep,.]
}
st_keepobsin((1,n))
}
void fastsample(real scalar N)
// faster alternative to stata's sample.
// 1.6s vs 58.9s in one test
{
real scalar origN, L, n, i
real vector allnum, allstr, obstokeep
// check for errors
L = st_nobs()
if ((N-ceil(N)!=0) | (N <= 0)) _error(9,"N must be a positive integer")
if (L < N) _error(9,"cannot sample more observations than entire dataset")
// initialize index
I = J(L,1,0) // index of rows to keep
i = 0
// first try - there could be collisions (the same row being called twice)
obstokeep = ceil(L*runiform(N,1))
I[obstokeep] = J(N,1,1)
// iterate until we have a sample of N index values of L
while (n!=0) {
R = selectindex(!I) // remaining indices that may be chosen
l = length(R) // total subindices
n = N - (L-l) // remaining obs to get
obstokeep = R[ceil(l*runiform(n,1))]
I[obstokeep] = J(n,1,1)
i++
}
printf("total iterations: %f\n", i)
obstokeep = selectindex(I)
st_keepobsin(obstokeep)
}
end
************** End function definitions *******************
*********** Benchmark Stata's bsample vs fastbsample ********
local reps 5
tempfile temp
set obs 10000000
gen r=runiform()
save `temp', replace
forval i = 1/`reps' {
use `temp', clear
timer on 1
bsample 5000
//sample 5000, count
timer off 1
}
forval i = 1/`reps' {
use `temp', clear
timer on 2
mata: fastbsample(5000)
//mata: fastsample(5000)
timer off 2
}
timer list
*********** End benchmark ***********************************
Thank you,
Andrew Maurer
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/