Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
st: Splitting a dataset efficiently/run regression repeatedly in subsets
From 
 
"Trelle Sven" <[email protected]> 
To 
 
<[email protected]> 
Subject 
 
st: Splitting a dataset efficiently/run regression repeatedly in subsets 
Date 
 
Mon, 15 Nov 2010 16:16:26 +0100 
Dear all,
I have a large (simulated) dataset with 400,000 observations (from
overall 50,000 simulations each creating 8 observations). I need to
perform a linear regression for each simulation separately. I noticed
the following:
1) keeping all observations in the dataset and looping through the
simulations is very inefficient i.e. it takes several hours to run e.g.
* first example starts; run is an ID for simulation
gen regcoeff = .
forval s=1/50000 {
	regress x y if run==`s'
	replace regcoeff = _b[y] if _n==`s'
}
* first example ends
2) preserving and restoring is even more time-consuming
3) I thought of creating a loop as before but load the data at the
beginning and then keeping only the data for the particular simulation.
However, it implies that the data is loaded 50,000times (because it
comes from a server with suboptimal connection speed this is also not
optimal) and it would make storage of the results also a little bit
difficult
* second example starts
gen regcoeff = .
save sim.dta, replace
local coeff = 0 // dummy for first run of loop
local p = 1 // dummy for first run of loop
forval s=1/50000 {
	use sim.dta, clear
	replace regcoeff = `coeff' if _n==`p'
	save sim.dta, replace
	keep if run==`s'	
	regress x y
	local coeff = _b[y]
	local p=`s'
}
use sim.dta, clear
replace regcoeff = `coeff' if _n==`p'
save sim.dta, replace
* second example ends
I am sure there is a better way of doing this. 
If there is anybody who has better ideas I would appreciate any
suggestions/help.
All the best
Sven
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/