Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: Splitting a dataset efficiently/run regression repeatedly in subsets

From	"Trelle Sven" <[email protected]>
To	<[email protected]>
Subject	st: Splitting a dataset efficiently/run regression repeatedly in subsets
Date	Mon, 15 Nov 2010 16:16:26 +0100

Dear all,
I have a large (simulated) dataset with 400,000 observations (from
overall 50,000 simulations each creating 8 observations). I need to
perform a linear regression for each simulation separately. I noticed
the following:

1) keeping all observations in the dataset and looping through the
simulations is very inefficient i.e. it takes several hours to run e.g.
* first example starts; run is an ID for simulation
gen regcoeff = .
forval s=1/50000 {
	regress x y if run==`s'
	replace regcoeff = _b[y] if _n==`s'
}
* first example ends

2) preserving and restoring is even more time-consuming

3) I thought of creating a loop as before but load the data at the
beginning and then keeping only the data for the particular simulation.
However, it implies that the data is loaded 50,000times (because it
comes from a server with suboptimal connection speed this is also not
optimal) and it would make storage of the results also a little bit
difficult
* second example starts
gen regcoeff = .
save sim.dta, replace
local coeff = 0 // dummy for first run of loop
local p = 1 // dummy for first run of loop
forval s=1/50000 {
	use sim.dta, clear
	replace regcoeff = `coeff' if _n==`p'
	save sim.dta, replace
	keep if run==`s'	
	regress x y
	local coeff = _b[y]
	local p=`s'
}
use sim.dta, clear
replace regcoeff = `coeff' if _n==`p'
save sim.dta, replace
* second example ends

I am sure there is a better way of doing this. 
If there is anybody who has better ideas I would appreciate any
suggestions/help.

All the best
Sven


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: Splitting a dataset efficiently/run regression repeatedly in subsets
  - From: Sergiy Radyakin <[email protected]>
- Re: st: Splitting a dataset efficiently/run regression repeatedly in subsets
  - From: Maarten buis <[email protected]>
- Re: st: Splitting a dataset efficiently/run regression repeatedly in subsets
  - From: Neil Shephard <[email protected]>

Prev by Date: st: moptimize gf evaluator
Next by Date: Re: st: Splitting a dataset efficiently/run regression repeatedly in subsets
Previous by thread: st: moptimize gf evaluator
Next by thread: Re: st: Splitting a dataset efficiently/run regression repeatedly in subsets
Index(es):
- Date
- Thread