Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | "Trelle Sven" <strelle@ctu.unibe.ch> |
To | <statalist@hsphsun2.harvard.edu> |
Subject | st: Splitting a dataset efficiently/run regression repeatedly in subsets |
Date | Mon, 15 Nov 2010 16:16:26 +0100 |
Dear all, I have a large (simulated) dataset with 400,000 observations (from overall 50,000 simulations each creating 8 observations). I need to perform a linear regression for each simulation separately. I noticed the following: 1) keeping all observations in the dataset and looping through the simulations is very inefficient i.e. it takes several hours to run e.g. * first example starts; run is an ID for simulation gen regcoeff = . forval s=1/50000 { regress x y if run==`s' replace regcoeff = _b[y] if _n==`s' } * first example ends 2) preserving and restoring is even more time-consuming 3) I thought of creating a loop as before but load the data at the beginning and then keeping only the data for the particular simulation. However, it implies that the data is loaded 50,000times (because it comes from a server with suboptimal connection speed this is also not optimal) and it would make storage of the results also a little bit difficult * second example starts gen regcoeff = . save sim.dta, replace local coeff = 0 // dummy for first run of loop local p = 1 // dummy for first run of loop forval s=1/50000 { use sim.dta, clear replace regcoeff = `coeff' if _n==`p' save sim.dta, replace keep if run==`s' regress x y local coeff = _b[y] local p=`s' } use sim.dta, clear replace regcoeff = `coeff' if _n==`p' save sim.dta, replace * second example ends I am sure there is a better way of doing this. If there is anybody who has better ideas I would appreciate any suggestions/help. All the best Sven * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/