Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Splitting a dataset efficiently/run regression repeatedly in subsets
From
Sergiy Radyakin <[email protected]>
To
[email protected]
Subject
Re: st: Splitting a dataset efficiently/run regression repeatedly in subsets
Date
Mon, 15 Nov 2010 10:53:43 -0500
Dear Sven,
50000 regressions on 8-observations dataset of two variables should
take about 30 seconds (see below).
So don't generate the large dataset, but rather run the regressions
right away when you generate your simulated data.
You don't need to save the 50000x8 observations you generated, as
[presumably] you are also doing it with Stata, so
next time you simulate them with your do-file - they will be the same
(don't forget to set the rnd seed)
On the other hand, since you need only one coefficient from this
trivial regression, you may ask yourself if the -regress-
artillery is really necessary here, or a trivial formula, such as the one here:
http://en.wikipedia.org/wiki/Regression_analysis
would suffice (and be faster).
In any case, don't forget to specify -quietly-. I am almost sure you
don't have any intention to review the output of the
50,000 regressions, and that speeds up the program a lot.
Best,
Sergiy Radyakin.
PS: I am strongly convinced you don't need access to above 1GB memory
for the task of running univariate regressions on
8-observations datasets.
. do "R:\TEMP\STD04000000.tmp"
. set rmsg on
r; t=0.00 10:42:16
. sysuse auto, clear
(1978 Automobile Data)
r; t=0.00 10:42:16
. keep in 1/8
(66 observations deleted)
r; t=0.00 10:42:16
.
. forvalues i=1/50000 {
2. qui regress price weight
3. }
r; t=26.53 10:42:42
.
end of do-file
r; t=26.53 10:42:42
On Mon, Nov 15, 2010 at 10:16 AM, Trelle Sven <[email protected]> wrote:
> Dear all,
> I have a large (simulated) dataset with 400,000 observations (from
> overall 50,000 simulations each creating 8 observations). I need to
> perform a linear regression for each simulation separately. I noticed
> the following:
>
> 1) keeping all observations in the dataset and looping through the
> simulations is very inefficient i.e. it takes several hours to run e.g.
> * first example starts; run is an ID for simulation
> gen regcoeff = .
> forval s=1/50000 {
> regress x y if run==`s'
> replace regcoeff = _b[y] if _n==`s'
> }
> * first example ends
>
> 2) preserving and restoring is even more time-consuming
>
> 3) I thought of creating a loop as before but load the data at the
> beginning and then keeping only the data for the particular simulation.
> However, it implies that the data is loaded 50,000times (because it
> comes from a server with suboptimal connection speed this is also not
> optimal) and it would make storage of the results also a little bit
> difficult
> * second example starts
> gen regcoeff = .
> save sim.dta, replace
> local coeff = 0 // dummy for first run of loop
> local p = 1 // dummy for first run of loop
> forval s=1/50000 {
> use sim.dta, clear
> replace regcoeff = `coeff' if _n==`p'
> save sim.dta, replace
> keep if run==`s'
> regress x y
> local coeff = _b[y]
> local p=`s'
> }
> use sim.dta, clear
> replace regcoeff = `coeff' if _n==`p'
> save sim.dta, replace
> * second example ends
>
> I am sure there is a better way of doing this.
> If there is anybody who has better ideas I would appreciate any
> suggestions/help.
>
> All the best
> Sven
>
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/