Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: st: Splitting a dataset efficiently/run regression repeatedly in subsets
From
"Trelle Sven" <[email protected]>
To
<[email protected]>
Subject
RE: st: Splitting a dataset efficiently/run regression repeatedly in subsets
Date
Mon, 15 Nov 2010 17:28:33 +0100
> Maarten buis
> Sent: Monday, November 15, 2010 4:39 PM
> > I have a large (simulated) dataset with 400,000 observations (from
> > overall 50,000 simulations each creating
> > 8 observations). I need to perform a linear regression for each
> > simulation separately. I noticed the following:
> >
> > 1) keeping all observations in the dataset and looping through the
> > simulations is very inefficient i.e. it takes several hours to run
> > e.g.
> > * first example starts; run is an ID for simulation gen regcoeff = .
> > forval s=1/50000 {
> > regress x y if run==`s'
> > replace regcoeff = _b[y] if _n==`s'
> > }
> > * first example ends
>
> An -in- condition is often quicker than an -if- condition.
> You need to do more work to make sure that the -in- condition
> is appropriate, but that is the price to pay.
I will try this. Thanks.
> Anyhow, before doing all this I would start with -statsby-,
> see: -help statsby-.
As always, I wasn't 100% precise ...
The statsby command is actually much quicker (and thanks for the advice). However, I also need to predict after each regression and apparently this is not possible with statsby.
Consequently, I will try
1) the "in" condition instead of "if".
2) Use statsby and predict by hand using these results
Sven
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/