Maybe you didn't see the follow-up email where I provide some specific code
on how to implement the -in- approach to selecting groups instead of
the -if- approach, or maybe you are wondering why it works.
If you issue a command like:
regress y x if group==`i'
then Stata must evaluate the -if- part of the expression on the full dataset
to identify the sample for the command. Often, this type of statement in a
loop is followed up with statements that copy the coefficients and or
standard errors into variables, again using the -if- expression. That adds
up to many passes through the entire dataset to select the same small subset
of observations. There isn't much problem with this approach when you have
just a few or even a few dozen groups, but when you have 1000 groups or
100,000, then you may be making many thousands of passes through the dataset
to evaluate the -if- expression for each group. For example, if you have
1000 groups with 10 obs each, then each -if- expression requires making
10,000 evaluations. If your loop has just 3 -if- expressions, that's
30,000,000 evaluations of the -if- expression to run your whole loop (3 *
10,000 * 1000).
In contrast, if you could identify each group using an -in- expression,
Stata can just directly work on the set of observations you want: -in- acts
as a direct pointer to the selected observations. In terms of speed, for my
example with 1000 groups the -in- approach is typically about 10x-15x
faster. There is a little overhead in terms of setting up the -in-
approach, but my prior email shows a fairly quick way to do it by generating
a variable that holds the count for each group and then using a -while- loop
that jumps from group to group in terms of observation numbers covered.
Michael Blasnik
[email protected]
----- Original Message -----
From: "Apostolos Ballas" <[email protected]>
To: <[email protected]>
Sent: Saturday, June 26, 2004 1:58 PM
Subject: st: RE: RE: RE: Re: RE: Multiple commands under "By varlist"?
> It is probably that I am dim, but since I have a very similar problem (ie,
> many simulations which take hours) can some please explain how the
following
> example works.
>
> Thanks a lot for the help.
>
> Apostolos
>
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Nick Cox
> Sent: Saturday, June 26, 2004 5:26 PM
> To: [email protected]
> Subject: st: RE: RE: Re: RE: Multiple commands under "By varlist"?
>
>
> In this I referred to Michael Blasnik.
> 14 seconds later he posted a similar point.
>
> Clearly this should be written up in supermarket
> trash newspapers as an Amazing Coincidence.
>
> Nick
> [email protected]
>
> Nick Cox
> >
> > 2. The way -if- is implemented. The
> > command
> >
> > regress returns factor if `i' == month
> >
> > is implemented by testing every observation
> > to see whether it should be included in
> > the regression. In your case 99.9% of
> > the observations are irrelevant to each
> > regression, but Stata takes no special
> > action to avoid that. You should be
> > able to substitute -if- by -in-:
> >
> > gen long obsno = _n
> > sort month port
> > forval i = 1/1000 {
> > local min = ...
> > local max = ...
> > regress returns factor in `min'/`max'
> > ...
> > }
> >
> > and by Blasnik's Law this should be much faster.
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/