Tim Sass <[email protected]> wrote,
> I may be crazy, but I am trying to estimate a fixed effects model with a
> handful of explanatory variables plus 3,100 explicit dummies on a two-year
> panel data set containing 1.7 million observations (about 850,000 fixed
> effects). I am using a Sun workstation with 8GB of RAM.
Even with 8GB, Tim ran out of memory:
------------------------------------------------------------------------------
. xi: areg nrtrgain nschools chgschl
> t2001 tgrde_04 tgrde_06 tgrde_07 tgrde_08 tgrde_09 tgrde_10
> rpeat_04 rpeat_05 rpeat_06 rpeat_07 rpeat_08 rpeat_09 rpeat_10
> i.instid,
> absorb(student) robust;
i.instid _Iinstid_1-36994 (naturally coded; _Iinstid_1 omitted)
no room to add more variables due to width
------------------------------------------------------------------------------
Tim concludes,
> I still have about 1.3 GB of memory free [...] though I guess that
> is not enough to do the matrix inversions necessary to compute the
> regression estimates. Is there a way to figure out the memory that would
> be required to solve this problem? I have some money budgeted to buy a
> 64-bit machine with 16GB RAM in the future, though I'm not sure even such a
> machine could do this.
Tim guessed that Stata ran out of memory doing the matrix inversion, but from
the error message (no room to add more variables *DUE TO WIDTH*), I know that
is not true. Stata ran out of memory while running areg.ado and before it
ever got to running the regression. In the part of the code that -areg- was
running, it was forming new copies of the lhs and rhs variables, copies with
their within-group means removed.
In his model, Tim has approximately 3,118 variables. -areg- generates each of
the new variables as a double (!), so the memory requirement is
3118*8*(1.7*10^6) = 42 gigabytes. Therein lies the problem. Even if -areg-
were modified to use floats rather than doubles, it would still need 21
gigabytes just to store the data from which the regression coeffients could be
calculated.
However, all is not lost. I think we can estimate model and do it within
Tim's 8GB of RAM.
Step 1: Understanding areg
---------------------------
Before we start with Tim's problem, let's understand how we could do -areg-
(or equivalently, -xtreg, fe-) by hand. Let us consider the model
. areg y x1 x2, group(id) (1)
We can obtain the coefficients and standard errors by typing:
. drop if y==. | x1==. | x2==.
. sort group
. by group: gen double mu_y = sum(y)/_N
. gen double ydev = y - mu_y
. by group: gen double mu_x1 = sum(x1)/_N
. gen double x1dev = x1 - mu_x1
. by group: gen double mu_x2 = sum(x2)/_N
. gen double x1dev = x2 - mu_x2
. regress ydev x1dev x2dev, nocons (2)
Compare the output of (1) with (2) and you will find thew coefficients are
the same and the standard errors are different. That is becauase we must
do a degree-of-freedom adjustment to the standard errors of (2) to account
for the fact that we estimated all those within-group means. That is easy.
In an example I ran, I typed
. display _se[x1dev]*sqrt((100-2)/(100-52))
In my example, I had 50 groups and 2 observations on each, so I had
. display _se[x1dev]*sqrt((100-2)/(100-52))
/ | \
/ | \
total obs | x1, x2, & 50 means
|
x1 and x2
I reccommend Tim try this example.
Step 2: Using a different encoding for a variable
--------------------------------------------------
Repeat the above, but this time, rather than type
. by group: gen double mu_x1 = sum(x1)/_N
. gen double x1dev = x1 - mu_x1
type
. by group: gen double mu_x1 = sum(x1)/_N
. gen double x1dev = (x1 - mu_x1)*2 <- multiply by 2
Do that for x1dev only. The result will be that you will estimate the
same coefficient and standard error for x2dev. The coefficient for the
multiplied-by-2 x1dev will be half of that estimated previously, as will
its standard error.
Step 3: Saving memory for 3,100 dummmies
-----------------------------------------
Now let's do the above above problem, but add 3,100 dummies. In addition
to x1 and x2, we will assume we have d1, d2, ..., d3100. Using the
approach of step 1, we will type statements like
. by group: gen double mu_d1 = sum(d1)/_N
. gen double d1dev = d1 - mu_d1
and another 3,099 like that. Along the way, we will run out of memory.
Think about the values that can appear in d1dev. Because Tim has only
two groups, the values of d1 are (0,0), (1,1), (1,0), or (0,1). The
mean is thus 0, 1, .5, or .5, and d1dev is -.5, 0, or .5.
Interestingly, 2*d1dev takes on the values -1, 0, or 1. All of those are
integers ard we can hold them in bytes! Probably, d1 was already a byte
variable, so let's just type
. by group: gen double mu_d1 = sum(d1)/_N
. replace d1 = 2*(d1 - mu_d1)
. drop mu_d1
And we can get our estimates!
Remember Tim, once you get results, you will need to multiply dummy
coefficients and their standard errors by 2. I reccommend you do all of this
a do-file, and do the mean-differencing of the dummies in a loop. With a
do-file, you can also run on a subsample of dataset -- one is which areg would
work -- and verify that your code is correct.
-- Bill -- Vince
[email protected] [email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/