Dear Stata Users,
suppose you had K=5000 regressors and N=10 million observations. This might
be from a linked employer employee data set where you explicitly include all
firm dummies (and algebraically sweep out person effects through the within
transformation) in order to compute person and firm effects.
As I understand, Stata SE is capable of using up to 11 000 variables. But in
this case the data matrix would be 50 GB assuming that each regressor can be
stored as a 1 byte variable.
What do you think of the following solutions, that necessarily will require
to use the data partly form the hard disk and not from the RAM:
1) Store the data in several files, two of which can be loaded into the
memory at a time. Compute the elements of X'X and X'Y by subsequently
loading all possible pairwise combinations of the data sets into the memory
and multyplying the data matrices. Store the results and use them to set
together X'y and X' X, which is then a 5000 x 5000 matrix that fits into the
memory and can be inverted by Stata. As I might with some luck have 16 GB of
RAM available, this would mean about 7 sub datasets of about 7 GB each (i.e.
about 700 regressors and 10 million observations each sub data set).
2) I read that SAS is able to handle datasets that do not fit into the RAM.
Does SAS do something like what I described under 1) ? Is there a good
discussion forum about SAS that might give advice on whether such a large
dataset can be handled by the software? (I am sorry to aks this on
Statalist...)
3) I have the feeling that with 1) and 2) I exchange the space restriction
for a time restriction. Any solution, 1) or 2) might be excessively time
consuming, so that it is just illusionary to compute least squares estimates
from such a huge dataset. Before computing X'X I need to "time-demean" the
firm dummies (within-transformation), which might also be very
time-comsuming, when there are 5000 regressors and 2 million persons
observed during 5 time periods. Supposedly the time restriction is real,
that's why several authors propose alternative estimation methods for person
and firm effects in linked employer employee data. Do you have any
suggestions about how to estimate the time needed ?
Has anybody had a similar problem or thought this through? I would be glad
for any comments.
Regards,
Thomas