Thomas Cornelissen <[email protected]> writes,
> I want to implement a regression routine in Mata in order to use a large
> data set. I have set up the system of moment equations A*beta = B.
> I need to solve for beta and I need the covariance matrix of beta given by
> Cov(beta)=A^(-1)*(u'u)/(n-k). I want to get its diagonal elements and I am
> not interested in the off-diagonal elements.
>
> Because I am concerned by memory and time restrictions I am looking for the
> most efficient way to solve for beta and for the standard errors. I have
> considered:
>
> Way 1: Solve for beta -cholsolve(A,B)- and then obtain A^(-1) by
> -cholinv(A)-
>
> I am troubled by the fact that I am so-to-speak inverting A twice
> (i.e. using the solver twice). Isn't that inefficient?
Accurracy wise, what Thomas just suggested is best, and as a matter of fact,
is very nearly what -regress- does:
1. -regress- forms A as deviations from the mean.
2. -regress- uses function -invsym()- to invert A rather than
-cholinv()- because -invym()- can handle collinearities.
3. If collinearities are found in step 2, then the appropropriate
rows and columns from A are removed.
4. -regress- solves for b.
5. -regress- now does the work to estimate the intercept and adjust
that VCE obtained in step 2.
That is, -regress- "so-to-speak inverts A twice". The extra computational
cost of using both a solver and an inverter is not much becuase the the
inversion cost is a function of k (the number of variables in the model), not
N (the number of observations), and the accuracy gain in the resulting
estimated coefficients can be large.
-regress- does two more things Thomas does not propose to do, and Thomas may
want to consider adding one of them. In addition to "inverting twice",
-regress- handles scale (steps 1 and 5) and it handles collinearity (uses
-invsym() rather than -cholinv()- and step 3).
The handling of collinearity can be safely skipped. Thomas' proposed code
will simply fail when given a collinear model, and that is certainliy
reasonable behavior.
Concerning scale, my usual reply would be that taking deviations from the mean
does imporove accuracy, but as long as the function is used with varaibles
already normalized so that the means are not too widly different, that will
not matter.
Thomas said, however, that he is writting this function for use with
a large dataset, and I suspect he means N is very large. The importance
of mean removal grows with N, and somewhere around 500,000 or 1,000,000
observations, it becomes important even when the means are already roughly
equal. Consider the sum of x, or the sum of x^2. Assume x has mean m.
The contribution of the mean to the sum of x is N*m, and the contribution
to the sum of x^2 is (N*m)^2 + Sum_i 2*m*(x_i-m). In both formulas,
the important part is N*m. As N grows, the amount of precision required
just to hold that part grows, and so leaves less and less precision for
handling the wiggle part (the deviation from the mean). Yet, in the
regression calculation, it is the wiggle that matters.
Ergo, for N large, I would either remove the means in the function
(Mata's accumulation routines can do that automatically), or I would
only use my function on variables already normalized to have mean 0.
-- Bill
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/