Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Re: Comparison of results from Stata
From
Matthew J Baker <[email protected]>
To
[email protected]
Subject
Re: st: Re: Comparison of results from Stata
Date
Mon, 7 Feb 2011 18:36:25 -0500 (EST)
A very insightful answer...thanks very much!
Matt Baker
Dr. Matthew J. Baker
Department of Economics
Hunter College and the Graduate Center, CUNY
---- Original message ----
>Date: Mon, 07 Feb 2011 10:27:57 -0600
>From: [email protected] (on behalf of [email protected] (William Gould, StataCorp LP))
>Subject: Re: st: Re: Comparison of results from Stata
>To: [email protected]
>
>In a previous exchange, Matthew discovered differences between
>-regress- and using Mata invsym(X'X)*X'y, and I explained those
>differences in terms of finite precision and, in particular, traced
>certain discrepancies down to Matthew's inadvertant use of -float-
>rather than -double-. Along the way, we discovered that -regress-
>was more accurate than invsym(X'X)*X'y.
>
>Matthew Baker <[email protected]> writes,
>
>> this begs the question: what is "reg" doing
>> differently? While running your code makes the errors much
>> closer, they still aren't identical - why is that?
>
>Matthew is imagining there is a single correct answer. Well, there is
>a single correct answer mathematically. Computers, however, perform
>calculations with a limited number of digits, aka finite precision, and
>the results they produce are close to, but not identical to, the single
>mathematical answer. As a result, two different formulas for calculating
>a result, formulas that would produce mathematically identical results,
>can and do produce different results when performed on a finite-precision
>computer. An important part of the scientific programming StataCorp does
>is evaluating those differences and finding fomrulas which, when performed
>using finite precision arithmetic, produces results closer to the true
>mathematical result.
>
>In Mata, Matthew obtained regression coefficients using the classic
>mathematical formula (X'X)^(-1)X'y, which he coded in Mata as
>invsym(X'X)*X'y.
>
>Stata's -regress- uses different formulas to obtain b. Performed with
>infinite precision, the formulas -regress- uses would produce the same,
>mathematically true result as (X'X)^(-1)X'y. Performed in finite precision,
>the fomulas -regress- uses produce results closer to the matematical answer.
>
>So what does -regress- do?
>
>First, rather than calculating (an approximiation to) (X'X)^(-1), which
>can then be used in the formula (X'X)^(-1)X'y directly to obtain b, Stata uses
>a solver, and solves the equation
>
> (X'X)b = X'y
>
>for b.
>
>Premultiply both sides of the above equation by (X'X)^(-1) and you will
>obtain b = (X'X)^(-1)*X'y, but that is not how a solver works. The
>particular solver Mata uses is based on LR decomposition.
>Mata provides this capability -- see -help mata lusolve()- -- so Matthew
>could have used it, but -lusolve()- is more trouble than using
>syminv(X'X)*X'y, and really, there's no reason he should have used it.
>syminv(X'X)*X'y produces an excellent answer in this case. EVen so,
>lusolve() would have produced a better answer.
>
>If Matthew wanted to implement using lusolve() in all cases, I should warn
>him that LU decomposition assumes that matrix is full rank. Matthew will
>need to think about how he will detect collinearity and reduce the problem
>before giving it over to the LU solver.
>
>Before Matthew does that, however, I need to mention that use of a
>solver rather than an inverter is just the icing on the cake. Before
>he even gets to solving (X'X)b=X'y for b, Matthew needs to think
>carefully about the calculation of X'X.
>
>Finite-precision calculation of X'X is a disaster in the making. The
>problem is that computers don't add and subtract numbers on different
>scales as accurately as you might expect. Just forming the sum of
>x1+x2+x3+...+xN to obtain a mean requires caution. The obvious way
>to proceed is to first add x+x2, then add x3 to that, and so on.
>If all the x's are greater than 0, the sequence (x1, x1+x2, x1+x2+x3, ...)
>grows, and the scale of x1+x2 is roughly twice than of x3,
>x1+x3+x3 is roughly three times that of x4, and so on. The result is
>that each subsequent addition is performed less and less accurately on
>a finite-precision machine.
>
>I could write a lot about that problem, but let me skip all that and
>just say that one thing that well help dramatically is to remove the
>mean from each of x1, x2, x3, ..., and so calculate ((x1-mean),
>(x1-mean)+(x2-mean), ...). It will not eliminate the problem, but it
>will be a big help.
>
>Before Matthew can adopt that approach, however, he will need to think
>about how he is going to obtain an accurate apporiximation of the
>mean, which is itself based on x1+x2+...+xN. I also note, Matthew
>should think about how much that will matter, which is to say, how
>error in the calculation of the mean will filter through the rest of
>the calculation. Again, I'll skip a lengthy discussion and merely say
>that there are ways of calculating a sum that are more accurate than
>summing the elements. Better is to calculate a running approximation
>of the mean, and update that. To wit, first I approximate the mean as
>m1 = x1, then I approximate it as m2 = (x1+x2)/2, then I approximate
>it as m3 = (m2*(2/3)+x3/3), and so on. Other things Matthew might
>want to think about to obain more precise result is putting x is
>ascending order and/or making the calculation in quad rather than
>double precision.
>
>Mata will do all of that, too; see cross(), crossdev(), and quadcross(),
>and quadcrossdev().
>
>Now, if Matthew removes the mean, he will be able to obtain b1 and b2
>for the linear regression,
>
> y = b0 + b1*x1 + b2*x2
>
>but he will need to develop an accurate calculation formula for b0.
>There's an obvious one staring you in the fact, assuming you have
>accurate renditions of the means of y, x1, and x2. Obtaining the
>standard errors for the coefficients is a entirely new problem.
>
>And that's what we do at StataCorp.
>
>The few extra digits we of accuracy we get are not the point of the
>exercise. We go through all of these macinations so that, when you
>give us a problem that is poorly scaled, or with lots of collinearity,
>problems where syminv(X'X)*X'y would produce inaccurate results,
>we produce results that are still close to the mathematically true
>answer. The few extra digits of accuracy that you observe in most
>problems come for free.
>
>-- Bill
>[email protected]
>*
>* For searches and help try:
>* http://www.stata.com/help.cgi?search
>* http://www.stata.com/support/statalist/faq
>* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/