Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Re: Comparison of results from Stata

From	Matthew J Baker <[email protected]>
To	[email protected]
Subject	Re: st: Re: Comparison of results from Stata
Date	Mon, 7 Feb 2011 18:36:25 -0500 (EST)

A very insightful answer...thanks very much!

Matt Baker

Dr. Matthew J. Baker
Department of Economics
Hunter College and the Graduate Center, CUNY


---- Original message ----
>Date: Mon, 07 Feb 2011 10:27:57 -0600
>From: [email protected] (on behalf of [email protected] (William Gould, StataCorp LP))
>Subject: Re: st: Re: Comparison of results from Stata  
>To: [email protected]
>
>In a previous exchange, Matthew discovered differences between 
>-regress- and using Mata invsym(X'X)*X'y, and I explained those 
>differences in terms of finite precision and, in particular, traced 
>certain discrepancies down to Matthew's inadvertant use of -float- 
>rather than -double-.  Along the way, we discovered that -regress- 
>was more accurate than invsym(X'X)*X'y.
>
>Matthew Baker <[email protected]> writes,
>
>> this begs the question: what is "reg" doing 
>> differently? While running your code makes the errors much 
>> closer, they still aren't identical - why is that?
>
>Matthew is imagining there is a single correct answer.  Well, there is 
>a single correct answer mathematically.  Computers, however, perform 
>calculations with a limited number of digits, aka finite precision, and 
>the results they produce are close to, but not identical to, the single
>mathematical answer.   As a result, two different formulas for calculating 
>a result, formulas that would produce mathematically identical results, 
>can and do produce different results when performed on a finite-precision 
>computer.  An important part of the scientific programming StataCorp does 
>is evaluating those differences and finding fomrulas which, when performed
>using finite precision arithmetic, produces results closer to the true 
>mathematical result.
>
>In Mata, Matthew obtained regression coefficients using the classic 
>mathematical formula (X'X)^(-1)X'y, which he coded in Mata as 
>invsym(X'X)*X'y.  
>
>Stata's -regress- uses different formulas to obtain b.  Performed with 
>infinite precision, the formulas -regress- uses would produce the same,
>mathematically true result as (X'X)^(-1)X'y.  Performed in finite precision,
>the fomulas -regress- uses produce results closer to the matematical answer.
>
>So what does -regress- do?
>
>First, rather than calculating (an approximiation to) (X'X)^(-1), which 
>can then be used in the formula (X'X)^(-1)X'y directly to obtain b, Stata uses
>a solver, and solves the equation 
>
>	(X'X)b = X'y
>
>for b. 
>
>Premultiply both sides of the above equation by (X'X)^(-1) and you will 
>obtain b = (X'X)^(-1)*X'y, but that is not how a solver works.  The 
>particular solver Mata uses is based on LR decomposition.  
>Mata provides this capability -- see -help mata lusolve()- -- so Matthew 
>could have used it, but -lusolve()- is more trouble than using
>syminv(X'X)*X'y, and really, there's no reason he should have used it.
>syminv(X'X)*X'y produces an excellent answer in this case.  EVen so, 
>lusolve() would have produced a better answer.
>
>If Matthew wanted to implement using lusolve() in all cases, I should warn 
>him that LU decomposition assumes that matrix is full rank.  Matthew will 
>need to think about how he will detect collinearity and reduce the problem 
>before giving it over to the LU solver. 
>
>Before Matthew does that, however, I need to mention that use of a
>solver rather than an inverter is just the icing on the cake.  Before
>he even gets to solving (X'X)b=X'y for b, Matthew needs to think
>carefully about the calculation of X'X.
>
>Finite-precision calculation of X'X is a disaster in the making.  The 
>problem is that computers don't add and subtract numbers on different 
>scales as accurately as you might expect.  Just forming the sum of 
>x1+x2+x3+...+xN to obtain a mean requires caution.  The obvious way 
>to proceed is to first add x+x2, then add x3 to that, and so on.  
>If all the x's are greater than 0, the sequence (x1, x1+x2, x1+x2+x3, ...)
>grows, and the scale of x1+x2 is roughly twice than of x3, 
>x1+x3+x3 is roughly three times that of x4, and so on.  The result is 
>that each subsequent addition is performed less and less accurately on 
>a finite-precision machine.
>
>I could write a lot about that problem, but let me skip all that and
>just say that one thing that well help dramatically is to remove the
>mean from each of x1, x2, x3, ..., and so calculate ((x1-mean),
>(x1-mean)+(x2-mean), ...).  It will not eliminate the problem, but it
>will be a big help.
>
>Before Matthew can adopt that approach, however, he will need to think
>about how he is going to obtain an accurate apporiximation of the
>mean, which is itself based on x1+x2+...+xN.  I also note, Matthew
>should think about how much that will matter, which is to say, how
>error in the calculation of the mean will filter through the rest of
>the calculation.  Again, I'll skip a lengthy discussion and merely say
>that there are ways of calculating a sum that are more accurate than
>summing the elements.  Better is to calculate a running approximation
>of the mean, and update that.  To wit, first I approximate the mean as
>m1 = x1, then I approximate it as m2 = (x1+x2)/2, then I approximate
>it as m3 = (m2*(2/3)+x3/3), and so on.  Other things Matthew might
>want to think about to obain more precise result is putting x is
>ascending order and/or making the calculation in quad rather than
>double precision.
>
>Mata will do all of that, too; see cross(), crossdev(), and quadcross(), 
>and quadcrossdev().
>
>Now, if Matthew removes the mean, he will be able to obtain b1 and b2
>for the linear regression, 
>
>        y = b0 + b1*x1 + b2*x2
>
>but he will need to develop an accurate calculation formula for b0.
>There's an obvious one staring you in the fact, assuming you have
>accurate renditions of the means of y, x1, and x2.  Obtaining the
>standard errors for the coefficients is a entirely new problem.
>
>And that's what we do at StataCorp.
>
>The few extra digits we of accuracy we get are not the point of the 
>exercise.  We go through all of these macinations so that, when you 
>give us a problem that is poorly scaled, or with lots of collinearity, 
>problems where syminv(X'X)*X'y would produce inaccurate results, 
>we produce results that are still close to the mathematically true 
>answer.  The few extra digits of accuracy that you observe in most 
>problems come for free.
>
>-- Bill
>[email protected]
>*
>*   For searches and help try:
>*   http://www.stata.com/help.cgi?search
>*   http://www.stata.com/support/statalist/faq
>*   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

References:
- Re: st: Re: Comparison of results from Stata
  - From: [email protected] (William Gould, StataCorp LP)

Prev by Date: st: epiconf: counfounding
Next by Date: st: Heckprob problems
Previous by thread: Re: st: Re: Comparison of results from Stata
Next by thread: st: changes of user-input by -bootstrap- and -permute-
Index(es):
- Date
- Thread