In response to Radu's <[email protected]> query on why -areg- was not dropping
collinear variables, Scott <[email protected]> made an interesting
demonstration on how machine precision can affect whether -_rmcoll- drops
variables:
> You replicate this with the grunfeld dataset and the mvalue variable.
>
> sysuse grunfeld,clear
> egen double t1 = total(mva), by(com)
> egen double t2 = total(kstock), by(com)
>
> //t1 not dropped
> areg invest kstock t1, ab(com)
>
> //t2 dropped
> areg invest mvalu t2, ab(com)
>
> replace t1 = t1/100
> //Now, t1 is dropped
> areg invest kstock t1, ab(com)
>
> replace t2 = t2*10000
> //Now, t2 is not dropped
> areg invest mvalu t2, ab(com)
I would like to add to this demonstration where I show the
dangers of having single precision variables (float) and not
promoting them to double precision with caution.
Note that single precision will have only about 7 digits
of precision while double has about 16 corresponding to
24 and 53 bits for the mantissia:
. di c(epsfloat)
1.192e-07
. di c(epsdouble)
2.220e-16
. di 2^(1-24)
1.192e-07
. di 2^(1-53)
2.220e-16
machine epsilon is 2^(1-p) where p is the number of bits
reserved for the mantissia of floating point numbers.
Now take a look at the storage type for each of the variables
in the grunfeld dataset.
. webuse grunfeld,clear
. describe
Contains data from http://www.stata-press.com/data/r9/grunfeld.dta
obs: 200
vars: 6 3 Mar 2005 20:27
size: 5,600 (99.5% of memory free)
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
company float %9.0g
year float %9.0g
invest float %9.0g
mvalue float %9.0g
kstock float %9.0g
time float %9.0g
. list in 1/20
+--------------------------------------------------+
| company year invest mvalue kstock time |
|--------------------------------------------------|
1. | 1 1935 317.6 3078.5 2.8 1 |
2. | 1 1936 391.8 4661.7 52.6 2 |
3. | 1 1937 410.6 5387.1 156.9 3 |
4. | 1 1938 257.7 2792.2 209.2 4 |
5. | 1 1939 330.8 4313.2 203.4 5 |
|--------------------------------------------------|
6. | 1 1940 461.2 4643.9 207.2 6 |
7. | 1 1941 512 4551.2 255.2 7 |
8. | 1 1942 448 3244.1 303.7 8 |
9. | 1 1943 499.6 4053.7 264.1 9 |
10. | 1 1944 547.5 4379.3 201.6 10 |
|--------------------------------------------------|
11. | 1 1945 561.2 4840.9 265 11 |
12. | 1 1946 688.1 4900.9 402.2 12 |
13. | 1 1947 568.9 3526.5 761.5 13 |
14. | 1 1948 529.2 3254.7 922.4 14 |
15. | 1 1949 555.1 3700.2 1020.1 15 |
|--------------------------------------------------|
16. | 1 1950 642.9 3755.6 1099 16 |
17. | 1 1951 755.9 4833 1207.7 17 |
18. | 1 1952 891.2 4924.9 1430.5 18 |
19. | 1 1953 1304.4 6241.7 1777.3 19 |
20. | 1 1954 1486.7 5593.6 2226.3 20 |
+--------------------------------------------------+
We can extend the display format of mvalue and kstock and you will see
what you are going to get beyond the 7-th digit when you promote these
variables to double precision.
. format %20.15g kstock mvalue
. gen double fvalue = mvalue
. gen double fstock = kstock
. format %20.15g fvalue fstock
. list fvalue mvalue fstock kstock in 1/20
+---------------------------------------------------------------------------+
| fvalue mvalue fstock kstock | |---------------------------------------------------------------------------|
1.| 3078.5 3078.5 2.79999995231628 2.79999995231628 |
2.| 4661.7001953125 4661.7001953125 52.5999984741211 52.5999984741211 |
3.| 5387.10009765625 5387.10009765625 156.899993896484 156.899993896484 |
4.| 2792.19995117188 2792.19995117188 209.199996948242 209.199996948242 |
5.| 4313.2001953125 4313.2001953125 203.399993896484 203.399993896484 |
|---------------------------------------------------------------------------|
6.| 4643.89990234375 4643.89990234375 207.199996948242 207.199996948242 |
7.| 4551.2001953125 4551.2001953125 255.199996948242 255.199996948242 |
8.| 3244.10009765625 3244.10009765625 303.700012207031 303.700012207031 |
9.| 4053.69995117188 4053.69995117188 264.100006103516 264.100006103516 |
10.| 4379.2998046875 4379.2998046875 201.600006103516 201.600006103516 |
|---------------------------------------------------------------------------|
11 | 4840.89990234375 4840.89990234375 265 265 |
12.| 4900.89990234375 4900.89990234375 402.200012207031 402.200012207031 |
13.| 3526.5 3526.5 761.5 761.5 |
14.| 3254.69995117188 3254.69995117188 922.400024414063 922.400024414063 |
15.| 3700.19995117188 3700.19995117188 1020.09997558594 1020.09997558594 |
|---------------------------------------------------------------------------|
16.| 3755.60009765625 3755.60009765625 1099 1099 |
17.| 4833 4833 1207.69995117188 1207.69995117188 |
18.| 4924.89990234375 4924.89990234375 1430.5 1430.5 |
19.| 6241.7001953125 6241.7001953125 1777.30004882813 1777.30004882813 |
20.| 5593.60009765625 5593.60009765625 2226.30004882813 2226.30004882813 |
+---------------------------------------------------------------------------+
. egen double t1 = total(mva), by(com)
. egen double t2 = total(kstock), by(com)
. format %20.15g t1 t2
. list t1 t2 in 1/20
+------------------------------------+
| t1 t2 |
|------------------------------------|
1. | 86676.900390625 12968.7000625134 |
2. | 86676.900390625 12968.7000625134 |
3. | 86676.900390625 12968.7000625134 |
4. | 86676.900390625 12968.7000625134 |
5. | 86676.900390625 12968.7000625134 |
|------------------------------------|
6. | 86676.900390625 12968.7000625134 |
7. | 86676.900390625 12968.7000625134 |
8. | 86676.900390625 12968.7000625134 |
9. | 86676.900390625 12968.7000625134 |
10. | 86676.900390625 12968.7000625134 |
|------------------------------------|
11. | 86676.900390625 12968.7000625134 |
12. | 86676.900390625 12968.7000625134 |
13. | 86676.900390625 12968.7000625134 |
14. | 86676.900390625 12968.7000625134 |
15. | 86676.900390625 12968.7000625134 |
|------------------------------------|
16. | 86676.900390625 12968.7000625134 |
17. | 86676.900390625 12968.7000625134 |
18. | 86676.900390625 12968.7000625134 |
19. | 86676.900390625 12968.7000625134 |
20. | 86676.900390625 12968.7000625134 |
+------------------------------------+
You can see here that even though we have made the variables
t1 and t2 double precision, we really only have done single
precision numerics (actually a little better than single
precision) and the variables have only 9 digits that
are any good. We gained some precsion, but we did not
do as well as we would have if we started with full double
precision. This garbage beyond the 9-th digit can be
enough for -_rmcoll- to not detect variable collinearity.
Now if we promote mvalue and kstock to double carefully
we can get full precision.
. scalar digits = round(log10(abs(c(epsfloat))),1)+1
. gen double dvalue = round(mvalue,10^(digits+ceil(log10(abs(mvalue)))))
. gen double dstock = round(kstock,10^(digits+ceil(log10(abs(kstock)))))
. format %20.15g dvalue dstock
. list dvalue mvalue dstock kstock in 1/20
+-------------------------------------------------------+
| dvalue mvalue dstock kstock |
|-------------------------------------------------------|
1. | 3078.5 3078.5 2.8 2.79999995231628 |
2. | 4661.7 4661.7001953125 52.6 52.5999984741211 |
3. | 5387.1 5387.10009765625 156.9 156.899993896484 |
4. | 2792.2 2792.19995117188 209.2 209.199996948242 |
5. | 4313.2 4313.2001953125 203.4 203.399993896484 |
|-------------------------------------------------------|
6. | 4643.9 4643.89990234375 207.2 207.199996948242 |
7. | 4551.2 4551.2001953125 255.2 255.199996948242 |
8. | 3244.1 3244.10009765625 303.7 303.700012207031 |
9. | 4053.7 4053.69995117188 264.1 264.100006103516 |
10. | 4379.3 4379.2998046875 201.6 201.600006103516 |
|-------------------------------------------------------|
11. | 4840.9 4840.89990234375 265 265 |
12. | 4900.9 4900.89990234375 402.2 402.200012207031 |
13. | 3526.5 3526.5 761.5 761.5 |
14. | 3254.7 3254.69995117188 922.4 922.400024414063 |
15. | 3700.2 3700.19995117188 1020.1 1020.09997558594 |
|-------------------------------------------------------|
16. | 3755.6 3755.60009765625 1099 1099 |
17. | 4833 4833 1207.7 1207.69995117188 |
18. | 4924.9 4924.89990234375 1430.5 1430.5 |
19. | 6241.7 6241.7001953125 1777.3 1777.30004882813 |
20. | 5593.6 5593.60009765625 2226.3 2226.30004882813 |
+-------------------------------------------------------+
. egen double d1 = total(dvalue), by(com)
. egen double d2 = total(dstock), by(com)
. format %20.15g d1 d2
. list d1 t1 d2 t2 in 1/20
+--------------------------------------------------------+
| d1 t1 d2 t2 |
|--------------------------------------------------------|
1. | 86676.9 86676.900390625 12968.7 12968.7000625134 |
2. | 86676.9 86676.900390625 12968.7 12968.7000625134 |
3. | 86676.9 86676.900390625 12968.7 12968.7000625134 |
4. | 86676.9 86676.900390625 12968.7 12968.7000625134 |
5. | 86676.9 86676.900390625 12968.7 12968.7000625134 |
|--------------------------------------------------------|
6. | 86676.9 86676.900390625 12968.7 12968.7000625134 |
7. | 86676.9 86676.900390625 12968.7 12968.7000625134 |
8. | 86676.9 86676.900390625 12968.7 12968.7000625134 |
9. | 86676.9 86676.900390625 12968.7 12968.7000625134 |
10. | 86676.9 86676.900390625 12968.7 12968.7000625134 |
|--------------------------------------------------------|
11. | 86676.9 86676.900390625 12968.7 12968.7000625134 |
12. | 86676.9 86676.900390625 12968.7 12968.7000625134 |
13. | 86676.9 86676.900390625 12968.7 12968.7000625134 |
14. | 86676.9 86676.900390625 12968.7 12968.7000625134 |
15. | 86676.9 86676.900390625 12968.7 12968.7000625134 |
|--------------------------------------------------------|
16. | 86676.9 86676.900390625 12968.7 12968.7000625134 |
17. | 86676.9 86676.900390625 12968.7 12968.7000625134 |
18. | 86676.9 86676.900390625 12968.7 12968.7000625134 |
19. | 86676.9 86676.900390625 12968.7 12968.7000625134 |
20. | 86676.9 86676.900390625 12968.7 12968.7000625134 |
+--------------------------------------------------------+
Now we do Scott's experiment in double precision:
. gen double dinvest = round(invest,10^(digits+ceil(log10(abs(invest)))))
. recast long company
. areg dinvest dstock d1, ab(com)
Linear regression, absorbing indicators Number of obs = 200
F( 1, 189) = 366.45
Prob > F = 0.0000
R-squared = 0.9184
Adj R-squared = 0.9141
Root MSE = 63.566
------------------------------------------------------------------------------
dinvest | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
dstock | .3707496 .0193676 19.14 0.000 .3325452 .4089541
d1 | (dropped)
_cons | 43.62499 6.984315 6.25 0.000 29.84777 57.40222
-------------+----------------------------------------------------------------
company | F(9, 189) = 36.975 0.000 (10 categories)
. areg dinvest dvalue d2, ab(com)
Linear regression, absorbing indicators Number of obs = 200
F( 1, 189) = 111.35
Prob > F = 0.0000
R-squared = 0.8491
Adj R-squared = 0.8411
Root MSE = 86.444
------------------------------------------------------------------------------
dinvest | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
dvalue | .1898776 .0179944 10.55 0.000 .1543819 .2253733
d2 | (dropped)
_cons | -59.42872 20.40144 -2.91 0.004 -99.6725 -19.18494
-------------+----------------------------------------------------------------
company | F(9, 189) = 15.973 0.000 (10 categories)
So my advice is to be very wary of single precision floating point variable.
-Rich
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/