Title | The accuracy of the float data type | |
Author | William Gould, StataCorp |
float is a storage format used by Stata, not a computation format. When you have a number stored as a float and you make a calculation, such as
. gen newvar = sqrt(oldvar)/sqrt(2)
oldvar is retrieved and is promoted to a double. The entire computation is then made in double precision, and that result is rounded to a float.
Floats have 7.22 digits of precision, but there is an argument for saying 7.5 digits because it all depends on how you count partial digits.
The way computers store floating point (not to be confused with float, because double is also an example of floating point) is
z = a * 2p -2 < a < 2
Here are some examples of how numbers are stored:
z a p ------------------------ 1 1 0 1.5 1.5 0 2 1 1 3 1.5 1 (i.e., 1.5*2^1 = 3) ------------------------
In float, 24 bits are allocated for a. Thus the largest integer that can be exactly stored is 2^0 + 2^1 + ... + 2^23 = (2^24)−1 = 16,777,215. Well, actually, 2^24 = 16,777,216 is also precisely stored because it is even, but 16,777,217 cannot be precisely stored. Using Stata, we can demonstrate these factors using Stata's float() function, which rounds to float precision:
. display float(16777216) 16777216 . display float(16777217) 16777216
Good; Stata works just as theory would suggest.
Now how accurate is float? Well, for numbers like 16,777,217, the absolute error is 1, so the relative error is
1/16,777,217 = 5.960e-08
Generally, when you store a number z as float, what is stored is z', and you can be assured that
z * (1 - 5.960e-08) <= z' <= z * (1 + 5.960e-08)
How many digits of accuracy is that? I can tell you exactly in binary: 24 binary digits, but how do you count in binary digits in base 10? (By the way, thinking in binary is not difficult here: 24 binary digits means the smallest number is 2^(−24) = 5.96e−08, and there is the same relative accuracy we received above.)
Returning to decimal, you might start by observing that 16,777,216 has 8 digits, but no 8-digit number can be stored, so we don't want to claim 8.
One way to get a base-10 representation would be to take log10(16,777,216) = 7.2247199. That is the way most numerical analysts would convert digit accuracy between bases, so we could claim 7.22 decimal digits of accuracy.
The .22 part of 7.2 is subject to misinterpretation because what we just called .22 would, by some, be called one-half. Consider 16,777,216 and some too-big numbers after that:
true number stored if float ---------------------------------------- 16,777,216 16,777,216 16,777,217 16,777,216 16,777,218 16,777,218 16,777,219 16,777,220 [sic] 16,777,220 16,777,220 16,777,221 16,777,220 16,777,222 16,777,222 16,777,223 16,777,224 [sic] 16,777,224 16,777,224 16,777,225 16,777,224 ----------------------------------------
Basically, odd numbers are being rounded to even numbers. A lot of people would call this the loss of half the digits in the last place, and we could develop a different formula that would label that difference 7.5. Sometimes in computer documentation, you will see the statement that float has 7.5 digits of accuracy. They say that and not 7.22 because the authors worry that you might misinterpret what 7.22 means.
Label the difference how you wish; there are 24 binary digits, and the relative accuracy is +/− 2^(−24) = 5.960e−08.
Note: The [sic]s in the above have to do with how numbers ending in exactly “half” (5 in decimal) are rounded; this is the same problem as rounding 1.5 and 2.5 to one digit in decimal.