Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Counter-intuitive overflow behavior

From	"Roger B. Newson" <[email protected]>
To	[email protected]
Subject	Re: st: Counter-intuitive overflow behavior
Date	Tue, 28 Feb 2012 18:53:57 +0000

Thanks to Bill for his very enlightening explanation of the issues ofundocumented missing values and asymmetric limits to the pseudo-realline. And for the advance warning about the change to come, in a futureStata version, in the definition of -c(mindouble)-.

I discovered these issues while working on the next version of the-cendif- module of the -somersd- package, which routinely calculatesinfinite upper and/or lower confidence limits for percentile differenceswhen there is a good reason to plead ignorance, as when the populationvalue really could be anywhere and we cannot be 95% confident about agreat deal. I will bear the impending redefinition, and the otherissues, when planning future updates to the -somersd- package.


Best wishes

Roger


Roger B Newson BSc MSc DPhil
Lecturer in Medical Statistics
Respiratory Epidemiology and Public Health Group
National Heart and Lung Institute
Imperial College London
Royal Brompton Campus
Room 33, Emmanuel Kaye Building
1B Manresa Road
London SW3 6LR
UNITED KINGDOM
Tel: +44 (0)20 7352 8121 ext 3381
Fax: +44 (0)20 7351 8322
Email: [email protected]
Web page: http://www.imperial.ac.uk/nhli/r.newson/
Departmental Web page:
http://www1.imperial.ac.uk/medicine/about/divisions/nhli/respiration/popgenetics/reph/

Opinions expressed are those of the author, not of the institution.

On 28/02/2012 17:14, William Gould, StataCorp LP wrote:

Roger Newson<[email protected]>  wrote,

I have a query about -c(mindouble)- and -c(maxdouble)- (see -help
creturn- for more about these).  [...]  I have noticed some very
counter-intuitive behavior when I attempt to assign variables of type
-double- to values outside the range from -c(mindouble)- to
-c(maxdouble)-, which (according to -help creturn-) are "the largest
negative number that can be stored in the 8-byte double storage type"
and "the largest positive number that can be stored in a double",
respectively.


Roger shows some exmaples and concludes by writing

Is this counterintuitive behavior a bug or a feature? I have always
assumed that -c(mindouble)- and -c(maxdouble)- are as advertized, and
that we can assume that non-missing -double- values can be expected to
be in the closed interval between these 2 limits. This doesn't seem to
be the case.


The short answer is,

     1.  Roger is correct about the counterintuitive behavior.

     2.  It is a feature.

     3.  Although we advertize the range of doubles as being
         [c(mindouble), c(maxdouble]), in fact the range is
         [2*c(mindouble), c(maxdouble)].  We should correct
         our documentation.

For those who do not want to immerse themselves in the details,
understand that there is not bug (except for the documentation), and
the counterintuitive behavior will lead to no errors in subsequent
calculation and, in fact, might even make make some calculations come
out as nonmissing (and correct) when, without the counterintuitive
behavior, they would result in missing.  That's not the reason why we
allowed the counterintuitive behavior, however; it was allowed for
performance reasons.

That only leaves the question of why the manual is incorrect.
It was, I hate to admit, intentional in a misguided attempt to avoid
giving this long explanation of how it is that that the range of
negative numbers is greater than the range of positive numbers.  The
attempt worked for years, but ultimately failed.

We will fix the manual.


What Roger found
----------------

     1.  Roger discovered that he could store numbers below c(mindouble)
         and that they work just fine.

     2.  Roger then worked exceedingly hard to create numbers above
         c(maxdouble) in, in some cases, he succeeded!  These numbers
         sometimes worked, but mostly, coverted themselves back to
         missing, or showed that they contained a mssing value that is not
         documented in the manual such as .z_, when the official missing
         values are ., .a, .b, ..., .z.

Concerning (1), we will fix the manual.

Concerning (2), a longer explanation is required.

We need to step back to the time we at StataCorp designed how missing
values would work.



IEEE 754-2008 Foating-point (double) numbers
--------------------------------------------

Floating point numbers are stored as

             a * 2^b

Think of a as normalized to be in the range -2<  a<  2, although I
oversimplify in unimportant ways.

The expontent b, a mathematical integer, is allowed to be in the range
[-1023, +1023].

The numnber line looks like this:


  -2^1024                          0                             2^1024
     |                             |                               |
      +-----------------------------------------------------------+


By the way, this is a rather odd number line in that it has a finite
number of points on it because a is recorded to a predetermined, fixed
number of digits.  Moreover, the number of points on the number line
for each power of 2 is the same as for any other power of two.

IEEE 754-2008 also provides some additional, off the number line
values, the most well known of which is NaN, meaning Not a Number.  It
would seem natural that missing values in Stata would be implemented
in terms of NaN but, for historical reasons, they are not.  Stata
understands NaNs, and uses NaNs internally, but for storage purposes,
NaNs are not used in the encoding of missing values.

This is because IEEE standard 754 has been under revision since 2000;
IEEE 754-2008 is the most recent revision, made in 2008.  IEEE 754 was
first adopted in 1985, the year Stata was first released.  In fact,
Stata was completed in 1984 and, even in 1985, it was not clear that
IEEE 754 would catch on.  There was a competing IBM floating-point
standard and, for example, Microsoft's then popular Basic used
the IBM standard rather than IEEE 754.

Stata uses IEEE 754 today, at least internally.  Nevertheless, Stata
has its own definition of missing values, and not just for historical
reasons.  Missing values are stored in datasets, datasets get old, and
users expect Stata to be able to pick up and use any dataset from the
past.  Thus, we at StataCorp control how Stata stores missing values.
Said differently, NaNs are fleeting things that arise during
calculations and that the software must deal with.  Missing values are
permanent things that need a stable definition and encoding.


How Stata stores missing values
-------------------------------

All Stata users know that missing values are larger than any
nonmissing value, which is to say, x<. if x is not missing.  Stata
also provides other missing values, namely .a, .b, ..., .z, and they
are ordered .a<  .b<  ...<  .z.

We arranged for all that to be true by taking an entire power of 2 for
missing values, namely 2^1023.  The number line now looks like this:


  -2^1024                          0                             2^1024 IEEE
     |                             |                               |
      +---------------------------------------------------+-------+
     |                             |                      |       |
  -2^1024                          0                    2^1023   2^1024 Stata
                                                          |-------|
                                                           missing
                                                           values

Although we advertise only 27 missing values, there are lots
more, some 2^52 of them!  The standard missing values we advertise are

               .  =  1           * 2^1023
               .a =  1 +  1/4096 * 2^1023
               .b =  1 +  2/4096 * 2^1023
                  .
                  .
               .y = 1/ + 25/4096 * 2^1023
               .z = 1/ + 26/4096 * 2^1023

The missing valuies between . and .a are known collectively as ._,
between .a and .b as .a_, ..., and above .z and .z_.

Roger ran into that when he showed the example,

       . tabulate doc, missing

                doc |      Freq.     Percent        Cum.
       ------------+-----------------------------------
                .z_ |         74      100.00      100.00
       ------------+-----------------------------------
              Total |         74      100.00


which, Roger wrote, suggested "that the variable -doc- contains a missing
value that I haven't seen mentioned in -help missing-".

Roger also discovered that he could not use .z_ as an input value.
That is because there is no single value associated with .z_.  .z_ is just
an output notation to refer to all the missing values above .z.


A difficult decision at StataCorp
---------------------------------

We at StataCorp, having decided how to encode missing values, now faced
the nubmer line


  -2^1024                          0                             2^1024 IEEE
     |                             |                               |
      +---------------------------------------------------+-------+
     |                             |                      |       |
  -2^1024                          0                    2^1023   2^1024 Stata
                                                          |-------|
                                                           missing
                                                           values

The range of values in Stata is (-2^1024, 2^1023).  There are more
negative than positive values.  We then considered the idea of
cutting the number line off at the bottom, too:


  -2^1024                          0                             2^1024 IEEE
     |                             |                               |
      +---------------------------------------------------+-------+
     |        |                    |                      |       |
  -2^1024  -2^1023                 0                    2^1023   2^1024 Stata
     |  CUT   |                                           |-------|
     |        |                                           |missing
     |        |                                           |values
               +-----------------------------------------+
                   We considered making this the range


We considered cutting the number line at -2^1023 just so that the range
of negative numbers would be the same as the range for positive numbers.

While there is a pleasing symmetry in the range (-2^1023, 2^1023),
there was simply no reason to throw away those numbers (which means to
map them to missing).  There might someday be a calculation that
visited that part of the number line.  The CPU (actually FPU) was
perfectly capable of making accurate calculations in that range.
Cutting the line off would only result in the odd calculation
resulting in missing when it when it could perfectly well produce a
result.

So we left the range as (-2^1024, 2^1023).

And then we documented the range as (-2^1023, 2^1023) just so I wouldn't
have to write this long explanation.

And I got to write it anyway.



Resolution
-----------

We will change c(mindouble) to be the true smallest double, but we will
not do that immediately.  The change will need to be made under version
control.  We will fix the manuals, but not immediately.  Because we have
gone years without anyone noticing, waiting a little longer should not
matter.  We will wait until the next release.  And then we'll have to
turn this Statalist posting into a FAQ so that we will have something
to refer to.


-- Bill
[email protected]
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

References:
- Re: st: Counter-intuitive overflow behavior
  - From: "William Gould, StataCorp LP" <[email protected]>

Prev by Date: Re: st: RE: New package -wridit- on SSC
Next by Date: st: SE erroes for xtmixed-predict fitted
Previous by thread: Re: st: Counter-intuitive overflow behavior
Next by thread: st: Moderated mediation with multilevel data
Index(es):
- Date
- Thread