Hello Statlist members:
I am a little puzzled by the way Stata stores numerical data in float-type
variables (I have read relevant portions of the manuals and the FAQ page). I
have a few questions arising from my inability to understand the point of
the float storage type and would appreciate it if anyone could show me what
I am missing. Here are those questions:
1) Why doesn't Stata change (or "promote" in Stata parlance) the storage
type of a variable from float to double when it is created by commands such
as "generate" and even when the "promotion" is logical and desirable? Here
is an example to illustrate my point. Suppose there is a data set with one
observation with two variables. The first variable (VAR_1) is created by the
user manually typing in the value 111222 into the data editor. The storage
type of VAR_1 is "long." The second variable (VAR_2) is created by the user
issuing the command, generate VAR_2=111222. The storage type of VAR_2 is
"float" because that is the default option. Then suppose I enter the data
editor and manually change the value of VAR_01 from 111222 to 111222333444.
Stata will change the storage type of VAR_01 from "long" to "double", and
the value that appears in the cell in the data editor is 111222333444; I say
111222333444 to Stata, and it understands me. However if I issue the
command, replace VAR_2=111222333444, Stata won't change the storage type and
the precision of the data gets lost; the value that appears in the cell in
the data editor is 111222333440 rather than 111222333444. I say 111222333444
to Stata, but it says back to me 111222333440. I know that Stata does
actually understand me because Stata will say 1 if I issue the command,
count if VAR_02==float(111222333444). But if I issue the command, list,
Stata will list 111222333444 and 111222333440, when those two values should
be identical in the log file. Like most users I work with log files and
don't always have a lot of free time. When I say 111222333444, I would like
to see Stata show 111222333444 without having me take extra steps (such as
"float function" or "recast"). I fail to understand why Stata won't change
the storage type in this situation to produce an accurate result; after all
Stata is providing inaccurate information for VAR_2 here. I have used SPSS
and SAS in the past and both programs understand me when I say
111222333444..., which, I think, isn't too much to ask. (Below I append a
log file documenting steps I took to illustrate my point).
2) Given the situation described above, why isn't the "double" storage type
the default option for numerical variables? I would think having "double" as
the default option would eliminate many sources of inaccuracies in Stata
data sets. For example, a program like Stat/Transfer can produce Stata data
sets with many inaccurate values if the user fails to trun on the "Use
Doubles" option (needless to say, that careless user is me, and I now
realize that I have to run file conversion jobs again). But if the "double"
type were the default option, there wouldn't be such problems. I don't see
the point of the "float" type being the default option for numerical
variables.
Thank you in advance. ---Hiroshi Maeda
----------------------------------------------------------------------------
------------------------------
log: C:\Data\DEMO.log
log type: text
opened on: 24 Feb 2004, 18:33:30
. *I am going to produce a variable by typing in the value 111 in the data
editor;
. edit
- preserve
. desc
Contains data
obs: 1
vars: 1
size: 5 (100.0% of memory free)
----------------------------------------------------------------------------
---
storage display value
variable name type format label variable label
----------------------------------------------------------------------------
---
var1 byte %8.0g
----------------------------------------------------------------------------
---
Sorted by:
Note: dataset has changed since last saved
. list
var1
1. 111
. count if var1==111
1
. *I am going to add an observation by typing in the value 111222;
. edit
- preserve
- set obs 2
- replace var1 = 111222 in 2
- preserve
. desc
Contains data
obs: 2
vars: 1
size: 16 (100.0% of memory free)
----------------------------------------------------------------------------
---
storage display value
variable name type format label variable label
----------------------------------------------------------------------------
---
var1 long %12.0g
----------------------------------------------------------------------------
---
Sorted by:
Note: dataset has changed since last saved
. list
var1
1. 111
2. 111222
. count if var1==111222
1
. *I am going to add an observation by typing in the value 111222333;
. edit
- preserve
- set obs 3
- replace var1 = 111222333 in 3
- preserve
. desc
Contains data
obs: 3
vars: 1
size: 24 (100.0% of memory free)
----------------------------------------------------------------------------
---
storage display value
variable name type format label variable label
----------------------------------------------------------------------------
---
var1 long %12.0g
----------------------------------------------------------------------------
---
Sorted by:
Note: dataset has changed since last saved
. list
var1
1. 111
2. 111222
3. 111222333
. count if var1==111222333
1
. *I am going to add an observation by typing in the value 111222333444;
. edit
- preserve
- set obs 4
- replace var1 = 111222333444 in 4
- format var1 %20.0g
- preserve
. desc
Contains data
obs: 4
vars: 1
size: 48 (100.0% of memory free)
----------------------------------------------------------------------------
---
storage display value
variable name type format label variable label
----------------------------------------------------------------------------
---
var1 double %20.0g
----------------------------------------------------------------------------
---
Sorted by:
Note: dataset has changed since last saved
. list
var1
1. 111
2. 111222
3. 111222333
4. 111222333444
. count if var1==111222333444
1
. *I am going to produce another variable by issuing the generate command;
. generate var2=.
(4 missing values generated)
. *I am going to assign the value 111 to var2 in case 1;
. replace var2=111 in 1
(1 real change made)
. desc
Contains data
obs: 4
vars: 2
size: 64 (100.0% of memory free)
----------------------------------------------------------------------------
---
storage display value
variable name type format label variable label
----------------------------------------------------------------------------
---
var1 double %20.0g
var2 float %9.0g
----------------------------------------------------------------------------
---
Sorted by:
Note: dataset has changed since last saved
. list
var1 var2
1. 111 111
2. 111222 .
3. 111222333 .
4. 111222333444 .
. count if var2==111
1
. *I am going to assign the value 111222 to var2 in case 2;
. replace var2=111222 in 2
(1 real change made)
. desc
Contains data
obs: 4
vars: 2
size: 64 (100.0% of memory free)
----------------------------------------------------------------------------
---
storage display value
variable name type format label variable label
----------------------------------------------------------------------------
---
var1 double %20.0g
var2 float %9.0g
----------------------------------------------------------------------------
---
Sorted by:
Note: dataset has changed since last saved
. list
var1 var2
1. 111 111
2. 111222 111222
3. 111222333 .
4. 111222333444 .
. count if var2==111222
1
. *I am going to assign the value 111222333 to var2 in case 3;
. replace var2=111222333 in 3
(1 real change made)
. desc
Contains data
obs: 4
vars: 2
size: 64 (100.0% of memory free)
----------------------------------------------------------------------------
---
storage display value
variable name type format label variable label
----------------------------------------------------------------------------
---
var1 double %20.0g
var2 float %9.0g
----------------------------------------------------------------------------
---
Sorted by:
Note: dataset has changed since last saved
. format var2 %20.0g
. list
var1 var2
1. 111 111
2. 111222 111222
3. 111222333 111222336
4. 111222333444 .
. count if var2==111222333
0
. count if var2==float(111222333)
1
. *I am going to assign the value 111222333444 to var2 in case 4;
. replace var2=111222333444 in 4
(1 real change made)
. desc
Contains data
obs: 4
vars: 2
size: 64 (100.0% of memory free)
----------------------------------------------------------------------------
---
storage display value
variable name type format label variable label
----------------------------------------------------------------------------
---
var1 double %20.0g
var2 float %20.0g
----------------------------------------------------------------------------
---
Sorted by:
Note: dataset has changed since last saved
. list
var1 var2
1. 111 111
2. 111222 111222
3. 111222333 111222336
4. 111222333444 111222333440
. count if var2==111222333444
0
. count if var2==float(111222333444)
1
. *I am going to manually replace the value of var2 in cases 3 to 4;
. edit
- preserve
- replace var2 = 111222333 in 3
- replace var2 = 111222333444 in 4
- preserve
. count if var2==111222333
0
. count if var2==float(111222333)
1
. count if var2==111222333444
0
. count if var2==float(111222333444)
1
. *I am going to change the storage type of var2 from float to double;
. recast double var2
. desc
Contains data
obs: 4
vars: 2
size: 80 (100.0% of memory free)
----------------------------------------------------------------------------
---
storage display value
variable name type format label variable label
----------------------------------------------------------------------------
---
var1 double %20.0g
var2 double %20.0g
----------------------------------------------------------------------------
---
Sorted by:
Note: dataset has changed since last saved
. list
var1 var2
1. 111 111
2. 111222 111222
3. 111222333 111222336
4. 111222333444 111222333440
. *I am going to replace the existing values of var2 in cases 3 to 4;
. replace var2=111222333 in 3
(1 real change made)
. replace var2=111222333444 in 4
(1 real change made)
. desc
Contains data
obs: 4
vars: 2
size: 80 (100.0% of memory free)
----------------------------------------------------------------------------
---
storage display value
variable name type format label variable label
----------------------------------------------------------------------------
---
var1 double %20.0g
var2 double %20.0g
----------------------------------------------------------------------------
---
Sorted by:
Note: dataset has changed since last saved
. list
var1 var2
1. 111 111
2. 111222 111222
3. 111222333 111222333
4. 111222333444 111222333444
. count if var2==111222333
1
. count if var2==111222333444
1
. log close
log: C:\Data\DEMO.log
log type: text
closed on: 24 Feb 2004, 18:45:11
----------------------------------------------------------------------------
------------------------------
Hiroshi Maeda
University of Illinois at Chicago
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/