Here's one take.
What you see in Stata depends on a mix of history and design
principles. There is some tension between those on occasions. In
fact, you've identified an inconsistency in Stata's behaviour
under different circumstances. Where I differ is that I don't
agree that Stata should change anything, except possibly its
documentation.
History
=======
-float- as the default type is a compromise given, historically,
that memory was often a major limit on what users can do. This has
always been a consideration for more than just filespace, given
how Stata does its calculations. Even if fewer users are
constrained by this than was true in the early days of Stata,
there are always some users constrained by memory on their
existing machines. And for all users, a default type of -double-
would still be wasteful. (If anything, I detect a shift towards
categorical data in the total pattern of Stata use, for which a
-byte- or -int- is often fine.) If that default of -double-
obtained, then for every user bitten by the problem you document,
there would be many more bitten by performance problems. (And the
advice to use -compress-, and what to do if that didn't work,
would recur constantly on Statalist.)
Design
======
Software design hinges on how far users are assumed to be smart
and how far users, although inherently smart, should be protected
from the stupid mistakes that they may make (by accident, of
course, or because they were distracted, or the documentation is
lousy, or whatever).
Statistical software in general, and Stata especially among other
top-end programs, mostly assumes that the user is very smart, or
at least capable of accepting the consequences of what they asked
for. Variable types are a case in point. The differences between
types are explained clearly and prominently in the documentation,
and you are supposed to understand them. And, if need be, you are
supposed to know how to change them -- as Hiroshi does understand.
However, you've touched on an interesting issue, and one rarely
discussed so far as I know. Broadly speaking, the editor is a
rather late addition to Stata, and (I suspect) its introduction
was driven partly by what the marketing people were telling the
technical people. (Rightly so, in this case.) And it's a more
protective or supportive environment than the command line. Right
from the first value that you type in the first cell of a new
variable, Stata is making smart guesses at what the variable type
should be, and it's prepared to change its mind. Suppose you type
1
in a new column. Stata guesses that you want a -byte-. You then
type
345
and it changes its mind: no, you want an -int-. (Or, if you like,
you _need_ an -int-.) You then type
3.14159
and, behold, it's a -float-. This is what you've noticed. -edit-
is more flexible in this respect than the command line.
Digression: -edit- and numeric and string variables
===================================================
There is a limit on this. Once the Stata editor has decided you
are on the numeric side, or on the string side, it won't change
its mind. For crossing that divide, surgery is needed, in the
shape of -destring- or -tostring-, for example.
As it happens, the little seed from which -destring- grew was
this. Given the advent of Stata's data editor, it became possible
to teach data entry using that directly. Faced with what looked
like a spreadsheet, many students relaxed and said to themselves,
"Oh, this is OK. I am used to this kind of interface." And they
did spreadsheet-like things like write a line or two of header
information. Stata was not fazed (or even phased) by this. "The
user wants a string variable here", it said to itself. And no
amount of numeric characters in later cells of that variable
caused it any anguish. Some time later, the failure of Stata to do
numerical calculations with that variable was traced back to the
fact that the user all along had a string variable. Remedy: zap
the header lines, and -destring-.
Summary
=======
-float- as the default has a stronger justification that you
imply.
The difference in behaviour between what is done from the command
line and -edit- comes down to designers' ideas of what is
reasonable in different environments. Both are regarded as well
thought out features, but the inconsistency can indeed bite on
occasion.
Nick
[email protected]
Hiroshi Maeda
>
> I am a little puzzled by the way Stata stores numerical data
> in float-type
> variables (I have read relevant portions of the manuals and
> the FAQ page). I
> have a few questions arising from my inability to understand
> the point of
> the float storage type and would appreciate it if anyone
> could show me what
> I am missing. Here are those questions:
>
> 1) Why doesn't Stata change (or "promote" in Stata parlance)
> the storage
> type of a variable from float to double when it is created by
> commands such
> as "generate" and even when the "promotion" is logical and
> desirable? Here
> is an example to illustrate my point. Suppose there is a data
> set with one
> observation with two variables. The first variable (VAR_1) is
> created by the
> user manually typing in the value 111222 into the data
> editor. The storage
> type of VAR_1 is "long." The second variable (VAR_2) is
> created by the user
> issuing the command, generate VAR_2=111222. The storage type
> of VAR_2 is
> "float" because that is the default option. Then suppose I
> enter the data
> editor and manually change the value of VAR_01 from 111222 to
> 111222333444.
> Stata will change the storage type of VAR_01 from "long" to
> "double", and
> the value that appears in the cell in the data editor is
> 111222333444; I say
> 111222333444 to Stata, and it understands me. However if I issue the
> command, replace VAR_2=111222333444, Stata won't change the
> storage type and
> the precision of the data gets lost; the value that appears
> in the cell in
> the data editor is 111222333440 rather than 111222333444. I
> say 111222333444
> to Stata, but it says back to me 111222333440. I know that Stata does
> actually understand me because Stata will say 1 if I issue
> the command,
> count if VAR_02==float(111222333444). But if I issue the
> command, list,
> Stata will list 111222333444 and 111222333440, when those two
> values should
> be identical in the log file. Like most users I work with log
> files and
> don't always have a lot of free time. When I say
> 111222333444, I would like
> to see Stata show 111222333444 without having me take extra
> steps (such as
> "float function" or "recast"). I fail to understand why Stata
> won't change
> the storage type in this situation to produce an accurate
> result; after all
> Stata is providing inaccurate information for VAR_2 here. I
> have used SPSS
> and SAS in the past and both programs understand me when I say
> 111222333444..., which, I think, isn't too much to ask.
> (Below I append a
> log file documenting steps I took to illustrate my point).
>
> 2) Given the situation described above, why isn't the
> "double" storage type
> the default option for numerical variables? I would think
> having "double" as
> the default option would eliminate many sources of
> inaccuracies in Stata
> data sets. For example, a program like Stat/Transfer can
> produce Stata data
> sets with many inaccurate values if the user fails to trun on the "Use
> Doubles" option (needless to say, that careless user is me, and I now
> realize that I have to run file conversion jobs again). But
> if the "double"
> type were the default option, there wouldn't be such
> problems. I don't see
> the point of the "float" type being the default option for numerical
> variables.
< for rest, see archives >
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/