I am with StataCorp on this, not that they need my
support.
More importantly, exasperation on this score is not
a substitute for an explanation of what you think
Stata should be doing instead and a demonstration
of how that in turn would be unproblematic.
I do agree that long-standing users as well as new users can from time
to time forget the implications of what they write and
get bitten by this: me too, but that's not the main issue.
The main issue is the insinuation that Stata is being
illogical here. I can't see any basis for that in Allan's
or Tom's posts, however colourfully or strongly they write.
On the contrary, the objections to Stata's practice arise
because _logic_ is biting when users in fact _mean_ something else.
There is tension between syntax and semantics. But, simply yet
crucially, the problem is that what you _mean_ is in your head,
and Stata can only work on the basis of what you say.
The first step in Stata's position is that missing numeric
values must themselves be assigned a numeric value. This
follows from the fact than when -sort-ed on a numeric
value, observations with missings must go somewhere. Anywhere "in
the middle" of a numeric range would be absurd, so there are two
solutions:
missings could be treated as arbitrarily high, or as arbitrarily low.
Stata chose the first, as we know. If they had chosen
the second, that would have been equally justifiable: we
would just be fielding questions arising from the use of <, not >.
The second step is then over what is to be done with missings given
an inequality > or <. (The same issues arise with >= or <=,
but I won't spell that out below, as making the inequality weak
not strong is, I guess, an issue for no-one.)
It follows from Stata's first step that missings are
> any non-missing value, and so, for example, included in
inequalities like
if x > 42
but not included in inequalities like
if x < 42
Now that's an awkward asymmetry, but logic doesn't feature in charm
school.
The consequences of a set of rules can be surprising or even unpleasant,
but there we go.
Now in practice it's often (but certainly not always) true that when
users write
if x > 42
they don't have in mind the missings. If so, then they just need to say
so.
But, seriously, what are the alternatives? Allan and Tom
seem to want "if x > 42" to ignore missings on -x-. If that
were so, then it would solve one problem only to replace
it with at least four others, on quite different levels:
1. Stata is now inconsistent. Missings are assigned precise
numeric values for some purposes (e.g. -sort-ing) but not others.
2. What, under that proposal, would be the truth value of
an expression, say
. > 42
We need to know that for all sorts of reasons, quite apart from the
selection of observations (which seems to be by far the most common
source of complaints under this heading). If that expression is to be
considered either true or false, then either decision implies
inconsistency with other parts of Stata. See also 1. (Another
alternative is some three-way logic.
I won't discuss that here, but StataCorp have, in my view rightly,
considered that closely and decided it's not the solution.)
3. Designing a language according to what users are supposed to mean,
rather than what they say, is, in my experience, a very long, very
slippery
slope to perdition. If you are in charge of a program, you can design
it exactly the way you want. If you are offering the program to others,
the only way that will work well is if there is a mutually understood
logic.
(This mailer and a word processor I am obliged to use sometimes try to
guess what I mean, and both are confounded nuisances.)
4. Declaring this behaviour now to be a bug, or at least a misfeature,
would
be a major change in Stata. Goodness knows how many scripts, programs
and understandings would be broken by such a change, even under version
control.
(Allan does flag this, but it's worth underscoring.)
I could go on, but this is long enough. There is no disputing the
irritation
this aspect of Stata's language can cause. But I am at a complete loss
to know how it could be fixed without affecting -if-, or the
interpretation of >, or the treatment of missings in ways that would be
immensely more awkward than the problem complained of here.
Nick
[email protected]
Steichen, Thomas J.
===================
I'm with Allan on this. Implementing "if x>y" to evaluate
as True when x is missing is a logical flaw and should be
corrected. After 10+ years with Stata, I still occasionally
fall into this trap. I figure that if I can deal with -index()-
being changed to... now what was it?... I can deal with a
real flaw being fixed!
Allan Reese
===========
Others have pointed out that "if x>y" in Stata evaluates as True when x
is missing "."
I've raised this before and had to accept as a feature of Stata that "."
is a big number and "computers do what you tell them, not what you
want." Nevertheless, I remain of the opinion that it is
counter-intuitive, logically incorrect, and undoubtedly leads to
computer-assisted errors. Changing the operation of Stata now would
inconvenience most current users, but it would not be inconsistent if
the kernel were adapted to output a warning after such calculations
"Missing values included - check your results".
It's indeed a strange world where the priests of IT can claim "user
error" when you fall into a trap they set. Software will at some time
come under the remit of health and safety legislation - IT's doin' me
'ed in!
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/