I think Allan still misses a real issue here, which arises
from two basic principles followed all the way through.
1. any self-respecting statistical program must allow
representations of missing values;
2. programmers and users alike find two-way logic
much easier to manage, in total, than three-way logic.
I guess no-one at all has any trouble with 1.
The issue is 2. The nuance to note -- crucial here --
is "in total".
As a matter of history, Stata decided that numeric missing
should have non-zero representation (well, that's
essential); it is regarded as very
large positive; and it is regarded as true. (To regard
missings as _either_ large positive _or_ large negative
is essential for at least one purpose, sorting, as
every observation must go somewhere.)
You can say that "missing => true" was
a bad design decision; "missing => false" is
presumably no better; so consider the alternative
of a three-way logic.
In examples like
. list if myvar > 0
what the user wants, almost all the time, is
to see positive values; when that user gets
missings as well it is, usually, somewhere
between an irrelevant extra and, in other
contexts, not what was wanted at all, and
so a bug (although people seem to want
to blame Stata for the bug, not their own
not-quite-careful-enough programming). Let's
all agree: even very experienced users can get bitten
by this, as we temporarily forget the principles
which Stata is following rigidly and rigorously.
Me too.
Examples like this are indeed persuasive. One
is tempted to say that Stata should be smart
enough to divide values of -myvar- into
true, false and irrelevant (because missing)
and show only those which are true.
However, examples like this are not the point
at all. The point is to consider the
consequences of following such three-way logic
all the way through; or to decide on when
Stata should use three-way logic and when
it should use two-way logic, and how in turn
you explain that distinction.
Let's suppose Stata could do this. It
would ignore false _and_ "irrelevant"
and show only true, given that -list-
command.
Now suppose you want two conditions, in
some combination,
. list if myvar > 0 & yourvar > 0
. list if myvar > 0 | myvar > 0
Now please, for this only slightly
more complicated situation,
1. fill in truth tables
& true false irrelevant
true
false
irrelevant
| true false irrelevant
true
false
irrelevant
2. imagine working with such combinations
for the rest of your Stata life.
3. imagine explaining this, repeatedly,
to other users, given that you, are,
probably, the local expert.
No thanks!
Incidentally, I don't think Allan's political excursus
explains what was supposedly a problem with
string variables.
Nick
[email protected]
Allan Reese
>
> On Wed, 7 Jan 2004, Bill Rising wrote [with RAR's inserts]:
> > ..., it would make Stata
> > code easier to read and less prone to error if people
> could code the
> > [potentially RAR] incorrect
> >
> > regress foo bar if snafu
> >
> > instead of the [intended? RAR] correct
> >
> > regress foo bar if snafu & snafu < .
> >
> > for snafu being some sort of indicator which could be missing.
> >
> > I've used Stata long enough that the latter comes natural
> to me. Still,
> > I'd hate to see how many analyses have been found invalid
> because of
> > folks forgetting the extra 'less than missing' clause.
>
> My point exactly. It *is* documented, thus making it a
> feature, but who
> reads documentation? It is *known* to all members of this
> list? to all
> Stata users? I doubt it. It raises anomalies, as here:
>
> . gen m = var1>0
> . gen l = var1<0
> . list var1 m l
> | var1 m l |
> 1. | 1 1 0 |
> 2. | 2 1 0 |
> 3. | 3 1 0 |
> 4. | . 1 0 |
> 5. | -1 0 1 |
> 6. | 0 0 0 |
>
> Within the Stata language, "missing" is a positive number,
> but that is not
> a natural treatment of missing data. In the same way that
> "replace" by
> default reports "n values changed", I suggest it would be
> more sporting to
> report "missing values used in calculation - check answers".
>
> Since Nick insists I spell out the joke (?), we were told
> that the basis
> for invading Iraq was that wmd was definitely TRUE. It
> subsequently turns
> out that the data were incomplete or inconclusive. But if
> wmd>0 computes
> as TRUE for missing data, they can justify any political or
> management
> decision.
>
> I have had similar exchanges on the discussion list devoted
> to spreadsheet
> use. The techies say, "It's a documented feature, so
> everyone knows", and
> the managers say, "We got the answer from the computer, so
> it must be
> correct." There is a wonderful area of computer science devoted to
> *proving* programs are correct; I've never seen evidence of
> an automated
> procedure that it capable of checking that the correct
> variable was named
> in an expression or that the correct operator was used.
>
> History demonstrates that it is only after a sequence of
> disasters that
> "management" accept that systems should be self-checking and error
> avoiding. Relying on people to "do the right thing" in all
> circumstances
> is a proven recipe for disasters. WRT software, why can't
> we abridge the
> historic process?
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/