[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
st: RE: RE: RE: RE: RE: Re: RE: may not use time-series operators on string variables

From	"Nick Cox" <[email protected]>
To	<[email protected]>
Subject	st: RE: RE: RE: RE: RE: Re: RE: may not use time-series operators on string variables
Date	Thu, 4 Jul 2002 17:12:09 +0100
Hakon Finne

> Stata has a number of data types (byte, int, long, float, double, str) but
> no explicit syntactical elements for variable types (nominal, ordinal,
> interval, ratio). For computational convenience, most of Stata's
> statistical and graphing procedures only work if the variable is stored as
a
> non-string
> data type, even if some statistical concepts in themselves do not require
> numerical values. But some procedures (e.g. -tabulate-) work on strings as
> well and you might never suspect there could be a problem.
>
> The example from the current thread: Time series data on the state of a
unit
> could be any of the four variable types. Events could then be calculated
as
> a change in state from one time to another, but if you want to do this
with
> time-series operators in Stata, the variable has to be stored as a
numerical
> data type.
>
> As long as there is no syntactical way to distinguish the four variable
> types, there are other means. Value labels help translate numbers to text
> for the reader as a compensation for having to convert textual information
> to numerical form. The data management tools for performing these
> conversions in Stata abound but perhaps someone could think about drafting
> an FAQ or a tutorial on how to use them in the context of variable types
(e.
> g., "What do I do with my categorical/nominal variables to make them work
> and display properly in Stata?"). (There are some already, e.g. on pie
> charts.)


Hakon's broadening of the question is very interesting. However,
if he wants Stata to encapsulate, or even to show a little respect for,
the nominal ... ratio scheme, then I have reservations.

To focus on what I believe to be the main point, the fourfold distinction
nominal/ordinal/interval/ratio (NOIR is a useful mnemonic for those who
see a black side to all this) was first proposed by the psychologist
S.S. Stevens in 1946, and revised intermittently in various small details
before and, in terms of publications, after his death in 1973.

http://www.nap.edu/books/0309022452/html/425.html

But despite being based, supposedly, on mathematical criteria,
it serves badly as a basis for modern data analysis.

This has often been pointed out, for example, in discussions started
by Velleman and Wilkinson (American Statistician, 1993-1995).

What is frequently problematic, in my view, is that this scheme, which
on one level is just classificatory terminology,
is often associated with a set of dogmas (dogmata?) on what are supposedly
valid methods to use with each data type (strictly, measurement scale).
This matter seems highly tribal: many texts and courses in (e.g.) psychology
or
sociology make a great deal of it, but there are also equally numerate
disciplines
in which it appears to be little known and little used. In fact, it seems to
feature less in the statistical literature (strict sense) than in literature
in several disciplines applying statistics.

Most mathematical statistics books make immensely more of the distinction
between discrete and continuous variables, which cuts across this scheme.
Even
that distinction need not dictate all analyses. Population sizes may jump up
or down in steps
of 1, but in itself this is no inhibition to fitting curves based on
differential calculus.
Conversely, atmospheric temperature sounds like an obviously continuous
variable, until you notice
that in practice there is a resolution level with human-recorded values of
0.1 deg and
that many observers prefer to write down even last digits (0(2)8) rather
than odd (1(2)9),
which seems somewhat at odds with the physics of heat.

There are in my view several things wrong or incomplete with the NOIR
scheme. This is
not a full list.

1. The distinction interval/ratio is only rarely of importance. (However, I
recently
saw writers agonise in print over negative coefficients of variation for
temperatures
when the mean was below zero Celsius. This served as a reminder that
"rarely" does
not mean "never".)

2. The category ordinal covers a wide range of possibilities. Pure ranks
(with no ties) have a very rich mathematical structure, while what might be
called
grades (e.g. "excellent" "good" ... "execrable") are very different in terms
of
what is usually appropriate either descriptively or in terms of modelling.
Lumping those together as ordinal is not very helpful. Also, the principle
that,
when you and I grade (say) our favourite Stata commands, my "excellent" is
distinct
from my "good" is separate from the principle that my "good" is equal to
your "good",
which appears fundamental to proper ordinal analysis, yet is (a) probably
dubious
and (b) perhaps untestable.

3. The scheme predates most modern categorical data analysis. Many
explanations miss the elementary but also fundamental flexibility which
we have in _representing_ categorical data in different but equivalent ways.
Thus while
{"male", "female"} looks nominal, (sex == "female") yielding 1 or 0 is
something we can quite happily take averages of or include in regression or
other
models, as is frequency of females. That example should
be widely familiar, but the principle is more general. To put it another
way,
naive accounts suppose that variables are necessarily or inherently of a
variable type,
but this conflicts with much of what we know about scientific and
statistical
practice.

4. Many kinds of variables do not fit into the scheme easily, if at
all. Variables measured on the circle or sphere as outcome space
are one example. Perhaps even more widely used are scores based on
the sum of many separate test items, as seen in education, medicine,
psychology, etc., etc. Purists often doubt whether such scales are even
ordinal, while
universities and medics often act as they were interval or even ratio
scales.

5. Percents and proportions have special properties which lie
outside the scheme.

As Hakon points out, Stata's distinctions between different types are based
on
how values are stored, a computing issue which may be of
little or no direct concern to most people using statistical methods.
Some statistical languages go much further than Stata in having
variable types (or the equivalent in their terminology) such as
factors and ordered factors, distinctions clearly based on statistical
meaning.

I don't know why Stata does not do this, and what exists now clearly
does not rule out future features _permitting_ users to declare that their
variables are of particular statistical type, but I can speculate:

1. It didn't seem a very interesting or important project, or there
wasn't a consensus on the matter.

2. It was too difficult to implement without raising or causing more
problems than it solved.

3. There is a Stata tradition of assuming that users know what
they are doing, which might including breaking or stretching
somebody else's rules. There is always of course a downside, or
an arguable side. As a teacher, I have often wished that a variable
produced by -encode- could never be used _as is_ as a variable in
regression or correlation, and that Stata would send a "howler", Harry
Potter style, to anyone who tried to do this. But it is also easy to imagine
situations
in which this could be very reasonable, as when grades "A" to "E" are
mapped to 1/5. (Flipping to 5/1 is a matter of 6 - grade.)

As for tutorials, I am currently writing something for the Stata
Journal on numeric and string variables.

On the specific matter of graphics, categorical data held as string
variables might feature in graphs in two fundamentally different ways:

a. They define classes, and the main concern is to show the
associated frequencies.

b. They define identifiers, and the main concern is to show
these in a legend.

Arguably, official Stata has neglected both kinds of graph.
For example, neither -graph, bar- nor -graph, pie- is about showing
frequencies directly, although each can be persuaded into
doing that. The FAQ which Hakon alluded to at

http://www.stata.com/support/faqs/graphics/piechart.html

refers to various approaches, and more could be said. But what are
programmers
not providing? (No requests for three-dimensional pie charts
will be entertained by the undersigned.)

Nick
[email protected]

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
References:
- st: RE: RE: RE: RE: Re: RE: may not use time-series operators on string variables
  - From: HF <[email protected]>
Prev by Date: st: Re: 'lag' of a string variable
Next by Date: st: -cipolate- now on SSC
Previous by thread: st: RE: RE: RE: RE: Re: RE: may not use time-series operators on string variables
Next by thread: st: Re: 'lag' of a string variable
Index(es):
- Date
- Thread