Something I have seen locally seems to deserve a wider note.
Users may fire up a box plot in Stata, realise that a logarithmic scale
would be better for their variable, and then ask for that by
-ysc(log)- or -xsc(log)-.
Arguably,
1. Stata will let you do this, but in a sense it should not. Almost
always, the result will be not quite what you want, or what you would
want if you were concentrating fully.
2. For private exploration, the difference may be of little
consequence, but even then it is possible to be puzzled or even misled.
3. Doing it properly, especially for public reports, is possible, and
not too difficult.
The point at issue is probably too much a confusing complication
for the elementary books, and too obvious or too trivial for smarter
writers to bother about, so it can fall between those two stools. If
anyone knows of a discussion in print, I would appreciate a reference.
(Incidentally, I have noticed an astonishing trend, that increasingly no
introductory statistics book is considered complete without colour
photos of smiling people of different kinds, even if completely
irrelevant to the material discussed!)
In what follows I assume that your variable is all positive, because
otherwise a logarithmic scale is not defined. As you may have
noticed, if -graph- is asked to -?sc(log)- when zero or negative values
are present, it just gives you a ridiculous graph, rather like the kind
of teacher who will not say "That was a stupid comment", but just give
you a funny look which clearly means, "Do think about that a bit more".
The main issue is one of division of labour. -ysc(log)- and -xsc(log)-
just take the graph you would have got otherwise and warp it
logarithmically. However, what neither does is to re-calculate summaries
on the log scale. In a sense, your punishment here is that you got
what you asked for.
Recipes for box plots differ from book to book and program to program.
Back in 1989 Frigge and friends catalogued several variants in _The
American Statistician_, and no doubt a careful trawl would reveal
some they missed and others that have arisen since. Stata follows what
John Tukey settled on after trying various possibilities, most
importantly here that a data point is plotted separately if it lies more
than 1.5 times the interquartile range away from the nearer quartile. If
you re-do this on a logarithmic scale, you will almost always get a
different answer whenever such points exist, and sometimes even if they
do not. Some high values plotted separately may jump back inside the
main box-and-whiskers cluster. Some low values may even jump out of that
cluster and now be plotted separately. The re-classifications reflect
the fact that the interquartile range of logarithms is not in general
the logarithm of the interquartile ranges.
The same issue can affect, although usually to a lesser extent, the
calculated median and quartiles. Each can be based on interpolation
between data points, and so it is not always true that (for example)
the median of the logarithms is _exactly_ the same as the logarithm of
the median. Admittedly, unless your data set is very strange and very
small, you would not usually be troubled by the difference, but only for
the minimum and maximum is there absolutely no problem.
The way to do it properly is thus to take logarithms first, e.g.
Here -log10()- has the marginal advantage over -log()- (or -ln()-) that
many users can do the inverse in their heads, thinking 4 means 10^4 =
10000, or whatever, so that you can add stuff like
. graph box log10price, yla(4 "10000")
but not many of us can remember more than the integer powers here.
A cute trick is to force Stata to do the calculation on the fly, as
in
but if I show my colleagues that they seem to regard it as a bit of
a joke: it is admittedly tedious, even if you can remember and
understand the syntax. (The relevant help here is at -help macro-.)
To get the best of all worlds, you will want several "nice" axis labels
on the original scale, with Stata doing all the calculations. One way of
getting those is through the program -mylabels- on SSC. With -mylabels-
you just say what you want shown and what is the scale in use, and the
program then does the harder bit. For example, if I say
You should not retype that, or even copy and paste, because it
is tucked safe inside a local macro.
. graph box logprice, yla(`labels', ang(h))
I often find it takes a few goes to get it right, but all that
is needed is to reissue -mylabels- until you do. Use the same
local macro name. Stata is happy to overwrite it, as local macros
are totally expendable.
The same issue with box plots arises with any nonlinear
transformation, although logarithms are the most common in
practice and the most tempting in Stata given -?sc(log)-.
Thus psychologists and some others work with times taken by
rats or students to complete a task. The distributions are
often highly skew and those who do not complete a task should be
assigned missing values. The reciprocal of time is a speed: note that
missing times can be recoded as zero speeds. Here again you would need
to do the transformation yourself and quite possibly fix the
axis labels too.