Phil Ryan mused generally in the light of a question
from Daniel Sabath:
> As I think Nick Cox has pointed out recently, Stata's tabulation
> facilities are somewhat scattered and it can be difficult to find
> exactly what you want among the myriad of official and unofficial
> commands. My own opinion is that, usually, user-written add-ons are
> a *very* good thing and add immeasurably to Stata's functionality.
> But tabulation is such a basic and important tool that a more
> unified system is needed. Many of us have written front-ends to
> _tabdisp for particular functions that -table- does not support, but
> (i) _tabdisp itself is limited and (ii) there is no unified
> <command/subcommand/option> construct to allow a reasonable choice
> of presentations of tabular material. (One has in mind the v8
> graphics subsystem - complex, admittedly, but now allows a deal of
> control over the end-product). In Dan's example below, what we have
> is essentially a collection of Rx2 subtables appended, that is, we
> have a sex X smoker table then an age group X smoker table and then
> perhaps other subtables. This is often the format given as "Table
> 1" of a published paper wherein the baseline characteristics of two
> or more groups are displayed. Stata can produce the subtables, but
> (I think) not the end-product, because Stata's tables are all about
> complete cross-classifications,whereas the display we want here has
> cross-classifications within a subtable but not between subtables.
>
> In summary I can imagine a tabulation subsystem in Stata that
> supports a user-defined output - contents and layout - for
> presentation. Imagination is, of course, cheap.
Imagination is where ideas come from!
I agree, as would be expected, with the general diagnosis here.
I also agree that at least for certain tabulation tasks the
needs go beyond what amateurs can do with Stata's own language,
so that we need a major input from Stata Corp.
However, in the spirit of Phil's later comments, let's talk
specifics. Here is a first PARtial list of a miserable seven Problems,
what can be done with Available material and what seems
Required. Join in with your own additions (or subtractions).
Problem 1: awareness
====================
I think one of the major problems users face is just to be aware of
what is possible, given the multiplicity of commands.
Available solutions: At some point, there is no substitute
for reading the manual and playing with the existing
commands, e.g. so that you know the strengths and weaknesses
of -tabulate-, -table-, -tabstat-, -tabdisp- etc. (and
-list- etc.). Some articles in the Stata Journal aim
to provide comparative material.
Required solutions: More documentation of various kinds!
More FAQs please. Anyone who was willing to write a book
on Stata tabulation tasks and tricks would not make the conceptual
breakthrough which Deans and Chairs expect, but they would
be able to start financing their retirement home.
Problem 2: combining tables
===========================
As Phil has clearly highlighted, one common need is to put
together what in effect sub-tables into combined tables.
It could be argued that Stata should not interfere between
you and your word and text processor; any way, at first sight
it offers next to no tools for doing this.
Available solutions: ... except that, in a sense, there
is a bunch of commands for joining tables so long as they
are (expressible as) Stata matrices. This line of attack
is probably under-appreciated; at the same time, it
falls short of what I guess people often need here.
Required solutions: a whole mini-language for combining
tables. In effect tables could be seen as objects
and there would be a set of operations for combining
them, with tunable control of output form: e.g.
join along rows; join along columns; layer. Each
combining would produce alignment, and be more than
what anybody could do as a cut/copy/paste
exercise. I guess that this would be a substantial
project for Stata Corp. -graph combine- is a partial
analogue.
(But there's more, such as elementwise addition,
subtraction, multiplication, division of tables...)
Problem 3: multiple variables
=============================
Stata does not offer much support for tabulating
frequency / proportion / percent results from
several variables simultaneously. Suppose (e.g.) I have
variables on trips to theatre, cinema, opera house,
funfair, etc. and I want a single table for all
variables so I can compare frequency distributions.
Available solutions: Some user efforts. Much can
be done once you see that a different data structure
is often the key (-stack-, -reshape- etc.), but
most users understandably prefer getting results on the fly
to mapping to a different data structure. (Even seeing
that you need a different structure can depend on
a lot of experience. Doing the restructuring can be
tricky too.)
Required solutions: Stata Corp to take this seriously!
Problem 4: sorting
==================
Sorting on the margins is often of limited analytical use.
To see patterns, rather than to provide easy look-up
(what is the population of Texas? Look under "Texas"...),
you often need to sort tables on their contents (i.e.
cell entries).
Available solutions: -tabulate, sort-. Some user
efforts. In general, this is not provided very
widely.
Required solutions: Stata Corp to take this seriously!
Problem 5: cell composites
==========================
What I call cell composites are cells containing
values from two or more variables, whether variables
in your dataset or temporary variables constructed by
the command running. In Daniel Sabath's
example which started this thread, he wanted cells
with concatenated strings
<cell freq> (<row percent>)
This is quite distinct cosmetically from what
might be called cell stacks
<cell freq>
<row percent>
In general, Stata directly supports cell stacks, but
not output like the first form. Cell stacks can
be more space-consuming and difficult to read in
some circumstances, although it is also easy to
run out of space with the first form.
Available solutions: Much is possible once
you see that setting tabulation up as a display
of string variables is the key. However, this
requires some prior manipulations and indeed
moderate fluency with some Stata basics. Canned
solutions, whether official commands or
user-written programs, appear lacking.
Required solutions: Support for output specifications,
i.e. if I want a table to show
<cell freq> (<row percent>)
something like
"#1 (#2)"
would specify "the first number followed
by a space followed by a parenthesis followed
by the second number followed by a parenthesis".
(Naturally there is a danger of reinventing e.g.
TeX's tabulation syntax.)
Problem 6: cell text
====================
Think of the number of ways in which you
might specify substantive missings as one
example. Depending on the boss's whims, the
house rules, the journal's prescribed
style, your own tastes, you could want
NA
or
--
or
(no data)
etc., etc. This is an example of how, frequently,
even in a numeric table, you often want extra
text. Or think of cell entries which are footnoted.
Available solutions: As with Problem 5,
much is possible once you see that setting tabulation
up as a display of string variables is the key. However, this
requires some prior manipulations and indeed
moderate fluency with some Stata basics. Canned
solutions, whether official commands or
user-written programs, appear lacking.
Required solutions: Stata Corp to take this seriously!
Problem 7: table design
=======================
In fact, we can easily extend this. This last problem
is really a rag-bag of all sorts of small and large
design issues, such as
support for different fonts and bold, italic, etc.
different kinds of divider and separator
control of titles, subtitles, notes, etc.
control of margin layout
multiple formats
A very simple example of the last is with -tabstat-.
If I go
. tabstat mpg, by(rep78) s(n mean sd)
Summary for variables: mpg
by categories of: rep78 (Repair Record 1978)
rep78 | N mean sd
---------+------------------------------
1 | 2 21 4.242641
2 | 8 19.125 3.758324
3 | 30 19.43333 4.141325
4 | 18 21.66667 4.93487
5 | 11 27.36364 8.732385
---------+------------------------------
Total | 69 21.28986 5.866408
----------------------------------------
then it's clear that the number of decimal
places is silly for mean and sd. Specifying
one d.p. is easy
. tabstat mpg, by(rep78) s(n mean sd) format(%2.1f)
Summary for variables: mpg
by categories of: rep78 (Repair Record 1978)
rep78 | N mean sd
---------+------------------------------
1 | 2.0 21.0 4.2
2 | 8.0 19.1 3.8
3 | 30.0 19.4 4.1
4 | 18.0 21.7 4.9
5 | 11.0 27.4 8.7
---------+------------------------------
Total | 69.0 21.3 5.9
----------------------------------------
but now the format of N is ill-chosen. And it is common to want
yet other formats for other cells:
. tabstat mpg, by(rep78) s(n mean sd skew kurt) format(%2.1f)
Summary for variables: mpg
by categories of: rep78 (Repair Record 1978)
rep78 | N mean sd skewness kurtosis
---------+--------------------------------------------------
1 | 2.0 21.0 4.2 0.0 1.0
2 | 8.0 19.1 3.8 0.2 1.6
3 | 30.0 19.4 4.1 0.4 3.1
4 | 18.0 21.7 4.9 -0.1 2.0
5 | 11.0 27.4 8.7 -0.0 1.6
---------+--------------------------------------------------
Total | 69.0 21.3 5.9 1.0 4.0
------------------------------------------------------------
Here one might want 2 d.p. for skew and kurt, at least
cosmetically.
Available solutions: There is a territorial issue here,
as with Problem 2, on how far Stata should get into terrain
which normally you would negotiate with (or in some cases
without) the assistance of your word or text processing software.
A lot can be done with SMCL, but either for one-off tasks or
for repetitive tasks that often requires Stata programming or at
least considerable Stata expertise. Multiple formats are
fairly easy to implement; one example can be seen in -makematrix-
from SSC.
Required solutions: Mostly, the finger points at Stata Corp,
again. But user-programmers can do more here than is
sometimes appreciated.
Nick
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/