As signalled on Tuesday in response to
a question from Don Spady, I have rewritten
-vplplot-, which has been on SSC for some years,
as -pairplot-, which uses the new graphics
in Stata 8. -vplplot- remains in place for any
remaining users of Stata 6 or Stata 7 who might
want to use it.
Thanks to Kit Baum, -pairplot- is now available
from SSC. A discussion of its rationale follows
my signature.
Nick
[email protected]
-pairplot- is a simple utility for comparing paired observations
graphically, especially when the interest lies in assessing
agreement or disagreement between measurements on the same scale.
-pairplot- is a reworking for Stata 8's new graphics of a program
called -vplplot-, which has been in existence for some years. (In
its present and indeed last form on SSC it requires Stata 6.0.)
The main stimulus for writing what is now -pairplot- was the
excellent paper by Don McNeil in American Statistician in 1992.
In explaining the idea, consider graphs not just as statements
The data are ... .
but as answers to questions
How far are the data ... ?
Given two responses, say, y1 and y2, the scatter plot
y1 vs y2
preserves the information on pairing (in contrast to say qqplots or
side-by-side dotplots, box plots, histograms, etc. which lose the
information on pairing). As is well known, the scatter plot can be
used to answer many questions. One which is emphasised greatly is
clearly
y = a + bx ?
or more generally
y = f(x) ?
and I will focus on this more discussed case -- which I will call
the regression question -- before returning to the agreement
question, which at its simplest is
y1 = y2 ?
or sometimes
y1 = y2 + c ?
or sometimes
y1 = k * y2 ?
A point often emphasised is that for the regression question a
scatter plot is in some ways inefficient. If we rephrase the
question as (e.g.)
y - (a + bx) = residual = 0 ?
then a plot of (e.g.)
residual vs (a + bx)
is in many ways more direct as an answer to the question. Three
points are of particular interest about this residual vs fitted plot
-- which remarkably seems to go no further back than the early
1960s.
1. Generally, the residual vs fitted plot does quite well in
serving two broad goals -- allowing both general patterns and
particular details to be evident, and working well at a range of
sample sizes.
2. The quantities of most relevance for answering the question are
the residuals, which are shown directly on the vertical axis.
3. There is a horizontal reference line for comparison. The eye and
brain are good at detecting departures from reference lines, and
especially good at detecting departures from the horizontal. (The
tilted regression line has this limitation: even statistical people
who understand the theory sometimes forget when interpreting a
scatter plot that departures from a regression line must be assessed
vertically, not horizontally or orthogonally. I will not digress
here to discuss other summary lines.)
Returning now to the scatter plot as an answer to questions like
y1 = y2 ?
y1 = y2 + c ?
y1 = k * y2 ?
my assertion is that it is an indirect answer to these questions. We
could try training ourselves to decode the horizontal distances
y1 - y2
y1 - (y2 + c)
y1 - (k * y2)
log y1 - (log k + log y2) (given log scales)
but I suggest that it would be hard work. The issue is, when looking
at a scatter plot, not just looking at any individual data point,
but also seeing the whole pattern of these distances, which are the
quantities of most relevance for answering the particular agreement
question. This points up the value of showing these distances
explicitly on a plot. -pairplot- supports plots with
y1 and y2, linked vertically, on the y axis
or
(y1 - y2), shown vertically, on the y axis
or
(y1 / y2), shown vertically, on the y axis
and
order of observations (_n) on the x axis
or
a specified variable
or
sort order on some varlist (ascending/descending) on the
x axis
or
mean (y1 + y2) / 2 on the x axis
or
geometric mean sqrt(y1 * y2) on the axis, provided that
y1, y2 > 0
Some of these graphs are well known, at least in various branches of
the literature. The plot of difference vs mean has often been
recommended (especially by Bland and Altman) in medical statistics.
The idea goes back at least as far as John Tukey ~ 1965.
These plots arguably all satisfy point 1 and point 2 above and many
satisfy point 3 above.
One example which may be of interest comes from the auto data. I
looked at the relationship between length and turn, first putting
them in the same units:
. replace length = length / 12
. label var length "Length (ft.)"
. pairplot turn length, ratio
shows that most values of the ratio are near 2.5, with one car very
much lower. This is made more obvious by
. pairplot turn length, ratio base(2.5)
Adding an extra option
. pairplot turn length, ratio base(2.5) mlabel(make)
made it clear that the Chevrolet Malibu has a very low turn /
length; experts may be able to comment.
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/