An initial question on histograms led to a (to me, surprisingly)
vigorous thread. Here is a personal, partial, summary, with some
new material.
The principle behind histograms is that the area of each
bar represents the fraction of a frequency (probability)
distribution within each interval (bin, class). This is
standard. It is not part of the definition that all
intervals have the same length. Yet in practice most
histograms produced or published do have equal length
bins. Official Stata users in particular have only been offered
options to tune number of bins (Stata <= 7) and/or (constant)
width of bins (Stata 8). In short, official Stata does not,
apparently, allow bins representing unequal lengths.
Why? Various arguments for and against this may be identified.
1. Consistency argument. The choice of bin width
is often a little arbitrary. In an important special
case, the variable is discrete, in which case
1 is often the obvious and natural choice. Even then
discrete variables may require some choice of interval.
If the variable is number of lifetime sexual partners,
then the tail (apparently) stretches into very large
numbers and some grouping may be desired. But in the
case of continuous variables especially, there is
certainly arbitrariness. Many statistically-minded
people are most reluctant to compound this by varying
the length of the intervals. To do this complicates
the interpretation of the histogram, it may be said,
because of variations in the way the bars were produced.
Or to put it another way, equal widths are relatively
simple and any kind of complexity beyond them needs
to be justified.
2. Structure of the data argument. On the other hand,
sometimes the data come grouped into irregular intervals
and the researcher has little or no choice. The raw data
may be difficult or impossible to access. The Stata
user then wants a histogram (correctly drawn, naturally).
How can this be done?
3. Sampling variation argument. If we regard the
histogram as a crude estimator of the density function,
then it might make sense to vary bin width to
match the structure of variation, in effect varying
how we average probability density locally. (Of
course, another answer may be that you should try
another kind of graph or a transformation.)
4. Equal probability argument. There is at least one
other way to build a histogram in a simple, systematic
way: use quantiles equally spaced on a probability
scale. That way, each bar represents the same area.
Unless our data come from a uniform distribution,
the bin lengths will inevitably be unequal.
Where do we stand in terms of what we can do in
Stata? Working backwards,
4. Thanks to Kit Baum -- and to Vince Wiggins
and to Marcello Pagano for comments of various
kinds -- a -eqprhistogram- for Stata 8 is now
downloadable from SSC. Please junk any and
all versions you may have copied from Statalist,
and
. ssc inst eqprhistogram
This is not a full-blown program offering
all the handles which might be desired, but
more a demonstration that the thing is possible.
In last discussing this, I alluded to a quirk in
the implementation of the undocumented option
-bartype(spanning)-. This quirk turned out to be
a figment of my imagination. Vince Wiggins put me
right on what I had overlooked.
3 and 2. If you can work out your class limits you
can draw a histogram in Stata 6 or Stata 7
using -barplot- or -hist3- from SSC. -hist3-
is more general in that it will count for you.
In lieu of a port to Stata 8, the following
shows what can be done once someone has told
you the right undocumented feature.
The data come from Snedecor and Cochran 1989 p.19
(reference in manual) and are frequencies of US cities
with particular populations in 000 in 1970. We enter the _lower_
class limits and the frequencies and _one_ final upper
limit as data, or -- in other cases -- somehow get a reduction of
the data to this form.
. list
+---------------------+
| popula~n freque~y |
|---------------------|
1. | 100 38 |
2. | 125 27 |
3. | 150 15 |
4. | 175 11 |
5. | 200 16 |
|---------------------|
6. | 300 16 |
7. | 400 7 |
8. | 500 8 |
9. | 600 10 |
10. | 800 2 |
|---------------------|
11. | 1000 . |
+---------------------+
There are 150 cities, so we calculate the densities
. gen density = freq / (150 * (population[_n+1] - population))
(1 missing value generated)
and we can then draw the graph directly:
. twoway bar density population, bartype(spanning)
In practice you might want to add (e.g.)
bstyle(histogram)
and you might need to add
yscale(range(0))
-- the last was the detail I overlooked in my last
posting on this.
Nick
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/