Kimberley Tran
>
> To build kernel density graphs in Stata, I created a
> Do-file for purpose of
> generating a variable within which density measures are
> taken. This variable
> contained 100 points. From my understanding, the distance
> between each point
> is the bandwidth. I ran this Do-file prior to using the
> kdensity command.
> In the resulting kernel density graphs, there are points
> on the y-axis which
> are greater than 1. How should the y-axis of the resulting
> kernel density
> graphs be interpreted? Is it the frequency of the distribution?
First off, grid mesh is not the same as bandwidth.
-kdensity- produces a smoothed estimate
of the probability density function. The
units of probability density are the reciprocal
of the units of the variable whose distribution
you are examining. If that variable is measured
in metres, the units are 1 / m; if in years, the
units are 1 / yr. The density cannot be negative;
otherwise there is a constraint that
the area under the probability density function
should integrate to 1. It is perfectly possible
for individual ordinates to exceed 1.
For example,
. use auto
. gen gpm = 1 / mpg
. kdensity gpm
I see a density estimate which averages about 15
for a range of about 0.09 - 0.02 = 0.07. Roughly,
15 * 0.07 is about 1, and I am confident that
a closer estimate would be nearer 1. (There is
usually some small loss in the extreme tails
with default choices.)
The units of the density are
1 / gallons per mile
OR miles per gallon
and the units of the variable are by construction
gallons per mile
Area under the curve has no units, as can be
seen by cancelling down
miles gallons
----- * -------
gallons miles
There is a note on this at [R] p.227.
David Finney wrote a very nice paper on "Dimensions
in statistics" in Applied Statistics 25, 285-289 (1977).
Nick
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/