| |
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: Ksmirnov discrete data (again)
Ah, now I see. Many thanks for the clarifications, Kirstin.
In the case of truly discrete data, however, the formula for D does
not make much sense to me since F(x) is a step function (what would
x-e be?). Am I right that for discrete data one would use
D = max | S(x) - F(x) |
to compute the Kolmogorov-Smirnov statistic (as in, e.g., Horn 1997 or
Wood/Altavela
1978)?
ben
Horn, Susan Dadakis (1977). Goodness-of-Fit Tests for Discrete Data: A
Review and an Application to a Health Impairment Scale. Biometrics
33(1): 237-247.
Wood, Constance L., and Michele M. Altavela (1978). Large-Sample
Results for Kolmogorov-Smirnov Statistics for Discrete Distributions.
Biometrika 65(1): 235-239.
On 6/16/07, Kristin MacDonald, StataCorp <[email protected]> wrote:
Robert Ostling asked about using -ksmirnov- with discrete data when performing
a two-sample Kolmogorov-Smirnov test. Ben Jann <[email protected]> also
commented on performing the one-sample Kolmogorov-Smirnov test with discrete
data.
The methodologies used by -ksmirnov- for both the one and two-sample tests
were derived for data from continuous distributions.
Ben referenced two articles that discuss a way to perform the a one-sample
Kolmogorov-Smirnov test when you are interested in comparing data to a
discrete theoretical distribution. When making a comparison of this type, the
test statistic should be computed using the method Ben describes as opposed to
the method that -ksmirnov- uses. Currently, there is not a command that
implements this test, although this is something we are looking into adding.
There has also been some discussion regarding the use of the -ksmirnov-
command when ties exist in the data. Theoretically, no ties should exist when
data is sampled from a continuous distribution, but, in practice, this is not
necessarily true. The test statistic that is produced by -ksmirnov- is still
correct when ties exist in a dataset that we wish to compare to a continuous
theoretical distribution. However, if there are a large number of ties, the
approximate p-value that is reported may not be appropriate. In the latest
update, a note was added to -ksmirnov- to inform the user of the number of
ties that exist in his dataset.
Gibbons and Chakraborti (2003, 121) give the following formula for the test
statistic D for the one-sample Kolmogorov-Smirnov test
D = sup|S(x) - F(X) = max[|S(x) - F(x)|, |S(x-e) - F(x)|]
where e is a small positive number. They also mention that it applies even in
the case when ties are present.
Using the example that Ben gave, this would be as follows
x S(x) F(x) S(x)-F(x) S(x-e)-F(x)
1 .1 .2 -.1 -.2
2 .2 .4 -.2 -.3
3 .3 .6 -.3 -.4
4 .9 .8 .1 -.5
4 .9 .8 .1 -.5
4 .9 .8 .1 -.5
4 .9 .8 .1 -.5
4 .9 .8 .1 -.5
4 .9 .8 .1 -.5
5 1 1 0 -.1
Therefore, D = .5. This is equivalent to the result that is reported by
-ksmirnov-. However, Ben's data was intended to be compared to a discrete
distribution, so a test for discrete data would be more suitable.
Gibbons, J. D., and S. Chakraborti. Nonparametric Statistical Inference. 4th
ed. New York: Marcel Dekker, Inc.
--Kristin
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/