Marcello Pagano <[email protected]> asks:
> I hate to think I have discovered a bug because it usually is something I
> have done wrong, but not this time, I don't think. After many hours with
> this command it struck me that the reported standard error is much too
> small; for one, the difference between four times the standard error and the
> width of the 95% confidence intervals is too big, even with big sample
> sizes.
> Has anyone experienced this? My suspicion is that there is an error in the
> formula on page 106 of the [ST] manual. The formula includes the
> square-root of a g-hat, identified as the standard error from Major
> Greenwood's formula. Either that should be the variance from Major
> Greenwood, or the square-root is out of place. I suspect that one too many
> square-roots exist.
In short, the formula for the standard error as stated on page 106 of [ST]
is incorrect -- it is not the formula used by -stci-. However, the standard
errors produced by -stci- (using another formula) are valid.
Page 106 of [ST] states the formula for the standard error of an estimated
percentile (t_p) as
se(t_p) = p*sqrt(g)
------------------- (1)
sqrt{S(t_p)}*f(t_p)
where
p = probability in question (e.g. 0.35 for the 35th percentile)
g = variance estimate of S(t_p) (erroneously called the standard
error of S(t_p) in the manual)
S(t_p) = estimate of the survival function at t_p
f(t_p) = estimate of the density function at t_p
The reference for this formula is page 114 of Klein & Moeschberger (1997).
The above is NOT, however, the formula used by the -stci- code, but instead
the formula used is
se(t_p) = sqrt(g)
------- (2)
f(t_p)
as given on page 35 of Collett (2003) and page 122 of Klein & Moeschberger
(2003). The fact that (1) is not used by -stci- is fortunate since, as
Marcello points out, it is likely incorrect as evidenced by the fact that it
was revised to (2) in the 2nd edition of Klein & Moeschberger (2003).
We will change the manual to show that (2) is indeed the formula used.
In order to verify that (2) (and hence -stci-) produce valid standard errors,
I ran a series of simulations using Weibull survival times. As one example,
consider the following simulation code
----------------------------begin doasim.ado---------------------------------
program doasim, rclass
version 8
syntax [, Nobs(int 1000) Shape(real 1.0) p(int 50)]
drop _all
set obs `nobs'
/* gen Weibulls with shape parameter `shape' */
gen t = (-ln(1-uniform()))^(1/`shape')
stset t
stci, p(`p')
return scalar p = r(p`p')
return scalar se = r(se)
end
---------------------------end doasim.ado--------------------------------------
which generates `nobs' Weibull survival times with shape parameter `shape',
and then estimates the `p'th percentile and standard error using -stci-.
Simulating this process 1000 times, each time estimating the 35th percentile
of survival time for samples of size 100, I then obtained the following
. simulate "doasim, n(100) p(35) shape(1.5)" p=r(p) se=r(se), reps(1000)
command: doasim , n(100) p(35) shape(1.5)
statistics: p = r(p)
se = r(se)
. sum
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
p | 1000 .5685117 .0645623 .3234471 .8108463
se | 1000 .0644948 .0201078 .0201328 .1776743
As you can see, the standard error estimates (variable -se-) pass the usual
test of validity -- their mean (0.0644948) is a good approximation of the
sample standard deviation (0.0645623) of the point estimates themselves,
after all, that is what a "standard error" is measuring. Note also that
the above was for a moderate sample size (100).
I ran simulations for various shapes, percentiles, and sample sizes and the
results of all indicate that -stci- using formula (2) is producing good
standard errors. The simulations were limited to the Weibull distribution,
but I am confident that the results would hold generally.
As to why Marcello is obtaining lousy standard errors on his data (even with
a large sample), I can provide one theory. The standard error estimate in
(2) requires an estimate of the density f(t_p), which is crudely estimated
(in the literature and in the code) as
f(t_p) = S(t_p - b) - S(t_p + b)
-----------------------
2b
for b = "some small number". The accuracy of the above estimate hinges on the
fact that your survival times follow a continuous spectrum, regardless of the
sample size. That is, I can have lots of observations of survival times, but
if I am only observing integers, say, then the above estimate will be lousy by
comparison. For example, t_35 could equal 3.4, but you'd be hard pressed to
accurately estimate f(3.4) if you are only observing 3's and 4's.
As such, the asymptotic theory for this estimate also includes an assumption
that as N gets large the infimum of the distance between your t's most go to
zero, or something like that.
If Marcello still has concerns about -stci-, he can email me privately or to
the list and we can look into this further.
--Bobby
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/