--- Andrzej Niemierko <[email protected]> wrote:
> Is there a test for overfitting when using Cox proportional hazards
> model? Specifically, I have a small data set with only 52 cases (39
> events) and 12 independent variables. Six of those independent
> variables are jointly significant (all at p<0.001) in the Cox model.
> Although in my case it makes perfect sense that those six variables
> are associated with the outcome I have a feeling that I am
> overfitting the data at hand.
The problem with overfitting is that you are fitting such a flexible
model that what you are picking up are the random fluctions due to
sampling and not the systematic part of the fluctuations. This is
problematic because the findings will now be uninformative about the
population or the next sample.
One way to get a feel for this is to 1) estimate your model in your
data, 2) create a bootstrap sample, and 3) see how the model with the
estimates from the real data would fit in the bootstrap sample. If you
are overfitting, than your estimates in the real data would have little
relevance for the bootstrap data and the fit would be bad. If you
haven't overfitted the data than the fit would cluster around the fit
in your real data.
Usually I am not such a fan of fit statistics, so I can't remember an
appropriate one for the Cox model, so I improvised by looking at the
mean deviance residual. You might have to change that. Otherwise, the
example below should do what you want: If you are overfitting that the
center of the histogram should be to the right of the vertical line I
added with the -xline()- option (assuming that "more is bad" in your
choice of fit statistic).
*--------------- begin example -------------------
sysuse cancer, clear
stset studytim, failure(died)
xi: stcox age i.drug, mgale(mgale)
tempname memhold
tempfile results
postfile `memhold' msd using `results'
forvalues i = 1/1000 {
preserve
bsample
predict dev, deviance
gen dev2 = dev^2
sum dev2, meanonly
post `memhold' (`r(mean)')
restore
}
postclose `memhold'
predict dev, deviance
gen dev2 = dev^2
sum dev2, meanonly
local observed = r(mean)
use `results', clear
hist msd, xline(`observed') /*
*/ xtitle("mean squared deviance residual")
*---------------------- end example -------------------------
(For more on how to use examples I sent to the Statalist, see
http://home.fsw.vu.nl/m.buis/stata/exampleFAQ.html )
Hope this helps,
Maarten
-----------------------------------------
Maarten L. Buis
Department of Social Research Methodology
Vrije Universiteit Amsterdam
Boelelaan 1081
1081 HV Amsterdam
The Netherlands
visiting address:
Buitenveldertselaan 3 (Metropolitan), room Z434
+31 20 5986715
http://home.fsw.vu.nl/m.buis/
-----------------------------------------
___________________________________________________________
Want ideas for reducing your carbon footprint? Visit Yahoo! For Good http://uk.promotions.yahoo.com/forgood/environment.html
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/