Dear Statalisters,
Last August, I wrote a somewhat primitive, yet working do-file to perform the enhanced bootstrap procedure to correct for over-optimism in predictive logistic models. I asked Patrick Royston's help to turn that do-file into something more useful for the Stata community. He kindly helped me out and his code works (see below). However, two issues remain: there is no help file and there is a question about what is the best way of reporting the Standard Deviations (SD) over the bootstrap samples (see below). Below I copied our correspondence on this topic. I feel that neither my statistical knowledge, nor my programming abilities are firm enough to write a good help file and decide on and program the correct SD. I hope that someone on the list might volunteer to (slightly) edit Patrick Royston's code, if needed, and perhaps write a help file to the program to ensure that the Stata community can fully benefit from what to me seems a useful procedure. In particular, I think that P
atrick's suggestion might be correct that "It may be preferable to report the SD over the bootstrap samples, this being the bootstrap-based SE of the optimism" (see below). That part of the code would then need changing. It would be wonderful of course if Patrick's suggestions of generalization to other commands than just logit could be realized.
Sent: 28 August 2008 10:01
To: Patrick Royston
Subject: enhanced bootstrap
Dear professor Royston, dear Patrick,
I have recently programmed a do file performing (I hope) Tibshirani's 'enhanced bootstrap' to estimate overoptimism in predictive models. Unfortunately, my programing abilities are not as they should be. I wondered whether you think that the Stata community might benefit from a more professional ado version of this programme (I could not find one on the internet). If so, would you have time and interest to help me modify this do file into an ado file for general use?
A (non-mathematical) description of the enhanced bootstrap may be found on:
http://books.google.nl/books?id=kfHrF-bVcvQC&pg=PA94&lpg=PA94&dq=%22enhanced+bootstrap%22&source=web&ots=33G--3dhD3&sig=rGjRgs0N4-n8ND5CEpiHl42ZPB0&hl=nl&sa=X&oi=book_result&resnum=5&ct=result
Many thanks for your consideration.
Kind regards,
Gerben ter Riet
On August 31 Patrick Royston replied:
Dear Gerben,
Here is a stab at an ado-file for your bootstrap. I am sending two versions. The first (bstrap_enhanced) uses bsample and preserve/restore, whereas the second (bstrap_enhanced2) uses the bsample,weight() approach and is more than twice as fast, since the data are not disturbed.
Both of these programs could be enhanced and generalized, e.g. to use commands other than logit such stcox, but one would of course need to consider performance measures alternative to AUROC.
The syntax is bstrap_enhanced[2] yvar xvars [if] [in] [ , Reps(#) Seed(#) ]
where reps defaults to 200, and seed to 0, ie. no random number seed is set.
The output includes the mean optimism and its SE over the bootstrap samples. It may be preferable to report the SD over the bootstrap samples, this being the bootstrap-based SE of the optimism; you might take a look at what the literature suggests. Either way it is a simple matter of modifying the line "return scalar opt_se = sqrt( (`ssopt' - `sopt'^2 / `reps') / (`reps' - 1) / `reps' )".
Feel free to do what you like with the programs. It is fine if you wish to distribute them on Statalist or SSC archive, but if you do the latter you would need to write a help file.
Hope this helps
Best wishes
Patrick Royston
Here is the code that Patrick provided for bootstrap_enhanced (capitals were mine):
*THIS FILE COMPUTES AUCs TO CALCULATE THE EXTENT OF OVEROPTIMISM INVOLVED
*IN A PREDICTION MODEL USING TIBSHIRANI'S ENHANCED BOOTSTRAP
*AS DESCRIBED IN HARRELL'S BOOK ON MODELLING ON PAGE 94.
program define bstrap_enhanced, rclass
version 10
syntax varlist(numeric min=2) [if] [in] [, Reps(int 200) Seed(int 0) ]
gettoken yvar xvars : varlist
marksample touse
tempname opt sopt ssopt area0 area
quietly {
if `seed' > 0 set seed `seed'
scalar `sopt' = 0
scalar `ssopt' = 0
// Compute AUC on original data
logit `yvar' `xvars' if `touse'
lroc, nograph
scalar `area0' = r(area)
forvalues i = 1 / `reps' {
preserve
bsample
logit `yvar' `xvars' if `touse'
lroc, nograph
scalar `area' = r(area)
// apply bsample's coefficients to original population
restore
lroc, nograph
scalar `opt' = `area' - r(area)
scalar `sopt' = `sopt' + `opt'
scalar `ssopt' = `ssopt' + `opt' ^ 2
}
}
return scalar auc = `area0'
return scalar opt = `sopt' / `reps'
return scalar opt_se = sqrt( (`ssopt' - `sopt'^2 / `reps') / (`reps' - 1) / `reps' )
di as txt _n "AUC = " as res %6.4f return(auc) ///
as txt " optimism = " %6.4f as res return(opt) ///
as txt " SE = " %6.4f as res return(opt_se)
end
Here is the code that Patrick provided for bootstrap_enhanced2:
*! v 1.0.0 PR 31aug2008
program define bstrap_enhanced2, rclass
version 10
syntax varlist(numeric min=2) [if] [in] [, Reps(int 200) Seed(int 0) ]
gettoken yvar xvars : varlist
marksample touse
tempname opt sopt ssopt area0 area
tempvar w xb
quietly {
if `seed' > 0 set seed `seed'
scalar `sopt' = 0
scalar `ssopt' = 0
// Compute AUC on original data
logit `yvar' `xvars' if `touse'
lroc, nograph
scalar `area0' = r(area)
gen long `w' = .
forvalues i = 1 / `reps' {
bsample, weight(`w')
logit `yvar' `xvars' if `touse' [fweight = `w']
lroc, nograph
scalar `area' = r(area)
// apply bsample's coefficients to original population
predict `xb' if `touse'
logit `yvar' `xb'
lroc, nograph
scalar `opt' = `area' - r(area)
scalar `sopt' = `sopt' + `opt'
scalar `ssopt' = `ssopt' + `opt' ^ 2
drop `xb'
}
}
return scalar auc = `area0'
return scalar opt = `sopt' / `reps'
return scalar opt_se = sqrt( (`ssopt' - `sopt'^2 / `reps') / (`reps' - 1) / `reps' )
di as txt _n "AUC = " as res %6.4f return(auc) ///
as txt " optimism = " %6.4f as res return(opt) ///
as txt " SE = " %6.4f as res return(opt_se)
end
To be complete, my original do file looked like this and uses the auto.dta file to illustrate the analysis:
capture program drop bstrap_enhanced
program define bstrap_enhanced, rclass
version 10
tempname auc
tempvar area kans area2
sysuse auto,clear
g price_d1=(price>5000)
postfile `auc' cons weight mpg trunk area area2 using roc1, replace
qui {
forval i = 1/200 {
bsample
logit price_d1 weight mpg trunk
local cons = _b[_cons]
local weight = _b[weight]
local trunk = _b[trunk]
local mpg = _b[mpg]
lroc, nogr
local area = r(area)
* apply bsample's coefficients to original population
sysuse auto, clear
g price_d1=(price>5000)
g kans = 1/(1+(exp(-(`cons'+`weight'*weight+`mpg'*mpg+`trunk'*trunk))))
roctab price_d1 kans, hanley
local area2 = r(area)
drop kans
post `auc' (`cons') (`weight') (`mpg') (`trunk') (`area') (`area2')
}
}
postclose `auc'
use roc1,clear
g optimism = area - area2
su cons weight mpg trunk area area2 optimism
end
*NOW RUN THE MODEL ON THE ORIGINAL DATA SET, CALCULATE AUC AND SUBTRACT MEAN OPTIMISM FROM THAT LATTER AUC
Best wishes,
Gerben ter Riet, epidemiologist, Dept of General Practice, AMC, Amsterdam , Netherlands
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/