Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: Tibshirani's enhanced bootstrap (ado file)


From   "G. ter Riet" <[email protected]>
To   "[email protected]" <[email protected]>
Subject   st: Tibshirani's enhanced bootstrap (ado file)
Date   Thu, 23 Oct 2008 15:35:44 +0200

Dear Statalisters,
Last August, I wrote a somewhat primitive, yet working do-file to perform the enhanced bootstrap procedure to correct for over-optimism in predictive logistic models. I asked Patrick Royston's help to turn that do-file into something more useful for the Stata community. He kindly helped me out and his code works (see below). However, two issues remain: there is no help file and there is a question about what is the best way of reporting the Standard Deviations (SD) over the bootstrap samples (see below). Below I copied our correspondence on this topic. I feel that neither my statistical knowledge, nor my programming abilities are firm enough to write a good help file and decide on and program the correct SD. I hope that someone on the list might volunteer to (slightly) edit Patrick Royston's code, if needed, and perhaps write a help file to the program to ensure that the Stata community can fully benefit from what to me seems a useful procedure. In particular, I think that P
 
atrick's suggestion might be correct that "It may be preferable to report the SD over the bootstrap samples, this being the bootstrap-based SE of the optimism" (see below). That part of the code would then need changing. It would be wonderful of course if Patrick's suggestions of generalization to other commands than just logit could be realized.

Sent: 28 August 2008 10:01
To: Patrick Royston
Subject: enhanced bootstrap

Dear professor Royston, dear Patrick,
I have recently programmed a do file performing (I hope) Tibshirani's 'enhanced bootstrap' to estimate overoptimism in predictive models. Unfortunately, my programing abilities are not as they should be. I wondered whether you think that the Stata community might benefit from a more professional ado version of this programme (I could not find one on the internet). If so, would you have time and interest to help me modify this do file into an ado file for general use?

A (non-mathematical) description of the enhanced bootstrap may be found on:

http://books.google.nl/books?id=kfHrF-bVcvQC&pg=PA94&lpg=PA94&dq=%22enhanced+bootstrap%22&source=web&ots=33G--3dhD3&sig=rGjRgs0N4-n8ND5CEpiHl42ZPB0&hl=nl&sa=X&oi=book_result&resnum=5&ct=result

Many thanks for your consideration.

Kind regards,
Gerben ter Riet

On August 31 Patrick Royston replied:
Dear Gerben,

Here is a stab at an ado-file for your bootstrap. I am sending two versions. The first (bstrap_enhanced) uses bsample and preserve/restore, whereas the second (bstrap_enhanced2) uses the bsample,weight() approach and is more than twice as fast, since the data are not disturbed.

Both of these programs could be enhanced and generalized, e.g. to use commands other than logit such stcox, but one would of course need to consider performance measures alternative to AUROC.

The syntax is bstrap_enhanced[2] yvar xvars [if] [in] [ , Reps(#) Seed(#) ]

where reps defaults to 200, and seed to 0, ie. no random number seed is set.

The output includes the mean optimism and its SE over the bootstrap samples. It may be preferable to report the SD over the bootstrap samples, this being the bootstrap-based SE of the optimism; you might take a look at what the literature suggests. Either way it is a simple matter of modifying the line "return scalar opt_se = sqrt( (`ssopt' - `sopt'^2 / `reps') / (`reps' - 1) / `reps' )".

Feel free to do what you like with the programs. It is fine if you wish to distribute them on Statalist or SSC archive, but if you do the latter you would need to write a help file.

Hope this helps

Best wishes
Patrick Royston

Here is the code that Patrick provided for bootstrap_enhanced (capitals were mine):
*THIS FILE COMPUTES AUCs TO CALCULATE THE EXTENT OF OVEROPTIMISM INVOLVED
*IN A PREDICTION MODEL USING TIBSHIRANI'S ENHANCED BOOTSTRAP
*AS DESCRIBED IN HARRELL'S BOOK ON MODELLING ON PAGE 94.

program define bstrap_enhanced, rclass
	version 10
	syntax varlist(numeric min=2) [if] [in] [, Reps(int 200) Seed(int 0) ]
	gettoken yvar xvars : varlist
	marksample touse
	tempname opt sopt ssopt area0 area
	quietly {
		if `seed' > 0 set seed `seed'
		scalar `sopt' = 0
		scalar `ssopt' = 0
		// Compute AUC on original data
		logit `yvar' `xvars' if `touse'
		lroc, nograph
		scalar `area0' = r(area)
		forvalues i = 1 / `reps' {
			preserve
			bsample
			logit `yvar' `xvars' if `touse'
			lroc, nograph
			scalar `area' = r(area)
			// apply bsample's coefficients to original population
			restore
			lroc, nograph
			scalar `opt' = `area' - r(area)
			scalar `sopt' = `sopt' + `opt'
			scalar `ssopt' = `ssopt' + `opt' ^ 2
		}
	}
	return scalar auc = `area0'
	return scalar opt = `sopt' / `reps'
	return scalar opt_se = sqrt( (`ssopt' - `sopt'^2 / `reps') / (`reps' - 1) / `reps' )
	di as txt _n "AUC = " as res %6.4f return(auc) ///
	 as txt " optimism = " %6.4f as res return(opt) ///
	 as txt " SE = " %6.4f as res return(opt_se)
end


Here is the code that Patrick provided for bootstrap_enhanced2:
*! v 1.0.0 PR 31aug2008
program define bstrap_enhanced2, rclass
	version 10
	syntax varlist(numeric min=2) [if] [in] [, Reps(int 200) Seed(int 0) ]
	gettoken yvar xvars : varlist
	marksample touse
	tempname opt sopt ssopt area0 area
	tempvar w xb
	quietly {
		if `seed' > 0 set seed `seed'
		scalar `sopt' = 0
		scalar `ssopt' = 0
		// Compute AUC on original data
		logit `yvar' `xvars' if `touse'
		lroc, nograph
		scalar `area0' = r(area)
		gen long `w' = .
		forvalues i = 1 / `reps' {
			bsample, weight(`w')
			logit `yvar' `xvars' if `touse' [fweight = `w']
			lroc, nograph
			scalar `area' = r(area)
			// apply bsample's coefficients to original population
			predict `xb' if `touse'
			logit `yvar' `xb'
			lroc, nograph
			scalar `opt' = `area' - r(area)
			scalar `sopt' = `sopt' + `opt'
			scalar `ssopt' = `ssopt' + `opt' ^ 2
			drop `xb'
		}
	}
	return scalar auc = `area0'
	return scalar opt = `sopt' / `reps'
	return scalar opt_se = sqrt( (`ssopt' - `sopt'^2 / `reps') / (`reps' - 1) / `reps' )
	di as txt _n "AUC = " as res %6.4f return(auc) ///
	 as txt " optimism = " %6.4f as res return(opt) ///
	 as txt " SE = " %6.4f as res return(opt_se)
end


To be complete, my original do file looked like this and uses the auto.dta file to illustrate the analysis:

capture program drop bstrap_enhanced
program define bstrap_enhanced, rclass
	version 10
	tempname auc
	tempvar area kans area2
	sysuse auto,clear
	g price_d1=(price>5000)
	postfile `auc' cons weight mpg trunk area area2 using roc1, replace
	qui   {
		forval i = 1/200 {
		bsample
		logit price_d1 weight mpg trunk
		local cons = _b[_cons]
		local weight = _b[weight]
		local trunk = _b[trunk]
		local mpg = _b[mpg]
		lroc, nogr
		local area = r(area)
		* apply bsample's coefficients to original population
		sysuse auto, clear
		g price_d1=(price>5000)
		g kans = 1/(1+(exp(-(`cons'+`weight'*weight+`mpg'*mpg+`trunk'*trunk))))
		roctab price_d1 kans, hanley
		local area2 = r(area)
		drop kans
		post `auc' (`cons') (`weight') (`mpg') (`trunk') (`area') (`area2')
		}
	}
	postclose `auc'
	

use roc1,clear
g optimism = area - area2
su cons weight mpg trunk area area2 optimism
end

*NOW RUN THE MODEL ON THE ORIGINAL DATA SET, CALCULATE AUC AND SUBTRACT MEAN OPTIMISM FROM THAT LATTER AUC

Best wishes,
Gerben ter Riet, epidemiologist, Dept of General Practice, AMC, Amsterdam , Netherlands


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2025 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index