Genuine semiparametric Cox proportional hazards modeling
Left-censoring, right-censoring, interval-censoring
Single-record and multiple-record data format
Support for time-varying covariates
Stratified estimation
Four methods to estimate standard errors, including robust and cluster–robust standard errors
Testing proportional-hazards assumption
Predictions of baseline functions, Martingale-like residuals, Cox–Snell-like residuals, and time-varying prediction
Graphs of survivor, cumulative hazard, and hazard functions
Graphical checks of proportional-hazards assumption
See more survival analysis features
What is the exact time of cancer recurrence? Or time of COVID-19 infection? We don't really know exactly. All we know is that the cancer recurred sometime between two follow-up examinations and that the infection occurred sometime before the onset of symptoms or a positive COVID test. Data like these are called interval-censored time-to-event data. By time-to-event or event-time data, we also mean failure-time data, duration data, and so on. They can be stored in two different formats: single-record-per-subject format and multiple-record-per-subject format.
Interval-censored event-time data arise in many areas, including medical, epidemiological, economic, financial, and sociological studies. Ignoring interval-censoring will often lead to biased estimates.
A semiparametric Cox proportional hazards regression model is used routinely to analyze uncensored and right-censored event-time data. In Stata, you can use the estimation command stintcox to fit the Cox model to interval-censored event-time data, for both single-record and multiple-record formats.
Just as with right-censored data, a Cox model is appealing for interval-censored data because it uses semiparametric estimation and thus does not require any parametric assumptions about the baseline hazard function. Also for low-event rates, the exponentiated regression parameters approximate the log relative-risks.
Semiparametric estimation of interval-censored event-time data is challenging because none of the event times are observed exactly. Thus, "semiparametric" modeling of these data often resorted to using spline methods or piecewise-exponential models for the baseline hazard function. Genuine semiparametric modeling of interval-censored event-time data was not available until recent methodological advances, which are implemented in the stintcox command.
Fitting a model for single-record-per-subject interval-censored data
Fitting a model for multiple-record-per-subject interval-censored data
We use data from Zeng et al. (2016) that studies time to HIV infection in a cohort study of injecting drug users in Thailand. The dataset contains 1,124 subjects. Those subjects initially tested negative for the HIV virus. They were then followed and assessed for HIV-1 seropositivity through blood tests approximately every four months. Because the subjects were tested periodically, the exact time of HIV infection was not observed, but the times are known to fall in intervals between blood tests with the lower and upper endpoints recorded in variables ltime and rtime.
The dataset also records several baseline factors, such as centered age at recruitment (age_mean), sex (male), history of needle sharing (needle), history of drug injection (inject), and whether a subject has been in jail at the time of recruitment (jail). The dataset contains one record per subject with the event-time information recorded as interval data and is called single-record-per-subject data. Here is a subset of the dataset for subjects 271–274:
. use https://www.stata-press.com/data/r18/idu (Modified Bangkok IDU Preparatory Study) . format ltime rtime age_mean %6.2f . list ltime rtime age_mean male needle inject jail if _n >=271 & _n<=274
ltime rtime age_mean male needle inject jail | |||
271. | 22.00 . -6.46 Yes Yes No No | ||
272. | 3.80 9.41 8.54 No No No Yes | ||
273. | 20.66 . -11.46 Yes Yes No No | ||
274. | 0.00 3.87 -4.46 Yes Yes Yes Yes | ||
We can fit a Cox proportional hazards model in which time to HIV infection depends on above factors of interest. In a single-record-per-subject format, we must specify option interval() with stintcox.
. stintcox age_mean i.male i.needle i.inject i.jail, interval(ltime rtime) note: using adaptive step size to compute derivatives. Performing EM optimization (showing every 100 iterations): Iteration 0: Log likelihood = -1086.2564 Iteration 100: Log likelihood = -597.65634 Iteration 200: Log likelihood = -597.57555 Iteration 295: Log likelihood = -597.56443 Computing standard errors: ............................ done Interval-censored Cox regression Number of obs = 1,124 Baseline hazard: Reduced intervals Uncensored = 0 Left-censored = 41 Right-censored = 991 Interval-cens. = 92 Wald chi2(5) = 17.10 Log likelihood = -597.56443 Prob > chi2 = 0.0043
OPG | ||
Haz. ratio std. err. z P>|z| [95% conf. interval] | ||
age_mean | .9684341 .0126552 -2.45 0.014 .9439452 .9935582 | |
male | ||
Yes | .6846949 .1855907 -1.40 0.162 .4025073 1.164717 | |
needle | ||
Yes | 1.275912 .2279038 1.36 0.173 .8990401 1.810768 | |
inject | ||
Yes | 1.250154 .2414221 1.16 0.248 .8562184 1.825334 | |
jail | ||
Yes | 1.567244 .3473972 2.03 0.043 1.014982 2.419998 | |
We find that age at the recruitment is significantly associated with lower risk of HIV infection and being in jail at enrollment is associated with higher risk of HIV infection. Other factors are not statistically significant.
If we assume that the baseline hazard function for female is different from that for male, we can fit a stratified Cox proportional hazards model by using the strata() option:
. stintcox age_mean i.needle i.inject i.jail, interval(ltime rtime) strata(male) note: using adaptive step size to compute derivatives. Performing EM optimization (showing every 100 iterations): Iteration 0: Log likelihood = -1087.0536 Iteration 100: Log likelihood = -585.59848 Iteration 200: Log likelihood = -585.53143 Iteration 282: Log likelihood = -585.5222 Computing standard errors: ........................... done Stratified interval-censored Cox regression Number of obs = 1,124 Baseline hazard: Reduced intervals Uncensored = 0 Strata variable: male Left-censored = 41 Right-censored = 991 Event-time interval: Interval-cens. = 92 Lower endpoint: ltime Upper endpoint: rtime Wald chi2(4) = 14.84 Log likelihood = -585.5222 Prob > chi2 = 0.0051
OPG | ||
Haz. ratio std. err. z P>|z| [95% conf. interval] | ||
age_mean | .9682508 .0126326 -2.47 0.013 .9438052 .9933295 | |
needle | ||
Yes | 1.276222 .2270302 1.37 0.170 .9005422 1.808625 | |
inject | ||
Yes | 1.245357 .2393768 1.14 0.254 .8544367 1.815131 | |
jail | ||
Yes | 1.57314 .3490687 2.04 0.041 1.018337 2.430205 | |
Our conclusion here is similar to the case without stratification.
It is important to evaluate the validity of the underlying assumption for the Cox proportional hazards model, which is that the hazard ratio is constant over time. One way of testing the proportional-hazards assumption for a covariate is to test whether the coefficient associated with that covariate is time invariant. This can be accomplished by including an interaction between this covariate and a function of time in the model and testing whether the corresponding coefficient equals zero.
For example, to check the proportional-hazards assumption for inject, we can include an interaction term between inject and time using the tvc() option:
. stintcox age_mean i.male i.needle i.inject i.jail, interval(ltime rtime) tvc(i.inject) nohr note: using adaptive step size to compute derivatives. Performing EM optimization (showing every 100 iterations): Iteration 0: Log likelihood = -1086.2564 Iteration 100: Log likelihood = -597.58166 Iteration 200: Log likelihood = -597.50007 Iteration 295: Log likelihood = -597.48895 Computing standard errors: ......................................... done Interval-censored Cox regression Number of obs = 1,124 Baseline hazard: Reduced intervals Uncensored = 0 Left-censored = 41 Event-time interval: Right-censored = 991 Lower endpoint: ltime Interval-cens. = 92 Upper endpoint: rtime Wald chi2(6) = 17.36 Log likelihood = -597.48895 Prob > chi2 = 0.0080
OPG | ||
Coefficient std. err. z P>|z| [95% conf. interval] | ||
main | ||
age_mean | -.0321889 .0131026 -2.46 0.014 -.0578696 -.0065082 | |
male | ||
Yes | -.3817124 .2714232 -1.41 0.160 -.9136921 .1502674 | |
needle | ||
Yes | .2434583 .1791214 1.36 0.174 -.1076131 .5945298 | |
inject | ||
Yes | .3194498 .3209978 1.00 0.320 -.3096943 .9485939 | |
jail | ||
Yes | .4498814 .2230002 2.02 0.044 .012809 .8869538 | |
tvc | ||
inject | ||
Yes | -.0080255 .0208446 -0.39 0.700 -.0488801 .0328292 | |
With the nohr option, the first equation, main, reports the coefficients for the covariates that do not vary over time; the second equation, tvc, reports the estimated coefficients for covariates interacted with time. There is no evidence that the coefficient of inject interacted with time is different from 0. Hence the proprtional-hazards assumption appears to hold for covariate inject.
For single-record interval-censored data, Stata also provides two graphical commands to visually assess the proportional-hazards assumption. These two commands can be used without running stintcox first.
For a single categorical covariate in a Cox model, you can use stintphplot, which plots -ln{-ln(survival)} curves for each category versus ln(analysis time). If the plots are parallel, the proportional-hazards assumption has not been violated. Alternatively, you can use stintcoxnp to plot the nonparametric maximum-likelihood estimation survival curve versus the Cox predicted survival curve for each category. When the two curves are close together, the proportional-hazards assumption is valid.
When a Cox model contains multiple covariates, as mentioned in the above example, only stintphplot is appropriate for graphically checking the proportional-hazards assumption. In that case, we need to use the adjustfor() option.
For example, to graphically check the proportional-hazards assumption for inject, we include all covariates except inject in the adjustfor() option:
. stintphplot, interval(ltime rtime) by(inject) adjustfor(age_mean i.male i.needle i.jail) Fitting Cox model with covariates from option adjustfor() for inject = No ... Fitting Cox model with covariates from option adjustfor() for inject = Yes ...
A separate Cox model, which contains all covariates from the adjustfor() option, is fit for each level of inject. And the two plots are almost parallel, which indicates that the proportional-hazards assumption has not been violated for the categorical variable inject.
Interval-censored data can also be recorded as multiple-record-per-subject format. In this format, each subject may contain multiple records with multiple examination times, the event status at each examination time, and potential time-varying covariates at each examination time. Here we use an extended version of previous single-record-per-subject dataset, which contains all the baseline covariates as well as a time-varying imprisonment indicator variable (jail_vary) that indicates whether the subject has been imprisioned since the last clinic visit. It also records a subject identifier (id), the examination time of the blood test (time), and whether the blood test is positive at the examination time (is_seropos). Here is a subset of this dataset for subject 271-274:
. use https://www.stata-press.com/data/r18/idu2 (Modified Bangkok IDU Preparatory Study with time-varying variable jail_vary) . format time age_mean %6.2f . list id time is_seropos age_mean male needle inject jail_vary if id >=271 & id <=274, sepby(id) noobs abbreviate(10)
id time is_seropos age_mean male needle inject jail_vary | ||
271 4.89 No -6.46 Yes Yes No No | ||
271 9.31 No -6.46 Yes Yes No No | ||
271 13.38 No -6.46 Yes Yes No Yes | ||
271 17.97 No -6.46 Yes Yes No Yes | ||
271 22.00 No -6.46 Yes Yes No No | ||
272 3.80 No 8.54 No No No Yes | ||
272 9.41 Yes 8.54 No No No No | ||
273 3.93 No -11.46 Yes Yes No No | ||
273 8.00 No -11.46 Yes Yes No No | ||
273 12.07 No -11.46 Yes Yes No Yes | ||
273 15.97 No -11.46 Yes Yes No Yes | ||
273 20.66 No -11.46 Yes Yes No Yes | ||
274 3.87 Yes -4.46 Yes Yes Yes Yes | ||
We use stintcox to fit a Cox proportional-hazards model in which the time to HIV infection depends on baseline covariates age_mean, male, needle, inject, and time-varying covariate jail_vary. In a multiple-record-per-subject format, we must specify options id(), time(), and status() with stintcox.
. stintcox age_mean i.male i.needle i.inject i.jail_vary, id(id) time(time) status(is_seropos) note: time-varying covariates detected in the data; using method nearleft to impute their values between examination times. note: using adaptive step size to compute derivatives. Performing EM optimization (showing every 100 iterations): Iteration 0: Log likelihood = -1086.2564 Iteration 100: Log likelihood = -598.45375 Iteration 200: Log likelihood = -598.35872 Iteration 285: Log likelihood = -598.34887 Computing standard errors: ........................................ done Interval-censored Cox regression Number of obs = 6,453 Baseline hazard: Reduced intervals Number of subjects = 1,124 Uncensored = 0 ID variable: id Left-censored = 41 Examination time: time Right-censored = 991 Status indicator: is_seropos Interval-cens. = 92 Wald chi2(5) = 17.03 Log likelihood = -598.34887 Prob > chi2 = 0.0044
OPG | ||
time | Haz. ratio std. err. z P>|z| [95% conf. interval] | |
age_mean | .9714605 .012757 -2.20 0.027 .9467762 .9967884 | |
male | ||
Yes | .6678044 .1816576 -1.48 0.138 .3918353 1.138138 | |
needle | ||
Yes | 1.271409 .2275426 1.34 0.180 .8952546 1.805609 | |
inject | ||
Yes | 1.370672 .2575405 1.68 0.093 .9484142 1.980928 | |
jail_vary | ||
Yes | 1.440966 .2916178 1.81 0.071 .9691488 2.142481 | |
Compared with the previous example, after we account for time-varying imprisonment, the hazard ratio for inject increases from 1.25 to 1.37, but the effect of imprisonment decreases from 1.57 for baseline jail to 1.44 for time-varying jail_vary.
After fitting the model, we can use stcurve to plot the estimated survivor, failure, hazard, or cumulative hazard function. By default, stcurve evaluates the functions at the overall means of covariates.
. stcurve, survival note: function evaluated at overall means of covariates.
With multiple-record data with time-varying covariates, we may want to incorporate the time-varying nature of those covariates when plotting the functions. In this case, we can use the attmeans option to evaluate the function at time-specific means.
. stcurve, survival attmeans note: function evaluated at time-specific means of covariates.
We can also use the atframe() option to specify our own time-varying covariate values to be used to evaluate the survivor function. Suppose we want to plot the survivor curve for an individual with the same covariate pattern as subject 1 in our dataset. We can create a new frame called id1 and use frame put to copy the relevant information for subject 1 in this new frame. Then we list the data we just saved in frame id1.
. frame put time age_mean male needle inject jail_vary if id==1, into(id1) . frame id1: list
time age_mean male needle inject jail_v~y | |||
1. | 4.13 5.54 Yes No Yes Yes | ||
2. | 8.26 5.54 Yes No Yes No | ||
3. | 12.33 5.54 Yes No Yes No | ||
4. | 16.10 5.54 Yes No Yes No | ||
5. | 20.10 5.54 Yes No Yes No | ||
6. | 24.23 5.54 Yes No Yes No | ||
7. | 28.62 5.54 Yes No Yes No | ||
8. | 32.59 5.54 Yes No Yes No | ||
9. | 36.20 5.54 Yes No Yes No | ||
10. | 40.56 5.54 Yes No Yes Yes | ||
We can graph the survivor curve for this particular profile by typing
. stcurve, survival atframe(id1) note: function evaluated at specified values of selected covariates and overall means of other covariates (if any). note: covariate values from frame id1 used to evaluate function.
Zeng, D., L. Mao, and D. Lin. 2016. Maximum likelihood estimation for semiparametric transformation models with interval-censored data. Biometrika 103: 253–271.S
Zeng, D., F. Gao, and D. Lin. 2017. Maximum likelihood estimation for semiparametric regression models with multivariate interval-censored data. Biometrika 104: 505–525.
Turnbull, B. W. 1976. The empirical distribution function with arbitrarily grouped, censored and truncated data. Journal of the Royal Statistical Society, Series B 38: 290–295.
Farrington, C. P. 2000. Residuals for proportional hazards models with interval-censored survival data. Biometrics 56: 473–482.
To learn more about how to fit Cox models to interval-censored data using the stintcox command, see [ST] stintcox.
Also see the Parametric estimation of interval-censored event-time data command.