<- See Stata 18's new features
Highlights
Single- and multiple-record data formats
Time-varying covariates (TVCs):
Created automatically as deterministic functions of time in both formats
Supplied directly in a multiple-record data format
Use to test the proportional-hazards assumption
Predictions with TVCs
Plots of functions with TVCs: survivor, cumulative hazard, and more
See more survival analysis features
Stata 17 introduced the stintcox command to fit genuine semiparametric Cox models to interval-censored event-time data. Stata 18 adds support for time-varying covariates (TVCs), commonly used in practice to record characteristics that change during follow-up. So, whether you have a biomarker variable or a policy variable that changes with time, you can now include them to model events of interest known only to lie within some time interval. For instance, the event may be a recurrence of cancer or a change in employment status recorded during one of the follow-ups. Both are examples of interval-censored event-time data, in which the event time is not observed exactly.
Use stintcox to create TVCs automatically by interacting existing covariates with specified deterministic functions of time or specify your own TVCs in the new multiple-record-per-subject data format. Use TVCs to test the proportional-hazards assumption with the Wald test or likelihood-ratio test. And incorporate TVCs in your predictions and plots of survivor and other functions.
-> Testing the proportional-hazards assumption using TVCs
-> Fitting an interval-censored Cox model using TVCs in multiple-record data
-> Graphing survivor functions
Here we demonstrate how to use TVCs to test the proportional-hazards assumption of a Cox model. Following Zeng et al. (2016), we use an interval-censored Cox regression to model time to HIV infection of injecting drug users. All 1,124 subjects initially tested negative for HIV. They were followed and assessed for HIV-1 seropositivity through blood tests approximately every four months. Because the subjects were tested periodically, the exact time of HIV infection was not observed, but the times were known to fall in intervals between blood tests. The corresponding lower and upper time endpoints are recorded in variables ltime and rtime.
The baseline factors we use to model the HIV-1 seropositivity are centered age at recruitment (age_mean), sex (male), history of needle sharing (needle), history of drug injection (inject), and whether a subject has been in jail at the time of recruitment (jail). Here is a subset of the dataset for subjects 271 through 274 with one record per subject:
. use https://www.stata-press.com/data/r18/idu (Modified Bangkok IDU Preparatory Study) . format ltime rtime age_mean %6.2f . list ltime rtime age_mean male needle inject jail if _n >= 271 & _n <= 274
ltime rtime age_mean male needle inject jail | |||
271. | 22.00 . -6.46 Yes Yes No No | ||
272. | 3.80 9.41 8.54 No No No Yes | ||
273. | 20.66 . -11.46 Yes Yes No No | ||
274. | 0.00 3.87 -4.46 Yes Yes Yes Yes | ||
We fit a Cox proportional hazards model in which time to HIV infection depends on the above baseline factors. In a single-record-per-subject format, we specify the lower and upper event-time intervals in stintcox's interval() option.
. stintcox age_mean i.male i.needle i.inject i.jail, interval(ltime rtime) note: using adaptive step size to compute derivatives. Performing EM optimization (showing every 100 iterations): Iteration 0: Log likelihood = -1086.2564 Iteration 100: Log likelihood = -597.65634 Iteration 200: Log likelihood = -597.57555 Iteration 295: Log likelihood = -597.56443 Computing standard errors: ............................ done Interval-censored Cox regression Number of obs = 1,124 Baseline hazard: Reduced intervals Uncensored = 0 Left-censored = 41 Right-censored = 991 Interval-cens. = 92 Wald chi2(5) = 17.10 Log likelihood = -597.56443 Prob > chi2 = 0.0043
OPG | ||
Haz. ratio std. err. z P>|z| [95% conf. interval] | ||
age_mean | .9684341 .0126552 -2.45 0.014 .9439452 .9935582 | |
male | ||
Yes | .6846949 .1855907 -1.40 0.162 .4025073 1.164717 | |
needle | ||
Yes | 1.275912 .2279038 1.36 0.173 .8990401 1.810768 | |
inject | ||
Yes | 1.250154 .2414221 1.16 0.248 .8562184 1.825334 | |
jail | ||
Yes | 1.567244 .3473972 2.03 0.043 1.014982 2.419998 | |
We find that age and being in jail at enrollment appear to be associated with, respectively, lower and higher risks of HIV infection.
One way of testing the proportional-hazards assumption for a covariate is to test whether the coefficient associated with that covariate is time invariant. This can be accomplished by including an interaction between this covariate and a function of time in the model and testing whether the corresponding coefficient equals zero.
To test the proportional-hazards assumption in our example, we include all covariates in option tvc() to additionally include their interactions with the analysis time _t, the default, in the model. In this analysis, we suppress the default reporting of hazard ratios with option nohr:
. stintcox age_mean i.male i.needle i.inject i.jail, interval(ltime rtime) tvc(age_mean i.male i.needle i.inject i.jail) nohr note: using adaptive step size to compute derivatives. Performing EM optimization (showing every 100 iterations): Iteration 0: Log likelihood = -1086.2564 Iteration 100: Log likelihood = -590.53655 Iteration 200: Log likelihood = -590.45163 Iteration 300: Log likelihood = -590.43665 Iteration 340: Log likelihood = -590.43386 Computing standard errors: .............................................. done Interval-censored Cox regression Number of obs = 1,124 Baseline hazard: Reduced intervals Uncensored = 0 Left-censored = 41 Event-time interval: Right-censored = 991 Lower endpoint: ltime Interval-cens. = 92 Upper endpoint: rtime Wald chi2(10) = 31.99 Log likelihood = -590.43386 Prob > chi2 = 0.0004
OPG | ||
Haz. ratio std. err. z P>|z| [95% conf. interval] | ||
main | ||
age_mean | -.0310177 .0233817 -1.33 0.185 -.076845 .0148097 | |
male | ||
Yes | -1.271583 .4604788 -2.76 0.006 -2.174105 -.3690615 | |
needle | ||
Yes | -.1819587 .3297493 -0.55 0.581 -.8282554 .464338 | |
inject | ||
Yes | .6852961 .3431924 2.00 0.046 .0126513 1.357941 | |
jail | ||
Yes | -.529615 .4021087 -1.32 0.188 -1.317734 .2585036 | |
tvc | ||
age_mean | -.000129 .0017099 -0.08 0.940 -.0034804 .0032224 | |
male | ||
Yes | .0884102 .042994 2.06 0.040 .0041434 .1726769 | |
needle | ||
Yes | .0358545 .0238562 1.50 0.133 -.0109027 .0826118 | |
inject | ||
Yes | -.0361192 .0228754 -1.58 0.114 -.0809541 .0087157 | |
jail | ||
Yes | .0916036 .0348915 2.63 0.009 .0232176 .1599896 | |
The first equation, main, reports the coefficients for the covariates that do not vary over time; the second equation, tvc, reports the results for covariates interacted with time. And the proprtional-hazards assumption appears to be violated for covariates male and jail according to their p-values reported in the tvc equation. The Wald test at the bottom of the table also indicates that the proportional-hazards assumption doesn't hold globally.
In the previous example, we used stintcox's tvc() option to create TVCs to test the proportional-hazards assumption. In some applications, TVCs already exist in the dataset. Continuing with the data on injecting drug users, being in jail is a TVC. TVCs can be recorded only in multiple-record-per-subject data format. In this format, each subject may contain multiple records with multiple examination times, the event status at each examination time, and potential time-varying covariates at each examination time.
Let's use an extended version, idu2.dta, of the previous idu.dta dataset, which contains all the baseline covariates as well as a time-varying imprisonment indicator variable (jail_vary). jail_vary indicates whether the subject has been imprisioned since the last clinic visit. The dataset also records a subject identifier (id), the examination time of the blood test (time), and whether the blood test is positive at the examination time (is_seropos). Here is a subset of this dataset for subjects 271 through 274 with multiple records per subject:
. use https://www.stata-press.com/data/r18/idu2 (Modified Bangkok IDU Preparatory Study with time-varying variable jail_vary) . format time age_mean %6.2f . list id time is_seropos age_mean male needle inject jail_vary if id >= 271 & id <= 274, sepby(id) noobs abbreviate(10)
id time is_seropos age_mean male needle inject jail_vary | ||
271 4.89 No -6.46 Yes Yes No No | ||
271 9.31 No -6.46 Yes Yes No No | ||
271 13.38 No -6.46 Yes Yes No Yes | ||
271 17.97 No -6.46 Yes Yes No Yes | ||
271 22.00 No -6.46 Yes Yes No No | ||
272 3.80 No 8.54 No No No Yes | ||
272 9.41 Yes 8.54 No No No No | ||
273 3.93 No -11.46 Yes Yes No No | ||
273 8.00 No -11.46 Yes Yes No No | ||
273 12.07 No -11.46 Yes Yes No Yes | ||
273 15.97 No -11.46 Yes Yes No Yes | ||
273 20.66 No -11.46 Yes Yes No Yes | ||
274 3.87 Yes -4.46 Yes Yes Yes Yes | ||
We refit the previous Cox model, but we now use the time-varying variable jail_vary instead of the baseline variable jail. In a multiple-record-per-subject format, we specify the subject identifier in option id(), the examination time in option time(), and the event-indicator status in option status().
. stintcox age_mean i.male i.needle i.inject i.jail_vary, id(id) time(time) status(is_seropos) note: time-varying covariates detected in the data; using method nearleft to impute their values between examination times. note: using adaptive step size to compute derivatives. Performing EM optimization (showing every 100 iterations): Iteration 0: Log likelihood = -1086.2564 Iteration 100: Log likelihood = -598.45375 Iteration 200: Log likelihood = -598.35872 Iteration 285: Log likelihood = -598.34887 Computing standard errors: ........................................ done Interval-censored Cox regression Number of obs = 6,453 Baseline hazard: Reduced intervals Number of subjects = 1,124 Uncensored = 0 ID variable: id Left-censored = 41 Examination time: time Right-censored = 991 Status indicator: is_seropos Interval-cens. = 92 Wald chi2(5) = 17.03 Log likelihood = -598.34887 Prob > chi2 = 0.0044
OPG | ||
time | Haz. ratio std. err. z P>|z| [95% conf. interval] | |
age_mean | .9714605 .012757 -2.20 0.027 .9467762 .9967884 | |
male | ||
Yes | .6678044 .1816576 -1.48 0.138 .3918353 1.138138 | |
needle | ||
Yes | 1.271409 .2275426 1.34 0.180 .8952546 1.805609 | |
inject | ||
Yes | 1.370672 .2575405 1.68 0.093 .9484142 1.980928 | |
jail_vary | ||
Yes | 1.440966 .2916178 1.81 0.071 .9691488 2.142481 | |
Compared with the previous example, after we account for time-varying imprisonment, the hazard ratio for inject increases from 1.25 to 1.37, but the effect of imprisonment decreases from 1.57 for baseline jail to 1.44 for time-varying jail_vary.
After fitting the model, we can use stcurve to plot the estimated survivor, failure, hazard, or cumulative hazard function. By default, stcurve evaluates the functions at the overall means of covariates.
. stcurve, survival note: function evaluated at overall means of covariates.
With multiple-record data with TVCs, we may want to incorporate the time-varying nature of those covariates when plotting the functions. In this case, we can use option attmeans to evaluate the function at time-specific means.
. stcurve, survival attmeans note: function evaluated at time-specific means of covariates.
We can also use option atframe() to specify our own TVC values to be used to evaluate the survivor function. Suppose we want to plot the survivor curve for an individual with the same covariate pattern as subject 1 in our dataset. We create a new frame called id1 and use frame put to copy the relevant information for subject 1 to this new frame. We list the data we just saved in frame id1.
. frame put time age_mean male needle inject jail_vary if id==1, into(id1) . frame id1: list
time age_mean male needle inject jail_v~y | |||
1. | 4.13 5.54 Yes No Yes Yes | ||
2. | 8.26 5.54 Yes No Yes No | ||
3. | 12.33 5.54 Yes No Yes No | ||
4. | 16.10 5.54 Yes No Yes No | ||
5. | 20.10 5.54 Yes No Yes No | ||
6. | 24.23 5.54 Yes No Yes No | ||
7. | 28.62 5.54 Yes No Yes No | ||
8. | 32.59 5.54 Yes No Yes No | ||
9. | 36.20 5.54 Yes No Yes No | ||
10. | 40.56 5.54 Yes No Yes Yes | ||
Then we can graph the survivor curve for this particular profile by typing
. stcurve, survival atframe(id1) note: function evaluated at specified values of selected covariates and overall means of other covariates (if any). note: covariate values from frame id1 used to evaluate function.
Zeng, D., L. Mao, and D. Lin. 2016. Maximum likelihood estimation for semiparametric transformation models with interval-censored data. Biometrika 103: 253—271.
Learn more about other new features in survival analysis.
Read more in the Stata Survival Analysis Reference Manual; see [ST] stintcox.
View all the new features in Stata 18.