Title | Stata 6: Estimating a Cox model with a continuously time-varying parameter | |
Author | William Gould, StataCorp |
Below is the response written by Bill Gould, which we quote in full.
Here are a little bit of data in which we want to investigate a continuously time varying Cox-regression.
. describe Contains data obs: 26 vars: 5 size: 624 (99.9% of memory free) ----------------------------------------------------------------------------- 1. patient float %9.0g 2. time float %9.0g survival time (days) 3. dead float %9.0g dead 4. treat float %9.0g 1=single 2=combined 5. age float %9.0g age ----------------------------------------------------------------------------- Sorted by: Note: data has changed since last save . list in 1/5 patient time dead treat age 1. 1 156 1 1 66 2. 2 1040 0 1 38 3. 3 59 1 1 72 4. 4 421 0 2 53 5. 5 329 1 1 43
and here is what we are to do:
Finally, the book reports
The problem is that this regression includes the (continously varying) time-varying regressor c*time.
This is indeed a tricky problem for Stata. Stata will estimate time-varying models, but Stata estimates models in which the time-varying regressors are assumed to be constant within intervals. Thus the formal answer to our question is that Stata cannot estimate the model. I can, however, come arbitrarily close. I can certainly reproduce these results.
Let’s back up, and let me show you how to do that.
In Stata, when you want to estimate a regression with time-varying covariates, there are to be multiple observations in the dataset per patient. Let us consider the first patient.
patient time dead treat age 1. 1 156 1 1 66
As the dataset is right now, this single observation records all the information on the patient. It says that the patient died on day 156. Now pretend the data on patient 1 looked like this,
patient time dead treat age 1. 1 1 0 1 66 2. 1 156 1 1 66
There are now two observations on this patient. The two observations summarize the experiences of the patient over the time intervals (0,1] and (1,156]. Note that I changed dead=0 in the first observation, so the patient did not die at time 1. As I have written it, these two observations for patient 1 record exactly the same information as the single observation did earlier. So do these three observations record the same information,
patient time dead treat age 1. 1 1 0 1 66 2. 1 2 0 1 66 3. 1 156 1 1 66
and so do these 156 observations,
patient time dead treat age 1. 1 1 0 1 66 2. 1 2 0 1 66 3. 1 3 0 1 66 ... 155. 1 155 0 1 66 156. 1 156 1 1 66
Note that in the last form I have one observation per day. I could add a new variable to this dataset equal to c*(time-470). In fact, I have done this in Stata. Here is my do-file:
clear infile using cancer label var time "survival time (days)" label var dead "dead" label var age "age" label var treat "1=single 2=combined" gen c = treat-1 compress expand time sort patient qui by patient: gen t = _n gen outcome = 0 quietly by patient: replace outcome = dead if _n==_N stset t, failure(outcome) id(patient) gen ctime = c*(t-470) stcox age c ctime, nohr
Here is an explanation of what I am doing:
clear infile using cancer label var time "survival time (days)" label var dead "dead" label var age "age" label var treat "1=single 2=combined"
I am just reading and labelling the data I showed you earlier. For those who want to duplicate my results, the dictionary cancer.dct can be found below my signature.
gen c = treat-1 compress
I create the c variable as instructed and toss in a compress command to make the dataset occupy as little memory as possible. The compress plays no substantive role, and I could omit it.
expand time
This is the important statement in the do-file. expand time means, out of each observation, make time observations (such as 156 observations) from it. Said differently, I have asked Stata to make time-1 copies of each observation. Thus, after this statement, my first observation (which had time=156) is now 156 identical observations:
patient time dead treat age 1 1 0 1 66 1 1 0 1 66 ... 1 1 0 1 66
The observations are not one after the other, however. All the new observations have been added after the original 26 observations.
sort patient
Now the observations are in patient order.
quietly by patient: gen t = _n
I create a new time variable called t equal to 1, 2, 3, ..., within patient.
gen outcome = 0 quietly by patient: replace outcome = dead if _n==_N
I create a new outcome variable equal to 0 and then, within patient, set its last observation to the value of the original dead variable.
My data is now just how I want it, and I’m ready to analyze it.
stset t, failure(outcome) id(patient)
I declare my data to be st data with multiple observations per patient.
gen ctime = c*(t-470)
I create the c*time variable. I could have done this before the stset. The order does not matter.
stcox age c ctime, nohr
I estimate the regression. The nohr option means no hazard ratios. I prefer to see results as hazard ratios, but the results to which we are comparing are reported as coefficients, so I've asked Stata to display them that way.
Here is the result of running my do-file:
. do cancer . clear . infile using cancer dictionary { patient time dead treat age } (26 observations read) . label var time "survival time (days)" . label var dead "dead" . label var age "age" . label var treat "1=single 2=combined" . . gen c = treat-1 . compress patient was float now byte time was float now int dead was float now byte treat was float now byte age was float now byte c was float now byte . . expand time (15562 observations created) . sort patient . quietly by patient: gen t = _n . gen outcome = 0 . qui by patient: replace outcome = dead if _n==_N . . stset t, failure(outcome) id(patient) !-- note: making entry-time variable t0 (within patient, t0 will be 0 for the 1st observation and the lagged value of exit time t thereafter) id: patient entry time: t0 exit time: t failure/censor: outcome --> (output omitted) . . gen ctime = c*(t-470) . stcox age c ctime, nohr failure time: t entry time: t0 failure/censor: outcome id: patient Iteration 0: log likelihood = -34.98494 Iteration 1: log likelihood =-27.377424 Iteration 2: log likelihood =-26.829243 Iteration 3: log likelihood =-26.819411 Iteration 4: log likelihood =-26.819386 Refining estimates: Iteration 0: log likelihood =-26.819386 Cox regression -- Breslow method for ties No. of subjects = 26 Number of obs = 403 No. of failures = 12 Time at risk = 15588 LR chi2(3) = 16.33 Log likelihood = -26.819386 Prob > chi2 = 0.0010 ----------------------------------------------------------------------------- t | outcome | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+------------------------------------------------------------------- age | .136387 .0463484 2.943 0.003 .0455457 .2272282 c | -.5319678 .7687557 -0.692 0.489 -2.038701 .9747657 ctime | .0034117 .00513 0.665 0.506 -.0066429 .0134663 -----------------------------------------------------------------------------
This is the result Collett reports.
Understand what I did. Collett wanted the continuous variable c*(time-470), and I substituted a step function for it:
c*(time-470), c=1 | / | / | <-- my step function | / | | /----+ | /| | / | | / | | / | | /----+ | /| | / | | / | |/ | +-------------------------------- time 1 2 3
That is what I meant when I said I could come arbitrarily close to the Collett’s result—it is just a matter of how finely I mesh my steps.
— Bill
[email protected]
Here is the dictionary containing the data used above:
dictionary { patient time dead treat age } 1 156 1 1 66 2 1040 0 1 38 3 59 1 1 72 4 421 0 2 53 5 329 1 1 43 6 769 0 2 59 7 365 1 2 64 8 770 0 2 57 9 1227 0 2 59 10 268 1 1 74 11 475 1 2 59 12 1129 0 2 53 13 464 1 2 56 14 1206 0 2 44 15 638 1 1 56 16 563 1 2 55 17 1106 0 1 44 18 431 1 1 50 19 855 0 1 43 20 803 0 1 39 21 115 1 1 74 22 744 0 2 50 23 477 0 1 64 24 448 0 1 56 25 353 1 2 63 26 377 0 2 58