Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
st: the id() option in -stset- and "gap-time" conditional risk models
From
Thomas Pepinsky <[email protected]>
To
"[email protected]" <[email protected]>
Subject
st: the id() option in -stset- and "gap-time" conditional risk models
Date
Mon, 8 Mar 2010 20:36:20 -0500
A colleague of mine and I are trying to figure out how to estimate a "gap-time" conditional risk set model in Stata. We are having trouble reconciling some various Stata recommendations that seem contradictory to us. Specifically, we are not sure whether or not we need to declare the id() option when we -stset- the data.
We are using replication data from Box-Steffensmeier and Zorn (2002). Their paper is available here: http://bit.ly/aDLVdZ. The replication data are available here: http://bit.ly/9jAIo4, using the file ag_pwp.dta. We are using Stata 10.
We wish to estimate the effect of democracy on international disputes between pairs of countries. Each subject is a "dyad," which is a pair of countries. Democracy is a time-varying covariate. Disputes are the failure events. Dyads experience multiple failures (i.e. multiple disputes).
Here is a snapshot of the data structure:
dyadid start stop starta stopa futime dispute sumdisp democ
2020 0 1 0 1 35 0 0 1
2020 1 2 1 2 35 0 0 1
2020 2 3 2 3 35 0 0 1
2020 3 4 3 4 35 0 0 1
.
.
.
2020 21 22 21 22 35 0 0 1
2020 22 23 22 23 35 0 0 1
2020 23 24 23 24 35 1 1 1
2020 0 1 24 25 35 0 1 1
2020 1 2 25 26 35 0 1 1
2020 2 3 26 27 35 0 1 1
2020 3 4 27 28 35 1 2 1
2020 0 1 28 29 35 0 2 1
.
.
.
2041 0 1 0 1 25 0 0 -.8
2041 1 2 1 2 25 0 0 -.9
2041 2 3 2 3 25 1 1 -.9
2041 0 1 3 4 25 0 1 -.9
2041 1 2 4 5 25 0 1 -.9
2041 2 3 5 6 25 0 1 -.9
2041 3 4 6 7 25 0 1 -.9
DYADID indexes subjects. STOP and STOPA are analysis-time variables that differ based on whether we are counting from entry into the pool for an "elapsed time" model (STOPA) or from the last failure for the "gap" model (STOP). DISPUTE marks a dispute between the two states, which is the failure event. START and STARTA mark when the subject comes under observation, differing in analogous way as STOP and STOPA. FUTIME marks the latest time under which the subject is both under observation and at risk because we have multiple failure data. SUMDISP is the sum of the total number of disputes that have occurred. DEMOC is democracy, our time-varying covariate, defined as the average of the levels of democracy in the two countries in the dyad.
Our confusion arises from what we believe are two contradictory pieces of advice on how to set up the data for analysis using -stset-.
One one hand, the stset help file (http://www.stata.com/help.cgi?stset) indicates that "Specifying id() never hurts" which we interpret to mean that we should be sure to declare the id() option when -stset-ing our data. If we do that we get the following output:
. stset stop, fail(dispute) exit(futime) enter(start) id(dyadid)
id: dyadid
failure event: dispute != 0 & dispute < .
obs. time interval: (stop[_n-1], stop]
enter on or after: time start
exit on or before: time futime
------------------------------------------------------------------------------
20448 total obs.
2621 multiple records at same instant PROBABLE ERROR
(stop[_n-1]==stop)
------------------------------------------------------------------------------
17827 obs. remaining, representing
816 subjects
111 failures in multiple failure-per-subject data
18471 total analysis time at risk, at risk from t = 0
earliest observed entry t = 0
last observed exit t = 35
On the other hand, the FAQ on multiple failure-time data does NOT include the id() option (http://www.stata.com/support/faqs/stat/stmfail.html) in its example of how to estimate the conditional gap model. If we follow the -stset- procedures outlined there, we get very different output:
. stset stop, fail(dispute) exit(futime) enter(start)
failure event: dispute != 0 & dispute < .
obs. time interval: (0, stop]
enter on or after: time start
exit on or before: time futime
------------------------------------------------------------------------------
20448 total obs.
0 exclusions
------------------------------------------------------------------------------
20448 obs. remaining, representing
405 failures in single record/single failure data
20448 total analysis time at risk, at risk from t = 0
earliest observed entry t = 0
last observed exit t = 35
Without the id() option, Stata considers the data to be single record/single failure data; with the id() option, Stata considers the data to be multiple failure-per-subject data. And these distinctions matter for our substantive conclusions when estimating the regression model. Compare the following two outputs, estimated using the recommended syntax for gap-time conditional risk set models from the FAQ:
. stset stop, fail(dispute) exit(futime) enter(start) id(dyadid)
***output omitted***
. stcox democ, nohr robust cluster(dyadid) strata(sumdisp) efron
failure _d: dispute
analysis time _t: stop
enter on or after: time start
exit on or before: time futime
id: dyadid
Iteration 0: log pseudolikelihood = -304.8235
Iteration 1: log pseudolikelihood = -304.41482
Iteration 2: log pseudolikelihood = -304.40997
Iteration 3: log pseudolikelihood = -304.40997
Refining estimates:
Iteration 0: log pseudolikelihood = -304.40997
Stratified Cox regr. -- Efron method for ties
No. of subjects = 816 Number of obs = 17827
No. of failures = 111
Time at risk = 18471
Wald chi2(1) = 0.51
Log pseudolikelihood = -304.40997 Prob > chi2 = 0.4770
(Std. Err. adjusted for 816 clusters in dyadid)
------------------------------------------------------------------------------
| Robust
_t | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
democ | .1889203 .2656566 0.71 0.477 -.3317571 .7095977
------------------------------------------------------------------------------
Stratified by sumdisp
. stset stop, fail(dispute) exit(futime) enter(start)
***output omitted***
. stcox democ, nohr robust cluster(dyadid) strata(sumdisp) efron
failure _d: dispute
analysis time _t: stop
enter on or after: time start
exit on or before: time futime
Iteration 0: log pseudolikelihood = -1567.2597
Iteration 1: log pseudolikelihood = -1567.2407
Iteration 2: log pseudolikelihood = -1567.2407
Refining estimates:
Iteration 0: log pseudolikelihood = -1567.2407
Stratified Cox regr. -- Efron method for ties
No. of subjects = 20448 Number of obs = 20448
No. of failures = 405
Time at risk = 20448
Wald chi2(1) = 0.07
Log pseudolikelihood = -1567.2407 Prob > chi2 = 0.7926
(Std. Err. adjusted for 827 clusters in dyadid)
------------------------------------------------------------------------------
| Robust
_t | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
democ | .0199128 .0757423 0.26 0.793 -.1285394 .1683651
------------------------------------------------------------------------------
Stratified by sumdisp
We note that the way that the Stata FAQ says to estimate this model (without the id() option) is the way that we have seen it done in other applications, but there are two issues that give us pause. First, when we -stset- the data without the id() option, Stata believes that the data is in single record/single failure, which is not the case for us. We have time-varying covariates, so we must have multiple failure-per-subject data. Second, this contradicts the advice that "Specifying id() never hurts". In this case, it is clear that specifying id() might actually hurt!
Any advice on how to proceed would be most appreciated.
TP
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/