Title | stsetting spell-type data | |
Author | Mario Cleves, StataCorp |
It is strongly recommended that before reading this FAQ you become familiar with the terms and definitions presented in the stset entry of the Stata manual.
Spell or duration data arise frequently from studies in econometrics and other disciplines. In a typical spell dataset there are multiple observations for each subject, each covering a span of time (a spell) during which the subject is in a given state, such as employed or unemployed. The main difference between this type of data and that required to perform survival analysis is that the latter expects the event of interest, commonly known as the failure event, to occur at the end of the time spanned by the record. It is concerned only with the time at which the transitions from one state to another occurs.
An often overlooked issue is that the failure event must be clearly defined. If we have employment history data recording spells during which an individual is either employed or unemployed, then we need to clearly define the event of interest (the failure event) as either entering employment or entering unemployment. This need brings up another crucial point. If we define our event as transition from unemployment to employment, then a subject is at “risk” of the transition only during those time when the subject is unemployed. Consequently, during time spans when the subject is employed the subject is not at risk of “failure”, and this in fact becomes a time gap in the data. Even though you may not have time gaps in your spell dataset, the resulting survival dataset will probably contain gaps.
In survival data, therefore, each observation must cover a span of time at the end of which the event of interest either occurs or does not. This model requires the subject be at risk of the event (transition from unemployment to employment) during the time span.
Assume we have employment history data recording spells during which an individual is either employed or unemployed. Further, assume we are interested in the transition from unemployed to employed. That is, the “failure event” is becoming employed. Who is at risk of making the transition? Of course, only unemployed individuals. Our spell data look like this:
ID Spelltyp Begin End 101 Employed 1 72 102 School-unemp 10 20 102 Employed 20 35 102 Unemployed 35 40 103 School-unemp 0 20 103 Welfare-unem 20 30 103 Employed 30 60
We have data on three individuals identified by the ID variable (ID=101, ID=102 and ID=103). We will use these data to create the corresponding survival dataset and then stset it.
The first person (ID=101) is already employed at entry and, consequently, not at risk of entering employment. Thus either he should not be included in the study or included as
ID Begin End Employed 101 0 1 1
meaning he was unemployed from time 0 to time 1 and entered employment at time 1. We will assume this inference is correct.
For ID=102, we are given three records
ID Spelltyp Begin End 102 School-unemp 10 20 102 Employed 20 35 102 Unemployed 35 40
This subject was not under observation from time 0 to time 10, which is what we refer to as left truncation or delayed entry. The first observation indicates that the person was unemployed from time 10 to 20 and entered at time 20. In fact, this person was in school during this time and then entered employment at time 20. He was employed from time 20 to time 35, when he became unemployed. During that time, he was not at risk of the transition. He was already employed; therefore, this record is noninformative and should be left out. This results is what we call in Stata a time gap. He then entered unemployment at time 35 and remained unemployed until time 40. This last observation is censored, because he was still unemployed at the end of this time span. Thus the corresponding survival data are
ID Begin End Employed 102 10 20 1 102 35 40 0
For ID=103, we are also given three records:
ID Spelltyp Begin End 103 School-unemp 0 20 103 Welfare-unem 20 30 103 Employed 30 60
This individual was unemployed from time 0 to time 30 when he became employed. He then remained employed until the end of the follow-up period. Although he was in school from time 0 to time 20, and on welfare from time 20 to time 30, there was only one transition from unemployment to employment. Consequently, for this individual there is only one important record.
ID Begin End Employed 103 0 30 1
The person was unemployed from time 0 to time 30 and entered employment at time 30.
We could also create two records for this subject. We may need to do this if we have other covariates that are time varying. We will assume we do have time-varying covariates and adopt the following setup:
ID Begin End Employed 103 0 20 0 103 20 30 1
Combining all the above observations, our survival dataset and the corresponding stset command produces
ID Begin End Employed 101 0 1 1 102 10 20 1 102 35 40 0 103 0 20 0 103 20 30 1
. stset End, failure(Employed) time0(Begin) id(ID) exit(time .) id: ID failure event: Employed != 0 & Employed != . obs. time interval: (Begin, End] exit on or before: time . ----------------------------------------------------------------------------- 5 total obs. 0 exclusions ----------------------------------------------------------------------------- 5 obs. remaining, representing 3 subjects 3 failures in single failure-per-subject data 1 subject remains remain at risk after failure 46 total analysis time at risk, at risk from t = 0 earliest observed entry t = 0 last observed exit t = 40
Although here our data have only one failure per subject, our data most likely will contain multiple failures per subject resulting from individuals moving in and out of employment status. If this is the case, then you will probably benefit from reading the FAQ Analysis of multiple failure-time data or the article by the same name (Cleves 1999). Regardless of whether you have single or multiple failures per subject data, the logic used to create the survival dataset is as described. The only difference arises when you stset and then analyze the data.
Continuing with our example, we can now describe our data:
. stdes failure _d: Employed analysis time _t: End exit on or before: time . id: ID |-------------- per subject --------------| Category total mean min median max ------------------------------------------------------------------------------ no. of subjects 3 no. of records 5 1.666667 1 2 2 (first) entry time 3.333333 0 0 10 (final) exit time 23.66667 1 30 40 subjects with gap 1 time on gap if gap 15 15 15 15 15 time at risk 46 15.33333 1 15 30 failures 3 1 1 1 1 ------------------------------------------------------------------------------
stdescribe correctly reports that we have 5 observations for three subjects, that one subject has a gap lasting 15 time units (ID=102 from 20 to 35), and that there are three failures in the data (i.e., three transitions from unemployment to employment). Although the original dataset did not contain time gaps, the survival dataset does because of time spans during which the subjects are not at risk of the transition. This is not unusual when transforming spell data into survival time data. Having verified our data, we are now ready to continue our data analysis using other st commands.
We can also set up a survival dataset corresponding to transitions from employment to unemployment by following a similar strategy.
There are several ways to stset our data. The above dataset was stset in one of these possible ways. The proper stset syntax for the data, however, depends on the study design and assumptions. In what follows we provide guidance for selecting the appropriate stset command syntax. This is only a guide, and idiosyncrasies in your particular data may require more modifications or options.
There are two main questions that need to be answered to stset our data.
Question 1: When does the clock begin ticking?
If you want the “clock” to begin at time zero, then what we did above is correct. For calendar data, t=0 at 1/1/1960, but for the above data, t=0 at 0. The command we used was
. stset End, failure(Employed) time0(Begin) id(ID) exit(time .)
If we want the “clock” to start ticking for each individual when the subject first enters unemployment, 10 for ID==102 and 0 for the others, then we need to specify origin().
. stset End, failure(Employed) time0(Begin) id(ID) exit(time .) origin(Begin)
Stata will use as the time origin the earliest entry time per subject.
When origin() is not specified, Stata automatically sets the origin to zero and treats records with entry times greater than zero as left-truncated or delayed-entry observations. That is what we obtained with our original syntax.
Question 2: How do we want to handle each subject’s second, third, etc., observations?
If we want the clock to continue ticking for each individual from the first observation forward, then we can use the syntax we used in our example
. stset End, failure(Employed) time0(Begin) id(ID) exit(time .)
or, depending on the answer to question 1,
. stset End, failure(Employed) time0(Begin) id(ID) exit(time .) origin(Begin)
If, on the other hand, we want to reset the clock to zero or the origin() for every observation, then we stset the data without specifying id(). The ID variable can be used later in the analysis to cluster the data and to produce a robust standard error.
. stset End, failure(Employed) time0(Begin) exit(time .)
or depending on the answer to question 1,
. stset End, failure(Employed) time0(Begin) exit(time .) origin(Begin)
To summarize,
. stset End, failure(Employed) time0(Begin) id(ID) exit(time .)
. stset End, failure(Employed) time0(Begin) exit(time .)
. stset End, failure(Employed) time0(Begin) id(ID) exit(time .) origin(Begin)
. stset End, failure(Employed) time0(Begin) exit(time .) origin(Begin)