------------------------------
Date: Wed, 2 Sep 2009 11:20:59 +0100
From: "V. Martini" <[email protected]>
Subject: st: left censoring, survival analysis
Dear Stata Users,
I'm actually working on a sample of employed workers and I would like
to know how to deal with left censoring in my data.
My sample contains employed workers in 1998, therefore I observe one
spell per worker.
The end of the spells is found by using following waves.
The beginning of each employment spell can be found because workers
have been asked when they started the current job.
But unfortunately, this information is not known for workers that
started their spell before 1979.
My interpretation is that these workers should be treated as left
censored, as the date when their spell begins is not known. I only
know that their spell started before 1979.
My questions are:
1) Can I deal with these data in STATA or should I remove the left
censored observations?
In my sample, it seems that having left censoring and duration
of the spell are positively correlated, therefore deleting these
observations is likely to have consequences on inference.
2) Is the use of the module INTCENS (by Jamie Griffin) appropriate for
these data?
3) The guide to Survival Analysis by Cleves, Gould and Gutierrez
suggests that, even if possibly different in nature, matematically,
left censored data can be treated as interval censored. In my case, I
would observe one
interval for each worker (intervals are very long and transitions
occur at the end of the interval).
Therefore, can I estimate my model using a probit / logit / cloglog
model?
4) Finally, would st setting my data only indicating right censoring
invalidate non parametric analysis (specifically, KM and NA
estimates)?
Thanks,
Vinicio
=========================
Question 1
----------
If you don't know what the start date of a spell is ('left
consoring'), then you can't figure out elapsed duration. But most
survival analysis models model the hazard rate (or log duration) as a
function of elapsed duration. You're stuck. How does one cope with
the lost information? I can think of 5 approaches:
(1) somehow try and get the lost information (typically infeasible).
Or use assumptions to substitute for data:
(2) drop the left-censored spells -- the typical practice, at least in
social science. This is usually tempered with the worry that these
spells are typically relatively long, and so dropping them will lead
to a form of selection bias in estimates. (See e.g. paper by John
Iceland at http://www.psc.isr.umich.edu/pubs/pdf/rr97-378.pdf.)
(3) assume that the hazard rate is constant (exponential hazard model
for continuous time data; geometric for discrete time). In this case,
the process doesn't depend on elapsed duration, and you can use the
observed duration (it's a case in which left censoring turns into left
truncation) -- so problem solved. But of course the assumption of
constant hazard rate is likely to be unpalatable.
(4) suppose that the hazard is constant at all elapsed durations
greater than some threshold value T* (e.g. T* = 5 years) where T*
chosen such that all left-censored spells are longer than T*. (You
have to decide for yourself whether this is feasible in your
situation.) Have a look at the article by Ann Huff Stevens ("Climbing
Out of Poverty, Falling Back In: Measuring the Persistence of Poverty
over Multiple Spells." Journal of Human Resources, Summer 1999.) She
compares this strategy with strategy #2. She has a discrete time model
and allows the baseline hazard to vary non-parametrically up to the
threshold T* and then is fixed constant thereafter. Beware that I
have not investigated this method in gory detail myself. (E.g. I
haven't checked how she ensures that she has the correct number of
person-months at risk of the event in her data set. And also how or
whether the method can also allow frailty.)
(5) integrate out over all possible start dates -- a very technically
demanding approach (to me, anyway) -- and hence not done too often.
For an example, see e.g. Gottshalk, Peter, and Robert A. Moffitt
(1994), 'Welfare dependence: concepts, measures, and trends', American
Economic Review, 84 (2), 38-42. Moffitt and Rendall have a paper that
does similar things, I recall. The idea is, crudely, that you write
down the probability (likelihood) of the spell conditional on the
spell starting at some specific date (t0, say), and then 'integrate'
out over all possible t0. This needs some assumptions about the
distribution of the t0, and modelling this usually uses auxiliary
information (another reason why it's not often done). See also, on
related matters, Steve Nickell's paper on unemployment duration in
Econometrica 1979.
Question 2
----------
-intcens- is a nice module for estimating parametric survival analysis
models using interval-censored data (though does not allow
time-varying covariates). For other approaches to interval-censored
data, see my Survival Analysis MS and Lessons at my website (URL
below).
Question 3
----------
I don't have the Cleves et al. reference to hand. But my response is,
in effect, already given in my response to Q2. (In short, yes, there
are approaches to modelling interval-censored data that utilize
-logit- and -cloglog- etc.)
Question 4
----------
See response to Q1. Again, the issue is whether omission of
left-censored observations leads to bias or not.
Good luck
Stephen
-------------------------------------------------------------
Professor Stephen P. Jenkins <[email protected]>
Institute for Social and Economic Research
University of Essex, Colchester CO4 3SQ, U.K.
Tel: +44 1206 873374. Fax: +44 1206 873151.
http://www.iser.essex.ac.uk
Survival Analysis using Stata:
http://www.iser.essex.ac.uk/iser/teaching/module-ec968
Downloadable papers and software: http://ideas.repec.org/e/pje7.html
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/