Hello Stata-listers.
First of all,
to Stephen Jenkins: you are great!
Thank you so much for your very helpful and very informative reply.
You pointed out issues I wasn't even aware of! Also, many thanks
for the references, and especially for making all that material
available on your website. Your lecture notes are excellent!
To Stephen and anybody else who wish to help me out:
Stephen P. Jenkins wrote:
> On Wed, 27 Nov 2002 23:37:46 -0800 (PST) Enrica Croda
> <[email protected]> wrote:
>
> > Dear Stata-experts:
> >
> > I am a newbie at survival analysis, and I would appreciate
> > your help with propoerly setting up the dataset for the analysis
> > with Stata 7.
> >
> > I have annual data for 15 years from a panel household survey
> > on living arrangements of elderly households.
> >
> > My original data come in the form of an _unbalanced_ dataset,
> > where the data are organized by ID and year (iis ID, tis year
> > in xt-language).
> >
> > The data set covers the period 1984 through 1998 (15 years in
> > total, but elderly households are in my sample only if the head
> > of the household is older than 65. If they die or drop out of the
> > survey, I have no records for them after the death or drop-out.
>
> <snip>
>
> Ignoring repeated spell issues for the moment, ...
>
> The date at which people first become at risk of not living
> independently need not coincide with the first date at which they were
> observed in your panel. The first date is the one at which the survival
> time clock starts ticking (t=0 in expressions for S(t)), whereas the
> second date is relevant for "delayed entry" adjustments. In essence,
> when modelling the time to transition to dependent living using your
> type of data, you have to condition on 'survival' (remaining
> independent) between t=0 and the date at which first surveyed
> (assuming that living independently then).
>
> You might ease the problem by making the simplifying assumption that
> everyone lives independently up to age 65 at least (which might be
> reasonable for the majority of the population), and set t=0 for that
> age.
I am willing to make this assumption, for now at least. However, in the
actual data this may not always true.
> Things are more complicated if not everyone is living
> independently at the start of the panel (though all the examples you
> showed us had indep==1 in first wave observed) -- the reason being that
> those already living dependently may be a non-random sample (a
> 'selection bias' sort of issue). [This may be relevant to your survey
> because, if it is the survey I think it is, then the spouses of
> household heads need not be 65 -- they may be younger or older.]
I think the survey is indeed the one you think it is (GSOEP) :)
As far as the spouses' different ages problem, I consider as
'head/reference person' the older of the 2 spouses.
> An additional complication is differential mortality and attrition,
> which are presumably 'competing risks' with the hazard of not living
> independently. Again some initial progress can be made by assuming the
> competing risks are independent, and applying standard techniques.
Thanks, this is what I will do.
<snip>
> Finally, do you have the exact dates at which transitions are made, or
> just the survey year?
I only have the survey year.
> If it is the latter (as appears from your output), you have grouped duration data ('interval censoring'), in
> which case discrete time models may be a more appropriate way to
> proceed.
Thanks a lot for pointing this out!
<snip>
So, to recap, I now believe my data are grouped duration data...
I understand that in this case I need to organize my data the so-called
"person-period" form.
I would appreciate getting feedback on the following:
My data are already organized by ID and year in "long" panel data
form (iis ID, tis year) with year = 1984, 1985,...1998.
A. Do I need to -expand- the data set? Am I correct in thinking
that I do not? I am thinking I just need to generate the analysis time
variable, with something like:
(A1) by ID: generate TIME = _n;
please see also question B, below.
B. How do I deal with delayed entry?
Assuming people first become at risk of not living independently at age 65,
which may not be the age at which they are first observed in my data,
how do I incorporate this information in my analysis?
I was thinking of defining as "analysis time" the variable NEWTIME,
generated as:
(B1) by ID: generate NEWTIME = _n + (age[1] - 65);
rather than (A1). Is this the correct way to proceed?
C. Would the solution to question B be different if I plan to control for
age in the 'regression' analysis?
D. Do I still need to stset the variables?
Thank you very much in advance for any help.
Enrica
_________________________________________________________________________
Enrica Croda e-mail [email protected]
Department of Economics
UCLA
Box 951477 fax + (310) 825 9528
Los Angeles, CA 90095-1477 phone + (310) 267 5168
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/