On Mon, 2 Dec 2002, Stephen P. Jenkins wrote:
> On Sun, 1 Dec 2002 03:07:48 -0800 (PST) Enrica Croda
> <[email protected]> wrote:
>
> <snip>
>
> > So, to recap, I now believe my data are grouped duration data...
> > I understand that in this case I need to organize my data the so-called
> > "person-period" form.
> > I would appreciate getting feedback on the following:
> > My data are already organized by ID and year in "long" panel data
> > form (iis ID, tis year) with year = 1984, 1985,...1998.
> > A. Do I need to -expand- the data set?
> > I am thinking I just need to generate the analysis time
> > variable, with something like:
> > (A1) by ID: generate TIME = _n;
> > please see also question B, below.
> > B. How do I deal with delayed entry?
> > Assuming people first become at risk of not living independently at age 65,
> > which may not be the age at which they are first observed in my data,
> > how do I incorporate this information in my analysis?
>
> Suppose first that there is no delayed entry -- in which case you would
> need a row in the data set corresponding to each year that each person
> was /at risk of experiencing the event of interest/. If you were to
> assume the first year at risk corresponds to age 65, you need rows for
> each person for each year corresponding to age 65+. As the first survey
> year (1984 in GSOEP) is after age 65 for most persons, then you
> would need to create new rows in the data corresponding to those ages
> before the beginning of the survey. The TIME variable starts with 1 for
> age 65, then 2 for age 66, and so on. [You would also need to 'spread'
> values for explanatory variables back onto these new person-year obs.]
> -expand- could probably be used to create the required data structure,
> making using of the -if- qualifier to ensure that the correct number of
> new person-year observations gets generated for each person. (As the
> respondents were of different ages in 1984, the number of new data rows
> will differ from person to person.)
>
Ideally, I would like to use some time-varying variables (e.g. income)
in the analysis. What would be the appropriate thing to do for these
variables when I 'spread' them?
> Now, to control for the delayed entry aspect and get the likelihood
> correct, all you need do is create the data structure as just stated,
> but throw away the person-years corresponding to pre-1984 (first survey
> year). (Note that the duration counter TIME does not start from 1 in
> most cases in the delayed-entry version of the data set.)
I am afraid I am still missing something. Please forgive me if this is a
silly question. If I understand correctly, the only variable I really
need is the appropriate 'analysis time' counter. I will throw away all the
records generated through -expand-. Correct?
If this is correct, could I accomplish the same goal by not expanding at
all, and using NEWTIME rather than TIME as 'analysis time', where NEWTIME
is generated as follow:
by ID: generate newtime= _n + (age[1] - 66);
label variable newtime "analysis time";
by ID: generate agediff= (age[1] - 65) if year==84;
label variable agediff "age-65 in 1984";
by ID: generate ageflag= agediff[1] if (agediff[1]~=.);
label variable ageflag "auxiliary var";
by ID: replace newtime=_n if ageflag==.;
Here is a listing of what I get with this code:
ID year age newtime
201 91 65 1
201 92 66 2
201 93 67 3
201 94 68 4
201 95 69 5
201 96 70 6
201 97 71 7
201 98 72 8
1101 84 78 13
1101 85 79 14
1101 86 80 15
1101 87 81 16
1101 88 82 17
1101 89 83 18
1101 90 84 19
1101 91 85 20
1101 92 86 21
1101 93 87 22
1101 94 88 23
1101 95 89 24
1101 96 90 25
1101 97 91 26
1101 98 92 27
20302 87 65 1
20302 88 66 2
20302 89 67 3
20302 90 68 4
20302 91 69 5
20302 94 72 6
20302 95 73 7
20302 96 74 8
20302 97 75 9
20302 98 76 10
> All this is
> discussed in those lecture notes you cited, together with regression
> models that you could apply once the data have been created.
>
Thanks! Your lecture notes are indeed extremely helpful (I also got your
1995 article in the Oxford Bulletin of Economics and Statistics), and I
think I understand what to do for the estimation part of the project.
It is the preparation of the data set for the analysis that I still find
complicated. (It is the first time I do duration analysis).
> > C. Would the solution to question B be different if I plan to control for
> > age in the 'regression' analysis?
>
> Given the way you have defined your time-at-risk variable (in terms of
> age), wouldn't "age" as an explanatory variable be perfectly correlated
> with TIME?
>
Yes, it would! Thanks for pointing it out!
<snip>
Thank you very much for all your help!
Enrica
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/