Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: -svy:cloglog-

From	"Muhuri, Pradip (SAMHSA/CBHSQ)" <[email protected]>
To	"[email protected]" <[email protected]>
Subject	RE: st: -svy:cloglog-
Date	Sun, 18 Aug 2013 11:44:38 +0000

Dear Statalist,

This is in response to Steve's recommendation to use -stpm2- (SSC) by Patrick Royston for my analysis of the complex survey designed-based National Health Interview Survey Linked Mortality file (NHIS-LMF) data 
(http://www.cdc.gov/nchs/data_access/data_linkage/mortality/nhis_linkage.htm), which include the "quarter and year" for the time of interview and the time of death (e.g. quarter 1, 2001) and the "exact date" for censored cases. 

I appreciated receiving Steve's comments and wanted to thank him for his time.

Below are my specific responses and observations.

1) I am not sure whether the -svy- prefix accepts the -stmp2- command. If it does not and  given that the event date is expressed in terms of quarter /year in the NHIS-LMF data, the complementary log log (CLL) regression analysis (-cloglog-) on "spell" data using Professor's Jenkins' resources https://www.iser.essex.ac.uk/files/teaching/stephenj/ec968/pdfs/ec968st6.pdf
), which Steve initially recommended (http://www.stata.com/statalist/archive/2013-07/msg01158.html), could still be a better alternative to the proportional hazards model approach.  

2)  From the NHIS-LMF citation list (http://www.cdc.gov/nchs/data/datalinkage/nhis_mort_cit_list.pdf), I find several published articles that employed Cox proportional hazards models although the time scale included in the data files is not continuous (the event date is the quarter/year scale).

3) I agree with Steve: measurement errors associated with "attained age" in the NHIS-LMF data are real. However, whether I use "attained age" or "age at interview", the death rates by 10-year age group in the -svy mean- analysis or the hazard rates from the -margins- command after -cloglog- for the same 10-category age variable in the CLL model differ little by the age construct used. Results have not been shown here.

4) Some comparison: Model-free death rates (-svy means-) and CLL model-based death rates (-margins-)

	a) The death rates by 10-year attained-age group, calculated from the 1997-2006 NHIS-LMF 	person-year "spells" data  (shown below), seem to be close to the death rates published 	for 2011 (Table 1, Page 8 http://www.cdc.gov/nchs/data/nvsr/nvsr61/nvsr61_06.pdf , also 	shown below).

	b) From the 1997-2006 NHIS-LMF data, the CLL model-based death rates are about the same as 	the rates from -svy- means presented below.

5) Steve's idea of using  the "split validation sample approach" is a good one.

6) Constructing the date of NHIS interview from the assignment week could be very different from the date constructed from quarter and year of interview.  Some interviews assigned in the last week of a given quarter might have actually taken place in the following quarter.

Any further comments and thoughts are welcome.

Thanks,

Pradip K. Muhuri

*************************  LOG BEGINS - Details for 4a *******************************

. svyset psu [pweight=wt8], strata (stratum) vce(linearized) singleunit(missing)

      pweight: wt8
          VCE: linearized
  Single unit: missing
     Strata 1: stratum
         SU 1: psu
        FPC 1: <zero>

.      svy: mean dead100k, over (xa_age_grp)
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =     339      Number of obs    =     1337154
Number of PSUs   =     678      Population size  =  1088280599
                                Design df        =         339

    _subpop_1: xa_age_grp = 25-34 Yrs
    _subpop_2: xa_age_grp = 35-44 Yrs
    _subpop_3: xa_age_grp = 45-54 Yrs
    _subpop_4: xa_age_grp = 55-64 Yrs
    _subpop_5: xa_age_grp = 65-74 Yrs
    _subpop_6: xa_age_grp = 75-84 Yrs
//This line is my addition to the output: 1997-2006 (NHIS-LMF Data)
--------------------------------------------------------------
             |             Linearized
        Over |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
dead100k     |
   _subpop_1 |   78.66841   6.446016      65.98919    91.34764
   _subpop_2 |    163.239   7.328281      148.8244    177.6536
   _subpop_3 |   383.6766   12.99239      358.1208    409.2325
   _subpop_4 |     844.95   24.33145      797.0903    892.8096
   _subpop_5 |   2056.149   41.04622      1975.411    2136.886
   _subpop_6 |   4644.522   71.03923      4504.788    4784.255

NCHS, Death Rates per 100,000 for 2011 (url cited above)

25-34 years    	104.4 
35-44 years    	171.7
45-54 years   	409.2
55-64 years   	848.7
65-74 years     1,845.0
75-84 years     4,750.3

********************** LOG CONTINUES - Details for 4b *******************************

. capture noisily svy, subpop(if a_age>=25 & a_age<=84): cloglog dead i.xa_age_grp , eform nolog

margins, at(xa_age_grp= (1 2 3 4 5 6))

Adjusted predictions                              Number of obs   =    1337154
Model VCE    : Linearized

Expression   : Pr(dead), predict()

1._at        : xa_age_grp      =           1
2._at        : xa_age_grp      =           2
3._at        : xa_age_grp      =           3
4._at        : xa_age_grp      =           4
5._at        : xa_age_grp      =           5
6._at        : xa_age_grp      =           6

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         _at |
          1  |   .0007867   .0000645    12.20   0.000     .0006603     .000913
          2  |   .0016324   .0000733    22.28   0.000     .0014888     .001776
          3  |   .0038368   .0001299    29.53   0.000     .0035821    .0040914
          4  |   .0084495   .0002433    34.73   0.000     .0079726    .0089264
          5  |   .0205615   .0004105    50.09   0.000      .019757     .021366
          6  |   .0464452   .0007104    65.38   0.000     .0450529    .0478376

************************** LOG ENDS HERE ******************************************




-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Steve Samuels
Sent: Saturday, August 17, 2013 4:50 PM
To: [email protected]
Subject: Re: st: -svy:cloglog-

Pradip:

I've now concluded that the grouped data approach has serious flaws for your data and that interval-censored estimation is better. 
Here are my best recommendations:

* Use a computer with more memory. 

* Abandon "attained age" as the time variable. Use time from interview instead. Because you don't know birth day,you introduce serious measurement error. You really do not know how to assign age to each interval after followup. Someone who is 54 at interview could turn 55 the next day, or 55 in 364 days. The absolute average error in assigning attained age is > 6 months, which is about 4 x the average error for the proposal below. Use age at interview as a baseline covariate.

* Use a split validation sample approach: develop a model on a test sample of 50%, and evaluate it in the "validation" sample. 

* Use -stpm- (SSC) by Patrick Royston. It can accommodate interval-censored, weighted, clustered data. It is documented in Royston, Patrick. 2001. Flexible parametric alternatives to the Cox model, and more. Stata Journal 1, no. 1: 1-28, which is available for free download at: http://www.stata-journal.com/article.html?article=st0001

* Before you post, reread the Statalist FAQ Section 3. If you have a question, try to find the answer yourself (-help-, manual, -search-,
-google-) before posting.

Now for the details:

1. Define the earliest possible date of interview from the assignment week variable in your data, as I suggested in http://www.stata.com/statalist/archive/2013-08/msg00253.html.
Call it idate

2. For deaths, define the earliest possible date of death as the first day of the quarter of death. Define latest possible date of death as the last day of the quarter. Call these ddate1 and ddate2 respectively.

You'll have to create Stata "quarterly dates" first, then get the first and last day of each. If you can't figure out how to do this from the Manual, ask the list.

3. Create the earliest and latest possible date of death, as: 
t1 = ddate1 -idate, t2 = ddtate2 - idate

4. For uncensored observations, define t2 = last day of the quarter.

This puts the units for your outcome in days.

Here is an example of -stpm- in action. Note that every failure is interval-censored. In your own use, you might have to try different knot, degree of freedom, or technique() options


Steve

/*********CODE BEGINS**********************/ set more off sysuse auto, clear gen t1 = price //lower point of censoring interval set seed 0314 gen t2 = t1 + 1 + ceil(20*runiform()) //upper end point stset t2 [pw = rep78], failure(foreign)

gen str2 mkr = substr(make,1,2)
egen psu = group(mkr)
stpm  price ,scale(hazard) cluster(psu) ///
  left(t1) df(3)
/***********CODE ENDS*********************/ 


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- st: Interval-censored analysis vs grouped analysis was: -cloglog
  - From: Steve Samuels <[email protected]>

References:
- st: Predicting y(i)--margins or predict?
  - From: [email protected]
- Re: st: Predicting y(i)--margins or predict?
  - From: Maarten Buis <[email protected]>
- st: -svy:cloglog- and margins
  - From: "Muhuri, Pradip (SAMHSA/CBHSQ)" <[email protected]>
- Re: st: -svy:cloglog-
  - From: Steve Samuels <[email protected]>

Prev by Date: Re: st: Formula for standard error of predicted probability in logit model
Next by Date: st: RE: xtabond2 estimation and observations used
Previous by thread: Re: st: -svy:cloglog-
Next by thread: st: Interval-censored analysis vs grouped analysis was: -cloglog
Index(es):
- Date
- Thread