Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: st: -svy:cloglog-
From
"Muhuri, Pradip (SAMHSA/CBHSQ)" <[email protected]>
To
"[email protected]" <[email protected]>
Subject
RE: st: -svy:cloglog-
Date
Sun, 18 Aug 2013 11:44:38 +0000
Dear Statalist,
This is in response to Steve's recommendation to use -stpm2- (SSC) by Patrick Royston for my analysis of the complex survey designed-based National Health Interview Survey Linked Mortality file (NHIS-LMF) data
(http://www.cdc.gov/nchs/data_access/data_linkage/mortality/nhis_linkage.htm), which include the "quarter and year" for the time of interview and the time of death (e.g. quarter 1, 2001) and the "exact date" for censored cases.
I appreciated receiving Steve's comments and wanted to thank him for his time.
Below are my specific responses and observations.
1) I am not sure whether the -svy- prefix accepts the -stmp2- command. If it does not and given that the event date is expressed in terms of quarter /year in the NHIS-LMF data, the complementary log log (CLL) regression analysis (-cloglog-) on "spell" data using Professor's Jenkins' resources https://www.iser.essex.ac.uk/files/teaching/stephenj/ec968/pdfs/ec968st6.pdf
), which Steve initially recommended (http://www.stata.com/statalist/archive/2013-07/msg01158.html), could still be a better alternative to the proportional hazards model approach.
2) From the NHIS-LMF citation list (http://www.cdc.gov/nchs/data/datalinkage/nhis_mort_cit_list.pdf), I find several published articles that employed Cox proportional hazards models although the time scale included in the data files is not continuous (the event date is the quarter/year scale).
3) I agree with Steve: measurement errors associated with "attained age" in the NHIS-LMF data are real. However, whether I use "attained age" or "age at interview", the death rates by 10-year age group in the -svy mean- analysis or the hazard rates from the -margins- command after -cloglog- for the same 10-category age variable in the CLL model differ little by the age construct used. Results have not been shown here.
4) Some comparison: Model-free death rates (-svy means-) and CLL model-based death rates (-margins-)
a) The death rates by 10-year attained-age group, calculated from the 1997-2006 NHIS-LMF person-year "spells" data (shown below), seem to be close to the death rates published for 2011 (Table 1, Page 8 http://www.cdc.gov/nchs/data/nvsr/nvsr61/nvsr61_06.pdf , also shown below).
b) From the 1997-2006 NHIS-LMF data, the CLL model-based death rates are about the same as the rates from -svy- means presented below.
5) Steve's idea of using the "split validation sample approach" is a good one.
6) Constructing the date of NHIS interview from the assignment week could be very different from the date constructed from quarter and year of interview. Some interviews assigned in the last week of a given quarter might have actually taken place in the following quarter.
Any further comments and thoughts are welcome.
Thanks,
Pradip K. Muhuri
************************* LOG BEGINS - Details for 4a *******************************
. svyset psu [pweight=wt8], strata (stratum) vce(linearized) singleunit(missing)
pweight: wt8
VCE: linearized
Single unit: missing
Strata 1: stratum
SU 1: psu
FPC 1: <zero>
. svy: mean dead100k, over (xa_age_grp)
(running mean on estimation sample)
Survey: Mean estimation
Number of strata = 339 Number of obs = 1337154
Number of PSUs = 678 Population size = 1088280599
Design df = 339
_subpop_1: xa_age_grp = 25-34 Yrs
_subpop_2: xa_age_grp = 35-44 Yrs
_subpop_3: xa_age_grp = 45-54 Yrs
_subpop_4: xa_age_grp = 55-64 Yrs
_subpop_5: xa_age_grp = 65-74 Yrs
_subpop_6: xa_age_grp = 75-84 Yrs
//This line is my addition to the output: 1997-2006 (NHIS-LMF Data)
--------------------------------------------------------------
| Linearized
Over | Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
dead100k |
_subpop_1 | 78.66841 6.446016 65.98919 91.34764
_subpop_2 | 163.239 7.328281 148.8244 177.6536
_subpop_3 | 383.6766 12.99239 358.1208 409.2325
_subpop_4 | 844.95 24.33145 797.0903 892.8096
_subpop_5 | 2056.149 41.04622 1975.411 2136.886
_subpop_6 | 4644.522 71.03923 4504.788 4784.255
NCHS, Death Rates per 100,000 for 2011 (url cited above)
25-34 years 104.4
35-44 years 171.7
45-54 years 409.2
55-64 years 848.7
65-74 years 1,845.0
75-84 years 4,750.3
********************** LOG CONTINUES - Details for 4b *******************************
. capture noisily svy, subpop(if a_age>=25 & a_age<=84): cloglog dead i.xa_age_grp , eform nolog
margins, at(xa_age_grp= (1 2 3 4 5 6))
Adjusted predictions Number of obs = 1337154
Model VCE : Linearized
Expression : Pr(dead), predict()
1._at : xa_age_grp = 1
2._at : xa_age_grp = 2
3._at : xa_age_grp = 3
4._at : xa_age_grp = 4
5._at : xa_age_grp = 5
6._at : xa_age_grp = 6
------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_at |
1 | .0007867 .0000645 12.20 0.000 .0006603 .000913
2 | .0016324 .0000733 22.28 0.000 .0014888 .001776
3 | .0038368 .0001299 29.53 0.000 .0035821 .0040914
4 | .0084495 .0002433 34.73 0.000 .0079726 .0089264
5 | .0205615 .0004105 50.09 0.000 .019757 .021366
6 | .0464452 .0007104 65.38 0.000 .0450529 .0478376
************************** LOG ENDS HERE ******************************************
-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Steve Samuels
Sent: Saturday, August 17, 2013 4:50 PM
To: [email protected]
Subject: Re: st: -svy:cloglog-
Pradip:
I've now concluded that the grouped data approach has serious flaws for your data and that interval-censored estimation is better.
Here are my best recommendations:
* Use a computer with more memory.
* Abandon "attained age" as the time variable. Use time from interview instead. Because you don't know birth day,you introduce serious measurement error. You really do not know how to assign age to each interval after followup. Someone who is 54 at interview could turn 55 the next day, or 55 in 364 days. The absolute average error in assigning attained age is > 6 months, which is about 4 x the average error for the proposal below. Use age at interview as a baseline covariate.
* Use a split validation sample approach: develop a model on a test sample of 50%, and evaluate it in the "validation" sample.
* Use -stpm- (SSC) by Patrick Royston. It can accommodate interval-censored, weighted, clustered data. It is documented in Royston, Patrick. 2001. Flexible parametric alternatives to the Cox model, and more. Stata Journal 1, no. 1: 1-28, which is available for free download at: http://www.stata-journal.com/article.html?article=st0001
* Before you post, reread the Statalist FAQ Section 3. If you have a question, try to find the answer yourself (-help-, manual, -search-,
-google-) before posting.
Now for the details:
1. Define the earliest possible date of interview from the assignment week variable in your data, as I suggested in http://www.stata.com/statalist/archive/2013-08/msg00253.html.
Call it idate
2. For deaths, define the earliest possible date of death as the first day of the quarter of death. Define latest possible date of death as the last day of the quarter. Call these ddate1 and ddate2 respectively.
You'll have to create Stata "quarterly dates" first, then get the first and last day of each. If you can't figure out how to do this from the Manual, ask the list.
3. Create the earliest and latest possible date of death, as:
t1 = ddate1 -idate, t2 = ddtate2 - idate
4. For uncensored observations, define t2 = last day of the quarter.
This puts the units for your outcome in days.
Here is an example of -stpm- in action. Note that every failure is interval-censored. In your own use, you might have to try different knot, degree of freedom, or technique() options
Steve
/*********CODE BEGINS**********************/ set more off sysuse auto, clear gen t1 = price //lower point of censoring interval set seed 0314 gen t2 = t1 + 1 + ceil(20*runiform()) //upper end point stset t2 [pw = rep78], failure(foreign)
gen str2 mkr = substr(make,1,2)
egen psu = group(mkr)
stpm price ,scale(hazard) cluster(psu) ///
left(t1) df(3)
/***********CODE ENDS*********************/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/