I see nothing wrong with the data generation steps you performed,
so the question is whether this model makes sense.
First, I will speculate that you have brand-specific prices at
the time of each wave. Since cigarette prices tend to rise
fairly uniformly between brands over time, either due to
manufacturer price increases due to inflation or government tax
increases, there is almost certainly a meaningful correlation
between wave and price. Thus, having both a "price" variable and
one or more "wave" variables will lead to confusion in the
coefficients.
In this model, the "wave2" variable can be thought of as estimating
the average quit rate differential from the missing wave (wave 1)...
and this includes an average price differential effect. Likewise,
"wave3" estimates the average quit rate differential of wave 3 from
wave 1.
So what does "price" itself estimate in this model? I'd speculate
it really only estimates how specific brands affect quitting.
In your logit model, I'd guess that it indicates that subjects
who smoke higher-than-average-priced brands quit at a lower rate.
Said differently, those who smoke low-priced brands are more likely
to quit due to a price increase. However, without knowing exactly
what your variables represent, I can't go beyond speculation.
I'm less clear why it remains negative when you take the wave
variables out. If real, it implies that price differential (if
it truly has a positive effect on quitting) wasn't great enough to
overcome other, competing but correlated issues (not explained by
any other variable in the model)that caused smokers to continue
smoking during this time period. If so, price represents the
increase in ALL of these issues and the ones for continued smoking
dominated the result.
On a different issue, using or not using the svy: prefix should
change the estimated coefficients, so no particular importance
should be placed on the fact that a coefficient changed signs
between these two. Without the prefix, you are estimating what
happened for the specific group of subjects surveyed in this study.
When you add the weighting via the svy: prefix, you change the
importance of those individual subjects based on their sampling
weights.
For example, you may have surveyed specific subjects who quit
but represent only a very, very small part of the overall population.
If you don't use the survey weights, their behavior may have
a large effect on the sample results but little effect on the
population results, even to the point of sign reversal.
On yet another issue, marking pattern SQS as a successful "quit"
seems possibly misleading. Clearly, if price continued to rise
over the time period between waves (which seems likely to me),
prices were higher in wave 3 than wave 2, yet these individuals
started smoking again. This seems to suggest that price was not
the most important motivating factor for quiting in wave 2 (or
restatring in wave 3). One can argue that you should code these
subjects as at "risk" for all three waves and as failing to quit.
Tom
-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Lili Yan
Sent: Thursday, October 18, 2007 2:25 PM
To: [email protected]
Subject: Re: st: help needed on discrete-time hazard model
Hi Thomas,
Thank you very much for helping out!
I know little about this model, so I thought the two zeros indicate
something wrong in the data. The e(N) is correct, which I am sure.
Here are some codes of setting up the data. I need explain first that
smok_stat = 1 for SSS, 2 for SSQ, 3 for SQS and 4 for SQQ.
................codes start here................
gen smk_time=3 if smok_stat==1 | smok_stat==2;
replace smk_time=2 if smok_stat==3 | smok_stat==4;
gen cessyear=2004 if smok_stat==1;
replace cessyear=2004 if smok_stat==2;
replace cessyear=2003 if (smok_stat==3 | smok_stat==4);
expand smk_time;
bysort uniqid: gen seqvar=_n;
bysort uniqid: gen qtsmok=smok_stat>1 & _n==_N;
bysort uniqid: gen evntyear=cessyear;
replace evntyear=2002 if seqvar==1;
replace evntyear=2003 if seqvar==2;
drop cessyear;
rename evntyear cessyear;
gen wave=1 if cessyear==2002;
replace wave=2 if cessyear==2003;
replace wave=3 if cessyear==2004;
gen wave1=wave==1;
gen wave2=wave==2;
gen wave3=wave==3;
svy: logit qtsmok male age married white mdrt_educ high_educ incm_mdrt
incm_high canada rPSPPPi wave2 wave3, noconstant
...............codes end here..........
Here is the output:
..............output starts here................
Survey: Logistic regression
Number of strata = 26 Number of obs = 5642
Number of PSUs = 5642 Population size = 5773.9291
Design df = 5616
F( 12, 5605) = 166.35
Prob > F = 0.0000
------------------------------------------------------------------------------
| Linearized
qtsmok | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
male | -.1715913 .1273081 -1.35 0.178 -.4211643 .0779817
age | -.0326805 .0053098 -6.15 0.000 -.0430898 -.0222713
married | .0156776 .1427494 0.11 0.913 -.2641663 .2955215
white | -.5607068 .1443603 -3.88 0.000 -.8437088 -.2777048
mdrt_educ | -.0291425 .1441877 -0.20 0.840 -.3118061 .2535212
high_educ | .5113156 .1800797 2.84 0.005 .1582899 .8643414
incm_mdrt | -.0339146 .1557743 -0.22 0.828 -.3392925 .2714632
incm_high | .1405313 .1766122 0.80 0.426 -.2056968 .4867595
canada | 1.802811 .2552666 7.06 0.000 1.30239 2.303233
rPSPPPi | -.0083975 .000842 -9.97 0.000 -.0100481 -.0067468
wave2 | 2.111112 .1326945 15.91 0.000 1.850979 2.371244
wave3 | 2.411039 .1389374 17.35 0.000 2.138668 2.68341
------------------------------------------------------------------------------
....................output ends here..............
The rPSPPPi is our price variable. We have more price variables but
logit results with them are similar to what reported here.
Thank you very much!
Lili
On 10/18/07, Steichen, Thomas J. <[email protected]> wrote:
> Why do you consider this an indication of something wrong?
>
> Having zero completely determined successes e(N_cds) and failures
> e(N_cdf) is what you prefer.
>
> Is your overall # of records e(N) wrong?
>
> Show us some sample commands and output so we can see what you are doing.
>
>
> -----Original Message-----
>
> I checked the data just now. After running logit model with our
> dependent variable, the stored results show:
>
> e(N) = 5463
> e(N_cds) = 0
> e(N_cdf) = 0
>
> So seems there is something wrong in the data setup. Could anyone
> please give me some help?
>
>
> -----------------------------------------
> CONFIDENTIALITY NOTE: This e-mail message, including any
> attachment(s), contains information that may be confidential,
> protected by the attorney-client or other legal privileges, and/or
> proprietary non-public information. If you are not an intended
> recipient of this message or an authorized assistant to an intended
> recipient, please notify the sender by replying to this message and
> then delete it from your system. Use, dissemination, distribution,
> or reproduction of this message and/or any of its attachments (if
> any) by unintended recipients is not authorized and may be
> unlawful.
>
> *
> * For searches and help try:
> * http://www.stata.com/support/faqs/res/findit.html
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/