Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: estat gof (Hosmer & Lemeshow) after svy:logistic (survey)
From
Ángel Rodríguez Laso <[email protected]>
To
[email protected]
Subject
Re: st: estat gof (Hosmer & Lemeshow) after svy:logistic (survey)
Date
Wed, 24 Jul 2013 15:40:27 +0200
Thank you, Steve, but the different behavior is not due to the use of
-quietly-. When I used quietly I was using -svylogitgof- instead of
-estat gof-.
Best regards,
Angel Rodriguez-Laso
2013/7/23 Steve Samuels <[email protected]>:
> I suggest that you notify Stata support of the different behavior of -estat gof- and subpop() with and without "quietly"
>
> Steve
> On Jul 23, 2013, at 9:32 AM, Ángel Rodríguez Laso wrote:
>
> Dear Stas and Steve,
>
> Yes, I'm aware of the limitations of using if instead of subpop. And I
> think that there shouldn't be big differences in estat gof in my case
> between svy and non svy commands because standard errors are not that
> different in both cases (there was a mistake in my previous email
> becuause I didn't include [pweight=pond_nacional] in the logistic
> version). Now:
>
> . svy, subpop(if disdesjub==1 & disdestr==1 & trab==1 & dismy50==1 &
> proxy==2 & edad_c>=60): logistic discAVD edad_c i.sexo i.
>> estud4 i.difinmes3
> (running logistic on estimation sample)
>
> Survey: Logistic regression
>
> Number of strata = 41 Number of obs = 1727
> Number of PSUs = 234 Population size = 1347,0862
> Subpop. no. of obs = 710
> Subpop. size = 563,75
> Design df = 193
> F( 7, 187) = 8,32
> Prob > F = 0,0000
>
> ------------------------------------------------------------------------------
> | Linearized
> discAVD | Odds Ratio Std. Err. t P>|t| [95% Conf. Interval]
> -------------+----------------------------------------------------------------
> edad_c | 1,10 0,02 4,42 0,000 1,05 1,15
> |
> sexo |
> 1 | 1,00 (base)
> 2 | 2,60 0,82 3,02 0,003 1,39 4,84
> |
> estud4 |
> 0 | 1,00 (base)
> 1 | 0,87 0,32 -0,38 0,704 0,43 1,78
> 2 | 0,90 0,40 -0,24 0,807 0,37 2,16
> 3 | 0,60 0,27 -1,14 0,257 0,24 1,47
> |
> difinmes3 |
> 0 | 1,00 (base)
> 1 | 1,59 0,57 1,31 0,190 0,79 3,21
> 2 | 3,33 1,20 3,35 0,001 1,64 6,77
> |
> _cons | 0,00 0,00 -5,88 0,000 0,00 0,00
> ------------------------------------------------------------------------------
>
>
> . logistic discAVD edad_c i.sexo i.estud4 i.difinmes3
> [pweight=pond_nacional] if disdesjub==1 & disdestr==1 & trab==1 &
> dismy5
>> 0==1 & proxy==2 & edad_c>=60
>
> Logistic regression Number of obs = 710
> Wald chi2(7) = 56,60
> Prob > chi2 = 0,0000
> Log pseudolikelihood = -171,06907 Pseudo R2 = 0,1384
>
> ------------------------------------------------------------------------------
> | Robust
> discAVD | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
> -------------+----------------------------------------------------------------
> edad_c | 1,10 0,02 4,58 0,000 1,06 1,15
> |
> sexo |
> 1 | 1,00 (base)
> 2 | 2,60 0,88 2,82 0,005 1,34 5,04
> |
> estud4 |
> 0 | 1,00 (base)
> 1 | 0,87 0,33 -0,37 0,715 0,42 1,82
> 2 | 0,90 0,45 -0,22 0,826 0,34 2,38
> 3 | 0,60 0,29 -1,06 0,290 0,23 1,56
> |
> difinmes3 |
> 0 | 1,00 (base)
> 1 | 1,59 0,54 1,37 0,169 0,82 3,10
> 2 | 3,33 1,26 3,20 0,001 1,59 6,98
> |
> _cons | 0,00 0,00 -6,04 0,000 0,00 0,00
> ------------------------------------------------------------------------------
>
>
>
>
> Unfortunately, esta gof does not work with pweights.
>
>
> Interestingly enough, when I run:
>
> . quietly svy, subpop(if disdesjub==1 & disdestr==1 & trab==1 &
> dismy50==1 & proxy==2 & edad_c>=60): logistic discAVD edad_c i
>> .sexo i.estud4 i.difinmes3
> .
> end of do-file
>
> . svylogitgof
> Number of observations = 1727
> F-adjusted test statistic = F(9,185) = 0,800
> Prob > F = 0,617
>
> -svylogitgof- seems to work with subpopolations. I remember having
> asked Kelly Archer about svylogitgof and she told me that it had been
> superseded by estat gof. I'll write her again and tell about these
> problems with estat gof.
>
> Best regards,
>
> Angel Rodriguez Laso
>
> 2013/7/22 Stas Kolenikov <[email protected]>:
>> The defaults of -estat gof- seem to be working in principally
>> different ways with and without -svy-. Without the -svy- prefix,
>> -estat gof- goes on to predict the probabilities in every covariate
>> pattern cell. If you have a continuous variable in your regression,
>> this may lead to a disaster, and it probably did in your case: I would
>> have zero trust in a chi-square with 342 degrees of freedom based on
>> 710 observations. Really? Do you know how many zero cells you have
>> there, at least???
>>
>> After -svy-, Archer-Lemeshow (note that this is a different test now,
>> documented in Stata Journal) performs essentially what -estat gof,
>> group()- does: it produces quantile groups of predicted probabilities,
>> and looks into the Pearson test based on these groups (9 in your case;
>> OK for your design with # of PSUs - # of strata = 234 - 41 = 193
>> effective degrees of freedom). Basically, Archer-Lemeshow test knows
>> that a lot of practical designs may have a low dozen degrees of
>> freedom, and it conserves the degrees of freedom and does not try to
>> go after all covariate combinations. In your example, doing so leads
>> to 342 combinations, which is simply not sustainable with 193
>> available degrees of freedom. -estat gof- is supported after -svy- in
>> Stata 12.
>>
>> I don't know if the reasons -estat gof- refuses to work after -svy,
>> subpop()- are conceptual (subsetting the sample may produce low
>> weighted count or zero cells; the denominators in the sum of cell
>> contributions in the Pearson test have random subpopulation size,
>> which needs to be properly accounted for) or programmatic (Stata Corp
>> did not expect such use, and made -estat gof- overly protective of
>> what it can work with). I can imagine that once e(V) is posted, which
>> incorporated uncertainty due to random subpopulation size, everything
>> else should follow -- which is what you do with an inappropriate -if-
>> statement. (I hope you are aware of the dangers of using -if- with
>> survey data, see
>> http://www.stata-journal.com/article.html?article=st0153.)
>>
>> -- Stas Kolenikov, PhD, PStat (ASA, SSC)
>> -- Senior Survey Statistician, Abt SRBI
>> -- Opinions stated in this email are mine only, and do not reflect the
>> position of my employer
>> -- http://stas.kolenikov.name
>>
>>
>>
>> On Wed, Jul 17, 2013 at 5:23 AM, Ángel Rodríguez Laso
>> <[email protected]> wrote:
>>> Dear Statalisters,
>>>
>>> Working with Stata 12.1.
>>>
>>>
>>> If I carry out the following logistic regression in a survey setting
>>> and then type estat gof I get:
>>>
>>>
>>> . svy, subpop(if disdesjub==1 & disdestr==1 & trab==1 & dismy50==1 &
>>> proxy==2 & edad_c>=60): logistic discAVD edad_c i.sexo i. estud4
>>> i.difinmes3
>>> (running logistic on estimation sample)
>>>
>>> Survey: Logistic regression
>>>
>>> Number of strata = 41 Number of obs = 1727
>>> Number of PSUs = 234 Population size = 1347,0862
>>> Subpop. no. of obs = 710
>>> Subpop. size = 563,75
>>> Design df = 193
>>> F( 7, 187) = 8,32
>>> Prob > F = 0,0000
>>>
>>> ------------------------------------------------------------------------------
>>> | Linearized
>>> discAVD | Odds Ratio Std. Err. t P>|t| [95% Conf. Interval]
>>> -------------+----------------------------------------------------------------
>>> edad_c | 1,10 0,02 4,42 0,000 1,05 1,15
>>> |
>>> sexo |
>>> 1 | 1,00 (base)
>>> 2 | 2,60 0,82 3,02 0,003 1,39 4,84
>>> |
>>> estud4 |
>>> 0 | 1,00 (base)
>>> 1 | 0,87 0,32 -0,38 0,704 0,43 1,78
>>> 2 | 0,90 0,40 -0,24 0,807 0,37 2,16
>>> 3 | 0,60 0,27 -1,14 0,257 0,24 1,47
>>> |
>>> difinmes3 |
>>> 0 | 1,00 (base)
>>> 1 | 1,59 0,57 1,31 0,190 0,79 3,21
>>> 2 | 3,33 1,20 3,35 0,001 1,64 6,77
>>> |
>>> _cons | 0,00 0,00 -5,88 0,000 0,00 0,00
>>> ------------------------------------------------------------------------------
>>>
>>> .
>>> end of do-file
>>>
>>> . estat gof
>>> estat gof is not allowed after subpopulation estimations
>>> r(198);
>>>
>>>
>>>
>>> Then I change if statements for my subpopulation especifications:
>>>
>>>
>>> . svy: logistic discAVD edad_c i.sexo i.estud4 i.difinmes3 if
>>> disdesjub==1 & disdestr==1 & trab==1 & dismy50==1 & proxy==2 &
>>> edad_c>=60
>>> (running logistic on estimation sample)
>>>
>>> Survey: Logistic regression
>>>
>>> Number of strata = 41 Number of obs = 710
>>> Number of PSUs = 193 Population size = 563,75
>>> Design df = 152
>>> F( 7, 146) = 8,35
>>> Prob > F = 0,0000
>>>
>>> ------------------------------------------------------------------------------
>>> | Linearized
>>> discAVD | Odds Ratio Std. Err. t P>|t| [95% Conf. Interval]
>>> -------------+----------------------------------------------------------------
>>> edad_c | 1,10 0,02 4,41 0,000 1,05 1,15
>>> |
>>> sexo |
>>> 1 | 1,00 (base)
>>> 2 | 2,60 0,82 3,02 0,003 1,39 4,85
>>> |
>>> estud4 |
>>> 0 | 1,00 (base)
>>> 1 | 0,87 0,32 -0,38 0,707 0,42 1,79
>>> 2 | 0,90 0,40 -0,25 0,807 0,37 2,16
>>> 3 | 0,60 0,27 -1,15 0,254 0,24 1,46
>>> |
>>> difinmes3 |
>>> 0 | 1,00 (base)
>>> 1 | 1,59 0,56 1,32 0,189 0,79 3,21
>>> 2 | 3,33 1,18 3,39 0,001 1,65 6,72
>>> |
>>> _cons | 0,00 0,00 -5,88 0,000 0,00 0,00
>>> ------------------------------------------------------------------------------
>>>
>>> . estat gof
>>>
>>> Logistic model for discAVD, goodness-of-fit test
>>>
>>> F(9,144) = 110,29
>>> Prob > F = 0,0000
>>>
>>>
>>>
>>> But if I get rid of the survey especifications, I get:
>>>
>>> . logistic discAVD edad_c i.sexo i.estud4 i.difinmes3 if disdesjub==1
>>> & disdestr==1 & trab==1 & dismy50==1 & proxy==2 & edad_c>=60
>>>
>>> Logistic regression Number of obs = 710
>>> LR chi2(7) = 65,87
>>> Prob > chi2 = 0,0000
>>> Log likelihood = -210,78135 Pseudo R2 = 0,1351
>>>
>>> ------------------------------------------------------------------------------
>>> discAVD | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
>>> -------------+----------------------------------------------------------------
>>> edad_c | 1,10 0,02 5,28 0,000 1,06 1,14
>>> |
>>> sexo |
>>> 1 | 1,00 (base)
>>> 2 | 1,96 0,56 2,36 0,018 1,12 3,44
>>> |
>>> estud4 |
>>> 0 | 1,00 (base)
>>> 1 | 0,87 0,29 -0,42 0,676 0,45 1,69
>>> 2 | 0,88 0,40 -0,28 0,781 0,36 2,14
>>> 3 | 0,52 0,25 -1,37 0,170 0,21 1,32
>>> |
>>> difinmes3 |
>>> 0 | 1,00 (base)
>>> 1 | 1,89 0,61 1,97 0,049 1,00 3,57
>>> 2 | 3,84 1,39 3,70 0,000 1,88 7,83
>>> |
>>> _cons | 0,00 0,00 -7,01 0,000 0,00 0,00
>>> ------------------------------------------------------------------------------
>>>
>>> . estat gof
>>>
>>> Logistic model for discAVD, goodness-of-fit test
>>>
>>> number of observations = 710
>>> number of covariate patterns = 350
>>> Pearson chi2(342) = 328,89
>>> Prob > chi2 = 0,6852
>>>
>>>
>>> The last two models don't look terribly different, so what is the
>>> reason for a such a large change in the Hosmer&Lemeshow result? Which
>>> one should I trust?
>>>
>>> Thank you for your time and attention.
>>>
>>> Angel Rodriguez-Laso
>>> *
>>> * For searches and help try:
>>> * http://www.stata.com/help.cgi?search
>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>> * http://www.ats.ucla.edu/stat/stata/
>>
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>> * http://www.ats.ucla.edu/stat/stata/
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
>
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/