Concerning my posting on logistic regression with dummies vs. conditional
logistic regression yesterday, Ricardo Ovaldia <[email protected]>
first quotes me,
WG> The difference is that the logistic regression
WG> estimates are inconsistent
WG> and bad.
and then writes,
RO> I must have missed something. From Bill's argument I
RO> see how this is true if you have few observation
RO> within clinic (for example 4), but I am not sure if
RO> the same holds if you have 4 clinics each with 200+
RO> observations which is the most ususal scenario.
First, the theoretical statement:
Standard logistic regression on four clinics each with 200+ observations
will generate good results. One would usually expect to get even better
results as the number of observations increases. In this case, however,
if the number of observations increases by increasing the number of
clinics while holding the number of observations within clinics constant,
I cannot prove that results will get better.
As a practical matter, Ricardo might be right. I just did a simulation where
I assumed
Case 1:
lnodds(outcome_j) = 0 + 1*sex_j
Case 2:
lnodds(outcome_ij) = a_i + 1*sex_ij
200 observations in each clinic i.
I did these simulations for overall sample sizes of 200, 400,
800, 1600, 3200, 6400, 12800, 25600, and 25600. Here are the results
I got:
Case 1 Case 2
_b[sex] _se[sex] _b[sex] _se[sex]
-------------------------------------------------
800 1.1090 .15 1.0474 .17
1600 .8713 .10 1.0494 .11
3200 .9475 .07 1.0474 .08
6400 .9898 .05 1.0803 .06
12800 .9886 .04 1.0849 .04
25600 1.0027 .03 1.0315 .03
-------------------------------------------------
Theoretically, I know that Case 1 converges. Theoretically, I cannot prove
that Case 2 will converge. In this example, however, note that the standard
errors on the sex coefficient are both heading down and are, in fact, roughly
comparable. In this case, the constant clinic size model seems to be
converging.
If it is convering, I have a suspicion as to why: The model I simulated I
could have fitted using a linear probablity model and that model would have
fit the data about as well. All the RHS variables are dummies and even from
linear regression, no prediction would have been below 0 or above 1. I also
know that the linear regression model does converge in both cases 1 and 2.
Thus, it would not surprise me that the logistic regression estimator works
fine in this case.
On the other hand, my simulation is inadequate because I have not verified
that the coverage is right in case 2: I report the standard error for one
regression in each row, and that reported standard error seems to be convering.
I should do a Monte Carlo on each row and check that the reported standard
error has the correct coverage.
I could do all that. I should do all that. But I must ask: Asuming
one is not auditing individual clinics, why use -logit- or -logistic- in this
case when -clogit- is just as easy to use, is less computationally intensive,
and theoretically known to be right? If one is auditing clinics, then
we need to change the focus and start looking at how well -logit- and
-logistic- estimate the individual clinic effects.
-- Bill
[email protected]
P.S. For anyone interested in pursuing this, here are the two programs
I used to produce the table above.
Do-file sim1.do makes columns 1 and 2 ov the table above:
----------------------------------------------------- sim1.do ----------------
clear all
program onecl
args N
quietly {
drop _all
set obs `N'
gen sex = cond(_n<=_N/2, 0, 1)
gen lnodds = 1*sex
gen p = exp(lnodds)/(1+exp(lnodds))
gen int outcome = (uniform()<=p)
logit outcome sex
}
di `N' " " _b[sex] " " _se[sex]
end
set seed 1939998
onecl 200
onecl 400
onecl 800
onecl 1600
onecl 3200
onecl 6400
onecl 12800
onecl 25600
onecl 51200
----------------------------------------------------- sim1.do ----------------
Do-file sim2.do makes columns 3 and 4:
----------------------------------------------------- sim2.do ----------------
clear all
program twocl
args N
local m = 200
local n = `N'/`m'
quietly {
drop _all
set obs `n'
gen clinic = _n
gen u = 2.5*(uniform()-.5)
expand `m'
sort clinic
gen sex = cond(mod(_n,2)==0,0,1)
gen lnodds = 1*sex + u
gen p = exp(lnodds)/(1+exp(lnodds))
gen int outcome = (uniform()<=p)
xi: logit outcome sex i.clinic
}
di `N' " " _b[sex] " " _se[sex]
end
set seed 1939998
twocl 800
twocl 1600
twocl 3200
twocl 6400
twocl 12800
twocl 25600
twocl 51200
twocl 102400
----------------------------------------------------- sim2.do ----------------
<end>
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/