Ricardo Ovaldia <[email protected]> asks,
> What is the difference between conditional logistic
> regression grouping on clinic and unconditional
> logistic regression including clinic as a dummy
> (indicator) variable? That is, what is the difference
> in model assumptions and parameter estimates?
The difference is that the logistic regression estimates are inconsistent
and bad.
Let's deal with inconsistent first. Think of what happens as the number of
observations goes to infinity. Let's denote the number of clinics as n and,
just to make things easy, let's assume the number of observations within
clinic is the same for each clinic, and is m. Then the total number of
observations is N = n*m.
What happens as N->ininity? Presumably, the number of clinics increases.
In this thought experiment, you are presumably imagining a replication
of the world as we observe it, with clinics serving roughly the same
number of patients, so as number of patients grows, so do the number of
clinics. Said in our notation, we are imagining n going to infinity and
m remaining constant. In standard logistic regression, that means we are
estimating n-1 coefficients for the clinics. The number of coefficients
is incrasing at the same rate as the number of observations, with the
result that there is no convergence to all the usual statistical properties
you are used to estimators having.
This may sound arcane, but it isn't, as you can show via simulation. Even
easier, however, is to think about a simpler problem. Consider standard
logistic regression with a standard problem -- no clinics, nothing odd. We'll
assume one RHS variable, say sex. It will not surprise you to hear that with
just 4 observations, the estimates produced by the standard logistic
regression estimator are bad. The estimates would turn good if we added
more observations, but it turns out that with just 4, the asymptotics have not
yet kicked in and the estimates produced by the standard logistic regression
estimator are bad, not merely poor. By poor, I mean noisy. By bad, I mean
biased, wrong, and having no good properties.
Now let's consider the clinic. Let's pretend we have 1,000 clinics and
4 observations per clinic. What running
. xi: logistic outcome sex i.clinic
amounts to as running separate logistic regressions for each clinic, but with
the constraint the the coefficient on sex is the same across them. I just
told you that with 4 observations, standard logistic is bad. Combining 1,000
bad results does not improve them; they are still bad. If the results were
merely poor -- noisy -- then combining them would help, but that's not our
case.
On the other hand, if by N = n*m -> infinity we held n constant and let
m->infinity, we would get good results. By m going to infinity, you will have
a world in which the number of clinics remains fixed but the number of
observations within clinic increases. Under those circumstances, each
logistic regression would turn good once m got large enough, and combining
the results will make them even better.
So does it matter which thought experiment is in your mind? No. Whether you
imagine n->infinity or m->infinity, if you have m=4, you have insufficient
observations for the standard logistic gression estimator, and results will be
bad. If you have m=20, then in most circumstances you do have sufficient
observations for the logistic estimator to work. But if you were to get more
data and the first thought experiment is the correct one, meaning the number
of clinics increase, the estimates will not get better, and that should
distrurb you. More data usually means better estimates.
Due to mathematical trickery, the conditional logistic estimator does not
estimate the individual coefficients for each clinic and so avoids the problem
of the number of estimates increasing at the same rate as the number of
observations goes to infinity regardless of the decomposition of the increase.
I told you that, with just 4 observations, standard logistic regression is
bad. So would be the conditional logistic regression with just one clinic.
But unlike the standard logisitic estimator, if you hold the size of clinics
constant and increase the number of them, results get better and better.
Give me a dataset with 20 clinics, and in most cases, I'm in asymptopia.
Results are trustworth and, given more data, they just get better and better.
-- Bill
[email protected]
P.S. Let me add a footnote to the argument above. The footnote is
unimportant for the argument made, but is important in linear
regression problems.
The gist of the problem in the standard logistic regression estimator
is that the number of estimated parameterse increases as the same
rate as the number of observations. The same could be said of
the linear regression estimator and yet there is no problem because
of it. Why? Because in the LR estimator, the problem of estimating
the clinic intercepts can be separated from the problem of estimating
the sex coefficient. It just turns out that way because of the
linear nature of the linear-regression estimator. The same is not
true of logistic.
The logic, "if the number of estimates increases at the same rate as
number of observations, there will be problems" is generally true,
the exception being cases where there is a particular kind of
separability, which happens only in the linear case.
<end>
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/