Title | Advantages of the robust variance estimator | |
Author | Bill Sribney, StataCorp |
I once overheard a famous statistician say the robust variance estimator for (unclustered) logistic regression is stupid. His reason was that if the outcome variable is binary then it’s got to be a Bernoulli distribution. It’s not like linear regression with data that stand a good chance of being nonnormal and heteroskedastic.
Well, it’s not as simple as this; there are a bunch of subtle nuisances. Let me lay them out here. I'm sure the famous statistician is aware of them, but they don’t necessarily lead to his conclusion.
It’s true that it has got to be a Bernoulli distribution. That is, if \( Y_i \) is a random variable for the outcome of the i-th unit, then
\[ P({Y_i=1}) = p_i \]or, equivalently,
\[ E(Y_i) = p_i \]This has to be true. Note how I indexed the RHS by \( i \). The term \( p_i \) is dependent on \( i \). It’s certainly not true in general that \( P(Y_i=1) = p \), where \( p \) is a constant independent of \( i \).
In logistic regression, we model \( p_i \) with a likelihood that assumes
\[ \text{logit}(p_i) = x_i b \]So these are our assumptions:
The robust variance estimator is robust to assumptions (1) and (2). It does require (3), but you can specify clusters and just assume independence of the clusters if you wish.
The MLE is also quite robust to (1) being wrong. If the link function is really probit and you estimate a logit, everything’s almost always fine.
Now if (2) is wrong, strictly speaking, you are in trouble with the interpretation of the point estimates of your model, never mind the variance estimates. Imagine \( \text{logit}(p_i) \) is truly quadratic in \( x_i \), but you fit it linear in \( x_i \). What’s the interpretation of the coefficient? It’s the best fit of a straight line to something that’s not straight! It’s got some meaning, but it’s somewhat problematic.
Well, the robust variance estimator will do a good job of giving you variance estimates and confidence intervals for this problematic case of a misspecified model. That is, if one imagines resampling the data and each time fitting the same misspecified model, then you get good coverage probabilities with respect to the “true” population parameters of the misspecified model, i.e., the best-fitting straight line in the population to something that’s not straight.
On the other hand, if you have confidence that your model is not misspecified, then the ML variance estimator is theoretically more efficient.
In summary, my personal advice (and I have respect for conflicting opinions) is
And, obviously, I’d use the robust variance estimator if I had clustered data.
This recommendation is in contrast to the advice I’d give for linear regression for which I’d say always use the robust variance estimator.
There are also other theoretical reasons to be keener on the robust variance estimator for linear regression than for general ML models. The robust variance estimator uses a one-term Taylor series approximation. In linear regression, the coefficient estimates, \( b \), are a linear function of \( y \); namely,
\[ b = (X'X)^{-1} \; X'y \]Thus the one-term Taylor series is exact and not an approximation.
For logistic regression and other MLEs, the ignored higher-order terms in the Taylor series are nonzero. So it’s truly an approximation in these cases.
What are the “true” population parameters for which the robust variance estimator gives good coverage properties?
First let me assume a linear regression model, and later I’ll discuss MLEs.
Consider the entire population from which the sample is drawn. Let \( X \) be the matrix of independent variables and \( Y \) the vector of dependent variables for the entire population. Then consider
\[ B = (X'X)^{-1} \; X'Y \]The parameter \( B \) is the coefficient vector for the linear model for the entire population. \( Y \) may be linear in \( X \) or it may not. \( B \) is simply the best least-squares coefficients for the entire population. \( B \) is what I was referring to when I said “the ‘true’ population parameters” in my above explanation.
The parameter \( B \) is what the robust variance estimator considers you to be estimating. The sample estimate is
\[ b = (x'x)^{-1} \; x'y \]where \( x \) is the matrix of the sample values of the independent variables and \( y \) is the vector of sample dependent variables. If there were sampling weights, the above equation would have weights added in the appropriate places.
Anyhow, \( b \) is an estimate of \( B \). The robust variance estimator estimates \( V(b) \) such that nominal \( (1 - \alpha) \) confidence intervals constructed from it have \( B \) in the interval about \( (1 - \alpha) \) of the time if one was to repeatedly resample from this population.
If \( Y \) is not linear in \( X \) because of incorrect functional form or missing predictors, then the interpretation of \( B \) is problematic. \( B \) can be considered to be the best least-squares linear fit for this set of predictors. \( b \) and \( V(b) \) are “robust to misspecification” in that \( b \) estimates \( B \) and that \( V(b) \) is a valid estimate of the variance of \( b \) even though misspecification is present.
For ML models, consider \( L(B; Y, X) \), an arbitrary likelihood function with data \( Y \), \( X \) for the entire population. Let \( B^* \) give the maximum of \( L(B; Y, X) \). The sample estimate \( b^* \) is the maximum of \( L(b; y, x) \). If there are weights, we add weights to the likelihood function so that \( L(b; y, x) \) estimates \( L(B; Y, X) \). Because \( L(b; y, x) \) estimates \( L(B; Y, X) \), \( b^* \) estimates \( B^* \). The robust variance estimator produces correct variance estimates \( V(b^*) \) for \( b^* \) in the same sense discussed above: nominal \( (1-\alpha) \) confidence intervals constructed from it have \( B^* \) in the interval about \( (1-\alpha) \) of the time if one were to repeatedly resample from this population.
\( L(B; Y, X) \) is not necessarily the true likelihood for the population; i.e., it is not necessarily the correct distribution of \( Y|X \). The theory doesn’t require it; it can be any function.
Standard MLE theory, on the other hand, requires \( L(b; y, x) \) to be the true distribution function for the sample.
When there are sampling weights or clustering, \( L(b; y, x) \) is in no sense a valid likelihood; it’s clearly not the distribution of the sample when there are weights or cluster sampling. \( L(b; y, x) \) merely has to estimate the arbitrary \( L(B; Y, X) \) for our theory to hold. This is why the survey theorists call \( L(b; y, x) \) a pseudolikelihood, and it’s also why you can’t do standard likelihood ratio tests with it.
However, if \( L(B; Y, X) \) is not close to the true distribution, its interpretation is problematic, just as in the case of a misspecified linear regression.