Title | Convergence of ml | |
Author | William Sribney, StataCorp |
There are only rare cases in which ml can converge to an answer when it shouldn’t.
If you see either of the messages “nonconcave function encountered” or “unproductive step attempted” from ml at the last iteration, then you should be suspicious of the results.
There is no cause for concern if either of these messages appear anywhere before the last iteration. For some likelihoods, these messages are the norm for early iterations, but they typically disappear for the final 4 or 5 iterations.
When the message “nonconcave function encountered” appears at the last iteration, there will also be one or more missing standard errors, so it’s clear that something is not right in this case. And in this case, the problem is usually not due to ml, but it is a problem inherent in the model (see the discussion below).
Thus only the message “unproductive step attempted” at the last iteration can appear with answers that look fine but are not.
Stata’s ml command uses a modified Newton–Raphson algorithm. Classical Newton–Raphson bases its steps on the Taylor-series approximation
f'(b) = f'(b0) + H(b0) * (b - b0)
where f() is the log-likelihood, b is the parameter vector, f'() is the vector of first derivatives, and H() is the matrix of second derivatives (the Hessian).
Since f'(b) = 0 at the maximum of f(), Newton–Raphson chooses b based on b0 such that
0 = f'(b0) + H(b0) * (b - b0)
Solving for b gives
b = b0 - H(b0)-1 * f'(b0)
and this process is repeated until convergence is achieved.
There are 3 situations that can cause this algorithm to converge to an answer that is not the unique global maximum. They are
I“ll discuss these problems one at a time.
If your log-likelihood function has multiple local maxima, then the modified Newton–Raphson algorithm should converge to one of them. Unless you can prove mathematically that your log-likelihood function is globally concave, you will never know whether you have converged at a global maximum.
If you cannot analytically prove global concavity, there’s little you can do in practice to show that you have converged to a unique global maximum.
About the only practical steps you can take are
Since (1) is usually impractical and (2) usually lands you back at the same answer, basically one hopes there is only one maximum when you can’t prove it analytically. This is a reasonable assumption — a statistically well-posed likelihood should not have multiple local maxima.
Sometimes there are maxima at nonsensical values of ancillary parameters. For example, for a variance parameter sigma, there may be one maximum for sigma > 0 and another for sigma < 0, with the latter being nonsensical. To avoid this, transform the parameter so it is always >0.
A problem with the classical Newton–Raphson algorithm arises when H(b0) is not invertible. That's where the "modified" part of the "modified Newton–Raphson algorithm" comes in. If H(b0) is not invertible, the matrix is fiddled with until it is changed into something that is invertible, and using this inverse, one hopes that one steps in a direction that is an improvement.
When one sees the message “nonconcave function encountered” in Stata’s iteration log
Iteration 0: Log Likelihood = -18072.001 (nonconcave function encountered) Iteration 1: Log Likelihood = -4064.9319 (nonconcave function encountered) Iteration 2: Log Likelihood = -1373.5496 (nonconcave function encountered)
this is what is happening. H is not invertible, ml is doing its "modified" fix-up, and it is taking steps that result in a bigger log-likelihood.
The "nonconcave function encountered" message is not a concern when it occurs in iterations before the last one:
Iteration 38: Log Likelihood = -205.20113 (nonconcave function encountered) Iteration 39: Log Likelihood = -196.88076 Iteration 40: Log Likelihood = -194.27732 Iteration 41: Log Likelihood = -194.18311 Iteration 42: Log Likelihood = -194.18306 Iteration 43: Log Likelihood = -194.18306
In the above example, ml eventually got to a concave region of the function (after 39 iterations) and then quickly found the maximum (in 4 iterations). This is an example of the modified Newton–Raphson algorithm working perfectly.
The bottom line here is the “snonconcave function encountered” message is only a concern if it appears at the very last step. (Note the message applies to the iteration number below the message.)
Here’s an example of when it does indicate a problem:
... Iteration 105: Log Likelihood = -195.38874 (nonconcave function encountered) Iteration 106: Log Likelihood = -195.38874 Interval Regression Number of obs = 74 Model chi2(2) = 78.01 Log Likelihood = -195.38874 Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- weight | -.0030025 .0005111 -5.874 0.000 -.0040043 -.0020007 weight | -.0030025 . . . . . _cons | 39.42903 1.59296 24.752 0.000 36.30689 42.55118 ------------------------------------------------------------------------------ _sigma | 3.394052 .2792304 12.155 0.000 2.846771 3.941334 ------------------------------------------------------------------------------
When you converge at a point where the Hessian is singular, one or more standard errors will be missing. The standard errors are derived from the inverse of the Hessian, and, since the Hessian was not invertible, a generalized inverse (produced by dropping one or more rows/columns) was used instead for the variance–covariance matrix.
Thus this situation is pretty easy to identify. It clearly doesn't look like a sound answer.
Note, however, that in the above example (of perfect collinearity) the algorithm did fine. It found a global maximum, only that maximum isn't unique; it is on a ridge. The algorithm worked fine in this case, so we shouldn't attribute the problem to the algorithm; it was a problem with the model.
Besides collinearity, “nonconcave function encountered” at the last step can happen when the true maximum of the likelihood is at infinity for a parameter. In these cases, ml may declare convergence at a value of a parameter that is effectively infinity. The likelihood is flat at infinity, so again the Hessian is not invertible. In this case, ml does find the maximum (to the specified tolerance).
Numeric problems in the computation of the likelihood (like scale problems) can also give rise to "nonconcave function encountered" at the last step.
So really, ml and the modified Newton–Raphson algorithm do pretty good when the Hessian is not invertible. The problem in these cases is usually inherent in the model (e.g., collinearity or convergence at infinity) or in a numeric problem in the computation of the likelihood.
The message “unproductive step attempted” appears from ml when the modified Newton–Raphson algorithm takes a step to a new place that is worse (lower likelihood) than the last place. The Hessian is invertible, but it steps to a place that is not an improvement.
This message is typically seen at immediate steps for likelihoods with narrow ridges and gently sloping backbones. It is easy to slip off the backbone of the ridge to a lower spot as you try to move along the backbone of the ridge to the maximum.
The algorithm is stepping in a good direction but is stepping too far. The solution is simply to back up a little. This is what is happens when you see the message “unproductive step attempted”. It takes a step that is too optimistic, so it backs up until it finds a smaller step that leads to a greater value of the likelihood.
Here’s a typical example (linear regression as an MLE) of this message when it’s not a problem:
Iteration 0: Log Likelihood = -18072.001 (nonconcave function encountered) Iteration 1: Log Likelihood = -4064.9319 ... (nonconcave function encountered) Iteration 14: Log Likelihood = -217.50138 (unproductive step attempted) Iteration 15: Log Likelihood = -216.84126 ... (unproductive step attempted) Iteration 41: Log Likelihood = -197.68186 Iteration 42: Log Likelihood = -194.35966 Iteration 43: Log Likelihood = -194.18325 Iteration 44: Log Likelihood = -194.18306 Iteration 45: Log Likelihood = -194.18306 <table of estimates appear>
Iterations 1-14 are “nonconcave function encountered”, and iterations 15-41 are “unproductive step attempted”. Iterations 42-45 are “productive” steps (i.e., initial step tried was an improvement) with invertible Hessians. When you are near the answer, you expect Newton–Raphson steps to work out fine.
When there is a problem, you’ll see something like
Iteration 0: Log Likelihood = -18072.001 (nonconcave function encountered) Iteration 1: Log Likelihood = -4064.9319 ... (nonconcave function encountered) Iteration 14: Log Likelihood = -217.50138 (unproductive step attempted) Iteration 15: Log Likelihood = -216.84126 ... (unproductive step attempted) Iteration 41: Log Likelihood = -197.68186 (unproductive step attempted) Iteration 42: Log Likelihood = -194.35966 (unproductive step attempted) Iteration 43: Log Likelihood = -194.35966 <table of estimates appear>
At iteration 43, it backed up. The question is, how far did it back up? Did it back up to a point that is (numerically) the same as the point at iteration 42? It may have. It may have backed up to the same point and then declared convergence.
Thus you cannot trust the estimates when there is an “unproductive step attempted” at the last iteration. (Again, if it appears at any iteration before the last iteration, there is no problem.)
This is a flaw in Stata’s ml routine. It should not back up to the same point and declare convergence. It should print out a message like “convergence not achieved”, or it should try to do something intelligent like randomly search in the vicinity for a step that is an improvement. For the next release of Stata, we will fix this.
Note this is a case where the modified Newton–Raphson algorithm truly breaks down. What is happening is the Taylor-series approximation
f'(b) = f'(b0) + H(b0) * (b - b0)
is no good — even for b very close to b0! Theoretically, this implies that the function is not analytic at b0, but this is usually not the case. Rather, what usually is happening is that f(), f'(), or H() are being computed poorly in the vicinity of b0 because of numeric problems in the likelihood computation routines.
So this is truly the one case in which the modified Newton–Raphson algorithm breaks down: when the functions being computed are not numerically smooth.
If you read this far, you must be interested in the details of ml!
If you want to learn all sorts of details about ml, I highly recommend Stata’s NetCourse 200 which is devoted to ml programming. The flavor of the course is, in part, similar to what I’ve written above (since I'm one of the course instructors!). The majority of the course covers practical Stata programming issues, but we discuss theoretical ML and general numerical issues as well.