Title | Stata 5: Obtaining predicted probabilities after probit | |
Author | William Sribney, StataCorp |
The programming techniques used in this answer are very simple in the beginning and very advanced at the end. The simple techniques will work fine, so don't think you must master the advanced ones.
Consider the following probit:
. probit y gender age value, nolog Probit Estimates Number of obs = 16 chi2(3) = 2.23 Prob > chi2 = 0.5253 Log Likelihood = -9.4680749 Pseudo R2 = 0.1055 ------------------------------------------------------------------------------ y | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- gender | .7505686 1.118497 0.671 0.502 -1.441645 2.942782 age | -.0440944 .0937161 -0.471 0.638 -.2277746 .1395857 value | 2.787841 2.10917 1.322 0.186 -1.346057 6.921739 _cons | -8.506711 5.827864 -1.460 0.144 -19.92912 2.915692 ------------------------------------------------------------------------------
The predicted probabilities are given by the formula
pi = F(xi'*beta)
where F is the cumulative normal distribution, xi is the data vector for the i-th observation, and beta is the vector of coefficient estimates.
For this example,
xi = (gender[i], age[i], value[i], 1)
and
beta = (_b[gender], _b[age], _b[value], _b[_cons]).
You can use the _b[gender] syntax to access the coefficients:
. display _b[gender] .75056856
(See [U] 20.5 Accessing coefficients and standard errors and [R] matrix get, for details on accessing coefficients after an estimation command.)
The predicted probabilities can be computed by
. gen phat1 = normprob(_b[gender]*gender + _b[age]*age + _b[value]*value > + _b[_cons])
but doing it this way is unnecessary. The predict command will do it for you:
. predict phat2 . list phat1 phat2 phat1 phat2 1. .1077763 .1077763 2. .1343292 .1343292 3. .1866688 .1866688 4. .198736 .198736 5. .2112619 .2112619 6. .2376565 .2376565 7. .3687274 .3687274 8. .3837354 .3837354 9. .4152546 .4152546 10. .4171898 .4171898 11. .484975 .484975 12. .5157192 .5157192 13. .5828102 .5828102 14. .5854243 .5854243 15. .6044933 .6044933 16. .619428 .619428
Often one wants to evaluate predicted probabilities at the mean of x:
mean of x = (mean of gender, mean of age, mean of value)
This can be done by
. su gender age value Variable | Obs Mean Std. Dev. Min Max ---------+----------------------------------------------------- gender | 16 .125 .341565 0 1 age | 16 24.125 4.964877 17 35 value | 16 3.2725 .2139315 3.05 3.7 . display normprob(_b[gender]*0.125 + _b[age]*24.125 + _b[value]*3.2725 > + _b[_cons]) .36187213
This is very cumbersome! There is an easier way: use predict to compute the predicted index, take its mean, and take the normprob() of it:
. predict ihat1, index . su ihat1 Variable | Obs Mean Std. Dev. Min Max ---------+----------------------------------------------------- ihat1 | 16 -.3534592 .514132 -1.238441 .3039789 . display _result(3) -.35345925 . display normprob(_result(3)) .36187209
This is what we got before (to within float precision).
Explanation: The index for the i-th observation is xi'*beta. Thus
. gen ihat2 = _b[gender]*gender + _b[age]*age + _b[value]*value + _b[_cons]
also gives the index.
. list ihat1 ihat2 ihat1 ihat2 1. -1.238441 -1.238441 2. -1.106158 -1.106158 3. -.8902391 -.8902391 4. -.8461447 -.8461447 5. -.8020502 -.8020502 6. -.7138613 -.7138613 7. -.3352256 -.3352256 8. -.2956849 -.2956849 9. -.2140487 -.2140487 10. -.2090879 -.2090879 11. -.0376709 -.0376709 12. .0394123 .0394123 13. .2090879 .2090879 14. .2157901 .2157901 15. .2649949 .2649949 16. .3039789 .3039789
The key to our shortcut is to note that
mean of ihat1 = _b[gender]*(mean of gender) + _b[age]*(mean of age) + _b[value]*(mean of value) + _b[_cons]
One can do what we did initially
. display normprob(_b[gender]*0.125 + _b[age]*24.125 + _b[value]*3.2725 > + _b[_cons])
and just substitute in different values for x = (gender, age, value).
This would be a big pain for a model with lots of independent variables.
In this case, it would be easier to use Stata’s matrix language:
. gen cons = 1 . matrix vecaccum mean = cons gender age value [iw=1/_N] . matrix list mean mean[1,4] gender age value _cons cons .125 24.125 3.2725 1CAUTION: Make sure the order of the variables is the same here as it is in the probit output.
. su gender age value Variable | Obs Mean Std. Dev. Min Max ---------+----------------------------------------------------- gender | 16 .125 .341565 0 1 age | 16 24.125 4.964877 17 35 value | 16 3.2725 .2139315 3.05 3.7
. matrix b = get(_b) . matrix list b b[1,4] gender age value _cons y1 .75056856 -.04409444 2.787841 -8.5067114
. matrix index = mean*b' . matrix list index symmetric index[1,1] y1 cons -.35345924 . display normprob(index[1,1]) .36187209
. matrix x = mean . matrix x[1,1] = 0 . matrix list x x[1,4] gender age value _cons cons 0 24.125 3.2725 1 . matrix index = x*b' . display normprob(index[1,1]) .32733634
Here’s a fancier example:
. su gender age value if gender==0 Variable | Obs Mean Std. Dev. Min Max ---------+----------------------------------------------------- gender | 14 0 0 0 0 age | 14 23.57143 5.079586 17 35 value | 14 3.279286 .2270366 3.05 3.7 . matrix vecaccum x = cons gender age value [iw=1/14] if gender==0 . matrix list x x[1,4] gender age value _cons cons 0 23.571429 3.2792857 1 . matrix index = x*b' . display normprob(index[1,1]) .34312349
See the following sections of the Stata 5.0 Documentation:
[U] 20.5 Accessing coefficients and standard errors, [R] logit, [R] matrix accum, [R] matrix define, [R] matrix get, [R] predict, [R] summarize