Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Factor variable notation vs. hand made dummy vars
[email protected] (Jeff Pitblado, StataCorp LP)
[email protected]
Re: st: Factor variable notation vs. hand made dummy vars
Mon, 06 Feb 2012 10:11:58 -0600
Ulrich Kohler <[email protected]> is comparing results from -logit- between two
different specifications of what seem be the same model, but is getting
different results:
> I cannot replicate the model
> . sysuse auto, clear
> . tab rep78, gen(d)
> . logit for mpg d2-d5
> with factor variable notation. I tried
> . logit for mpg ib1.rep78
> but results differ. Can anybody explain why?
> (Note as an aside that
> . logit for mpg d1-d5
> reproduces the factor variables solution, but normally I would not
> specify the model this way)
Here is the output form Uli's first model:
***** BEGIN:
. logit for mpg d2-d5
note: d2 != 0 predicts failure perfectly
d2 dropped and 8 obs not used
Iteration 0: log likelihood = -39.273156
Iteration 1: log likelihood = -26.016988
Iteration 2: log likelihood = -25.527683
Iteration 3: log likelihood = -25.487362
Iteration 4: log likelihood = -25.480362
Iteration 5: log likelihood = -25.478768
Iteration 6: log likelihood = -25.478391
Iteration 7: log likelihood = -25.478309
Iteration 8: log likelihood = -25.478292
Iteration 9: log likelihood = -25.478288
Iteration 10: log likelihood = -25.478287
Logistic regression Number of obs = 61
LR chi2(4) = 27.59
Prob > chi2 = 0.0000
Log likelihood = -25.478287 Pseudo R2 = 0.3513
foreign | Coef. Std. Err. z P>|z| [95% Conf. Interval]
mpg | .1310881 .0707293 1.85 0.064 -.0075387 .2697149
d2 | 0 (omitted)
d3 | 14.28187 2465.084 0.01 0.995 -4817.194 4845.758
d4 | 16.29835 2465.084 0.01 0.995 -4815.177 4847.774
d5 | 17.41793 2465.084 0.01 0.994 -4814.058 4848.894
_cons | -19.14137 2465.084 -0.01 0.994 -4850.618 4812.335
***** END:
Technically, this model should not have converged. The coefficients on the
binary predictors are way too big; the standard errors don't look reasonable
The problem here is that 'd1' and 'd2' are prefect predicts for 'foreign', but
Uli dropped 'd1' from the list of predictors. Dropping a level from the
indicators of a factor variable is normally a natural thing to want to do. One
of the levels is going to be omitted because of collinearity anyway, so by
dropping you can control which level to treat as the base level for the fitted
coefficient effects of the factor variable. But 'd1' is a perfect predictor,
so -logit- would have dropped it along with 'd2' (and the observations they
indicate) for that reason and then found that it still needed to drop one of
the other 'd#' variables because of collinearity. However by not including
'd1' in the list of predictors, the observations that 'd1' indicates are left
in the estimation sample, and -logit- is unable to identify that it has a
collinearity problem.
We can prevent this by adding 'd1' back in to the list of predictors:
***** BEGIN:
. logit for mpg d1-d5
note: d1 != 0 predicts failure perfectly
d1 dropped and 2 obs not used
note: d2 != 0 predicts failure perfectly
d2 dropped and 8 obs not used
note: d5 omitted because of collinearity
Iteration 0: log likelihood = -38.411464
Iteration 1: log likelihood = -25.814503
Iteration 2: log likelihood = -25.480135
Iteration 3: log likelihood = -25.478287
Iteration 4: log likelihood = -25.478287
Logistic regression Number of obs = 59
LR chi2(3) = 25.87
Prob > chi2 = 0.0000
Log likelihood = -25.478287 Pseudo R2 = 0.3367
foreign | Coef. Std. Err. z P>|z| [95% Conf. Interval]
mpg | .1310946 .070733 1.85 0.064 -.0075396 .2697287
d1 | 0 (omitted)
d2 | 0 (omitted)
d3 | -3.136422 1.044601 -3.00 0.003 -5.183803 -1.08904
d4 | -1.119903 .9741478 -1.15 0.250 -3.029198 .7893916
d5 | 0 (omitted)
_cons | -1.723275 1.776453 -0.97 0.332 -5.205059 1.758509
***** END:
Uli already mentioned that this specification reproduces the results from the
one using factor variables.
We do not recommend this, but Uli can reproduce the first model specification
using factor variables notation by explicitly specifying the levels of 'rep78'
to use:
***** BEGIN:
. logit for mpg i(2/5).rep78
note: 2.rep78 != 0 predicts failure perfectly
2.rep78 dropped and 8 obs not used
Iteration 0: log likelihood = -39.273156
Iteration 1: log likelihood = -26.016988
Iteration 2: log likelihood = -25.527683
Iteration 3: log likelihood = -25.487362
Iteration 4: log likelihood = -25.480362
Iteration 5: log likelihood = -25.478768
Iteration 6: log likelihood = -25.478391
Iteration 7: log likelihood = -25.478309
Iteration 8: log likelihood = -25.478292
Iteration 9: log likelihood = -25.478288
Iteration 10: log likelihood = -25.478287
Logistic regression Number of obs = 61
LR chi2(4) = 27.59
Prob > chi2 = 0.0000
Log likelihood = -25.478287 Pseudo R2 = 0.3513
foreign | Coef. Std. Err. z P>|z| [95% Conf. Interval]
mpg | .1310881 .0707293 1.85 0.064 -.0075387 .2697149
rep78 |
2 | 0 (empty)
3 | 14.28187 2465.084 0.01 0.995 -4817.194 4845.758
4 | 16.29835 2465.084 0.01 0.995 -4815.177 4847.774
5 | 17.41793 2465.084 0.01 0.994 -4814.058 4848.894
_cons | -19.14137 2465.084 -0.01 0.994 -4850.618 4812.335
***** END:
[email protected]
* For searches and help try: