What is the difference between “endogeneity” and “sample selection bias”?
||Endogeneity versus sample selection bias
||Daniel Millimet, Southern Methodist University
Many individuals have posted questions using sample selection bias and
endogeneity interchangeably or incorrectly. I do not intend to single out
one individual, but consider the case of the effect on wages of workers of
being in a trade union. Using a dummy variable to pick up this effect in a
pooled sample of union and nonunion workers is inappropriate since workers
in unions may self-select and workers being in a union may not be random.
One approach I have read is to use a probit model to estimate the
probability of being in a union (1 being union worker and 0 being nonunion
worker). Then from the probit equation, obtain predicted probabilities of
being a union worker for the entire sample of union and nonunion workers.
Then use these predicted probabilities in place of a union dummy variable to
estimate the effect of being in a union. This approach is supposed to
control for sample selection bias.
I am trying to relate this procedure with the standard Heckman’s
two-stage procedure that uses the inverse Mills’ ratio. Any help will
be much appreciated.
Sample selection bias and endogeneity bias refer to two distinct concepts,
both entailing distinct solutions. In general, sample selection bias refers
to problems where the dependent variable is observed only for a restricted,
nonrandom sample. Using the example above, one observes an
individual’s wage within a union only if the individual has joined a
union. Conversely, one observes an individual’s nonunion wage only if
the individual does not belong to a union. Endogeneity refers to the fact
that an independent variable included in the model is potentially a choice
variable, correlated with unobservables relegated to the error term. The
dependent variable, however, is observed for all observations in the data.
Here union status may be endogenous if the decision to join or not join a
union is correlated with unobservables that affect wages. For instance, if
less able workers are more likely to join a union and therefore receive
lower wages ceteris paribus, then failure to control for this correlation
will yield an estimated union effect on wages that is biased down.
The problem with unions and wages, and a host of other problems, can be
treated either as a sample selection problem or as an endogeneity problem.
The “appropriate” model depends on how one believes unions
Model I. Endogeneity
If one believes union status has merely an intercept effect on wages
(i.e. results in a parallel shift up or down for various wage profiles),
then the appropriate model includes union status as a right-hand-side
variable and pools the entire sample of union and nonunion workers. Because
the entire sample is used, there are no sample-selection issues (there may
be a sample selection issue to the extent that wages are observed only for
employed workers; typically this is a cause for concern only in estimating
wage equations for females). One can then proceed to estimate a typical wage
regression equation via OLS. If you believe union status is endogenous and
workers self-select into union/nonunion jobs, then one should instrument for
union status. One can use either two-step methods, as outlined in the
question above, or use the Stata command
treatreg. Upon fitting the model, the union status coefficient
answers the following question: “Conditional on the Xs, what is the
average effect on wages of belonging to a union?’ Under this
estimation technique, the betas (the coefficients on the Xs) are restricted
to be the same for union and nonunion workers. For example, the return to
education is restricted to be the same regardless of whether one is in a
Model II. Sample Selection
If one believes that union status has not only an intercept effect but also
a slope effect (i.e., the betas differ according to union status as well),
then a sample selection model is called for. To proceed, split the sample
into union and nonunion workers and then estimate a wage equation for each
subsample. If union status is the only potentially endogenous variable in
the model, the two separate wage equations may be estimated via OLS,
accounting for the fact that each sample is a nonrandom sample of all
workers. This is accomplished via Heckman’s selection correction model
(using either ML estimation, or two-step estimation where in the first stage
a probit model is used to predict the probability of union status and in the
second stage, the inverse Mills’ ratio [IMR] is included as a
regressor). According to this type of model, the union effect does not show
up as a dummy variable but rather in the fact that the constant term and
betas may differ from the union to the nonunion sample. The difference in
the constants yields the difference in average wages if a union and nonunion
worker have X=0. The difference in the betas tells one how the returns to
different observable attributes vary by union status. Essentially this
model allows a full set of interaction terms between union status and the
Xs. A Chow test could be used to test if the betas differ across by union
status. If they do not, Model I is more efficient. This type of model is
also known as an endogenous switching regime model.
Other references: Main and Reilly (1993) estimate a sample-selection model
similar to Model II, where they split the sample depending on the size of
the firm where the individual works. Thus their first-stage involves an
estimating an ordered probit for three classes of firm size (small, medium,
or large), and then estimating three wage equations, each including the
appropriate IMR term. Millimet (2000, SMU working paper) estimates the
effect of household size on schooling using a similar modeling technique.
Maddala (1983) also gives a good introduction to these issues.
Model III. Endogeneity and sample selection
One may also confront both types of biases in the same model. For example,
say one wants to estimate the effect of union status on wages for women
only. Thus one may choose to include union status as a right-hand-side
variable (Model I) or wish to split up the sample (Model II). If one opts
for Model I, one still has to confront the fact that wages for women are
only selectively observed—for those women choosing to participate in the
labor force. To fit this model, one would start by estimating a probit model
explaining the decision of women to work or not. One would then generate the
IMR and include the IMR and the union dummy in a second-stage wage
regression, where one would instrument for union status if it was thought to
be endogenous. Finally, if Model II were desired, then one would be
confronted with a double-selection model. I believe one would estimate a
probit for labor force participation first. Upon generating the IMR term,
this would be included in a second probit equation explaining union status.
The appropriate IMR term from this equation would then be included in the
two final wage equations. (This topic is covered in Amemiya 1985.)
As in any model, one must be aware from where identification arises. While
it is well known that for instrumental variables estimation one requires a
variable that is correlated with the endogenous variable, uncorrelated with
the error term, and does not affect the outcome of interest conditional on
the included regressors, identification in sample selection issues is often
not as well grounded. Because the IMR is a nonlinear function of the
variables included in the first-stage probit model, call these Z, then the
second-stage equation is identified—because of this
nonlinearity—even if Z=X. However, the nonlinearity of the IMR arises
from the assumption of normality in the probit model. Since most researchers
do not test or justify the use of the normality assumption, it is highly
questionable whether this assumption should be used as the sole source of
identification. Thus, it is advisable, in my opinion, to have a variable in
Z that is not also included in X. This step makes the source of
identification clear (and debatable). For the double-selection model
discussed above in Model III, two exclusion restrictions would be needed
(one for the labor force probit, one for the union probit).
- Amemiya, T. 1985.
Advanced Econometrics. Cambridge, MA: Harvard
- Maddala, G. S. 1983.
Limited-Dependent and Qualitative Variables in
Econometrics. Cambridge: Cambridge University Press.
- Main, B. and B. Reilly. 1993.
The employer size-wage gap: evidence for
Britain. Economica 60: 125–142.
Stata for Windows
Stata for Unix
Stata for Mac