xtheckman

Order

Watch video demo

<- See Stata's other features

Highlights

Random-effects panel-data modeling with endogenous selection
Two-level multilevel models with endogenous selection
Advanced inferences

Inference statistics

Expected means and probabilities
Marginal effects and contrasts
Average structural functions (ASFs)
More ...

Conditional analysis—specify values of all covariates
Test whether selection matters
Population-averaged—values of specified covariates
Inferences and plots over groups

Heckman selection models adjust for bias when some outcomes are missing not at random. Imagine modeling income. The problem is that income is observed only for those who work. Missingness is not random.

Stata fits Heckman selection models and can fit them with panel (two-level) data.

You want to fit the model

\( y_{it} = x_{it}\beta + \alpha_{i} + \varepsilon_{it} \)

where \(y_{it}\) is sometimes missing. The equation that determines which \(y_{it}\) are not missing is

\( S_{it} = 1(z_{it}\gamma + v_{i} + u_{it} > 0) \)

In these equations, \(\alpha_{i}\), \(\varepsilon_{it}\), \(v_{i}\), and \(u_{it}\) will not be estimated. Their correlations with each other, however, will be estimated along with \(\beta\) and \(\gamma\).

The above model can be fit even though income is not observed for everyone and even if their employment status changes over time.

Why fit a selection model? Because it is possible that people who work and whose income is therefore observed systematically differ from those who do not, and those differences are for unobserved reasons.

For instance, if more productive people work, their income will be higher than those who do not work. Or, if income of the less productive is lower, they might need to work more. Allowing for selection allows for either of the above alternatives and other alternatives too. After estimation, we can test whether selection matters.

Let's see it work

We have fictional data on 8,000 individuals from 2011 to 2018. Among the variables are income, which is observed only for those who work. We worry that unobservables might lead to biased results.

To fit the selection model, we must model income and the probability of working. We model probability of working as a function of experience, age, region of the county, and whether the person has college or technical college training.

We fit the model

. xtheckman income c.age##c.age i.training#(c.exp##c.exp),
     select(working = age exp i.region i.training)

If you are new to Stata, things like c.age##c.age mean to include age and age squared in the model. The "c." means continuous. The "i." in i.training and i.region means categorical variable and indicates the categories are to be included in the model.

The results are

Random-effects regression with selection           Number of obs    =    8,000
                                                           Selected =    7,235
                                                        Nonselected =      765

Group variable: id                                 Number of groups =    1,000

                                                   Obs per group:
                                                                min =        8
                                                                avg =      8.0
                                                                max =        8

Integration method: mvaghermite                    Integration pts. =        7

                                                   Wald chi2(6)     = 13011.86
Log likelihood = -28748.805                        Prob > chi2      =   0.0000



               Coefficient  Std. err.      z    P>|z|     [95% conf. interval] 

wage          
         age     .0841345     .05193     1.62   0.105    -.0176465    .1859154
              
 c.age#c.age    -.0006552   .0006167    -1.06   0.288    -.0018638    .0005534
              
    training#  
       c.exp  
          0      .1497442   .1897593     0.79   0.430    -.2221772    .5216656
          1      3.198649   .2042991    15.66   0.000     2.798231    3.599068
              
       _cons     7.744222   1.223347     6.33   0.000     5.346506    10.14194

working       
         age     .0083258    .001507     5.52   0.000      .005372    .0112795
         exp      .069833    .007423     9.41   0.000     .0552843    .0843818
              
      region  
          2      .1683876   .0653623     2.58   0.010     .0402798    .2964953
          3      .0286791   .0630488     0.45   0.649    -.0948944    .1522525
          4      .0476718   .0639092     0.75   0.456    -.0775879    .1729315
          5      .0054477   .0621657     0.09   0.930    -.1163948    .1272901
              
  1.training     .8223611   .0596781    13.78   0.000     .7053942     .939328
       _cons     .3367662   .0905615     3.72   0.000     .1592688    .5142635

  var(e.wage)     81.07829   2.142513                      76.98594    85.38819

corr(e.wor~g,  
      e.wage)    -.5812249   .0615897    -9.44   0.000    -.6892935    -.447854

var(wage[id])     19.93373   1.405619                      17.36067    22.88815
        var(  
 working[id])     .0989163   .0262816                      .0587635    .1665052

       corr(  
 working[id],  
    wage[id])      .258161   .1038653     2.49   0.013      .045996    .4480403

The first panel in the results reports the income equation.

The second panel reports the working (selection) equation.

After that are reported three variances and two correlations. The correlations are of interest.


     
     correlation                     estimate      SE
     
     corr(e.working. e.income)         -0.58    0.06
     corr(working[id], income[id])      0.26   0.10

The first correlation is the correlation of the residuals in the income and working (selection) equation, the correlation of \(\varepsilon_{it}\) and \(u_{it}\).

The second is the correlation of random effects and unobservables that do not change over time, or the correlation of \(\alpha_{i}\) and \(v_{i}\).

Selection was an issue if either of these correlations are significant. Both are.

Tell me more

Read more about Heckman selection models for panel data in the Stata Longitudinal-Data/Panel-Data Reference Manual; see [XT] xtheckman.

Products

New in Stata 19

Why Stata

All features

Disciplines

Stata/MP

StataNow

Order Stata

Purchase

Order Stata

Bookstore

Stata Press

Stata Journal

Gift Shop

Learn

Free webinars

NetCourses

Classroom and web training

Organizational training

Video tutorials

Third-party courses

Web resources

Teaching with Stata

Support

Training

Video tutorials

FAQs

Statalist: The Stata Forum

Resources

Technical support

Customer service

Alerts

Company

News and events

Customer service

Careers

We use cookies

We use cookies to ensure that we give you the best experience on our website—to enhance site navigation, to analyze usage, and to assist in our marketing efforts. By continuing to use our site, you consent to the storing of cookies on your device and agree to delivery of content, including web fonts and JavaScript, from third party web services.

Cookie Settings

Privacy policy

Last updated: 16 November 2022

StataCorp LLC (StataCorp) strives to provide our users with exceptional products and services. To do so, we must collect personal information from you. This information is necessary to conduct business with our existing and potential customers. We collect and use this information only where we may legally do so. This policy explains what personal information we collect, how we use it, and what rights you have to that information.

Required cookies

Advertising cookies

Required cookies

These cookies are essential for our website to function and do not store any personally identifiable information. These cookies cannot be disabled.
Advertising and performance cookies

This website uses cookies to provide you with a better user experience. A cookie is a small piece of data our website stores on a site visitor's hard drive and accesses each time you visit so we can improve your access to our site, better understand how you use our site, and serve you content that may be of interest to you. For instance, we store a cookie when you log in to our shopping cart so that we can maintain your shopping cart should you not complete checkout. These cookies do not directly store your personal information, but they do support the ability to uniquely identify your internet browser and device.

Please note: Clearing your browser cookies at any time will undo preferences saved here. The option selected here will apply only to the device you are currently using.

Accept Cookies


		Coefficient Std. err. z P>\|z\| [95% conf. interval]

wage
age		.0841345 .05193 1.62 0.105 -.0176465 .1859154

c.age#c.age		-.0006552 .0006167 -1.06 0.288 -.0018638 .0005534

training#
c.exp
0		.1497442 .1897593 0.79 0.430 -.2221772 .5216656
1		3.198649 .2042991 15.66 0.000 2.798231 3.599068

_cons		7.744222 1.223347 6.33 0.000 5.346506 10.14194

working
age		.0083258 .001507 5.52 0.000 .005372 .0112795
exp		.069833 .007423 9.41 0.000 .0552843 .0843818

region
2		.1683876 .0653623 2.58 0.010 .0402798 .2964953
3		.0286791 .0630488 0.45 0.649 -.0948944 .1522525
4		.0476718 .0639092 0.75 0.456 -.0775879 .1729315
5		.0054477 .0621657 0.09 0.930 -.1163948 .1272901

1.training		.8223611 .0596781 13.78 0.000 .7053942 .939328
_cons		.3367662 .0905615 3.72 0.000 .1592688 .5142635

var(e.wage)		81.07829 2.142513 76.98594 85.38819

corr(e.wor~g,
e.wage)		-.5812249 .0615897 -9.44 0.000 -.6892935 -.447854

var(wage[id])		19.93373 1.405619 17.36067 22.88815
var(
working[id])		.0989163 .0262816 .0587635 .1665052

corr(
working[id],
wage[id])		.258161 .1038653 2.49 0.013 .045996 .4480403


correlation	estimate	SE

corr(e.working. e.income)	-0.58	0.06
corr(working[id], income[id])	0.26	0.10