Bayesian sample-selection models

Order

Watch video demo

<- See Stata's other features

Highlights

Simply prefix your sample-selection command with bayes:
Linear, binary, and ordinal outcomes
Default and custom prior distributions
Full Bayesian postestimation features support
See more Bayesian analysis features

Sample selection arises when the sampled data are not representative of the population of interest. A classic example of sample selection is women's work participation. Suppose that we want to model the wages of women. If we consider only the sample of women who chose to work, we may end up with a sample in which the wages are too high because women who would have low wages may have chosen not to work. Of course, if the decision whether to work is random, there would be no problem with using only the sample of women who work. This is not a realistic assumption in this case. To obtain valid inference in this example, we must model the outcome, the wages, and the decision to work. We will refer to the two models as the outcome model and the participation model.

In Stata, you can use heckman to fit a Heckman selection model to continuous outcomes, heckprobit to fit a probit sample-selection model to binary outcomes, and heckoprobit to fit an ordered probit model with sample selection to ordinal outcomes. You can simply prefix these commands with bayes: to fit the corresponding Bayesian sample-selection models.

Let's see it work

Continuing with our example of women's work participation, we first fit the classical Heckman sample-selection model. Below we model both the wages and the decision to work based on the level of education and age. For the decision to work, we additionally include marriage s tatus and number of children.

. heckman wage educ age, select(married children educ age)

Iteration 0:   log likelihood = -5178.7009
Iteration 1:   log likelihood = -5178.3049
Iteration 2:   log likelihood = -5178.3045


Heckman selection model                         Number of obs     =      2,000
(regression model with sample selection)              Selected    =      1,343
                                                      Nonselected =        657

                                                Wald chi2(2)      =     508.44
Log likelihood = -5178.304                      Prob > chi2       =     0.0000



       wage   Coefficient  Std. err.      z    P>|z|     [95% conf. interval]

wage           
   education     .9899537   .0532565    18.59   0.000     .8855729    1.094334
         age     .2131294   .0206031    10.34   0.000     .1727481    .2535108
       _cons     .4857752   1.077037     0.45   0.652    -1.625179     2.59673

select         
     married     .4451721   .0673954     6.61   0.000     .3130794    .5772647
    children     .4387068   .0277828    15.79   0.000     .3842534    .4931601
   education     .0557318   .0107349     5.19   0.000     .0346917    .0767718
         age     .0365098   .0041533     8.79   0.000     .0283694    .0446502
       _cons    -2.491015   .1893402   -13.16   0.000    -2.862115   -2.119915

     /athrho     .8742086   .1014225     8.62   0.000     .6754241    1.072993
    /lnsigma     1.792559    .027598    64.95   0.000     1.738468     1.84665

         rho     .7035061   .0512264                      .5885365    .7905862
       sigma     6.004797   .1657202                       5.68862    6.338548
      lambda     4.224412   .3992265                      3.441942    5.006881

LR test of indep. eqns. (rho = 0):   chi2(1) =    61.20   Prob > chi2 = 0.0000

To fit its Bayesian analog, we use bayes: heckman.

. bayes: heckman wage educ age, select(married children educ age)


Burn-in ...
Simulation ...

Model summary

 
Likelihood:
  wage ~ heckman(xb_wage,xb_select,{athrho} {lnsigma})

Priors:
                     {wage:education age _cons} ~ normal(0,10000)          (1)
  {select:married children education age _cons} ~ normal(0,10000)          (2)
                               {athrho lnsigma} ~ normal(0,10000)             
 
(1) Parameters are elements of the linear form xb_wage.
(2) Parameters are elements of the linear form xb_select.

Bayesian Heckman selection model                MCMC iterations   =     12,500
Random-walk Metropolis-Hastings sampling        Burn-in           =      2,500
                                                MCMC sample size  =     10,000
                                                Number of obs     =      2,000
                                                      Selected    =      1,343
                                                      Nonselected =        657
                                                Acceptance rate   =      .3484
                                                Efficiency:   min =     .02314
                                                              avg =     .03657
Log marginal-likelihood = -5260.2024                          max =     .05013



                                                              Equal-tailed    
                    Mean   Std. dev.     MCSE     Median  [95% cred. interval]

wage           
   education    .9919131    .051865   .002609   .9931531   .8884407   1.090137
         age    .2131372   .0209631   .001071   .2132548   .1720535   .2550835
       _cons    .4696264   1.089225     .0716   .4406188  -1.612032    2.65116

select         
     married    .4461775   .0681721   .003045   .4456493   .3178532   .5785857
    children    .4401305   .0255465   .001156   .4402145   .3911135   .4903804
   education    .0559983   .0104231   .000484   .0556755   .0360289    .076662
         age    .0364752   .0042497   .000248   .0362858   .0280584   .0449843
       _cons   -2.494424     .18976   .011327  -2.498414  -2.861266  -2.114334

      athrho     .868392    .099374   .005961   .8699977   .6785641   1.062718
     lnsigma    1.793428   .0269513   .001457   1.793226   1.740569   1.846779

Note: Default priors are used for model parameters.

Unlike heckman, bayes: heckman reports the ancillary parameters only in the estimation metric. We can use bayesstats summary to obtain the parameters in the original metric.

. bayesstats summary (rho:1-2/(exp(2*{athrho})+1)) (sigma:exp({lnsigma}))

Posterior summary statistics                      MCMC sample size =    10,000

         rho : 1-2/(exp(2*{athrho})+1)
       sigma : exp({lnsigma})



                                                              Equal-tailed    
                    Mean   Std. dev.     MCSE     Median  [95% cred. interval]

         rho    .6970522   .0510145   .003071    .701373   .5905851   .7867018
       sigma    6.012205   .1621422   .008761   6.008807   5.700587   6.339366

Parameter rho is a correlation coefficient that measures the dependence between the outcome and participation models. If rho is zero, the two models are independent and can be analyzed separately. In other words, there is no sample selection, and we can model the wages using only the sample of women who work without introducing any bias in our results. In our example, rho is estimated to be between 0.59 and 0.79 with a probability of 0.95, so the decision to work is related to the wages in this example.

We can test for sample selection formally by using, for example, Bayes factors. A Bayes factor of two models is simply the ratio of their marginal likelihoods. The larger the value of the marginal likelihood, the better the model fits the data. To test for sample selection, we can compare the marginal likelihoods of the current model and of the model with rho equal to zero.

First, we store the current Bayesian estimation results from the sample-selection model.

. bayes, saving(heckman_mcmc)

. estimates store heckman

Next, we fit a model that assumes no sample selection. When rho equals zero, {athrho} also equals zero. So we specify a strong prior saturated at zero for parameter {athrho}.

. bayes, prior({athrho}, normal(0,1e-4)) saving(nosel_mcmc):
  heckman wage educ age, select(married children educ age)

Model summary

 
Likelihood:
  wage ~ heckman(xb_wage,xb_select,{athrho} {lnsigma})

Priors:
                     {wage:education age _cons} ~ normal(0,10000)          (1)
  {select:married children education age _cons} ~ normal(0,10000)          (2)
                                       {athrho} ~ normal(0,1e-4)
                                      {lnsigma} ~ normal(0,10000)
 (1) Parameters are elements of the linear form xb_wage.
(2) Parameters are elements of the linear form xb_select.

Bayesian Heckman selection model                MCMC iterations   =     12,500
Random-walk Metropolis-Hastings sampling        Burn-in           =      2,500
                                                MCMC sample size  =     10,000
                                                Number of obs     =      2,000
                                                      Selected    =      1,343
                                                      Nonselected =        657
                                                Acceptance rate   =      .3065
                                                Efficiency:   min =     .03943
                                                              avg =     .09498
Log marginal-likelihood = -5283.0246                          max =      .2432



                                                              Equal-tailed    
                    Mean   Std. dev.     MCSE     Median  [95% cred. interval]

wage           
   education    .8981219   .0509913   .001578   .8973616   .8013416   1.000497
         age    .1477784     .01854    .00066   .1477496   .1115628   .1850257
       _cons    5.994764    .890318   .030657   6.014622   4.150738   7.658942

select         
     married    .4351031   .0748102   .003577   .4377313   .2821176   .5752786
    children    .4501657   .0285028   .001045   .4492015   .3937091   .5048498
   education    .0584037   .0110582   .000524   .0579573   .0370387   .0814287
         age     .034779   .0043677    .00022   .0348894   .0259916    .043139
       _cons    -2.47607   .1962162   .009818  -2.467739  -2.862694   -2.10733

      athrho    .0062804    .010209    .00023   .0062746   -.014139   .0261746
     lnsigma     1.69586    .019056   .000386   1.695649    1.65948   1.734115

Note: Default priors are used for some model parameters.

. estimates store nosel

We now use bayesstats ic to obtain the Bayes factor of the two models.

. bayesstats ic heckman nosel

Bayesian information criteria



                     DIC    log(ML)    log(BF)

     heckman    10376.05  -5260.202          .
       nosel    10435.29  -5283.025  -22.82221

Note: Marginal likelihood (ML) is computed
using Laplace-Metropolis approximation.

The value of the log-Bayes factor of -23 indicates a very strong preference for the sample-selection model heckman and thus for the presence of sample selection in these data.

Tell me more

Learn more about the general features of the bayes prefix.

Learn more about Stata's Bayesian analysis features.

Read more about the bayes prefix and Bayesian analysis in the Stata Bayesian Analysis Reference Manual.

Products

New in Stata 19

Why Stata

All features

Disciplines

Stata/MP

StataNow

Order Stata

Purchase

Order Stata

Bookstore

Stata Press

Stata Journal

Gift Shop

Learn

Free webinars

NetCourses

Classroom and web training

Organizational training

Video tutorials

Third-party courses

Web resources

Teaching with Stata

Support

Training

Video tutorials

FAQs

Statalist: The Stata Forum

Resources

Technical support

Customer service

Alerts

Company

News and events

Customer service

Careers

We use cookies

We use cookies to ensure that we give you the best experience on our website—to enhance site navigation, to analyze usage, and to assist in our marketing efforts. By continuing to use our site, you consent to the storing of cookies on your device and agree to delivery of content, including web fonts and JavaScript, from third party web services.

Cookie Settings

Privacy policy

Last updated: 16 November 2022

StataCorp LLC (StataCorp) strives to provide our users with exceptional products and services. To do so, we must collect personal information from you. This information is necessary to conduct business with our existing and potential customers. We collect and use this information only where we may legally do so. This policy explains what personal information we collect, how we use it, and what rights you have to that information.

Required cookies

Advertising cookies

Required cookies

These cookies are essential for our website to function and do not store any personally identifiable information. These cookies cannot be disabled.
Advertising and performance cookies

This website uses cookies to provide you with a better user experience. A cookie is a small piece of data our website stores on a site visitor's hard drive and accesses each time you visit so we can improve your access to our site, better understand how you use our site, and serve you content that may be of interest to you. For instance, we store a cookie when you log in to our shopping cart so that we can maintain your shopping cart should you not complete checkout. These cookies do not directly store your personal information, but they do support the ability to uniquely identify your internet browser and device.

Please note: Clearing your browser cookies at any time will undo preferences saved here. The option selected here will apply only to the device you are currently using.

Accept Cookies


wage		Coefficient Std. err. z P>\|z\| [95% conf. interval]

wage
education		.9899537 .0532565 18.59 0.000 .8855729 1.094334
age		.2131294 .0206031 10.34 0.000 .1727481 .2535108
_cons		.4857752 1.077037 0.45 0.652 -1.625179 2.59673

select
married		.4451721 .0673954 6.61 0.000 .3130794 .5772647
children		.4387068 .0277828 15.79 0.000 .3842534 .4931601
education		.0557318 .0107349 5.19 0.000 .0346917 .0767718
age		.0365098 .0041533 8.79 0.000 .0283694 .0446502
_cons		-2.491015 .1893402 -13.16 0.000 -2.862115 -2.119915

/athrho		.8742086 .1014225 8.62 0.000 .6754241 1.072993
/lnsigma		1.792559 .027598 64.95 0.000 1.738468 1.84665

rho		.7035061 .0512264 .5885365 .7905862
sigma		6.004797 .1657202 5.68862 6.338548
lambda		4.224412 .3992265 3.441942 5.006881


		Equal-tailed
		Mean Std. dev. MCSE Median [95% cred. interval]

wage
education		.9919131 .051865 .002609 .9931531 .8884407 1.090137
age		.2131372 .0209631 .001071 .2132548 .1720535 .2550835
_cons		.4696264 1.089225 .0716 .4406188 -1.612032 2.65116

select
married		.4461775 .0681721 .003045 .4456493 .3178532 .5785857
children		.4401305 .0255465 .001156 .4402145 .3911135 .4903804
education		.0559983 .0104231 .000484 .0556755 .0360289 .076662
age		.0364752 .0042497 .000248 .0362858 .0280584 .0449843
_cons		-2.494424 .18976 .011327 -2.498414 -2.861266 -2.114334

athrho		.868392 .099374 .005961 .8699977 .6785641 1.062718
lnsigma		1.793428 .0269513 .001457 1.793226 1.740569 1.846779


		Equal-tailed
		Mean Std. dev. MCSE Median [95% cred. interval]

rho		.6970522 .0510145 .003071 .701373 .5905851 .7867018
sigma		6.012205 .1621422 .008761 6.008807 5.700587 6.339366


		DIC log(ML) log(BF)

heckman		10376.05 -5260.202 .
nosel		10435.29 -5283.025 -22.82221