High-dimensional fixed effects (HDFE)

← See Stata 19's new features

Highlights

Absorb multiple high-dimensional categorical variables in the following:

Linear models with areg, absorb()
Fixed-effects linear models with xtreg, fe absorb()
Two-stage least-squares regression with ivregress 2sls, absorb()

Choose an alternating projection algorithm:

Halperin
Cimmino

Gain speed by using option absorb()
See more features of linear and panel-data models

Absorb not just one but multiple high-dimensional categorical variables in your linear, fixed-effects linear, and instrumental-variables linear models using option absorb() with commands areg, xtreg, fe, and ivregress 2sls. This provides remarkable speed gains over the traditional approach of directly including indicators for categories of these variables in your models. Choose among different estimation methods.

Let's see it work

-> Linear models with high-dimensional categorical variables

-> How much time are we saving?

-> More speed in fixed-effects linear models

Linear models with high-dimensional categorical variables

We often include categorical variables in our models as controls. These controls are necessary for model specification, but they are not the focus of our analysis. For instance, we may want to study the effect of import tariffs (imports) on yearly trade volume and include year, country, and industry as controls.

We could fit a linear regression with indicator variables for the three controls:

. regress trade imports i.year i.country i.industry

If we have 40 years of data, 160 countries, and 1,000 industry codes, we would be estimating 1,200 parameters. This is time consuming, and only one parameter, the coefficient on imports, is of interest to our research question.

We can now fit the same model in a fraction of the time by typing

. areg trade imports, absorb(year country industry)

Variables year, country, and industry are absorbed. areg already had the ability to absorb one variable, but now we can absorb as many categorical variables as we want.

And if we want to fit a model with, say, industry fixed effects, we would type

. xtset industry year

. xtreg trade imports, fe absorb(year country)

Suppose that we believe imports are endogenous and we would like to instrument them using a measure of productivity. Again, we do not care about the coefficients on year, country, or industry. We could fit the model:

. ivregress 2sls trade (imports = productivity), absorb(year country industry)

How much time are we saving?

Below is a toy example with one million observations, one variable of interest (x), two categorical variables (a1 and a2) with one thousand categories, and the id variable with one hundred thousand categories.

Previously, we could absorb only one variable and would have typed

. webuse hdfe
. quietly areg y x i.a1 i.a2, absorb(id)

specifying the variable with the largest number of categories, id, in the absorb() option. This takes about five minutes in Stata/MP and six minutes in Stata/SE.

Now we can absorb all three categorical variables by typing

. quietly areg y x, absorb(id a1 a2)

which takes roughly 1 second in Stata/MP and 1.3 seconds in Stata/SE. (The times may vary slightly across computers, but the time gains will be similar.)

The time gains are remarkable!

More speed in fixed-effects linear models

areg is the fastest command for models with high-dimensional categorical variables. But if you want to fit a fixed-effects model, xtreg, fe may be more appropriate.

Previously, to control for categorical variables with xtreg, fe, you had to specify them as indicator variables in the model. Now you can specify them in the new absorb() option, just as you do with areg; this will make xtreg, fe run much faster.

Continuing with the previous example, suppose you want to fit a linear model with id fixed effects. You would type

. xtset id
. xtreg y x, fe absorb(a1 a2) vce(cluster id)

and get

(header output omitted)



                             Robust                                              
           y   Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
   
           x    -.5017659   .0010547  -475.73   0.000    -.5038331   -.4996986
       _cons     1.509526   .0031635   477.16   0.000     1.503326    1.515727

     sigma_u    1.4160503                                                     
     sigma_e    3.1659943                                                     
         rho    .16670092   (fraction of variance due to u_i)

xtreg, fe is slower than areg because it does more heavy lifting. In particular, it computes panel-level statistics that are used, for instance, to compute the variation between fixed effects (sigma_u). If you are not interested in sigma_u, you can save execution time by specifying the nosigmau option.

. xtreg y x, fe absorb(a1 a2) vce(cluster id) nosigmau
(header output omitted)



                             Robust                                              
           y   Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
   
           x    -.5017659   .0010547  -475.73   0.000    -.5038331   -.4996986
       _cons     1.509526   .0031635   477.16   0.000     1.503326    1.515727

     sigma_e    3.1659943

Without the vce(cluster id) option, xtreg, fe reports a test that all panel effects, the \(u_{i}\)'s, are zero. In this case, specifying the nouitest option will suppress both the test and the estimation of sigma_u to save even more execution time.

Tell me more

Read more about how to handle high-dimensional categorical predictors in linear models in [R] areg, in instrumental-variables regression in [R] ivregress 2sls, and in fixed-effects linear models in [XT] xtreg.

Learn more about Stata's linear models and panel-data models features.

View all the new features in Stata 19, and, in particular, new in linear models.