Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: st: Interaction terms interpretation when one variable is omitted

From	"Mirnezami, Oliver" <[email protected]>
To	"[email protected]" <[email protected]>
Subject	RE: st: Interaction terms interpretation when one variable is omitted
Date	Fri, 12 Apr 2013 13:54:16 +0000
Dear David

Thank you so much for your help. 

Following your advice, I've made a new variable treat_status which is a categorical variable and equals 0 for the control group (anyone who is employed in the period) and then takes a value of 1 if treat_emp ==1 , 2 if treat_unemp ==1 , 3 if treat_ret ==1 etc. 4 = not in labour force, 5 = disabled. 

treat_status |
           |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |     29,869       97.66       97.66
          1 |        436        1.43       99.08
          2 |         87        0.28       99.37
          3 |        123        0.40       99.77
          4 |         66        0.22       99.98
          5 |          5        0.02      100.00
------------+-----------------------------------
      Total |     30,586      100.00

I then ran the following regression using the factor variable notation in Stata (I've included a few explanatory variables and also time dummies)

xtreg health i.treat_status age i.married ln(income) `yeareffects1994to2010', fe vce(cluster id)

treat_status |
                1  |  -.0196492   .0380383    -0.52   0.605    -.0942113    .0549129
                2  |  -.0938826   .1191151    -0.79   0.431    -.3273705    .1396053
                3  |  -.0601347   .1000886    -0.60   0.548    -.2563271    .1360578
                4  |  -.0004453   .1684459    -0.00   0.998     -.330631    .3297403
                5  |  -1.043159    .355558    -2.93   0.003    -1.740119   -.3461987
                   |
             _cons |   5.382643   .1899976    28.33   0.000     5.010212    5.755074

Can I then just compare these coefficients and say that for example, people that are unemployed following job loss (category 2) have worse health than people who regain employment following job loss (category 1) i.e. compare -0.093 with -0.019. And all of these labour force statuses post job loss result in worse health on average compared to my control group (category 0) who have not experienced job loss as all have a negative sign in relation to the reference group. Does the constant just refer to the value of the control group? 

One thing that I found confusing was that when I re-ran the regression using the original binary treatment variable (i.e. 0 = control group, 1 = job loss and any labour force status), the constant was slightly different than above when using the categorical variable (5.37 vs 5.38). Why are the constants not the same when both refer to the same control group?

   treat |   -.032478   .0365249    -0.89   0.374    -.1040735    .0391176
             _cons |   5.371252   .1897432    28.31   0.000      4.99932    5.743184

To show you the construction of this variable: (i.e. 0 = same control group as categorical. 1 is the sum of all labour force statuses categories.)

treatj
|      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |     29,869       97.66       97.66
          1 |        717        2.34      100.00
------------+-----------------------------------
      Total |     30,586      100.00


One other query I had was when you mentioned about the constant term and the definitions of the predictor variables. You said that 'when the model includes treat_emp, but not treat_unemp or treat_ret, the individuals whose values on treat_unemp or treat_ret are accounted for by the constant term, and the coefficient of treat_emp would be interpreted as a comparison between the individuals for whom treat_emp = 1 and the aggregate of all other individuals.'

However, originally when I did the series of separate regressions, I only had individuals that were: 
1) either in the control group or treat_emp. The individuals in treat_unemp or treat_ret etc. were not present in the regression. 
2) either in the control group or treat_unemp. The individuals in treat_emp or treat_ret etc. were not present in the regression.
2) either in the control group or treat_ret. The individuals in treat_emp or treat_unemp etc. were not present in the regression.

So I thought that it would be ok because the reference point (i.e. the control group) was always the same each time. I checked this though and the constant term was different in each regression which confused me. 

I think I will stick with the categorical factor variable approach you suggested as this seems to work ok - I would be grateful if you could confirm that my interpretation when using this approach is correct and would appreciate any additional clarity on my other queries, particularly regarding the constant term. 

Thank you again. I really appreciate all your help. 

Kind regards

Oliver 

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of David Hoaglin
Sent: 12 April 2013 01:47
To: [email protected]
Subject: Re: st: Interaction terms interpretation when one variable is omitted

Hi, Oliver.

If the rest of your data behave in the same way as the data for id 001 that you listed, then _const - Employed - Treatment + (interaction term) = 0 (exactly).
That is the collinearity that caused Stata to omit the interaction term.

I suspect that I do not know enough about the detailed structure of your data and your models, but it appears that the alternative approach is not satisfactory.  The definition of each regression coefficient includes the list of other predictors in the model.  When you use the separate models, you need to understand what happens to the constant term.  For example, when the model includes treat_emp, but not treat_unemp or treat_ret, the individuals whose values on treat_unemp or treat_ret are accounted for by the constant term, and the coefficient of treat_emp would be interpreted as a comparison between the individuals for whom treat_emp = 1 and the aggregate of all other individuals.  It appears that treat_emp, treat_unemp, and treat_ret are indicators for separate categories of a categorical variable.  In such situations, all the categories except one should be included in the model together.  (Omitting one category avoids a
collinearity.)  You may need to re-examine the definitions of your predictor variables and make sure that they capture the intended effects.

David Hoaglin

On Thu, Apr 11, 2013 at 7:14 AM, Mirnezami, Oliver <[email protected]> wrote:
> Hello
>
> I have a query regarding the interpretation of an interaction term when Stata automatically omits a  variable from the regression due to collinearity.
>
> I am looking at how job loss affects health and wish to extend my model to see when an individual loses their job, does re-employment moderate the negative effect on their health.
>
> To do this, I have interacted my treatment variable (1 for individuals that have reported job loss in current wave, 0 for individuals employed in current wave) with an individual's labour force status.
>
> For example:
>
> gen treat_employed = treat * employed
> gen treat_unemployed = treat * unemployed gen treat_retired = treat * 
> retired
>
> In the first case, my regression is then (n.b. other controls are left out here for simplicity):
>
> xtreg health treat employed treat_employed, fe
>
> However, the interaction term treat_employed gets omitted. I then tried running the following regressions separately (with just 2 of 3 variables) and found that the coefficient and standard error on employed is the same as those of treat_employed (the interaction term):
>
>
>               |               Robust
>    health |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
> --------------+-------------------------------------------------------
> --------------+---------
> treat |  -.0353416   .0370996    -0.95   0.341    -.1080636    .0373803
>        employed |   .1540951   .0679695     2.27   0.023     .0208624    .2873278
>         _cons |     3.4245   .0677945    50.51   0.000     3.291611     3.55739
>
>               |               Robust
>    sr_health1 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
> --------------+-------------------------------------------------------
> --------------+---------
> treat |  -.1894367   .0585036    -3.24   0.001    -.3041146   -.0747589
>     treat_employed |   .1540951   .0679695     2.27   0.023     .0208624    .2873278
>         _cons |   3.578596   .0007682  4658.40   0.000      3.57709    3.580101
>
> An example of my data is as follows:
>
> Id      Year    Employed        Treatment       Interaction term (employed * treatment)
> 001     1996          1                        0                                                        0
> 001     1998          1                        0                                                        0
> 001     2000          1                        0                                                        0
> 001     2002          0                        1                                                        0
> 001     2004          1                        0                                                        0
> 001     2006          1                        0                                                        0
> 001     2008          1                        1                                                        1
> 001     2010          1                        0                                                        0
>
> I think the problem is arising because employment and treatment are not independent of each other in the sense that treatment always equals  0 when employed equals 1 by construction (as my control group is people with a job) although when treatment equals 1 (i.e. an individual reports job loss in this wave), the individual can be employed or unemployed (or in fact any labour force status) because the job loss would have occurred at some point between this wave and the previous interview wave and so they have already found a new job. I wish to see if health is impacted depending on which labour force status an individual has following job loss.
>
> I thought of an alternate approach to the problem and would be grateful for your feedback. Originally, my treatment variable could equal 1 for any labour force status of the individual. My new method involves making separate treatment variables where the control groups are always the same but I have treat_emp which only equals 1 when the individual happens to be employed in the period in which job loss is reported and then treat_unemp or treat_ret if the individual happens to be unemployed or retired in the interview in which they report they have experienced job loss whereas originally it would equal 1 for all of these labour force statuses. My new method:
>
> local stubs "emp unemp ret"
> foreach stub of local stubs {
> gen treat_`stub' = .
> by id: replace treat_`stub'  = 0 if (treat ==0) by id: replace 
> treat_`stub'  = 1 if (treat ==1 & `stub' ==1) }
>
> I then run a series of separate regressions and analyse the coefficient of the treatment variables separately. I found for example that the coefficient on treat_unemp is twice as large as treat_emp which makes intuitive sense to me - can I make these comparisons across regressions in this way when the regressions are exactly the same with just a different treatment variable included in each? My thought process is that in a sense, the original treatment variable is some kind of the average of the separate treatment variables whereas now I am examining each case separately to see how they differ across separate regressions.
>
> xtreg health treat_emp, fe
> xtreg health treat_unemp, fe
> xtreg health treat_ret, fe
>
> Is this alternate method acceptable to use? I'm just concerned because previously I have always been taught to use interaction terms.
>
> Incidentally, I found a query on interaction terms raised a few days ago by Nahla Betelmal very helpful as a starting point. David Hoaglin and Richard Williams generated a lot of discussion which was interesting to read although my query is specifically regarding when one of the variables is omitted which I don't think was covered specifically and whether my alternate approach is acceptable or should be disregarded?
>
> I would really appreciate any advice that you can offer. Apologies for the longwinded explanation.
>
> Kind regards
>
> Oliver

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/
Follow-Ups:
- Re: st: Interaction terms interpretation when one variable is omitted
  - From: David Hoaglin <[email protected]>
References:
- st: Interaction terms interpretation when one variable is omitted
  - From: "Mirnezami, Oliver" <[email protected]>
- Re: st: Interaction terms interpretation when one variable is omitted
  - From: David Hoaglin <[email protected]>
Prev by Date: Re: st: Problem with bar chart
Next by Date: st: RE: R: RE: ivreg2: endogeneity & AP F test in esttab
Previous by thread: Re: st: Interaction terms interpretation when one variable is omitted
Next by thread: Re: st: Interaction terms interpretation when one variable is omitted
Index(es):
- Date
- Thread