Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Predict in version 11

From	[email protected] (Jeff Pitblado, StataCorp LP)
To	[email protected]
Subject	Re: st: Predict in version 11
Date	Wed, 08 Dec 2010 13:02:01 -0600
Marnix Zoutenbier <[email protected]> is using -predict- after -anova-
and noticed that Stata 11 will now produce a non-missing value in
out-of-sample observations where a factor variable takes on values not
observed within the estimation sample:

> I see a difference in the way predict works between Stata10 and 11.
> 
> Consider the following example
> x1	testset 	y
> 1	1	12
> 2	1	13
> 3	1	14
> 4	2	.
> 
> And the commands
> anova y x1 if testset==1
> predict yhat
> 
> The following is the result in version 11
> x1	testset 	y	yhat
> 1	1	12	12
> 2	1	13	13
> 3	1	14	14
> 4	2	.	12
> 
> While in version 10 the following dataset results
> x1	testset 	y	yhat
> 1	1	12	12
> 2	1	13	13
> 3	1	14	14
> 4	2	.	.
> 
> I prefer the version 10 way-of-working, because it gives me the opportunity
> to identify observations that are in the testset (testset==2) and not in
> the trainingset (testset==1).
> 
> Is it possible to obtain the same result in version 11 as in version 10,
> other than switching with the version command before and after predict?
> 
> Thank you for your consideration,

Short reply:

Except under version control, as noted above by Marnix, there is no option of
-predict- to get it to behave like it did in Stata 10.  As with out-of-sample
predictions involving continuous predictors, Stata 11 relies on the data
analyst to judge which predictions are meaningful or even valid.

Both Neil Shephard <[email protected]> and Nick Cox <[email protected]>
point out that -predict- allows -if- and -in- restrictions, giving the data
analyst the control to identify which observations to compute the predictions.

Longer reply:

Prior to Stata 11, -anova- and -manova- were the only estimation commands that
possessed logic to handle categorical variables, but even they had some
limitations we intended to address with the new factor variables notation.
For example, controlling the base level and level restrictions were not
allowed with -anova- and -manova- without generating modified copies of the
factor variables.

The new factor variables notation also replaced and expanded on the features
of the -xi- prefix, which produced indicator variables for categorical
variables and some two-way interactions.

One of our goals for the new factor variables notation was to get all of
Stata's official estimation commands to support categorical variables and
their interactions consistently.  Thus -anova- and -manova- were updated to
possess the same features of their linear models counterparts, -regress- and
-mvreg-.

The new factor variables notation allows you to specify which levels to
include in a model fit.  Using Marnix's data, let's fit an ANOVA model where
we only care about the effect of x1=1 compared to all the other levels.  In
Stata 11 we simply type

***** BEGIN:
. anova y 1.x1

                           Number of obs =       3     R-squared     =  0.7500
                           Root MSE      = .707107     Adj R-squared =  0.5000

                  Source |  Partial SS    df       MS           F     Prob > F
              -----------+----------------------------------------------------
                   Model |         1.5     1         1.5       3.00     0.3333
                         |
                      x1 |         1.5     1         1.5       3.00     0.3333
                         |
                Residual |          .5     1          .5   
              -----------+----------------------------------------------------
                   Total |           2     2           1   

. mat li e(b)

e(b)[1,2]
        1.       
       x1  _cons
y1   -1.5   13.5
***** END:

We see that -anova- used all observations where 'x1' and 'y' were not missing,
fitting an intercept '_cons' and a coefficient on '1.x1'.

	'1.x1' is factor variables notation for an implied variable that
	indicates when 'x1' is equal to 1.

Here are the linear predictions:

***** BEGIN:
. predict yhat1 if e(sample)
(option xb assumed; fitted values)
(1 missing value generated)

. list

     +---------------------------+
     | x1   testset    y   yhat1 |
     |---------------------------|
  1. |  1         1   12      12 |
  2. |  2         1   13    13.5 |
  3. |  3         1   14    13.5 |
  4. |  4         2    .       . |
     +---------------------------+
***** END:

Notice that -predict- treated levels 2 and 3 the same, so we get their average
response back as the linear prediction.  This is in accordance with a linear
regression model with a single indicator variable that identifies when 'x1' is
equal to 1.

Here are the commands to reproduce the above using -regress-, but without
factor variables notation:

***** BEGIN:
. gen x1is1 = x1==1

. regress y x1is1

      Source |       SS       df       MS              Number of obs =       3
-------------+------------------------------           F(  1,     1) =    3.00
       Model |         1.5     1         1.5           Prob > F      =  0.3333
    Residual |          .5     1          .5           R-squared     =  0.7500
-------------+------------------------------           Adj R-squared =  0.5000
       Total |           2     2           1           Root MSE      =  .70711

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       x1is1 |       -1.5   .8660254    -1.73   0.333     -12.5039    9.503896
       _cons |       13.5         .5    27.00   0.024     7.146898     19.8531
------------------------------------------------------------------------------

. predict ryhat1 if e(sample)
(option xb assumed; fitted values)
(1 missing value generated)

. list

     +--------------------------------------------+
     | x1   testset    y   yhat1   x1is1   ryhat1 |
     |--------------------------------------------|
  1. |  1         1   12      12       1       12 |
  2. |  2         1   13    13.5       0     13.5 |
  3. |  3         1   14    13.5       0     13.5 |
  4. |  4         2    .       .       0        . |
     +--------------------------------------------+
***** END:

Since we did not use factor variables notation, we can reproduce the result in
Stata 10 or Stata 11; we can even use -anova- instead of -regress-.

--Jeff					--Ken
[email protected]			[email protected]
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Follow-Ups:
- Re: st: Predict in version 11
  - From: "Marnix Zoutenbier" <[email protected]>
Prev by Date: Re: st: Folded F-statistic
Next by Date: st: Folded F-Statistic
Previous by thread: RE: st: Predict in version 11
Next by thread: Re: st: Predict in version 11
Index(es):
- Date
- Thread