There have been a few interrelated questions recently about the new factor
variables in Stata 11.
David Airey <[email protected]> asks,
> Will factor variables support different coding schemes like
> indicator coding and effect coding?
Joseph Coveney <[email protected]> echos David's question,
> A question that I have, too. The coding schemes, at least those
> that interest me most, such as reverse Helmert, can easily be done
> from conventional dummy indicators with -lincom-, afterward. [...]
> Maybe the new -margins- can do this sort of thing, too [...]
The short answer is not yet.
We gave thorough consideration to codings when designing factor variables. I
can't say when we will we will undertake codings in a serious way, but both
the syntax and internal workings of factor variables are compatible with a
future implementation of codings.
Roger Newson <[email protected]> writes,
> I too am keen to know the answer to David's query. I routinely use
> the -noomit- option of -xi- in Stata 10 to fit multi-intercept, and have not
> found any mention of a corresponding option [...]
Roger is referring to the fact that typically one category (level) of a factor
variable is omitted when creating the indicators for each level. We must omit
one level because if we include indicators for all levels then the indicators
will be collinear with the constant in our regressions. In Stata 11, there are
two ways to control what level is used as the base and whether a base is used
at all.
In a variable list the ib. operator designates the base and can be abbreviated
b. The default base is the lowest level. So if we type,
. regress mpg i.rep78
The base level is 1 because 1 is the smallest of the levels of rep78 -- 1, 2,
3, 4, and 5.
If instead, we type
. regress mpg b3.rep78
the base level is 3, and our regression will not include an indicator for
rep78==3.
The operator bn. specifies that we do not want a base. That is to say, we
want indicators for all levels of a variable. Typing,
. regress mpg bn.rep78, noconstant
runs a regression with indicators for all 5 levels of rep78.
We could also type -b(last).rep78- to make the last level (5) of rep78 the
base. Typing, -b(freq).rep78- makes the most frequent level (3) the base.
The b. operator can also be used on interactions.
You can also set the base on your variables permanently. Typing,
. fvset base none rep78 foreign
sets -rep78- and -foreign- to have no base.
. fvset base 3 rep78
sets the base of -rep78- to 3.
Typing,
. fvset base last _all
tells Stata that whenever any variable is used as a factor variable that the
variable's largest value in the sample be used as the base. If you save your
dataset, the -fvset-ings are saved too.
There are other aspects of factor variables that we haven't discussed. For
example, if we type,
. regress y x1 x2 5.country
then a single indicator for country==5 will be added to the model.
Indicator variables for each level of a factor variable can be thought of as
virtual variables that always exist in our data. That means they can also be
used in expressions like -if 1.foreign-.
-- Vince
[email protected]
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/