It is not relevant for Stata 9 since nlogit in Stata 9 runs on datasets with unbalanced panels.
| Title | Nested logit models | |
| Author | Gustavo Sanchez, StataCorp |
Let's say that my tree looks like this:
.
/ \
/ \
_l__ / \__r_
/ | \ / \
/ | \ / \
/ | \ / \
d e f g h
Suppose the group variable is named id and the response y. I could create a
new variable, calling it mid, that labels the mid-level nodes using the
nlogitgen command:
. nlogitgen mid = leaf(l:d|e|f, r:g|h)Thus the first two observations of the data might look something like this:
rec id mid leaf1 y x1 x2 x3 1 1 l d 0 1 2.3 -1.0 2 1 l e 0 1 3.3 -1.1 3 1 l f 1 0 4.5 . 4 1 r g 0 0 1.3 -2.3 5 1 r h 0 0 5.5 -1.7 6 2 l d 0 1 1.2 -2.0 7 2 l e 0 1 4.0 -0.7 8 2 l f 0 1 2.0 -1.0 9 2 r g 1 . 5.1 -0.9 10 2 r h 0 0 6.1 -0.8 . . .Here each observation consumes 5 records since there are 5 leaf nodes.
If I have three covariates, x1-x3, the call to nlogit might look like this:
. nlogit y (leaf = x1 x2) (mid = x3), group(id)The most frequent cause for the unbalanced-data error is missing values in your covariates, as demonstrated in the data listing above. Variables x3 and x1 have missing values for records 3 and 9, respectively. nlogit drops those records from the analysis, thereby making the data incomplete.
The following examples use the dataset "restaurant" from the StataCorp website. The examples are refer to the tree structure below, which implies a first-level choice of having dinner at a fast food restaurant, at a family restaurant, or at a fancy restaurant. Then, once the type of restaurant is selected, the bottom level corresponds to the final decision about the specific restaurant chosen.
Dining
/ \
/ | \
/ | \
/ | \
/ | \
/ | \
Fast Food Family Fancy
/ \ / | \ / \
/ \ / | \ / \
/ \ / | \ / \
M F W L C C M
P B M N E C
The code below reproduces the example in [CM] nlogit. The
middle variable and a set of explanatory variables are generated, and then the
nested logit model is estimated:
clear
webuse restaurant
nlogitgen type=restaurant(Fast:Freebirds|MamasPizza, ///
Family:CafeEccell|LosNortenos|WingsNmore, ///
Fancy: Christophers|MadCows)
gen incFast =(type==1)*income
gen incFancy =(type==3)*income
gen kidFast =(type==1)*kids
gen kidFancy =(type==3)*kids
nlogit chosen (restaurant=cost rating distance) ///
(type= incFast incFancy kidFast kidFancy), ///
group(family_id) nolog
top --> bottom
type restaurant
--------------------------
Fast Freebirds
MamasPizza
Family CafeEccell
LosNorte~s
WingsNmore
Fancy Christop~s
MadCows
Nested logit estimates
Levels = 2 Number of obs = 2100
Dependent variable = chosen LR chi2(10) = 199.6293
Log likelihood = -483.9584 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
restaurant |
cost | -.0944352 .03402 -2.78 0.006 -.1611131 -.0277572
rating | .1793759 .126895 1.41 0.157 -.0693338 .4280855
distance | -.1745797 .0433352 -4.03 0.000 -.2595152 -.0896443
-------------+----------------------------------------------------------------
type |
incFast | -.0287502 .0116242 -2.47 0.013 -.0515332 -.0059672
incFancy | .0458373 .0089109 5.14 0.000 .0283722 .0633024
kidFast | -.0704164 .1394359 -0.51 0.614 -.3437058 .2028729
kidFancy | -.3626381 .1171277 -3.10 0.002 -.5922041 -.1330721
-------------+----------------------------------------------------------------
(incl. value |
parameters) |
type |
/fast | 5.715758 2.332871 2.45 0.014 1.143415 10.2881
/family | 1.721222 1.152002 1.49 0.135 -.5366608 3.979105
/Fancy | 1.466588 .4169075 3.52 0.000 .6494642 2.283711
------------------------------------------------------------------------------
LR test of homoskedasticity (iv = 1): chi2(3)= 9.90 Prob > chi2 = 0.0194
------------------------------------------------------------------------------
Using this nlogit model as the base for comparison, let's
modify the data and check whether a problem arises in the
estimation.
First, let's erase the information on the explanatory variable rating for four families:
replace rating=. if family_id==65 | family_id==146 | ///
family_id==220 | family_id==285
Then, we will estimate the same nested logit model that we estimated above:
nlogit chosen (restaurant=cost rating distance) ///
(type= incFast incFancy kidFast kidFancy), ///
group(family_id) nolog
tree structure specified for the nested logit model
top --> bottom
type restaurant
--------------------------
fast Freebirds
MamasPizza
family CafeEccell
LosNorte~s
WingsNmore
Fancy Christop~s
MadCows
Nested logit estimates
Levels = 2 Number of obs = 2072
Dependent variable = chosen LR chi2(10) = 199.7439
Log likelihood = -476.11744 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
restaurant |
cost | -.089669 .031975 -2.80 0.005 -.1523389 -.0269991
rating | .1585362 .1275017 1.24 0.214 -.0913624 .4084349
distance | -.1712953 .0425201 -4.03 0.000 -.2546332 -.0879575
-------------+----------------------------------------------------------------
type |
incFast | -.0305267 .0120644 -2.53 0.011 -.0541724 -.006881
incFancy | .0452898 .0088836 5.10 0.000 .0278783 .0627014
kidFast | -.0750763 .1422322 -0.53 0.598 -.3538463 .2036938
kidFancy | -.3617426 .116892 -3.09 0.002 -.5908467 -.1326386
-------------+----------------------------------------------------------------
(incl. value |
parameters) |
type |
/fast | 6.228289 2.541234 2.45 0.014 1.247562 11.20902
/family | 1.759751 1.185935 1.48 0.138 -.5646376 4.08414
/Fancy | 1.479319 .4055198 3.65 0.000 .6845151 2.274124
------------------------------------------------------------------------------
LR test of homoskedasticity (iv = 1): chi2(3)= 10.44 Prob > chi2 = 0.0151
------------------------------------------------------------------------------
We see that the sample size is now lower by 28 observations due to
the seven records with missing values for rating corresponding to each of the
four families for which the values of this variable were modified. A similar
situation occurs if we eliminate the information on the dependent variable for
a group of families; nlogit drops those families from the estimation
sample.
However, if we eliminate the information corresponding to the variable rating for some individuals, but this time just for one of the options in the bottom level, we get the unbalanced-data error because we are effectively changing the design by stating that some individuals will not reach the bottom level. Look at the code below:
replace rating=. if family_id==25 & typ==3 | ///
family_id==50 & typ==3 | ///
family_id==75 & typ==3
nlogit chosen (restaurant=cost rating distance) ///
(type= incFast incFancy kidFast kidFancy), ///
group(family_id) nolog
tree structure specified for the nested logit model
top --> bottom
type restaurant
--------------------------
Fast Freebirds
MamasPizza
Family CafeEccell
LosNorte~s
WingsNmore
Fancy Christop~s
MadCows
unbalanced data
r(459);
For the final example, we erase the full set of observations corresponding to
one option of the bottom level. In this case nlogit performs the
estimation since the dataset will correspond to a new design without the
corresponding branch. See the example below:
replace rating=. if type==3This implies that the correct design is now
Dining
/ \
/ \
/ \
/ \
/ \
/ \
Fast Food Family
/ \ / | \
/ \ / | \
/ \ / | \
M F W L C
P B M N E
In this case, using nlogit is valid again.
nlogit chosen (restaurant=cost rating distance) ///
(type= incFast incFancy kidFast kidFancy), ///
group(family_id) nolog
tree structure specified for the nested logit model
top --> bottom
type restaurant
--------------------------
Fast Freebirds
MamasPizza
Family CafeEccell
LosNorte~s
WingsNmore
Fancy Christop~s
MadCows
note: 51 groups (255 obs) dropped due to no positive outcome
or multiple positive outcomes per group
note: incFancy omitted due to no within-group variance
note: kidFancy omitted due to no within-group variance
Nested logit estimates
Levels = 2 Number of obs = 1245
Dependent variable = chosen LR chi2(7) = 125.731
Log likelihood = -337.88453 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
restaurant |
cost | -.105763 .0469251 -2.25 0.024 -.1977345 -.0137915
rating | .1706296 .1425147 1.20 0.231 -.108694 .4499533
distance | -.1556858 .0606158 -2.57 0.010 -.2744905 -.0368811
-------------+----------------------------------------------------------------
type |
incFast | -.0289775 .012192 -2.38 0.017 -.0528734 -.0050816
kidFast | -.0774806 .1462401 -0.53 0.596 -.3641059 .2091447
-------------+----------------------------------------------------------------
(incl. value |
parameters) |
type |
/Fast | 5.702476 2.985886 1.91 0.056 -.1497534 11.55471
/Family | 1.958308 2.057901 0.95 0.341 -2.075104 5.99172
------------------------------------------------------------------------------
LR test of homoskedasticity (iv = 1): chi2(2)= 6.86 Prob > chi2 = 0.0324
------------------------------------------------------------------------------
Now 255 observations have been lost due to the missing values for the branch
corresponding to Fancy restaurants, but the estimation is performed since no
information is missing for the other two branches.
1 Notice that the labels of the leaf variable are listed here. The values of the leaf variable would be 1 2 3 4 5 1 2 3 4 5. Thus you need to define the label and assign it to the leaf variable:
label define leaf_lbl 1 "d" 2 "e" 3 "f" 4 "g" 5 "h" label values leaf leaf_lbl