It is not relevant for Stata 9 since nlogit in Stata 9 runs on datasets with unbalanced panels.
Title | Nested logit models | |
Author | Gustavo Sanchez, StataCorp |
Let's say that my tree looks like this:
. / \ / \ _l__ / \__r_ / | \ / \ / | \ / \ / | \ / \ d e f g hSuppose the group variable is named id and the response y. I could create a new variable, calling it mid, that labels the mid-level nodes using the nlogitgen command:
. nlogitgen mid = leaf(l:d|e|f, r:g|h)Thus the first two observations of the data might look something like this:
rec id mid leaf1 y x1 x2 x3 1 1 l d 0 1 2.3 -1.0 2 1 l e 0 1 3.3 -1.1 3 1 l f 1 0 4.5 . 4 1 r g 0 0 1.3 -2.3 5 1 r h 0 0 5.5 -1.7 6 2 l d 0 1 1.2 -2.0 7 2 l e 0 1 4.0 -0.7 8 2 l f 0 1 2.0 -1.0 9 2 r g 1 . 5.1 -0.9 10 2 r h 0 0 6.1 -0.8 . . .Here each observation consumes 5 records since there are 5 leaf nodes.
If I have three covariates, x1-x3, the call to nlogit might look like this:
. nlogit y (leaf = x1 x2) (mid = x3), group(id)The most frequent cause for the unbalanced-data error is missing values in your covariates, as demonstrated in the data listing above. Variables x1 and x3 have missing values for records 3 and 9, respectively. nlogit drops those records from the analysis, thereby making the data incomplete.
The following examples use the dataset "restaurant" from the StataCorp website. The examples are refer to the tree structure below, which implies a first-level choice of having dinner at a fast food restaurant, at a family restaurant, or at a fancy restaurant. Then, once the type of restaurant is selected, the bottom level corresponds to the final decision about the specific restaurant chosen.
Dining / \ / | \ / | \ / | \ / | \ / | \ Fast Food Family Fancy / \ / | \ / \ / \ / | \ / \ / \ / | \ / \ M F W L C C M P B M N E CThe code below reproduces the example in [CM] nlogit. The middle variable and a set of explanatory variables are generated, and then the nested logit model is estimated:
clear webuse restaurant nlogitgen type=restaurant(Fast:Freebirds|MamasPizza, /// Family:CafeEccell|LosNortenos|WingsNmore, /// Fancy: Christophers|MadCows) gen incFast =(type==1)*income gen incFancy =(type==3)*income gen kidFast =(type==1)*kids gen kidFancy =(type==3)*kids nlogit chosen (restaurant=cost rating distance) /// (type= incFast incFancy kidFast kidFancy), /// group(family_id) nolog top --> bottom type restaurant -------------------------- Fast Freebirds MamasPizza Family CafeEccell LosNorte~s WingsNmore Fancy Christop~s MadCows Nested logit estimates Levels = 2 Number of obs = 2100 Dependent variable = chosen LR chi2(10) = 199.6293 Log likelihood = -483.9584 Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- restaurant | cost | -.0944352 .03402 -2.78 0.006 -.1611131 -.0277572 rating | .1793759 .126895 1.41 0.157 -.0693338 .4280855 distance | -.1745797 .0433352 -4.03 0.000 -.2595152 -.0896443 -------------+---------------------------------------------------------------- type | incFast | -.0287502 .0116242 -2.47 0.013 -.0515332 -.0059672 incFancy | .0458373 .0089109 5.14 0.000 .0283722 .0633024 kidFast | -.0704164 .1394359 -0.51 0.614 -.3437058 .2028729 kidFancy | -.3626381 .1171277 -3.10 0.002 -.5922041 -.1330721 -------------+---------------------------------------------------------------- (incl. value | parameters) | type | /fast | 5.715758 2.332871 2.45 0.014 1.143415 10.2881 /family | 1.721222 1.152002 1.49 0.135 -.5366608 3.979105 /Fancy | 1.466588 .4169075 3.52 0.000 .6494642 2.283711 ------------------------------------------------------------------------------ LR test of homoskedasticity (iv = 1): chi2(3)= 9.90 Prob > chi2 = 0.0194 ------------------------------------------------------------------------------Using this nlogit model as the base for comparison, let's modify the data and check whether a problem arises in the estimation.
First, let's erase the information on the explanatory variable rating for four families:
replace rating=. if family_id==65 | family_id==146 | /// family_id==220 | family_id==285Then, we will estimate the same nested logit model that we estimated above:
nlogit chosen (restaurant=cost rating distance) /// (type= incFast incFancy kidFast kidFancy), /// group(family_id) nolog tree structure specified for the nested logit model top --> bottom type restaurant -------------------------- fast Freebirds MamasPizza family CafeEccell LosNorte~s WingsNmore Fancy Christop~s MadCows Nested logit estimates Levels = 2 Number of obs = 2072 Dependent variable = chosen LR chi2(10) = 199.7439 Log likelihood = -476.11744 Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- restaurant | cost | -.089669 .031975 -2.80 0.005 -.1523389 -.0269991 rating | .1585362 .1275017 1.24 0.214 -.0913624 .4084349 distance | -.1712953 .0425201 -4.03 0.000 -.2546332 -.0879575 -------------+---------------------------------------------------------------- type | incFast | -.0305267 .0120644 -2.53 0.011 -.0541724 -.006881 incFancy | .0452898 .0088836 5.10 0.000 .0278783 .0627014 kidFast | -.0750763 .1422322 -0.53 0.598 -.3538463 .2036938 kidFancy | -.3617426 .116892 -3.09 0.002 -.5908467 -.1326386 -------------+---------------------------------------------------------------- (incl. value | parameters) | type | /fast | 6.228289 2.541234 2.45 0.014 1.247562 11.20902 /family | 1.759751 1.185935 1.48 0.138 -.5646376 4.08414 /Fancy | 1.479319 .4055198 3.65 0.000 .6845151 2.274124 ------------------------------------------------------------------------------ LR test of homoskedasticity (iv = 1): chi2(3)= 10.44 Prob > chi2 = 0.0151 ------------------------------------------------------------------------------We see that the sample size is now lower by 28 observations due to the seven records with missing values for rating corresponding to each of the four families for which the values of this variable were modified. A similar situation occurs if we eliminate the information on the dependent variable for a group of families; nlogit drops those families from the estimation sample.
However, if we eliminate the information corresponding to the variable rating for some individuals, but this time just for one of the options in the bottom level, we get the unbalanced-data error because we are effectively changing the design by stating that some individuals will not reach the bottom level. Look at the code below:
replace rating=. if family_id==25 & typ==3 | /// family_id==50 & typ==3 | /// family_id==75 & typ==3 nlogit chosen (restaurant=cost rating distance) /// (type= incFast incFancy kidFast kidFancy), /// group(family_id) nolog tree structure specified for the nested logit model top --> bottom type restaurant -------------------------- Fast Freebirds MamasPizza Family CafeEccell LosNorte~s WingsNmore Fancy Christop~s MadCows unbalanced data r(459);For the final example, we erase the full set of observations corresponding to one option of the bottom level. In this case nlogit performs the estimation since the dataset will correspond to a new design without the corresponding branch. See the example below:
replace rating=. if type==3This implies that the correct design is now
Dining / \ / \ / \ / \ / \ / \ Fast Food Family / \ / | \ / \ / | \ / \ / | \ M F W L C P B M N EIn this case, using nlogit is valid again.
nlogit chosen (restaurant=cost rating distance) /// (type= incFast incFancy kidFast kidFancy), /// group(family_id) nolog
tree structure specified for the nested logit model top --> bottom type restaurant -------------------------- Fast Freebirds MamasPizza Family CafeEccell LosNorte~s WingsNmore Fancy Christop~s MadCows note: 51 groups (255 obs) dropped due to no positive outcome or multiple positive outcomes per group note: incFancy omitted due to no within-group variance note: kidFancy omitted due to no within-group variance Nested logit estimates Levels = 2 Number of obs = 1245 Dependent variable = chosen LR chi2(7) = 125.731 Log likelihood = -337.88453 Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- restaurant | cost | -.105763 .0469251 -2.25 0.024 -.1977345 -.0137915 rating | .1706296 .1425147 1.20 0.231 -.108694 .4499533 distance | -.1556858 .0606158 -2.57 0.010 -.2744905 -.0368811 -------------+---------------------------------------------------------------- type | incFast | -.0289775 .012192 -2.38 0.017 -.0528734 -.0050816 kidFast | -.0774806 .1462401 -0.53 0.596 -.3641059 .2091447 -------------+---------------------------------------------------------------- (incl. value | parameters) | type | /Fast | 5.702476 2.985886 1.91 0.056 -.1497534 11.55471 /Family | 1.958308 2.057901 0.95 0.341 -2.075104 5.99172 ------------------------------------------------------------------------------ LR test of homoskedasticity (iv = 1): chi2(2)= 6.86 Prob > chi2 = 0.0324 ------------------------------------------------------------------------------Now 255 observations have been lost due to the missing values for the branch corresponding to Fancy restaurants, but the estimation is performed since no information is missing for the other two branches.
1 Notice that the labels of the leaf variable are listed here. The values of the leaf variable would be 1 2 3 4 5 1 2 3 4 5. Thus you need to define the label and assign it to the leaf variable:
label define leaf_lbl 1 "d" 2 "e" 3 "f" 4 "g" 5 "h" label values leaf leaf_lbl