Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Sparse Data Problem


From   Steven Samuels <[email protected]>
To   [email protected]
Subject   Re: st: Sparse Data Problem
Date   Sun, 8 Mar 2009 10:54:48 -0400



John, I was wrong in saying that you cannot fit s315t = 0 into a logistic model. You can, but it is a model with only a constant term. Below I show how to put everything into one model.

As Richard Williams said in a related post today:
"Also, my experience is that the classification table, which I never use all that much anyway, is especially worthless when you have such an extreme split. You may wish to check into Gary King's -relogit-. See http://gking.harvard.edu/stats.shtml";

Steps:
For simplicity I refer to your outcome 'clstr' as y; your variable 's315t' as z; and your other predictors as x1 and x2

1. Create  z1 = z*x1  z2 = z*x2

2. Run your model as:

logistic y z z1 z2

Note what this will do:

When z = 0:
y = constant
When z = 1
y = constant + _b[z] + _b[z1]*x1 + _b[z2]*x2

Interpretation:
constant = log odds of outcome when z = 0.
All hypotheses about x1 and x2 apply only when z = 1

constant + _b[z] = log odds of event when z = 1 and there are no other covariates;
_b[z] = log odds ratio for event for z= 1 vs z = 0.


-Steve

On Mar 7, 2009, at 6:27 PM, Steven Samuels wrote:


Here's the first table you presented:
                         clstr
  s315t |         0          1 |     Total
-----------+----------------------+----------
        0 |        22          1 |        23
        1 |        58         32 |        90
-----------+----------------------+----------
    Total |        80         33 |       113


You don't need (and won't be able to fit) a logistic model for the first row, but one might help for the second. Think of a classification and regression tree (CART) approach, where s315t= 0 defines a terminal node. By the way, missing values in the predictors are leading to differing n's in your results: 113, 103, 100.

-Steve


On Mar 7, 2009, at 4:50 PM, john metcalfe wrote:

Thanks to Dave and Steve.
Dave, I am not sure how to apply -xtmelogit- to this data set, or if
this would be a correct thing to do. I haven't worked with this before
but will look into it.
Steve, thanks for your helpful comments. I am not quite sure what is
meant by the two part prediction equation. I think you mean getting
predicted probabilities from a logit model with s315t==1, but am not
sure about 's315t negative: predict clstr = 1'? Can you make this more
explicit?
Thanks much,
John

On Sat, Mar 7, 2009 at 11:43 AM, Steven Samuels
<[email protected]> wrote:

John, your model is probably incorrect. It assumes that, when s315t is 0, the other factors make a difference implied by the model form. They don't.
 Correspondingly, the stratified two-way tables indicate  a possible
interaction between s315t and 'east'.

I suggest a two part prediction equation.

s315t negative: predict clstr = 1
s315t positive: predict with other factors in a logistic model.


I'm not very familiar with exact logistic regression, but if the usual rules of thumb apply, the 32-33 events (clstr =1) entitle you to about three
predictors altogether.


-Steve

On Mar 6, 2009, at 10:12 PM, john metcalfe wrote:

Dear Statalist,
I am analyzing a small data set with outcome of interest 'clstr', with the primary goal of the analysis to determine if the variables 's315t' and 'east' have independent associations with the outcome. However,
2315t is highly deterministic for the outcome clstr, as below. I am
concerned that exact logistic regression is not fully accounting for
the small cell bias. I would like to employ a hierarchical logistic
regression, but it seems that the stata command 'hireg' is only for
linear linear regressions??
It may be that I simply am unable to make any valid inferences with
this dataset, but I just want to make sure I have explored the
appropriate possible remedies.
Thanks,
John

John Metcalfe, M.D., M.P.H.
University of California, San Francisco


. tab s315 clstr,e

          |         clstr
    s315t |         0          1 |     Total
-----------+----------------------+----------
        0 |        22          1 |        23
        1 |        58         32 |        90
-----------+----------------------+----------
    Total |        80         33 |       113

          Fisher's exact =                 0.002
  1-sided Fisher's exact =                 0.002




. logit clstr ageat s315t east emb sm num,or

Iteration 0:   log likelihood = -62.686946
Iteration 1:   log likelihood = -51.860098
Iteration 2:   log likelihood = -50.754342
Iteration 3:   log likelihood = -50.661741
Iteration 4:   log likelihood = -50.660257
Iteration 5:   log likelihood = -50.660256

Logistic regression                               Number of obs   =
 100
                                                 LR chi2(6)      =
 24.05
                                                 Prob > chi2     =
0.0005
Log likelihood = -50.660256                       Pseudo R2       =
0.1919


------------------------------------------------------------------- -----------
      clstr | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf.
Interval]

------------- +----------------------------------------------------------------
  ageatrept |   .9908837   .0139884    -0.65   0.517     .9638428
 1.018683
      s315t |   9.238959   10.28939     2.00   0.046     1.041462
 81.96011
 east_asian |   4.219755   2.215279     2.74   0.006     1.508083
 11.80727
        emb |   .9964845   .6599534    -0.01   0.996     .2721043
 3.649268
         sm |   2.138175   1.696319     0.96   0.338      .451589
 10.12379
 num_resist |   1.064089   .2385192     0.28   0.782     .6857694
 1.651116

------------------------------------------------------------------- -----------



Strategy 1: Two-way contingency tables

. tab clstr s315t if east==1,e

          |         s315t
    clstr |         0          1 |     Total
-----------+----------------------+----------
        0 |         6         19 |        25
        1 |         1         24 |        25
-----------+----------------------+----------
    Total |         7         43 |        50

          Fisher's exact =                 0.098
  1-sided Fisher's exact =                 0.049

. tab clstr s315t if east==0,e

          |         s315t
    clstr |         0          1 |     Total
-----------+----------------------+----------
        0 |        12         33 |        45
        1 |         0          8 |         8
-----------+----------------------+----------
    Total |        12         41 |        53

          Fisher's exact =                 0.175
  1-sided Fisher's exact =                 0.108



Strategy 2: Exact Logistic Regression

observation 102: enumerations =       1128
observation 103: enumerations =        574

Exact logistic regression Number of obs = 103 Model score = 19.78112 Pr >= score = 0.0000

------------------------------------------------------------------- -------- clstr | Odds Ratio Suff. 2*Pr(Suff.) [95% Conf. Interval]

------------- +------------------------------------------------------------- s315t | 10.44218 32 0.0135 1.391627 474.4786 east_asian | 5.414021 25 0.0006 1.933718 16.65417




(output omitted)
observation 103: enumerations =        574

Exact logistic regression Number of obs = 103 Model score = 19.78112 Pr >= score = 0.0000

------------------------------------------------------------------- -------- clstr | Coef. Score Pr>=Score [95% Conf. Interval]

------------- +------------------------------------------------------------- s315t | 2.345854 6.763266 0.0129 . 3304732 6.162216 east_asian | 1.688992 12.98631 0.0004 . 6594448 2.812661

------------------------------------------------------------------- --------


Strategy 3: Hierarchical Regression

. hireg clstr (s315t) (east)(ageat emb sm)

Model 1:
  Variables in Model:
  Adding            : s315t

Source | SS df MS Number of obs =
113
-------------+------------------------------ F( 1, 111) =
 9.18
Model | 1.7840879 1 1.7840879 Prob > F =
 0.0030
Residual | 21.578744 111 .194403099 R- squared =
 0.0764
-------------+------------------------------ Adj R- squared =
 0.0680
Total | 23.3628319 112 .208596713 Root MSE =
 .44091


------------------------------------------------------------------- -----------
      clstr |      Coef.   Std. Err.      t    P>|t|     [95% Conf.
Interval]

------------- +----------------------------------------------------------------
      s315t |   .3120773   .1030162     3.03   0.003     .1079438
 .5162108
      _cons |   .0434783   .0919364     0.47   0.637    -.1386999
 .2256565

------------------------------------------------------------------- -----------

Model 2:
  Variables in Model: s315t
  Adding            : east

Source | SS df MS Number of obs =
103
-------------+------------------------------ F( 2, 100) =
12.03
Model | 4.34936038 2 2.17468019 Prob > F =
 0.0000
Residual | 18.0778241 100 .180778241 R- squared =
 0.1939
-------------+------------------------------ Adj R- squared =
 0.1778
Total | 22.4271845 102 .219874358 Root MSE =
 .42518


------------------------------------------------------------------- -----------
      clstr |      Coef.   Std. Err.      t    P>|t|     [95% Conf.
Interval]

------------- +----------------------------------------------------------------
      s315t |   .2817301   .1086887     2.59   0.011     .0660947
 .4973654
 east_asian |   .3247109   .0843486     3.85   0.000     .1573656
 .4920561
      _cons |  -.0669987   .1023736    -0.65   0.514     -.270105
 .1361075

------------------------------------------------------------------- ----------- R-Square Diff. Model 2 - Model 1 = 0.118 F(1,100) = 14.190 p = 0.000

Model 3:
  Variables in Model: s315t  east
  Adding            : ageat emb sm

Source | SS df MS Number of obs =
100
-------------+------------------------------ F( 5, 94) =
 4.72
Model | 4.36538233 5 .873076466 Prob > F =
 0.0007
Residual | 17.3946177 94 .185049124 R- squared =
 0.2006
-------------+------------------------------ Adj R- squared =
 0.1581
Total | 21.76 99 .21979798 Root MSE =
 .43017


------------------------------------------------------------------- -----------
      clstr |      Coef.   Std. Err.      t    P>|t|     [95% Conf.
Interval]

------------- +----------------------------------------------------------------
      s315t |   .2335983   .1163422     2.01   0.048     .0025981
 .4645984
 east_asian |   .2694912   .0945411     2.85   0.005     .0817777
 .4572048
  ageatrept |  -.0012444   .0024199    -0.51   0.608    -.0060491
 .0035603
        emb |   .0396897   .0989203     0.40   0.689    -.1567189
 .2360984
         sm |   .1063985   .1087626     0.98   0.330    -.1095522
 .3223492
      _cons |  -.0454117   .1512602    -0.30   0.765    -.3457423
.254919

------------------------------------------------------------------- ----------- R-Square Diff. Model 3 - Model 2 = 0.007 F(3,94) = 0.029 p = 0.993


Model  R2      F(df)              p         R2 change  F(df) change
p
  1:  0.076   9.177(1,111)       0.003
  2:  0.194  12.030(2,100)       0.000     0.118     14.190(1,100)
0.000
  3:  0.201   4.718(5,94)        0.001     0.007      0.029(3,94)
 0.993
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index