| |
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: RE: transformation of a continuous variable for a logisticregression model
From |
Suzy <[email protected]> |
To |
[email protected] |
Subject |
Re: st: RE: transformation of a continuous variable for a logisticregression model |
Date |
Wed, 19 Apr 2006 23:17:02 -0400 |
I'm sorry for the attachment - that was accidental - it must have been
generated with the citation number (as a link) in the paragraph I copied
and pasted.
According to Royston: see: [R] fracpoly Stata Manual...This further
clears up the terminology issue (for me at least), which is exactly what
you said in #1 below.
Conventional polynomial of degree m with powers p = (1,…, m) is defined as:
p(m) = b1x^1 + b2x^2 +....+bmx^m
Fractional polynomial of degree m with powers p = (p1,…, pm) is defined as:
fp(m) = b1x^p1+ b2^xp2 +...+bmx^pm
As far as number 2 goes: based on my meager and naive experience with
this method - I would agree with you .
"although quite different polynomials may give
similar overall fits the individual terms in
those polynomials may be not at all comparable."
If you have any recommendations for further reading that discusses the underlying issue at hand, please let me know.
"The basic underlying issue is very likely that both these kinds of polynomials are not orthogonal."
As always - I appreciate your help!
Suzy
Nick Cox wrote:
It's good that you consider this makes biological
sense. My main concern was that you were focusing
on the statistical results alone. In curve fitting
there is often a tendency to over-fit and ignore
substantive or scientific considerations.
I have only two detailed comments to add:
1. Terminology. You call "quadratic" what
[R] fracpoly (and presumably Patrick Royston
and co-authors) would call "degree 2" and what the paper
cited here appears to call "second-order". This may
sound like a parade of synonyms, but my strong guess
is that it is not. With fractional polynomials
the degree is the number of powers, and _not_ the
highest power. In your case, your term "quadratic"
appears quite wrong therefore, especially for
polynomials in which none of the individual powers
is 2. I was reacting to your term and not looking
carefully at the documentation which explains this
terminology.
2. I have not tried to understand what you are doing
with -boxtid- (which is a user-written command).
But in very general terms my understanding is that
although quite different polynomials may give
similar overall fits the individual terms in
those polynomials may be not at all comparable.
The basic underlying issue is very likely that both these kinds
of polynomials are not orthogonal.
Note that attachments should not be sent to Statalist.
This is explicit in the FAQ.
Nick
[email protected]
Suzy
Nick - Not to beat a dead horse, but I just thought I'd share
this with
you - from:<>
Vincenzo Bagnardi, Antonella Zambon, Piero Quatto and
Giovanni Corrao.
Flexible Meta-Regression Functions for Modeling Aggregate
Dose-Response
Data, with an Application to Alcohol and Mortality. Am J
Epidemiol 2004;
159:1077-1086.
"Although it is rather simple, the family of second-order fractional
polynomial models offers considerably flexibility. In particular, by
choosing p1 and p2 from a predefined set P = {–2, –1, –0.5,
0, 0.5, 1,
2, 3}, a very rich set of possible functions, including some
so-called
U-shaped and J-shaped relations, may be accommodated. The powers are
expressed according to the Box-Tidwell transformation (12
<http://aje.oxfordjournals.org/cgi/content/full/159/11/1077#KW
H142C12>),
in which denotes if pi != 0 and log x if pi = 0. When p1 = p2
= p, the
model becomes log(RR½x) = ß1xp + ß2(xp log x)."
I thought that a second order polynomial = "degree of 2" (M=2) =
quadratic as shown in my output from fracpoly below (M=2). I had also
e-mailed the fracplot to show the quadratic curve, but for
some reason,
it was deleted via transport. In any case, the age variable
transformations (age_1 and age_2) from the fracgen command were
calculated using the the formulas above - ß1age3 + ß2(age3 log age).
Thus, I still respectfully do not understand why the fracpoly
and boxtid
results are not consistent with this variable. As far as a
theoretical
justification of the functional form of age and the response
variable -
it does make sense for these data.
Nick Cox wrote:
Sorry, but this to me is just a restatement of
your previous posting, and addresses none of
the points I raised.
That aside,
I don't understand how a quadratic function can
have powers 3 3. Cubics in my experience are never
appropriate for global fits unless there are clear
dimensional grounds for using them, which seems unlikely
here.
Nick
[email protected]
Suzy
Thanks for your response Nick. In a nutshell, age is not
linear in the
logit. I'm using the fracpoly command to identify the best
functional
form for age in the full model. The result returned from
Fracpoly was a
quadratic function with powers 3 3 (which also looks good with
fracplot). However, when I further assessed the model using
the Boxtid
command, the results with the new age transformation - the
results were
not favorable (the Ho was rejected). When I transformed another
continuous variable in the same full logistic model
(quadratic with
powers 1 2 by Fracpoly), the Boxtid results were favorable,
all graphs
looked very good, and the diagnostics were good (linktest,
etc...). I'm
trying to understand why my results aren't consistent (Fracpoly and
Boxtid) with the age variable, but is with all other
continuous variables?
Nick Cox wrote:
I am not clear what you think Statalist members know
that can help you here. For example, the field
in which you are working, what the response variable
-dmcat- means, and what other predictors there may be are all
hidden from view, so the chance of giving opinions
drawing on substantive expertise is zero. Otherwise
put, you appear to be assuming that the choices
here can all be made on purely statistical criteria,
an attitude which always worries me greatly.
What I have observed, as a kind of anthropologist of
statistical science, is that age plays very different
roles in different fields. Economists often seem
to find that a quadratic in age does very nicely,
whereas biostatisticians often seem to need
more complicated representations, which seems
perfectly plausible given the complexities of
childhood, adolescence, etc.
Either way, -fracpoly- like other programs has
no inbuilt sensor (or censor) selecting theoretically or
scientifically sensible functional forms. So,
I suggest that you plot the curve implied against
age and think about it as something that needs justification
or interpretation independently from the data.
Nick
[email protected]
Suzy
I am trying to transform one final continuous independent
variable (age)
in a logistic regression model. I've tried what I know that's
available
via Stata. For example, I used the fracpoly command and the best
transformation was a second order polynomial with powers 3 3.
Fractional polynomial model comparisons:
---------------------------------------------------------------
age df Deviance Gain P(term) Powers
---------------------------------------------------------------
Not in model 0 2098.129 -- --
Linear 1 1834.224 0.000 0.000 1
m = 1 2 1805.957 28.267 0.000 -1
m = 2 4 1791.327 42.897 0.001 3 3
m = 3 6 1790.526 43.699 0.670 -2 3 3
m = 4 8 1788.431 45.793 0.351 -2 -2 3 3
---------------------------------------------------------------
I then used fracgen to generate the new age variables - age_1
and age_2.
fracgen age 3 3
-> gen double age_1 = X^3
-> gen double age_2 = X^3*ln(X)
(where: X = (age+1)/10)
The coefficients for age_1 and age_2 from the full logistic
regression
model:
--------------------------------------------------------------
----------------
Y var | Odds Ratio Std. Err. z P>|z|
[95% Conf.
Interval]
-------------+------------------------------------------------
----------------
age_1 | 1.087994 .0093302 9.83 0.000
1.06986
1.106436
age_2 | .9644247 .0037538 -9.31 0.000
.9570955
.9718101
However the boxtid command rejected the null for both age_1
and age_2....
age_1 | .0100805 .0007172 14.055 Nonlin. dev.
24.646 (P
= 0.000)
p1 | .0535714 .2122906 0.252
--------------------------------------------------------------
----------------
age_2 | -.0021756 .0004885 -4.453 Nonlin. dev.
7.894 (P
= 0.005)
p1 | 3.864227 2.133377 1.811
In all other respects, the preliminary diagnostics look good...
Linktest:
--------------------------------------------------------------
----------------
dmcat | Coef. Std. Err. z P>|z|
[95% Conf.
Interval]
-------------+------------------------------------------------
----------------
_hat | .8900851 .1153855 7.71 0.000
.6639337
1.116236
_hatsq | -.0319886 .0307101 -1.04 0.298
-.0921793
.0282022
_cons | -.0450195 .1069617 -0.42 0.674
-.2546606
.1646215
--------------------------------------------------------------
----------------
lroc
Logistic model for dmcat
number of observations = 3354
area under ROC curve = 0.8647
etc...etc...etc...
My question is should I be concerned with the results of
the Boxtid
command? Is there something I've done incorrectly or
something else I
can do/should do?
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/