Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: logistic regression predictors
From
Steven Samuels <[email protected]>
To
[email protected]
Subject
Re: st: logistic regression predictors
Date
Mon, 19 Jul 2010 11:19:14 -0400
--
In fact, if you feed the two-time data to -cart-, as I suggested, the
log rank test in -cart- (which is -stcox- with the breslow option for
ties), will be equivalent to the stratified mantel-haenzel test for
binary data. Thus -cart- will provide a defensible split for binary
data. This splitting algorithm not equivalent to that in the original
CART method; also -cart- does not prune its trees and so risks over-
splitting. It continues to split if there are at least enough events;
if there is a split which enough observations on each side of the
split; and if a pvalue adjusted for multiple comparisons is too
small. The minimum required numbers of events and observations are
set by the minfail() and minsize() options; the default values are 10.
The default pvalue is 0.05, also settable.
To calculate the error rate You have to identify the observations in
each final node; classify each observation according to whether the
proportion of events in the node is >.5 or <.5; then compute the
percent of correct classifications overall (also, for each node if you
wish, but these will not be too precise.)
Steve
On Jul 18, 2010, at 11:59 AM, Steve Samuels wrote:
I was wrong about the utility of -cart- and -boost- for your data.
--boost- is not useful when the predictors are indicator variables, as
yours seem to be. (You haven't given many details). -cart- is intended
for failure time data, not binary data.
With a small number of predictors, you might be able to do a
classification tree "by hand". -cart- might guide you to a possible
tree: simply set up two times: a shorter one for deaths and a longer
one for survivors. -cart- will show the numbers of cases and failures
and each terminal node. The error rate will be optimistic, because
it is measured on the same data used to form the tree. To get a more
accurate error rate, you could also manually do a cross-validation.
Most simply, randomly split your data into a "training" and "test"
sets. Develop your tree on the training set, and estimate it's
accuracy (percent correctly predicted) on your "test" set. This can
be improved by k-fold cross-validation Randomly divide your data into
k (say 10) sets, omit one at a time, do -cart- on the remainder and
test the resulting prediction on the omitted set. Your estimate of
prediction error is the average of the 10.
I also suggest that you also look at the counts, deaths, and rates for
all combinations of your predictors. See -crp- by Nick Cox,
downloadable from SSC.
Steve
On Sun, Jul 18, 2010 at 10:24 AM, Steve Samuels <[email protected]>
wrote:
With such a strong independently predictive group, logistic regression
will give poor predictions, because it assumes that all variables are
needed to predict for each individual. The solution is a tree-based
approach. The original reference is Breiman, L., J. H. Friedman, R. A.
Olshen, and C. J. Stone. 1984. Classification and Regression Trees.
New
York: Chapman & Hall/CRC. Apparent Stata solutions are -boost-
("findit boost") and -cart- (from SSC). I say "apparent", because I've
not closely read the documentation for either. Non-commercial
solutions can be found in R and at
http://www.stat.wisc.edu/~loh/guide.html.
Steve
--
Steven Samuels
[email protected]
18 Cantine's Island
Saugerties NY 12477
USA
Voice: 845-246-0774
Fax: 206-202-4783
On Sun, Jul 18, 2010 at 1:57 AM, lilian tesmann <[email protected]
> wrote:
Dear All,
I am trying to predict mortality rates in a specific population of
clients.
I encountered two problems and would be really grateful for any
insights or suggestions.
(1) We have one predictor – a health condition, which is present
in only 5% of population but over70% of people with that condition
die. Not surprisingly OR is very large (from 25 to 50). The purpose
of the analysis is to obtain individual predictions, but they are
hugely influenced by this health condition. Could anyone suggest
how to deal with this problem?
(2) Another problem is that in this very specific clinical
population another two health conditions, which are usually very
significant predictors of death, have OR=0.3-0.5. The result it has
on prediction is that according to my model, sicker people have a
lower risk of dying. It looks to me as a collinearity issue between
predictors and our inclusion/exclusion criteria which created this
population. What do I do in this situation? We cannot change
inclusion criteria and we have only a small number of predictors,
three of them with ‘behavior problems’.
--
Steven Samuels
[email protected]
18 Cantine's Island
Saugerties NY 12477
USA
Voice: 845-246-0774
Fax: 206-202-4783
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/