Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: logistic regression predictors

From	Steven Samuels <[email protected]>
To	[email protected]
Subject	Re: st: logistic regression predictors
Date	Mon, 19 Jul 2010 11:19:14 -0400

--

In fact, if you feed the two-time data to -cart-, as I suggested, thelog rank test in -cart- (which is -stcox- with the breslow option forties), will be equivalent to the stratified mantel-haenzel test forbinary data. Thus -cart- will provide a defensible split for binarydata. This splitting algorithm not equivalent to that in the originalCART method; also -cart- does not prune its trees and so risks over-splitting. It continues to split if there are at least enough events;if there is a split which enough observations on each side of thesplit; and if a pvalue adjusted for multiple comparisons is toosmall. The minimum required numbers of events and observations areset by the minfail() and minsize() options; the default values are 10.The default pvalue is 0.05, also settable.

To calculate the error rate You have to identify the observations ineach final node; classify each observation according to whether theproportion of events in the node is >.5 or <.5; then compute thepercent of correct classifications overall (also, for each node if youwish, but these will not be too precise.)


Steve


On Jul 18, 2010, at 11:59 AM, Steve Samuels wrote:


I was wrong about the utility of -cart- and -boost- for your data.
--boost- is not useful when the predictors are indicator variables, as
yours seem to be. (You haven't given many details). -cart- is intended
for failure time data, not binary data.

With a small number of predictors, you might be able to do a
classification tree "by hand".   -cart- might guide you to a possible
tree: simply set up two times: a shorter one for deaths and a longer
one for survivors. -cart- will show  the numbers of cases and failures
and each terminal node.   The error rate will be optimistic, because
it is measured on the same data used to form the tree. To get a more
accurate error rate, you could also manually do a cross-validation.
Most simply, randomly split your data into a "training" and "test"
sets.  Develop your tree on the training set, and estimate it's
accuracy (percent correctly predicted) on your "test" set.  This can
be improved by k-fold cross-validation  Randomly divide your data into
k (say 10) sets, omit one at a time, do -cart- on the remainder and
test the resulting prediction on the omitted set.  Your estimate of
prediction error is the average of the 10.

I also suggest that you also look at the counts, deaths, and rates for
all   combinations of your predictors.  See -crp- by Nick Cox,
downloadable from SSC.

Steve

On Sun, Jul 18, 2010 at 10:24 AM, Steve Samuels <[email protected]>wrote:

With such a strong independently predictive group, logistic regression
will give poor predictions, because it assumes that all variables are
needed to predict for each individual. The solution is a tree-based
approach. The original reference is Breiman, L., J. H. Friedman, R. A.
Olshen, and C. J. Stone. 1984. Classiﬁcation and Regression Trees.New
York: Chapman & Hall/CRC. Apparent Stata solutions are -boost-
("findit boost") and -cart- (from SSC). I say "apparent", because I've
not closely read the documentation for either. Non-commercial
solutions can be found in R and at
http://www.stat.wisc.edu/~loh/guide.html.


Steve

--
Steven Samuels
[email protected]
18 Cantine's Island
Saugerties NY 12477
USA
Voice: 845-246-0774
Fax:    206-202-4783
On Sun, Jul 18, 2010 at 1:57 AM, lilian tesmann <[email protected]> wrote:
Dear All,
I am trying to predict mortality rates in a specific population ofclients.I encountered two problems and would be really grateful for anyinsights or suggestions.
(1) We have one predictor – a health condition, which is presentin only 5% of population but over70% of people with that conditiondie. Not surprisingly OR is very large (from 25 to 50). The purposeof the analysis is to obtain individual predictions, but they arehugely influenced by this health condition. Could anyone suggesthow to deal with this problem?
(2) Another problem is that in this very specific clinicalpopulation another two health conditions, which are usually verysignificant predictors of death, have OR=0.3-0.5. The result it hason prediction is that according to my model, sicker people have alower risk of dying. It looks to me as a collinearity issue betweenpredictors and our inclusion/exclusion criteria which created thispopulation. What do I do in this situation? We cannot changeinclusion criteria and we have only a small number of predictors,three of them with ‘behavior problems’.




--
Steven Samuels
[email protected]
18 Cantine's Island
Saugerties NY 12477
USA
Voice: 845-246-0774
Fax:    206-202-4783

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- RE: st: logistic regression predictors
  - From: lilian tesmann <[email protected]>

References:
- st: logistic regression predictors
  - From: lilian tesmann <[email protected]>
- Re: st: logistic regression predictors
  - From: Steve Samuels <[email protected]>
- Re: st: logistic regression predictors
  - From: Steve Samuels <[email protected]>

Prev by Date: Re: st: calling another .do file?
Next by Date: Re: st: Programming: Ranking hospitals according to admissions in a dataset with patient level data
Previous by thread: Re: st: logistic regression predictors
Next by thread: RE: st: logistic regression predictors
Index(es):
- Date
- Thread