| |
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: stepwise
At 04:30 AM 9/4/2006, [email protected] wrote:
stepwise regression is needed. Say we have n = 200, and a potential pool
of predictors = 50, say that each of these 50 predictors have 1 or 2
missing, not necesarily randomly. Using the Stata stepwise procedure, we
may well end up with a final model with some 5 variables, but this model
was derived only using around 75% of the sample, and most likely not a
random sample. Would it not be wiser to use all available observations at
each try? Intuitively I feel that this final model might be less biased
because it does not involve throwing as much information away (1% vs 25%),
although I believe mathematically this would be quite difficult to prove.
One of the concerns with stepwise is that a different sample could
easily lead to different variables being selected. That concern
would seem to be even greater with a small sample, where the
estimates are going to be less precise, i.e. two different samples of
200 could easily lead to two different sets of variables being
selected, especially if a lot of variables are close to each other in
their correlations.
Your suggested procedure might make this even worse. Suppose X1 and
X2 both have 20% missing data and it is a different 20% for each
variable. X1 barely edges out X2 in step 1. The sample for
subsequent steps will be quite different than it would be if X2 had
barely edged out X1.
Anyway, I would say that, if you are really concerned about the
effects of missing data, then try to do something about it. If you
don't want to get too fancy about it, perhaps even simple mean
substitution (which has its own problems) would be better than
nothing at all. See the -impute- command.
I assume it would be possible to write a stepwise procedure that
behaved like you would like, but it might take a long time to
run. Regular stepwise can be done just by working off a correlation
matrix. With your approach, the correlation matrix would be a moving
target as the sample changed, so I imagine you'd have to do a lot
more calculations.
Richard, re-running step-wise on the selected model does not produce the
same results, does it?
Hopefully it would, but there is no guarantee. If there are big
differences, this may underscore the problems you have with MD or the
problems with using a stepwise procedure where small differences in
variable correlations could produce very different models. For
example, in the full sample, X1 might barely edge out X2, but in a
sample where MD has been eliminated X2 might barely edge out X1.
-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
OFFICE: (574)631-6668, (574)631-6463
FAX: (574)288-4373
HOME: (574)289-5227
EMAIL: [email protected]
WWW (personal): http://www.nd.edu/~rwilliam
WWW (department): http://www.nd.edu/~soc
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/