|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: model building survey data. WAS st: adjusted R2 in survey regression
From |
Steven Samuels <[email protected]> |
To |
[email protected] |
Subject |
Re: model building survey data. WAS st: adjusted R2 in survey regression |
Date |
Thu, 16 Oct 2008 10:30:11 -0400 |
Aca:
1. The -linktest- command is an excellent test of fit if some
predictors are continuous, and it can assist in model building. See:
http://www.ats.ucla.edu/stat/Stata/webbooks/logistic/chapter3/
statalog3.htm and http://www.michiganscienceonline.org/article.aspx?
ID=8669. If the link test is significant, something must be changed:
add new predictors, including polynomial terms and interactions;
transform predictors; transform outcome. You will have to figure out
the solution yourself. Note that a model that passes the link test
is not necessarily a "good" model or one that predicts well.
Conversely, a model that predicts well may also display a lack of
fit. You may encounter a situation where adding a statistically
significant variable turned a non-significant link test (no evidence
of lack of fit) into a significant one (model does not fit). I have
also seen (once) a situation in which no model we could think up
passed the link test.
Unfortunately, -linktest- is not survey-aware, and will give an
incorrect p-value if run after -svy: reg-. Here is a way of doing it
yourself (be sure to zap text gremlins first).
**************************CODE BEGINS**************************
sysuse auto,clear
gen psu= mod(_n, 10) // artificial cluster
svyset psu [pweight=rep78]
reg mpg weight
predict yhat
gen yhat2= yhat*yhat
svy: reg mpg yhat yhat2 //significance of yhat2 is the link test
***************************CODE ENDS***************************
2. However, the link test does not compare models of different sets
of covariates. For that you will need -test- (-help test-)
**************************CODE BEGINS**************************
svy: reg mpg weight trunk length
test trunk length //tests for significance of adding trunk and length
***************************CODE ENDS***************************
3. Aids to model building: There are several commands which will
suggest transformations of predictors: Stata's command -fracpoly- and
commands -mfracpol- and -boxtid- by Patrick Royston (search mfracpol,
all/ search boxtid, all). They are not -svy- aware, but do accept
pweights and clustering options Do a google searches on "fractional
polynomials" and "multivariate fractional polynomials" to learn more
about them.
**************************CODE BEGINS**************************
fracpoly reg mpg weight [pweight=weight], vce(cluster psu)
***************************CODE ENDS***************************
4. If you try to build models by finding significant covariates, the
"final" model is unlikely to hold up in new data. You can avoid this
by using theory-based models, as Maarten suggested. Otherwise, regard
your models as exploratory. At a minimum, set aside part of your
data (say some of the strata), build the model on the rest, and test
the model on the set-aside part.
On Oct 15, 2008, at 8:51 PM, Aca N.T. wrote:
Steve had shown how -dlist- can sort my problem out anyway. In this
case, however, I was wondering if -linktest- can be used as a
subtitute for adjusted R2 (or should be more as complementary test?).
I mean, does -linktest- act like -lrtest- which is to compare LR from
one model to another when running a simple logistic regression so we
can see how a model is improved?
Aca.
On Thu, Oct 16, 2008 at 5:01 AM, Stas Kolenikov
<[email protected]> wrote:
On Wed, Oct 15, 2008 at 2:23 AM, Aca N.T. <[email protected]> wrote:
I'm puzzled with model building using -svy: reg- for there is no
adjusted R squared produced.
Is there an alternative test for this?
Uhm... alternative test for what?
If Stata does not produce something really obvious, like R2 or
adjusted R2, then it means they looked into this and decided it had
dubious statistical properties. R2 is an iid data concept: each
residual is a random variable that has a certain variance, and that
variance is the same for all observations. The complex survey setting
does not really have that concept: the explanatory and response
variables are in fact fixed, and the randomness comes from sampling
procedure only. The regression formulas may look the same (in the
end,
there are just this many ways to minimize a sum of squares...) but
interpretation of a few things is different. So one can probably talk
about population variance of residuals, as a relatively meaningful
quantity, but there is no analogue of the concept of the variance of
each individual residual -- that's a fixed quantity. If there is no
population analogue of R2, it should not be reported to the user, and
that makes perfect sense.
--
Stas Kolenikov, also found at http://stas.kolenikov.name
Small print: I use this email account for mailing lists only.
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/