Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
st: Regression diagnostics after -xtreg-
From
"Tobias Pfaff" <[email protected]>
To
<[email protected]>
Subject
st: Regression diagnostics after -xtreg-
Date
Wed, 7 Nov 2012 11:53:18 +0100
Dear all,
Some time ago I had posted a question on how to do regression diagnostics
after -xtreg-
(http://www.stata.com/statalist/archive/2011-08/msg01063.html). I never
received an answer and tried to come up with solutions myself.
In the meantime I have received a few private messages on how I eventually
proceeded. I post the code that I used below.
Beware that I cannot guarantee that the code is correct! I just want to post
it for discussion in case it helps anybody.
Cheers,
Tobias
Code:
/*
Notes:
Without verifying that your data have met the assumptions underlying
OLS regression, your results may be misleading.
In particular, we will consider the following assumptions.
> Linearity - the relationships between the predictors and the
outcome variable should be linear
> Normality - the errors should be normally distributed -
technically normality is necessary only for hypothesis tests to be valid,
estimation of the coefficients only requires that the errors
be identically and independently distributed
> Homogeneity of variance (homoscedasticity) - the error
variance should be constant
> Independence - the errors associated with one observation are
not correlated with the errors of any other observation (e.g.,
autocorrelation)
> Errors in variables - predictor variables are measured without
error (we will cover this in Chapter 4)
> Model specification - the model should be properly specified
(including all relevant variables, and excluding irrelevant variables)
Additionally, there are issues that can arise during the analysis
that, while strictly speaking are not assumptions of regression,
are none the less, of great concern to data analysts.
> Influence - individual observations that exert undue influence
on the coefficients
> Collinearity - predictors that are highly collinear, i.e.,
linearly related, can cause problems in estimating the regression
coefficients.
[http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter2/statareg2.htm]
*/
***********************************************
** 1. Unusual and influential data
/*
Notes:
A single observation that is substantially different from all other
observations can make a large difference in the results
of your regression analysis. If a single observation (or small group
of observations) substantially changes your results,
you would want to know about this and investigate further. There are
three ways that an observation can be unusual.
Outliers: In linear regression, an outlier is an observation with
large residual. In other words, it is an observation whose
dependent-variable value is unusual given its values on the predictor
variables. An outlier may indicate a sample peculiarity
or may indicate a data entry error or other problem.
Leverage: An observation with an extreme value on a predictor variable
is called a point with high leverage. Leverage is a
measure of how far an observation deviates from the mean of that
variable. These leverage points can have an effect on the
estimate of regression coefficients.
Influence: An observation is said to be influential if removing the
observation substantially changes the estimate of coefficients.
Influence can be thought of as the product of leverage and
outlierness.
*/
// Added-variable plots
display _n in yellow "Unusual and influential data: Added-variable plots"
foreach var in $indepvars {
// Don't create plot for categorical variables
if (substr("`var'",1,2) == "i.") continue
quietly {
local RHS = subinword("$indepvars", "`var'", "", 1)
di _n in yellow "var: `var'"
di _n in yellow "Before: $indepvars"
di _n in yellow "After: `RHS'"
di _n in yellow "depvar: $depvar"
// Step 1: Regression of y on all independent variables without
`var'
areg $depvar `RHS' [aw = phrf], absorb(pid) vce(cluster
region)
// Generate residuals
predict res_1_`part' if e(sample), residuals
// Step 2: Regression of `var' on all other independent
variables
areg `var' `RHS' [aw = phrf], absorb(pid) vce(cluster region)
// Generate residuals
predict res_2_`part' if e(sample), residuals
// Regress both residuals on each other to obtain statistics
displayed in added-variable plot
regress res_1_`part' res_2_`part'
local coef = round(_b[res_2_`part'], .0001)
local se = round(_se[res_2_`part'], .0001)
local t = round(_b[res_2_`part']/_se[res_2_`part'],.01)
}
twoway scatter res_1_`part' res_2_`part', name(g1, replace)
msize(tiny) legend(off) ///
ytitle("e(life_sat | X)") xtitle("e(`var' | X)")
note("coef = `coef', se = `se', t = `t'") || ///
lfit res_1_`part' res_2_`part' //mlabel(pid)
// Conversion of global to local since -graph export- cannot read
globals
local depvar = "$depvar"
local model = "$model"
local path = "$path"
graph export
Graphs/`path'/avplot_m_`model'_`part'_`depvar'_`var'.png, replace
drop res_*
graph drop _all
}
/*
Notes:
How added-variable plots should be interpreted...
The horizontal variable in the added-variable plot is the residual
from the regression
of the respective variable on all other independent variables (X).
Thus, data points
going to the far left/right mean that the value of the respective
variable is unusually
low/high given all other independent variables (X).
Unusual data points can be marked by adding -mlabel(pid)- to the
graph.
[http://socserv.socsci.mcmaster.ca/jfox/Courses/Brazil-2009/slides-handout
.pdf]
*/
**************************************************
** 2. Checking normality of residuals
/*
Notes:
Many researchers believe that multiple regression requires normality.
This is not the case. Normality of residuals is
only required for valid hypothesis testing, that is, the normality
assumption assures that the p-values for the t-tests
and F-test will be valid. Normality is not required in order to obtain
unbiased estimates of the regression coefficients.
OLS regression merely requires that the residuals (errors) be
identically and independently distributed. Furthermore, there
is no assumption or requirement that the predictor variables be
normally distributed. If this were the case than we would
not be able to use dummy coded variables in our models.
See also:
http://www.bwl.uni-kiel.de/bwlinstitute/grad-kolleg/new/typo3conf/ext/naw_
securedl/secure.php?u=0&file=/fileadmin/publications/pdf/2010_Methodik_der
_empirischen_Forschung_-_Normalverteilungsannahme__Arne_Schmidt_.pdf&t=127
2745396&hash=047d8763f4331cf740a478480fd48464
*/
// Graphical analysis
kdensity residuals, normal name(g1, replace) graph box residuals, name(g2,
replace) symplot residuals, name(g3, replace) qnorm residuals, name(g4,
replace) pnorm residuals, name(g5, replace)
/*
Notes:
The pnorm command graphs a standardized normal probability (P-P) plot
while qnorm plots the quantiles of a variable
against the quantiles of a normal distribution. pnorm is sensitive to
non-normality in the middle range of data and
qnorm is sensitive to non-normality near the tails.
*/
graph combine g1 g2 g3 g4 g5, name(gcomb2, replace)
// Conversion of global to local since -graph export- cannot read globals
local depvar = "$depvar" local model = "$model"
local path = "$path"
graph export
Graphs/`path'/normality_residuals_m_`model'_`part'_`depvar'.png, replace
graph drop _all
// Formal tests
iqr residuals
jb residuals
sktest residuals //swilk can only be used with less than 2000 obs. and
sfrancia with less than 5000 obs.
/*
Notes:
iqr stands for inter-quartile range and assumes the symmetry of the
distribution. Severe outliers consist of
those points that are either 3 inter-quartile-ranges below the first
quartile or 3 inter-quartile-ranges
above the third quartile. The presence of any severe outliers should
be sufficient evidence to reject normality
at a 5% significance level. Mild outliers are common in samples of any
size.
The p-value of the sktest is based on the assumption that the
distribution is normal. For example, a p-value of .15
indicates that we cannot reject that the residuals are normally
distributed at the 5 percent level.
*/
// Test for normality with pantest2
//pantest2 svyyear
********************************************
** 3. Checking homoscedasticity
// residual vs. fitted plot
display _n in yellow "Homoscedasticity: Residual-vs-fitted-plot"
scatter residuals yhat, name(g1) msize(tiny) || mband residuals yhat,
bands(50) clp(solid)
local path = "$path"
graph export Graphs/`path'/rvfplot_m_`model'_`part'_`depvar'.png, replace
graph drop _all
/*
Notes:
Since we use robust standard errors with - vce(cluster) - we don't
need to worry about heteroscedasticity ;) */
*************************************************
** 4. Checking for multicollinearity
display _n in yellow "Multicollinearity (m_$model, $depvar): VIF"
// Delete factor variables and time-series operators since they don't work
with -collin- global indepvars_new = "$indepvars"
foreach var in $indepvars {
//display in yellow "Before: $indepvars_new"
if (substr("`var'",1,2) == "i.") global indepvars_new =
subinstr("$indepvars_new","`var'","",1)
//display in yellow "After: $indepvars_new"
}
collin $indepvars_new
/*
Notes:
"As I understand it, multicollinearity on the right-hand side
is much the same irrespective of what is on the left-hand
side or what link function is contemplated. So I don't
see that you are obliged to do it retrospectively."
[http://www.stata.com/statalist/archive/2003-12/msg00335.html]
The term collinearity implies that two variables are near perfect
linear combinations of one another.
When more than two variables are involved it is often called
multicollinearity, although the two terms
are often used interchangeably.The primary concern is that as the
degree of multicollinearity increases,
the regression model estimates of the coefficients become unstable and
the standard errors for the
coefficients can get wildly inflated.
We can use the vif command after the regression to check for
multicollinearity. vif stands for variance
inflation factor. As a rule of thumb, a variable whose VIF values are
greater than 10 may merit further
investigation. Tolerance, defined as 1/VIF, is used by many
researchers to check on the degree of collinearity.
A tolerance value lower than 0.1 is comparable to a VIF of 10. It
means that the variable could be considered
as a linear combination of other independent variables.
*/
*************************************
** 5. Checking linearity
/*
Notes:
The most straightforward thing to do is to plot the standardized
residuals against each of the predictor
variables in the regression model. If there is a clear nonlinear
pattern, there is a problem of nonlinearity.
acprplot graphs an augmented component-plus-residual plot, a.k.a.
augmented partial residual plot.
It can be used to identify nonlinearities in the data.
Often poorly chosen functional forms show themselves up in patterns on
a
plot of residuals versus predicted. Conversely, a satisfactory
functional form shows up, after some informal or formal smooth, as a
flat trace on such plots. Naturally, you should check other
diagnostics
too, e.g. variance functions.
I see no reason why this should not apply in broad terms to panel data
too. A disadvantage of this is that it does not produce a P-value or
sharp test decision as you evidently prefer. An advantage is that it
is
flexible and allows you to apply scientific as well as statistical
judgement.
[http://www.stata.com/statalist/archive/2009-10/msg00731.html]
If you want more formal tests then it really helps if you have an
alternative, for instance a quadratic relationship, or a spline. If
the
former is the case I would first use -orthpoly- to create orthogonal
polynomials, that way the significance of the quadratic term can be
interpreted as a test whether the quadratic terms adds anything. If
your alternative is spline, I would use the marginal option. that way
the second term measures how much the effect changes befor and after
the knot, and the significance of that term tells you if that's
significant. Finally, you could look at the -estat ovtest-.
However, I think graphs are the best way to find non-linearities and
assess whether they are big enough for you to worry about. I would
only
use tests if I had a specific hypothesis about that non-linearity.
[http://www.stata.com/statalist/archive/2006-11/msg00448.html]
*/
/*
// component-plus-residual-plot (partial residual plot)
display _n in yellow "Linearity: Component-plus-residual-plot"
foreach var in $indepvars {
// don't create plot for categorical variables
if (substr("`var'",1,2) == "i.") continue
gen residuals_plus = residuals + _b[`var']*`var' //see Kohler &
Kreuter (2008, p. 214)
scatter residuals_plus `var', name(g1) msize(tiny) || mband
residuals_plus `var', bands(50)
// Conversion of global to local since -graph export- cannot
read globals
local depvar = "$depvar"
local model = "$model"
local path = "$path"
graph export
Graphs/`path'/cprplot_m_`model'_`part'_`depvar'_`var'.png, replace
//scatter residuals_plus `var', name(g2) msize(tiny) || lowess
residuals_plus `var', sort yvarlabel("Lowess smooth") // Computation of
lowess smooth takes ages?
drop residuals_plus
graph drop _all
}
*/
// augmented-component-plus-residual-plot
display _n in yellow "Linearity: Augmented-component-plus-residual-plot"
foreach var in $indepvars {
// don't create plot for categorical variables
if (substr("`var'",1,2) == "i.") continue
gen sq = `var'^2
areg $depvar $indepvars sq, absorb(pid) vce(cluster region)
gen residuals_augm = residuals + _b[`var']*`var' + _b[sq]*sq
scatter residuals_augm `var', name(g1) msize(tiny) || mband
residuals_augm `var', bands(50) || lfit residuals_augm `var',
lstyle(refline) legend(off)
// Conversion of global to local since -graph export- cannot read
globals
local depvar = "$depvar"
local model = "$model"
local path = "$path"
graph export
Graphs/`path'/acprplot_m_`model'_`part'_`depvar'_`var'.png, replace
drop sq residuals_augm
graph drop _all
}
***********************************************
** 6. Checking model specification
/*
Notes:
A model specification error can occur when one or more relevant
variables are omitted from the model
or one or more irrelevant variables are included in the model. If
relevant variables are omitted from
the model, the common variance they share with included variables may
be wrongly attributed to those
variables, and the error term is inflated. On the other hand, if
irrelevant variables are included in
the model, the common variance they share with included variables may
be wrongly attributed to them.
Model specification errors can substantially affect the estimate of
regression coefficients.
The linktest command performs a model specification link test for
single-equation models.
linktest is based on the idea that if a regression is properly
specified, one should not be able to
find any additional independent variables that are significant except
by chance. linktest creates two
new variables, the variable of prediction, _hat, and the variable of
squared prediction, _hatsq. The
model is then refit using these two variables as predictors. _hat
should be significant since it is
the predicted value. On the other hand, _hatsq shouldn't, because if
our model is specified correctly,
the squared predictions should not have much explanatory power. That
is we wouldn't expect _hatsq to
be a significant predictor if our model is specified correctly. So we
will be looking at the p-value
for _hatsq.
Results:
We find evidence that our model is misspecified.
The coefficient for _hatsq after -linktest- is significant. And the
RESET(4) also rejects the H0.
*/
display _n in yellow "Model specification: linktest"
gen yhat_sq = yhat^2
areg $depvar yhat yhat_sq [aw = phrf], absorb(pid) vce(cluster region) local
coeff_yhat_sq = _b[yhat_sq] display _n in yellow "p-value for yhat_sq should
be insignificant!"
drop yhat_sq
**************************************************
** 7. Checking issues of independence
/*
Notes:
The statement of this assumption that the errors associated with one
observation are not correlated with
the errors of any other observation cover several different
situations. Consider the case of collecting
data from students in eight different elementary schools. It is likely
that the students within each
school will tend to be more like one another than students from
different schools, that is, their errors
are not independent. We will deal with this type of situation (...)
when we demonstrate the (...) cluster option.
Another way in which the assumption of independence can be broken is
when data are collected on the same
variables over time. Let's say that we collect truancy data every
semester for 12 years. In this situation
it is likely that the errors for observation between adjacent
semesters will be more highly correlated than
for observations more separated in time. This is known as
autocorrelation.
Results:
Concerning autocorrelation, -xtserial- shows evidence for first-order
autocorrelation.
Note that if the panel identifier (e.g. individuals, firms, or
countries) is the cluster() variable,
then Rogers standard errors are heteroscedasticity and autocorrelation
consistent.
=> Do the estimations again with vce(cluster pid) and check if the
results change.
*/
xi: xtserial $depvar $indepvars
// Estimate with -xtregar- and report differences
xi: xtscc $depvar $indepvars, fe
/*
Further commands for treating autocorrelation:
xtabond
newey2 -> not with fixed effects
xtgls
*/
*********************************************
** 8. Checking measurement error
// ??
***************************************************
** 9. Checking exogeneity of variables
// Hausman tests!?
// http://www.stata.com/support/faqs/stat/endogeneity.html
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/