Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: st: correcting skewness of an indep variables
From
"Mihes, Dimitrie" <[email protected]>
To
"[email protected]" <[email protected]>
Subject
RE: st: correcting skewness of an indep variables
Date
Sun, 21 Jul 2013 17:22:14 +0000
David,
With regards to your question, time is not a predictor in my model, as naturally disasters are naturally and randomly triggered. The unit of analysis is, to be more precise, every natural disaster to which the US contributed between 1992- 2004.
Going back to the issue of linear relationship between the predictor and outcome, by regressing amount of aid (logged) on no of articles on each event (count) and then running the command -cprplot no_of_articles, lowess lsopts(bwidth(1))- , both with and without the values of 0, the relationship seemed non-linear, as confirmed by a -ovtest- with a p-value=0.0083. Even so, the bivariate relationship between aid and no. of articles was significant at p<0.001. However, after removing some of the outliers in the predictor, and running the same tests, with and without the values of 0, the relationship became linear, as confirmed by the graph and an -ovtest- , p= 0.9669.
Nevertheless, my primary concern was that the skewness would affect the validity of the p-value in the full regression model, as the "no of articles" is almost always significant, p<0.001, even when clustering or using robust standard errors, removing outliers as well as values of zero.
________________________________________
From: [email protected] [[email protected]] on behalf of David Hoaglin [[email protected]]
Sent: 21 July 2013 14:24
To: [email protected]
Subject: Re: st: correcting skewness of an indep variables
Dimitrie,
The skewness of a predictor variable is not necessarily a problem, and
neither is a spike at 0. The first step should be to examine whether
the relation between the dependent variable and each of the predictors
(in the full regression model) departs systematically from being
linear. Various plots of residuals can help you do this.
If the data on the dependent variable when the predictor is 0 behave
differently from the data on the dependent variable when the predictor
is > 0, you may need to model the two parts separately (as a sort of
mixture). You can try omitting all the observations in which the
predictor is 0 and fitting a separate regression to the remaining
data.
In a response on Cross Validated, you mentioned that your data came
from natural disasters over a span of years. Should time be a
predictor in your models?
Without more information on your data, I can only offer general suggestions.
David Hoaglin
On Sun, Jul 21, 2013 at 7:31 AM, Mihes, Dimitrie
<[email protected]> wrote:
> Apologies, allow me to correct myself. The issue I've mentioned has also been addressed in
> http://stats.stackexchange.com/questions/64714/count-data-as-an-independent-variable-in-ols-using-a-dummy-variable-the-variab?noredirect=1#comment124994_64714
>
> However, the proposed solution seems to be in contrast with that proposed in this thread (which I had mistakenly not mentioned)
> http://www.stata.com/statalist/archive/2010-03/msg01034.html
>
> From my understanding, the former suggest using a dummy variable to account for a spike in 0 (for a predictor based on count data) only when zero means unobserved or truncated data, whereas the latter suggest either looking for a non-linear relationship between the variables (in which case, log transformation is proposed) or adding a dummy variable+ the skewed variable linearly even when the zeros represent the true value.
> I am conflicted between the two, as the former suggests that the dummy variable is useless when zeros are the observed values, while the latter, who advocates this techinque when 0 is the true value, lacks a more elaborate explanation with regards to the interpretation of the dummy alongside the linear variable and with regards to the process through which the dummy variable controls for the spike in 0.
>
> Moreover, using a log-transformation renders the 0 values as "missing values".
>
> Thanks for your consideration.
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/