--- Krista Jacobs wrote:
> I am estimating a model where y is vitamin A consumption, and among
> the x's is participation in a nutritional intervention. The vitamin A
> consumption variable was highly skewed, so I ran
> svyreg lny x. (The distribution of lny is sufficiently close to normal.)
It is a common misconception that the distribution of the dependent
variable needs to be normal. This is not the case: The assumption is
that the distribution of the dependent variable conditional on the
explanatory variables is normal, in other words the residuals need
to be normal. The unconditional distribution of the dependent
variable can look highly non-normal even if the residuals are normal.
> Unfortunately, the results were a little too high for me to really
> believe, so I also ran svyreg y x which yielded something a bit more
> reasonable. It was suggested that I might be seeing the
> retransformation problem at work.
These two models model something slightly different: if you
-etransform- a linear regression with a log-transformed dependent
variable you get for dummy variables the difference in geometric
means (Newson 2003), while if you look at a dummy variable with
a non-transformed dependent variable you get the difference in
arithmatic means.
> The homoskedasticity of the error terms from svyreg lny x is
> rejected.
Tests of assumptions are pretty useless when it comes to model
building: They may tell you there is a problem, but they do
not tell you what the problem is or how to solve it. They also
mess with your inference, now the p-values are all conditional
on the prior tests, which is probably not what you want.
What you want to do is look at various graphs involving the
residuals. They will give you a lot more information about
the heteroscedasticity comes from and what to do about it.
For a clear overview on this topic see: (Fox 1991)
> I started to work with glm and a log link, but I have a few basic
> (sorry) questions.
>
> First, in the glm estimation should y be the dependent variable or
> lny? That is, do I want to write " glm y x, link(log)" or "glm lny x,
> link(log)." I think it's the first, but I'm not positive.
-glm y x, link(log)-
> Second, I've been using the default Gaussian for the family. Is there
> a reason to use a different distribution like gamma or Poisson?
poisson is a discrete distribution, so you may not want to use that
(or use the -robust- option).
> Third, for simplicity, say x is a dummy variable. After I run "glm y
> x, link(log)" I ask Stata to exponentiate with eform. Are the results
> it gives after eform
>
> Exp(xB) where x=1
> -----------------------------
> Exp(xB) where x=0
>
> evaluated at the mean? If not, what are they?
-eform- gives you exp(b) (as it says on the top the coefficient table
in the output).
Hope this helps,
Maarten
Fox, John (1991), "Regression Diagnostics", Thousand Oaks: Sage.
Newson, R. (2003), "Stata tip 1: The eform() option of regress". The
Stata Journal, 3(4): 445.
-----------------------------------------
Maarten L. Buis
Department of Social Research Methodology
Vrije Universiteit Amsterdam
Boelelaan 1081
1081 HV Amsterdam
The Netherlands
visiting address:
Buitenveldertselaan 3 (Metropolitan), room Z434
+31 20 5986715
http://home.fsw.vu.nl/m.buis/
-----------------------------------------
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/