Just to add to Maarten's sage advice that how, or indeed whether, to
take logarithms of zero in some roundabout way is a frequent question on
this list. See for example a thread last month:
<http://www.hsph.harvard.edu/cgi-bin/lwgate/STATALIST/archives/statalist
.0907/Author/article-678.html>
Nick
[email protected]
Maarten buis <[email protected]>
--- On Tue, 11/8/09, Fardad Zand wrote:
> In my econometric specification, I'm using the total number
> of employees (EMP) and the share of highly educated
> employees (EDU) as two important explanatory variables of
> my analysis. These are coming directly from two separate
> survey questions asking about the number of employees and
> the share of employees with a university degree. I'm now
> encountering the following three problems:
>
> 1- Adding EMP and EDU in the model may lead to some sort of
> systematic negative correlation between the two variables
> as EDU in essence equals: # of highly educated employees
> /EMP.
The fact that explanatory variables are correlated is not a
problem, except when the correlation becomes perfect, in
which case we can't distinguish between the variables and we
than obviously can't compute separate effects for each of
these variables. In fact this correlation between explanatory
variables is the very reason why we do a regression with
multiple explanatory variables: it is this correlation that
makes a variable a confounding variable which needs to be
controlled for.
> However, EDU in my survey is not calculated but directly
> asked. Thus, that will reduce the problem compared to a
> situation of artificial correlation by construction. Yet,
> do you still find it problematic to add EMP and EDU at
> the same time?
The reduction in the correlation is due to extra measurement
error, which is not a solution but a problem. However, the
trick to get research done is to worry about one problem at
the time, so I recommend you forget this problem (for now).
> 2- As a solution, I can rely on logarithmic transformation
> and add ln_EMP and ln_EDU into regression; this way, the
> inherit correlation manifest itself in the corresponding
> estimated coefficients of these two variables. The log
> transformation is indeed a good solution from another
> reason as well. These two variables are highly skewed and
> log can reduce the effect of outliers (and I can see that
> by obtaining totally different results when I use log).
> However, the main problem is the very high number of zeros
> for EDU variable; this way, taking log will drop these
> observations out of the analysis, which will bias the
> sample and results (about 25% of the sample is discarded
> this way). A solution is to impute zero values of EDU with
> a very small number, say 1 e-06. Is this a scientifically
> valid approach? What are the alternatives?
If you are worried about outliers than adding such a small
number is an absolutely horrible approach as now you are
adding an outlier to the left side of your distribution. The
reason for transforming your explanatory variable should be
because the effect of that variable is non-linear, so your
first port of call should be a scatter plot of EMP on the
x-axis and your dependent variable on the y-axis and another
scatter plot of EDU on the x-axis and your dependent variable
on the y-axis. Than just look at what kind of functional form
of the relationship would make sense for these variables.
Usually a log transform of size makes sense, but I am not so
sure if the same thing is true for a proportion variables.
However, that is an empirical question, so just take a look.
One thing you could look at is whether the zero proportions
are qualitatively different from the rest, which you could
represent with a linear effect of EDU combined with a dummy
for EDU==0. You can always represent the relationship in a
flexible non-linear way using for example restricted cubic
splines (see: -help mkspline- and
http://ideas.repec.org/p/boc/dsug09/04.html) or fractional
polynomials (see: -help fracpoly-).
> 3- Finally, we might come to the conclusion that EDU is
> better to be transferred manually to the # of highly
> educated employees (by multiplying EDU to EMP) and then
> apply log on EMP only (to avoid the problem of many
> zeros for EDU) and then add ln_EMP and # of highly
> educated employees into the regression. Scientifically
> speaking, is it wise to add EMP in log form but #
> high-educated employees in absolute level?
There is no reason why you can not do that. The reason to
choose between functional forms has to do with how the
explanatory variable influences the explained variable, as
was discussed above.
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/