Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: About taking log on zero values
From
Jeph Herrin <[email protected]>
To
[email protected]
Subject
Re: st: About taking log on zero values
Date
Wed, 19 Feb 2014 15:42:49 -0500
To Nick's typically excellent advice I'll add two comments:
> 4. Doing something about 0s is only necessary with logarithmic
> transformation. If you have 0s in the response, you can leave them and
> use a logarithmic link. That won't necessarily be a good model, but
> using a logarithmic link doesn't require positive values in the
> response, only that the mean function be always positive. (This
> doesn't apply in your case as the variable in question is a
> predictor.)
This is reasonable if you think the 0s are simply the bottom end of the
continuum, but if there are many zeroes, you may be looking at a zero
inflated Poisson (ZIP) or hurdle model - that is, a dependent variable
where one process generates the 0s and another the positive values.
> 5. There are usually alternatives, such as transformations other than
> logarithms.
In particular, if this is a predictor, a very good transformation is
that provided by categorization. You will lose some power, but also make
fewer assumptions. So for instance, instead of x or log(x) use e.g,
gen newx = irecode(x,10,100,1000,.)
to get logarithm categories.
cheers,
Jeph
On 2/19/2014 3:11 PM, Nick Cox wrote:
Stata would ignore numeric missings in anything like a regression calculation.
That applies also to missings that result from calculating log(0).
Changing values of 0 to values to 1 so that you can take logarithms is
not something I would call "usual practice". It is, I suspect,
regarded differently by different people on a spectrum from unethical
and incorrect to an acceptable fudge, depending partly on the rest of
the data and what you are doing with them.
An incomplete list of things to think about:
0. If values of 1 occur otherwise, you have created an inconsistency.
If values between 0 and 1 occur otherwise, you have created a bigger
one. Applying log(x + 1) consistently solves this problem only by
creating another. Applying log(x + 1) and pretending that it is really
applying log(x) is not widely accepted.
1. If 0 really means what it says, changing it to 1 is a
falsification. Whether you can put a spin on it as an acceptable or
necessary falsification is an open question.
2. If 0 really means "small but not detected", changing it to e.g.
half smallest observable value is sometimes an accepted or acceptable
modification.
3. Replacing log(0) with log(1) is not, necessarily, even a small and
conservative modification. If apart from the values of 0 values range
from e3 to e6 then after logging you have 0 and otherwise a range of 3
to 6. You may have _created_ a bundle of outliers that will dominate
analyses.
4. Doing something about 0s is only necessary with logarithmic
transformation. If you have 0s in the response, you can leave them and
use a logarithmic link. That won't necessarily be a good model, but
using a logarithmic link doesn't require positive values in the
response, only that the mean function be always positive. (This
doesn't apply in your case as the variable in question is a
predictor.)
5. There are usually alternatives, such as transformations other than
logarithms.
6. I wouldn't do anything without considering some kind of sensitivity
analysis, i.e. a consideration of how much difference an arbitrary
treatment of zeros makes.
7. There is often an argument that implies that the observations with
zeros don't belong any way.
(I have generalised your question, but suspect that zero values for
sales usually mean exactly what they say.)
Nick
[email protected]
On 19 February 2014 19:44, Sebastian Say
<[email protected]> wrote [edited]
My question is about how Stata treats a log-transformed variable
that draws upon an original variable that contains zero.
In my dataset, I have firm sales data but some of them have values of zero. I
created a logsales variable and noticed that those with zeros are
indicated as a "."
I plan to run a regression, e.g.
reg y x1 x2 logsales
My question is, how would Stata treat these "." if I do not remove them?
Technically the "." should be undefined.
I've read some papers and they usually put a 1 for those sales data
with zeros in them. Is this a usual practice?
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/