Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: About taking log on zero values
From
Maarten Buis <[email protected]>
To
[email protected]
Subject
Re: st: About taking log on zero values
Date
Thu, 20 Feb 2014 10:57:14 +0100
One option you could also consider is that you treat the value 0 as
special which needs its own effect. This depends whether 0 means
"literaly nothing" or "so small that it could not be detected". In the
former case you would often want to treat the value 0 as qualitatively
different, while in the later case adding a small but not too small
number to the 0 values could be justified.
In case that you would want to treat the value 0 as qualititively
different, then I would do something like this:
gen byte nosales = (sales == 0) if sales < .
gen logsales = ln(sales)
sum logsales, meanonly
replace logsales = r(min) if nosales == 1
reg y x1 x2 logsales nosales
In that case the coefficient for logsales can be interpreted as
before, but refers only to sales > 0. The coefficient for nosales
represents the difference in expected value of y between those units
with no sales at all and those units with the smallest non-zero sales.
Hope this helps,
Maarten
On Wed, Feb 19, 2014 at 9:11 PM, Nick Cox <[email protected]> wrote:
> Stata would ignore numeric missings in anything like a regression calculation.
>
> That applies also to missings that result from calculating log(0).
>
> Changing values of 0 to values to 1 so that you can take logarithms is
> not something I would call "usual practice". It is, I suspect,
> regarded differently by different people on a spectrum from unethical
> and incorrect to an acceptable fudge, depending partly on the rest of
> the data and what you are doing with them.
>
> An incomplete list of things to think about:
>
> 0. If values of 1 occur otherwise, you have created an inconsistency.
> If values between 0 and 1 occur otherwise, you have created a bigger
> one. Applying log(x + 1) consistently solves this problem only by
> creating another. Applying log(x + 1) and pretending that it is really
> applying log(x) is not widely accepted.
>
> 1. If 0 really means what it says, changing it to 1 is a
> falsification. Whether you can put a spin on it as an acceptable or
> necessary falsification is an open question.
>
> 2. If 0 really means "small but not detected", changing it to e.g.
> half smallest observable value is sometimes an accepted or acceptable
> modification.
>
> 3. Replacing log(0) with log(1) is not, necessarily, even a small and
> conservative modification. If apart from the values of 0 values range
> from e3 to e6 then after logging you have 0 and otherwise a range of 3
> to 6. You may have _created_ a bundle of outliers that will dominate
> analyses.
>
> 4. Doing something about 0s is only necessary with logarithmic
> transformation. If you have 0s in the response, you can leave them and
> use a logarithmic link. That won't necessarily be a good model, but
> using a logarithmic link doesn't require positive values in the
> response, only that the mean function be always positive. (This
> doesn't apply in your case as the variable in question is a
> predictor.)
>
> 5. There are usually alternatives, such as transformations other than
> logarithms.
>
> 6. I wouldn't do anything without considering some kind of sensitivity
> analysis, i.e. a consideration of how much difference an arbitrary
> treatment of zeros makes.
>
> 7. There is often an argument that implies that the observations with
> zeros don't belong any way.
>
> (I have generalised your question, but suspect that zero values for
> sales usually mean exactly what they say.)
>
> Nick
> [email protected]
>
> On 19 February 2014 19:44, Sebastian Say
> <[email protected]> wrote [edited]
>
>> My question is about how Stata treats a log-transformed variable
>> that draws upon an original variable that contains zero.
>>
>> In my dataset, I have firm sales data but some of them have values of zero. I
>> created a logsales variable and noticed that those with zeros are
>> indicated as a "."
>>
>> I plan to run a regression, e.g.
>>
>> reg y x1 x2 logsales
>>
>> My question is, how would Stata treat these "." if I do not remove them?
>>
>> Technically the "." should be undefined.
>>
>> I've read some papers and they usually put a 1 for those sales data
>> with zeros in them. Is this a usual practice?
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
--
---------------------------------
Maarten L. Buis
WZB
Reichpietschufer 50
10785 Berlin
Germany
http://www.maartenbuis.nl
---------------------------------
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/