Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: About taking log on zero values
From
Austin Nichols <[email protected]>
To
"[email protected]" <[email protected]>
Subject
Re: st: About taking log on zero values
Date
Thu, 20 Feb 2014 07:16:50 -0500
Whether sales=0 means
"literally nothing" or "so small that it could not be detected"
you can't do any of the things suggested without introducing bias.
In the former case, you must run separate models for cases with and
without sales (or fully interact by a dummy variable nosales) while in
the latter case you must multiply impute sales using a sensible model,
not simply add a constant.
On Thu, Feb 20, 2014 at 4:57 AM, Maarten Buis <[email protected]> wrote:
> One option you could also consider is that you treat the value 0 as
> special which needs its own effect. This depends whether 0 means
> "literaly nothing" or "so small that it could not be detected". In the
> former case you would often want to treat the value 0 as qualitatively
> different, while in the later case adding a small but not too small
> number to the 0 values could be justified.
>
> In case that you would want to treat the value 0 as qualititively
> different, then I would do something like this:
>
> gen byte nosales = (sales == 0) if sales < .
> gen logsales = ln(sales)
> sum logsales, meanonly
> replace logsales = r(min) if nosales == 1
> reg y x1 x2 logsales nosales
>
> In that case the coefficient for logsales can be interpreted as
> before, but refers only to sales > 0. The coefficient for nosales
> represents the difference in expected value of y between those units
> with no sales at all and those units with the smallest non-zero sales.
>
> Hope this helps,
> Maarten
>
>
> On Wed, Feb 19, 2014 at 9:11 PM, Nick Cox <[email protected]> wrote:
>> Stata would ignore numeric missings in anything like a regression calculation.
>>
>> That applies also to missings that result from calculating log(0).
>>
>> Changing values of 0 to values to 1 so that you can take logarithms is
>> not something I would call "usual practice". It is, I suspect,
>> regarded differently by different people on a spectrum from unethical
>> and incorrect to an acceptable fudge, depending partly on the rest of
>> the data and what you are doing with them.
>>
>> An incomplete list of things to think about:
>>
>> 0. If values of 1 occur otherwise, you have created an inconsistency.
>> If values between 0 and 1 occur otherwise, you have created a bigger
>> one. Applying log(x + 1) consistently solves this problem only by
>> creating another. Applying log(x + 1) and pretending that it is really
>> applying log(x) is not widely accepted.
>>
>> 1. If 0 really means what it says, changing it to 1 is a
>> falsification. Whether you can put a spin on it as an acceptable or
>> necessary falsification is an open question.
>>
>> 2. If 0 really means "small but not detected", changing it to e.g.
>> half smallest observable value is sometimes an accepted or acceptable
>> modification.
>>
>> 3. Replacing log(0) with log(1) is not, necessarily, even a small and
>> conservative modification. If apart from the values of 0 values range
>> from e3 to e6 then after logging you have 0 and otherwise a range of 3
>> to 6. You may have _created_ a bundle of outliers that will dominate
>> analyses.
>>
>> 4. Doing something about 0s is only necessary with logarithmic
>> transformation. If you have 0s in the response, you can leave them and
>> use a logarithmic link. That won't necessarily be a good model, but
>> using a logarithmic link doesn't require positive values in the
>> response, only that the mean function be always positive. (This
>> doesn't apply in your case as the variable in question is a
>> predictor.)
>>
>> 5. There are usually alternatives, such as transformations other than
>> logarithms.
>>
>> 6. I wouldn't do anything without considering some kind of sensitivity
>> analysis, i.e. a consideration of how much difference an arbitrary
>> treatment of zeros makes.
>>
>> 7. There is often an argument that implies that the observations with
>> zeros don't belong any way.
>>
>> (I have generalised your question, but suspect that zero values for
>> sales usually mean exactly what they say.)
>>
>> Nick
>> [email protected]
>>
>> On 19 February 2014 19:44, Sebastian Say
>> <[email protected]> wrote [edited]
>>
>>> My question is about how Stata treats a log-transformed variable
>>> that draws upon an original variable that contains zero.
>>>
>>> In my dataset, I have firm sales data but some of them have values of zero. I
>>> created a logsales variable and noticed that those with zeros are
>>> indicated as a "."
>>>
>>> I plan to run a regression, e.g.
>>>
>>> reg y x1 x2 logsales
>>>
>>> My question is, how would Stata treat these "." if I do not remove them?
>>>
>>> Technically the "." should be undefined.
>>>
>>> I've read some papers and they usually put a 1 for those sales data
>>> with zeros in them. Is this a usual practice?
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>> * http://www.ats.ucla.edu/stat/stata/
>
>
>
> --
> ---------------------------------
> Maarten L. Buis
> WZB
> Reichpietschufer 50
> 10785 Berlin
> Germany
>
> http://www.maartenbuis.nl
> ---------------------------------
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/