Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Wrong results for Wilcoxon signed ranks test when data have decimal places (even using double)
From
Nick Cox <[email protected]>
To
[email protected]
Subject
Re: st: Wrong results for Wilcoxon signed ranks test when data have decimal places (even using double)
Date
Thu, 14 Feb 2013 18:35:29 +0000
On the contrary, I think your argument was mostly very clear; the only
detail that was unclear to me was what rounding you were suggesting
but the later post made that explicit.
It's up to StataCorp to respond. I agree with you that you have
exposed a problem. My own suggestion is that the problem should be
documented via an FAQ, but it's manifestly not my decision.
StataCorp certainly place a high priority on reproducing textbook
examples and results from other software, and working at why results
differ when they do.
Nick
On Thu, Feb 14, 2013 at 4:58 PM, Marta García-Granero
<[email protected]> wrote:
> Since English is not my native tongue, I did not express myself very well.
> When I talked about rounding, I did not talk about rounding the original
> data, but applying the round() function to the absolute differences before
> ranking them.
>
> Time ago, while preparing some slides with Excel (just for classes, I NEVER
> use Excel for serious research), I found the same problem: some differences
> that should be the same where in fact different (below the 15th decimal
> place) an got a wrong rank assigned. I discovered that ranking
> "round(absdiff,1e-15)" eliminated the problem, since the data where compared
> only up to th 15th decimal place and declared equal or different correctly.
> In another message I have sent shortly before this one, I have suggested
> applying the same method to signrank.ado fixed the problem with wrong
> ranking (I tested it myself before posting).
>
> Concerning SPSS, since their code is compiled and hidden, and more protected
> than Coke's formula, I can only guess from the Acrobat documentation and my
> hand calculations that they have somehow circumvented the problem of those
> nasty little differences below the 15th decimal place.
>
> Maybe I was a bit too bold (being just a 2 months old Stata user) suggesting
> the modification of signrank.ado, but I am checking it with different
> datasets (from statistics books), and the results obtained with Stata, SPSS,
> and the ones shown in those books agree.
>
> Regards,
> MGG
>
> El 14/02/2013 17:32, Nick Cox escribió:
>>
>> Surprising though it may seem in the face of this carefully presented
>> evidence, I wouldn't call this a bug, at least not one that is
>> fixable.
>>
>> It's an anomaly and it's awkward, but it's not a bug
>>
>> First off, a look at the code for -signrank- suggests that Stata uses
>> -double- precision where possible, and that's as far as ado code goes.
>>
>> It's an anomaly and it's awkward, but if it were a bug there would be
>> a solution and Marta's suggestion that there be "some rounding",
>> whatever that means precisely, does not sound like a good solution,
>> because how is StataCorp supposed to justify what rounding it does,
>> and how does that fit in with anybody else's idea of what the correct
>> procedure is, exactly and reproducibly? For example, which
>> authoritative accounts say you should apply some rounding first to get
>> reproducible results?
>>
>> Also, Marta has a solid argument that when you have a rank procedure,
>> and data that come all presented to 2 decimal places, that you should
>> get exactly the same result when data are multiplied by 100 and become
>> integers. That's totally sound logic: the results of ranking are
>> invariant under multiplication of the originals by a positive
>> constant. But that's not only the only consideration. The other
>> consideration is that people reasonably expect this test to be
>> applicable to non-integer data and so Stata's code has to work within
>> the constraints that implies.
>>
>> The underlying fact, often rehearsed on this list, is that Stata does
>> not do, and does not claim to do, exact decimal arithmetic unless
>> there is an exact binary equivalent of that decimal calculation. So
>> the heart of the matter is that Stata will very occasionally give what
>> look wrong answers to decimal problems, as in the case of
>>
>> . di %21x 0.70 - 0.65
>> +1.9999999999990X-005
>>
>> . di %21x 0.65 - 0.6
>> +1.99999999999a0X-005
>>
>> Every smart child knows that the answers to these problems should be
>> same, but they aren't when mapped to the nearest equivalent problems
>> in binary.
>>
>> I can't comment on exactly what SPSS does; that's clearly pertinent too.
>>
>> Nick
>>
>> On Thu, Feb 14, 2013 at 4:02 PM, Marta García-Granero
>> <[email protected]> wrote:
>>>
>>> Apologies for sending this twice, but yesterday I tried to piggyback into
>>> another thread ("Rounding Errors Stata 12"), although closely related to
>>> this question, and I think my question got lost. Besides, I'm going to
>>> explain the problem a bit more (and better).
>>>
>>> I'm converting some class notes (basic statistics) from SPSS to Stata,
>>> and I
>>> have found that the way Stata handles ranking tied data in Wilcoxon test
>>> can
>>> be sometimes wrong, when data have decimal places, even using -double-
>>> everywhere.
>>>
>>> The sample dataset comes from the on-line e-book Statistics at Square One
>>> (exercise at the end of chapter 1). I am using Stata 12.1 64 bits (last
>>> update installed) on W7, but I found the same problem with Stata 12.1 32
>>> bits on Windows XP. The results I get using Stata doesn't match the ones,
>>> I
>>> got either with my hand calculations, or with SPSS.
>>>
>>> set type double
>>> input copper
>>> 0.70
>>> 0.45
>>> 0.72
>>> 0.30
>>> 1.16
>>> 0.69
>>> 0.83
>>> 0.74
>>> 1.24
>>> 0.77
>>> 0.65
>>> 0.76
>>> 0.42
>>> 0.94
>>> 0.36
>>> 0.98
>>> 0.64
>>> 0.90
>>> 0.63
>>> 0.55
>>> 0.78
>>> 0.10
>>> 0.52
>>> 0.42
>>> 0.58
>>> 0.62
>>> 1.12
>>> 0.86
>>> 0.74
>>> 1.04
>>> 0.65
>>> 0.66
>>> 0.81
>>> 0.48
>>> 0.85
>>> 0.75
>>> 0.73
>>> 0.50
>>> 0.34
>>> 0.88
>>> end
>>>
>>> * One sample Wilcoxon's test (against population median = 0.6)
>>>
>>> signrank copper = 0.6
>>>
>>> * Multiply data by 100 to get rid of decimal places and running the test
>>> again (pop. median = 60)
>>> * this time all the output (positive&negative sum of ranks, Z stat&p
>>> value)
>>> is correct
>>>
>>> generate copper100 = round(copper*100)
>>> signrank copper100 = 60
>>>
>>> * Generating the ranks for absolute differences between copper & pop
>>> median
>>> for both variables (copper&copper100)
>>> * Ranks should have been the same in both cases, but they are not
>>> * Notice the difference for cases 5/6/7, 18/19, 22/23/24, 29/30, 32/33
>>> * "ranks2" is correct (recognizes all tied data), and leads to the right
>>> Wilcoxon's p-value
>>>
>>> egen double ranks1 = rank(abs(copper-0.6))
>>> egen double ranks2 = rank(abs(copper100-60))
>>> generate absdiff = abs(copper-0.6)
>>> sort absdiff
>>> list absdiff ranks1 ranks2
>>>
>>> I would label that as a Stata bug. Tied absolute differences are not
>>> recognized as so because there is a difference at the 15th decimal place.
>>> Maybe some rounding should be performed before assigning ranks.
>>>
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>> * http://www.ats.ucla.edu/stat/stata/
>>
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/