[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
RE: st: Shapiro Wilk: data interpretation

From	"Nick Cox" <[email protected]>
To	<[email protected]>
Subject	RE: st: Shapiro Wilk: data interpretation
Date	Wed, 28 Feb 2007 23:51:14 -0000
I'd echo Svend's advice. You must look at the data too. 

A salutary example is near to hand. Note that the sample sizes 
here are similar to Vanessa's, so a key issue of what happens 
with different sample sizes is set on one side. 

. sysuse auto, clear 

. swilk price-foreign

                   Shapiro-Wilk W test for normal data
    Variable |    Obs        W          V          z     Prob>z
-------------+-------------------------------------------------
       price |     74    0.76696     15.008      5.909  0.00000
         mpg |     74    0.94821      3.335      2.627  0.00430
       rep78 |     69    0.98191      1.100      0.208  0.41760
    headroom |     74    0.98104      1.221      0.436  0.33137
       trunk |     74    0.97921      1.339      0.637  0.26215
      weight |     74    0.96110      2.505      2.003  0.02258
      length |     74    0.97165      1.825      1.313  0.09461
        turn |     74    0.97113      1.859      1.353  0.08803
displacement |     74    0.92542      4.803      3.423  0.00031
  gear_ratio |     74    0.95814      2.696      2.163  0.01525
     foreign |     74    0.96928      1.978      1.488  0.06838

Let's sort that so the structure is easier to see.

       price |     74    0.76696     15.008      5.909  0.00000
displacement |     74    0.92542      4.803      3.423  0.00031
         mpg |     74    0.94821      3.335      2.627  0.00430
  gear_ratio |     74    0.95814      2.696      2.163  0.01525
      weight |     74    0.96110      2.505      2.003  0.02258
     foreign |     74    0.96928      1.978      1.488  0.06838
        turn |     74    0.97113      1.859      1.353  0.08803
      length |     74    0.97165      1.825      1.313  0.09461
       trunk |     74    0.97921      1.339      0.637  0.26215
    headroom |     74    0.98104      1.221      0.436  0.33137
       rep78 |     69    0.98191      1.100      0.208  0.41760

Stepping back, what is non-normality and why we should care 
about it? (For normal, read "Gaussian" or "central" if you prefer.
The second was suggested by the physicist Edwin Jaynes.) 

Crudely, non-normality could include overall skewness, overall
tail weight differing from normal, granularity, individual 
outliers, and whatever else I've forgotten. Shapiro-Wilk collapses
all that onto one dimension by quantifying the straightness of
a normal probability plot. But, crucially, you lose much information
by any such numerical reduction. 

How far is any column here an indicator of non-normality that 
you might care about (or normality that you might desire)? 

For example, -rep78- is at one extreme of the ranking, but -rep78- is an 
ordered categorical variable and in one sense is possibly not
even appropriate for the test. It looks good because it happens to be 
unimodal, fairly symmetric and free of outliers. Even -foreign- passes muster, 
if you use P < 0.05 as a cutoff, even though it's a binary variable. 
But why is -foreign- assessed as more nearly normal than 
-gear_ratio-? It's, I guess, because it waggles less in the tails
than -gear_ratio-. Yet I really can't imagine -gear_ratio- causing
any problems as either response or predictor, even if there were
some assumption of normality anywhere. On the other hand, -foreign- 
really should not be analysed as if it were normal! 

Naturally, some of the results here make perfect sense. On -swilk-
(and for that matter on moment- and L-moment-based shape measures)
-price- sticks out as distinctly skew and fat-tailed and probably 
best analysed on (say) a logarithmic scale. 

But the total picture is this. You can boost Shapiro-Wilk 
as much as you like as an omnibus or portmanteau statistic, but
you can't guarantee that it will match what is acceptable to 
you or unacceptable to you. Practically, it can send a very 
misleading message. 

(I haven't touched on another issue. Tests for marginal normality
are often not directly relevant for how a predictor or response behaves
within some larger model.) 

Nick 
[email protected] 

Svend Juul
 
> Vanessa Mahlperg wrote:
> 
> I've got a question concerning the interpretation of the Shapiro-Wilk
> test results.
> I don't know the correct meaning of V, z and Prob>z in German. Could
> anybody tell me how to identify the normal distribution in 
> the following
> case:
> 
> swilk c_ws6m c_ws2j c_stelle if zugeh==2
> Shapiro-Wilk W test for normal data
> 
> Variable  Obs        W      V       z   Prob>z
> c_ws6m     87  0.88729  8.290   4.656  0.00000
> c_ws2j     87  0.99142  0.631  -1.015  0.84484
> c_stelle   87  0.98980  0.750  -0.632  0.73638
> 
> --------------------------------------------------------------
> ----------
> ----
> 
> W and V are specific to the Shapiro-Wilk test; if you need to 
> know more
> (I don't), Google will point to explanations.
> 
> z is the z-statistic; you will find it in any statistical textbook and
> in the output from numerous commands. Essentially it is an estimate
> divided by its standard error.
> 
> Prob>z is one of Stata's strange shorthand habits; you will find it in
> the output from numerous commands. It does NOT mean that a probability
> is larger than z, but (in the line for c_ws6m) that - if the
> null-hypothesis of normality is true - the probability of z 
> being 4.656
> or more extreme, is < 0.00001.
> 
> The above results would lead most of us to reject the 
> null-hypothesis of
> normality for c_ws6m, and accept it for c_ws2j and c_stelle. 
> But some of
> us prefer to assess normality visually rather than by statistical
> testing; -histogram- and -qnorm- are useful commands for that.

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Prev by Date: RE: st: RE: RE: Stata formulas
Next by Date: RE: st: RE: RE: Stata formulas
Previous by thread: Re: st: Shapiro Wilk: data interpretation
Next by thread: st: correlation coefficient in the tobit model
Index(es):
- Date
- Thread