I'd echo Svend's advice. You must look at the data too.
A salutary example is near to hand. Note that the sample sizes
here are similar to Vanessa's, so a key issue of what happens
with different sample sizes is set on one side.
. sysuse auto, clear
. swilk price-foreign
Shapiro-Wilk W test for normal data
Variable | Obs W V z Prob>z
-------------+-------------------------------------------------
price | 74 0.76696 15.008 5.909 0.00000
mpg | 74 0.94821 3.335 2.627 0.00430
rep78 | 69 0.98191 1.100 0.208 0.41760
headroom | 74 0.98104 1.221 0.436 0.33137
trunk | 74 0.97921 1.339 0.637 0.26215
weight | 74 0.96110 2.505 2.003 0.02258
length | 74 0.97165 1.825 1.313 0.09461
turn | 74 0.97113 1.859 1.353 0.08803
displacement | 74 0.92542 4.803 3.423 0.00031
gear_ratio | 74 0.95814 2.696 2.163 0.01525
foreign | 74 0.96928 1.978 1.488 0.06838
Let's sort that so the structure is easier to see.
price | 74 0.76696 15.008 5.909 0.00000
displacement | 74 0.92542 4.803 3.423 0.00031
mpg | 74 0.94821 3.335 2.627 0.00430
gear_ratio | 74 0.95814 2.696 2.163 0.01525
weight | 74 0.96110 2.505 2.003 0.02258
foreign | 74 0.96928 1.978 1.488 0.06838
turn | 74 0.97113 1.859 1.353 0.08803
length | 74 0.97165 1.825 1.313 0.09461
trunk | 74 0.97921 1.339 0.637 0.26215
headroom | 74 0.98104 1.221 0.436 0.33137
rep78 | 69 0.98191 1.100 0.208 0.41760
Stepping back, what is non-normality and why we should care
about it? (For normal, read "Gaussian" or "central" if you prefer.
The second was suggested by the physicist Edwin Jaynes.)
Crudely, non-normality could include overall skewness, overall
tail weight differing from normal, granularity, individual
outliers, and whatever else I've forgotten. Shapiro-Wilk collapses
all that onto one dimension by quantifying the straightness of
a normal probability plot. But, crucially, you lose much information
by any such numerical reduction.
How far is any column here an indicator of non-normality that
you might care about (or normality that you might desire)?
For example, -rep78- is at one extreme of the ranking, but -rep78- is an
ordered categorical variable and in one sense is possibly not
even appropriate for the test. It looks good because it happens to be
unimodal, fairly symmetric and free of outliers. Even -foreign- passes muster,
if you use P < 0.05 as a cutoff, even though it's a binary variable.
But why is -foreign- assessed as more nearly normal than
-gear_ratio-? It's, I guess, because it waggles less in the tails
than -gear_ratio-. Yet I really can't imagine -gear_ratio- causing
any problems as either response or predictor, even if there were
some assumption of normality anywhere. On the other hand, -foreign-
really should not be analysed as if it were normal!
Naturally, some of the results here make perfect sense. On -swilk-
(and for that matter on moment- and L-moment-based shape measures)
-price- sticks out as distinctly skew and fat-tailed and probably
best analysed on (say) a logarithmic scale.
But the total picture is this. You can boost Shapiro-Wilk
as much as you like as an omnibus or portmanteau statistic, but
you can't guarantee that it will match what is acceptable to
you or unacceptable to you. Practically, it can send a very
misleading message.
(I haven't touched on another issue. Tests for marginal normality
are often not directly relevant for how a predictor or response behaves
within some larger model.)
Nick
[email protected]
Svend Juul
> Vanessa Mahlperg wrote:
>
> I've got a question concerning the interpretation of the Shapiro-Wilk
> test results.
> I don't know the correct meaning of V, z and Prob>z in German. Could
> anybody tell me how to identify the normal distribution in
> the following
> case:
>
> swilk c_ws6m c_ws2j c_stelle if zugeh==2
> Shapiro-Wilk W test for normal data
>
> Variable Obs W V z Prob>z
> c_ws6m 87 0.88729 8.290 4.656 0.00000
> c_ws2j 87 0.99142 0.631 -1.015 0.84484
> c_stelle 87 0.98980 0.750 -0.632 0.73638
>
> --------------------------------------------------------------
> ----------
> ----
>
> W and V are specific to the Shapiro-Wilk test; if you need to
> know more
> (I don't), Google will point to explanations.
>
> z is the z-statistic; you will find it in any statistical textbook and
> in the output from numerous commands. Essentially it is an estimate
> divided by its standard error.
>
> Prob>z is one of Stata's strange shorthand habits; you will find it in
> the output from numerous commands. It does NOT mean that a probability
> is larger than z, but (in the line for c_ws6m) that - if the
> null-hypothesis of normality is true - the probability of z
> being 4.656
> or more extreme, is < 0.00001.
>
> The above results would lead most of us to reject the
> null-hypothesis of
> normality for c_ws6m, and accept it for c_ws2j and c_stelle.
> But some of
> us prefer to assess normality visually rather than by statistical
> testing; -histogram- and -qnorm- are useful commands for that.
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/