Thanks to Steve and Nick for their additional thoughts and improvements on calculating skew and kurtosis for a row of observations, and for highlighting the broader issues.
I am sure jeheyman probably has many good reasons for doing these calculations on a row of observations, and I wanted to give an answer that helped them do this. My preference, however, would be to follow Nick's advice and -reshape- the data, perform the calculations, then (if necessary) -reshape- it back again. Working with 3, 5 or even 10 variables in a row is probably okay, but it seems to me that things could become quite cumbersome as more variables were added. A major advantage of reshaping the data is that you can also easily graph the distributions and quickly get a feel for the issues you are trying to summarise with the skew and kurtosis statistics.
-- Matt
[email protected]
-----Original Message-----
From: [email protected]
[mailto:[email protected]]On Behalf Of Nick Cox
Sent: Wednesday, 15 October 2008 3:48 AM
To: [email protected]
Subject: RE: st: RE: rowskew?
This thread raises questions on several different levels.
First, I've spent some time looking at the literature, partly in pursuit
of a longer-term project, and I'd underline that there are various
different formulae for moment measures of skewness and kurtosis. The
situation is not like that for s.d. and variance where in essence there
are only two defensible formulae. There are in this case several
co-existing, and I don't count algebraical equivalents as separate. I
don't think there is a clear-cut case for saying that one is right
rather than another.
Two advantages of the formulae used by Matt are that they are the
simplest and that they are those used by [R] summarize, so that it can
more easily be checked whether results square with those of official
Stata.
Second, I'd recommend using -double-s. In examples I've checked there is
no substantial difference, but using doubles is at least a nod to
numerical issues.
Third, Matt's toy example threw up a kurtosis curiosum:
With the formula used by Stata, the kurtosis of 3 values not all equal
to each other is always 1.5.
Presumably the result is a direct consequence of the formula and obvious
when looked at the right way, but it was a surprise to me. (If all
values tie, then the variance is clearly zero, so that case is
indeterminate. But two tied values is not a problem.)
This is not a proof, naturally, but an example:
sysuse auto
gen group = ceil(_n/3)
egen kurt = kurt(mpg) in 1/72 , by(group)
(8 missing values generated)
su kurt
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
kurt | 66 1.5 0 1.5 1.5
In practice, you shouldn't want to work with kurtosis of groups of 3, of
course. And, to be clear, I am sure Matt wasn't implying that at all.
Fourth, an alternative solution is via a -reshape-, -egen- and -reshape-
back again. (This is likely to be much more practical than the -xpose-
suggested by Martin Weiss.)
Fifth, and perhaps most important, who says that moment-based skewness
and kurtosis are the best measures, or indeed of much practical use?
Why not (mean - median) / sd as a measure of skewness, which is based on
easy ingredients, bounded by [-1, 1] and in several ways easier to work
with? Why not L-moments (-findit lmoments-), known for almost 20 years
to behave much better?
Nick
[email protected]
Steven Samuels
This version adds a -qui- prefix to the replace statements and
removes an unneeded -di- statement. This will remove uninformative
output lines. Note that statistics are not computed for observations
with a missing value for any row variable. To override this
behavior, add "if !missing(`v')" to the end of the first -replace-
statement.
> Here is Matt's code with a couple of macros to reduce typing. I've
> subtracted 1 from the row mean and also subtracted 3 from the
> kurtosis formula Matt used. See: http://www.itl.nist.gov/div898/
> handbook/eda/section3/eda35b.htm
>
**************************CODE BEGINS**************************
sysuse auto,clear
****************************************************
* Create local macro vlist with variables to analyze
****************************************************
local vlist "mpg rep78 trunk turn"
egen rowmean = rowmean(`vlist')
egen rowN =rownonmiss(`vlist')
forvalues i=2/4{
gen m`i'=0
foreach v of local vlist {
qui replace m`i' = m`i' + (`v'-rowmean)^`i'
}
qui replace m`i' = m`i'/(rowN-1)
}
gen rowskew = m3*m2^(-3/2)
gen rowkurt = m4*m2^(-2) -3
list `vlist' row* in 1/5
***************************CODE ENDS***************************
On Oct 13, 2008, at 11:21 PM, Matt Spittal wrote:
> You can use the moments about the mean to calculate skew and
> kurtosis for a row of variables. Imagine that you want to do this
> for the variables weight, length and price from the auto dataset.
>
> sysuse auto, clear
>
> // get mean and N
> egen rowmean = rowmean(weight length price)
> egen rowN = rownonmiss(weight length price)
>
> // calculate 2, 3, and 4th moments about the mean
> gen m2 = 1/rowN * ((weight - rowmean)^2 + (length - rowmean)^2 +
> (price - rowmean)^2)
> gen m3 = 1/rowN * ((weight - rowmean)^3 + (length - rowmean)^3 +
> (price - rowmean)^3)
> gen m4 = 1/rowN * ((weight - rowmean)^4 + (length - rowmean)^4 +
> (price - rowmean)^4)
>
> // calculate skew and kurtosis
> gen rowskew = m3*m2^(-3/2)
> gen rowkurt = m4*m2^(-2)
>
> list weight length price rowskew rowkurt in 1/10
jeheyman
> Is it possible to calculate essentially a rowskew and rowkurtosis in
> the same way that egen calculates rowmean?
>
> For each observation I have 18 variables and I need, obviously, the
> three distribution measures. Mean is trivial but the other two are
> proving elusive.
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/