Nick Cox wrote:
>There are various unstated assumptions and criteria that need to be
>spelled out for a fruitful discussion.
>1. Continuous versus discrete. I don't know any reason why PCA might not
be as helpful, or as useless, on discrete data (e.g. counts) as compared
with continuous data.
Agreed. The main thing is that discrete variables tend to be quite skewed and thus have strongly attenuated correlations. Much of the dimensionality you find is created by this issue. The temptation is to assume that
dimension = substantively interesting variation,
but sadly this is often wrong. Instead,
dimension = systematic variation,
but that's far from the same thing.
>I wouldn't think it useful for categorical
variables, which I take to be a quite different issue. <
Well correspondence analysis is, essentially, principal components for categorical variables in the sense that CA depends on the singular value decomposition of the indicator matrix for categorical data in essentially the same way that PCA (or biplotting) uses the SVD of the data matrix for continuous variables. There's a large literature on it and, indeed, Stata has some nice procedures for it already built in. See -mca- and then expect to do some reading.
>2. Skewed versus symmetric. In principle, PCA might work very well even
if some of the variables were highly skewed. In practice, skewness quite
often goes together with nonlinearities, and a transformation might help
in either case. <
Yup.
JV
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/