Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: polychoric for huge data sets
From
Stas Kolenikov <[email protected]>
To
[email protected]
Subject
Re: st: polychoric for huge data sets
Date
Wed, 5 Sep 2012 09:05:26 -0500
Obviously, -polychoric- computing time is quadratic in the number of
variables, but linear (or may be even faster) in the number of
observations. There's also the curse of large sample sizes: most of
the time, the underlying bivariate normality will be considered
violated by -polychoric-, and that may create computational
difficulties, such as flat regions, ridges, and multiple local optima.
On Wed, Sep 5, 2012 at 8:54 AM, Nick Cox <[email protected]> wrote:
> Experiment supports intuition in suggesting that the number of
> variables is a bigger deal for -polychoric- than the number of
> observations, and also that you can get results for 8000 obs and 40
> variables in several minutes on a mundane computer. That's tedious
> interactively but doesn't support the claim that Timea made. Best
> just to write a do-file and let it run while you are doing something
> else.
>
> Nick
>
> On Wed, Sep 5, 2012 at 9:59 AM, Nick Cox <[email protected]> wrote:
>> Stas Kolenikov's -polychoric- package promises only principal
>> component analysis. Depending on how you were brought up, that is
>> distinct from factor analysis, or a limiting case of factor analysis,
>> or a subset of factor analysis.
>>
>> The problem you report as "just can't handle it" with no details
>> appears to be one of speed, rather than refusal or inability to
>> perform.
>>
>> That aside, what is "appropriate" is difficult to answer. A recent
>> thread indicated that many on this list are queasy about means or
>> t-tests for ordinal data, so that would presumably put factor analysis
>> or PCA of ordinal data beyond the pale. Nevertheless it remains
>> popular.
>>
>> You presumably have the option of taking a random sample from your
>> data and subjecting that to both (a) PCA of _ranked_ data (which is
>> equivalent to PCA based on Spearman correlation) and (b) polychoric
>> PCA. Then it would be good news for you if the substantive or
>> scientific conclusions were the same, and a difference you need to
>> think about otherwise. Here the random sample should be large enough
>> to be substantial, but small enough to get results in reasonable time.
>>
>> Alternatively, you could be ruthless about which of your variables are
>> most interesting or important. A preliminary correlation analysis
>> would show which variables could be excluded because they are poorly
>> correlated with anything else, and which could be excluded because
>> they are very highly correlated with anything else. Even if you can
>> get it, a PCA based on 40+ variables is often unwieldy to handle and
>> even more difficult to interpret than one based on say 10 or so
>> variables.
>>
>> Nick
>>
>> On Wed, Sep 5, 2012 at 3:37 AM, Timea Partos
>> <[email protected]> wrote:
>>
>>> I need to run a factor analysis on ordinal data. My dataset is huge (7000+ cases with 40+ variables) so I can't run the polychoric.do program written by Stas Kolenikov, because it just can't handle it.
>>>
>>> Does anyone know of a fast way to obtain the polychoric correlation matrix for very large data sets?
>>>
>>> Alternatively, I was thinking of running the factor analysis using the Spearman rho (rank-order correlations) matrix instead. Would this be appropriate?
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
--
-- Stas Kolenikov, PhD, PStat (SSC) :: http://stas.kolenikov.name
-- Senior Survey Statistician, Abt SRBI :: work email kolenikovs at
srbi dot com
-- Opinions stated in this email are mine only, and do not reflect the
position of my employer
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/