Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: polychoric for huge data sets
From
Stas Kolenikov <[email protected]>
To
[email protected]
Subject
Re: st: polychoric for huge data sets
Date
Wed, 5 Sep 2012 12:19:06 -0500
Well, clearly there's some overhead that hardly depends on the number
of variables (parsing, populating the matrices, etc.), but that should
be much faster than the iterative optimization. It may well be that
with some setups, the time may be somewhat faster than quadratic, but
I'd be surprised if it were as fast as linear: -polychoric- literally
computes the correlations one by one, so I thought that the quadratic
is unavoidable.
--
-- Stas Kolenikov, PhD, PStat (SSC) :: http://stas.kolenikov.name
-- Senior Survey Statistician, Abt SRBI :: work email kolenikovs at
srbi dot com
-- Opinions stated in this email are mine only, and do not reflect the
position of my employer
On Wed, Sep 5, 2012 at 9:15 AM, Nick Cox <[email protected]> wrote:
> I don't know what is obvious to anyone else, but clearly as author you
> know your code, which is based on calculating correlations one at a
> time. Nevertheless my very limited experiments show less than
> quadratic dependence on the number of variables.
>
> Nick
>
> On Wed, Sep 5, 2012 at 3:05 PM, Stas Kolenikov <[email protected]> wrote:
>> Obviously, -polychoric- computing time is quadratic in the number of
>> variables, but linear (or may be even faster) in the number of
>> observations. There's also the curse of large sample sizes: most of
>> the time, the underlying bivariate normality will be considered
>> violated by -polychoric-, and that may create computational
>> difficulties, such as flat regions, ridges, and multiple local optima.
>>
>> On Wed, Sep 5, 2012 at 8:54 AM, Nick Cox <[email protected]> wrote:
>>> Experiment supports intuition in suggesting that the number of
>>> variables is a bigger deal for -polychoric- than the number of
>>> observations, and also that you can get results for 8000 obs and 40
>>> variables in several minutes on a mundane computer. That's tedious
>>> interactively but doesn't support the claim that Timea made. Best
>>> just to write a do-file and let it run while you are doing something
>>> else.
>>>
>>> Nick
>>>
>>> On Wed, Sep 5, 2012 at 9:59 AM, Nick Cox <[email protected]> wrote:
>>>> Stas Kolenikov's -polychoric- package promises only principal
>>>> component analysis. Depending on how you were brought up, that is
>>>> distinct from factor analysis, or a limiting case of factor analysis,
>>>> or a subset of factor analysis.
>>>>
>>>> The problem you report as "just can't handle it" with no details
>>>> appears to be one of speed, rather than refusal or inability to
>>>> perform.
>>>>
>>>> That aside, what is "appropriate" is difficult to answer. A recent
>>>> thread indicated that many on this list are queasy about means or
>>>> t-tests for ordinal data, so that would presumably put factor analysis
>>>> or PCA of ordinal data beyond the pale. Nevertheless it remains
>>>> popular.
>>>>
>>>> You presumably have the option of taking a random sample from your
>>>> data and subjecting that to both (a) PCA of _ranked_ data (which is
>>>> equivalent to PCA based on Spearman correlation) and (b) polychoric
>>>> PCA. Then it would be good news for you if the substantive or
>>>> scientific conclusions were the same, and a difference you need to
>>>> think about otherwise. Here the random sample should be large enough
>>>> to be substantial, but small enough to get results in reasonable time.
>>>>
>>>> Alternatively, you could be ruthless about which of your variables are
>>>> most interesting or important. A preliminary correlation analysis
>>>> would show which variables could be excluded because they are poorly
>>>> correlated with anything else, and which could be excluded because
>>>> they are very highly correlated with anything else. Even if you can
>>>> get it, a PCA based on 40+ variables is often unwieldy to handle and
>>>> even more difficult to interpret than one based on say 10 or so
>>>> variables.
>>>>
>>>> Nick
>>>>
>>>> On Wed, Sep 5, 2012 at 3:37 AM, Timea Partos
>>>> <[email protected]> wrote:
>>>>
>>>>> I need to run a factor analysis on ordinal data. My dataset is huge (7000+ cases with 40+ variables) so I can't run the polychoric.do program written by Stas Kolenikov, because it just can't handle it.
>>>>>
>>>>> Does anyone know of a fast way to obtain the polychoric correlation matrix for very large data sets?
>>>>>
>>>>> Alternatively, I was thinking of running the factor analysis using the Spearman rho (rank-order correlations) matrix instead. Would this be appropriate?
>>> *
>>> * For searches and help try:
>>> * http://www.stata.com/help.cgi?search
>>> * http://www.stata.com/support/statalist/faq
>>> * http://www.ats.ucla.edu/stat/stata/
>>
>>
>>
>> --
>> -- Stas Kolenikov, PhD, PStat (SSC) :: http://stas.kolenikov.name
>> -- Senior Survey Statistician, Abt SRBI :: work email kolenikovs at
>> srbi dot com
>> -- Opinions stated in this email are mine only, and do not reflect the
>> position of my employer
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/statalist/faq
>> * http://www.ats.ucla.edu/stat/stata/
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/