Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: wealth score using principal component analysis (PCA)
From
汪哲仁 <[email protected]>
To
[email protected]
Subject
Re: st: wealth score using principal component analysis (PCA)
Date
Thu, 27 Sep 2012 11:06:02 +0400
Dear Nick and Stat,
May I ask a question? In which circumstances, the PCA is a better
choice than SEM?
with kind regards,
Charles Wang
с уважением
with kind regards,
Чальз Ван
Charles Wang
汪哲仁
2012/9/27 Nick Cox <[email protected]>
>
> You are confusing two different questions. Throughout I focus on the
> case you are looking at where PCA is based on the correlation matrix.
>
> If the aim is to use the most important PC, then that is labelled 1,
> but even if it weren't we could identify it by its having the largest
> eigenvalue attached and no extra considerations arise.
>
> If the aim is to identify which PCs are "important" or "worthy of use"
> (typically one or more) and should be used in later analyses, then
> this is necessarily a looser, more open question and the best art is a
> darker matter. There can't be an answer independent of what you are
> trying to do. Some people do stress a rule of thumb such as
> eigenvalues > 1 and some people look for a break in the eigenvalues
> using a scree plot. In some projects PCs that are used later are good
> if interpretable as having high correlations with particular
> variables; in other projects the PCs are just composite variables with
> the properties assigned to them and interpretability is less material.
>
> Every book I know on PCA stresses this open aspect of the method. The
> books by Jolliffe and Jackson referenced in the -pca- documentation
> certainly do.
>
> It's not clear exactly why you feel committed in advance to using PCA
> like this. I sympathise with the advice given earlier by Stas
> Kolenikov to consider something more like an SEM.
>
> Nick
>
> On Wed, Sep 26, 2012 at 9:33 PM, Shikha Sinha <[email protected]> wrote:
> > Ok, I got it now that if I want to use one score, then PC1 is the most
> > relevant one, and then for further distinction between financial vs
> > social, we need to look at factor loadings in each PC2, PC3 , to
> > figure out if PC2 is better than PC1 if the focus is on social or
> > financial autonomy.
> >
> > Then I am struggling to understand the use of selecting components
> > based on eigenvalues. What is the use of selecting PC based either on
> > eigenvalues or screeplot, if we are always (most of the time) going to
> > use the 1st component. An example on the importance of eigenvalues in
> > selecting components would be very helpful ( or any ref.)
> >
> > Thanks,
> > Shikha
> >
> > On Wed, Sep 26, 2012 at 6:39 AM, Stas Kolenikov <[email protected]> wrote:
> >> Often, the 1st PC works as a measure of "overall size", while the
> >> subsequent components, as measures of "structure". So the 1st
> >> component might be the degree of overall autonomy, while the 2nd
> >> component might distinguish say between financial autonomy and social
> >> interactions autonomy.
> >>
> >> --
> >> -- Stas Kolenikov, PhD, PStat (SSC) :: http://stas.kolenikov.name
> >> -- Senior Survey Statistician, Abt SRBI :: work email kolenikovs at
> >> srbi dot com
> >> -- Opinions stated in this email are mine only, and do not reflect the
> >> position of my employer
> >>
> >>
> >>
> >> On Tue, Sep 25, 2012 at 6:34 PM, Nick Cox <[email protected]> wrote:
> >>> If you want just one index, you can't improve on the first PC if you
> >>> are using the criteria of PCA. That's a central idea of PCA.
> >>>
> >>> Nick
> >>>
> >>> On Wed, Sep 26, 2012 at 12:22 AM, Shikha Sinha
> >>> <[email protected]> wrote:
> >>>> Thanks for your response Nick and stat!
> >>>>
> >>>> I think I am struggling with how to create one scores from two
> >>>> components. Let me pose my question again.
> >>>>
> >>>> Suppose I want to create one index out of six variables. For example,
> >>>> I want to create a "women autonomy index". The index would be one
> >>>> number for every households. The Demographic and health survey (DHS)
> >>>> ask 10 different questions related to women autonomy and instead of
> >>>> using the information in all the 10 questions, I just want to use an
> >>>> index that contains the summary information of all the 10
> >>>> questions/variables. I can use -pca to create the index. Once I use
> >>>> -pca x1-x10, I can choose number of principal components (pc) to
> >>>> retain based on eigenvalues or screeplot. Let assume that there are
> >>>> three pc that have eigenvalues > 1 and I want to retain all these
> >>>> components, though the first component has the highest variation.
> >>>>
> >>>> Now, I want to create a "women autonomy index" based on these three
> >>>> pc. How can I do that? If I use -predict p1 p2 p3, scores; it gives
> >>>> three different scores, all unrelated. However, I want just one index,
> >>>> kindly suggest how to do this.
> >>>>
> >>>> Thanks,
> >>>> Shikha
> >>>>
> >>>>
> >>>>
> >>>> On Tue, Sep 25, 2012 at 9:05 AM, Stas Kolenikov <[email protected]> wrote:
> >>>>> Regarding (c), you would be best off with a structural equations model
> >>>>> (-sem- module), and forgo the PCA whatsoever.
> >>>>>
> >>>>> --
> >>>>> -- Stas Kolenikov, PhD, PStat (SSC) :: http://stas.kolenikov.name
> >>>>> -- Senior Survey Statistician, Abt SRBI :: work email kolenikovs at
> >>>>> srbi dot com
> >>>>> -- Opinions stated in this email are mine only, and do not reflect the
> >>>>> position of my employer
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Mon, Sep 24, 2012 at 7:07 PM, Nick Cox <[email protected]> wrote:
> >>>>>> You seem to be misunderstanding both PCA and the syntax of -predict-
> >>>>>> after -pca-.
> >>>>>>
> >>>>>> To take the second first, -predict- just gives you as many components
> >>>>>> as you ask for. Ask for one by giving one variable name and you get
> >>>>>> scores for the first PC, regardless of what name you give. Stata's
> >>>>>> indifferent to what name you give (so long as it is new and legal) and
> >>>>>> indeed
> >>>>>>
> >>>>>> predict p3
> >>>>>> predict p777
> >>>>>>
> >>>>>> would give you further identical copies of the first PC.
> >>>>>>
> >>>>>> predict P1 P2
> >>>>>>
> >>>>>> would give you scores for the first two PCs.
> >>>>>>
> >>>>>> As for PCA there are potentially as many PCs as variables: although
> >>>>>> the -components()- option puts a self-defined limit on how many you
> >>>>>> can calculate the main purpose of this option appears to be to let
> >>>>>> -pca- behave more like -factor-.
> >>>>>>
> >>>>>> Even if your purpose is to use just one PC, it usually makes sense to
> >>>>>> look at several and the relationships of those PCs to your original
> >>>>>> variables. Sometimes the second, third, ... PC pick up important parts
> >>>>>> of the variation and it is a good idea to look at those too to see
> >>>>>> what the first PC is missing. In the case of wealth variables it might
> >>>>>> be a good idea to think about using PCA on logarithmic transformations
> >>>>>> of the variables too (assuming all values are strictly positive).
> >>>>>>
> >>>>>> Note that the audience of Statalist is very international and
> >>>>>> interdisciplinary, so that assuming that "DHS" is self-evident is
> >>>>>> likely to be wrong in many cases.
> >>>>>>
> >>>>>> Your last question (c) is unanswerable. Many people do it, but how far
> >>>>>> it is "OK" in your project depends on your goals and your data, which
> >>>>>> we can't see.
> >>>>>>
> >>>>>> Nick
> >>>>>>
> >>>>>> On Mon, Sep 24, 2012 at 9:20 PM, Shikha Sinha <[email protected]> wrote:
> >>>>>>
> >>>>>>> I am trying to create a wealth score using the ownership of different
> >>>>>>> assets in the DHS survey. I am suing -pca but I am not sure how to
> >>>>>>> estimate the score as I want to use the wealth score as one of the
> >>>>>>> independent variables.
> >>>>>>>
> >>>>>>> pca x1-x4
> >>>>>>> predict p1,score
> >>>>>>>
> >>>>>>> but -predict only generates score from first component.
> >>>>>>>
> >>>>>>> I also tried the following,
> >>>>>>>
> >>>>>>> -pca x1-x4, components (2)
> >>>>>>> predict p2, score
> >>>>>>>
> >>>>>>> However, p1 and p2 are same.
> >>>>>>>
> >>>>>>> My questions are, (a) why there is no difference between p1 and p2?
> >>>>>>> (b) How can I generate score by using first 2 components only?
> >>>>>>> (c) Is it ok to use continuous pca score as an independent variable?
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/