Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
st: PCA: Principal Components as weighted sums of standardized variables (error in MV Ref. Manual?)
From
Partho Sarkar <[email protected]>
To
[email protected]
Subject
st: PCA: Principal Components as weighted sums of standardized variables (error in MV Ref. Manual?)
Date
Fri, 17 May 2013 17:54:31 +0530
I have just started using Stata for PCA, and am puzzled by a seeming
error in the Multivariate Statistics Reference Manual.
In the Chapter "Postestimation tools for pca and pcamat" [Stata
Multivariate Statistics Reference Manual, Release 11, P 580], after
having worked throught the example audiometry data and calculated the
principal components
(use http://www.stata-press.com/data/r11/audiometric (Audiometric
measures), the manual says (long quote begins):
[BEGIN QUOTE, with comments in square brackets] "Predicting the
component scores
After deciding on the number of components..., you may want to
estimate the component scores for all respondents. To estimate only
the first component scores, which here is called pc1:
[enter command]
predict pc1
[output]
------------------------------------------------------
Variable | Comp1 Comp2 Comp3 Comp4
-------------+----------------------------------------
lft500 | 0.4011 -0.3170 0.1582 -0.3278
lft1000 | 0.4210 -0.2255 -0.0520 -0.4816
lft2000 | 0.3664 0.2386 -0.4703 -0.2824
lft4000 | 0.2809 0.4742 0.4295 -0.1611
rght500 | 0.3433 -0.3860 0.2593 0.4876
rght1000 | 0.4114 -0.2318 -0.0289 0.3723
rght2000 | 0.3115 0.3171 -0.5629 0.3914
rght4000 | 0.2542 0.5135 0.4262 0.1591
------------------------------------------------------
[This is just the Principal components (eigenvectors) matrix in the PC
computations]
The table is informing you that pc1 could be obtained as a weighted
sum of standardized variables,
. egen std_lft500 = std(lft500)
. egen std_lft1000 = std(lft1000)
. egen std_rght4000 = std(rght4000)
[etc. etc.]
. gen pc1 = 0.4011*std_lft500 + 0.4210*std_lft500 [TYPO] + ... +
0.2542*std_rght4000
[END QUOTE]
Accordingly, after standardizing all the variables, I tried this
corrected version of the equation above:
gen pc1try = 0.4011*std_lft500 + 0.4210*std_lft1000
+0.3664*std_lft2000+0.2809*lft4000+0.3433*rght500+0.4114*rght1000+0.3115*
rght2000+0.2542*std_rght4000
But
assert pc1==pc1try
produces :
" 100 contradictions in 100 observations
assertion is false
r(9); "
And sure enough, here is what the first few lines of data look like:
list pc1 pc1try in 1/4
+-----------------------+
| pc1 pc1try |
|-----------------------|
1. | 1.180442 8.493489 |
2. | -.2950325 3.019228 |
3. | .7345378 7.701978 |
4. | -2.132017 -12.07502 |
+-----------------------+
What is going on here? I noticed of course that the formula given
above (gen pc1, gen pc1try..) is not a weighted sum properly speaking,
since the weights do not sum to 1. I tried a modification,dividing by
the sum of the weights, but this too does not give the correct pc1:
g pc2try=pc1try/2.7898
. assert pc1==pc2try
100 contradictions in 100 observations
I am sorry if this is too obvious, or misunderstood on my part.
Thanks and regards,
Partho Sarkar
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/