Francesco Burchi wrote:
>>@ Jay
The theoretical reason for this aggregation is that the different variables
indicate different types of health knowledge.<<
OK, then it makes much sense to generate a sum score from this.  
>>The following are the results of tetrachoric correlation:
              Var1  	    Var2            Var3        Var4
Var1              1
Var2      .1819233		1
Var3      .3699331      .25242738             1
Var4      .18371493     .27407531      .40299934          1,
Thanks. Eyeballing this you have a positive manifold and some differences between different items. A one factor model is likely to be appropriate.  
>>I was specifically asked whether I could justify my choice of one single
factor on the basis of the variance explained. Following your reasoning, I
could argue that with more than 1 factor it would be unidentified. Just to
be sure about the procedure I am following, I have tried to get results
keeping the 4 factors:
factormat R, n(6926) ipf   factor(4)
Factor analysis/correlation                    Number of obs    =     6926
Method: iterated principal factors             Retained factors =        3
Rotation: (unrotated)                          Number of params =        6
--------------------------------------------------------------------------
         Factor  |   Eigenvalue   Difference        Proportion   Cumulative
-------------+------------------------------------------------------------
        Factor1  |      1.28200      1.06199            0.8049       0.8049
        Factor2  |      0.22001      0.12912            0.1381       0.9431
        Factor3  |      0.09089      0.09108            0.0571       1.0001
        Factor4  |     -0.00019            .           -0.0001       1.0000
--------------------------------------------------------------------------
Could I state that the first factor explains 80% of the common variance?<<
Yes, it's pretty clearly one dimensional, with the rest being junk that happens with item-level factor analysis. The uniquenesses associated with the loadings are totally in line with . I also ran the ML factor analysis using:
. factormat R, n(6296) ml  factors(1) names(v1 v2 v3 v4)
(obs=6296)
Iteration 0:   log likelihood = -216.46349
Iteration 1:   log likelihood = -65.941751
Iteration 2:   log likelihood = -63.980616
Iteration 3:   log likelihood = -63.905495
Iteration 4:   log likelihood =  -63.90257
Iteration 5:   log likelihood = -63.902458
Factor analysis/correlation                        Number of obs    =     6296
    Method: maximum likelihood                     Retained factors =        1
    Rotation: (unrotated)                          Number of params =        4
                                                   Schwarz's BIC    =  162.796
    Log likelihood = -63.90246                     (Akaike's) AIC   =  135.805
    --------------------------------------------------------------------------
         Factor  |   Eigenvalue   Difference        Proportion   Cumulative
    -------------+------------------------------------------------------------
        Factor1  |      1.20010            .            1.0000       1.0000
    --------------------------------------------------------------------------
    LR test: independent vs. saturated:  chi2(6)  = 2727.68 Prob>chi2 = 0.0000
    LR test:    1 factor vs. saturated:  chi2(2)  =  127.75 Prob>chi2 = 0.0000
Factor loadings (pattern matrix) and unique variances
    ---------------------------------------
        Variable |  Factor1 |   Uniqueness 
    -------------+----------+--------------
              v1 |   0.4583 |      0.7900  
              v2 |   0.3732 |      0.8607  
              v3 |   0.7583 |      0.4250  
              v4 |   0.5252 |      0.7242  
    ---------------------------------------
The chi square tests for this sample size are rather silly, ignore them. The loadings and uniquenesses are almost the same as for IPF (interestingly enough---that's not always true). It won't run anything higher dimensional but I doubt from looking at that tetrachoric correlation matrix you'd find anything. 
>>
Finally, I have tried to add one or two further indicators to improve the
analysis. However, I had some theoretical doubts on the inclusion of these
variables, and the factor analysis with tetrachoric correlations gave me
loadings for these variables much lower than 0.1, thus I was convinced to
use only 4 variables.<
Are the tetrachoric correlations for the other two variables markedly lower or still meaningful? You might have an oblique two-factor solution. 
Jay
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/