The following material is based on a question and answer that appeared on Statalist.
Title | Stata 6: Simulating multivariate normal observations | |
Author | William Gould, StataCorp |
Pretend one had random variates C = (c1, c2, ..., ck) that are all orthogonal. One rotates this space via
a11 a12 ... a1k Y = AC, A = a21 a22 ... a2k ... ak1 ak2 ... akk
where A is an arbitrary but full-rank matrix. Thus we could
. gen c1 = invnorm(uniform()) . gen c2 = invnorm(uniform()) etc.
and then
gen y1 = a11*c1 + a12*c2 + ... + a1k*ck gen y2 = a21*c1 + a22*c2 + ... + a2k*ck
if only we knew the values of A. We want to find values of A such that
cov(y1,y2) = as desired cov(y1,y3) = as desired etc.
Let P be the desired covariance matrix. The value of A we seek is the square root (Cholesky decomposition) of P because we want
P = A'Var(C)A = A'A
P can just as well be the correlation matrix—we just have to put 1s on the diagonal.
Thus I could generate 2 variables with correlation .5:
. clear . matrix P = (1,.5 \.5, 1) . mat A = cholesky(P) . mat list A A[2,2] c1 c2 r1 1 0 r2 .5 .8660254 . set obs 4000 obs was 0, now 4000 . gen c1= invnorm(uniform()) . gen c2= invnorm(uniform()) . gen y1 = c1 . gen y2 = .5*c1 + .8660254*c2 . corr y1 y2 (obs=4000) | y1 y2 --------+------------------ y1| 1.0000 y2| 0.5079 1.0000
Now the names c1, c2,, ..., ck for the orthogonal random variates were not chosen at random. Since that is the way Stata, by default, labels the columns of matrices, we could have generated y1 and y2 using the matrix score function:
. matrix a1 = A[1,.] . matrix score y1 = a1 . matrix a2 = A[2,.] . matrix score y2 = a2
This is handy because (1) we do not have to type the coefficients, and (2) how much we have to type on a line is not a function of the number of correlated variables we are generating. In fact, given a matrix P, we could make a crude program to generate y1, y2, ...:
program define mkcorr version 4.0 local k = rowsof(P) matrix A = cholesky(P) local i 1 quietly { while `i'<=`k' { gen c`i' = invnorm(uniform()) local i=`i'+1 } local i 1 while `i'<=`k' { matrix row = A[`i',.] matrix score y`i' = row local i=`i'+1 } local i 1 while `i' <= `k' { drop c`i' local i=`i'+1 } } end
This program is crude. The name of the covariance (correlation) matrix is prerecorded—it’s P; the names of the output variables are unsettable—they are y1, y2, ...; and the names of the working variables are preset -- they are c1, c2, ....
It does, however, work.
. clear . matrix P = (1,.25,.5 \ .25, 1, .75 \ .5, .75, 1) . set obs 4000 obs was 0, now 4000 . mkcorr . summ Variable | Obs Mean Std. Dev. Min Max ---------+----------------------------------------------------- y1 | 4000 .0095433 .9973219 -3.053706 4.167702 y2 | 4000 -.0059072 .9962081 -3.341381 3.419445 y3 | 4000 -.0086885 1.009847 -3.448567 4.026851 . corr (obs=4000) | y1 y2 y3 --------+--------------------------- y1| 1.0000 y2| 0.2664 1.0000 y3| 0.5206 0.7441 1.0000
Obviously, someone could make mkcorr fancier.