Austin Nichols <[email protected]>:
As for the sparse matrix problem in (A), you can generate a new
variable with all distinct concatenations of rowvar and colvar, then
cycle over the values of that, thereby ignoring the empty cells.
On Tue, May 13, 2008 at 10:18 AM, Sergiy Radyakin
<[email protected]> wrote:
Thank you all, who responded to my request regarding obtaining a
matrix of means. Besides the answers posted in this thread I have
received a couple of suggestions privately. To summarize and close the
thread, the suggestions can be divided roughly into two groups:
A. Obtaining all possible levels of the by-variables, then cycling
through these values and computing means for each subgroup. This can
be quite slow, especially in case of "sparse" matrices, where only a
few non-empty cells exist (for a 50x50 matrix -summarize- must be
called 2500 times).
B. Using other Stata commands which can produce matrix of means as a
by-product. Unfortunately none of them is fast enough either. In
particular, Joseph Coveney suggested using xi to automatically create
all combinations of values and then estimating a univariate
regression. Although this is a very short code, it is perhaps the
slowest, and demands large amounts of memory.
--------------------------------------------------------------------------------
Sergiy, it'll help us to help you better if you're more specific about the
scope of your problem up front; Austin's original reply's -tabmat- seemed
ideal to me given what you gave the list to go on; and my suggestion
works well for the example that you gave in your post, which I took to be
illustrative of scope of the individual summarization that you want to
repeat many times and therefore want to avoid -preserve-s, etc.
Austin's point above about concatenating applies to sparse matrix problems
in (B), too: see below for timing of a (B)-approach compared to -table ,
contents(mean )-, which is the benchmark you give in your original post.
Note that -anova , noconstant category()- is used in lieu of -xi: regress ,
noconstant-, because it's more efficient here.
Joseph Coveney
clear *
set matsize 800 // Nothing extraordinary
set memory 10M // Nothing extraordinary
set obs 250000 // I don't know how many you have--is this in the ballpark?
/* A 50 X 50 matrix */
generate byte a = mod(_n, 50)
sort a
generate byte b = mod(_n, 50)
generate float c = uniform()
/* Make that sparse */
foreach var of varlist a b {
replace `var' = 0 if !inrange(`var', 20, 30)
}
*
timer clear 1
quietly forvalues i = 1/10 {
timer on 1
table a b, contents(mean c)
timer off 1
}
timer clear 2
quietly forvalues i = 1/10 {
timer on 2
generate int ab = 100 *a + b // Concatenation
anova c ab, noconstant category(ab)
timer off 2
drop ab
}
timer list
exit
Results:
. timer list
1: 24.29 / 10 = 2.4295
2: 7.62 / 10 = 0.7621
. exit
end of do-file
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/