Thanks again,
The code in the original letter worked OK, but as David points out there
was a problem naming and identifying all the combinations and it didn't
quite solve my problems. What I finaly came up with follows
As I don't feel myself comfortable with foreach and forvalues I did it
the long way using a sample from full file.
there was ~9000 persons identified by donorid and the total number of
different diagnoses was 17
gen whatever=1
collapse (sum) whatever, by(donorid)
drop whatever
save temp1.dta
save temp2.dta
append using temp1.dta /* 17 times */
sort donorid
egen count=seq(), by(donorid)
save temp2.dta
use original.dta
egen count=seq(), by(donorid)
joinby donorid count using temp2.dta, unm(b)
drop _merge
reshape wide dgn, i(donorid) j(count)
egen combination=concat( dgn_1 ... dgn_17), punct(", ")
egen combination2=ends(combination), punct(", ,") head
drop combination
tab combination2
The resulting frequency tabel is a starting point for a health state
valuation workgroup which has to select most common diseases and
combinations for their work. The full dataset has ~1800 different diag
values, so its hard to imagine the final frequency tabel.
Best regards,
Taavi
David Kantor wrote:
This is a different problem. How many possible combinations of k
potential diagnoses are there?
The answer is 2^k, and there is a natural (one-to-one) mapping from
these combinations to the integers from 0 to 2^k -1. But if k is
large, then how do you name all the combinations? You may be stuck
with just using the resulting integer.
First, you need to have your diag values to be in a small range of
non-negative integers, such as 0-k (with minimal gaps in this range).
If they already are in such a form, okay. Else, you need to map them
(one-to-one) into such a set of integers. (If your diag values are
string, you can -encode- them and use the encoded values.)
Next, suppose that diag is that variable (or a one derived by an
appropriate mapping).
summ diag
local diagmax = r(max) // get the maximal value (corresponds to k in
the above)
assert r(min) >=0 // we really don't want any negatives
sort personid
forvalues n = 0 / `diagmax' {
egen byte hasdiag`n' = max(diag==`n'), by(personid)
}
/* That is like what I wrote in the previous reply -- but compacted
under a -forvalues-. */
/* At this point you can condense to one observation per person; this
is optional. */
bysort personid: keep if _n==1
/* Generate the identifier of all combinations. */
gen long combination = 0
forvalues n = 0 / `diagmax' {
replace combination = combination + 2^`n' if hasdiag`n'
}
----
There may be other (better) ways to express that computation.
Also, be warned, I have not tested this.
And, if `diagmax' is large, you may need double rather than long --
for the type of combination.
If this has been done correctly, then each value of combination should
uniquely correspond to a distinct combination of diag values. The
correspondence is that for each diag value of n, that diag value is
present if and only if there is a 1 in the nth bit of the binary
representation of combination (counting from the right, starting with
0) -- but only when represented as an integer (not float or double).
----
Again, I hope this helps.
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/