Thank you very much, indeed.
Am Dienstag, den 11.03.2008, 11:46 -0500 schrieb [email protected]:
> Ulrich Kohler <[email protected]> sent me the SQdist1 and SQdist2 matrices
> from the query below.
>
> >> I have two "hand made" distance matrizes, SQdist1 and SQdist2. Both
> >> distance matrizes are essentially identical, with the exception that
> >> they are differently ordered.
> >>
> >> If I perform a cluster analysis using singlelinkage for the two distance
> >> matrizes, I get identical results:
> >>
> >> <cut>
> >>
> >> (The same is true for median-linkage and centroid linkage.)
> >>
> >> However, if I use wards-linkage I get different results for the two
> >> distance matrizes:
> >>
> >> . clustermat wards SQdist1, name(cluster1) add
> >> . clustermat wards SQdist2, name(cluster2) add
> >> . sum *_hgt
> >>
> >> Variable | Obs Mean Std. Dev. Min Max
> >> -------------+--------------------------------------------------------
> >> cluster1_hgt | 53 .7051013 .861406 .1666667 4.414418
> >> cluster2_hgt | 53 .7051013 .8751653 .1666667 4.645984
> >>
> >> Although the difference doesn't seem large, it have led to quite
> >> different groupings in a practical application. Unfortunately, I am not
> >> an expert with cluster analysis. So, please, can anybody explain me why
> >> this happens? If the order of distance matrix matter for
> >> cluster-analysis, what is the "correct" order of the distance matrix,
> >> then?
> >
> > The hierarchical cluster analysis methods start with N groups
> > (each observation is a group). At each step in the process the 2
> > closest groups are merged and this is continued until all
> > observations are in one group. This can be viewed as a
> > dendrogram (cluster tree).
> >
> > My guess is that there are ties in determining the closest 2
> > groups at one or more steps in the process and the order that the
> > data is presented changes which of these ties gets selected for
> > merging together at that step.
> >
> > If Uli would like me to explore this further, he can send me the
> > SQdist1 and SQdist2 matrices and I will report back what I find.
> >
> > Ken Higbee [email protected]
> > StataCorp 1-800-STATAPC
>
> My guess was correct. The difference is due to ties in the
> dissimilarities.
>
> The matrices are 54 x 54 and have lots of ties in the
> dissimilarities. There are 54*53/2 = 1,431 elements in the
> strictly lower triangle of the matrix. Of those 1,431
> dissimilarites, there are only 9 distinct values
>
> value count
> ------------------
> .1666667 48
> .3333333 138
> .5 268
> .6666667 385
> .8333333 295
> 1 205
> 1.166667 62
> 1.333333 23
> 1.5 7
>
> At the first step of the hierarchical clustering there are 48
> ties for smallest dissimilarity. One of these is picked for
> combining 2 of the groups into 1, and the resulting 53 x 53
> dissimilarity matrix is then created from the original 54 x 54
> matrix using the Lance and Williams' recurrence formula. See
> pages 86-87 of "[MV] cluster" in the Version 10 [MV] manual. The
> process is then repeated.
>
> Given the number of ties in the original 54 x 54 dissimilarity
> matrix, I expect that ties for smallest dissimilarity happen
> often during the steps of the algorithm.
>
> Changing the order of the original dissimilarity matrix in this
> example will usually result in different tied pairs being
> combined at different stages of the algorithm.
>
> Why did Uli notice the difference with Ward's linkage and not
> some of the other linkages? Look at the table on page 87 of the
> manual for the recurrence formula. You will see that alpha_i,
> alpha_j, and beta involve the group sizes for group i and group j
> (the groups being combined) and group k. Some of the other
> linkages have simpler forms. I believe this is why the
> differences due to ties is more apparent in his Ward's linkage
> results.
>
> Uli only showed the summary of the _hgt variables. Even for
> those linkages where his _hgt variable had the same summary (same
> mean, min, max, ...), he will probably see that the _hgt and _ord
> variables are different (not the summary, but the actual values)
> from one run to the other if the order has changed and there are
> ties involved. The _ord variable indicates the order the
> clusters are joined in the hierarchy. This will change depending
> on which tie is picked.
>
> Uli also wonders which ordering is "correct". I don't think that
> question has an answer. It is not a matter of one solution being
> correct and the others being incorrect.
>
> The hierarchical clustering methods were not designed to provide
> an "optimal" clustering solution. For that you would have to try
> all possibilities (which is not feasible for most size problems)
> and if you did try all possiblities, you would most often not end
> up with a hierarchical solution (e.g., the optimal 8 group
> solution may not nest the optimal 7 group solution).
>
> Ken Higbee [email protected]
> StataCorp 1-800-STATAPC
>
> *
> * For searches and help try:
> * http://www.stata.com/support/faqs/res/findit.html
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/