I have two "hand made" distance matrizes, SQdist1 and SQdist2. Both
distance matrizes are essentially identical, with the exception that
they are differently ordered.
If I perform a cluster analysis using singlelinkage for the two distance
matrizes, I get identical results:
. clustermat single SQdist1, name(cluster1) add
. clustermat single SQdist2, name(cluster2) add
. sum *_hgt
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
cluster1_hgt | 53 .2232704 .108128 .1666667 .6666667
cluster2_hgt | 53 .2232704 .108128 .1666667 .6666667
(The same is true for median-linkage and centroid linkage.)
However, if I use wards-linkage I get different results for the two
distance matrizes:
. clustermat wards SQdist1, name(cluster1) add
. clustermat wards SQdist2, name(cluster2) add
. sum *_hgt
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
cluster1_hgt | 53 .7051013 .861406 .1666667 4.414418
cluster2_hgt | 53 .7051013 .8751653 .1666667 4.645984
Although the difference doesn't seem large, it have led to quite
different groupings in a practical application. Unfortunately, I am not
an expert with cluster analysis. So, please, can anybody explain me why
this happens? If the order of distance matrix matter for
cluster-analysis, what is the "correct" order of the distance matrix,
then?
Many regards
Uli
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/