Ulrich Kohler <[email protected]> asks:
> I have two "hand made" distance matrizes, SQdist1 and SQdist2. Both
> distance matrizes are essentially identical, with the exception that
> they are differently ordered.
>
> If I perform a cluster analysis using singlelinkage for the two distance
> matrizes, I get identical results:
>
> <cut>
>
> (The same is true for median-linkage and centroid linkage.)
>
> However, if I use wards-linkage I get different results for the two
> distance matrizes:
>
> . clustermat wards SQdist1, name(cluster1) add
> . clustermat wards SQdist2, name(cluster2) add
> . sum *_hgt
>
> Variable | Obs Mean Std. Dev. Min Max
> -------------+--------------------------------------------------------
> cluster1_hgt | 53 .7051013 .861406 .1666667 4.414418
> cluster2_hgt | 53 .7051013 .8751653 .1666667 4.645984
>
> Although the difference doesn't seem large, it have led to quite
> different groupings in a practical application. Unfortunately, I am not
> an expert with cluster analysis. So, please, can anybody explain me why
> this happens? If the order of distance matrix matter for
> cluster-analysis, what is the "correct" order of the distance matrix,
> then?
The hierarchical cluster analysis methods start with N groups
(each observation is a group). At each step in the process the 2
closest groups are merged and this is continued until all
observations are in one group. This can be viewed as a
dendrogram (cluster tree).
My guess is that there are ties in determining the closest 2
groups at one or more steps in the process and the order that the
data is presented changes which of these ties gets selected for
merging together at that step.
If Uli would like me to explore this further, he can send me the
SQdist1 and SQdist2 matrices and I will report back what I find.
Ken Higbee [email protected]
StataCorp 1-800-STATAPC
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/