| |
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: "testing" a cluster analysis
Hi, Ron�n. Your point is well-taken, and conventional hypothesis test might
not be the best tool. I have already analyzed the data more formally with
OLS, but one of my advisors suggested I see how the observations cluster
with respect to these seven binary indicators. So, I started playing around
with different techniques for clustering observations. Now, I am trying to
decide--"scientistically," I realize--just how well-defined/tight/distinct
the clusters would be from one another if I clustered the data into 5
clusters. (Then, I might do the same thing with fewer or more clusters.) I
am not truly testing a hypothesis; I am looking for some basis on which to
decide just how many clusters there may be in the dataset...
Does that make the original question any more valid, and if so, is there a
way to do what I'm thinking...either by examining means, as I suggested, or
some better way? adam
From: Ron�n Conroy <[email protected]>
Reply-To: [email protected]
To: [email protected]
Subject: Re: st: "testing" a cluster analysis
Date: Wed, 7 Feb 2007 10:34:58 +0000
On 7 Feabh 2007, at 00:06, Adam Seth Litwin wrote:
Hello. I just ran a cluster analysis, not a technique I use frequently.
I have seven binary variables forming, at the moment, five clusters. I
thought a useful exercise would be the following:
For each of the seven variables, examine its mean in all five clusters.
Then, run an F-test to show that the means are not equal across all five
clusters.
So, for example, I type
- tabstat var1, by(CLUSTER) stat(n mean)
But, I'm not sure how to run the F-test.
Careful. An analysis of variance is a hypothesis test. The model is
specified in advance and the anova calculates the values of the model
parameters.
In your case, the model was generated from the data. The usual
interpretation of the F ratio does not apply.
Cluster analysis is an exploratory technique. You need to think about
validating the clustering by showing that the clusters differ on variables
which were not used in the clustering but which are theoretically related
to the cluster process.
For example, if you use clustering to define five clusters of people based
on the type and frequency of their social interactions, then you would
expect that the clusters would differ on things like loneliness and
perceived social support, and you would hope that they differed in
dimensions like mood or (headline from this month's Archives of General
Psychiatry) risk of Alzheimer's disease.
So I'd forget the F-test and start validating the clusters. Your
hypothesis is that the clusters are different from each other in some
respect other than the variables you clustered on.
=========
Ron�n Conroy
Royal College of Surgeons in Ireland
[email protected]
+353 (0) 1 402 2431
+353 (0) 87 799 97 95
http://www.flickr.com/photos/ronanconroy
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
_________________________________________________________________
Turn searches into helpful donations. Make your search count.
http://click4thecause.live.com/search/charity/default.aspx?source=hmemtagline_donation&FORM=WLMTAG
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/