Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: cluster analysis: differences between clusters
From
Nick Cox <[email protected]>
To
"[email protected]" <[email protected]>
Subject
Re: st: cluster analysis: differences between clusters
Date
Fri, 7 Mar 2014 11:47:45 +0000
I don't think this would mean very much. Once clusters are identified
from the data you are not then really in a sound position taking them
back into a significance test.
Here's an analogue. Suppose I split -mpg- from the auto data at the
mean and then compare means for higher and lower values. (This isn't
an especially good clustering method, but let that pass.)
. sysuse auto
(1978 Automobile Data)
. su mpg, meanonly
. gen highlow = mpg > r(mean)
. ttest mpg, by(highlow)
Two-sample t test with equal variances
------------------------------------------------------------------------------
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
0 | 43 17.4186 .3768385 2.471095 16.65811 18.1791
1 | 31 26.67742 .8313574 4.628802 24.97956 28.37528
---------+--------------------------------------------------------------------
combined | 74 21.2973 .6725511 5.785503 19.9569 22.63769
---------+--------------------------------------------------------------------
diff | -9.258815 .8326686 -10.91871 -7.59892
------------------------------------------------------------------------------
diff = mean(0) - mean(1) t = -11.1194
Ho: diff = 0 degrees of freedom = 72
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000
Researchers are usually happy with that kind of t-test and P-value,
but it's worthless. I just showed that "higher mpg" cars are
systematically different from "lower mpg" cars, but that's inevitable.
I am just seeing the consequences of what I identified on purpose. I
can do that with random numbers too.
. set seed 2803
. gen y = runiform()
. gen high = y > 0.5
. ttest y, by(high)
Two-sample t test with equal variances
------------------------------------------------------------------------------
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
0 | 39 .229432 .022727 .1419298 .1834237 .2754404
1 | 35 .7746163 .0253076 .1497215 .7231851 .8260474
---------+--------------------------------------------------------------------
combined | 74 .4872894 .0360238 .3098884 .4154941 .5590848
---------+--------------------------------------------------------------------
diff | -.5451842 .0339151 -.6127928 -.4775757
------------------------------------------------------------------------------
diff = mean(0) - mean(1) t = -16.0750
Ho: diff = 0 degrees of freedom = 72
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000
Now this is not what you're imagining, but I'd like to hear how your
procedure is justified otherwise. Naturally, I am not saying that
clusters from a cluster analysis might not be interesting or useful,
just that inference is not best done this way.
Nor I am saying that there is no way of identifying how far clusters
can be trusted, but I think you need some simulations from a plausible
stochastic process to provide a benchmark. As shown, you can chop
random noise into clusters and the clusters will be different.
Nick
[email protected]
On 7 March 2014 11:32, Andrea Jaberg <[email protected]> wrote:
> Dear statalist-users
>
> I performed optimal matching for all sequences in the dataset against
> all others using -sqom- and the option -full-. Afterwards I grouped
> them using cluster analysis. Now I'd like to test whether the clusters
> are reliably different. My first thought was using ANOVA. However,
> this seems not possible since I compared all sequences against each
> other which results in a distance matrix.
> What do you suggest in order to test whether the differences between
> clusters are significant?
>
> Thank you for your help
> Andrea
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/