Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: cluster analysis: differences between clusters

From	Nick Cox <[email protected]>
To	"[email protected]" <[email protected]>
Subject	Re: st: cluster analysis: differences between clusters
Date	Fri, 7 Mar 2014 11:47:45 +0000

I don't think this would mean very much. Once clusters are identified
from the data you are not then really in a sound position taking them
back into a significance test.

Here's an analogue. Suppose I split -mpg- from the auto data at the
mean and then compare means for higher and lower values. (This isn't
an especially good clustering method, but let that pass.)

. sysuse auto
(1978 Automobile Data)

. su mpg, meanonly

. gen highlow = mpg > r(mean)

. ttest mpg, by(highlow)

Two-sample t test with equal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
---------+--------------------------------------------------------------------
       0 |      43     17.4186    .3768385    2.471095    16.65811     18.1791
       1 |      31    26.67742    .8313574    4.628802    24.97956    28.37528
---------+--------------------------------------------------------------------
combined |      74     21.2973    .6725511    5.785503     19.9569    22.63769
---------+--------------------------------------------------------------------
    diff |           -9.258815    .8326686               -10.91871    -7.59892
------------------------------------------------------------------------------
    diff = mean(0) - mean(1)                                      t = -11.1194
Ho: diff = 0                                     degrees of freedom =       72

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 1.0000

Researchers are usually happy with that kind of t-test and P-value,
but it's worthless. I just showed that "higher mpg" cars are
systematically different from "lower mpg" cars, but that's inevitable.
I am just seeing the consequences of what I identified on purpose. I
can do that with random numbers too.

. set seed 2803

. gen y = runiform()

. gen high = y > 0.5

. ttest y, by(high)

Two-sample t test with equal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
---------+--------------------------------------------------------------------
       0 |      39     .229432     .022727    .1419298    .1834237    .2754404
       1 |      35    .7746163    .0253076    .1497215    .7231851    .8260474
---------+--------------------------------------------------------------------
combined |      74    .4872894    .0360238    .3098884    .4154941    .5590848
---------+--------------------------------------------------------------------
    diff |           -.5451842    .0339151               -.6127928   -.4775757
------------------------------------------------------------------------------
    diff = mean(0) - mean(1)                                      t = -16.0750
Ho: diff = 0                                     degrees of freedom =       72

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 1.0000

Now this is not what you're imagining, but I'd like to hear how your
procedure is justified otherwise. Naturally, I am not saying that
clusters from a cluster analysis might not be interesting or useful,
just that inference is not best done this way.

Nor I am saying that there is no way of identifying how far clusters
can be trusted, but I think you need some simulations from a plausible
stochastic process to provide a benchmark. As shown, you can chop
random noise into clusters and the clusters will be different.

Nick
[email protected]


On 7 March 2014 11:32, Andrea Jaberg <[email protected]> wrote:
> Dear statalist-users
>
> I performed optimal matching for all sequences in the dataset against
> all others using -sqom- and the option -full-. Afterwards I grouped
> them using cluster analysis. Now I'd like to test whether the clusters
> are reliably different. My first thought was using ANOVA. However,
> this seems not possible since I compared all sequences against each
> other which results in a distance matrix.
> What do you suggest in order to test whether the differences between
> clusters are significant?
>
> Thank you for your help
> Andrea
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: cluster analysis: differences between clusters
  - From: Brendan Halpin <[email protected]>
- Re: st: cluster analysis: differences between clusters
  - From: Nick Cox <[email protected]>

References:
- st: cluster analysis: differences between clusters
  - From: Andrea Jaberg <[email protected]>

Prev by Date: st: Regular expressions
Next by Date: Re: st: Regular expressions
Previous by thread: st: cluster analysis: differences between clusters
Next by thread: Re: st: cluster analysis: differences between clusters
Index(es):
- Date
- Thread