Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: cluster analysis: differences between clusters


From   Andrea Jaberg <[email protected]>
To   [email protected]
Subject   Re: st: cluster analysis: differences between clusters
Date   Fri, 7 Mar 2014 13:08:12 +0100

Thanks for your help Nick.

2014-03-07 13:02 GMT+01:00 Nick Cox <[email protected]>:
> I should also remind you of the request to explain the provenance of
> user-written commands you refer to, in this case -sqom-.
>
> Nick
> [email protected]
>
>
> On 7 March 2014 11:47, Nick Cox <[email protected]> wrote:
>> I don't think this would mean very much. Once clusters are identified
>> from the data you are not then really in a sound position taking them
>> back into a significance test.
>>
>> Here's an analogue. Suppose I split -mpg- from the auto data at the
>> mean and then compare means for higher and lower values. (This isn't
>> an especially good clustering method, but let that pass.)
>>
>> . sysuse auto
>> (1978 Automobile Data)
>>
>> . su mpg, meanonly
>>
>> . gen highlow = mpg > r(mean)
>>
>> . ttest mpg, by(highlow)
>>
>> Two-sample t test with equal variances
>> ------------------------------------------------------------------------------
>>    Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
>> ---------+--------------------------------------------------------------------
>>        0 |      43     17.4186    .3768385    2.471095    16.65811     18.1791
>>        1 |      31    26.67742    .8313574    4.628802    24.97956    28.37528
>> ---------+--------------------------------------------------------------------
>> combined |      74     21.2973    .6725511    5.785503     19.9569    22.63769
>> ---------+--------------------------------------------------------------------
>>     diff |           -9.258815    .8326686               -10.91871    -7.59892
>> ------------------------------------------------------------------------------
>>     diff = mean(0) - mean(1)                                      t = -11.1194
>> Ho: diff = 0                                     degrees of freedom =       72
>>
>>     Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
>>  Pr(T < t) = 0.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 1.0000
>>
>> Researchers are usually happy with that kind of t-test and P-value,
>> but it's worthless. I just showed that "higher mpg" cars are
>> systematically different from "lower mpg" cars, but that's inevitable.
>> I am just seeing the consequences of what I identified on purpose. I
>> can do that with random numbers too.
>>
>> . set seed 2803
>>
>> . gen y = runiform()
>>
>> . gen high = y > 0.5
>>
>> . ttest y, by(high)
>>
>> Two-sample t test with equal variances
>> ------------------------------------------------------------------------------
>>    Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
>> ---------+--------------------------------------------------------------------
>>        0 |      39     .229432     .022727    .1419298    .1834237    .2754404
>>        1 |      35    .7746163    .0253076    .1497215    .7231851    .8260474
>> ---------+--------------------------------------------------------------------
>> combined |      74    .4872894    .0360238    .3098884    .4154941    .5590848
>> ---------+--------------------------------------------------------------------
>>     diff |           -.5451842    .0339151               -.6127928   -.4775757
>> ------------------------------------------------------------------------------
>>     diff = mean(0) - mean(1)                                      t = -16.0750
>> Ho: diff = 0                                     degrees of freedom =       72
>>
>>     Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
>>  Pr(T < t) = 0.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 1.0000
>>
>> Now this is not what you're imagining, but I'd like to hear how your
>> procedure is justified otherwise. Naturally, I am not saying that
>> clusters from a cluster analysis might not be interesting or useful,
>> just that inference is not best done this way.
>>
>> Nor I am saying that there is no way of identifying how far clusters
>> can be trusted, but I think you need some simulations from a plausible
>> stochastic process to provide a benchmark. As shown, you can chop
>> random noise into clusters and the clusters will be different.
>>
>> Nick
>> [email protected]
>>
>>
>> On 7 March 2014 11:32, Andrea Jaberg <[email protected]> wrote:
>>> Dear statalist-users
>>>
>>> I performed optimal matching for all sequences in the dataset against
>>> all others using -sqom- and the option -full-. Afterwards I grouped
>>> them using cluster analysis. Now I'd like to test whether the clusters
>>> are reliably different. My first thought was using ANOVA. However,
>>> this seems not possible since I compared all sequences against each
>>> other which results in a distance matrix.
>>> What do you suggest in order to test whether the differences between
>>> clusters are significant?
>>>
>>> Thank you for your help
>>> Andrea
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> *   http://www.ats.ucla.edu/stat/stata/
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index