Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: cluster analysis: differences between clusters
From
Andrea Jaberg <[email protected]>
To
[email protected]
Subject
Re: st: cluster analysis: differences between clusters
Date
Fri, 7 Mar 2014 13:08:12 +0100
Thanks for your help Nick.
2014-03-07 13:02 GMT+01:00 Nick Cox <[email protected]>:
> I should also remind you of the request to explain the provenance of
> user-written commands you refer to, in this case -sqom-.
>
> Nick
> [email protected]
>
>
> On 7 March 2014 11:47, Nick Cox <[email protected]> wrote:
>> I don't think this would mean very much. Once clusters are identified
>> from the data you are not then really in a sound position taking them
>> back into a significance test.
>>
>> Here's an analogue. Suppose I split -mpg- from the auto data at the
>> mean and then compare means for higher and lower values. (This isn't
>> an especially good clustering method, but let that pass.)
>>
>> . sysuse auto
>> (1978 Automobile Data)
>>
>> . su mpg, meanonly
>>
>> . gen highlow = mpg > r(mean)
>>
>> . ttest mpg, by(highlow)
>>
>> Two-sample t test with equal variances
>> ------------------------------------------------------------------------------
>> Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
>> ---------+--------------------------------------------------------------------
>> 0 | 43 17.4186 .3768385 2.471095 16.65811 18.1791
>> 1 | 31 26.67742 .8313574 4.628802 24.97956 28.37528
>> ---------+--------------------------------------------------------------------
>> combined | 74 21.2973 .6725511 5.785503 19.9569 22.63769
>> ---------+--------------------------------------------------------------------
>> diff | -9.258815 .8326686 -10.91871 -7.59892
>> ------------------------------------------------------------------------------
>> diff = mean(0) - mean(1) t = -11.1194
>> Ho: diff = 0 degrees of freedom = 72
>>
>> Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
>> Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000
>>
>> Researchers are usually happy with that kind of t-test and P-value,
>> but it's worthless. I just showed that "higher mpg" cars are
>> systematically different from "lower mpg" cars, but that's inevitable.
>> I am just seeing the consequences of what I identified on purpose. I
>> can do that with random numbers too.
>>
>> . set seed 2803
>>
>> . gen y = runiform()
>>
>> . gen high = y > 0.5
>>
>> . ttest y, by(high)
>>
>> Two-sample t test with equal variances
>> ------------------------------------------------------------------------------
>> Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
>> ---------+--------------------------------------------------------------------
>> 0 | 39 .229432 .022727 .1419298 .1834237 .2754404
>> 1 | 35 .7746163 .0253076 .1497215 .7231851 .8260474
>> ---------+--------------------------------------------------------------------
>> combined | 74 .4872894 .0360238 .3098884 .4154941 .5590848
>> ---------+--------------------------------------------------------------------
>> diff | -.5451842 .0339151 -.6127928 -.4775757
>> ------------------------------------------------------------------------------
>> diff = mean(0) - mean(1) t = -16.0750
>> Ho: diff = 0 degrees of freedom = 72
>>
>> Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
>> Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000
>>
>> Now this is not what you're imagining, but I'd like to hear how your
>> procedure is justified otherwise. Naturally, I am not saying that
>> clusters from a cluster analysis might not be interesting or useful,
>> just that inference is not best done this way.
>>
>> Nor I am saying that there is no way of identifying how far clusters
>> can be trusted, but I think you need some simulations from a plausible
>> stochastic process to provide a benchmark. As shown, you can chop
>> random noise into clusters and the clusters will be different.
>>
>> Nick
>> [email protected]
>>
>>
>> On 7 March 2014 11:32, Andrea Jaberg <[email protected]> wrote:
>>> Dear statalist-users
>>>
>>> I performed optimal matching for all sequences in the dataset against
>>> all others using -sqom- and the option -full-. Afterwards I grouped
>>> them using cluster analysis. Now I'd like to test whether the clusters
>>> are reliably different. My first thought was using ANOVA. However,
>>> this seems not possible since I compared all sequences against each
>>> other which results in a distance matrix.
>>> What do you suggest in order to test whether the differences between
>>> clusters are significant?
>>>
>>> Thank you for your help
>>> Andrea
>>> *
>>> * For searches and help try:
>>> * http://www.stata.com/help.cgi?search
>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>> * http://www.ats.ucla.edu/stat/stata/
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/