Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: cluster analysis: differences between clusters
From
Nick Cox <[email protected]>
To
"[email protected]" <[email protected]>
Subject
Re: st: cluster analysis: differences between clusters
Date
Fri, 7 Mar 2014 12:02:40 +0000
I should also remind you of the request to explain the provenance of
user-written commands you refer to, in this case -sqom-.
Nick
[email protected]
On 7 March 2014 11:47, Nick Cox <[email protected]> wrote:
> I don't think this would mean very much. Once clusters are identified
> from the data you are not then really in a sound position taking them
> back into a significance test.
>
> Here's an analogue. Suppose I split -mpg- from the auto data at the
> mean and then compare means for higher and lower values. (This isn't
> an especially good clustering method, but let that pass.)
>
> . sysuse auto
> (1978 Automobile Data)
>
> . su mpg, meanonly
>
> . gen highlow = mpg > r(mean)
>
> . ttest mpg, by(highlow)
>
> Two-sample t test with equal variances
> ------------------------------------------------------------------------------
> Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
> ---------+--------------------------------------------------------------------
> 0 | 43 17.4186 .3768385 2.471095 16.65811 18.1791
> 1 | 31 26.67742 .8313574 4.628802 24.97956 28.37528
> ---------+--------------------------------------------------------------------
> combined | 74 21.2973 .6725511 5.785503 19.9569 22.63769
> ---------+--------------------------------------------------------------------
> diff | -9.258815 .8326686 -10.91871 -7.59892
> ------------------------------------------------------------------------------
> diff = mean(0) - mean(1) t = -11.1194
> Ho: diff = 0 degrees of freedom = 72
>
> Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
> Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000
>
> Researchers are usually happy with that kind of t-test and P-value,
> but it's worthless. I just showed that "higher mpg" cars are
> systematically different from "lower mpg" cars, but that's inevitable.
> I am just seeing the consequences of what I identified on purpose. I
> can do that with random numbers too.
>
> . set seed 2803
>
> . gen y = runiform()
>
> . gen high = y > 0.5
>
> . ttest y, by(high)
>
> Two-sample t test with equal variances
> ------------------------------------------------------------------------------
> Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
> ---------+--------------------------------------------------------------------
> 0 | 39 .229432 .022727 .1419298 .1834237 .2754404
> 1 | 35 .7746163 .0253076 .1497215 .7231851 .8260474
> ---------+--------------------------------------------------------------------
> combined | 74 .4872894 .0360238 .3098884 .4154941 .5590848
> ---------+--------------------------------------------------------------------
> diff | -.5451842 .0339151 -.6127928 -.4775757
> ------------------------------------------------------------------------------
> diff = mean(0) - mean(1) t = -16.0750
> Ho: diff = 0 degrees of freedom = 72
>
> Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
> Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000
>
> Now this is not what you're imagining, but I'd like to hear how your
> procedure is justified otherwise. Naturally, I am not saying that
> clusters from a cluster analysis might not be interesting or useful,
> just that inference is not best done this way.
>
> Nor I am saying that there is no way of identifying how far clusters
> can be trusted, but I think you need some simulations from a plausible
> stochastic process to provide a benchmark. As shown, you can chop
> random noise into clusters and the clusters will be different.
>
> Nick
> [email protected]
>
>
> On 7 March 2014 11:32, Andrea Jaberg <[email protected]> wrote:
>> Dear statalist-users
>>
>> I performed optimal matching for all sequences in the dataset against
>> all others using -sqom- and the option -full-. Afterwards I grouped
>> them using cluster analysis. Now I'd like to test whether the clusters
>> are reliably different. My first thought was using ANOVA. However,
>> this seems not possible since I compared all sequences against each
>> other which results in a distance matrix.
>> What do you suggest in order to test whether the differences between
>> clusters are significant?
>>
>> Thank you for your help
>> Andrea
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/