The fact that you get the same results with -group1d- and a k-means
approach is good fortune, as k-means methods don't guarantee that an
optimum will be found.
The main point of -group1d- is that it produces classes that are
contiguous intervals in one dimension. In contrast -cluster- has no
notion of contiguity.
Your main question is about -cluster- and is best left to Ken Higbee, I
suspect.
Nick
[email protected]
Ada Ma
Thanks to Nick for introducing me to this wonderful command -group1d-.
It's exactly what I was looking for.
I have some further questions - which I hope someone would help me to
understand. I was also playing around with the -cluster kmeans-
command and find that -group1d- generates the same groupings -cluster
kmeans- with the option -measure(L2squared)- applied.
I then compare the results of -cluster kmeans- with or without the
-measure(L2squared)- option specified. The result groupings are
different. I don't really understand why this should be the case for
univariate clustering, because when I typed:
help measure_option (note the underscore between the words measure
and option, without the underscore a different help file will show up)
It is explained that the default option calculates the grouping by
minimising:
requests the Euclidean distance / Minkowski distance metric
with argument 2
sqrt(sum((x_ia - x_ja)^2))
But when the option -measure(L2squared)- is specified
grouping is assigned by minimising the square of the Euclidean
distance / Minkowski distance metric with argument 2
sum((x_ia - x_ja)^2)
Here are some output generated using the same 49 observations:
. cluster kmeans var1, k(4) generate(euclid)
cluster name: _clus_5
. cluster kmeans var1, k(4) generate(euclidsq) measure(L2squared)
cluster name: _clus_1
. tab euclid euclidsq
| euclidsq
euclid | 1 2 3 4 | Total
-----------+--------------------------------------------+----------
1 | 10 0 0 0 | 10
2 | 0 0 12 0 | 12
3 | 0 4 0 6 | 10
4 | 9 0 0 8 | 17
-----------+--------------------------------------------+----------
Total | 19 4 12 14 | 49
. bys euclid: egen m_euclid=mean(var1)
. bys euclidsq: egen m_euclidsq=mean(var1)
. egen tot1euclid=total((var1-m_euclid)^2)
. egen tot1euclidsq=total((var1-m_euclidsq)^2)
. sum tot*
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
tot1euclid | 49 712.2434 0 712.2434 712.2434
tot1euclidsq | 49 524.9169 0 524.9169 524.9169
. di sqrt(712.2434 )
26.687889
. di sqrt( 524.9169 )
22.911065
Groupings generated with the option -measure(L2squared)- applied is
superior to the one without. This shouldn't be the case for
univariate clustering, or should it?? Have I missed something
important?
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/