Thanks to Nick for introducing me to this wonderful command -group1d-.
It's exactly what I was looking for.
I have some further questions - which I hope someone would help me to
understand. I was also playing around with the -cluster kmeans-
command and find that -group1d- generates the same groupings -cluster
kmeans- with the option -measure(L2squared)- applied.
I then compare the results of -cluster kmeans- with or without the
-measure(L2squared)- option specified. The result groupings are
different. I don't really understand why this should be the case for
univariate clustering, because when I typed:
help measure_option (note the underscore between the words measure
and option, without the underscore a different help file will show up)
It is explained that the default option calculates the grouping by minimising:
requests the Euclidean distance / Minkowski distance metric
with argument 2
sqrt(sum((x_ia - x_ja)^2))
But when the option -measure(L2squared)- is specified
grouping is assigned by minimising the square of the Euclidean
distance / Minkowski distance metric with argument 2
sum((x_ia - x_ja)^2)
Here are some output generated using the same 49 observations:
. cluster kmeans var1, k(4) generate(euclid)
cluster name: _clus_5
. cluster kmeans var1, k(4) generate(euclidsq) measure(L2squared)
cluster name: _clus_1
. tab euclid euclidsq
| euclidsq
euclid | 1 2 3 4 | Total
-----------+--------------------------------------------+----------
1 | 10 0 0 0 | 10
2 | 0 0 12 0 | 12
3 | 0 4 0 6 | 10
4 | 9 0 0 8 | 17
-----------+--------------------------------------------+----------
Total | 19 4 12 14 | 49
. bys euclid: egen m_euclid=mean(var1)
. bys euclidsq: egen m_euclidsq=mean(var1)
. egen tot1euclid=total((var1-m_euclid)^2)
. egen tot1euclidsq=total((var1-m_euclidsq)^2)
. sum tot*
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
tot1euclid | 49 712.2434 0 712.2434 712.2434
tot1euclidsq | 49 524.9169 0 524.9169 524.9169
. di sqrt(712.2434 )
26.687889
. di sqrt( 524.9169 )
22.911065
Groupings generated with the option -measure(L2squared)- applied is
superior to the one without. This shouldn't be the case for
univariate clustering, or should it?? Have I missed something
important?
Thank you once again!!
Ada
On Wed, Mar 11, 2009 at 11:21 AM, Nick Cox <[email protected]> wrote:
> <>
>
> . findit group1d
>
> points to a program in this territory on SSC.
>
> Thanks to Kyle Hood for reminding me that a version of this problem arises in choosing classes or bins for choropleth or patch maps.
>
> Long-term Stata user Ian S. Evans wrote a review of that territory that is still useful:
>
> Ian S. Evans. 1977.
> The Selection of Class Intervals.
> Transactions of the Institute of British Geographers 2: 98-124.
>
> It will be accessible to many readers (but not all) through JSTOR.
>
> I must look again at how Jenks implemented his own least-squares criterion, but independently of this work in cartography, the problem has arisen in mainstream statistics. I suspect that Jenks' method would have had fewer users if he had used a more candid term such as "fortuitous breaks".
>
> The help for -group1d- gives a rather detailed discussion with documentation showing that the problem goes back to 1958 at least, but I won't repeat that here.
>
> -group1d- has a habit of picking out moderate outliers as singleton groups, but then that is hardly surprising given its least-squares criterion. I've been echoing Hartigan's 1975 comment intermittently over the last 30 years that least first powers (L_1 norm) is an alternative without ever implementing it.
>
> Although Ada supplies her series in some jumbled order I am presuming she wants breaks in the distribution, i.e. to group the ordered values.
>
> I see 49 values in her example. Reading those in
>
> . sort var1
>
> . group1d var1, max(7)
>
> Partitions of 49 data up to 7 groups
>
> 1 group: sum of squares 8604.04
> Group Size First Last Mean SD
> 1 49 1 92.2135 49 144.228 112.04 13.25
>
> 2 groups: sum of squares 3059.49
> Group Size First Last Mean SD
> 2 33 17 108.97 49 144.228 119.44 9.03
> 1 16 1 92.2135 16 107.54 96.76 4.80
>
> 3 groups: sum of squares 1073.41
> Group Size First Last Mean SD
> 3 10 40 124.565 49 144.228 130.97 6.70
> 2 24 16 107.54 39 121.712 114.14 3.97
> 1 15 1 92.2135 15 104.744 96.04 4.04
>
> 4 groups: sum of squares 524.92
> Group Size First Last Mean SD
> 4 4 46 133.568 49 144.228 138.39 4.33
> 3 14 32 116.865 45 127.885 122.00 3.79
> 2 19 13 102.857 31 115.641 110.48 3.51
> 1 12 1 92.2135 12 95.5293 94.09 1.11
>
> 5 groups: sum of squares 309.08
> Group Size First Last Mean SD
> 5 4 46 133.568 49 144.228 138.39 4.33
> 4 9 37 120.04 45 127.885 124.33 2.61
> 3 14 23 112.013 36 119.072 114.95 2.37
> 2 10 13 102.857 22 111.134 107.88 2.82
> 1 12 1 92.2135 12 95.5293 94.09 1.11
>
> 6 groups: sum of squares 185.01
> Group Size First Last Mean SD
> 6 4 46 133.568 49 144.228 138.39 4.33
> 5 6 40 124.565 45 127.885 126.03 1.15
> 4 9 31 115.641 39 121.712 118.61 1.91
> 3 14 17 108.97 30 114.67 111.74 1.74
> 2 4 13 102.857 16 107.54 104.77 1.73
> 1 12 1 92.2135 12 95.5293 94.09 1.11
>
> 7 groups: sum of squares 116.89
> Group Size First Last Mean SD
> 7 2 48 140.798 49 144.228 142.51 1.72
> 6 2 46 133.568 47 134.952 134.26 0.69
> 5 6 40 124.565 45 127.885 126.03 1.15
> 4 9 31 115.641 39 121.712 118.61 1.91
> 3 14 17 108.97 30 114.67 111.74 1.74
> 2 4 13 102.857 16 107.54 104.77 1.73
> 1 12 1 92.2135 12 95.5293 94.09 1.11
>
> Groups Sums of squares
> 1 8604.04
> 2 3059.49
> 3 1073.41
> 4 524.92
> 5 309.08
> 6 185.01
> 7 116.89
>
> It is vital to check graphically that the groups (breaks) make sense. -qplot- from SJ is especially useful here.
>
> . qplot var1, rank xli(12.5 22.5 36.5 45.5)
>
> The graph shows that in this case two of the groups are fairly distinct, but the other subdivisions seem less convincing.
>
> Nick
> [email protected]
>
> Ada Ma
>
> Thank you to both Partha Deb and Kyle Hood for providing me with some
> very promising looking leads to attempt.
>
> On Wed, Mar 11, 2009 at 7:11 AM, Kyle K. Hood <[email protected]> wrote:
>
>> In mapping, univariate classification schemes are used to group features
>> together. An example is Jenks' natural breaks, which simply defines k-1
>> cutoffs to minimize within-group sums of square deviations from group means.
>> Unfortunately,
>>
>> . findit jenks
>>
>> produces nothing. However, there is information on the web regarding how to
>> compute these cutoffs (just google it). I'm not sure how closely this
>> method relates to cluster analysis and finite mixture models.
>
> Partha Deb wrote:
>
>>> Although one can never be sure what's in someone else's mind, I suspect
>>> you are looking for cluster analysis. -help cluster- . Finite mixture
>>> models may also be of interest. -findit fmm- . See
>>> http://users.ox.ac.uk/~polf0050/ISS%20Lecture%208.pdf for a set of slides by
>>> Stephen Fisher that has an introduction to Cluster analysis and finite
>>> mixture models.
>
> Ada Ma wrote:
>
>>>> Let's say I have 50 packets of crisps of various weights and I would
>>>> like to separate these 50 packets of crisps into five groups based on
>>>> their weights in grams, as follows:
>>>>
>>>> 108.9702
>>>> 111.1337
>>>> 112.5217
>>>> 112.6697
>>>> 112.9962
>>>> 114.0323
>>>> 114.6699
>>>> 116.8646
>>>> 119.0719
>>>> 124.5645
>>>> 124.691
>>>> 126.4943
>>>> 126.5528
>>>> 133.5675
>>>> 134.9519
>>>> 140.7979
>>>> 144.228
>>>> 102.8566
>>>> 103.9373
>>>> 104.7436
>>>> 107.5397
>>>> 109.4443
>>>> 109.7089
>>>> 110.395
>>>> 112.1248
>>>> 113.6032
>>>> 115.6405
>>>> 117.1919
>>>> 120.0395
>>>> 121.0714
>>>> 121.7119
>>>> 110.1116
>>>> 112.0128
>>>> 117.6563
>>>> 118.2418
>>>> 126.0027
>>>> 127.8855
>>>> 92.21352
>>>> 92.45715
>>>> 92.953
>>>> 93.01508
>>>> 94.05335
>>>> 94.27259
>>>> 94.38242
>>>> 94.72507
>>>> 94.83315
>>>> 95.25914
>>>> 95.37813
>>>> 95.52933
>>>>
>>>> I don't want to separate them into five equally sized groups. I want
>>>> to separate the packets into groups so that the group members are most
>>>> similar to one another. I am looking for a method (or methods?) to
>>>> achieve this end but I don't know where to start. If you can think of
>>>> any suggestion please fire away and I'd be most grateful!
>>>>
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
--
Ada Ma
Research Fellow
Health Economics Research Unit
University of Aberdeen, UK.
http://www.abdn.ac.uk/heru/
Tel: +44 (0) 1224 553863
Fax: +44 (0) 1224 550926
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/