<>
. findit group1d
points to a program in this territory on SSC.
Thanks to Kyle Hood for reminding me that a version of this problem arises in choosing classes or bins for choropleth or patch maps.
Long-term Stata user Ian S. Evans wrote a review of that territory that is still useful:
Ian S. Evans. 1977.
The Selection of Class Intervals.
Transactions of the Institute of British Geographers 2: 98-124.
It will be accessible to many readers (but not all) through JSTOR.
I must look again at how Jenks implemented his own least-squares criterion, but independently of this work in cartography, the problem has arisen in mainstream statistics. I suspect that Jenks' method would have had fewer users if he had used a more candid term such as "fortuitous breaks".
The help for -group1d- gives a rather detailed discussion with documentation showing that the problem goes back to 1958 at least, but I won't repeat that here.
-group1d- has a habit of picking out moderate outliers as singleton groups, but then that is hardly surprising given its least-squares criterion. I've been echoing Hartigan's 1975 comment intermittently over the last 30 years that least first powers (L_1 norm) is an alternative without ever implementing it.
Although Ada supplies her series in some jumbled order I am presuming she wants breaks in the distribution, i.e. to group the ordered values.
I see 49 values in her example. Reading those in
. sort var1
. group1d var1, max(7)
Partitions of 49 data up to 7 groups
1 group: sum of squares 8604.04
Group Size First Last Mean SD
1 49 1 92.2135 49 144.228 112.04 13.25
2 groups: sum of squares 3059.49
Group Size First Last Mean SD
2 33 17 108.97 49 144.228 119.44 9.03
1 16 1 92.2135 16 107.54 96.76 4.80
3 groups: sum of squares 1073.41
Group Size First Last Mean SD
3 10 40 124.565 49 144.228 130.97 6.70
2 24 16 107.54 39 121.712 114.14 3.97
1 15 1 92.2135 15 104.744 96.04 4.04
4 groups: sum of squares 524.92
Group Size First Last Mean SD
4 4 46 133.568 49 144.228 138.39 4.33
3 14 32 116.865 45 127.885 122.00 3.79
2 19 13 102.857 31 115.641 110.48 3.51
1 12 1 92.2135 12 95.5293 94.09 1.11
5 groups: sum of squares 309.08
Group Size First Last Mean SD
5 4 46 133.568 49 144.228 138.39 4.33
4 9 37 120.04 45 127.885 124.33 2.61
3 14 23 112.013 36 119.072 114.95 2.37
2 10 13 102.857 22 111.134 107.88 2.82
1 12 1 92.2135 12 95.5293 94.09 1.11
6 groups: sum of squares 185.01
Group Size First Last Mean SD
6 4 46 133.568 49 144.228 138.39 4.33
5 6 40 124.565 45 127.885 126.03 1.15
4 9 31 115.641 39 121.712 118.61 1.91
3 14 17 108.97 30 114.67 111.74 1.74
2 4 13 102.857 16 107.54 104.77 1.73
1 12 1 92.2135 12 95.5293 94.09 1.11
7 groups: sum of squares 116.89
Group Size First Last Mean SD
7 2 48 140.798 49 144.228 142.51 1.72
6 2 46 133.568 47 134.952 134.26 0.69
5 6 40 124.565 45 127.885 126.03 1.15
4 9 31 115.641 39 121.712 118.61 1.91
3 14 17 108.97 30 114.67 111.74 1.74
2 4 13 102.857 16 107.54 104.77 1.73
1 12 1 92.2135 12 95.5293 94.09 1.11
Groups Sums of squares
1 8604.04
2 3059.49
3 1073.41
4 524.92
5 309.08
6 185.01
7 116.89
It is vital to check graphically that the groups (breaks) make sense. -qplot- from SJ is especially useful here.
. qplot var1, rank xli(12.5 22.5 36.5 45.5)
The graph shows that in this case two of the groups are fairly distinct, but the other subdivisions seem less convincing.
Nick
[email protected]
Ada Ma
Thank you to both Partha Deb and Kyle Hood for providing me with some
very promising looking leads to attempt.
On Wed, Mar 11, 2009 at 7:11 AM, Kyle K. Hood <[email protected]> wrote:
> In mapping, univariate classification schemes are used to group features
> together. An example is Jenks' natural breaks, which simply defines k-1
> cutoffs to minimize within-group sums of square deviations from group means.
> Unfortunately,
>
> . findit jenks
>
> produces nothing. However, there is information on the web regarding how to
> compute these cutoffs (just google it). I'm not sure how closely this
> method relates to cluster analysis and finite mixture models.
Partha Deb wrote:
>> Although one can never be sure what's in someone else's mind, I suspect
>> you are looking for cluster analysis. -help cluster- . Finite mixture
>> models may also be of interest. -findit fmm- . See
>> http://users.ox.ac.uk/~polf0050/ISS%20Lecture%208.pdf for a set of slides by
>> Stephen Fisher that has an introduction to Cluster analysis and finite
>> mixture models.
Ada Ma wrote:
>>> Let's say I have 50 packets of crisps of various weights and I would
>>> like to separate these 50 packets of crisps into five groups based on
>>> their weights in grams, as follows:
>>>
>>> 108.9702
>>> 111.1337
>>> 112.5217
>>> 112.6697
>>> 112.9962
>>> 114.0323
>>> 114.6699
>>> 116.8646
>>> 119.0719
>>> 124.5645
>>> 124.691
>>> 126.4943
>>> 126.5528
>>> 133.5675
>>> 134.9519
>>> 140.7979
>>> 144.228
>>> 102.8566
>>> 103.9373
>>> 104.7436
>>> 107.5397
>>> 109.4443
>>> 109.7089
>>> 110.395
>>> 112.1248
>>> 113.6032
>>> 115.6405
>>> 117.1919
>>> 120.0395
>>> 121.0714
>>> 121.7119
>>> 110.1116
>>> 112.0128
>>> 117.6563
>>> 118.2418
>>> 126.0027
>>> 127.8855
>>> 92.21352
>>> 92.45715
>>> 92.953
>>> 93.01508
>>> 94.05335
>>> 94.27259
>>> 94.38242
>>> 94.72507
>>> 94.83315
>>> 95.25914
>>> 95.37813
>>> 95.52933
>>>
>>> I don't want to separate them into five equally sized groups. I want
>>> to separate the packets into groups so that the group members are most
>>> similar to one another. I am looking for a method (or methods?) to
>>> achieve this end but I don't know where to start. If you can think of
>>> any suggestion please fire away and I'd be most grateful!
>>>
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/