Justin <[email protected]> asks:
> I am trying to do a kmeans cluster analysis, and I have a couple
> issues that keep coming up. First, I use the default random
> start option and get certain results. Then I used the segments
> start option, and the results are quite different. Is there any
> explanation for this?
This is often an indication that your data can not be separated
into k (whatever k you asked for) groups very well. Try the
following experiment to get a feel for why this might be true.
Generate random data (no real groups in the data). Run -cluster
kmeans- with various starting values. do some cross tabs of the
resulting grouping variables to get a feel for how much they
disagree. Also run -cluster stop- for each run and see how the
pseudo F value changes.
For instance I did
clear
set seed 412389
set obs 500
gen x1 = uniform()
gen x2 = uniform()
cluster kmeans x1 x2, k(5) name(try1) start(krandom(28392))
cluster kmeans x1 x2, k(5) name(try2) start(krandom(11833))
cluster kmeans x1 x2, k(5) name(try3) start(krandom(3216))
cluster stop try1
cluster stop try2
cluster stop try3
tab try1 try2
tab try1 try3
tab try2 try3
Now try a similar experiment on data that has a reasonable chance
of having some natural groupings within the data.
sysuse auto, clear
cluster kmeans price length , k(5) name(a1) start(kr(4484))
cluster kmeans price length , k(5) name(a2) start(kr(33232))
cluster kmeans price length , k(5) name(a3) start(kr(678213))
...
cluster stop a1
...
tab a1 a2
...
You will find more agreement between the results, then with the
2-dimensional random uniform data. You will still find some of
them going to different solutions, because this particular case
does not naturally break into 5 groups, but does so better than
the totally random data.
The usual strategy is to make several (maybe many) runs with
different starting values and take the solution that gives the
largest value produced by -cluster stop-. If you saw many
different solutions while doing this, then it is an indication
that you are trying to force the data into groups that are not
distinct.
> Also, I tried to use group(varname) as a start option, but when I
> run this, I keep getting an error message that the variable I
> chose" does not define k (in my case, 5) groups". How can I fix
> this?
Does your variable "varname" have 5 and only 5 levels? When I try
something like
sysuse auto
cluster kmeans price length, k(5) start(groups(rep78))
works fine for me. rep78 takes on 5 possible values, and I
correspondingly asked for "k(5)" with -cluster kmeans-.
Ken Higbee [email protected]
StataCorp 1-800-STATAPC
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/