|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
st: three questions about cluster analysis ties(), cutv() and _hgt
Hello,
I have a massive data set that contains demographic data at the census
block level on a bunch of people -- things like the median income,
age, number of people with a given schooling level, number of people
within various age brackets, etc. in everybody's census block.
The job is to look for any kind of clusters based on a variety of
criteria that these variables might suggest. I am at a very early
exploratory stage here, and I have a very rudimentary understanding of
how the cluster set of commands works. I use Stata 10.
My problem is this: I took a stratified random sample of the original
data set so I'd get a manageable number of observations reasonably
well scattered across all of the subsets of interest. I then did
hierarchical clustering over this sample, so I would get an initial
idea of the k, number of clusters, that I will want to request next,
when I try k-means clustering over the entire data set.
This is a simplification. I have several data sets, and I want to try
several types of linkage. So I wrote a wrapper where I can set these
options, but the core of it does this:
cluster `linkage' `groupby', name(_`linkage'_groupby)
cluster tree _`linkage'_groupby, cutn(`howmany')
My three questions:
1. This sometimes produces the error message "cannot cut exactly
`howmany' groups due to ties in the dendrogram". I tried varying the
`howmany'. Went through 50, 30 and 10 -- no luck. I also tried varying
`linkage', but complete and ward both produced the same error message.
I am not sure how to fix it. The ties() option is not available for
cluster tree -- it's only available for cluster gen. So, how do you go
about resolving ties in the cluster tree command?
2. Is there a way to back into the dissimilarity coefficient value
that corresponds to a given number of stems? Say I want to use cutv()
instead of cutn(), and set the value for the dissimilarity coefficient
that corresponds to about 10 stems. How do I go about it?
3. The command cluster `linkage' `groupby' produces three new
variables, with names starting in _`linkage'_`groupby' and ending in
_id, _ord and _hgt. Is _`linkage'_`groupby'_hgt equal to the
dissimilarity coefficient value computed by this command, and shown on
the y-axis of the dendrogram? How about _ord?
Thank you,
Gabi
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/