A couple of days ago Ping Zheng <[email protected]> asked:
>We are conducting inward FDI locational determinants by using a panel
>data set with 28 home countries. we'd like to try a country cluster
>analysis to cluster the home countries into different groups naturally
>by Stata. What are the commands for this and what are the commands for
>regressing the different groups obtained from clustering by using OLS
>and Random Effects GLS?
And Rose Medeiros <[email protected]> gave some good advice
concerning the many things you have to consider when doing a
cluster analysis.
I would like to add a cautionary note. It sounds like you are
going to use the groups produced by cluster analysis as groupings
in later statistical analyses. More often than not, this is a
statistically dangerous thing to do. Let me illustrate.
First I generate 300 random uniform observations (over the range
-0.5 to 0.5) for 6 variables
clear
set obs 300
set seed 11313
forvalues i = 1/6 {
gen x`i' = uniform() - 0.5
}
And I create a grouping variable by blindly dividing the data
into thirds.
gen rand3 = 1 in 1/100
replace rand3 = 2 in 101/200
replace rand3 = 3 in 201/300
I look at the means of the 6 variables over these random 3 groups
tabstat x* , by(rand3)
and do a -manova- to see if these 3 groups are significantly
different.
manova x1 x2 x3 x4 x5 x6 = rand3
They are not. Which is what we all expect.
Now, what happens if I use a cluster analysis routine to go
searching for 3 groups in the data? Here I picked K-means
clustering, but the concept is the same for the other clustering
methods.
cluster kmeans x* , k(3) name(g3)
What do the data show for these 3 groups?
tab g3
tabstat x* , by(g3)
Compare the output of -tabstat- here with that from the random
groupings.
What does -manova- say about the grouping created by -cluster-?
manova x1 x2 x3 x4 x5 x6 = g3
We have found that the groups are statistically different. But
if you tried to publish results like these, a knowledgable
Journal reviewer would reject your paper. The result is
significant only because the cluster analysis went searching (as
hard as it could) for groupings that best separate the data. In
random data there are bound to be groupings that will separate
the data enough to cause follow-on statistical tests to show
significance.
By the way, you could also run
cluster stop
and get a Pseudo-F statistic indicating how good the 3 groups
split the data. Notice the word "Pseudo" and that no p-values
are provided in the output of -cluster stop-. You can read more
about cluster stopping rules in [MV] cluster stop. In particular
read the first Technical Note on page 186 concerning why the
stoping rule statistics have the word "Pseudo" in them.
Also read the Technical Note on page 74 of [MV] cluster. It
warns against using the groups produced by the -cluster- command
in the -cluster()- option of an estimation command.
Ken Higbee [email protected]
StataCorp 1-800-STATAPC
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/