How Does Cluster Analysis Work?

Cluster analysis is a set of statistical procedures for assigning items to groups (clusters). These items can be people, organizations, countries, animals, experimental observations … Similar objects are placed in the same group, while dissimilar objects are placed in different groups. This sounds simple and interesting enough, but there are several uncomfortable facts about cluster analysis.

More Than One Type of Procedure

There several ways in which to cluster items, and no one way is best. Hierarchical clustering is an agglomeration method. If there are N items, start with N clusters, each item forming its own cluster. Find the two closest (most similar) clusters and merge them, leaving N-1 clusters. Again, find the two closest clusters, and merge them to create N-2 clusters. Continue this process until you have the number of clusters you desire. The advantage of this procedure is that it always finds a solution via a non-random method. K-Means clustering is an iterative method. Desiring K clusters, randomly choose K of the N items to form the initial cluster centers. Each of the remaining N-K items is assigned to the cluster closest to it. The centroid of all items in a cluster forms a new cluster center. Given these new cluster centroids, each of the N items is reassigned to the cluster centroid closest to it. The process is repeated (iterated) until none of the N items changes cluster membership. The advantage to this method is its flexibility – items can fluidly change cluster assignments as the algorithm tries to find the best solution.

Each Procedure Has Faults

Hierarchical clustering may be defective because items cannot change their cluster memberships. As a cluster accumulates items, the cluster centroid may drift away from some of the earlier cluster members. Those items might actually be closer to the centroid of another cluster by the end of the agglomeration process. K-Means clustering may be defective because there is no guarantee that the iterations will end. There may be no final solution. There may be several semi-final solutions through which the process cyclically rotates, as a handful of fickle items shift cluster membership with each iteration.

Never Know the Correct Number of Clusters

No procedure can tell you how many clusters “truly” exist. This is essentially a human judgement. Using his/her own criteria, the analyst must decide, beforehand, the number of clusters to be generated.

Never Can Be Certain of the Absolute Best Solution

Perhaps the ugliest of facts about cluster analysis is that, unless there are a small number of items, one can never be certain that the absolute best solution has been found. All clustering procedures start with an arbitrary initial condition, which usually leads to a good solution. But it is probably not the best solution. Most initial conditions will settle into a “locally” good solution, the best found in their statistical “region”. The only guaranteed way to find the best solution is to start with every possible initial condition (there could be quadrillions) and work each through to the end. Then, the best of these quadrillion local solutions would be the best overall clustering solution.