Clustering Quality

Largely due to Francesca Chiaro.

One of the key ideas in statistical learning is clustering. Clustering is based on the idea of a metric. One of major problems with clustering is deciding how many clusters to have. Once you have decided on a K, then K-means clustering does a nice job, but how to decide on K?

1. Intrinsic measures: perform a clustering for some given K. For every point p, compute the distance to the centroid of p's cluster. Square these distances and sum up. Call this W(k)

Note: this will decrease monotonically as the number of clusters increase, so we need a stopping point. One paper suggests stopping when W(k)/W(k+1) < 4. There are more complicated criteria, but this is admittedly heuristic.

2. Another technique is to use the Silhouette: Compare the average distance to a point's own centroid vs. distance to closest other centroid. If that is high (choose your cutoff), we are in good shape. Sometimes people map the Silhouette value as k increases.

Minimax diameter clustering.

Form of partitioned clustering. Want to minimize the maximum diameter of any cluster on some set of points P. Farthest-point clustering for any metric space [T. Gonzales, Clustering to minimize the maximum intracluster distance, Theoretical Computer Science, 38:293-306, 1985]: start with an arbitrary point s1. Pick a point s2 that is as far from s1 as possible. Pick si to maximize the distance to the nearest of all centroids picked so far. That is: maximize the min{dist(si,s1), dist(si,s2), ...}. After all k representatives are chosen we can define the partition of S: cluster Sj consists of all points closer to sj than to any other representative.

Theorem: Let the maximum diameter of this k clustering C_Gon be D, then even in the optimal k clustering C_opt the maximum diameter must be at least D/2. Here optimal is in the sense of minimizing the maximum diameter of any cluster.

Proof: C_gon has centroids s1, ..., sk. Pick a point q in P that is a non-centroid in C_gon and whose minimum distance E to the centroids is greatest among all non-centroids of P. (So, q would be the next centroid chosen in a farthest-point clustering.) Now, by construction, all the other centroids in C_gon must be at least E apart from one another. Therefore, if we are creating an optimal k-clustering from just these k+1 points (i.e. s1, ..., sk, q), some cluster X must have two of these points and those two points must be at least E apart. So, the lower bound on the optimal clustering is a diameter of E. Now, consider C_gon. Point q must be in the cluster Y of some centroid si, so that cluster must have radius at most E and hence diameter at most 2E. No other point in any cluster of C_gon can be farther from its centroid by construction of q. Therefore, C_gon can't be twice as bad as the optimal.

Remarkably there is a further theorem that says that it is NP-hard to do any better than a factor of 1.962 (this result is only for Euclidean).