Clustering Algorithms

Decompositional (top-down)
Agglomerative (bottom-up)

Any decompositional clustering algorithm can be made hierarchical by recursive application.

K-means algorithm

K-means-cluster (in S : set of vectors : k : integer)
{  let C[1] ... C[k] be a random partition of S into k parts;
   repeat {
            for i := 1 to k {
               X[i] := centroid of C[i];
               C[i] := empty 
              } 
            for j := 1 to N {
               X[q] := the closest to S[j] of X[1] ... X[k]
               add S[j] to C[q]
              } }
    until the change to C (or the change to X) is small enough.
}

Example: (Note: I have started here with random points rather than a random partition.)

Features:

Objective function: minimize the sum of the squared distance from each point to centroid of cluster.
For unit vectors, this is equivalent to maximizing the average dot product S[i] dot S[j} where S[i] and S[j] are in the same cluster.
Sum of the squared distance steadily decreases over the iterations.
Hill-climbing algorithm.
Disjoint and exhaustive decomposition.
For fixed number of iterations, linear in N. Clever optimization reduces recomputation of X[q] if small change to S[j]. Second loop much shorter than O(kN) after the first couple of iterations.
"Anytime" algorithm: S[j] always a decomposition of S into convex subregions.
Random starting point: Multiple runs may give different results, choose "best"

Problems:

Have to guess K.
Local minimum. Example: In diagram below, if K=2, and you start with centroids B and E, converges on the two clusters {A.B.C}, {D,E,F}
Disjoint and exhaustive decomposition.
Starvation: Complete starvation of C[j], or starvation to single outlier.
Assumes that clusters are spherical in vector space. Hence particularly sensitive to coordinate changes (e.g. changes in weighting)

Variant 1: Update centroids incrementally. Claims to give better results.

Variant 2: Bisecting K-means algorithm:

for I := 1 to K-1 do {
    pick a leaf cluster C to split;
    for J := 1 to ITER do split C into two subclusters C1 and C2;
    choose the best of the above splits and make it permanent
}

Generates a binary clustering hierarchy. Works better (so claimed) than K-means.

Variant 3: Allow overlapping clusters: If distances from P to C1 and to C2 are close enough then put P in both clusters.

Full mean vector would get very long (= number of different words in all the documents.) Solution: truncate after first M terms. (Typically M=50 or 100.)

Agglomerative Hierarchical Clustering Technique

{ put every point in a cluster by itself
  for I := 1 to N-1 do {
     let C1,C2 be the most mergeable pair of clusters;
     create C parent of C1, C2
  } }

Various measures of "mergeable" used.

Minimum distance between d1 in C1 and d2 in C2. Then this is basically Kruskal's algorithm for minimum spanning tree; runs in almost linear time. However, not a good clustering measure.
Average distance between d1, d2 in C1 union C2. Quickly computable.
Maximum distance between d1 in C1 and d2 in C2 (= diameter of C1 and C2)

Characteristics

Creates complete binary tree of clusters
Various ways to determine "mergeability".
Deterministic
O(N²) running time.

Conflicting claims about quality of agglomerative vs. K-means.

One-pass Clustering

pick a starting  point D in S;
CC = { { D } } } /* Set of clusters: Initially 1 cluster containing D */
for Di in S do {
    C := the cluster in CC "closest" to Di
    if similarity(C,Di) > threshhold 
      then add Di to C;
      else add { Di } to CC;
}

Features:

Running time: O(KN) (K = number of clusters)
Fixed threshhold
Order dependent. Can rerun with different order.
Disjoint, exhaustive clusters
Low precision