Clustering Algorithms

Any decompositional clustering algorithm can be made hierarchical by recursive application.

K-means algorithm

K-means-cluster (in S : set of vectors : k : integer)
{  let C[1] ... C[k] be a random partition of S into k parts;
   repeat {
            for i := 1 to k {
               X[i] := centroid of C[i];
               C[i] := empty 
              } 
            for j := 1 to N {
               X[q] := the closest to S[j] of X[1] ... X[k]
               add S[j] to C[q]
              } }
    until the change to C (or the change to X) is small enough.
}

Example: (Note: I have started here with random points rather than a random partition.)

Features:

Problems:

Variant 1: Update centroids incrementally. Claims to give better results.

Variant 2: Bisecting K-means algorithm:

for I := 1 to K-1 do {
    pick a leaf cluster C to split;
    for J := 1 to ITER do split C into two subclusters C1 and C2;
    choose the best of the above splits and make it permanent
}
Generates a binary clustering hierarchy. Works better (so claimed) than K-means.

Variant 3: Allow overlapping clusters: If distances from P to C1 and to C2 are close enough then put P in both clusters.

Full mean vector would get very long (= number of different words in all the documents.) Solution: truncate after first M terms. (Typically M=50 or 100.)

Agglomerative Hierarchical Clustering Technique

{ put every point in a cluster by itself
  for I := 1 to N-1 do {
     let C1,C2 be the most mergeable pair of clusters;
     create C parent of C1, C2
  } }
Various measures of "mergeable" used.

Characteristics

Conflicting claims about quality of agglomerative vs. K-means.

One-pass Clustering

pick a starting  point D in S;
CC = { { D } } } /* Set of clusters: Initially 1 cluster containing D */
for Di in S do {
    C := the cluster in CC "closest" to Di
    if similarity(C,Di) > threshhold 
      then add Di to C;
      else add { Di } to CC;
}
Features: