\section{Hierarchical Algorithms}
\label{sec:hierarchical}

Hierarchical clustering is a common technique used in clustering
algorithms. Hierarchical clustering algorithms, as the name implies,
treat each cluster as a hierarchy of subclusters. They can be further
divided into \emph{agglomerative} hierarchical algorithms, which start
with each point in a singleton cluster and repeatedly combine
clusters, and \emph{divisive} hierarchical algorithms, which begin
with every point in the same cluster and repeatedly divide clusters
into smaller ones. In general, these algorithms use some sort of
\emph{linkage metric} that measures the dissimilarity between two
clusters, 
%\emph{goodness metric} that evaluates the quality of the current
%clustering,
and select the clusters to combine or divide that maximize
this metric.

As we shall see, these algorithms are quite flexible with regards to
different measures of similarity. This makes them applicable to a
number of different situations, including cases where the data points
are categorial rather than numeric. Another advantage is that the
construction of clusters proceeds incrementally. Thus, if the desired
number of clusters $k$ is unknown, as it often is, the algorithm can
use a different termination criterion to determine when to stop
joining or dividing clusters.

Hierarchical algorithms are, however, limited by the fact that the
technique is fundamentally greedy: if a cluster is divided into two
subclusters, the algorithms will never reconsider the assignment of
points into each subcluster. It might, for example, prove optimal to
first split a cluster into two subclusters $A$ and $B$, but then
create a third subcluster containing the points from both $A$ and $B$
that are closest to the boundary. Hierarchical algorithms are unable
to recognize these types of cases.


\subsection{Single-link Hierarchical Clustering: SLINK}
\label{sec:hierarchical:slink}

The single link technique is the classic example of a hierarchical
clustering algorithms. It is an agglomerative hierarchical algorithm
based on the \emph{single-link} linkage metric.  That is, it defines
the similarity between two clusters as the distance between the two
closest points in each cluster:
\[ linkage(A, B) = \min_{\substack{x \in A \\ y \in B}} \; d(x,y) \]

A straightforward algorithm that implements this technique is as
follows \cite{olson-parallel-algorithms}: we begin by placing each
element in its own cluster. Throughout the algorithm, we will maintain
a (symmetric) matrix $L$ where $L_{i,j} = linkage(i,j)$, and an array
$N$ where $N_i$ is the cluster most closely linked to cluster $i$,
i.e. the $j$ that minimizes $linkage(i,j)$. Since the set of clusters
is initially the same as the set of data points, we can initialize the
arrays by setting $L_{i,j} = d(i,j)$ for each $i$ and $j$ in the data
set, and iterating over each row $i$ in $L$ to find the minimum
element, which gives the value of $N_i$.

Then, until the termination criterion is reached, we repeat the
following procedure: we find a cluster $i$ that minimizes the linkage
between it and its closest neighbor (which is $L_{i,N_i}$). Let $j =
N_i$. Then we merge cluster $j$ into $i$, removing cluster $j$ and
assigning its elements to $i$, and update the arrays. Now for every
$x$, the new values of $L_{x,i}$ and $L_{x,j}$ are $\min(L_{x,i},
L_{x,i})$, since the closest point to $x$ in the newly-merged cluster
is the closest point in either of its subclusters. Any $x$ that
previously had $N_x = j$ is updated to $N_x = i$.

Note that this algorithm requires $\bigO{n^2}$ space to store the
matrix. It also requires $\bigO{n^2}$ time, since each of the
$\bigO{n}$ merge steps requires $\bigO{n}$ time to update two rows of
$L$ and the array $N$.

The optimal implementation of this technique is the SLINK
algorithm, introduced by Sibson in 1973 \cite{slink}. It is a
refinement of the above procedure that requires only $\bigO{n}$ space
(but still $\bigO{n^2}$ time).

In addition to their high runtime, single-link clustering algorithms
suffer from a ``chaining effect'': if two clusters are
connected by a chain of points, they will be merged, since the
linkage between two clusters is computed by their closest points. This
happens even if the clusters are large and there are only a few points
connecting them --- since outlier points are common in large data
sets, this problem can occur frequently. \cite{cure}

% FIXME: figure here?

This problem can be minimized by defining the linkage metric as the
average distance between points in the two clusters (known as
\emph{all-points} clustering). This limits the effects of outliers,
but introduces a new problem: the clusters it creates are effectively
spheres around their mean, which becomes problematic if
the data set is not composed of spherical clusters but irregular
shapes.

% FIXME: another figure here? And a citation?

\subsection{Representative Points and Sampling: CURE}
\label{sec:hierarchical:cure}

The CURE algorithm \cite{cure} is another hierarchical agglomerative
algorithm that addresses the two major shortcomings of single-link
clustering. It increases scalability for large data sets by using
random sampling, and it provides improved handling for
irregularly-shaped clusters and outliers with a new linkage metric.
Rather than compute the linkage between two clusters as the distance
between their closest points, it computes for each cluster a set of
$c$ ``well-scattered'' representative points, and then scaling them
towards the mean of the cluster by a fixed factor; the linkage between
two clusters is then the distance between their closest representative
points.

A set of well-scattered representative points for a cluster can be
chosen by first choosing the point furthest from the mean, then
repeatedly choosing points furthest from the closest one of the
previously chosen points. This requires scanning over the set of
points $c$ times, giving a runtime of $\bigO{n}$, which is
undesirable since the set of representative points must be recomputed
each time two clusters are merged. An alternative is to compute the
$c$ representative points for a merged cluster from the $2c$
representative points from the two clusters it was formed from. This
sacrifices quality of the representative points to give a $\bigO{1}$
runtime, but not by much since the new points are chosen from points
that were well-scattered over the previous clusters.

The effect of choosing representative points is to capture the shape
of a cluster, while scaling them towards the mean limits the effects
of outliers. This avoids the problem that single-link algorithms have
with respect to chains of outliers connecting two otherwise disparate
groups. By taking into account the shape of the cluster, it is
possible to identify non-spherical shapes (such as elongated regions),
which would not be possible if only the mean was considered. However,
it still encounters problems if the shapes of the clusters are not
convex.

Efficiently computing a clustering using this linkage metric uses a
procedures similar to that described above for the single-link metric.
However, instead of storing a matrix of distances and array of closest
clusters, we now maintain a heap of clusters sorted by each cluster's
distance to the closest neighboring cluster, and a $kd$-tree of all
the representative points. With these data structures, identifying and
merging a pair of clusters requires extracting the minimum element
from the heap, then merging that cluster with its closest neighbor and
generating the new set of representative elements. To update the data
structures, the representative points for the pre-merge clusters are
removed from and the new representative points added to the $kd$-tree.
We then scan over all other clusters to determine which is closest to
the new merged cluster, and to recompute the closest cluster (using
the $kd$-tree) for all clusters that were previously closest to one of
the merged clusters; we update the heap accordingly. Since each data
structure operation requires $\bigO{\log n}$ time, each merge
operation requires $\bigO{n \log n}$ time, and the algorithm's total
runtime is $\bigO{n^2 \log n}$. Note that this is slower than the
SLINK algorithm, though for low-dimensional data sets CURE can be
shown to run in the same $\bigO{n^2}$ time bound.

CURE also employs the standard streaming technique of random sampling
to handle large data sets. This introduces a tradeoff between speed
and accuracy, but it is typically justifiable in light of the fact
that most clusters contain enough points that, with high probability,
every cluster will be sampled with a sufficiently large sample
size. An argument based on Chernoff bounds makes this reasoning
precise. Sampling also has the useful side-effect of reducing the
number of outliers, as they are less likely to be sampled than points
in clusters.

The results, as shown by an experimental comparison of CURE with other
hierarchical algorithms including single-link and all-pairs
\cite{cure}, are that single-link clustering is best for handling
non-spherically shaped clusters, but the data set must have
clearly-defined clusters without outliers. CURE trades off some
ability to detect arbitrarily-shaped clusters for robustness in the
face of outliers, so is more suitable for general data sets. The
results also suggest that sampling is a feasible technique for
increasing scalability without overly degrading result quality.


\subsection{Non-numeric and Categorical Data: ROCK}
\label{sec:hierarchical:rock}

The algorithms described above are all intended to find clusters in
sets of numeric data in metric spaces. Unfortunately, many data mining
problems involve data that is boolean or categorical. The classic
example for this is the \emph{market basket problem}: examining a
large database of supermarket transactions to determine clusters of
items that are frequently purchased together. In this case, each
purchase is a data point consisting of many Boolean variables, each
indicating whether a certain item was included in that purchase.

Of course, this problem can still be expressed as a numerical
clustering problem which has one dimension per item, where the data
points take value 1 in a certain dimension if the corresponding
transaction contains that item and zero otherwise. However, such data
sets generally have other characteristics that make standard
clustering algorithms produce poor results. For one, each transaction
includes only a small subset of the (perhaps thousands) of total
items. Thus, it is reasonable for a cluster to consist of a larger
class of items that are frequently associated, whereas each individual
transaction in the cluster contains only a small number of items in
that class. The effect of this is that two items that should be in the
same cluster may actually have no items in common, but rather be
``connected'' via other transactions that share many of the same
items.

None of the single-link, all-pairs, or CURE algorithms work well for
these types of data sets. The data set size is generally large enough
that there will be occasional transactions that contain items that appear in
different clusters, bridging the clusters and causing them to be
merged by a single-link algorithm \cite{rock}. The all-pairs and (to a
lesser extent) CURE algorithms use the cluster mean as a factor in
determining the linkage of clusters. This leads to a ``ripple effect''
\cite{rocktr}: as a cluster becomes larger, it contains increasingly
more points with more different items, and so the averaging process
leads to a mean that contains a smaller fraction of a greater number
of items. Then linkage metrics based on the mean essentially lose
their information about what distinguishes the items in the cluster,
and cannot tell the difference between data points that are largely
similar but differ on one or two items and other averaged data points
that differ on every item, but in small amounts.

The ROCK algorithm \cite{rock} is designed to handle data sets made up
of Boolean attributes; it also handles categorical data by treating
each categorical attribute as a set of Boolean attributes for each
possible value it can take on. It achieves this goal by defining a
different sort of linkage metric. As before, it assumes the existence
of a distance function $d(p,q)$ --- for the market basket example,
often this is the Jaccard coefficient $1 - \frac{\abs{I_p \cap
    I_q}}{\abs{I_p \cup I_q}}$, i.e. the fraction of items in the two
transactions that are not shared between them. It defines two data
points to be \emph{neighbors} if their distance is less than a
user-specified parameter $\theta$, and defines $link(p,q)$ to be the
number of common neighbors shared between two points $p$ and $q$. Then
the linkage function between two sets is based on the number of
cross-links between points in each set:
\[ linkage(A,B) = \frac
{(\abs{A} + \abs{B})^{1 + 2f(\theta)} - \abs{A}^{1 +
    2f(\theta)} - \abs{B}^{1 + 2f(\theta)}}
{\sum_{\substack{p \in A\\q \in B}}
  link(p,q)}\]
The denominator of this
linkage function is the number of cross-links, a fairly intuitive
measure of the linkage of two sets; the numerator requires a bit
more explanation. Using the number of cross-links alone is not
sufficient because a large cluster contains more points and thus more
possibilities for ``outlier'' cross-links to other clusters; large
clusters will then have a tendency to inappropriately consume smaller
clusters. The numerator of the linkage function attempts to minimize
this effect. The user-supplied function $f(\theta)$ is chosen so that
each point in cluster $C$ is expected to have approximately
$\abs{C}^{f(\theta)}$ links to other points in the cluster. Thus the
expected number of links between points in a cluster $C$ is
$\abs{C}^{1+2f(\theta)}$. So the numerator is the expected number of
cross-links between the two sets: the expected number of links between
the two sets together minus their expected numbers of links
individually. The linkage function is lowest, and so two clusters are
most closely linked, when they have many more cross-links than would
be expected from their size alone.

Whereas the previous hierarchical algorithms make decisions
``locally'', considering only the distance between points in a
cluster, ROCK uses a somewhat more global approach, taking into
account information about their neighbors in the form of the link
count. This provides improved result quality, but necessarily
increases the runtime cost. A necessary first step is to count the
number of links between each pair of data points, which can be
accomplished by assembling every point's neighbors into an adjacency
matrix then squaring it ($\bigO{n^{2.37}}$ time using a Strassen-like
algorithm). Alternatively, for the common case where each point has
few neighbors, the simple approach of calculating for each point its
set of neighbors then incrementing the link count for every pair of
neighbors gives the same result with a runtime of $\bigO{n m_a m_m}$,
where $m_a$ and $m_m$ are the average and maximum number of neighbors
per point.

Once the link counts are computed, an agglomerative algorithm largely
similar to that described in Section~\ref{sec:hierarchical:slink}. It
requires a local heap for each cluster containing all other clusters
it has links to, ordered by the linkage metric, and also a global heap
containing all clusters ordered by their lowest linkage metric. Until
the desired number of clusters has been reached, the minimum is
extracted from the global heap, giving the two clusters to be
merged. They are then merged and the link counts and heaps are
updated. Note that the number of links between any cluster and the
newly merged cluster is simply the sum of the number of links between
it and each of the two pre-merge clusters, so updating the link count
simply requires summing $\bigO{n}$ entries. The local heap must be
rebuilt and the global heap updated; this requires $\bigO{n \log n}$
time, so the algorithm's total runtime is $\bigO{n^2 \log n + n m_a
  m_m}$.

As with CURE, sampling is used to reduce the runtime of the algorithm
while still giving a result of acceptable quality.

The ROCK algorithm provides much better handling for categorical data
sets, a common class for which traditional clustering algorithms give
poor results. Besides the increased runtime, a disadvantage is the
increased number of input parameters. The user must specify not only
the desired number of clusters $k$ but also the threshold $\theta$ for
classifying a point as a neighbor and a function $f(\theta)$
that gives an estimate for the number of neighbors for a point. In
practice, however, the authors claim that $f(\theta)$ needs not be a
very good estimate \cite{rocktr}.
%%% Local Variables: 
%%% mode: latex
%%% TeX-master: "paper"
%%% End: 

% LocalWords:  subcluster subclusters