\section{Overview}
\label{sec:overview}

As large databases of information become increasingly common, it is
frequently useful to be able to identify patterns in the data. This is
known as \emph{data mining}, and is a topic of considerable interest
for many companies. For example, a supermarket may wish to examine its
database of sales transactions in order to identify products that are
often purchased together or by the same customers, or to discover
relationships between consumer demographics and their sales
patterns. Network administrators may wish to monitor network traffic
and discover patterns in order to create a model of normal
behavior. When events occur that do not correspond to these normal
behavior patterns, it may signal an intrusion, and this can be easily
detected. Similarly, the government may wish to monitor association
between individuals; changes in communication patterns may indicate
terrorist activity.

A common technique for performing data mining is \emph{clustering}.
Clustering entails identifying groups of related data points from a
large data set. For clustering purposes, the data set is often treated
as a set of points in a metric space, with a distance function
$d(p,q)$ reflecting the dissimilarity between two points. For data
consisting of numeric values, it is often reasonable to express the
data set as points in a Euclidean space and use the Euclidean distance
metric to measure dissimilarity, though in some cases a different
distance function may be preferable.

There is no single algorithm that is best for clustering. For a given
data set, it is not clear what the optimal clustering assignment
should look like, let alone how to determine it efficiently. Hence,
there are a plethora of different clustering algorithms that give
different results and have different runtime characteristics. The
choice of which algorithm to use depends heavily on the nature of the
data set in question.


\subsection{Desirable Characteristics}
\label{sec:overview:desirable-characteristics}

Before we begin to consider the various classes of clustering
algorithms currently available, it is necessary to establish some of
the criteria for evaluating clustering algorithms. The following are
some of the characteristics that an ideal clustering algorithm would
have; clustering algorithms typically trade off some of these
capabilities in order to make others possible.

\begin{itemize}
\item \textbf{Scalability} \hspace{1em}
  The data sets used in clustering algorithms are generally very
  large, so an algorithm should deal well with them. Often this is
  accomplished by using the technique of \emph{sampling}: selecting a
  representative (usually random) subset of the data points and
  applying the algorithm to this subset rather than the full set.

\item \textbf{Robustness to outliers} \hspace{1em}
  The data sets are often noisy, containing outliers that do not fall
  into any clusters. A clustering should not be affected significantly
  by these outliers. The algorithm should also be able to handle
  a small number of outliers that lie between two well-defined
  clusters, connecting them with a ``bridge'' of points, without
  merging the two clusters.

\item \textbf{Ability to discover unusually-shaped clusters}
  \hspace{1em}
  Many algorithms assume that clusters will be spherical, consisting
  of all points whose distance from the cluster center is less than
  some value. Many data sets, however, have clusters that are
  non-spherical: they may be elongated, asymmetrical, or of some other
  unusual shape. Ideally, an algorithm should be able to discover
  these.
  
\item \textbf{Minimal input parameters} \hspace{1em}
  An algorithm should be able to function as independently as
  possible, requiring little or no parameters from the user. For
  example, many algorithms take as a parameter $k$, the number of
  clusters to be identified. This must be chosen with care, because
  identifying too few clusters will fail to capture the useful
  information in the data set, while identifying too many clusters
  will lead to overfitting. Some algorithms, particularly
  density-based and grid-based ones
  (Sections~\ref{sec:density-based}~and~\ref{sec:grid}), require
  further parameters.
  
\item \textbf{Flexibility towards non-numerical data} \hspace{1em}
  Some data sets have \emph{Boolean} or \emph{categorical} data
  instead of numerical data, where the values are limited to a small
  set of possible options (two, in the case of Boolean data). Note
  that the use of an appropriately-chosen distance function allows
  these to be treated in largely the same way as numerical
  data. However, as we will see in
  Section~\ref{sec:hierarchical:rock}, categorical data sets have
  other characteristics that make many algorithms not explicitly
  designed for them perform poorly.
\end{itemize}


\subsection{Outline}
\label{sec:overview:outline}

Having established the nature of the clustering problem and the types
of characteristics we would like to see in an ideal algorithm, we will
now proceed to take a tour of the design space for clustering
algorithms. Though there are certainly far too many clustering
algorithms to consider them all here, we describe four of the most
important classes of clustering algorithms, and examine two or three
representative algorithms for each one. In particular, we focus not
only on how each algorithm works, but on the tradeoffs it makes, and
which types of data sets it is well-suited for.

Section~\ref{sec:hierarchical} describes hierarchical algorithms,
which incrementally build the desired number of clusters either by
combining smaller clusters or dividing larger ones. By contrast,
partitioning algorithms (Section~\ref{sec:partitioning}) examine the
data space in order to identify the clusters directly. Density-based
algorithms (Section~\ref{sec:density-based}) are a special type of
partitioning algorithms that use the density of regions rather than
individual data points. Finally, grid-based algorithms
(Section~\ref{sec:grid}) divide the input space using a grid and process
grid cells rather than data points.

%%% Local Variables: 
%%% mode: latex
%%% TeX-master: "paper"
%%% End: