
\section{Partitioning Algorithms}
\label{sec:partitioning}

Whereas hierarchical algorithms incrementally build clusters either by
agglomerating smaller clusters or dividing larger ones, partitioning
clustering algorithms attempt to directly identify clusters by
partitioning the entire data space.


\subsection{$k$-means}
\label{sec:partitioning:k-means}

We are obligated to begin with a mention of the $k$-means algorithm,
since it finds wide use in statistical and scientific applications. As
we shall see, however, it suffers from many of the problems previously
described. While popular in practice, from an algorithmic standpoint
it falls short.

The $k$-means approach is to represent each of the $k$ clusters by the
mean (centroid) of the points contained within it. Then the goal is to
minimize the sum of the distances from each point to the mean of the
cluster that contains it. This has an intuitive geometric meaning,
which is pleasing for numeric data, though it is not particularly
useful for categorical data. However, it uses the mean, which makes
the algorithm very sensitive to outliers since points that do not fall
into any cluster can still skew the centroid of the nearest cluster.
It also identifies only spherical clusters, which means it does not
work will with data sets containing clusters of non-uniform shapes.

The classic algorithm for solving this problem is Forgy's algorithm
\cite{forgy}, which begins by selecting the initial $k$ cluster
centroid assignments at random. It then iteratively improves the
clusters by assigning each point to the nearest centroid, then
recomputing the centroid of each cluster; this process is repeated
until no changes occur.

This is a hill-climbing algorithm, and suffers from the standard
problem with such algorithms: it will find a local minimum, not a
global minimum --- clearly a major problem from a correctness
standpoint. It is difficult to make any guarantees on the runtime,
since the number of iterative improvements required is unknown; in
fact, in some cases, the algorithm fails to converge entirely.


\subsection{$k$-median: Approximations}
\label{sec:partitioning:k-median}

A more useful variant on the $k$-means problem is the $k$-median
problem. As before, the goal is to choose $k$ cluster centers so that
the total distance from each data point to its nearest center is
minimized. Now, however, the centers are required to be actual data
points, rather than the mean of several data points. Note that this
differs from the related $k$-centers problem in that both involve
choosing $k$ points as centers, but $k$-median requires minimizing the
total (or, equivalently, average) distance between each point and its
center, whereas $k$-centers requires minimizing the \emph{maximum}
distance. 

This is an improvement over $k$-means because it provides improved
handling of outliers, in much the same way as the median of a set of
numbers is preferable to the mean. It does, however, still assume that
the clusters are spherical. Moreover, solving the problem is
non-trivial since it is $\NP$-complete. Thus we are forced to consider
approximation algorithms.

Charikar et al. \cite{k-median-approx} were the first to identify a
constant-factor approximation for the metric $k$-median problem.  The
approximation algorithm expresses $k$-median as an integer linear
program, with one set of Boolean variables indicating for all $i$
whether point $i$ is a center and another set indicating for all pairs
$\tup<i,j>$ whether point $j$ is assigned to the cluster centered at
point $i$. The relaxation of this integer linear program allows
``fractional centers,'' where the values of fractional centers still
sums to $k$. They then round the solution to this linear program first
into a $\set{\nicefrac{1}{2},1}$-integral solution (which can have
only full and half centers), and then into an integral solution,
giving a $6\; \nicefrac{2}{3}$ approximation of the optimal
solution. This algorithm, however, is of theoretical rather than
practical importance.

An algorithm commonly used in practice for computing an approximation
of the $k$-median clustering is PAM \cite{pam}. It is an iterative
refinement algorithm similar to Forgy's algorithm for $k$-means: it
selects $k$ cluster centers at random and assigns each point to the
closest center. It then repeatedly considers replacing each cluster
center with a different point in the cluster, and performs this
reassignment if it improves the sum-of-distances objective function.
As with Forgy's algorithm, this method has the major drawback that it
may find a local minimum instead of the global minimum. The iterative
refinement process also leads to a high runtime. CLARA \cite{pam} is
a related algorithm that performs better on larger sets by taking
several relatively small samples of the data set, applying PAM to each
of them, and taking the best result.

%%% Local Variables: 
%%% mode: latex
%%% TeX-master: "paper"
%%% End: 

% LocalWords:  medioid
