
\section{Grid-based Algorithms}
\label{sec:grid}

While density-based algorithms are so called because of the criteria
they use for deciding if a point should be included in a cluster,
grid-based algorithms are named for the way in which they split up the
data to be processed.  These algorithms use a grid to divide the space
that the data points are in, and then perform calculations on
individual cells of the grid.  There are generally fewer cells in the
grid than there are data points in the input, and calculations are
performed on the cells rather than on the individual points.  This
decreases the amount of objects on which calculations need to be
performed, making the algorithm more efficient.

Both of the grid-based algorithms we will discuss here are also
density-based to some extent.  That is, after the algorithm applies
the grid to the space in question, the density of points in a given
cell determines whether it is in a cluster or not.  To explain this
concept further, we will first consider the CLIQUE algorithm.

\subsection{CLIQUE}
\label{sec:grid:clique}

CLIQUE \cite{clique} is both density and grid based.  It first
partitions the space into a grid of non-overlapping rectangular cells
by partitioning each dimension at equal length intervals.  For each
cell, the algorithm calculates the number of points in that cell, and
if that number is greater than a given fraction of the total number of
points, then that cell is \emph{dense}, and thus considered to be part
of a cluster.  The threshold fraction is a parameter to the algorithm,
as is the number of intervals that a dimension gets divided into.

CLIQUE first finds and
sorts all subspaces that are above the threshold by their densities.
Adjacent subspaces are combined until a whole cluster is contained in
the union of those spaces, and then parts of that subspace that are not
dense are pruned.  When the pruning is done, that subspace is defined
to be a cluster, and is expressed in the output as a disjunctive
normal form (DNF) expression.  The DNF output form allows the
algorithm to describe clusters that exist in multiple subspaces.  That
is, there are multiple clusters, each representing one relationship,
and they overlap to form a multi-dimensional cluster.  The DNF output
can be interpreted, showing to some extent what relationships were
involved in the formation of the cluster.

CLIQUE's run time is exponential in the highest dimensionality of any
cluster in the data set, but empirically it scales linearly with
input, and scales well with added dimensions.  It does particularly
well with sparsely populated, high-dimensional data.  The algorithm's
output is independent of the order in which data was input.  The
accuracy of CLIQUE's output can sometimes be compromised by the
simplicity of the algorithm.  Also, the threshold density value is a
fixed number, so CLIQUE can sometimes have trouble with varied
densities across clusters.  There have been some tweaks made on the
original algorithm to try to improve it, such as the MAFIA algorithm,
but in the interests of space, these will not be discussed here.


\subsection{WaveCluster}
\label{sec:grid:wavecluster}

While \proc{WaveCluster} \cite{wavecluster} is another grid-based and
density-based clustering algorithm, its workings are, in fact, quite
different from those of CLIQUE.  In particular, \proc{WaveCluster} is
unique from anything discussed thus far in its use of wavelets for
calculation.  The algorithms parameters are the wavelet to be used,
the number of transformations $k$ to be performed, and the number of
grid cells that each dimension will be divided into.  As with CLIQUE,
the space is first divided into equal-volume non-overlapping cells.
Then the given wavelet transform is applied to the cells, producing a
new space, also divided into cells.  In this new space, we find
densely populated cells, and any maximal group of densely populated
cells is a proposed cluster.  This is repeated $k$ times.

Wavelets are primarily used in signal processing.  They contain a
low-pass filter, which helps to remove outliers.  Though a full
discussion of the theory of wavelets is of course beyond the scope
this paper, the nature of wavelets is to suppress weak data and
enhance stronger data, so applying a wavelet to the input space will
make the natural clusters become more apparent.  Wavelet
transformation is fast, which allows the total run time to remain
linear in the input size.  Also, the multiple applications of the
wavelet allow the algorithm to find clusters of varying densities, as
each transformation will make increasingly fine clusters stand out.

\proc{WaveCluster}'s output is independent of the order in which the
data points were input.  It is not sensitive to noise, and can detect
clusters of arbitrary shape. It performs best in lower dimensions,
where there are more data points than there are cells in the
grid. This gives a linear runtime bound. However, since the number of
cells in the grid is exponential in the number of dimensions, for
higher dimensions the size of the input might not be larger, and the
linear bound no longer holds.



%%% Local Variables: 
%%% mode: latex
%%% TeX-master: "paper"
%%% End: 
