
\section{Density-based Algorithms}
\label{sec:density-based}

Density based cluster algorithms are a special type of partitioning
algorithms, so named because of the criteria they use to determine if
a given point should be added to a cluster.  While other algorithms
typically just use the distance between two points to make such
decisions, these algorithms use the density of data points around the
point in question in making the choice to include.

\subsection{DBScan}
\label{sec:density-based:dbscan}

\proc{DBScan} is one of the best-known density-based clustering
algorithms.  It takes two parameters, $\epsilon$ and $\id{MinPts}$.  A
point is part of a cluster if a circle of radius $\epsilon$ centered
at that point contains at least \id{MinPts} data points.  If this is
not the case, then that point is instead treated as an outlier, and
discarded.  Two points $p_1$ and $p_2$ are \emph{density-reachable}
if the distance between them is less than $\epsilon$.  Points $p_0$
and $p_n$ are \emph{density-connected} if there exists a sequence of
density-reachable points connecting the two.  In this algorithm, a
cluster is defined as a set of points that are all density-connected
\cite{dbscan-tutorial}. When \proc{DBScan} runs, it expands outward
from existing clusters, testing points close to each cluster to
determine if they too are included in the cluster.  Then, it examines
other points to see if they are part of a new cluster, or if they are
simply outliers.

\proc{DBScan} has some major advantages over some of the clustering
algorithms that preceded it.  It is more effective at identifying
arbitrarily sized and shaped clusters than most other algorithms, and
is very successful in spatial clustering \cite{zaiane,spatial-clustering}. This is, in
% FIXME: what's spatial clustering?
fact, a general property of density-based algorithms.  Unfortunately,
for the same reason that it does well with more varied input shapes,
it also has problems with chaining effects, much like SLINK
(Section~\ref{sec:hierarchical:slink}).  If a bridge of outlier points
connects two clusters, \proc{DBScan} may be unable to distinguish the
two clusters, and will sometimes produce an output in which they are
combined into one large cluster. It is still more robust than
single-link algorithms, however, since $\proc{DBScan}$ will eliminate
outliers that are not surrounded by at least \id{MinPts} other points
within distance $\epsilon$. \cite{cure}

While \proc{DBScan} in its simplest form works best with spatial
clustering, it can be applied to other types of data sets as well
\cite{gdbscan}.  In the ordinary form of \proc{DBScan}, the condition
used to determine if two points are neighbors is simply the inequality
$d(p_1, p_2)< \epsilon$.  For other types of data sets, we can use
some other binary, symmetric, and reflexive function to judge whether
two data points are neighbors or not.  The rest of the algorithm
remains the same.  The result of this change is that now \proc{DBScan}
can be applied to higher-dimension data, or even to non-numerical data
\cite{gdbscan}.

As with many of the other parameterized algorithms, choosing
appropriate values for $\epsilon$ and \id{MinPts} can be difficult,
and having good results depends heavily on having those appropriate
values.  \proc{DBScan} will have particular trouble finding good
values for its parameters if the different clusters vary greatly in
density.  The choice of parameters can have fairly dramatic effects on
the clustering that is produced, ranging from no clusters at all to
one large cluster containing all the points in the data set
\cite{zaiane}.

\subsection{OPTICS}
\label{sec:density-based:optics}

OPTICS is a hierarchical variation of \proc{DBScan} which attempts to
alleviate some of its predecessor's weaknesses \cite{optics}.  In
particular, \proc{DBScan} has trouble dealing with data in which
different clusters have very different densities, because it is hard
to select a value for $\epsilon$ that will work for all of them.
OPTICS avoids this issue by using density-based cluster ordering.  It
orders partial clusters by their densities, beginning with the most
dense.  If a particular set of data points is a cluster when
$\epsilon$ is very small, that cluster has a high density.  If
$\epsilon$ is increased, the size of the cluster will likely also
increase, but it will still contain the original smaller cluster.

OPTICS gathers and stores data that allows the algorithm to recognize
clusters of varying densities when the final consolidation scan
occurs.  These modifications allow OPTICS to produce more accurate
outputs than \proc{DBScan} can.  It can also be used to analyze the
intrinsic clustering structure of the data input to it.  However, due
to the overhead from storage and extra computation OPTICS is actually
less efficient than \proc{DBScan}.  The long runtimes associated with
it mean that OPTICS cannot be used for high-dimensional data sets, or
for databases that are extremely large.  There have been some attempts
to improve on OPTICS' run time, but as is to be expected, such
attempts require that some accuracy be sacrificed.

\nocite{incremental-optics}
\nocite{codbu}

%%% Local Variables: 
%%% mode: latex
%%% TeX-master: "paper"
%%% End: 
