\documentclass{article}
\usepackage{scribe}
\usepackage{amsmath}

\newcommand{\ceil}[1]{\ensuremath{\left\lceil #1 \right\rceil}}
\newcommand{\floor}[1]{\ensuremath{\left\lfloor #1 \right\rfloor}}
\newcommand{\bigO}[1]{\ensuremath{O\left(#1\right)}}
\newcommand{\littleO}[1]{\ensuremath{o\left(#1\right)}}
\newcommand{\bigTheta}[1]{\ensuremath{\Theta\left(#1\right)}}
\newcommand{\bigOmega}[1]{\ensuremath{\Omega\left(#1\right)}}
\newcommand{\littleOmega}[1]{\ensuremath{\omega\left(#1\right)}}

\begin{document}
\lecture{21}{11/15/2004}{Erik Demaine}{Dan Ports}{}

\begin{center}
  \Large{\textbf{External-Memory Algorithms}}
\end{center}


\section{The External-Memory Model}
\label{sec:external-memory-model}

Previously we have considered the costs of algorithms in the context
of the \emph{random-access machine} (RAM) model, where each elementary
operation and each memory access has a constant cost. While sufficient
for many purposes, this model does not always predict how algorithms
will behave on real memory. For example, if the entire data set does
not fit into main memory, segments will periodically have to be paged
in from disk; disk access times are so much larger that they dominate
the runtime. CPUs typically have one or more layers of cache that
are even faster than main memory. In general, modern systems have a
multi-layered memory structure, where the memory closest to the CPU is
fastest but smallest, and the memory farthest from the CPU is largest
but slowest.

The \emph{external-memory model}, introduced by Aggarwal and Vitter in
1988, is an alternative to the RAM model that takes into account disk
access time. It uses a two-layered memory structure, as shown in
Figure~\ref{fig:external-memory-structure}. The CPU has a fast
connection to the \emph{cache}, which has size $M$. The \emph{disk} is
much larger and much slower to access\footnote{For clarity, we
  exclusively use ``cache'' and ``disk'' to refer to the two types of
  memory. Some papers instead use ``cache'' and ``memory'' and others
  use ``memory'' and ``disk''; we use the term ``memory'' only to
  refer to memory in general, without regard for its location in the
  hierarchy.}. Both the cache and disk are divided into \emph{blocks}
of size $B$; the cache thus holds $\frac{M}{B}$ blocks, while the disk
can hold many more. A \emph{memory transfer} operation reads a block
from disk to cache, or writes a block from cache to disk. These are
necessary since the CPU can only operate directly on the contents of
the cache; blocks on disk must first be transferred to the cache
before they can be processed.

\begin{figure}[htbp]
  \centering
  FIXME
  \caption{The External-Memory Structure}
  \label{fig:external-memory-structure}
\end{figure}

We assume there is no cost to access the cache and perform operations
on it. Instead, the cost of an algorithm in the external-memory model
is the number of memory transfers required. Because disk seek times
are so much higher than memory access times, this cost captures the
runtime more realistically than the runtime given by the RAM model.

Note that all transfers take place in blocks of size $B$. Thus,
accessing one element in memory requires fetching the entire block
that contains it. The goal is to take advantage of locality: if nearby
memory locations are accessed together, then the block containing them
will only need to be transferred once. Any algorithm that has runtime
$T(N)$ in the RAM model clearly makes no more than $T(N)$ memory
transfers in the external-memory model. Our goal is to reduce this
amount. Ideally, we will be able to reduce it to $\frac{T(N)}{B}$.


\section{Scanning Algorithms}
\label{sec:scanning}

Scanning is a simple standard technique that takes advantage of
locality. A scanning algorithm processes the input by visiting each
element in order. For example, to find the smallest element in an
array of integers, we can scan over the array, keeping track of the
smallest element seen so far. Since memory locations are accessed
sequentially, adjacent memory locations are accessed together, and
each block needs to be transferred from disk to cache only once. Only
$\ceil{\frac{N}{B}} + 1$ memory transfers are required to visit $N$
elements. This is optimal.

A slightly more complicated example is the problem of reversing the
order of an array. Observe that this can be done by swapping pairs of
elements: the $i$th element and the $N-i$th element are
interchanged. We can implement this using blocks: we first read the
first and last blocks, reverse their elements and swap their
positions, then proceed to the second and penultimate blocks,
etc. This can be viewed as performing two scanning sequences in
parallel, one from the beginning of the array and one from the
end. The number of required memory transfers is $\bigO{\frac{N}{B}}$.

Note that the array reversal algorithm requires the cache to be large
enough to hold two blocks, i.e. $M \ge 2B$. This is not a problem. In
general, we can assume that the cache size is
$\bigOmega{N^{1+\epsilon}}$. This is known as the \emph{tall-cache
assumption} and is standard.


\section{B-Trees}
\label{sec:b-trees}

We now turn to the problem of building a comparison-based search tree.
B-trees are a data structure well-suited for use in the
external-memory model. Recall that a B-tree is a balanced search tree
where each node contains $b$ values and has a branching factor of
$b+1$. Thus, the tree has height $\bigTheta{\log_{b+1} N}$, and performs
operations in $\bigTheta{\log_{b+1} N}$ time. Why is this well-suited for
the external-memory model? We can choose $b$ to be a constant factor
less than $B$ such that the representation of each node fits in a block.
Then any operation requires $\bigTheta{\log_{B+1} N}$ memory
transfers.

Is this a desirable runtime? The following theorem from information
theory shows that in fact it is optimal:

\begin{theorem}
  Searching for an element $x$ in a sorted list of $N$ elements
  requires $\bigOmega{\log_{B+1} N}$ memory transfers.
\end{theorem}

\begin{proof}
  Suppose that we have an sorted array of numbers $1 \ldots N$ and we
  wish to discover where $x$ is located in this array. The most
  succinct encoding of the answer an index into the array, which has
  $\lg N + 1$ bits. Initially the cache is empty and we have no
  information. Each time a transfer from disk to cache is performed, we
  read at most $B$ elements. We therefore determine where $x$ fits
  into these $x$ elements; since they are sorted, the smallest
  encoding of the information determined is the $\lg B + 1$ bit index
  into the array. So we learn no more than $\lg B + 1$ new bits of
  information per memory transfer. Since we need to determine $\lg N +
  1$ bits of information, at least $\frac{\lg N + 1}{\lg B + 1} =
  \lg_{B+1} N$ memory transfers are required.
\end{proof}

\begin{corollary}
  Finding an element in a search tree requires $\bigOmega{\log_{B+1}
    N}$ memory transfers.
\end{corollary}

\begin{proof}
  The problem of searching for an element in a sorted list is
  reducible to the problem of finding an element in a search tree, so
  the same lower bound applies.
\end{proof}


\section{Sorting}
\label{sec:sorting}

Consider now the problem of comparison-based sorting. In the RAM
model, we can sort $n$ elements by inserting each of them into a
B-tree, then repeatedly extracting the minimum element until the tree
is empty. This requires $\bigTheta{n \log n}$ time, which we know is
optimal.

We can use the same approach in the external-memory model. Again there
are $n$ insert and extract-min operations, but now they each require
$\bigTheta{\log_{B+1} N}$ memory transfers, so the total cost of the
algorithm is $\bigTheta{N \log_{B+1} N}$. This is \emph{not} optimal
in the external-memory model.


\subsection{Mergesort}
\label{sec:sorting:mergesort}

The standard mergesort algorithm has desirable runtime characteristics
in the external-memory model because merging can be performed via a
scanning algorithm. We simply scan through each of the two input
arrays in parallel, adding the smallest element from either of the two
arrays to the output array and fetching new blocks as needed. This
requires $\bigO{\frac{N}{B}}$ time. Recall that mergesort divides a
problem of size $N$ into two sorting subproblems of size $\frac{N}{2}$
and a merge of size $N$. This gives the following recurrence:
\begin{equation}
  \label{eq:mergesort-recurrence}
  T(N) = 2T\left(\frac{N}{2}\right) + \bigO{\frac{N}{B}}
\end{equation}

\begin{figure}[htbp]
  \centering
  FIXME: mergesort recursion tree
  \caption{Recursion tree for mergesort}
  \label{fig:mergesort-recursion-tree}
\end{figure}

The recursion tree for this recurrence is shown in
Figure~\ref{fig:mergesort-recursion-tree}. Note also that once the
problem has been reduced to sorting an array of size $M$, we can solve
it in constant time, because all the work is done in cache. Thus,
there are $\frac{N}{M}$ elements on the last level of the recursion
tree, and so the tree has height $\lg \frac{N}{M}$. At each level, the
total work done in the merging steps sums to $\frac{N}{B}$ memory
transfers. So the solution to the recurrence and the total cost of the
algorithm is
\begin{equation}
  \label{eq:mergesort-recurrence-solution}
  T(N) = \bigO{\frac{N}{B} \log \frac{N}{M}}
\end{equation}

This is an improvement over B-tree sorting, but it is still not
optimal. We can make a small modification to the mergesort procedure
to improve it:


\subsection{$\frac{M}{B}$-way Mergesort}
\label{sec:sorting:m-b-way-mergesort}

Previously we considered a \emph{binary} mergesort algorithm: it
divides each problem of size $N$ into two subproblems of size
$\frac{N}{2}$. But we are not limited to a two-way division. We can
divide the problem into any number of subproblems. A cache of size $M$
holds $\frac{M}{B}$ blocks, so $\frac{M}{B}$-way division is ideal.
This is the largest choice that makes the scanning merge algorithm
possible, since one block from each sorted subproblem array can fit
into cache. (To be precise, we actually need a cache of size
$M+\bigO{1}$ since we must store the output block in cache too, but
this detail can be safely ignored.)  The recursion is now:
\begin{equation}
  \label{eq:m-b-way-mergesort-recurrence}
  T(N) = \frac{M}{B} T\left(\frac{N}{\frac{M}{B}}\right) +
  \bigO{\frac{N}{B}}
\end{equation}
\begin{figure}[htbp]
  \centering
  FIXME: M/B-way mergesort recursion tree
  \caption{$\frac{M}{B}$-way mergesort recursion tree}
  \label{fig:m-b-way-mergesort-recursion-tree}
\end{figure}
which leads to the recursion tree in
Figure~\ref{fig:m-b-way-mergesort-recursion-tree}. As before, once the
problem has been reduced to sorting an array of size $M$, we can solve
it in constant time. Each level of the recursion tree requires
$\bigO{\frac{N}{B}}$ work, and the height of the tree is
$\log_{\frac{M}{B}} \frac{N}{M}$. So the solution to the recurrence is
\begin{align}
  T(N) &= \frac{N}{B}\left(1 + \log_{\frac{M}{B}} \frac{N}{M}\right)
  = \frac{N}{B}\left(2 + \log_{\frac{M}{B}} \frac{N}{B}\right) \\
  &= \bigO{\frac{N}{B} \log_{\frac{M}{B}} \frac{N}{B}}
\end{align}
This is the number of memory transfers required by the algorithm.


\subsection{Optimality}
\label{sec:sorting:optimality}

We have seen that $\frac{M}{B}$-way mergesort sorts $N$ elements using
$\frac{N}{B} \log_{\frac{M}{B}} \frac{N}{B}$ memory transfers. Is this
optimal? As with B-trees, information theory shows that it is.

\begin{theorem}
  To sort $N$ elements, $\bigOmega{\frac{N}{B} \log_{\frac{M}{B}}
  \frac{N}{B}}$ memory transfers are required.
\end{theorem}

\begin{proof}
  We need to identify which of the permutations of the $N$ input
  elements is the sorted order. This is $\lg N!$ bits of
  information. But we actually need only learn $\lg \frac{N!}{B!}$
  bits of information, since once we read a block into cache, we can
  sort it instantly and its sorted ordering therefore comes for free.

  We can always keep the cache sorted, since operations on the cache
  incur no cost. Every time we read a block of $B$ elements, we
  determine where it fits in among the elements currently in
  cache. That is, we learn which of the interleavings of $B$ elements
  into $M$ elements ($M+B$ total) is the sorted order. There are
  $\binom{M+B}{B}$ such interleavings, so at most $\lg \binom{M+B}{B}$
  bits of information are obtained from each memory
  transfer. Therefore, the number of memory transfers we need to
  determine our requisite $\lg \frac{N!}{B!}$ bits of information is
  \begin{align}
    \frac{\lg \frac{N!}{B!}}{\lg \binom{M+B}{B}} &= \bigOmega{\frac{\lg
    \frac{N!}{B!}}{\lg \frac{(M+B)!}{M!B!}}} \\
    &= \bigOmega{\frac{N \lg \frac{N}{B}}{(M+B) \lg (M+B) - M \lg M - B \lg
B}} \\
    &= \bigOmega{\frac{N \lg \frac{N}{B}}{M \lg M + B \lg M - M \lg M - B
    \lg B}} = \bigOmega{\frac{N \lg \frac{N}{B}}{B \lg \frac{M}{B}}} \\
    &= \bigOmega{\frac{N}{B} \log_{\frac{M}{B}} \frac{N}{B}}
  \end{align}
  So $\bigTheta{\frac{N}{B} \log_{\frac{M}{B}} \frac{N}{B}}$ is optimal.
\end{proof}


\section{Buffer Trees}
\label{sec:buffer-trees}


\end{document}
