\documentclass{article}
%\input{6854-preamble}
\usepackage{scribe}

\begin{document}

\lecture{1}{11/15/2004}{Erik Demaine}{Dan Ports}

Aggarwal \& Vitter 1988

Alternative to RAM model. RAM model doesn't always predict how
algorithms will work on real memory.

CPU has fast connection to \emph{cache}. Cache size is M.
$\frac{M}{B}$ rows of $B$ columns. Main memory is much larger and much
slower to access. $B$ columns, but more rows.

No cost to access cache memory and perform operations on it. Goal is
to minimize number of memory transfers. This isn't exactly the
runtime, since runtime includes cache operations too.

Transfers take place in blocks of size $B$, so we want to take
advantage of locality. Accessing an element requires transferring the
entire block.An algorithm that has runtime $T$ can trivially be made to
use $\le T$ transfers, but we'd like to use less. $\frac{T}{B}$ is
ideal.

Scanning
Visit each element in order. 
%$\ceil{\frac{N}{B}} + 1$ memory transfers to visit $N$ elements. This
is optimal.

Reversing an array. Swap first and last elements. Can be performed
using blocks. Two parallel scans: one from the front, one from the
back. Can be done in $O(\frac{N}{B})$ time. Requires $M \ge 2B$ cache
size. This isn't a problem.

Often we'll use $M = \Omega(B^{1+\epsilon})$: tall cache assumption.

Search trees.
B-trees. nodes containing $b$ values, branching factor
$b+1$. $\Theta(\log_b N)$ height. Why is this good for external memory
algorithms? Choose $b$ to be (a constant factor less than) $B$, so we
can fit a node and all of its child pointers into a block. This gives
us $O(\log_{B+1} N)$ access time. Is this optimal?

Search problem: given $N$ elements, do some preprocessing, then find
the closest element to a given query. Normally $O(\log N)$ binary
trees is optimal; here we claim $O(\log_B N)$ is optimal.

Claim (information theoretic): searching for $x$ among a sorted list
of $N$ elements requires $\Omega(\log_{B+1} N)$ memory transfers.

Proof: write down sorted array 1 .. n
most succinct encoding of the answer is the index in the
array. Goal is therefore to learn $lg N+1$ bits. Initially cache is empty,
so we don't have any information. Each memory read reveals
at most $B$ elements that we haven't seen before. So we learn where
$x$ fits into these elements. Since they were already sorted, we learn
$\le \lg B$ bits. So we need $\frac{\lg N+1}{\lg B+1} = \log_{B+1}N$. 

So B-trees are optimal in terms of memory transfer. They can also be
made optimal in time by storing each node as a small balanced binary
search tree.


Sorting (comparison-based).

Bound for sorting is $\Theta(\frac{N}{B} \log_{\frac{M}{B}}
\frac{N}{B})$

In the RAM model, we can sort optimally by building a B-tree and
extracting the minimum element repeatedly. $O(n \log n)$. But this
isn't good in the external memory model: it gets us $O(N \log_{B+1}
N)$, which isn't quite good enough.

Merge sort.

Can divide a problem of size $N$ into two sorting subproblems of size
$\frac{N}{2}$, plus a merge. The merging can be done in
$O(\frac{N}{B})$ time using parallel scanning.
\[ T(N) = 2T(\frac{N}{2}) + O(\frac{N}{B} \]
Recursion tree starts at N/B, goes to 1/2 N/B x 2 + 1/4 N/B x 4 ...
up to depth M. N/B work on each level, log N/M levels.
\[T(N) = \frac{N}{B} \log \frac{N}{M}\]
Almost good, but we want a log base of M/B. How do we get this?

M/B-way mergesort (as opposed to binary mergesort).

Split a subproblem of size $N$ into $M/B$ subproblems of size
$\frac{N}{M/B}$. This is the correct size because it allows us to fit
the array in memory. (FIXME: explain better)
Recursion is
\[ T(N) = \frac{M}{B} T\left(\frac{N}{\frac{M}{B}}\right) +
O\left(\frac{N}{B}\right) \]
FIXME: recursion solution by recursion tree
\[ T(N) = \frac{N}{B}\left(1 + \log_{\frac{M}{B}} \frac{N}{M}\right)
\]
\[ T(N) = \frac{N}{B}\left(2 + \log_{\frac{M}{B}} \frac{N}{B}\right)
\]

Is this optimal? (in comparison model)

Need to learn $\lg N!$ bits. Actually just $\lg \frac{N!}{B!}$ since
we can sort a block instantly. FIXME: make sure this carries through
in the analysis.

Can always keep cache sorted. (It's free, might as well)
By reading $B$ elements, we learn where it fits in among the cached
elements: block read learns which of the $\binom{M+B}{B}$ possible
interleavings of the $B$ elements we have. Gives us $\lg
\binom{M+B}{B}$ bits. Need $\lg N!$ bits, so we need
\[ \frac{\lg N!}{\lg \binom{M+B}{B}}\]
accesses. $\lg N!$ is $N \lg N$ to a constant factor.
\[\lg \frac{(M+B)!}{M!B!} \approx (M+B) \lg (M+B) - M \lg M - B \lg
B\]
\[ \approx M \lg M + B \lg M - M \lg M - B \lg B \]
\[ = B \lg \frac{M}{B} \]
\[\frac{ N \lg \frac{N}{B}}{B \lg \frac{M}{B}} = \frac{N}{B}
\lg_\frac{M}{B}} \frac{N}{B} \]


Buffer tree: Arge 1995

Would like a data structure that's better for sorting.

$O\left(\frac{1}{B} \log_{\frac{M}{B}} \frac{N}{B}\right)$ amortized
memory transfers per insert, delete, batched search, and delete-min
operations. This is a rather impressive bound since it's usually less
than 1 (which is why it's an amortized bound).

What's a batched search? Make a request for a search, don't get an
immediate answer.

This is also a priority queue, but more general.


Looks like a B-tree. Branching factor is $M/B$. Leaves will have
$\Theta(B)$ elements and no buffer. For each internal node, also have a buffer of
size $M$. Requests are stored in the buffer before being pushed lazily
into children (standard data structures technique).

\begin{verbatim}
Insert(x):
  - Append x to the root buffer, with ``append'' insert classification
   and timestamp.
  - If root buffer is full,
      - sort the buffer. $O(M/B \lg_{M/B} M/B) = O(M/B)$.
      - distribute items in buffer to buffers of appropriate
        children %(possibly triggering overflows there)
        Dealing with $M$ elements, and $M/B$ children = $M/B$ buffers
        we might be touching. Need to pay O(M/B) to fetch buffers from
        memory and start writing. But we have M elements to charge it
        to. $O(M/B)$ as well.
      - check for children overflows

Overflow:
  - sort $\le M$ old elements in buffer
  - merge with newly added elements (which are in sorted order) just
    inserted from parent - $O(\frac{M}{B})$ time.
  - distribute to children
  - check for overflowing children

Overflow-leaf:
  - just split leaf like a M/B-tree.
  Happens rarely.
\end{verbatim}

Analysis:
  O(M/B) transfers per overflow involving at least M elements.
  How many times do we overflow per inserting $N$ elements?
  Each element is involved in log_{M/B} N/B overflows, since this is
  the height of the tree and it can only move down.
  Charge transfers per overflow to $M$ elements involved: 1/B
  per. Each element gets charged once per overflow it's involved in,
  so amortized cost per element is 1/b log_{M/B} N/B. This is the
  sorting bound divided by n

Flush: clear all buffers
  Can't use typical amortized analysis because buffers might not be
  full, but it can't take more time than it would if they did. time is
  O(N/B log_{M/B} N/B).

Delete: insert timestamped delete element into the root buffer and
propagate it down as before. When an insert and delete for the same
element (with timestamps ok) meet in some buffer, they cancel each
other out. In a leaf, do a B-tree delete rather than a B-tree delete.

Offline searches (batched searches):
 Insert searches into the buffer. If a search hits an identical
 insertion, that's the exact answer. Otherwise, when the search hits a
 leaf, do a b-tree search and return the closest.

Range-search(x,y): find all elements between x and y
O(1/B log_{M/B} N/B + \frac{\abs{output}}{B})
Insert x and y endpoints, and accumulate answers ``off to the side''

Delete-min:
 Extra buffer of $M$ min elements - always in cache (being sloppy with
 constant factors here).
 Delete-min: delete-min of buffer of min. OK until buffer
 empties. Then we need the next $M$ smallest elements. This is a
 search tree, so the minimum elements will be along the left
 spine. Walk down the left spine, flushing. Refills buffer. Also need
 to check for new mins on insert, but doesn't change runtime since all
 in cache.
\end{document}