CPU -> registers -> L1 -> L2 -> more caches -> main memory -> disk -> ...

RAM/cell probe cares about how many different things we're
accessing. But larger disks are slower, but work with larger blocks
(better parallelism). Exploit locality in order to minimize number of
blocks that need to be interacted with.

External memory model (I/O model, disk access model - DAM) [Aggarwal &
Vitter 1988]
Captures two levels of hierarchy. 

Cache w/ M/B B-bit words (total size M) connected to CPU via fat pipe
- instantaneous transfer.

Also connected to disk, arranged in blocks of size B. Can read/write
blocks. Slow to transfer between disk and cache.

Goal: design algorithms that minimize number of memory transfers. 
If we've got a T(N) algorithm in the RAM model, we can do it in T(N)
memory transfers trivially. We want to get it down -- T(N)/B is the
minimum, but usually hard to achieve.

Searching
B-trees give us O(lg_{B+1} N)
Lower bound of Omega(lg_{B+1} N) for searching (comparison model).
- Information theory: lg(N+1) bits to discover, lg(B+1) per transfer.

Sorting
O(N/B lg_{M/B} N/B) - M/B-way mergesort.
matching lower bound in comparison model [A&V]

Permutation:
Rearrange N elements into some new order
Theta(min(N , (N/B lg_{M/B} N/B))
N-pick up each block and move it to its new position
sorting bound - sort elements, sort permutation, undo permutation
Omega in indivisible model (can't divide elements, but other numbers
can be split up etc)
Open problem: can do better in weaker model?

Sorting data structures:
Search trees can't be used for sorting: gives
O(N lg_(M/B) N/B) instead of O(N/B ...)
Buffer trees give O(1/B lg_(M/B) N/B) amortized
insert, delete, delete-min, delayed search/range search


Cache-oblivious model [Frigo, Leiserson, Prokop, Ramachandran 1999]
Just like external memory, except algorithm doesn't know B or M
Doesn't explicitly manage memory -- is a RAM algorithm. Memory is
managed via automatic block transfers triggered by element access,
using offline optimal block replacement
policy. (in practice: FIFO/LRU/... 2-competitive on double size cache)
Will assume M >= cB for some sufficiently large c (but we usually
don't require it to be too big, so that's OK)

Why cache-oblivious?
Nice clean model
Allows RAM algorithms to be used directly
Multilevel memory hierarchies captured. (can't do this cleanly in the
external memory model)

Results
B-tree: insert/delete/search in O(lg_{B+1} N) transfers [Bender,
 Demaine, Farach-Colton 2000] but simplified 
Sorting in O(N/B lg_(M/B) N/B) [Frigo et al]
 requires tall-cache assumption: M=Omega(B^(1+\epsilon))
 tall-cache necessary in cache-oblivious (but not external-memory)
  [Brodel & Fagerberg 2003]
Priority queue - insert/delete/delete-min O(1/B lg_(M/B) N/B) [Arge, 
  Bender, et al 2002 CAST?]


Static search tree [Prokop 1999]
 Store n elts in order in a complete binary tree
 Cut tree at middle level of edges
  - \sqrt{N}+1 subtrees of \sqrt{N} elts
 Recurse on subtrees
 Concatenate - van Emde Boas layout
Claim: search uses O(lg_(B+1) N) mem transfers
Pf: algorithm continues, but we can stop the analysis when we reach a
 tree that fits in a block
 Look at level of detail (recursion) that straddles B: each subtree
 has size <= B, but the whole structure at that level doesn't.
  Each subtree cost <= 2 to access.
  How many to access? Each one has height >= 1/2 lg B, so total cost =
    (lg N)/(1/2 lg B) = 2 lg_B N
Works for arbitrary height (not just 2^k)
Works for constant-degree (not 1) trees (not just binary)

 
Ordered file maintenance
Problem: store N elements, in order in a O(N) array (O(1)-size gaps)
 Insert an element between 2 elements, preserving order
 Delete element
Black box: can do this by rearranging O(lg^2 N) consecutive elts


Dynamic search tree: [Bender, Duan/Puan? Iacono, Wu 2002]

Build vEB tree with each leaf corresponding to an array slot in the
ofm struct
Each internal node stores max of its children (ignoring empty slots)
Search in O(lg_(B+1) N) - look at left child
Insert(x):
 - search(x) -> pred or succ = where to insert in ofm
 - insert into ofm - changes O(lg^2 N) cells
 - update corresponding leaves and propagate maxima up inp post-order
   traversal of changed leaves and ancestors   
 
Top part costs O(\lg_B N)

Claim: if k cells change, cost is O(lg_B N + k/B)
Pf: Consider level of detail straddling B again. Look at bottom two
levels of /<=B\ 
Can be done by scanning: only need to store ofm block, current bottom block of
tree, next-from-bottom block of tree.
With J>B in large square, O(J/B + 1) = O(J/B)
O(K/B) for the bottom two levels. After bottom two levels, also
O(K/B), since J>B are reduced to 1.

Now have B-tree with insert (and delete, equivalently) in O(lg_(B+1) N
+ (lg^2 N) / B), search in O(lg_(B+1) N)

Can get rid of (lg^2 N)/ B via indirection:
cluster into groups of lg n
store min of group in ofb, previous structure
rewriting group requires O((lg N)/B) <= O(lg_B N)
may need to be split after lg_N inserts, but can be amortized away

Now O(lg_(B+1) N + lg N/B) = O(lg_(B+1) N) amortized updates.