\documentclass{article}
\input{6830-preamble}
\psetheadertitle{Lab}
\usepackage{fancyvrb}
\DefineShortVerb{\<}
\begin{document}
\psetnum{2}
\date{2007/10/16}


\begin{pset}
  \section{Design Decisions}
  \label{sec:design}

  In the interest of minimizing duplication and maximizing parsimony
  of code, I took advantage of the fact that both the data pages and
  the directory of a <ExHashFile< can be viewed as an sequence of
  tuples, much like a <HeapFile<. Accordingly, I generalized
  <HeapFile< to allow it to be used as such. Specifically, I augmented
  it to support not only the standard <HeapFile< interfaces
  (<addTuple<, <deleteTuple<, <iterator<, etc.) but also
  \emph{random-access} operations, making it possible to read or set a
  particular tuple in a particular page, by number.

  What this change is actually trying to represent is the
  decomposition of a <HeapFile< into a more generic structure
  representing a file containing a sequence of tuples. Had I more
  time, fewer conference deadlines, and a willingness to make sweeping
  interface changes while still maintaining compatibility with the
  test classes\footnote{In other words, if I were Austin...}, I would
  have divided this into two parts, one representing the details of
  serializing pages of tuples and providing the random-access
  operations listed above, and one implementing the semantics of a
  <HeapFile< (\emph{i.e.\ }finding an empty space and inserting a new
  tuple there).

  With this expanded <HeapFile<, implementing the <ExHashFile< is
  fairly straightforward. The hash file is represented as two separate
  heap files, one storing the data tuples, and one storing the
  directory. The directory is represented as a set of two-element
  tuples. The first tuple stores the global level of the hash file,
  and the rest of the tuples store, for each hash bucket (from 0 to
  $2^\text{globalLevel} - 1$), the per-page level and the
  corresponding page in the data file where the tuples for that bucket
  are actually stored.

  The advantage of this use of <HeapFile<s to implement <ExHashFile<s
  is that many of the hash file operations can be passed down to the
  underlying <HeapFile<. For example, the <ExHashFile< does not have
  to deal with page data formats. Tuple deletions and sequential-scan
  iterator requests are passed directly to the underlying data
  <HeapFile<. Adding a tuple requires hashing the tuple's key and
  using the directory to look up the appropriate data page, then
  passing the tuple to that data page's <addTuple< method. Looking up
  a tuple by key follows largely the same pattern, finding the
  appropriate data page via the directory, and obtaining an iterator
  for it; the iterator is wrapped to filter out other elements that
  may have different keys but hash to the same bucket. Rehashing and
  incrementing the global level are implemented in the standard ways.

  \section{Benchmark}
  \label{sec:benchmark}

  To evaluate the effectiveness of the indexing, the benchmark
  performed a join of the provided outer ($115 \times 7000$) and inner
  ($2 \times 7653$) tables using a nested loops join (with both tables
  as heap files) and an index nested loops join (with the inner table
  as a hash file and the outer as a heap file). The number of page
  I/Os and the elapsed wall-clock time were measured. The table below
  shows the results for several sizes of the buffer pool (number of
  pages that can be loaded at once).
  
  \begin{tabular}{|r|r|r|r|r|}
    \hline
    ~ & \multicolumn{2}{c|}{\textbf{NL Join}} &
    \multicolumn{2}{c|}{\textbf{Index NL Join}}\\
    \textbf{Buffer Pool Size (pages)} & \textbf{IOs (pages)} &
  \textbf{Time (s)} & \textbf{IOs (pages)} &
  \textbf{Time (s)} \\
  \hline
  8 & 105875 & 41.621 & 5410 & 2.266 \\
  16 & 7445 & 6.016 & 2996 & 1.439 \\
  32 & 905 & 3.863 & 978 & 0.805 \\
  64 & 905 & 3.736 & 911 & 0.694 \\
  \hline
  \end{tabular}
\paragraph{}
These results show that the index nested loops join consistently
outperforms the nested loops join, not too surprisingly. The nested
loops join has terrible performance for a very small buffer cache, in
which the inner table does not fit into memory (8 or 16 pages) and
constantly has to be reloaded, but performs more reasonably once it
does fit into memory. The index join's performance is also worse when
the inner table does not fit into memory, but not nearly to the same
degree. Interestingly, for larger buffer caches (32 or 64 pages), the
nested loops join requires fewer IOs than the index join, but still
requires more time, presumably because of the CPU cost of repeatedly
scanning the entire index, even if it's in memory.


  \section{Hours}
  \label{sec:hours}

  This lab was completed during the span of a DFW-PDX flight, plus
  several of the less-exciting SOSP presentations. Probably about 8-10
  hours total.
\end{pset}
\end{document}
