\documentclass{article}
\input{../6830-preamble}
\def\bptree{B$^+$-tree}
\collab{Collaborators: $\set{\texttt{amdragon}, \mathtt{iyzhang}}$}
\usepackage{fancyvrb}
\DefineShortVerb{\|}
\begin{document}
\psetnum{4}
\date{2007/11/29}

\begin{pset}
  \begin{problem}
    This query can best be performed using a hybrid hash join over the
    two tables, after applying the appropriate selection predicates,
    since the tables do not fit into memory.
  \end{problem}

  \begin{problem}
    Performing the filtering requires reading 2000 MB of data. Since
    the selectivity is 0.5, the size of each resulting table is 500
    MB. The hash join requires an additional reads and a write for
    each filtered table, minus the 400 MB of available memory, so
    $\frac{(2000\text{ MB} + 2*((1000 - 400) \text{ MB}))}
    {10 \text{MB/s}} = 320$ sec.
  \end{problem}

  \begin{problem}
    Each node should filter |R| and |S| on their respective
    predicates, and repartition the result, transmitting the resulting
    tuples to the node found by computing a hash (using the same
    function) over |R.c| and |S.d|. Those nodes should then perform
    the join (which now fits in memory) and send the results to the
    client.
  \end{problem}

  \begin{problem}
    We'll assume a uniform hash function, such that tuples are
    uniformly distributed among the nodes in the system. So each node
    has 200 MB of each table. Performing the filtering requires 2 * 200 MB
    / 10 MB/s = 40 seconds. Assuming that the repartitioning is also
    uniform, each node will need to ship $\nicefrac{4}{5} \cdot 200
    = 160$ MB, which takes 16 seconds. Since there are only 200 MB to
    join at each node, everything fits in RAM and the join happens
    instantly. Finally, the 200 MB of joined tables must be sent back
    to the client, which takes 20 seconds, so the total time is $40 +
    16 + 20 = 76$ seconds.
  \end{problem}

  \begin{problem}
    Again, each node will filter |R| and |S| on their respective
    predicates. However, repartitioning on |R.c| and |S.d| is not
    necessary since the tuples are already hashed on these fields. So
    each node will then simply perform the join (in memory) and send
    the results to the client.
  \end{problem}

  \begin{problem}
    As before, performing the filtering requires reading the tables,
    which takes 40 seconds. Now there is no cost to performing the
    actual join (repartitioning is not necessary, and each node's
    portion still fits in memory). Shipping the 200 MB of results at
    each node back to the client requires 20 seconds. So the total
    time is $40 + 20 = 60$ seconds.
  \end{problem}

  \begin{problem}
    We'll assume that the network can be unreliable, so messages can
    be lost or delivered out of order. We know that transaction |T2|
    has committed, since |COMMIT| records for it appear in the logs of
    the coordinator and worker 2. However, we cannot tell whether |T1|
    or |T3| are going to commit. Since the coordinator has written a
    |COLLECTING| record and the worker nodes have written |PREPARE|
    records, we know the coordinator sent out and the workers received
    |PREPARE| requests. However, the worker nodes might have responded
    either with a commit or an abort, or might not have responded
    yet. Since none of them have written either |COMMIT| or |ABORT|
    messages, there is no way to tell whether the transactions have
    committed or aborted.
  \end{problem}

  \begin{problem}
    \textbf{False}. If a table is very small, say containing only one
    row (but multiple columns), performing a query on all columns
    would require reading only one page in a row-oriented database,
    but would require reading one page per column in a column-oriented
    database.
  \end{problem}

  \begin{problem}
    \textbf{False}. If the table contains only distinct values, then
    run-length encoding will increase its size.
  \end{problem}

  \begin{problem}
    \textbf{False}. C-Store uses snapshot isolation for read-only
    transactions instead of locks.
  \end{problem}

  \begin{problem}
    \textbf{False}. C-Store supports a standard SQL interface, meaning
    that the query language is supports physical independence, so
    unmodified queries can be executed. If they are read-only
    transactions, particularly ones that use only a few columns, they
    are likely to be faster.
  \end{problem}
\end{pset}
\end{document}
