% -*- TeX-master: "report.tex"; TeX-PDF-mode: t -*-

The client-server protocol divides into three separate, but
interacting sub-protocols: A protocol for filling and maintaining
client-side caches, a protocol for committing read/write transactions,
and a protocol for assigning read-only transaction timestamps.

\subsubsection{Cache Coherence}
\label{sec:algorithm:protocol:cache}

% Key points
% * The server maintains a multiversion block store, which each client
%   maintains a local cache of
% * Because of its multiversion nature, the unit of storage---a
%   version---is essentially immutable, except that the latest version
%   of a block, which has an unknown upper bound in time, can be
%   replaced, and thus gain a known upper bound.  Asynchronous
%   deprecation messages are used to keep client caches up to date
%   about the upper bounds of versions
% * Updates always occur in monotonically increasing order, so client
%   caches keep track of a \emph{global least upper bound} (GLUB), a
%   timestamp before which no further updates can occur.  (Moved to
%   read-only transactions, since that's really what the GLUB affects,
%   even though it's maintained by the cache)

\begin{figure*}[ht]
  \centering
  \includegraphics{protocol.pdf}
  \caption{Protocol example with two clients.  (1)~Initially, the server
    is storing two blocks spanning times~1 to~3.  Both blocks have a
    single historical version and no clients are in either block's
    holder sets.  (2)~Client~1 requests A at time~2.  The server
    replies with the historical contents of A and the time range of
    that version.  (3)~Client~2 requests the latest version of B.  The
    server replies with the current contents and time range of B.
    Client~2's cache will now contain an unbounded copy of B, so the
    server adds client~2 to the B's holder set.  (4)~Client~1 requests
    B at time~2.  This is the current version of B, so the server adds
    client~1 to B's holder set.  (5)~Client~2 performs a commit,
    indicating which version of B it read and proposing a
    new version of B.  After validating the commit request, the server
    replies with the time assigned to the transaction.  Because the
    latest version of B was replaced, the server removes all of the
    clients from the holder set except the committing client and sends
    deprecations to these clients.}
  \label{fig:protocol}
\end{figure*}

The server maintains a multiversion block store, such as the one
depicted in step~(1) of Figure~\ref{fig:protocol}.  Each block in the
block store consists of a set of \emph{versions}, each of which is
valid over some non-overlapping range of time.  The current version of
a block is \emph{unbounded}, meaning that its upper bound is unknown
until a write to that block installs a new version at some later time.

Each client maintains a local cache with a structure similar to that
of the server's block store.  This cache provides two operations for
reading versions:
\begin{itemize}
\item \func{get}{\id{blockid}, \id{timestamp}}, which retrieves the
  contents of the block version whose range includes the given
  timestamp.  This may be the current version or a historical version
  of the block.
\item \func{getLatest}{\id{blockid}, \id{transaction}}, which
  retrieves the current version of a block.
\end{itemize}

If the requested version is not present in the client-side cache, it
forwards the request to the server, which responds with the contents
and time range of the appropriate version.  A typical historical
request can be seen in step~(2) of Figure~\ref{fig:protocol}.  If the
requested version turns out to be unbounded, then the server will add
the client that issued the request to \emph{holder set} for that
block.  In the example in Figure~\ref{fig:protocol}, this happens for
block B in steps~(3) and~(4).

Because of the block store's multiversion nature, a given version is
essentially immutable with the one proviso that an \emph{unbounded}
version can become a \emph{bounded} version when its upper bound
becomes fixed.  The holder set of a block tracks exactly which clients
have an unbounded version of a block in their cache so that when that
version becomes bounded, the server can immediately notify these
clients of this change via a \emph{deprecation} message.  For example,
step~(5) of Figure~\ref{fig:protocol} shows the deprecation message
sent to client~1 when a new version of B is installed at timestamp~4.

\subsubsection{Read/Write Transactions}
\label{sec:algorithm:protocol:rw}

% Key points
% * Read/write transactions use an optimistic write protocol with
%   validate-on-commit.
% * Asynchronous deprecations allow for active aborts, which not only
%   saves CPU time over regular optimistic concurrency, but admits a
%   simpler commit protocol.
% * Asynchronous deprecations make it practical for read/write
%   transactions to optimistically read from the local cache.
%
% I think these need to be emphasized better, or maybe briefly
% introduced somewhere

Read/write transactions use a protocol based on a mix of standard
optimistic concurrency
control~\cite{kung81:_optim_method_for_concur_contr} and multiversion
concurrency
control~\cite{bernstein81:_concur_contr_in_distr_datab_system}.  Block
reads are satisfied optimistically from the latest, unbounded version
of blocks in the client's cache whenever possible.  Block writes go to
a per-transaction side-store.  Upon commit, the read and write sets of
the transaction are sent to the server, which
\begin{enumerate}
\item Validates that all of the block versions read by the transaction
  are still unbounded.  This ensures that the transaction can be
  placed next in the serial ordering because it is consistent with the
  latest version of the block store.
\item Assigns the transaction a timestamp $\alpha$ one greater than
  that of previously committed transaction.  The timestamp indicates
  that the transaction read from the block store at time $\alpha$ and
  created the contents of the block store at time $\alpha+1$.
\item Installs the blocks from the transaction's write set at time
  $\alpha+1$.  If this causes existing block versions to become
  bounded, then the server will send out deprecations, as described in
  the previous section.
\end{enumerate}
This write protocol ensures global serializability of read/write
transactions.

When creating new blocks, the client generates a \emph{random} block
ID, which it transmits to the server with the commit request.  If the
block ID collides with an existing block ID, then the server will
abort the request.  Otherwise, the block write proceeds as usual.
Block ID's are considered sparse enough that this should rarely occur
in practice.  Our implementation, for example, uses 64-bit integers,
making the probability of a conflict negligible.

Our algorithm goes beyond the standard validate-on-commit approach of
optimistic concurrency control by employing \emph{active aborts}.
Because the cache is actively kept coherent by asynchronous
deprecation messages, the validation condition given above can be
checked continuously on the client side.  Specifically, if a
deprecation arrives for a block in an active transaction's read set,
then it is impossible for that transaction to pass validation, so it
can be aborted as soon
as possible.  Validation on commit is still necessary, however, as a
critical deprecation message and a commit message may have crossed
paths on the network; however, once a transaction requests a commit,
it is very likely that the commit will succeed simply because the
window during which such a conflict could arise is very small.

The standard optimistic approach of waiting until commit time to
detect conflicts has a number of drawbacks.  First, by waiting to
abort a transaction, work is wasted in the event of a
conflict~\cite{liskov99:_provid_persis_objec_in_distr_system}.
Second, in a distributed setting where the commit data does not reside
on the node performing validation, transactions must either send
committed data to the server in the commit request, or they must
engage in a multistage protocol to first validate the commit before
sending the data.  The first approach wastes network resources if the
commit fails, while the second approach incurs latency and may stall
other transactions if the commit succeeds.  Active aborts, on the
other hand, make it reasonable to use a single-stage protocol that
sends committed data along with the commit request because of the low
probability of a conflict.

Furthermore, asynchronous deprecations also permit better optimism
and cache usage when retrieving blocks.  A pessimistic alternative to
the read protocol given above is to employ a cache update protocol
that always contacts the server when requesting the latest version of
a block instead of assuming the cached copy is up-to-date.  This would
ensure that a read/write transaction always operates on the latest
copy of the data.  However, doing so incurs a full network round-trip
for every block read.  With asynchronous deprecations, the transaction
will know that it was overly optimistic and should therefore abort
within half the time of a network
round-trip of retrieving the block from the local cache.  Thus, the
only disadvantage of optimistically reading cached data is that events
during the half round-trip window following the read could force the
transaction to abort where the pessimistic read protocol would
simply have blocked.  However, we assume that write conflicts are
sufficiently rare in the workload that the trade-off is worthwhile.

\subsubsection{Read-Only Transactions}
\label{sec:algorithm:protocol:ro}

% Key points
% * Read-only transactions choose a timestamp and then operate at that
%   timestamp
% * The timestamp is chosen based on locally available information in
%   order to ensure local causality and conflict freeness
% * Network communication is sometimes, though not always, necessary
%   to ensure freshness.

Read-only transactions operate in a model similar to snapshot
isolation~\cite{berenson95:_critiq_of_ansi_sql_isolat_level}.  A
read-only transaction is assigned a timestamp as soon as it starts and
always reads versions from that timestamp.  Choosing a particular
timestamp to operate at ensures the global serializability of
read-only transactions with respect to read/write transactions.  Since
read-only transactions do not have observable effects, they are always
serializable with respect to each other.

Choosing an appropriate timestamp is critical both to ensuring the
properties laid out in Section~\ref{sec:algorithm:properties} and for
the efficiency of read-only transactions.  The more flexibility
afforded in the assignment of a timestamp, the better locally cached
data can be taken advantage of.

In order to guarantee local causality, the timestamp must be greater
than the largest timestamp assigned to any committed local read/write
transaction.  Because the read/write protocol assigns timestamps in
increasing order, it is sufficient to simply use a timestamp greater
than that of the last locally committed read/write transaction.  Note
that, because read-only transactions do not have effects, multiple
read-only transactions can be assigned the same timestamp.

The write protocol presented in the previous section ensures that
updates to the server's block store always occur with monotonically
increasing timestamps.  This means that each client can keep track of
a \emph{global least upper bound} (GLUB)---a timestamp before which no
further updates can occur---by keeping track of the timestamp of the
last deprecation message or local commit.  Because monotonicity
guarantees that read/write transactions cannot be assigned a timestamp
less than the GLUB, assigning a read-only transaction a timestamp less
than or equal to the GLUB is sufficient to ensure conflict-freeness
between read-only transactions and read/write transactions.

Furthermore, because local writes update the GLUB, the window of
allowable timestamps always contains at least one timestamp, read-only
transactions never have to block until another transaction widens the
window.

\paragraph{Freshness.}

While assigning timestamps less than or equal to the GLUB is
\emph{sufficient} to guarantee conflict-freeness, it is not
\emph{necessary}.  In particular, if the last GLUB update occurred
more than $\epsilon$ seconds ago, this may be too restrictive to
ensure freshness.  Thus, each client keeps track of the
\emph{wall-clock} time of the last update to its GLUB.  When a
transaction with $\epsilon$ freshness is started, if the GLUB is more
than $\epsilon$ seconds old, the client requests a GLUB update from
the server before assigning the transaction a timestamp.  On a
relatively active client with a large cache, such updates may not be
necessary very often because of frequent GLUB updates from other
communication with the server.

Assuming the network latency is less than $\epsilon$, assigning the
transaction a timestamp of the GLUB received from the server is
sufficient to guarantee freshness. If the network latency is greater
than $\epsilon$, then it is impossible to simultaneously provide
freshness and conflict-freeness in any system.

% This is interesting, but I don't have time to think about it:
%
% There's a distinction between begin causality and commit causality.
% RW transactions are always globally commit causal, and RO
% transaction can be globally begin causal by reducing epsilon to 0,
% but they can't be globally commit causal without resorting to making
% them RW transactions

% LocalWords: LocalWords timestamp multiversion serializability
% LocalWords: freeness serializable acausality GLUB