\chapter{Consistency Protocol}
\label{cha:consistency-protocol}

This chapter describes the protocol that our system uses to ensure
transactional consistency. The protocol is based on a notion of
\emph{validity intervals}, which indicate the range of times at which
a version of an object was current. These reflect the fact that
cached objects are valid not just at the instant they were computed
but at a range of times, allowing the system to combine cached objects
that were computed at different times as long as they are consistent
with each other.

To ensure transactional consistency when using cached objects,
\txcache uses a versioned cache indexed by validity interval. We
ensure that the storage layer can compute an associated validity
interval for each query result, describing the range of time over
which its result was valid. The \txcache library tracks the queries
that a cached value depends on, and uses them to tag the cache entry
with a validity interval. Then, the library provides consistency by
ensuring that, within each transaction, it only retrieves values from
the cache and database that were valid at the same time.


\section{Validity Intervals}
\label{sec:consistency-protocol:validity-intervals}

The basic observation that makes this protocol possible is that every
object is valid not just at a single point in time -- the time at
which it was generated -- but at a \emph{range} of times, representing
the time range during which it was the most recent version. This
observation makes it possible to combine two cached objects that were
computed at different times as long as the data that they rely on has
not changed.

We consider each read/write transaction to have a \textbf{timestamp}
that reflects the ordering in which that transaction's changes became
visible to other transactions. These timestamps must have the property
that, if two transactions have timestamps $t_1$ and $t_2$, with $t_1 <
t_2$, then any transaction that sees the effects of transaction $t_2$
must also see the effects of transaction $t_1$. For example, the
wall-clock time at which a transaction commits can be used as a
transaction timestamp, as can a logical timestamp that simply
indicates the order in which transactions commit.%\footnote{This
%  assumes the storage layer uses a \emph{cascadeless} concurrency
%  control protocol -- that is, one that does not allow transactions to
%  see uncommitted changes. Such protocols are standard and ubiquitous.}

The \textbf{validity interval} of an object (more precisely, of a
\emph{version} of an object) is a pair of timestamps defining the
range of time at which that version was current. The lower bound of
this interval is the time at which that object became valid, \ie the
commit time of the transaction that created it. The upper bound is a
time after which the value may no longer be current, \ie
the commit
time of the first subsequent transaction to change the result.

\begin{figure}[tbp]
  \centering
  \includegraphics[width=0.9\textwidth]{figures/validity-intervals}
  \caption{Example of validity intervals for two objects}
  \label{fig:consistency-protocol:validity-intervals:example}
\end{figure}

For example,
\autoref{fig:consistency-protocol:validity-intervals:example}
shows two objects and their validity intervals. Object A was created
by the transaction that executed at timestamp 10 and deleted by the
transaction that committed at timestamp 14. We write its validity
interval as $[10,14)$. Object B has two versions, with validity
intervals $[11, 13)$ and $[13, 16)$. The transaction at timestamp 13
updates B's value, creating a second version.

Every object's validity interval has a lower bound -- the time at
which that object became valid. For some objects, the upper bound --
the time at which the object is no longer valid -- is known. However,
for other objects the upper bound may lie in the future; this is the
case for objects that are currently valid. We refer to validity
intervals with a known upper bound as \emph{bounded} and those where
the upper bound is unknown as \emph{unbounded}.

To simplify the explanation of the protocol, we assume in this chapter
that all validity intervals are bounded. That is, we assume the system
has perfect foresight and knows the timestamp at which an object
becomes invalid, even when that timestamp lies in the future.  In
\autoref{cha:invalidations}, we show how to correctly account for
objects that are still valid by using \emph{invalidations}:
notifications from the database that certain values are no longer
current.

\subsection{Properties}
\label{sec:consistency-protocol:validity-intervals:properties}

Validity intervals have the following properties, which our protocol
makes use of.

\begin{itemize}
\item \textbf{intersection}: if a new result is computed by reading
  two or more objects, the validity interval of the
  resulting object is the intersection of their validity
  intervals. We use this property to determine the validity interval
  of cached objects that depend on multiple data objects from the
  storage layer.
\item \textbf{subset}: if an object is valid over an interval $[a,b)$,
  it is also valid over any subinterval, \ie any interval $[c, d)$
  where $c \ge a$ and $d \le b$. This seemingly-trivial property is
  useful because it means that it is acceptable to use
  conservative estimates of validity
  intervals when they cannot be accurately computed.
\item \textbf{uniqueness}: there is exactly one version of any object
  at any given time, \ie if two versions of the object have different
  values, they must have non-intersecting validity intervals. The
  converse is not true; it is acceptable to have multiple versions,
  valid at different times, that have the same value.
\end{itemize}

Most importantly, validity intervals are useful because they allow us
to ensure that the system provides transactional
consistency. In particular, they allow us to ensure the following
property:
\begin{itemize}
\item \textbf{consistent reads}: if all objects read by a read-only
  transaction have validity intervals that contain timestamp $t$, then
  the transaction sees the same data it would if its reads were
  executed on the database at time $t$.
\end{itemize}
The \txcache library uses this property to ensure that all objects
read from either the cache or the database reflect a consistent view
of the system state.  As we discuss in \autoref{cha:serializability-appendix},
this property is sufficient to ensure that the cache provides
serializable behavior when used with a database that provides
serializability, or snapshot isolation when used with a database that
provides that.

A generalization of this property is that if the intersection of the
validity intervals of all objects read by a transaction has a
non-empty interval, then that transaction is consistent with the data
at \emph{some} point.


% \section{Consistency Approach}
% \label{sec:consistency-protocol:approach}

% \begin{outline}
% \item The basic approach:
% \item get the storage layer to report validity intervals for any data
%   we access
% \item have the cache library track these, intersect them, and attach
%   the corresponding interval to everything in the cache
% \item then assign a timestamp to each transaction and make sure it
%   only reads objects from the cache or DB whose validity interval
%   contains that timestamp
% \end{outline}

\section{A Versioned Cache}
\label{sec:consistency-protocol:cache}

\txcache stores the results of application computations in its cache,
distributed across the set of cache
servers. \autoref{sec:programming-model:implementation} described the
basic architecture of the cache, which presents a hash table interface
to the \txcache library. In order to provide support for transactions,
we extend our cache to be \emph{versioned}.  In addition to its key,
each entry in the cache is tagged with its validity interval, as shown
in \autoref{fig:cache-intervals}. The cache can store multiple cache
entries with the same key; they will have disjoint validity intervals
because only one is valid at any time. The cache stores this
information in a hash table, indexed by key, that points to a 
tree of versions indexed by their validity
interval.

\begin{figure}[tp]
  \centering
  \includegraphics[width=0.9\textwidth,keepaspectratio]{figures/cache-intervals.pdf}
  \caption[An example of versioned data in the cache.]{An example of versioned data in the cache. Each rectangle
    is a version of a data item. For example, the data for key 1
    became valid with commit 51 and invalid with commit 53, and the
    data for key 2 became valid with commit 48 and is still valid as
    of commit 55. There are two versions of the data with key 3; the
    gap between their validity intervals suggests that there is at
    least one more version that is not in the cache.}
  \label{fig:cache-intervals}
\end{figure}

\begin{figure}
  % XXX TABLE
  \begin{itemize} 
 \addtolength{\leftmargini}{1.5em}
 \itemindent=-1.5em 
  \item \command{store}(\cmdarg{key, value, validity-interval, basis}) :
    Add a new entry to the cache, indexed by the specified key and
    validity interval. The \cmdarg{basis} argument is used to track
    data dependencies for the
    invalidation mechanism in \autoref{cha:invalidations}.
  \item \command{lookup}(\cmdarg{key, timestamp}) $\to$ \cmdarg{value,
    interval} :
    Returns the value corresponding to \cmdarg{key} whose validity
    interval contains the specified timestamp, if it is resident in
    the cache.
  \item \command{lookup}(\cmdarg{key, interval}) $\to$ \cmdarg{value, interval} :
    Returns the value corresponding to \cmdarg{key} whose validity
    interval intersects the specified interval, if it is resident in
    the cache.
  % \item \command{invalidate}(\cmdarg{timestamp, tags}) :
  %   Update the validity interval for any items with the specified
  %   invalidation tags. This process is discussed in
  %   \autoref{cha:invalidations}.
  %   {\color{red} \bf XXX This is listed for completeness only; maybe
  %     leave this out here?}
  \end{itemize}

  \caption[Cache interface used by the \txcache library to interact
  with cache servers]{Cache interface. This API is used by the
    \txcache library to interact with cache servers.}
  \label{fig:cache-interface}
\end{figure}

\autoref{fig:cache-interface} shows the interface that the cache
provides to the \txcache library. The \txcache library can insert
objects into the cache using the \command{store} command. Now, in
addition to providing the key-to-value mapping it is adding, it must
also provide the associated validity interval. This creates a new
entry in the cache. If the cache already contains an existing entry
for the same key, it must either have a disjoint validity interval
(\ie the new entry is a distinct version) or the data of both versions
must match (\ie the new entry is a duplicate of the existing entry).
Our cache server rejects any request to store a new version of an
existing object with different data but an overlapping validity
interval, it rejects the request and issues a warning; this indicates
an error, typically caused by the application attempting to cache a
non-deterministic function. (This warning exposed a bug in the \rubis
benchmark, as we discuss in \autoref{sec:eval:rubis}.)

To look up a result in the cache, the \txcache library sends a
\command{lookup} request with the key it is interested in and either a
timestamp or range of acceptable timestamps. The cache server returns
a value consistent with the library's request, \ie the result of a
\command{store} request whose validity
interval contained the specified timestamp (or intersects the given
range of acceptable timestamps), if any such entry exists. The server
also returns the value's associated validity interval. If multiple
such values exist, as could happen if the \txcache library specifies a
range of timestamps, the cache server returns the most recent one.

Note, however, that cache \command{lookup} operations may fail even if
the data was previously stored in the cache. The cache is always free
to discard entries if it runs out of space, or during failures. It is
not a problem if data is missing: the cache contents are soft state,
and can always be recomputed. In our
versioned cache, a cache eviction policy can take into account both the
time since an entry was accessed, and its staleness. Our cache server
uses a least-recently-used replacement policy, but also eagerly
removes any data too stale to be useful.


\section{Storage Requirements}
\label{sec:consistency-protocol:storage-requirements}


Validity intervals give us a way to manage versioned data in the
cache. But where do these validity intervals come from? Cached objects
are derived from data obtained from the storage layer (\eg the
database), so their
validity intervals must also ultimately derive from information
provided from the storage layer. Furthermore, the \txcache library
must be able to integrate with the storage layer's concurrency
mechanism in order to ensure that cached data is consistent with any
information accessed directly from the storage layer. \txcache,
therefore, imposes two requirements on the storage layer:

\begin{itemize}
\item whenever the storage layer responds to a query, it must also
  indicate the result's validity interval
\item the storage layer must allow the \txcache library to control
  which timestamp is used to run read-only transactions
\end{itemize}

The first requirement is needed so that the library can compute the
validity intervals of cached objects. The second requirement is needed
because a transaction might initially access some cached data, but
then miss in the cache and need to obtain data from the database. To
ensure that the data obtained from the database is consistent with the
cached data previously read, the \txcache library must control which
snapshot is used on the database. It does not need this control for
read/write transactions, however, as these transactions always bypass
the cache and operate on the latest state of the database.

This section specifies the interface that that \txcache library
expects from the storage layer.  \autoref{cha:storage-layer} shows how
to modify an existing system to support these requirements.

\begin{figure}[tbp]
 \addtolength{\leftmargini}{1.5em}
 \itemindent=-1.5em 
  \begin{itemize} 
  \item \command{begin-ro}(\cmdarg{timestamp}) :
    Start a new read-only transaction running at the specified timestamp.
  \item \command{begin-rw}() :
    Start a new read/write transaction. This does not require a
    timestamp; all read/write transactions operate on the latest
    state of the data.
  \item \command{commit}() $\to$ \cmdarg{timestamp}:
    End the current transaction
  \item \command{abort}() :
    End the current transaction
  \end{itemize}
  \caption{Concurrency control API requirements for the storage layer}
  \label{fig:storage-requirements-ideal}
\end{figure}

\autoref{fig:storage-requirements-ideal} shows the concurrency control
API that the \txcache library requires from the storage layer. It
requires the usual functions for grouping operations into transactions
-- \command{begin}, \command{commit}, and \command{abort} functions --
with a few additions. When starting a transaction, the library will
indicate whether that transaction is read-only or read/write. For
read-only transactions, the storage layer must allow the library to
specify the timestamp at which the transaction runs. \txcache doesn't
specify a particular interface for how the application queries the
storage layer (for example, a block store and a relational database
have very different interfaces) but does require that every query also
return an accompanying validity interval.

The storage layer API requirements are similar to \txcache's cache
server API (\autoref{fig:cache-interface}): both expose to the
\txcache library control over query timestamps, and return validity
intervals on queries. However, there are some notable
asymmetries. First, the cache takes a timestamp as an argument to its
\command{lookup} function, whereas the storage layer has an explicit
\command{begin-ro} function that starts a read-only transaction. The
active transaction is tracked as connection state, and subsequent
queries on the same connection are therefore performed relative to the
specified timestamp.  We use this interface because it is the standard
transaction interface for relational databases; it also allows us to
avoid modifying the syntax for query operations, which could be
complex.  The second difference is that cache \command{lookup}
operations can specify a range of acceptable timestamps, but here
transactions are restricted to run on a specific timestamp. Again,
this is partially a pragmatic choice: it matches the way databases are
implemented. But there is also a more fundamental reason: some data
may not be available in the cache at a given timestamp, so it is
useful to specify a range of acceptable timestamps; the storage layer,
however, always has access to \emph{all} data for a given timestamp.

\subsection{Pinned Snapshots}

% \begin{figure}[tbp]
%   \begin{itemize} 
%   \item \command{begin-ro}(\cmdarg{timestamp}) :
%     Start a new read-only transaction running at the specified timestamp.
%   \item \command{begin-rw}() :
%     Start a new read/write transaction. This does not require a
%     timestamp; all read/write transactions operate on the latest
%     state of the data.
%   \item \command{commit}() $\to$ \cmdarg{timestamp}:
%     End the current transaction
%   \item \command{abort}() :
%     End the current transaction
%   \item \command{pin}() $\to$ \cmdarg{timestamp} :
%     Pin the current snapshot, allowing future read-only transactions
%     to use it. Return the timestamp
%     that identifies this snapshot.
%   \item \command{unpin}(\cmdarg{timestamp}) :
%     Unpin the specified snapshot.
%   \end{itemize}
%   \caption{Concurrency control API requirements for the storage layer,
%   pinned snapshot version.}
%   \label{fig:storage-requirements-pinned}
% \end{figure}

Ideally, the storage layer would provide the ability to run a
read-only transaction at any arbitrary timestamp that the \txcache
library specified in the \command{begin-ro} call. This requirement can
be met by some multiversion storage systems; for example, we describe
the design of a block store that provides this interface in
\autoref{sec:storage-layer:blockstore}.

In practice, however, it isn't always possible to modify an existing
storage system, such as a relational database, to support this
interface. For reasons we discuss in
\autoref{sec:storage:mvcc:pinned-snapshots}, providing access to a
particular version may require the database to record and retain
additional metadata; the associated cost can render it infeasible to
provide access to \emph{all} previous versions. As a concession to
practicality, therefore, we allow the storage layer to provide a more
restricted interface that can be easier to implement.

In this more restricted interface, the storage layer allows the
\txcache library to start transactions only on certain timestamps in
the past, not arbitrary ones. We refer to these timestamps as
\emph{pinned snapshots}. Only these timestamps can be used as
arguments to \command{begin-ro}. If the storage layer provides this
interface, it must provide an interface that creates a pinned snapshot
from the current database state. \txcache includes a module (the
``pincushion'', described in \autoref{sec:storage:pincushion}) that
creates pinned snapshots on a regular basis (\eg every second) and
ensures that all application servers know the timestamps of all pinned
snapshots. The \txcache library on an application server can also
request that a new pinned snapshot be created at the latest timestamp,
reflecting the current database state.

\section[Maintaining Consistency with Validity Intervals]{Maintaining Consistency with\\Validity Intervals}
\label{sec:consistency-protocol:consistency}

Validity intervals and timestamps make it possible for the system to
ensure serializability. The consistent reads property of validity
intervals
(\autoref{sec:consistency-protocol:validity-intervals:properties})
indicates that if all objects read by the application during a
transaction have validity intervals that contain timestamp $t$, then
the transaction is serializable at time $t$. The \txcache library uses
this property to ensure transactional consistency of all data accessed
from the cache or the database. It does so using the following
protocol. For clarity, we begin with a simplified version where
timestamps are chosen when a transaction begins
(\autoref{sec:consistency-protocol:basic}). In
\autoref{sec:consistency-protocol:lazy}, we describe a technique for
choosing timestamps lazily to take better advantage of cached data.

\subsection{Basic Operation}
\label{sec:consistency-protocol:basic}

When a transaction is started, the application specifies whether it is
read/write or read-only, and, if read-only, the staleness limit.  For
a read/write transaction, the \txcache library simply starts a
transaction on the database server, and passes all queries directly to
it; it does not use the cache for the duration of this transaction.
At the beginning of a read-only transaction, the library selects a
timestamp to run the transaction at. If the storage layer uses the
pinned snapshot interface, the selected timestamp must correspond to a
pinned snapshot. If necessary (\ie if no sufficiently recent snapshots
exist), the library may request a new snapshot on the database and
select its timestamp.

The library can delay beginning a read-only transaction on
the database (\ie sending a \command{begin-ro} operation) until it
actually needs to issue a query.  Thus, transactions whose requests
are all satisfied from the cache do not need to connect to the
database at all. When it does start a transaction on the database, the
\txcache library specifies the timestamp it selected, ensuring that
any results obtained from the database are valid at that timestamp.

When the application invokes a cacheable function, the library checks
whether its result is in the cache. To do so, it identifies the
responsible cache server using consistent hashing, and sends it a
\command{lookup} request that includes the transaction's timestamp,
which any returned value must satisfy.  If the cache returns a
matching result, the library returns it directly to the application.

If the requested data is not in the cache -- or if it is available in
the cache but not for the specified timestamp -- then the application
must recompute the cached value. Doing so requires making an upcall to
the application to execute the cacheable function's implementation.
As the cacheable function issues queries to the database, the \txcache
library accumulates the validity intervals returned by these queries.
The final result of the cacheable function is valid at all times in
the intersection of the accumulated validity intervals.  When the
cacheable function returns, the library marshals its result and
inserts it into the cache, tagged with the accumulated validity
interval.% and any invalidation
%tags.


\subsection{Lazy Timestamp Selection}
\label{sec:consistency-protocol:lazy}

\begin{figure}[tbp]
  \centering
  \includegraphics[width=0.9\textwidth]{figures/intervals-staleness}
  \caption[An illustration of the importance of timestamp
    selection]{An illustration of the importance of timestamp
    selection. All four objects in the cache are available to
    transactions running at timestamp 51, but a transaction running at
    timestamp 54 can use only one of them, and a transaction running at
    timestamp 47 can use none.}
  \label{fig:consistency-protocol:timestamp-selection}
\end{figure}

Which timestamp is selected for a read-only transaction can have a
significant impact on
performance. \autoref{fig:consistency-protocol:timestamp-selection}
illustrates: if a transaction is run at the latest available timestamp
54, one of the four objects in the cache can be used. If, however,
the transaction were run at the earlier timestamp 51 -- which might
still be consistent with the application's freshness requirement --
all four of the objects depicted could be used. A poor choice of
timestamp could also be even more limiting: if timestamp 47 were
chosen, no cached objects could be used. 

But how should the \txcache library choose the timestamp for a
read-only transaction? Above, we assumed that the library chooses the
transaction's timestamp when the transaction starts. Although
conceptually straightforward, this approach falls short because the
library does not have enough information to make a good decision. It
does not know what data is in the cache or what its validity intervals
are, as this information is available only on the cache
nodes. Furthermore, because our programming model support arbitrary
computations, the library also does not even know \emph{a priori} what
data the transaction will access: the requests the application makes
from the cache may depend on the results of previous requests.
Lacking this knowledge, it is not at all clear what policy would
provide the best hit rate.


\begin{figure}[tbp]
  \centering
  \subfigure[\small Initially, the application is willing to accept any data
  with timestamps that satisfy the application's freshness
  requirement. Here, that is the interval $[47, 55)$, as indicated by
  the green region.]{
    \includegraphics[keepaspectratio,width=0.7\textwidth]{figures/lazy1}
  }
  
  \subfigure[\small After accessing an object in the cache (key 4), the
  transaction is then constrained to access only objects that were
  valid at the same time as that object.]{
    \includegraphics[keepaspectratio,width=0.7\textwidth]{figures/lazy2}
  }
  
  \subfigure[\small As the transaction accesses other objects, the set of
  acceptable timestamps progressively narrows.]{
    \includegraphics[keepaspectratio,width=0.7\textwidth]{figures/lazy3}
  }    
  \caption{Example of lazy timestamp selection}
  \label{fig:consistency-protocol:lazy}
\end{figure}


However, the timestamp does not need to be chosen
immediately. Instead, it can be chosen lazily based on which cached
results are available. This takes advantage of the fact that each
cached value is valid over a range of timestamps: its validity
interval.  For example, consider a transaction that has observed a
single cached result $x$.  This transaction can still be serialized at
\emph{any} timestamp in $x$'s validity interval.  When the transaction
next accesses the cache, any cached value whose validity interval
overlaps $x$'s can be chosen, as this still ensures there is at least
one timestamp at which the transaction can be serialized.  As the
transaction proceeds, the set of possible serialization points narrows
each time the transaction reads a cached value or a database query
result. \autoref{fig:consistency-protocol:lazy} illustrates this process.

\subsubsection{Algorithm}

Specifically, the algorithm proceeds as follows. When a transaction
begins, the \txcache library identifies the set of all timestamps that
satisfy the application's freshness requirement. It stores this set as
the \emph{pin set}. If the storage layer can run transactions on
arbitrary timestamps, this set is simply the set of all timestamps
between the application's staleness limit and the current time. If the
storage layer uses the more restricted pinned snapshot interface,
then this is the set of pinned snapshots in that range. If there are
no pinned snapshots in this range, the library requests the creation
of a new one and initializes the pin set with it.

% list of all pinned snapshot
% IDs that satisfy the application's freshness requirement. It stores
% this set as its \emph{pin set}. The pin set represents the set of
% timestamps at which the current transaction can be serialized; it will
% be updated as the cache and the database are accessed.
% The pin set
% also initially contains a special ID, denoted \pinstar, which
% indicates that the transaction can also be run in the present, on some
% newly pinned snapshot. The pin set only contains \pinstar until the
% first cacheable function in the transaction executes.

When the application invokes a cacheable function, the library sends a
\command{lookup} request for the appropriate key, but instead of
indicating a single timestamp, it indicates the \emph{bounds} of the
pin set (the lowest and highest timestamp it contains).  The
transaction can use any cached value whose validity interval overlaps
these bounds and still remain serializable at one or more timestamps.
If the cache server has multiple values for the specified key whose
intervals overlap the specified bounds, it returns the most recent
value.
%The cache node returns the most up-to-date value that overlaps the
%given bounds, though other policies are possible.
The library then reduces the transaction's pin set by eliminating all
timestamps that do not lie in the returned value's validity interval,
since observing a cached value means the transaction can no longer be
serialized outside its validity interval.
% This includes removing
% \pinstar from the pinset because once the transaction has used cached
% data, it cannot be run on a new, possibly inconsistent snapshot.

When the cache does not contain any entries that match both the key
and the requested interval, a cache miss occurs. In this case, the
library calls the cacheable function's implementation, as before.
When the transaction makes its first database query, the library is
forced to select a specific timestamp from the pin set and
\command{begin} a read-only transaction on the database at the chosen
timestamp. This ensures that the application's database queries are
performed with respect to that timestamp. Our implementation chooses
the latest timestamp in the pin set for this purpose, biasing
transactions toward running recent data, though other policies are
possible.

The \txcache library may run multiple database queries within the same
transaction at different timestamps. This can be necessary if the
transaction's pin set changes as a result of cached data it later
reads. In this case, the library instructs the database to end its
current transaction and start a new one with the newly-chosen
timestamp. Thus, from the perspective of the database, the queries are
in separate transactions; however, the \txcache library, by managing
validity intervals, ensures that the query results are still
consistent with each other.

% If a non-\pinstar timestamp is chosen, the transaction runs
% on that timestamp's saved snapshot. If \pinstar is chosen, the library
% pins a new snapshot on the database and uses it to start a new
% transaction.  The pin set is then \emph{reified}: \pinstar is replaced
% with the newly-created snapshot's timestamp, replacing the abstract
% concept of ``the present time'' with a concrete timestamp.

% The library needs a policy to choose which pinned snapshot from the
% pin set it should run at. Simply choosing \pinstar if available, or
% the most recent timestamp otherwise, biases transactions towards
% running on recent data, but results in a very large number of pinned
% snapshots, which can ultimately slow the system down. To avoid the
% overhead of creating many snapshots, we used the following policy: if
% the most recent timestamp in the pin set is older than five seconds
% and \pinstar is available, then the library chooses \pinstar in order
% to produce a new pinned snapshot; otherwise it chooses the most recent
% timestamp.

% This has
% the cumulative effect of producing a new pinned snapshot every five
% seconds, which avoids the overhead of large numbers of snapshots while
% giving other transactions a variety of timestamps to use when
% retrieving values from the cache.

% There are many imaginable policies for choosing which pinned snapshot
% to run at.  Simply choosing the latest pin that isn't \pinstar (or
% \pinstar if nothing else is available) results in the creation of a
% new pin every time some requested freshness requirement can't be
% satisfied.  This results in very few pins (for example, one every 30
% seconds if the freshness requirement is 30 seconds), but can cause
% thrashing every time a new pin is created.  Choosing \pinstar in
% preference to the latest pin produces a large variety of pins and
% evenly distributes the load on the cache and the database over time.
% However, this policy can result in \emph{too many} pinned snapshots,
% introducing large processing overheads.  Ultimately, we settled on a
% compromise between these policies: the latest pin is used, unless it
% is older than some small age (say, 5 seconds), in which case \pinstar
% is used.  This results in a new pin every 5 seconds, spreading out the
% load on the cache without introducing large processing overheads.

During the execution of a cacheable function, the validity intervals
of the queries that the function makes are accumulated, and their
intersection defines the validity interval of the cacheable result,
just as before. In addition, just like when a transaction observes
values from the cache, each time it observes query results from the
database, the transaction's pin set is reduced by eliminating all
timestamps outside the result's validity interval, as the transaction
can no longer be serialized at these points.
% If the transaction's pin
% set still contains \pinstar, \pinstar is removed.

The validity interval of the cacheable function and pin set of the
transaction are two distinct but related notions: the function's
validity interval is the set of timestamps at which its result is
valid, and the pin set is the set of timestamps at which the enclosing
transaction can be serialized. The pin set always lies within the
validity interval, but the two may differ when a transaction calls
multiple cacheable functions in sequence, or performs
database queries outside a cacheable function.

\subsubsection{Correctness}
\label{sec:library:lazy:correctness}

Lazy selection of timestamps is a complex algorithm, and its
correctness is not self-evident. The following two properties show
that it provides transactional consistency.

\begin{invariant}
  \label{inv:pin-serializability}
  All data seen by the application during a read-only transaction are
  consistent with the database state at every timestamp
  in the pin set, \ie the transaction can be serialized at any
    timestamp in the pin set.
\end{invariant}

Invariant~\ref{inv:pin-serializability} holds because any timestamps
\emph{in}consistent with data the application has seen are removed
from the pin set. The application sees two types of data: cached
values and database query results. Each is tagged with its validity
interval. The library removes from the pin set all timestamps that lie
outside either of these intervals.

\begin{invariant}
  \label{inv:pin-non-empty}
  The pin set is never empty, \ie there is always at least one timestamp
  at which the transaction can be serialized.
\end{invariant}

The pin set is initially non-empty. It contains the timestamps of all
sufficiently-fresh pinned snapshots; if necessary, the library creates
a new pinned snapshot. So we must ensure that at least one timestamp
remains every time the pin set shrinks, \ie when a result is obtained
from the cache or database.

When a value is fetched from the cache, its validity interval is
guaranteed to intersect the transaction's pin set at at least one
timestamp. The cache will only return an entry with a non-empty
intersection between its validity interval and the bounds of the
transaction's pin set. This intersection contains the timestamp of at
least one pinned snapshot: if the result's validity interval lies
partially within and partially outside the bounds of the client's pin
set, then either the earliest or latest timestamp in the pin set lies
in the intersection. If the result's validity interval lies entirely
within the bounds of the transaction's pin set, then the pin set
contains at least the timestamp of the
%\drkp{call this out more}
pinned snapshot from which the cached result was originally
generated. Thus, Invariant~\ref{inv:pin-non-empty} continues to hold
even after removing from the pin set any timestamps that do not lie
within the cached result's validity interval.

It is easier to see that when the database returns a query result, the
validity interval intersects the pin set at at least one
timestamp. The validity interval of the query result must contain the
timestamp of the pinned snapshot at which it was executed, by
definition. That pinned snapshot was chosen by the \txcache library
from the transaction's pin set. Thus, at least that one timestamp will
remain in the pin set after intersecting it with the query's validity
interval.

\subsubsection{Policy Choices}

In two cases, our lazy timestamp selection protocol makes a seemingly
arbitrary choice between multiple possibilities that could affect the
transaction timestamp. First, when retrieving values from the cache
server, if multiple values for the same key are valid within the
requested timestamp range, the most recent value is returned. Second,
when starting a transaction on the database, the library needs to
specify a specific timestamp; it chooses the timestamp of the most
recent pinned snapshot.

In both cases, we used a policy of choosing the most recent data. We
opted for this policy for several reasons. First, even though
applications may be able to tolerate slightly or significantly stale
data, when all else is equal it is surely preferable to provide the
most recent data possible. Second, biasing transactions to run at
higher timestamps causes the cache to be populated with more recent
data, which will remain useful to other transactions for a longer
time. In particular, it makes it more likely that values computed on a
cache miss will be currently valid, in which case their validity
intervals will extend into the future (via our invalidation
mechanism).

This greedy, prefer-most-recent policy is not the only possible
policy, nor is it guaranteed to be the optimal one. it may cause cache
misses that would not occur with a different policy. For example, a
transaction might first request a cached value for which there are two
acceptable values in the cache, as shown in \XXX. Our library would
choose the one with validity interval $[50, 54)$, as this is the more
recent one. However, when it makes its next request to the cache,
there is only one available value with validity interval $[46, 48)$,
so a cache miss occurs. If the library had instead chosen the other
value for its first request, this cache miss could have been avoided.
\drkp{this needs a figure}

We experimented with different policy choices (our implementation of
the \txcache library allows policies to be selected at runtime). For
example, one policy we considered was choosing the cached value that
has the largest overlap with the transaction's pin set, under the
theory that this would make cache misses like the one in \XXX less
likely. However, we did not observe a significant difference in
performance between these policies. Our experiments in
\autoref{sec:eval:consistency} provide insight into this result.
There, we allow applications to access \emph{any} sufficiently fresh
value in the cache, regardless of whether its validity interval
overlaps with those of previously accessed objects, and find that
doing so improves performance on the application we studied by less
than 5\%. The benefit of even an optimal policy cannot be more than
this figure, suggesting that exploring this further has limited value.

\subsection{Handling Nested Calls}
\label{sec:library:nested}

Cacheable functions can be nested, \ie they can call other cacheable
functions. This allows applications to cache data at multiple
granularities. Supporting nested calls does not require any
fundamental changes to the approach above. However, we must keep track
of a separate cumulative validity interval for each cacheable function
in the call stack. When a cached value or database query result is
accessed, its validity interval is intersected with that of
\emph{each} function currently on the call stack. As a result, a
nested call to a cacheable function may have a wider validity interval
than the function that calls it, but not vice versa. This makes sense, as
the outer function might have seen more data than the functions it
calls (\eg if it calls more than one cacheable function).

%%% Local Variables: 
%%% mode: latex
%%% TeX-command-default: "Make"
%%% TeX-PDF-mode: t
%%% TeX-master: "main.tex"
%%% End: 
%  LocalWords:  transactional serializable timestamp timestamps API
%  LocalWords:  subrange versioned scalable serializability cacheable
%  LocalWords:  invalidations lookup multiversion metadata infeasible
%  LocalWords:  granularities upcall