\svnInfo $Id$  

\newcommand{\pinstar}[0]{\ensuremath{\star}\xspace}
\newtheorem{invariant}{Invariant}

Applications interact with \txcache through its client library, which
automatically looks up values in the cache whenever a cacheable
function is called. Much of the complexity of the system lies in the
library, which is responsible for assigning timestamps to read-only
transactions, selecting values to retrieve from the cache, and
computing the validity interval associated with everything it puts in
the cache.

In this section, we describe the implementation of the client
library. For clarity, we initially describe a simplified version in
which timestamps are chosen when a transaction begins. In
Section~\ref{sec:library:lazy}, we describe a technique for choosing
timestamps lazily to take better advantage of cached data.

\subsection{Basic Functionality}
\label{sec:library:basic}

The client library is divided into a language-independent library that
implements the core functionality, and a set of bindings that
implement language-specific interfaces. Currently, we have only
implemented bindings for PHP, but adding support for other languages
should be relatively straightforward.

The library's interface is simple: it provides the standard
transaction commands (\command{begin}, \command{commit}, and
\command{abort}) and a command for making database queries
(\command{query}). There is also a way to designate a function as
cacheable, though the details depend on the language being used. In a
language that support higher-order functions, this would be a
\command{wrap} function that takes a function and returns a wrapped
function that first checks for available cached values. In other
languages, such as PHP, the function needs to be called through a
\command{cacheable-call} procedure.

When a transaction is started, the programmer must specify whether the
transaction is read/write or read-only, and the freshness requirement,
if read-only. For a read/write transaction, the \txcache library
simply starts a transaction on the database server, and passes all
queries directly to it. In a read-only transaction, the transaction
needs to be assigned a timestamp. The library contacts the pincushion
to request the list of pinned snapshots that are more recent than its
freshness requirement, then chooses one. If no sufficiently recent
snapshots exist, the library starts a new transaction on the database,
pins its snapshot, and notifies the pincushion.

When a cached function is called, the library checks whether its
result is in the cache. It does so by serializing the function's name
and arguments to generate a key, then uses consistent hashing to
identify the cache server responsible for storing it. The library
sends the cache a \command{lookup} request for that key at the
transaction's timestamp. If the cache finds a matching result, it can
be returned directly to the program.

In the event of a cache miss, the library calls the cacheable
function's implementation, then inserts the result into the
cache. While the function executes, the library computes the validity
interval with which it will tag the result in the cache. Every time
the application makes a query, the library forwards it to the
database, recording the validity interval returned. The validity
interval for the cacheable function's result is the intersection of
the intervals for all the queries it made.

Note that the library does not need to start a transaction on the
database (\ie{send a \command{begin} SQL statement}) immediately when
the client begins a read-only transaction. Instead, it can wait until
the first query is issued. This can be a useful optimization, because
a transaction whose requests are all satisfied from the cache does not
need to contact the database at all.

\subsection{Choosing Timestamps Lazily}
\label{sec:library:lazy}

Above, we assumed that the client library chooses a read-only
transaction's timestamp when the transaction starts. Although
straightforward, this approach may not be able to take advantage of
the most cached data. The client must decide on a timestamp without
any knowledge of what data is in the cache. Indeed, lacking this
knowledge, it is not at all clear what policy would provide the most
benefit.

However, the timestamp needs not be chosen at transaction-start
time. Instead, it can be chosen lazily based on which cached results
are available. This takes advantage of the fact that cached results
are valid at multiple timestamps (over their validity interval), so a
transaction that has seen a single cached result $x$ can be serialized
at any timestamp in $x$'s validity interval. On the transaction's next
call to a cacheable function, any cached value whose validity interval
intersects $x$'s can be chosen, as this still ensures that the
transaction can be serialized at at least one timestamp.

More specifically, the algorithm proceed as follows. When a
transaction begins, the client library requests from the pincushion
all pinned snapshot IDs that satisfy its freshness requirement. It
stores this set as its \emph{pin set}. The pin set represents the set
of timestamps at which the current transaction can be serialized; it
will be updated as cached results and the database are accessed.  The
pin set also initially contains a special ID, denoted \pinstar, which
is ordered after all other entries. This entry reflects the fact that
the transaction could also obtain and pin a new snapshot.

When the application invokes a cacheable function, the library sends a
\command{lookup} request to the appropriate cache server with the
generated key, and an interval ranging from the lowest to highest
timestamps in the pin set, excluding \pinstar. The cache returns the
most recent value for that key whose validity interval intersects the
one the client sent. Afterwards, the client eliminates from its pin
set all timestamps that do not lie in the validity interval of the
returned value.

A cache miss occurs when the cache does not contain any entries that
match both the key and the client's requested validity interval. In
this case, the library calls the cacheable function's implementation,
as before. Now, however, before it does so, it must select a timestamp
and \command{begin} a read-only transaction on the database. The
library chooses the most recent timestamp in the pin set, which will
be \pinstar if it is in the set. If a non-\pinstar timestamp is
chosen, the transaction is started using that timestamp's saved
snapshot. If \pinstar is chosen, the library starts a new transaction
that runs in the present, pinning its snapshot and reporting the pin
to the pincushion. The pin set is then \emph{reified}: \pinstar is
replaced with the timestamp of the newly-created snapshot, replacing
the abstract concept of ``the present time'' with a concrete
timestamp.

During the execution of the cacheable function, the validity intervals
of the queries it makes are recorded, and their intersection defines
the validity interval of the cacheable result, just as before. In
addition, each time a query is made, all timestamps that are not
contained within the query's validity interval are removed from the
pin set, as the application has now seen data that was not consistent
with the database's state at that timestamp. The validity interval of
the cacheable function and pin set of the transaction are two distinct
but related notions: the function's validity interval is the set of
timestamps at which its result is valid, and the transaction's pin set
is the set of timestamps at which it can be serialized. The two may
differ if a transaction calls multiple cacheable functions, or
performs ``bare'' database queries outside a cacheable function.

\subsubsection{Correctness}
\label{sec:library:lazy:correctness}

Lazy selection of timestamps is a somewhat complex algorithm, and its
correctness is not self-evident. The following two properties show
that it provides transactional consistency.

\begin{invariant}
  \label{inv:pin-serializability}
  All results seen by the application during a read-only transaction
  are consistent with the state of the database at all of the
  timestamps in the pin set, \ie{the transaction can be serialized at
    any timestamp in the pin set}.
\end{invariant}

Invariant~\ref{inv:pin-serializability} holds because any timestamps
inconsistent with the data the application has seen are removed from
the pin set. The application sees two types of data: values obtained
from the cache, and query results from the database. Cached values are
tagged with their validity interval, and the database reports validity
intervals for query results. The library removes from the pin set all
timestamps that lie outside one of these intervals.

\begin{invariant}
  \label{inv:pin-non-empty}
  The pin set is never empty, \ie{the transaction can always be
    serialized at some timestamp}.
\end{invariant}

The pin set is initially non-empty: it contains the timestamps of all
sufficiently-fresh pinned snapshots, if any, and always \pinstar. So
we must ensure that at least one timestamp remains every time the pin
set shrinks, \ie{when a result is obtained from the cache or
  database}.

When a value is fetched from the cache, its validity interval is
guaranteed to intersect the pin set at at least one timestamp. The
cache can only return an element that has a non-empty intersection
between its validity interval and the bounds of the client's pin
set. This intersection contains the timestamp of at least one pinned
snapshot: if the result's validity interval lies partially within and
partially outside the bounds of the client's pin set, then either the
earliest or latest timestamp in the pin set lies in the
intersection. If the result's validity interval lies entirely within
the bounds of the client's pin set, then the client's pin set must
contain the timestamp of the pinned snapshot from which the cached
result was originally generated. Thus,
Invariant~\ref{inv:pin-non-empty} continues to hold even after
removing from the pin set any timestamps that do not lie within the
cached result's validity interval.

It is easier to see that when the database returns a query result, its
validity interval intersects the pin set at at least one
timestamp. The validity interval of the query result must contain the
timestamp of the pinned snapshot at which it was executed, by
definition. That pinned snapshot was chosen by the client from the pin
set (or it chose \pinstar, obtained a new snapshot, and added it to
the pin set). Thus, at least that one timestamp remains in the pin set
after intersecting it with the query's validity interval.

\subsection{Handling Nested Calls}
\label{sec:library:nested}

In the preceding sections, we assumed for simplicity that cacheable
functions never call other cacheable functions. However, it is useful
to be able to nest calls to cacheable functions. For example, a user's
home page at an auction site might contain a list of items the user
recently bid on. We might want to cache the description and current
price for each item as a function of the item ID (because they might
appear on other user's pages) in addition to the complete content of
the user's page (because he might access it again).

Handling nested calls does not require any fundamental changes to our
approach. However, we do need to keep track of a separate validity
interval for each cacheable function in the call stack. Each time a
result is fetched from the cache or a query is performed on the
database, its validity interval is intersected with that of each
function on the call stack. As a result, these validity intervals have
an interesting property: a nested call to a cacheable function may
have a wider validity interval than its enclosing function, but not
vice versa. This makes sense, because the outer function might have
seen more data than the functions it calls (\eg{if it calls more than
  one function}).


%%% Local Variables: 
%%% mode: latex
%%% TeX-command-default: "Make"
%%% TeX-PDF-mode: t
%%% TeX-master: "paper.tex"
%%% End: 

% LocalWords:  cacheable PHP timestamp lookup timestamp's reified
