\svnInfo $Id$  

\txcache stores cached data in RAM on a number of cache servers. The
cache presents a hash table interface: it maps keys to associated
values. Applications do not interact with the cache directly; the
\txcache library translates the name and arguments of a function call
into a hash key, and checks and updates the cache itself.


Data is partitioned among cache nodes using a consistent hashing
approach~\cite{karger97:_consis_hashin_and_random_trees}, as in
peer-to-peer distributed hash
tables~\cite{rowstron01:_pastr,stoica03:_chord}. Unlike DHTs, we
assume that the system is small enough that every application node can
maintain a complete list of cache servers, allowing it to immediately
map a key to the responsible node. This list could be maintained by
hand in small systems, or using a group membership
service~\cite{cowling09:_census} in larger or more dynamic
environments.

% The interface and partitioning of our cache are very similar to
% peer-to-peer distributed hash tables, such as
% Chord~\cite{stoica03:_chord}, Pastry~\cite{rowstron01:_pastr}, and
% countless others. However, our system lacks many of their more
% demanding requirements, and so avoids most of their complexity. In
% particular, multi-hop routing is unnecessary at the scale of perhaps
% hundreds of nodes in a data center. Data replication is also
% unnecessary; if a cache node fails, attempts to access it can simply
% be treated as cache misses until the node is repaired or removed from
% the membership list.

\subsection{Versioning}
\label{sec:cache:versioning}

Unlike a simple hash table, our cache is \emph{versioned}. In addition
to its key, each entry in the cache is tagged with its \emph{validity
  interval}, as shown in Figure~\ref{fig:cache-intervals}. This
interval is the range of time at which the cached value was
current. Its lower bound is the commit time of the transaction that
caused it to become valid, and its upper bound is the commit time of
the first subsequent transaction to change the result, making the
cache entry invalid. The cache can store multiple cache entries with
the same key; they will have disjoint validity intervals because only
one is valid at any time.  Whenever the \txcache library puts the
result of a cacheable function call into the cache, it includes the
validity interval of that result (derived using information obtained
from the database).

\begin{figure}[tp]
  \centering
  \includegraphics[scale=0.38]{cache-intervals.pdf}
  \caption{An example of versioned data in the cache at one point in
    time. Each rectangle is a version of a data item. For example, the
    data for key 1 became valid with commit 51 and invalid with commit
    53, and the data for key 2 became valid with commit 46 and is
    still valid.}
  \label{fig:cache-intervals}
\end{figure}

To look up a result in the cache, the \txcache library sends both the
key it is interested in and a timestamp or range of acceptable
timestamps. The cache server returns a value consistent with the
library's request, \ie one whose validity interval intersects the
given range of acceptable timestamps, if any exists. The server
also returns the value's associated validity interval. If multiple
such values exist, the cache server returns the most recent one.
% The reason the library might want to specify a range of
% values is that, rather than assign a transaction's timestamp
% immediately when it begins, it can choose the timestamp lazily based on
% what data is in the cache; Section~\ref{sec:library} explains this
% process in detail.

When a cache node runs out of memory, it evicts old cached values to
free up space for new ones. Cache entries are never pinned and can
always be discarded; if one is later needed, it is simply a cache
miss. A cache eviction policy can take into account both the time
since an entry was accessed, and its staleness. Our cache server uses
a least-recently-used replacement policy, but also eagerly removes any
data too stale to be useful.

\subsection{Invalidation Tags and Streams}
\label{sec:cache:invalidations}

When an object is inserted into the cache, it can be flagged as
\emph{still-valid} if it reflects the latest state of the database,
like Key 2 in Figure~\ref{fig:cache-intervals}. For such objects, the
database provides \emph{invalidation} notifications when they change.

Every still-valid object has an associated set of \emph{invalidation
  tags} that describe which parts of the database it depends on. Each
invalidation tag has two parts: a table name and an optional index key
description. The database identifies the invalidation tags for a query
based on the access methods used to access the database. A query that
uses an index equality lookup receives a two-part tag, \eg{a search
  for users with name Alice would receive tag
  \invtag{users:name=alice}}. A query that performs a sequential scan
or index range scan has a wildcard for the second part of the tag,
\eg{\invtag{users:$\star$}}. Wildcard invalidations are expected to be
very rare because applications typically try to perform only index
lookups; they exist primarily for completeness. Queries that access
multiple tables or multiple keys in a table receive multiple tags. The
object's final tag set will have one or more tags for each query that
the object depends on.

The database distributes invalidations to the cache as an
\emph{invalidation stream}. This is an ordered sequence of messages,
one for each update transaction, containing the transaction's
timestamp and all invalidation tags that it affected. Each message is
delivered to all cache nodes by a reliable application-level multicast
mechanism~\cite{cowling09:_census}, or by link-level broadcast if
possible. The cache servers process the messages in order, truncating
the validity interval for any affected object at the transaction's
timestamp.

Using the same transaction timestamps to order cache entries and
invalidations eliminates race conditions that could occur if an
invalidation reaches the cache server before an item is inserted with
the old value. These race conditions are a real concern: \mediawiki
does not cache failed article lookups, because a negative result might
never be removed from the cache if the report of failure is stale but
arrived after its corresponding invalidation.

For cache lookup purposes, items that are still valid are treated as
though they have an upper validity bound equal to the timestamp of the
last invalidation received prior to the lookup. This ensures that
there is no race condition between an item being changed on the
database and invalidated in the cache, and that multiple items
modified by the same transaction are invalidated atomically.


%%% Local Variables: 
%%% mode: latex
%%% TeX-command-default: "Make"
%%% TeX-PDF-mode: t
%%% TeX-master: "paper.tex"
%%% End: 

% LocalWords:  cacheable multi versioned invalidations timestamp
