\documentclass[10pt,finalversion]{usetex-v1}
\usepackage{graphicx}   % postscript figures
\usepackage{url}        % \url{} command with good linebreaks
\usepackage{microtype}  % I am a font nerd.
\usepackage{subfigure}
\usepackage{float}
\usepackage{nicefrac}
\usepackage{balance}
\usepackage{xspace}
\usepackage{amssymb,amsfonts,amsmath,amsthm,mathrsfs}
\usepackage{lastpage}
%\usepackage[small,compact]{titlesec}

% The following line needs to appear in every source file in
% order to track the last commit time and svn revision ID. (Why is
% this so difficult?)
%
% Also, need to 'svn propset svn:keywords Id file.tex'
\svnInfo $Id$  

\usepackage[cm]{fullpage}
\usepackage{savetrees}
\begin{document}
\title{Transactional~Caching of Application~Data using Recent~Snapshots}

\newcommand{\txcache}[0]{TxCache\xspace}
\newcommand{\memcached}[0]{memcached\xspace}
\newcommand{\rubis}[0]{RUBiS\xspace}
\newcommand{\eg}[1]{\emph{e.g.},~#1}
\newcommand{\ie}[1]{\emph{i.e.},~#1}
\newcommand{\XXX}[0]{\edatnote{XXX}{XXX}}
\newcommand{\command}[1]{\textsc{#1}}
\newcommand{\drkp}[1]{\edatnote{DRKP}{#1}}

% document status: submitted to foo, published in bar, etc.
\docstatus{In submission to SOSP '09}

\newcommand{\authemail}[1]{\authurl{\mailto{#1}}}  

% authors.  separate groupings with \hspace{2em}.
\author{\large
Dan R. K. Ports\thanks{student; will present \hspace{1em} $^\dag$ student}
\hspace{1.5em}
Austin T. Clements$^\dag$%\thanks{student}
\hspace{1.5em}
Irene Zhang$^\dag$
\hspace{1.5em}%\\
Samuel Madden
\hspace{1.5em}
Barbara Liskov
\\
MIT CSAIL
} % end author
\date{}
\maketitle

\section{Overview}
\label{sec:overview}

% Today's web applications are used by millions of users and demand
% implementations that scale accordingly.  Frequently, the database
% layer becomes the bottleneck to scalability, because increasing
% database capacity is typically a difficult and costly proposition,
% requiring careful partitioning or complex distributed database
% systems.

Many of today's well-known websites use application data caches to
reduce the bottleneck load on the database, as well as the computational
load on the application servers.  Distributed in-memory shared caches,
exemplified by \emph{\memcached{}}, are one popular approach. These
caches typically provide a get/put interface, akin to a distributed
hash table; the application chooses what data to keep in the cache and
keeps it up to date. By storing the cache entirely in memory and
horizontally partitioning among nodes, in-memory caches provide quick
response times and ease of scaling.

However, existing caches have no notion of transactional consistency:
there is no way to ensure that two accesses to the cache
%(or one access to the cache and one to the database)
reflect a view of the database at the same point in time. While the
backing database goes to great lengths to ensure this property
(serializable isolation), the caching layer violates these
guarantees. The resulting inconsistencies can have unpleasant
consequences if exposed to the user (\eg{attributing the latest bid to
  the wrong user on an auction site}), or add complexity to
application code by forcing it to cope with temporarily violated
invariants.

We argue that transactional semantics are not incompatible with cache
performance and scalability. We introduce a transactional cache,
\emph{\txcache{}}, which guarantees that all values retrieved from the
cache or database during a transaction reflect a consistent snapshot
of the database.

\txcache also strives to simplify application design by helping manage
the cache. Instead of requiring applications to manually insert and
check for values in the cache, \txcache provides a library with which
programmers simply designate functions as cacheable, and the library
checks the cache for previous calls with the same arguments. In
particular, and unlike \memcached, \txcache does not require
applications to explicitly invalidate cached values; correctly
identifying the values to invalidate is difficult because it requires
global reasoning about the application.

\section{Running Transactions in the Past}
\label{sec:past}

\txcache provides a different consistency guarantee than most
caches. Typical caches strive to guarantee \emph{freshness}: the cache
always reflects the latest state of what it is caching. \txcache
relaxes its freshness guarantee slightly to provide
\emph{transactional consistency}: within a transaction block, the
application sees a view consistent with the state of the database at a
particular timestamp.

Providing both freshness and consistency requires
conflicting transactions to block or abort, which is impractical in a
high-throughput system.  Snapshot isolation ensures that a transaction
sees a consistent snapshot of the database taken when the transaction
begins. \txcache takes this approach further, allowing read-only
transactions to be run on a slightly earlier snapshot (up to a
user-specified limit). In addition to avoiding conflicts, this
increases the hit rate by allowing the use of slightly stale data.
% It also allows \txcache to operate without
% invalidations: values in the cache are known to be valid at specific
% times prior to their insertion into the cache, so \txcache can assign
% the transaction a timestamp for which cached values are available.

Using stale but consistent cached data is safe because of the inherent
asynchrony in distributed systems. Even without a cache, query results
are already not guaranteed to be fresh unless leases or locks are
used, because concurrent updates might make them stale while in
flight. Applications can use read/write transactions when freshness is
required. \txcache does not use cached data for read/write
transactions, so it introduces no new anomalies.

\section{\txcache Design}
\label{sec:design}

\begin{figure}[tb]
  \centering
  \includegraphics[scale=0.85]{arch4}
  \caption{Anatomy of a \txcache deployment.}
  \label{fig:architecture}
\end{figure}

\txcache is designed for systems where several application servers
(\eg{web servers}) interact with a database server. \txcache
introduces two new components, shown in Figure~\ref{fig:architecture}:
a cache and an application-side cache interface library. It also
requires minor modifications to the database server.

\paragraph{Cache servers.} \txcache stores data in RAM using a simple
key-value distributed hash table to partition data across many cache
server nodes. Applications do not access the DHT directly:
\txcache's application-side library assigns keys to, retrieves, and
updates cached values as cacheable functions are called.

Cached data is \emph{versioned}. In addition to its key, each entry in
the cache is tagged with its \emph{validity interval}, the range of
time at which the cached value was current. It is bounded below by the
commit time of the transaction that made it valid, and above by that
of the first subsequent transaction to change the result.

\paragraph{Database server.} \txcache uses a standard relational
database with a few modifications. Specifically, it requires the
ability to run queries on slightly stale snapshots, and to generate
validity intervals for each query result returned. These validity
intervals are used to tag the cache entries. We leverage existing
concurrency control mechanisms to add the necessary support to an
existing DBMS, PostgreSQL, with minimal modifications (under 1,000
lines of code changed) and no observable performance impact.

\paragraph{Library.} Applications interact with \txcache
through its application-side library. The library assigns a timestamp
to the transaction and retrieves from the cache only data valid at
that timestamp; the timestamp is assigned lazily based on what data is
available in the cache. The library provides language bindings that
wrap a cacheable function; when called, it checks for available cached
values. If none are available, it invokes the function's
implementation, collects the validity intervals from any database
queries it makes, and stores the result in the cache.

\section{Results}

Preliminary results using the \rubis auction website benchmark are
encouraging. Even restricting stale data to at most 30 seconds old,
\txcache is able to increase peak throughput by a factor of over
$2.5\times$ without sacrificing transactional consistency.

%%% Local Variables: 
%%% mode: latex
%%% TeX-command-default: "Rubber"
%%% TeX-PDF-mode: t
%%% End: 
% LocalWords:  serializable timestamp cacheable versioned DHTs asynchrony


\end{document}
