\chapter{Introduction}

This thesis addresses the problem of maintaining consistency in a
cache of application-computed objects.

Application-level caches are a popular solution for improving the
scalability of complex web applications: they are widely deployed by
many well-known websites.  They are appealing because they can be
implemented with a simple, scalable design, and their flexibility
allows them to address many bottlenecks, as we discuss below.

However, existing caches place a substantial burden on application
developers. They do not preserve the isolation guarantees of the
underlying storage layer, introducing potential concurrency
bugs. They also place the responsibility for locating and updating
data in the cache entirely in the hands of the application, a common
source of bugs. Consider, for example, the following report of a major
outage at one of the world's most-visited web sites:

\begin{quotation}
  Early today Facebook was down or unreachable for many of you for
  approximately 2.5 hours. This is the worst outage we’ve had in over
  four years, and we wanted to first of all apologize for it. We also
  wanted to provide much more technical detail on what happened and
  share one big lesson learned.

  The key flaw that caused this outage to be so severe was an
  unfortunate handling of an error condition. An automated system for
  verifying configuration values ended up causing much more damage
  than it fixed.

  The intent of the automated system is to check for configuration
  values that are invalid in the cache and replace them with updated
  values from the persistent store. This works well for a transient
  problem with the cache, but it doesn’t work when the persistent
  store is invalid.

  Today we made a change to the persistent copy of a configuration
  value that was interpreted as invalid. This meant that every single
  client saw the invalid value and attempted to fix it. Because the
  fix involves making a query to a cluster of databases, that cluster
  was quickly overwhelmed by hundreds of thousands of queries a
  second.

  \dots

  The way to stop the feedback cycle was quite painful -- we had to
  stop all traffic to this database cluster, which meant turning off
  the site.
  \attrib{Robert Johnson, Facebook~\cite{johnson10:_more_detail_today_outag}}
\end{quotation}

This incident highlights both the importance of application-level
caches -- not a mere optimization, they are a critical part of the
site's infrastructure -- and the challenges of cache management.

This thesis presents techniques that address these challenges while
preserving the flexibility and scalability benefits of existing
application-level caches.

\section{The Case for Application-Level Caching}

Today's web applications are used by millions of users and demand
implementations that scale accordingly. A typical system includes
application logic (often implemented in web servers) and an underlying
storage layer (often a relational database) for managing persistent
state. Either layer can become a
bottleneck~\cite{amza02:_bottl_charac_of_dynam_web_site_bench}.
Either type of bottleneck can be addressed, but with difficulty.
Increasing database capacity is typically a difficult and costly
proposition, requiring careful partitioning or complex distributed
databases. Application server bottlenecks can be easier to address --
simply adding more nodes is usually an option -- but no less costly,
as these nodes don't come for free.

Not surprisingly, many types of caches have been used to address these
problems, ranging from page-level web
caches~\cite{candan01:_enabl_dynam_conten_cachin_for,
  challenger99:_scalab_system_for_consis_cachin,
  yu99:_scalab_web_cache_consis_archit,
  zhu01:_class_based_cache_manag_for} to database replication
systems~\cite{kemme00:_dont_be_lazy_be_consis, amza03:_distr,
  petersen97:_flexib_updat_propag_for_weakl_consis_replic} and
query/mid-tier caches~\cite{timesten, mtcache,
  garrod08:_scalab_query_resul_cachin_for_web_applic}.  But
increasingly complex application logic and more personalized web
content has made it more useful to cache the result of
\emph{application computations}. As shown in
\autoref{fig:intro:memcache-arch}, the cache does not lie in front of
the application servers (like a page-level web cache) or between the
application server and database (like a query cache).  Rather, it
allows application to cache arbitrary objects. These objects are
typically generated from the results of one or more database queries
along with some computation in the application layer.

\begin{figure}[tbp]
  \centering
  \includegraphics[scale=0.75]{figures/memcache-arch}
  \caption{Architecture of a system using an application-level cache}
  \label{fig:intro:memcache-arch}
\end{figure}

This flexibility allows an application-level cache to replace existing
caches: it can act as database query cache, or it can act as a web
cache and cache entire web pages. But application-level caching is
more powerful because it can cache \emph{intermediate computations},
which can be even more useful. For example, many web sites have
highly-personalized content, rendering whole-page web caches largely
useless; application-level caches can separate common content from
customized content and cache the common content separately so it can
be shared between users. Database-level query caches are also less
useful in an environment where processing is increasingly done within
the application. Unlike database-level solutions, application-level
caches can also address application server load; they can avert costly
post-processing of database records, such as converting them to an
internal representation, or generating partial HTML output. For
example, MediaWiki uses \memcached to store items ranging from
translations of interface messages to parse trees of wiki pages to the
generated HTML for the site's sidebar.


\section{Existing Caches Are Difficult to Use}

Existing application-level caches typically present a hash table
interface to the application, allowing it to \command{get} and
\command{put} arbitrary objects identified by a key. This interface
offers two benefits:

\begin{itemize}
\item the interface is flexible, in that it allows the application to
  use the cache to store many different types of objects. As discussed
  above, it can be used to hold database query results, entire
  generated web pages, or anything in between.
\item the interface lends itself to a simple, scalable
  implementation. A typical example is \memcached~\cite{memcached},
  which stores objects on a cluster of nodes using a consistent
  hashing algorithm~\cite{karger97:_consis_hashin_and_random_trees} --
  making the cache a simple implementation of a distributed hash
  table~\cite{stoica03:_chord,
    ratnasamy01:_scalab_conten_addres_networ, rowstron01:_pastr,
    zhao04:_tapes}. Importantly, the cache is stored entirely in
  memory, and does not attempt to do any processing other than
  returning the object identified by the requested key. This
  architecture scales well; the largest \memcached deployment has
  thousands of nodes with hundreds of terabytes of in-memory
  storage~\cite{kwiatkowski10:_memcache}.
\end{itemize}

But the \command{get}/\command{put} interface has a serious drawback:
it leaves responsibility for managing the cache entirely with the
application. This presents two challenges for developers, which we
address in this thesis.

\paragraph{Challenge 1: Transactional Consistency.}

First, existing caches do not ensure \emph{transactional consistency}
with the rest of the system state.  That is, there is no way to ensure
that accesses to the cache and the underlying storage layer return
values that reflect a view of the entire system at a single point in
time. While the backing database goes to great length to ensure that
all queries performed in a transaction reflect a consistent view of
the database, \ie it can ensure serializable isolation, these
consistency guarantees are violated when data is accessed from the cache..

% it is nearly
% impossible to maintain these consistency guarantees while using a
% cache that operates on application objects and has no notion of
% database transactions.

The anomalies caused by inconsistencies between cached objects can
cause incorrect information to be exposed to the user. Consider, for
example, an eBay-like auction site. Such an application might wish to
cache basic metadata about an auction (its title and current price)
separately from more detailed information like the history of
bidders. This division would be useful because the basic item metadata
is used to construct indexes and search results, whereas both objects
are needed when a user views the detailed auction information. Placing
a new bid should update both objects. If a subsequent request sees
inconsistent versions of the objects, it might, for example, display
the latest auction price but not the updated bid
history, leaving a user confused about whether his bid has been
registered.

Accordingly, \memcached has seen use primarily for applications for
which the consequences of incorrect data are low (\eg social networking
websites). Applications with stricter isolation requirements between
transactions (\eg online banking or e-commerce) could also benefit
from application-level caching: it would improve their performance,
reducing their operating costs and allowing them to scale further.
However, adding a cache would require them to sacrifice the isolation
guarantees provided by the database, a price that many are unwilling
to pay.

Even in cases where exposing inconsistent data to users is not a
serious problem, transaction support can still be useful to simplify
application development. Serializable isolation between transactions
allows developers to consider the behavior of transactions in
isolation, without worrying about the race conditions that can occur
between concurrent transactions. Compensating for a cache that
violates these guarantees requires complex application logic because
the application must be able to cope with temporarily-violated
invariants. For example, transactions allow the system to maintain
referential integrity by removing an object and any references to it
atomically. With a cache that doesn't support transactions, the
application might see (and attempt to follow) a reference to an object
that has been deleted.

\paragraph{Challenge 2: Cache Management.}

The second challenge existing caches present for application
developers is that they must explicitly manage the cache. That is, the
application is responsible for assigning names to cached values,
performing lookups, and keeping the cache up to date. This explicit
management places a substantial burden on application developers, and
introduces the potential for errors. Indeed, as we discuss in
\autoref{cha:programming-model}, cache management has been a common
source of programming errors in applications that use \memcached.

A particular challenge for applications is that they must explicitly
\emph{invalidate} data in the cache when updating the backing storage.
That is, the cache requires developers to be able to identify every
cached application computation whose value might have been affected by
a database modification. This analysis is difficult to do,
particularly in a complex and evolving application, because it
requires global reasoning about the entire application to determine
which values may be cached, violating the principles of modularity.

% \section{Application-Level Caching Presents New Challenges}
% \label{sec:intro:hard}

% \begin{outline}
% \item previous section argued that the problem is important. But is it
%   *hard*? or can we just expect to solve it by applying the usual
%   replication techniques for ensuring consistency?
% \item hard because it involves two separate layers: the storage layer
%   and the application
% \item want to support general operations and computations, which means
%   that we can't make any assumptions about the data format/schema or
%   what the application might do with it. Not going to assume
%   application is running some kind of object-relational framework or
%   get into parsing SQL and PHP to figure out what it's doing.
% \item those things make naming/invalidations challenging
% \item consistency is hard too
% \item the replication systems you might think about applying
%   to this are really for replicating *data*. Can assume
% \item we are dealing with caching *computations*. Can't exactly
%   compute all possible computations the app might ask for and keep
%   them up to date all the time.
% \item Might want to use data that is out of date.  Need a new
%   consistency protocol that can integrate cached objects that were
%   computed at different times as long as they're consistent.
% \end{outline}

\section{Contributions}

This thesis addresses both challenges described above in the context
of a new transactional, application-level cache which we call
\txcache. \txcache aims to preserve the flexibility of existing
application-level caches while minimizing the challenges it poses for
application developers: adding caching to a new or existing
application should be as simple as indicating which data should be
cached.

We describe how to build an application-level cache that guarantees
\emph{transactional consistency} across the entire system. That is,
within a transaction, all data seen by the application reflects a
consistent snapshot of the database, regardless of whether the data
obtained from the cache or computed directly from database
queries. This preserves the isolation guarantees of the underlying
storage layer. Typically, the entire system can provide serializable
isolation (assuming that the database itself provides
serializability). \txcache can also provide snapshot isolation, when
used with a database that provides this weaker isolation guarantee.

\txcache provides a consistency model where the system allows
read-only transactions to access stale -- but still consistent --
snapshots. Applications can indicate a freshness requirement,
specifying the age of stale data that they are willing to tolerate We
argue that this model matches well with the requirements of web
applications, and show that it improves cache utilization by allowing
cache entries to remain useful for a longer period.

% Although our cache guarantees transactional consistency, it does not
% guarantee \emph{freshness}: the data seen by applications may not
% reflect the latest state of the database. We show that guaranteeing
% freshness is impossible without introducing undesirable
% blocking. Instead, \txcache intentionally \emph{relaxes} freshness,
% requiring applications to indicate the extent to which they can
% tolerate stale data. It then provides applications with access to
% slightly stale but nevertheless consistent snapshots. This improves
% cache utilization by allowing cache entries to remain useful for a
% longer period.

Furthermore, we propose a simple programming model based on
\emph{cacheable functions}.  In this model, applications do not need
to interact explicitly with the cache, avoiding the challenges of
cache management described above. Instead, developers simply designate
certain existing application functions as cacheable. A cache library
then handles inserting the result of the function into the cache and
retrieving that result the next time the function is called with the
same arguments. Rather than requiring applications to invalidate cache
data, we integrate the database, cache, and library to track data
dependencies and automatically invalidate cache data.

Towards these ends, this thesis makes the following technical
contributions:

\begin{itemize}
\item a protocol based on \emph{validity intervals} 
  for ensuring that transactions see only consistent
  data, even though that data consists of cached objects that were
  computed at different times
  
\item techniques for modifying an existing multiversion database to
  generate validity information which can be attached to cache objects

\item a \emph{lazy timestamp selection} algorithm that selects which
  snapshot to use for a read-only transaction based on the
  application's freshness requirement and the
  availability of cached data.
\item an automatic invalidation system, based on \emph{invalidation
    tags} for tracking the data dependencies of cache objects, and
  \emph{invalidation locks}, which use a variant on existing
  concurrency control mechanisms to detect data dependencies and
  generate notifications when they are modified.
\end{itemize}

We have implemented the \txcache system. We evaluated the
effectiveness of its programming model in two ways. We ported the
\rubis web application benchmark, which emulates an auction site, to
use \txcache. We also examined \mediawiki, a popular web application
that uses \memcached, and analyzed some bugs related to its use of
caching. Both experiences suggest that our programming model makes it
easier to integrate caching into an existing application, compared to
existing caches.

We evaluated cache performance using the \rubis
benchmark~\cite{amza02:_specif_and_implem_of_dynam}. Compared to a
system without caching, our cache improved peak system throughput by
$1.5$ -- $5.2\times$, depending on the system configuration and cache
size. Moreover, we show that the system
performance and cache hit rate is only slightly (less than five percent) below
that of a non-transactional cache, showing that consistency does not
have to come at the expense of performance.


\section{Outline}

This thesis is organized as follows:

We begin with a high-level overview of the system architecture and how
it integrates with existing components in a web application, and state
our assumptions (\autoref{cha:arch}). We then describe \txcache's
cacheable function programming model, explain how to implement a
simplified version of the system that provides this programming model
but does not support transactions, and discuss how the programming
model can avoid common sources of caching-related application bugs
(\autoref{cha:programming-model}).

We then turn to the question of how to provide support for
whole-system transactional
consistency. \autoref{cha:consistency-model} describes the semantics
that the system provides for
transactions. \autoref{cha:consistency-protocol} explains how the
system ensures that all data accessed by the application within a
transaction reflects a consistent view of the storage state; it
introduces the concept of validity intervals and explains how the
cache and library use them. The consistency protocol requires some
support from the storage layer; \autoref{cha:storage-layer} explains
how to provide this support. We describe both how to design a new
storage system (here, a simple block store) to provide the necessary
support, and how to retrofit it into an existing relational database.
\autoref{cha:invalidations} discusses how the system tracks data
dependencies and uses invalidations to notify the cache of database
changes that affect cached objects.

\autoref{cha:evaluation} evaluates the performance of \txcache using
the \rubis benchmark. \autoref{cha:related-work}
surveys the related work, and \autoref{cha:conclusion} concludes.

\autoref{cha:rw} describes an extension to the system that allows
read/write transactions to safely use cached
data. \autoref{cha:serializability-appendix} provides an explanation
about how \txcache's consistent reads property allows the system to
ensure either serializability or snapshot isolation.

% LocalWords:  memcached DRKP lookups versioned timestamp online invalidations
% LocalWords:  serializable Facebook cacheable scalability invariants

%%% Local Variables: 
%%% mode: latex
%%% TeX-command-default: "Make"
%%% TeX-PDF-mode: t
%%% TeX-master: "main.tex"
%%% End: 
%  LocalWords:  transactional modularity serializability multiversion
%  LocalWords:  metadata