\chapter{Cacheable Functions: A~Simple~Programming~Model}
\label{cha:programming-model}

%\epigraphhead[50]{
  \epigraph{\it There are only two hard problems in Computer Science:
  cache invalidation and naming things.}{--- Phil Karlton}
%}

This thesis aims to improve the usability of application-level caching
in two ways. The first, which is addressed in this chapter, is to
introduce a new programming model that makes it easy to seamlessly
integrate caching into existing systems and frees application
developers from the responsibilities of cache management. (The second,
providing support for transactions, is the subject of
\Cref{cha:consistency-model,cha:consistency-protocol,cha:storage-layer,cha:invalidations}.)

The goal of our programming model is to make the process of adding
caching to an existing system as simple as deciding what should be
cached. In our programming model, developers simply designate certain
functions as \emph{cacheable functions}, which automatically and
transparently cache their results. These functions behave identically
regardless of whether the cache is used, so making a function
cacheable is a purely local change: callers of these functions do not
need to be modified. Unlike existing systems, our programming model
does not require applications to explicitly manage the
cache. Applications are not responsible for naming and locating values
in the cache, or for invalidating cached values when the database is
changed. As we demonstrate, these tasks are challenging -- and a
frequent source of bugs -- because they require global reasoning about
the application.

This chapter presents the basic programming model, ignoring issues
related to transaction support and concurrency
control. \autoref{sec:programming-model:model} explains the interface
presented to application
developers. \autoref{sec:programming-model:discussion} examines how this
interface prevents common problems that currently plague developers
who use application-level
caching. \autoref{sec:programming-model:implementation} then describes
how to build a cache that implements this model.

%\drkp{I don't feel particularly strongly about the order of
%  \Cref{sec:programming-model:discussion,sec:programming-model:implementation};
%  could also put them in the other order.}

\section{The Cacheable Function Model}
\label{sec:programming-model:model}

The \txcache programming model is organized around \emph{cacheable
  functions}. These are actual functions in the program's code,
annotated to indicate that their results can be cached. Cacheable
functions are essentially
memoized~\cite{michie68:_memo_funct_machin_learn}: \txcache's library
transforms them so that, when called, they first check whether the
cache contains the results of a previous call to the same function. If
so, that result is returned directly, bypassing the need to execute
the function. If not, the function is executed, and its result stored
in the cache for future use.

Cacheable functions can make requests to the system's underlying
storage layer, \eg they can perform database queries. Cacheable
functions can also be nested, \ie they can make calls to other
cacheable functions. Nested cacheable functions allow applications to
cache data at different granularities, which is an important benefit
of application-level caching. For example, a web application might
choose to make the outermost function that generates a webpage
cacheable, allowing the entire page to be cached, but can also
cache the functions generating individual elements of that page, which
might also be used on different pages.

\subsection{Restrictions on Cacheable Functions}
There are some restrictions on which functions can be cacheable.  To
be suitable for caching, functions must be \emph{pure}, \ie they must
be deterministic, not have side effects, and depend only on their
arguments and the storage layer state. (This definition of ``pure''
differs slightly from the classic one, in that we allow the function
to depend on the storage state, treating it as an implicit argument to
the function.)

These requirements are straightforward and correspond to an intuitive
notion of what types of computations ought to be cached. It would not
make sense to cache a function with side-effects, \eg one that
modifies a file, as this side-effect would then take place only on a
cache miss. Similarly, it would not make sense to cache a function
that is nondeterministic, such as one that returns a random number or
the current time.

Another aspect of this requirement is that if a cacheable function
invokes storage layer operations, they must also be
deterministic. This is not always the case for SQL database
queries; Vandiver discusses some techniques for finding and
eliminating nondeterminism in SQL queries in the context of a database
replication
system~\cite{vandiver08:_detec_and_toler_byzan_fault,
  vandiver07:_toler_byzan_fault_in_datab}. 

\txcache currently relies upon programmers to ensure that they only
cache suitable functions. However, this requirement could be enforced
using static analysis to identify pure
functions~\cite{salcianu05:_purit_side_effec_analy_java_progr}. It
could also be enforced by using dynamic instrumentation to detect and
refuse to cache functions that modify state or invoke
non-deterministic
functions~\cite{guo10:_towar_pract_increm_recom_scien}. As a sanity
check, our implementation can detect certain non-deterministic
functions at runtime and issue a warning; this identified a bug in the
\rubis benchmark we discuss in \autoref{sec:eval:rubis}, but the check is
not intended as a comprehensive solution.

\subsection{Making Functions Cacheable}

\begin{figure}[t]
  \begin{itemize}
  \itemindent=-1.5em   
  \item \command{make-cacheable}(\cmdarg{fn}) $\to$ \cmdarg{cached-fn} :
    Makes a function cacheable. \cmdarg{cached-fn} is a new function
    that first
    checks the cache for the result of another call with the same
    arguments. If not found, it executes \cmdarg{fn} and stores
    its result in the cache.
  \end{itemize}
  
  \caption{\txcache API for designating cacheable functions}
  \label{fig:cacheable-api}
\end{figure}

\autoref{fig:cacheable-api} shows the API that \txcache presents to
applications. It is presented here in an abstract form using
higher-order procedures; this is similar to the API that we provide in
Python. In other languages, the details may differ slightly; for
example, our PHP implementation takes slightly different form because
PHP does not support higher-order procedures.

\begin{figure}[tbp]
  \centering

  \begin{minipage}{0.35\linewidth}
    \scriptsize
    \begin{lstlisting}[escapechar={@},language={c++},frame=single,columns=fullflexible]
int myFunction(int arg1, string arg2) {
   @\llangle\textit{computation \& DB queries} \rrangle@
   return @\llangle\textit{result expression} \rrangle@
}
    \end{lstlisting}      
  \end{minipage}
  $\;\xrightarrow{\command{\footnotesize make-cacheable}}\;$
  \begin{minipage}{0.38\linewidth}
    \scriptsize
    \begin{lstlisting}[escapechar={@},language={c++},frame=single,columns=fullflexible]
int myFunction(int arg1, string arg2) {
   key = concat("myFunction", arg1, arg2)
   result = cache[key]
   if (result != null) {
     return result
   } else {
     // cache miss!    
     @\llangle\textit{computation \& DB queries} \rrangle@
     result =  @\llangle\textit{result expression} \rrangle@

     cache[key] = result
     return result
   }
}
    \end{lstlisting}      
  \end{minipage}
  
  \caption{The \command{make-cacheable} operation transforms a normal
function into a cached function}
  \label{fig:make-cacheable-example}
\end{figure}

As the figure shows, the API is minimal: it consists of a single
operation, \command{make-cacheable}. (This is only a slight
simplification; the transactional version of the system adds a few
mostly-standard transaction management functions like \command{commit}
and \command{abort}, but no more.) The \command{make-cacheable}
operation transforms a function into a cacheable function, as shown in 
\autoref{fig:make-cacheable-example}.

Assuming that the functions provided to \command{make-cacheable} are
indeed pure functions, the resulting cacheable functions behave
identically to the original functions. From the perspective of a user
of the function, the only difference between the case where the result
is available in the cache and the case where it has to be computed
anew is the speed of execution. This unchanged behavior is important
for modular software development: marking a function as cacheable
doesn't require making changes to other parts of the system that
depend on it.

\subsection{Automatic Cache Management}

What is mainly noteworthy about the API in \autoref{fig:cacheable-api}
is what it \emph{does not} include. Application developers are only
required to indicate what functions they would like to
cache. Everything else is handled by the \txcache system; applications
are not required to access the cache directly. As a result,
applications are not required to track which data is in the cache,
where it is stored, or whether it is up to date. In particular,
applications are not required to assign names to cached objects, and
are not required to notify the cache when cached values
change. \txcache instead provides an automatic invalidation mechanism,
described in \autoref{cha:invalidations}, which tracks data
dependencies and notifies the cache when the underlying
data in the database changes.

\section{Discussion}
\label{sec:programming-model:discussion}

\txcache's cacheable-function programming model offers significant
usability advantages over standard application-level caches.  The
standard interfaces for these caches is a hash table interface:
applications \command{get} and \command{put} arbitrary objects in the
cache, identified by application-defined keys. This interface is used
by many well-known application-level caches, including
\memcached~\cite{memcached}, Redis~\cite{redis}, JBoss
Cache~\cite{jboss},
AppFabric~\cite{sampathkumar09:_introd_cachin_window_server_appfab_beta},
and others.

The hash table interface provided by other caches presents two
specific challenges to application developers: they are responsible
for naming and locating items in the cache, and they are responsible
for invalidating cached objects when an update to the database affects
them. Both are challenging in large systems and are a common source of
caching errors. In \txcache's programming model, these are no longer
the responsibility of the application, eliminating this source of
bugs.

We examine the problems with naming and invalidations in existing
caches, and how \txcache avoids them, in the context of the \mediawiki
application~\cite{mediawiki}. MediaWiki is an application for serving
collaboratively-editable ``wiki'' sites. It is used to power the
Wikipedia encyclopedia, one of today's ten most popular websites, as
well as several other high-traffic websites. To support these heavy
workloads -- while providing a rich set of features and dynamic
content -- \mediawiki uses \memcached extensively for caching. Among
the objects it caches are the rendered text of articles, the parse
trees from which they are generated, metadata about articles and users,
and translations of user interface messages. Our analysis of
\mediawiki consisted of examining \memcached-related bug reports from
the \mediawiki bug database, and adapting \mediawiki to use \txcache
to cache some of the computations that it currently uses \memcached
for.

\subsection{Naming}
\label{sec:programming-model:discussion:naming}

In a key-value cache interface like the one \memcached provides,
applications need to explicitly store and look up data in the
cache. To do so, application developers must identify the data they
wish to access using a cache key. These cache keys must uniquely
identify objects in the cache: the same key must be used everywhere to
refer to a single object, and no two distinct objects can share a
cache key. This requirement seems obvious, but this is nevertheless a
source of errors because selecting keys requires reasoning about the
entire application and how the application might evolve.

Examining \mediawiki bug reports, we found that several
\memcached-related \mediawiki bugs stemmed from choosing
insufficiently descriptive keys, causing two different objects to
overwrite each other~\cite{mediawiki-keys}. One representative example
(\mediawiki bug \#7541), concerned caching of a user's watchlist,
which displays recent changes to pages the user has flagged as being
of interest. Each user's watchlist was cached using a cache key
derived from the user's ID. However, \mediawiki also allows the user
to specify the number of changes to display in the watchlist, and this
was not reflected in the cache key. As a result, the same results were
returned even when the user requested to display a different number of
days worth of changes.

It is easy to see how these types of errors might arise in a large,
complex application: the code that implements this feature underwent
numerous changes in the years since caching was added to it. Any
modifications that parameterize the output must also modify the cache
key, but this requirement is easy to overlook. In particular, the
changes to this module were made by several different developers, all
of whom would need to be aware of how caching is being used to avoid
errors.

An error like this could not occur with \txcache. \txcache uses
cacheable functions as its unit of caching, and uses the function and
its arguments to identify the cached value. Because cacheable
functions are pure, the arguments uniquely identify the result.
Adding a new customization to the watchlist page would require adding
an additional argument to the cacheable function, and that argument
would be used to distinguish between cached values.

\subsection{Invalidations}
\label{sec:programming-model:discussion:invalidations}

An even more challenging problem with existing caches is that they
require applications to explicitly update or invalidate cached results
when modifying the database. These explicit invalidations require
global reasoning about the application, violating modularity. To add
caching for an object, an application developer must be able to
identify every place in the application that might change the value of
the object, and add an invalidation. Similarly, an application
developer who adds code that modifies data in the database must know
which cached values might depend on it, and invalidate them. The
cached values and the updates that must invalidate them may be located
in different modules and may be written by different developers,
making it difficult to keep track of when invalidations are necessary.

As a result, applications that use \memcached have cache invalidation
code scattered through many functions in many modules. Developers
typically rely on ad hoc coordination methods: \mediawiki maintains a
text file (\texttt{memcached.txt}) that lists the various objects
stored in the cache, the cache keys used to identify them, and the
situations in which they need to be invalidated. But this hardly
guarantees that the resulting code will be correct; indeed, the text
file itself contains a disclaimer that it is ``incomplete, out of
date''.

Several \mediawiki bugs resulted from missing or incorrect
invalidations~\cite{mediawiki-invalidations}. Consider, for example,
the process of editing an article on Wikipedia: what cached objects
must be invalidated once the update is committed to the database? Some
are easy to identify: the cached objects that contain the article text
are clearly no longer valid. However, other objects have less obvious
dependencies on this change, and must also be invalidated. One example
is the \textsc{user} object corresponding to the user who made the
change, which includes various metadata including the user's edit
count (which is used to restrict access to certain features to
experienced users only). Once MediaWiki began storing each user's edit
count in their cached \textsc{user} object, it became necessary to
invalidate this object after an edit. This was initially forgotten
(\mediawiki bug \#8391), indicating that identifying all cached
objects needing invalidation is not straightforward, especially in
applications so complex that no single developer is aware of the whole
of the application.

Such an error could not happen with \txcache, as it does not require
applications to explicitly invalidate cached objects. Rather, it
incorporates a dependency tracking mechanism which would have
identified that editing a page modified some data that the
\textsc{user} object depended on, and invalidated that object.


\subsection{Limitations}
\label{sec:programming-model:discussion:limitations}

By organizing cached data around cacheable functions rather than
providing a hash table interface, \txcache imposes more structure on
cached data. This structure allows it to simplify cache management by
avoiding naming and invalidation issues. However, it does provide a
small decrease in flexibility: there are some use cases that can be
supported by \memcached and other caches, but not by \txcache.

\subsubsection{Soft State}

\txcache makes the assumption that all cached objects are derived from
persistent state in the database (or other storage system). This
corresponds to the most common use case of application-level
caching. However, some users of \memcached also use it as a simple
storage system for soft state that is never stored in the backing
store. For example, it is sometimes used to track the number of
requests from a particular IP address to implement rate limiting as
protection against denial of service attacks. This information is
needed only in the short term, and does not require durability or
consistency, so storing it in the cache is a viable option.

We do not attempt to support this use case (short-term unreliable
storage); we target only caching data derived from a persistent
backing store. Of course, applications can still use \memcached or
another existing cache for this purpose.

\subsubsection{Invalidations vs. Updates}

\txcache includes an automatic invalidation system that notifies the
cache when a cached object is no longer valid because of a database
change. The next time the application requests the invalidated object,
it must recompute it.  In most existing systems, applications are
responsible for keeping the cache up to date. When the application
makes a change to the database, it must identify the cached objects
that are affected, which imposes a significant burden on the
programmer. However, once they have done so, they can choose either to
invalidate these objects, or \emph{update} them in place: they can
determine the new value and replace the cache entry directly, making
it immediately accessible for future lookups. \txcache does not
support this mode of operation.

Gupta showed that using in-place updates instead of invalidations can
improve the performance of a \memcached-using web application by
0--25\%~\cite{gupta11:_trigg_based_middl_cache_orms,
  gupta10:_provid_cachin_abstr_web_applic}. This result is in the
context of an object-relational mapping system that only caches
objects corresponding to fixed set of well-defined query patterns. In
such a model, it is immediately apparent which changes need to be made
to update cached objects.  In contrast, \txcache's programming model
supports caching of functions including arbitrary computation, so
updating a cached object requires recomputing it. In particular,
\txcache supports caching complex objects (such as the entire content
of web pages that involve many database queries) which would be
considerably more expensive to recompute, making the benefit of
updates over invalidations significantly smaller.

\section{Implementing the Programming Model}
\label{sec:programming-model:implementation}

The \txcache system implements the programming model described
above. This section describes a basic implementation of the system
that does not provide support for transactions. In subsequent
chapters, we extend this system with support for transactions
(\Cref{cha:consistency-model,cha:consistency-protocol,cha:storage-layer})
and for an automatic invalidation system (\autoref{cha:invalidations}).

Recall the system architecture from
\autoref{fig:arch:arch:arch}. End users interact with application
servers (\ie web servers) which process their requests. In doing so,
they interact with a storage layer (\eg a database). \txcache
introduces two new components: a cache, comprised of multiple cache
servers, and a library that runs on each application server and
translates cacheable function calls to cache accesses.

\subsection{Cache Servers}
\label{sec:programming-model:implementation:cache}

\txcache stores the results of application computations in its cache,
which is distributed across a set of cache servers.  The cache
presents a hash table interface: it maps keys to associated values.
Only the \txcache library uses this interface; applications do not
interact with the cache directly.

The cache is implemented as a simple distributed hash table. It
partitions data by key among multiple hosts using consistent
hashing~\cite{karger97:_consis_hashin_and_random_trees}. We assume
that a membership service provides every node in the system with a
consistent view of the system membership (we discuss how membership
changes are handled later). Therefore, any participant can immediately
determine which cache node is responsible for storing a particular
hash key. The cache nodes, therefore, can operate completely
independently; they do not need to communicate with each other. Each
cache node simply accepts \command{lookup} and \command{store}
requests for the keys it is responsible for, using a simple RPC
protocol.

The interface and partitioning of our cache are similar to
peer-to-peer distributed hash tables, such as
Chord~\cite{stoica03:_chord}, Pastry~\cite{rowstron01:_pastr}, and
many others. However, our system lacks many of their more
demanding requirements, and so avoids most of their complexity. In
particular, multi-hop routing is unnecessary at the scale of perhaps
hundreds of nodes in a data center; at this scale, it is feasible for
each node to maintain the complete system membership.

Each cache node's data is stored entirely in memory, ensuring that it
can always give an immediate response (without blocking on disk I/O,
for example). This allows the cache server to be implemented as a
single-threaded application which processes one request at a time,
avoiding the need for locking within the cache sever. Multiprocessor
systems can simply run multiple independent instances of the cache
server, relying on consistent hashing to achieve load balancing among them.

The cache stores data in a hash table indexed by key.
% which points to a tree of versions indexed by their validity
% interval.
When a cache node runs out of memory, it evicts old cached
values to free up space for new ones. Cache entries are never pinned
and can always be discarded; if one is later needed, it is simply a
cache miss. The cache evicts entries using a least-recently-used
replacement policy.


\subsubsection{Reconfiguration: Cache Server Failures and
  Membership Changes}

The set of active cache servers can change over time. Cache nodes may
fail, or may be removed from the system. New nodes can be added to the
system, perhaps to increase the overall capacity of the cache. For the
most part, the system handles these membership changes acceptably
without any special protocols: we do not use replication to ensure
availability. Because the cache can evict entries arbitrarily, the
system must already be prepared to handle the absence of data in the
cache. If a cache node fails, attempts to access it can simply be
treated as cache misses until the node is repaired or removed from the
membership list. Similarly, when a new node is added, it does not have
any state; it simply indicates a cache miss on all requests until it
receives the appropriate state.

As an optimization, newly joining cache nodes can transfer state from
the node that previously stored the cached values it will be
responsible for. Furthermore, replication can optionally be integrated
into the system if losing a cache node's worth of data during a
failure is a serious performance concern. However, replication has a
clear downside: it reduces the effective storage capacity of the cache
(or increases the cost of the cache hardware) by a factor of at least
two. A useful property for implementing either a state transfer or
replication protocol is that the \txcache server (though not the
simplified version shown here) stores versioned data. Because each
version is immutable, it is possible to use DHT-like synchronization
protocols~\cite{cates03:_robus_and_effic_data_manag,
  chun06:_effic_replic_maint_for_distr_storag_system,
  sit06:_proac_replic_data_durab} rather than needing more expensive
state machine replication
protocols~\cite{schneider90:_implem_fault_toler_servic_using,
  oki88:_views_replic} to ensure consistency.

\subsection{\txcache Library}
\label{sec:programming-model:implementation:library}

The \txcache library runs on each application server in the system. It
is responsible for implementing the cacheable-function programming
model: as the application invokes cacheable functions, the library
checks for their values in the cache, and stores computed values in
the cache as necessary.

The library's \command{make-cacheable} function takes a function as
input and produces a wrapper function. This wrapper function is a
memoized version of the original
function. \autoref{fig:funcall-process} shows what happens when the
application invokes one of these wrapper functions. The function first
constructs a cache key from the function name and the arguments to the
function, serializing them in the appropriate (language-specific)
way.\footnote{Using only the function name and arguments as the cache
  key assumes that the function will not change. In environments with
  a dynamic code base, one might want to also incorporate a version
  number or a hash of the code itself.} The library then hashes the
cache key to determine the cache server responsible for storing it
(using consistent hashing), and sends a \command{lookup} request.

\begin{figure}[tbp]
  \centering
  \includegraphics{figures/funcall-process}
  \vspace{1.5em}
  
  \begin{tabular}{rrl}
    & \bf \liningnumerals 1 & application calls wrapper function\\
    &\bf \liningnumerals 2 & \txcache library sends \command{lookup} request to cache\\
    &\bf \liningnumerals 3 & cache returns value or indicates cache miss\\
    (cache miss only) &\bf \liningnumerals 4 & \txcache library makes upcall to application's\\&&implementation
    function\\
    (cache miss only)&\bf \liningnumerals 5 & application executes function, making \\&&queries to database as
    necessary\\
    (cache miss only)&\bf \liningnumerals 6 & application returns result to \txcache library\\
    (cache miss only)    &\bf \liningnumerals 7 & \txcache library adds value to cache\\
    &\bf \liningnumerals 8 & \txcache library returns value to caller
  \end{tabular}
  
  \caption{Control flow during a call to a cacheable function}
  \label{fig:funcall-process}
\end{figure}

If the cache server has the appropriate value available, it returns it
to the \txcache library. The \txcache library then returns that directly
to the application, avoiding the need to execute the function.

If the \txcache library is unable to obtain the cached result from the
cache server, it must recompute it. This might happen if the result
has been invalidated, evicted from the cache, or was never computed,
or if the cache node is inaccessible. The \txcache library then makes
an upcall to the application, requesting that it invoke the cacheable
function's implementation (the function originally provided to
\command{make-cacheable}). As that function executes, it may make
queries to the storage layer; the \txcache library monitors those
queries to track dependencies for invalidation purposes, as we
describe in \autoref{cha:invalidations}. When the execution of the
function completes, it returns the result to the wrapper function
created by the \txcache library. Before passing the return value back
to the application, the \txcache library's wrapper function serializes
the value and sends a \command{store} request to the appropriate cache
server, inserting the value into the cache so that it can be used for later calls.
  

%%% Local Variables: 
%%% mode: latex
%%% TeX-command-default: "Make"
%%% TeX-PDF-mode: t
%%% TeX-master: "main.tex"
%%% End: 

%  LocalWords:  Cacheable cacheable memoized granularities webpage
%  LocalWords:  subcomponents nondeterministic nondeterminism runtime
%  LocalWords:  Vandiver SQL API transactional invalidations lookups
