\svnInfo $Id$  

% \begin{figure}[tb]
% \begin{center}
% %\subfigure[Throughput] {
% \includegraphics[width=0.9\columnwidth,keepaspectratio]{data/tput.pdf}
% %\label{fig:eval:main:throughput}
% %}
% %\subfigure[Latency] {
% %\includegraphics[width=0.9\columnwidth,keepaspectratio]{data/latency.pdf}
% %\label{fig:eval:main:latency}
% %}
% \caption{Effects on peak throughput
%   %/latency
%   of adding \txcache to
% \label{fig:eval:main}
%   \rubis workload.}
% \end{center}
% \end{figure}

We implemented all the components of \txcache, including the cache
server, database modifications to PostgreSQL, and the cache library
with PHP language bindings. Our current implementation has one
limitation; it does not yet build invalidation tags by extracting
dependency information from the database query plan, so functions need
to be annotated with the tables and indexes they read. We expect to
remove this limitation soon.

To test the performance of \txcache, we used the \rubis
benchmark~\cite{amza02:_specif_and_implem_of_dynam}.  \rubis
implements an auction website modeled after eBay where users can
register items for sale, browse listings, and place bids on items. A
client emulator simulates many concurrent user sessions: there are 26
possible user interactions (\eg{browsing items by category, viewing an
  item, or placing a bid}), each of which corresponds to a
transaction. The emulator repeatedly selects the next interaction for
each client by following a transition matrix and each client's
interactions are separated by a randomly-chosen ``think time''. We
used the standard \rubis ``bidding'' workload, a mix of 85\% read-only
interactions (browsing) and 15\% read/write interactions (placing
bids)~with a think time with negative exponential distribution and
7-second mean.

We ran our experiments on a cluster of 10 servers, each a Dell
PowerEdge SC1420 with two 3.20 GHz Intel Xeon CPUs and 2 GB RAM. The
servers were connected via a gigabit Ethernet switch, with 0.1 ms
round-trip latency. One server was dedicated to the database; it ran
PostgreSQL 8.2.11 with our modifications. Seven acted as
front-end web servers, running Apache 2.2.12 with PHP 5.2.10. The
remaining two servers were dedicated cache nodes. Four other machines,
also connected via the same switch, served as client emulators. Except
when otherwise noted, database server load was the bottleneck.

We initialized the database with
about 35,000 active auctions, 50,000 completed auctions, and 160,000
registered users. The total size of the database is about 850 MB,
so its working set easily fits in the database server's
buffer cache. This is representative of current web applications that
strive to fit their working set into the buffer cache to achieve
reasonable performance.
\begin{figure*}[t]
  \centering
  \subfigure[Peak throughput]{
    \includegraphics[keepaspectratio]{data/size.pdf}
    \label{fig:eval:size}
  }
  \subfigure[Hit rate]{
    \includegraphics[keepaspectratio]{data/hitrate.pdf}
    \label{fig:eval:size-hitrate}
  }
  \caption{Effect of cache size on throughput and hit rate}
\end{figure*}


\subsection{Porting \rubis}
\label{sec:exp:porting-rubis}

% One of \txcache's goals is to make it easier to add caching to a new
% or existing application. The \txcache library makes it straightforward
% to designate a function as cacheable. However, ensuring that the
% program has functions suitable for caching still requires some
% effort. In this section, we describe our experiences modifying the
% \rubis benchmark to take advantage of caching.

% The first modification was straightforward: code that makes database
% queries and begins and ends transactions needed to be modified to do
% so using \txcache's library calls instead of the PHP SQL frontend
% library directly. A minor complication was that the \rubis
% implementation originally used a MySQL database, so we also needed to
% port it to use Postgres instead; this was simply a matter of fixing
% a few non-standard SQL constructs.

Because of \txcache's programming model, porting \rubis to use it was
straightforward. Besides changing code that begins and ends
transactions to use \txcache's interfaces, this was largely a process
of identifying and designating cacheable functions.  The PHP
implementation of \rubis was largely written in a sequential style
with few functions, so we had to begin by dividing it into functions;
this was not difficult and would be unnecessary in a more modular
implementation.

We used two main granularities of cacheable functions. First, we
cached large portions of the generated HTML output (except some
headers and footers) for each page. This meant that if two clients
viewed the same page with the same arguments, the previous result
could be reused. Second, we cached common functions such
as authenticating a user's login, or looking up information about a
user or item by ID. Even these fine-grained functions were often more
complicated than an individual query; for example, looking up an item
requires examining both the active items table and the old items
table.  These fine-grained cached values can be shared between
different pages; for example, if two search results contain the same
item, then the database only needs to be queried once for the
description and price of that item.

\label{sec:eval:main}

% Cacheable functions must be deterministic and depend only on database
% state. Generally, identifying such functions is easy; indeed, nearly
% every read-only function in the system has this property. However,
% while designating functions as cacheable, we discovered that one of
% \rubis's actions made a nondeterministic SQL query. Namely, one of the
% search results pages does not enforce an ordering on the items it
% returns (it uses a \command{select} \ldots \command{limit} 20 query
% without an \command{order by} clause).  This turned out to be a known
% bug in the PHP implementation, which was not written by the original
% authors of \rubis. The bug, which means that the application does not
% properly divide search results into pages, became clear when we
% observed the validity intervals being inserted into the cache.

We made one modification to the \rubis code unrelated to caching. One
interaction, finding all the auctions in a particular category whose
seller lives in a given region, required performing a sequential scan
over all active auctions, and joining it against the users table. This
severely impacted the performance of the benchmark without
\txcache. It also benefited little from caching: with dependencies on
both the users and items tables, it was quickly invalidated. We
addressed this with a new table and index containing each item's
category and region IDs.

\subsection{Cache Sizes and Performance}


We evaluated \rubis's performance in terms of the peak throughput
achieved (in requests handled per second) as we varied the number of
emulated client sessions. Our baseline measurement evaluates
\rubis running directly on the Postgres database, with \txcache
disabled. This configuration achieved a peak throughput of 928 req/s.

We also performed the same experiment with our modified version of
Postgres, implementing the modifications described in
Section~\ref{sec:database}. We found no observable difference between
the two cases, suggesting our modifications have negligible
performance impact. Because the system already maintains multiple
versions to implement snapshot isolation, keeping a few more versions
around adds little cost, and tracking validity intervals is simply an
additional bookkeeping step during query execution.


We then ran the same experiment with \txcache enabled, using a 30
second staleness limit and various cache sizes. For consistency, each
experiment ran on an identical copy of the database, and the cache was
warmed prior to the benchmark by restoring the contents of the cache
from a snapshot taken after an hour of continuous processing. The
resulting peak throughput levels are shown in
Figure~\ref{fig:eval:size}. Depending on the cache size, the speedup
achieved ranged from $2.2 \times$ to $5.2\times$.

Examining statistics sheds some light on how \txcache improves system
scalability. Figure~\ref{fig:eval:size-hitrate} shows that the cache
hit rate ranged from $27\%$ to $90\%$. As expected, the cache hit rate
increases linearly until the working set size is reached, and then
grows slowly. The cache hit rate directly translates into a
performance improvement because each cache hit represents load (often
many queries) removed from the database. Even before the database
reaches capacity, performing a query takes an average of 5.5~ms,
whereas a cache lookup takes 0.7~ms. The cache miss penalty averaged
16~ms.

Each cache server's load remained under 60\% utilization of a single
CPU core. Most of this CPU overhead was in kernel time, suggesting
inefficiencies in the kernel's TCP stack were the culprits. Switching
to a UDP-based protocol might alleviate some of this
overhead~\cite{saab08:_scalin_memcac_at_faceb}.

\subsection{Varying Staleness Limits}
\label{sec:eval:vary-freshn-requ}

The staleness limit is an important parameter. By raising this value,
applications may be exposed to increasingly stale data, but are able
to take advantage of more cached data. An invalidated cache entry
remains useful for the duration of the staleness limit, which is
valuable for values that change (and are invalidated) frequently.

Figure~\ref{fig:eval:freshness} compares the peak throughput obtained
by running transactions with staleness limits from 1 to 120 seconds.
Even a staleness limit of 1 second gives a throughput improvement of
$1.5\times$, but a staleness limit of 5 seconds gives over twice the
benefit.
% As the staleness limit increases, the speedup grows as data
% in the cache remains useful for longer periods of time.
The benefit
diminishes at around 30 seconds, suggesting that the bulk of the data
either changes infrequently (such as information about inactive users
or auctions), or is accessed at least once every 30 seconds (such as
category index pages).


\begin{figure}[t]
  \centering
  \includegraphics{data/freshness.pdf}
  \caption{Impact of staleness limit on peak throughput}
  \label{fig:eval:freshness}
\end{figure}

\subsection{Costs of Consistency}
\label{sec:eval:costs-of-consistency}

A natural question is how \txcache's guarantee of transactional
consistency affects its performance. We explore this question by
examining cache statistics and comparing against other approaches.

\newpage
We identified four distinct types of cache misses:
\begin{itemize}
\item \emph{compulsory miss}: the object was never in the cache
\item \emph{staleness miss}: the object has been invalidated, and its
  staleness limit has been exceeded
\item \emph{capacity miss}: the object was previously evicted
\item \emph{consistency miss}: some sufficiently fresh version of
  the object was available, but it was inconsistent with previous data
  read by the transaction
\end{itemize}
Figure~\ref{fig:eval:cache-miss-types} shows the breakdown of misses
by type for three different workloads. Our cache server is
unfortunately unable to distinguish between staleness and capacity
misses. However, we see that consistency misses are the least common
by a large margin. Consistency misses are rare because items in the
cache are likely to have overlapping validity intervals, either
because they change rarely or the cache contains multiple versions of
them. Workloads with higher staleness limits experience more
consistency misses because they have more stale data that needs to be
matched to other items valid at the same time. The 64 MB-sized cache's
workload is dominated by capacity misses, because it is smaller than
the working set.

\begin{figure}[tb]
  \centering\small{
  \begin{tabular}{r|ccc}
    & \bf 512 MB & \bf 512 MB  & \bf 64 MB \\
    & \bf 30 s stale & \bf 15 s stale & \bf 30 s stale \\
    \hline
    \input{data/missstats.tex}
  \end{tabular}}
  \caption{Breakdown of cache misses by type. Figures are percentage
    of total misses.}
  \label{fig:eval:cache-miss-types}
\end{figure}

The low fraction of consistency misses suggests that providing
consistency has little performance cost. We verified this
experimentally. Ideally, we would compare against \memcached or some
other cache that does not provide consistency. However, our different
programming model makes a fair comparison difficult. Certain queries
are not easy to invalidate explicitly, without our invalidation
tags. \memcached users would probably use timeouts instead of
invalidations for these. To account for both possibilities, we
modified our cache to continue to use our invalidation mechanism, but
to read any data that was valid within the last 30 seconds, blithely
ignoring consistency. The results of this experiment are shown as the
``No consistency'' line in Figure~\ref{fig:eval:size}. As predicted,
the benefit it provides over consistency is small.

%%% Local Variables: 
%%% mode: latex
%%% TeX-command-default: "Make"
%%% TeX-PDF-mode: t
%%% TeX-master: "paper.tex"
%%% End: 

% LocalWords:  PowerEdge TPC gigabit PostgreSQL CPUs Xeon GHz php cacheable SQL
% LocalWords:  MySQL Postgres granularities nondeterministic invalidations
% LocalWords:  lookup
