\svnInfo $Id$  

\begin{figure*}[t]
  \centering
  \subfigure[In-memory database]{
    \includegraphics[keepaspectratio]{data/size.pdf}
    \label{fig:eval:size:small}
  }
  \subfigure[Disk-bound database]{
    \includegraphics[keepaspectratio]{data/size-big.pdf}
    \label{fig:eval:size:big}
  }
  \caption{Effect of cache size on peak throughput (30 second
    staleness limit)}
  \label{fig:eval:size}
\end{figure*}

\begin{figure*}[t]
  \centering
  \subfigure[In-memory database]{
    \includegraphics[keepaspectratio]{data/hitrate.pdf}
    \label{fig:eval:hitrate:small}
  }
  \subfigure[Disk-bound database]{
    \includegraphics[keepaspectratio]{data/hitrate-big.pdf}
    \label{fig:eval:hitrate:big}
  }
  \caption{Effect of cache size on cache hit rate (30 second staleness
    limit)}
  \label{fig:eval:hitrate}
\end{figure*}

We used \rubis as a benchmark to explore the performance benefits of
caching. In addition to the PHP auction site implementation described
above, \rubis provides a client emulator that simulates many
concurrent user sessions: there are 26 possible user interactions
(\eg{browsing items by category, viewing an item, or placing a bid}),
each of which corresponds to a transaction.
% The emulator repeatedly
% selects the next interaction for each client by following a transition
% matrix and each client's interactions are separated by a
% randomly-chosen ``think time''.
We used the standard \rubis
``bidding'' workload, a mix of 85\% read-only interactions (browsing)
and 15\% read/write interactions (placing bids)~with a think time with
negative exponential distribution and 7-second mean.

We ran our experiments on a cluster of 10 servers, each a Dell
PowerEdge SC1420 with two 3.20 GHz Intel Xeon CPUs, 2 GB RAM, and a
Seagate ST31500341AS 7200 RPM hard drive. The servers were connected
via a gigabit Ethernet switch, with 0.1 ms round-trip latency. One
server was dedicated to the database; it ran PostgreSQL 8.2.11 with
our modifications. The others acted as front-end web servers running
Apache 2.2.12 with PHP 5.2.10, or as cache nodes. Four other machines,
connected via the same switch, served as client emulators. Except
as otherwise noted, database server load was the bottleneck.

We used two different database configurations. One configuration was
chosen so that the dataset would fit easily in the server's buffer
cache, representative of applications that strive to fit their working
set into the buffer cache for performance. This configuration had
about 35,000 active auctions, 50,000 completed auctions, and 160,000
registered users, for a total database size about 850 MB. The larger
configuration was disk-bound; it had 225,000 active auctions, 1
million completed auctions, and 1.35 million users, for a total
database size of 6 GB.

For repeatability, each test ran on an identical copy of the
database. We ensured the cache was warm by restoring its contents from
a snapshot taken after one hour of continuous processing for the
in-memory configuration and one day for the disk-bound configuration.

For the in-memory configuration, we used seven hosts as web servers,
and two as dedicated cache nodes. For the larger configuration, eight
hosts ran both a web server and a cache server, in order to make a
larger cache available. 


\subsection{Cache Sizes and Performance}


We evaluated \rubis's performance in terms of the peak throughput
achieved (requests handled per second) as we varied the number of
emulated clients. Our baseline measurement evaluates \rubis
running directly on the Postgres database, with \txcache
disabled. This achieved a peak throughput of 928 req/s with the
in-memory configuration and 136 req/s with the disk-bound configuration.

We performed this experiment with both a stock copy of Postgres, and
our modified version. We found no observable difference between the
two cases, suggesting our modifications have negligible performance
impact. Because the system already maintains multiple versions to
implement snapshot isolation, keeping a few more versions around adds
little cost, and tracking validity intervals and invalidation tags
simply adds an additional bookkeeping step during query execution.

We then ran the same experiment with \txcache enabled, using a 30
second staleness limit and various cache sizes. The resulting peak
throughput levels are shown in Figure~\ref{fig:eval:size}. Depending
on the cache size, the speedup achieved ranged from $2.2\times$ to
$5.2\times$ for the in-memory configuration and from $1.8\times$ to
$3.2\times$ for the disk-bound configuration. The \rubis PHP benchmark
does not perform significant application-level computation; even so,
we see a 15\% reduction in total web server CPU usage. Cache server
load is low, with most CPU overhead in kernel time, suggesting
inefficiencies in the kernel's TCP stack as the cause. Switching to a
UDP protocol might alleviate some of this
overhead~\cite{saab08:_scalin_memcac_at_faceb}.

Figure~\ref{fig:eval:hitrate:small} shows that for the in-memory
configuration, the cache hit rate ranged from $27\%$ to $90\%$, increasing
linearly until the working set size is reached, and then growing
slowly. Here, the cache hit rate directly translates into a
performance improvement because each cache hit represents load (often
many queries) removed from the database. Interestingly, we always see
a high hit rate on the disk-bound database
(Figure~\ref{fig:eval:hitrate:big}) but it does not always translate
into a large performance improvement. This workload exhibits some very
frequent queries (\eg{looking up a user's nickname by ID}) that can be
stored in even a small cache, but are also likely to be in the
database's buffer cache. It also has a large number of data items that
are each accessed rarely (\eg{the full bid history for each item}).
The latter queries collectively make up the bottleneck, and the
speedup is determined by how much of this data is in
the cache.


% Even before the database
% reaches capacity, performing a query takes an average of 5.5~ms,
% whereas a cache lookup takes 0.7~ms. The cache miss penalty averaged
% 16~ms.


% Each cache server's load remained under 60\% utilization of a single
% CPU core. Most of this CPU overhead was in kernel time, suggesting
% inefficiencies in the kernel's TCP stack were the culprits. Switching
% to a UDP-based protocol might alleviate some of this
% overhead~\cite{saab08:_scalin_memcac_at_faceb}.

\subsection{Varying Staleness Limits}
\label{sec:eval:vary-freshn-requ}

The staleness limit is an important parameter. By raising this value,
applications may be exposed to increasingly stale data, but are able
to take advantage of more cached data. An invalidated cache entry
remains useful for the duration of the staleness limit, which is
valuable for values that change (and are invalidated) frequently.

Figure~\ref{fig:eval:freshness} compares the peak throughput obtained
by running transactions with staleness limits from 1 to 120 seconds.
Even a small staleness limit of 5-10 seconds provides a significant
benefit.  \rubis has some objects that are expensive to compute and
have many data dependencies (indexes of all items in particular
regions with their current prices). These objects are invalidated
frequently, but the staleness limit permits them to be used.  The
benefit diminishes at around 30 seconds, suggesting that the bulk of
the data either changes infrequently (such as information about
inactive users or auctions), or is accessed multiple times every 30
seconds (such as the aforementioned index pages).


\begin{figure}[t]
  \centering
  \includegraphics{data/freshness.pdf}
  \caption{Impact of staleness limit on peak throughput}
  \label{fig:eval:freshness}
\end{figure}

\subsection{Costs of Consistency}
\label{sec:eval:costs-of-consistency}

A natural question is how \txcache's guarantee of transactional
consistency affects its performance. We explore this question by
examining cache statistics and comparing against other approaches.

We classified cache misses into four types, inspired by the common
classification for CPU cache misses:
\begin{itemize}
\item \emph{compulsory miss}: the object was never in the cache
\item \emph{staleness miss}: the object has been invalidated, and its
  staleness limit has been exceeded
\item \emph{capacity miss}: the object was previously evicted
\item \emph{consistency miss}: some sufficiently fresh version of
  the object was available, but it was inconsistent with previous data
  read by the transaction
\end{itemize}
Figure~\ref{fig:eval:cache-miss-types} shows the breakdown of misses
by type for four different configurations. Our cache server
unfortunately cannot distinguish staleness and capacity misses. We see
that consistency misses are the least common by a large
margin. Consistency misses are rare, as items in the cache are likely
to have overlapping validity intervals, either because they change
rarely or the cache contains multiple versions. Workloads with higher
staleness limits experience more consistency misses (but fewer overall
misses) because they have more stale data that must be matched to
other items valid at the same time. The 64~MB-sized cache's workload
is dominated by capacity misses, because the cache is smaller than the
working set. The disk-bound experiment sees more compulsory misses
because it has a larger dataset with limited locality, and few
consistency misses because the update rate is slower.

\begin{figure}[tb]
  \centering\footnotesize{
  \begin{tabular}{r|cccc}
    & \multicolumn{3}{c}{\bf in-memory DB} & \bf disk-bound\\
    & \bf 512 MB & \bf 512 MB  & \bf 64 MB & \bf 9 GB \\
    & \bf 30 s stale & \bf 15 s stale & \bf 30 s stale & \bf 30 s stale\\
    \hline
    \input{data/missstats.tex}
  \end{tabular}}
  \caption{Breakdown of cache misses by type. Figures are percentage
    of total misses.}
  \label{fig:eval:cache-miss-types}
\end{figure}

The low fraction of consistency misses suggests that providing
consistency has little performance cost. We verified this
experimentally by modifying our cache to continue to use our
invalidation mechanism, but to read any data that was valid within the
last 30 seconds, blithely ignoring consistency. The results of this
experiment are shown as the ``No consistency'' line in
Figure~\ref{fig:eval:size:small}. As predicted, the benefit it
provides over consistency is small. On the disk-bound configuration, the
results could not be distinguished within experimental error.

%%% Local Variables: 
%%% mode: latex
%%% TeX-command-default: "Make"
%%% TeX-PDF-mode: t
%%% TeX-master: "paper.tex"
%%% End: 

% LocalWords:  PowerEdge TPC gigabit PostgreSQL CPUs Xeon GHz php cacheable SQL
% LocalWords:  MySQL Postgres granularities nondeterministic invalidations
% LocalWords:  lookup