\svnInfo $Id$  

To ease deployment of our system, our goal is to keep the database
completely separate from the cache, and to use largely unmodified
database software. Nevertheless, supporting a transactional cache
imposes some requirements on the database server that are not met by
standard database systems. In this section, we describe how we
modified an existing DBMS, PostgreSQL~\cite{postgresql} to provide the
necessary support. The modifications are not extensive; our
implementation required changing less than a thousand \XXX{} lines of
code.

\subsection{Exposing Multiversion Concurrency}

Because our cache allows read-only transactions to run slightly in the
past, the database must be able to perform queries against a past
snapshot of a database. This situation arises when a read-only
transaction begins, and is assigned a timestamp in the past after it
accesses some cached data. However, if a later operation in the
transaction results in a cache miss, the application must query the
database. The database query must return results consistent with the
previously-returned cached values, so the query must execute at the
same timestamp in the past.

One solution is to use a \emph{temporal database}, which tracks the
history of its data and allows ``time travel'' --- running queries
against previous database state. Although such databases exist, they
are not commonly used, because retaining historical data introduces
substantial storage and indexing cost\footnote{The POSTGRES research
  database was one such system. However, time travel support was
  removed early in transition to an open-source production
  database, precisely because it was considered too
  expensive.}. Therefore, we do not consider this a practical
solution.

Instead, we note that supporting caching requires a much more limited
form of historical query support than general time-travel queries. We
simply require the ability to run a transaction on a stale, but recent
snapshot, not the ability to perform complex queries over the entire
history of the database. In addition, the historical data does not
need to be recoverable after a crash, since we assume the cache is
cleared after the database restarts. Our insight is that these
requirements are essentially identical to those required for
supporting multiversion concurrency control techniques such as
snapshot isolation~\cite{berenson95:_critiq_of_ansi_sql_isolat_level},
so many databases already have the necessary infrastructure to support
them.

We modified Postgres to support running transactions on recent
snapshots, exposing the multiversion storage it uses internally to
implement snapshot isolation. We added a \command{pin} command, which
can be called from a read-only transaction; it assigns an identifier
to the transaction's snapshot. When starting a new transaction, if the
ID is specified using the new \command{snapshotid} argument to the
\command{begin} command, the new transaction will see the same view of
the database. The database state as of that timestamp is guaranteed to
be retained until it is released with the \command{unpin}
command. Pinned snapshots are identified by the transaction ID of the
last committed transaction visible to it. (Here, we assume that
transaction IDs are ordered by their commit time; Postgres actually
uses a different representation internally, so this required setting
up a mapping.)
\begin{ednote}{DRKP}
  Actually, it's the last committed transaction's ID + 1, but I think
  it's OK to lie here.
\end{ednote}

Postgres is especially well-suited to this modification because of its
``no-overwrite'' storage
manager~\cite{stonebraker87:_desig_of_postg_storag_system}, which
already retains recent versions of data. Because stale data is only
removed periodically by an asynchronous ``vacuum cleaner'' process,
the fact that we keep data around slightly longer has little impact on
performance. However, our technique is not Postgres-specific; any
database that implements snapshot isolation must have a way to keep a
similar history of recent database states.

\subsection{Tracking Result Validity}
\label{sec:database:validity}

When a read-only transaction executes a query, the returned result is
guaranteed valid at the timestamp associated with that
transaction. Therefore, any values the application computes from that
query result can be stored in the cache, and used by other
transactions assigned to the same timestamp. However, this is overly
restrictive, as it requires other transactions to have exactly the
same timestamp in order to use the shared data. In fact, a query
result is valid for a \emph{range} of time, because other transactions
may not have affected the query result. We call this range the
result's \emph{validity interval}. In this section, we describe
how to efficiently compute the validity interval for a query result.

A query's validity interval is a range of timestamps. The lower bound
is the commit time of the transaction that caused the current result
to become valid, and the upper bound is the commit time of the first
subsequent transaction to change the result, making it invalid, or
infinity if the result is still current. \edatnote{DRKP}{Maybe this
  belongs earlier.} The validity interval is computed as the
intersection of two ranges, the \emph{result tuple validity} and the
\emph{invalidity mask}, which we track separately.

\begin{figure}[tp]
  \centering
  \includegraphics[scale=0.4]{db-intervals.pdf}
  \caption{An example of how validity intervals are tracked for a
    read-only query executed at timestamp 46. The first 4 lines show
    the validity of 4 tuples over time. For example, tuple 1 was
    created with commit 43 and deleted with commit 47. Tuples 1 and 2
    get returned for the query, giving a result tuple validity of
    44-47. Tuple 3 is not returned because it was deleted before the
    query's timestamp, with commit 45. Likewise, tuple 4 is not
    returned because it was created after the query's timestamp, with
    commit 48. Thus, 44 must be subtracted from the result validity
    mask to give the query a final validity interval of 45-47.}
  \label{fig:db-intervals}
\end{figure}

The result tuple validity is the intersection of the validity times of
the tuples returned by the query. This reflects the fact that the
results of the query would have been different before any of the
returned tuples had been created, or after any had been
deleted. Computing this interval is relatively straightforward,
because multiversion concurrency requires that each tuple in the
database be tagged with the ID of its creating transaction and
deleting transaction (if any). We simply propagate these tags
throughout query execution. For any operator that combines multiple
tuples to produce a single result, such as a join or aggregate, the
validity interval of the output tuple is the intersection of the
validity intervals of all inputs. Finally, once the entire result set
has been produced, we take the intersection of the tuples' validity
intervals.

The result tuple validity, however, does not completely capture the
validity of a query. The tuples that did \emph{not} appear in the
result, but might have at a different timestamp, also have an
impact. For example, a matching tuple that was deleted just before the
querying transaction's timestamp will not appear in its results, but
means that the result is not valid very far in the past. We capture
this effect with the invalidity mask, which is the union of the
validity times for all tuples that were discarded as invisible to the
transaction's snapshot. Throughout query execution, whenever a tuple
is encountered that was deleted before or created after the
transaction's snapshot time, that tuple's validity interval is added
to the invalidity mask.

The invalidity mask is an imperfect but conservative metric. It tracks
the validity time of all tuples discarded because they were not
visible at the query's timestamp, even though some of these tuples
might not have matched the query conditions either. A conservative
estimate preserves correctness of cached results, but might reduce the
cache hit rate by unnecessarily constraining the validity intervals of
cached items. Its accuracy depends on at which stage of query
processing the visibility check is performed: if timestamps were only
checked after the query executed to completion, the validity interval
would be completely accurate, but tuples that could never appear in
the result would be subjected to potentially expensive query
processing. As in standard Postgres, we perform the visibility check
as early as possible for efficiency; the only difference is that in a
sequential we check the scan predicate before the timestamps. This
provides good results for most queries with minimal cost.

The final validity interval for a query is computed by subtracting the
invalidity mask from the result tuple validity. If the query result is
still valid when the query returns (\ie only currently valid tuples
are returned), the database truncates the validity interval at the
last committed transaction before the query started. \edatnote{IZ}{(Is
  it important to say why we do this here?)} This interval is reported
to the client, piggybacked on the result of all \command{select}
queries. Section~\ref{sec:library} describes how the client library
combines these intervals to obtain validity intervals for all items it
stores in the cache.

% \begin{ednote}{DRKP}
%   Should say something here about either invalidations, or how
%   unbounded intervals are bounded (for the no-invalidations design),
%   or both --- haven't decided which is best yet.
% \end{ednote}

\subsection{Pincushion}
The pincushion is responsible for tracking currently pinned snapshots
and eventually unpinning unused snapshots. Although the pincushion
could be running on the same host as the database, we decided not the
include the functionality in the database itself to reduce the number
of modifications and the overall load on the database.

At the beginning of each read-only transaction, clients request the
list of pinned snapshots from the pincushion that are within the
client's freshness requirement. The pincushion will ensure that none
of the returned snapshots are unpinned until the transaction commits.

If the pincushion does not have any pinned snapshots that satisfy a
client's freshness requirement, the client must run a read-only
transaction in the present and pin a new snapshot. The client gives
the snapshot's timestamp and ID to the pincushion for tracking.

% LocalWords:  multiversion Postgres timestamp tuple tuples DRKP tuple's

%%% Local Variables: 
%%% mode: latex
%%% TeX-command-default: "Make"
%%% TeX-PDF-mode: t
%%% TeX-master: "paper.tex"
%%% End: