\svnInfo $Id$  

The validity intervals that \txcache uses in its cache are derived
from validity information generated by the database. To make this
possible, \txcache uses a modified DBMS that has similar versioning
properties to the cache. Specifically, it can run queries on slightly
stale snapshots, and it computes validity intervals for each query
result it returns. It also assigns invalidation tags to queries, and
produces the invalidation stream described in
Section~\ref{sec:cache:invalidations}.

Though standard databases do not provide these features, they can be
implemented by reusing the same mechanisms that are used to implement
multiversion concurrency control techniques like snapshot
isolation. In this section, we describe how we modified an existing
DBMS, PostgreSQL~\cite{postgresql}, to provide the necessary
support. The modifications are not extensive (under 2000 lines of code
in our implementation). Moreover, they are not Postgres-specific; the
approach can be applied to other databases that use multiversion
concurrency.

\subsection{Exposing Multiversion Concurrency}

Because our cache allows read-only transactions to run slightly in the
past, the database must be able to perform queries against a past
snapshot of a database. This situation arises when a read-only
transaction is assigned a timestamp in the past and reads some cached
data, and then a later operation in the same transaction results in a
cache miss, requiring the application to query the database. The
database query must return results consistent with the cached values
already seen, so the query must execute at the same timestamp in the
past.

Temporal databases, which track the history of their data and allow
``time travel'' are a possible but impractical solution, because they
impose substantial storage and indexing cost for historical data. What
we require is much simpler: we only need to run a transaction on a
stale but recent snapshot, not perform complex queries over the entire
history of the database. Our insight is that these requirements are
essentially identical to those for supporting snapshot
isolation~\cite{berenson95:_critiq_of_ansi_sql_isolat_level}, so many
databases already have the infrastructure to support them.

We modified Postgres to % support running transactions on recent
%snapshots by
expose the multiversion storage it uses internally to
provide snapshot isolation. We added a \command{pin} command that
assigns an ID to a read-only transaction's snapshot. When
starting a new transaction, the \txcache library can specify this ID
using the new \command{begin snapshotid} syntax, creating a new
transaction that sees the same view of the database as the erstwhile
read-only transaction. The database state for that snapshot will be
retained at least until it is released by the \command{unpin}
command. A pinned snapshot is identified by the commit time of the
last committed transaction visible to it, allowing it to be easily
ordered with respect to update transactions and other snapshots.

Postgres is especially well-suited to this modification because of its
``no-overwrite'' storage
manager~\cite{stonebraker87:_desig_of_postg_storag_system}, which
already retains recent versions of data. Because stale data is only
removed periodically by an asynchronous ``vacuum cleaner'' process,
the fact that we keep data around slightly longer has little impact on
performance. However, our technique is not Postgres-specific; any
database that implements snapshot isolation must have a way to keep a
similar history of recent database states, such as Oracle's rollback
segments.

\subsection{Tracking Result Validity}
\label{sec:database:validity}

\txcache needs the database server to provide the validity interval
for every query result in order to ensure transactional consistency of
cached objects. Recall that this is defined as the range of timestamps
for which the query would give the same results. Its lower bound is
the commit time of the most recent transaction that added, deleted, or
modified any tuple in the result set. It may have an upper bound if a
subsequent transaction changed the result, or it may be unbounded if
the result is still current. % This section describes
% how to efficiently compute the validity interval of a query result.

% When a read-only transaction executes a query, the returned result is
% guaranteed valid at the timestamp associated with that
% transaction. Therefore, any values the application computes from that
% query result can be stored in the cache, and used by other
% transactions assigned to the same timestamp. However, this is overly
% restrictive, as it requires other transactions to have exactly the
% same timestamp in order to use the shared data. In fact, a query
% result is valid for a \emph{range} of time, because other transactions
% may not have affected the query result. We call this range the
% result's \emph{validity interval}.

% A query result's validity interval is a range of timestamps. The lower bound
% is the  commit time of the most recent transaction that caused any tuple in the result
% to become valid, and the upper bound is the commit time of the most recent subsequent transaction that 
% changed any tuple in the result, making it invalid, or
% infinity if the result is still current. \edatnote{DRKP}{Maybe this
%   belongs earlier.}

The validity interval is computed as the
intersection of two ranges, the \emph{result tuple validity} and the
\emph{invalidity mask}, which we track separately.

% XXX Even though this figure isn't referenced for a while, it's good
% if it appears near the beginning of the section so people can have
% it in their minds.
\begin{figure}[tp]
  \centering
  \includegraphics[scale=0.4]{db-intervals.pdf}
  \caption{Example of tracking the validity interval for a
    read-only query.  All four tuples match the query
    predicate. Tuples 1 and 2 match the timestamp, so their intervals
    intersect to form the result validity. Tuples 3 and 4 fail the
    visibility test, so their intervals join to form the invalidity
    mask. The final validity interval is the difference between the
    result validity and the invalidity mask.}
  \label{fig:db-intervals}
\end{figure}

The result tuple validity is the intersection of the validity times of
the tuples returned by the query. For example, tuple 1 in
Figure~\ref{fig:db-intervals} was deleted at time 47, and tuple 2 was
created at time 44; the result would be different before time 44 or
after time 47. This interval is easy to compute because multiversion
concurrency requires that each tuple in the database be tagged with
the ID of its creating transaction and deleting transaction (if
any). We simply propagate these tags throughout query execution. If an
operator, such as a join, combines multiple tuples to produce a single
result, the validity interval of the output tuple is the intersection
of its inputs.
%Finally, once
%the entire result set has been produced, we take the intersection of
%the tuples' validity intervals.

% This reflects the fact that the results of the query would have been
% different before any of the returned tuples had been created, or after
% any had been deleted. Computing this interval is relatively
% straightforward, because multiversion concurrency requires that each
% tuple in the database be tagged with the ID of its creating
% transaction and deleting transaction (if any). We simply propagate
% these tags throughout query execution. For any operator that combines
% multiple tuples to produce a single result, such as a join or
% aggregate, the validity interval of the output tuple is the
% intersection of the validity intervals of all inputs. Finally, once
% the entire result set has been produced, we take the intersection of
% the tuples' validity intervals.

The result tuple validity, however, does not completely capture the
validity of a query, because of \emph{phantoms}. These are tuples that
did \emph{not} appear in the result, but would have if the query were
run at a different timestamp. For example, tuple 3 in
Figure~\ref{fig:db-intervals} will not appear in the results because
it was deleted before the query timestamp, but the results would be
different if the query were run before it was deleted. Similarly,
tuple 4 is not visible because it was created afterwards. We capture
this effect with the invalidity mask, which is the union of the
validity times for all tuples that failed the \emph{visibility check},
causing them to be discarded because their timestamps made them
invisible to the transaction's snapshot. Throughout query execution,
whenever such a tuple is encountered, its validity interval 
is added to the invalidity mask.

The invalidity mask is conservative because visibility checks are
performed as early as possible in the query plan to avoid processing
unnecessary tuples. Some of these tuples might have been discarded
anyway if they failed the query conditions later in the query plan
(perhaps after joining with another table). While being conservative
preserves the correctness of the cached results, it might
unnecessarily constrain the validity intervals of cached items,
reducing the hit rate. To ameloriate this problem, we continue to
perform the visibility check as early as possible, but during
sequential scans and index lookups, we evaluate the predicate before
the visibility check. This differs from regular Postgres with respect
to sequential scans, where it evaluates the cheaper visibility check
first. Delaying the visibility checks improves the quality of the
invalidity mask, and incurs little overhead for simple predicates,
which are most common.

Finally, the invalidity mask is subtracted from the result tuple
validity to give the query's final validity interval.  This interval
is reported to the \txcache library, piggybacked on each
\command{select} query result; the library combines these intervals to
obtain validity intervals for objects it stores in the cache.

% If a result returned by a query is still valid in the present (\ie
% running the query on the latest snapshot would return the same
% result), then the upper bound of the query's validity interval is
% unknown.  The database can either indicate that the upper bound is
% presently unknown, or it can assign the validity interval some
% conservative upper bound.  Because the former route would require an
% invalidation scheme in order to assign upper bounds to cached values
% once they become known, \txcache takes the latter approach, upper
% bounding any returned validity intervals at the latest timestamp known
% to the database when the query begins executing.

% As a concrete example, Figure~\ref{fig:db-intervals} shows a simple
% select query running at timestamp 46 that considers four tuples, all
% of which match the query predicate, but only two of which satisfy the
% visibility check.  For example, tuple 1 was created with commit 43 and
% deleted with commit 47, so it appears in the result of the query,
% while tuple 3 was deleted by commit 45, before the query's timestamp,
% so it does not appear in the result of the query.  The result tuple
% validity is the intersection of the validity intervals of the tuples
% returned by the query; in this case, tuples 1 and 2 yield a result
% tuple validity of 44-47.  Tuples that match the query predicate but
% are discarded by the visibility check contribute to the invalidity
% mask; in this case, tuples 3 and 4 would have affected the query
% result had the query been run sufficiently earlier or sufficiently
% later.  The final validity interval of 45-47 is simply the difference
% of the result tuple validity and the invalidity mask.  Together, these
% guarantee that if the query were run at any other timestamp in the
% final validity interval, it would have returned the same result.

% \begin{ednote}{DRKP}
%   Should say something here about either invalidations, or how
%   unbounded intervals are bounded (for the no-invalidations design),
%   or both --- haven't decided which is best yet.
% \end{ednote}

\subsection{Automating Invalidations}
\label{sec:database:invalidations}

When the database executes a query and reports that its validity
interval is unbounded, \ie{the query result is still valid}, it
assumes responsibility for providing an invalidation when the result
may have changed. At query time, it must assign invalidation tags to
indicate the query's dependencies, and at update time, it must notify
the cache of invalidation tags for objects that might have changed.

When a query is performed, the database examines the query plan it
generates. At the lowest level of the tree are the access methods that
obtain the data, \eg{a sequential scan of a heap file, or a B-tree
  index lookup}. For index equality lookups, the database assigns an
invalidation tag of the form \invtag{table:key}. For other types, it
assigns a wildcard tag \invtag{table:$\star$}. Each query may have
multiple tags; the complete set is returned along with the
\command{select} query results.

When a read/write transaction modifies some tuples, the database
identifies the set of invalidation tags affected. Each tuple added,
deleted, or modified yields one invalidation tag for each index it is
listed in. If a transaction modifies many tuples, the database can
aggregate multiple tags into a single wildcard tag on
\invtag{table:$\star$}. Generated invalidation tags are queued until a
transaction commits. When it does, the database server passes the set
of tags, along with the transaction's timestamp, to the multicast
service for distribution to the cache nodes, ensuring that the
invalidation stream is properly ordered. 

\subsection{Pincushion}

\txcache needs to keep track of which snapshots are pinned on the
database, and which of those are within a read-only transaction's
staleness limit. It also must eventually unpin old snapshots, provided
that they are not used by running transactions. The DBMS itself could
be responsible for tracking this information. However, to simplify
implementation, and to reduce the overall load on the database, we
placed this functionality instead in a lightweight daemon known as the
\emph{pincushion} (so named because it holds the pinned snapshot
IDs). It can be run on the database host, on a cache server, or
elsewhere.

The pincushion maintains a table of currently pinned snapshots,
containing the snapshot's ID, the corresponding wall-clock timestamp,
and the number of running transactions that might be using it. When
the \txcache library running on an application node begins a read-only
transaction, it requests from the pincushion all sufficiently fresh
pinned snapshots, \eg{those pinned in the last 30 seconds}.  The
pincushion flags these snapshots as possibly in use, for the duration
of the transaction.  If there are no sufficiently fresh pinned
snapshots, the \txcache library starts a read-only transaction on the
database, running on the latest snapshot, and pins that snapshot.  It
then registers the snapshot's ID and the wall-clock time (as reported
by the database) with the pincushion.  The pincushion also
periodically scans its list of pinned snapshots, removing any unused
snapshots older than a threshold, by sending an \command{unpin} command
to the database.

Though the pincushion is accessed on every transaction, it performs
little computation and is unlikely to form a bottleneck. In all of
our experiments, nearly all pincushion requests received a response in
under 0.2 ms, approximately the network round-trip time.  We have
also developed a protocol for replicating the pincushion to increase
its throughput, but it has yet to become necessary.

% LocalWords:  multiversion Postgres timestamp tuple tuples DRKP tuple's

%%% Local Variables: 
%%% mode: latex
%%% TeX-command-default: "Make"
%%% TeX-PDF-mode: t
%%% TeX-master: "paper.tex"
%%% End: 
% LocalWords:  unsynchronized versioning