\svnInfo $Id$  

The validity intervals that \txcache uses in its cache are derived
from validity information generated by the database. To make this
possible, \txcache uses a modified DBMS that has similar versioning
properties to the cache. Specifically, it can run queries on slightly
stale snapshots, and it computes validity intervals for each query
result it returns.

Though standard databases do not provide these features, they can be
implemented without much difficulty by taking advantage of the same
mechanisms that are used to implement multiversion concurrency control
techniques like snapshot isolation. In this section, we describe how
we modified an existing DBMS, PostgreSQL~\cite{postgresql}, to provide
the necessary support. The modifications are not extensive; our
implementation required changing less than a thousand lines of
code. Moreover, they are not Postgres-specific; the approach is
applicable to other databases that use multiversion concurrency.

\subsection{Exposing Multiversion Concurrency}

Because our cache allows read-only transactions to run slightly in the
past, the database must be able to perform queries against a past
snapshot of a database. This situation arises when a read-only
transaction is assigned a timestamp in the past as a result of using
some cached data and then a later operation in the same transaction
results in a cache miss, requiring the application to query the
database. The database query must return results consistent with the
previously-returned cached values, so the query must execute at the
same timestamp in the past.

One solution is to use a \emph{temporal database}, which tracks the
history of its data and allows ``time travel'' --- running queries
against previous database state. Although such databases exist, they
are not commonly used, because retaining historical data introduces
substantial storage and indexing cost.\footnote{The POSTGRES research
  database was one such system. However, time travel support was
  removed early in transition to an open-source production
  database, precisely because it was considered too
  expensive.} Therefore, we do not consider this a practical
solution.

Instead, we note that the form of history query support required by
our cache is much more limited than general time-travel queries. We
simply require the ability to run a transaction on a stale, but recent
snapshot, not the ability to perform complex queries over the entire
history of the database. In addition, the historical data does not
need to be recoverable after a crash, since we assume the cache is
cleared after the database restarts. Our insight is that these
requirements are essentially identical to those required for
supporting multiversion concurrency control techniques such as
snapshot isolation~\cite{berenson95:_critiq_of_ansi_sql_isolat_level},
so many databases already have the necessary infrastructure to support
them.

We modified Postgres to support running transactions on recent
snapshots by exposing the multiversion storage it uses internally to
implement snapshot isolation. We added a \command{pin} command, which
can be called from a read-only transaction; it assigns an identifier
to the transaction's snapshot. When starting a new transaction, if
this ID is specified using the new \command{snapshotid} argument to
the \command{begin} command, the new transaction will see the same
view of the database as the erstwhile read-only transaction. The
database state as of that snapshot is guaranteed to be retained until
it is released with the \command{unpin} command. A pinned snapshots is
identified by the transaction ID of the last committed transaction
visible to it. (Here, we assume that transaction IDs are ordered by
their commit time; Postgres actually uses a different representation
internally, so this requires a simple mapping.)

Postgres is especially well-suited to this modification because of its
``no-overwrite'' storage
manager~\cite{stonebraker87:_desig_of_postg_storag_system}, which
already retains recent versions of data. Because stale data is only
removed periodically by an asynchronous ``vacuum cleaner'' process,
the fact that we keep data around slightly longer has little impact on
performance. However, our technique is not Postgres-specific; any
database that implements snapshot isolation must have a way to keep a
similar history of recent database states.

\subsection{Tracking Result Validity}
\label{sec:database:validity}

When a read-only transaction executes a query, the returned result is
guaranteed valid at the timestamp associated with that
transaction. Therefore, any values the application computes from that
query result can be stored in the cache, and used by other
transactions assigned to the same timestamp. However, this is overly
restrictive, as it requires other transactions to have exactly the
same timestamp in order to use the shared data. In fact, a query
result is valid for a \emph{range} of time, because other transactions
may not have affected the query result. We call this range the
result's \emph{validity interval}. In this section, we describe
how to efficiently compute the validity interval for a query result.

A query result's validity interval is a range of timestamps. The lower bound
is the  commit time of the most recent transaction that caused any tuple in the result
to become valid, and the upper bound is the commit time of the most recent subsequent transaction that 
changed any tuple in the result, making it invalid, or
infinity if the result is still current. \edatnote{DRKP}{Maybe this
  belongs earlier.} The validity interval is computed as the
intersection of two ranges, the \emph{result tuple validity} and the
\emph{invalidity mask}, which we track separately.

% XXX Even though this figure isn't referenced for a while, it's good
% if it appears near the beginning of the section so people can have
% it in their minds.
\begin{figure}[tp]
  \centering
  \includegraphics[scale=0.4]{db-intervals.pdf}
  \caption{An example of how validity intervals are tracked for
    read-only queries.  All four tuples match the query
    predicate. Tuples 1 and 2 match the timestamp, so their intervals
    intersect to form the result validity. Tuples 3 and 4 fail the
    visibility test, so their intervals join to form the invalidity
    mask. The final validity interval is the difference between the
    result validity and the invalidity mask.}
  \label{fig:db-intervals}
\end{figure}

The result tuple validity is the intersection of the validity times of
the tuples returned by the query. This reflects the fact that the
results of the query would have been different before any of the
returned tuples had been created, or after any had been
deleted. Computing this interval is relatively straightforward,
because multiversion concurrency requires that each tuple in the
database be tagged with the ID of its creating transaction and
deleting transaction (if any). We simply propagate these tags
throughout query execution. For any operator that combines multiple
tuples to produce a single result, such as a join or aggregate, the
validity interval of the output tuple is the intersection of the
validity intervals of all inputs. Finally, once the entire result set
has been produced, we take the intersection of the tuples' validity
intervals.

The result tuple validity, however, does not completely capture the
validity of a query. The tuples that did \emph{not} appear in the
result, but would have if the query were run at a different timestamp,
also have an
impact. For example, a matching tuple that was deleted just before the
querying transaction's timestamp will not appear in its results, but
does mean the results would have been different if computed slightly
earlier in the past. We capture
this effect with the invalidity mask, which is the union of the
validity times for all tuples that failed the \emph{visibility check},
causing them to be discarded because their timestamps made them
invisible to the
transaction's snapshot. Throughout query execution, whenever a tuple
is encountered that was deleted before or created after the
transaction's snapshot time, that tuple's validity interval is added
to the invalidity mask.

The invalidity mask is a conservative estimate because visibility
checks are performed as early as possible in the query plan to avoid
processing unnecessary tuples.  Some of these tuples might have been
discarded anyways if they failed the query conditions later in the
query plan.  If timestamps were only
checked after the query executed to completion, the validity interval
would be completely accurate, but tuples that could never appear in
the result would be subjected to potentially expensive query
processing.  While this preserves the correctness of the cached
results, it might reduce the
cache hit rate by unnecessarily constraining the validity intervals of
cached items.
Though the visibility check is generally performed as early as
possible, in the case of sequential scans and index lookups, we
evaluate the predicate before the visibility check.  This differs from
regular Postgres with respect to sequential scans: unmodified Postgres
evaluates the visibility check first because it is marginally cheaper.
Delaying visibility checks in these two cases improves the quality of
the invalidity mask while incurring minimal overhead.

The final validity interval for a query is computed by subtracting the
invalidity mask from the result tuple validity. This interval is
reported
to the client, piggybacked on the result of all \command{select}
queries. Section~\ref{sec:library} describes how the \txcache library
combines these intervals to obtain validity intervals for all items it
stores in the cache.

If a result returned by a query is still valid in the present (\ie
running the query on the latest snapshot would return the same
result), then the upper bound of the query's validity interval is
unknown.  The database can either indicate that the upper bound is
presently unknown, or it can assign the validity interval some
conservative upper bound.  Because the former route would require an
invalidation scheme in order to assign upper bounds to cached values
once they become known, \txcache takes the latter approach, upper
bounding any returned validity intervals at the latest timestamp known
to the database when the query begins executing.

As a concrete example, figure~\ref{fig:db-intervals} shows a simple
select query running at timestamp 46 that considers four tuples, all
of which match the query predicate, but only two of which satisfy the
visibility check.  For example, tuple 1 was created with commit 43 and
deleted with commit 47, so it appears in the result of the query,
while tuple 3 was deleted by commit 45, before the query's timestamp,
so it does not appear in the result of the query.  The result tuple
validity is the intersection of the validity intervals of the tuples
returned by the query; in this case, tuples 1 and 2 yield a result
tuple validity of 44-47.  Tuples that match the query predicate but
are discarded by the visibility check contribute to the invalidity
mask; in this case, tuples 3 and 4 would have affected the query
result had the query been run sufficiently earlier or sufficiently
later.  The final validity interval of 45-47 is simply the difference
of the result tuple validity and the invalidity mask.  Together, these
guarantee that if the query were run at any other timestamp in the
final validity interval, it would have returned the same result.

% \begin{ednote}{DRKP}
%   Should say something here about either invalidations, or how
%   unbounded intervals are bounded (for the no-invalidations design),
%   or both --- haven't decided which is best yet.
% \end{ednote}

\subsection{Pincushion}

\txcache needs to be able to identify currently pinned snapshots
satisfying a read-only transaction's freshness requirement, and to
eventually unpin
unused snapshots. To do this, it must keep track of what snapshots are
currently pinned on the database, and which pinned snapshots are
currently being used by running transactions.

In principle, the DBMS itself could be responsible for tracking this
information. However, to reduce the overall load on the database, we
chose to place this functionality instead in a separate daemon known
as the \emph{pincushion} (so named because it holds the pinned
snapshot IDs). The pincushion is lightweight, so can be placed
anywhere, \eg{on the same host as the database, or on one of the
  cache servers}.

The pincushion maintains a table of currently pinned snapshots. For
each snapshot, it records the snapshot's ID, the corresponding
wall-clock timestamp, and the number of running transactions
that might be using that snapshot. When beginning a read-only
transaction, the \txcache library running on some application node
requests from the pincushion all sufficiently
fresh pinned snapshots, \eg{any pinned in the last 30 seconds}. The
pincushion returns a list of matching snapshots and records that any
of these pinned snapshots might be in use for the duration of the
transaction.

If the pincushion does not have any pinned snapshots that satisfy the
requested freshness requirement, the \txcache library must create a new
snapshot. It does so by starting a new read-only transaction that runs
in the present, and pinning its snapshot.  Whenever the library does so,
it registers the snapshot's ID and wall-clock time with the pincushion
for tracking. To ensure correct ordering despite network latency and
unsynchronized clocks, we always use the snapshot's wall-clock time as
reported by the database.

Finally, the pincushion is responsible for unpinning snapshots on the
database when they are no longer needed. Periodically, it scans the
list of pinned snapshots for snapshots older than a maximum staleness,
and which no transaction may be using. Any such snapshots are
removed by sending an \command{unpin} command to the database.

% LocalWords:  multiversion Postgres timestamp tuple tuples DRKP tuple's

%%% Local Variables: 
%%% mode: latex
%%% TeX-command-default: "Make"
%%% TeX-PDF-mode: t
%%% TeX-master: "paper.tex"
%%% End: 
% LocalWords:  unsynchronized versioning