% Transactional consistency is pretty expensive, especially in
% distributed systems. This leads otherwise reasonable people to do
% crazy things like relax consistency. Relaxed consistency is bad for
% obvious reasons, and it's hard to reason about when it's OK.

% We offer a practical alternative that retains transactional
% consistency while significantly optimizing the performance of
% read-only transactions. It does this by relaxing \emph{causality}; it
% allows read-only transactions to run slightly in the past, but
% guarantees that they see a transactionally-consistent view.

% We argue that this property is useful. Running slightly in the past is
% a totally reasonable thing to do: might miss a recent commit, but
% unless you know otherwise, that commit could just as easily have
% happened slightly later. Local causality, so entirely correct if no
% communication; similarly so if causal interactions can be modeled,
% e.g. w/ Lamport clocks.

% The tradeoff is that more space is needed to keep around old versions,
% but this can be kept pretty small, space is cheap, and you might want
% to have them around anyway.

% We think this will be useful for a variety of applications: database
% systems, transactional file systems, maybe persistent object stores if
% people still care about those. We've implemented it and it (hopefully)
% shows all kinds of wonderful performance improvements.

% Outline of the paper.

Many distributed systems, including distributed databases and
transactional file systems, need to provide support for distributed
transactions. Unfortunately, achieving full serializable isolation for
distributed transactions is an expensive proposition, generally
requiring locking, optimistic concurrency, and/or two-phase
commit. The high cost of these protocols leads many implementors to
consider alternatives that trade off consistency for performance.

Achieving full consistency is expensive because it encompasses two
requirements, \emph{isolation} and \emph{freshness}. To ensure that
the results of concurrent transactions are equivalent to a serial
ordering, each transaction must operate on a transactionally
consistent state; this requires either the use of two-phase locking or
a multiversion scheme. Ensuring that transactions operate on fresh
data limits the amount of cached information that can be used.

Existing work on weak consistency has focused on weakening isolation
to levels less than full
serializability~\cite{gray77:_granul_of_locks_and_degrees,berenson95:_critiq_of_ansi_sql_isolat_level,adya00:_gener_isolat_level_defin}. Though
these approaches are common in practice, allowing shorter-duration
locks or fewer aborted transactions, it is difficult to reason about
when they are
correct~\cite{fekete05:_makin_snaps_isolat_serial}. Following a
suggestion by Liskov and
Rodrigues~\cite{liskov04:_trans_file_system_can_be_fast}, we take an
alternative approach, weakening \emph{causality} instead of
serializability.

Specifically, our approach allows read-only transactions to run on
slightly stale data, but guarantees that they see a
transactionally-consistent state. Read/write transactions are executed
using a standard optimistic concurrency protocol, ensuring that they
operate on the latest data. Allowing read-only transactions to use
stale data not only makes it possible to avoid conflicts involving
read-only transactions, but makes it possible to take advantage of
data that is already in the cache. To ensure that the data seen by
read-only transactions is reasonable, we require that it reflect the
results of all transactions that committed on the local node, and all
other transactions that committed more than $\epsilon$ seconds ago;
these constraints are specified formally in
Section~\ref{sec:algorithm:properties}.

We argue this is a reasonable model for users of the system, since
many transactions do not need to be run on the latest version of data
as long as they still see a consistent state. As a result of our
relaxed freshness requirements, the anomalies that can occur are ones
where a read-only transaction $A$ executed on one node fails to see
the results of a transaction $B$ that recently committed on a
different node. However, although $B$ might have happened before $A$,
the commit point could just as well have happened slightly later,
after $A$. Changing the ordering has no visible effect as long as
there has been no communication between the two nodes involved to
synchronize them; the two transactions can be said to happen
concurrently and thus can be reordered.\footnote{This is the same as
  the distinction between Lamport's \emph{happened-before} and
  \emph{happens-before}
  relations~\cite{lamport78:_time_clock_and_order_of}.} We assume that
the staleness interval $\epsilon$ is small enough that a synchronizing
communication between nodes is unlikely to happen in the interval, but
a more complete solution would use Lamport
clocks~\cite{lamport78:_time_clock_and_order_of} to explicitly model
synchronization and concurrency in the system.

Our system optimizes the performance of read-only transactions. Such
transactions are quite common in file systems, where the vast majority
(over 80\%) of operations are
read-only~\cite{vogels99:_file_system_usage_in_window_nt,unix91:_john_k};
our experiments also confirmed this property. Moreover, it also
eliminates potential conflicts between read-only and read/write
transactions, and reduces concurrency overhead for read-only
transactions. It is necessary for transactions to specify whether they
are read-only or read/write when they begin, but this requirement is
not onerous; a simple static analysis can usually determine whether it
is read-only.

Besides weakened causality, the principal cost incurred by our scheme
is additional storage space. However, this cost should be small,
because only a few extra revisions will need to be kept. Moreover,
storage capacity is rarely the limiting factor for scalability in such
systems. Indeed, a number of systems, including temporal
databases~\cite{stonebraker87:_desig_of_postg_storag_system,oezsoyoglu95:_tempor_and_real_time_datab},
persistent object stores~\cite{moh04:_timel}, and versioned file
systems~\cite{santry99:_decin_when_to_forget_in,hitz94:_file_system_desig_for_nfs}
already maintain a history of versions in order to enable queries on
previous system state, so such systems would incur no additional
storage cost for our protocol.

The structure of this paper is as follows: Section~\ref{sec:overview}
provides an overview of the system, ending with this paragraph giving
an outline of the paper. Section~\ref{sec:model} describes the
architecture of the system and the interface provided to
clients. Section~\ref{sec:algorithm} explains our concurrency control
protocol in detail, explaining what properties it guarantees and how
it achieves them. Section~\ref{sec:implementation} describes our
implementation, and Section~\ref{sec:evaluation} analyzes its
performance using file system traces. We review related work in
Section~\ref{sec:related-work}, and describe our plans for future work
in Section~\ref{sec:future-work}. Finally,
Section~\ref{sec:conclusion} concludes.


%%% Local Variables: 
%%% mode: latex
%%% TeX-master: "report"
%%% TeX-PDF-mode: t
%%% End: 

% LocalWords:  serializable