This is a second attempt to try to sort out what's going on in our
system and how it differs fundamentally from database replication. I'm
trying to get a different perspective by rederiving the system to see
why all the features we provide are really needed.


1. Database replication

Suppose for now that we're just interested in replicating a
database. If you're like me (or probably Barbara), your first idea
might be to use synchronous replication with a master and group of
replicas: every modification locks the relevant data items and
triggers a two-phase commit.

This is a great approach in the sense that it provides both
(transactional) consistency and freshness: any request can be sent to
any replica or the master with the same result. But it is horrible for
performance because of all the blocking! So I doubt anyone really does
this -- although it might be reasonable if updates are really rare.

So let's make the replication asynchronous: eliminate the two-phase
commit and just push updates to the replicas on a best-effort
basis. What does this mean?
  - we've given up a guarantee of freshness, although this probably
    isn't a big deal in practice because the replication lag should be
    low
  - every replica will still have some valid snapshot of the entire
    database, as long as the master pushes the updates in order.
  - but: no guarantees that they'll have the same snapshot

(Note that we still need some concurrency control mechanism on the
replicas, but it is standard -- either we block updates during a
transaction, or we use snapshot isolation.)

Having replicas with different snapshots of the database (i.e. at
different timestamps) isn't a big deal for full-database replication,
because we can send each transaction to a single replica -- there's no
real reason to do otherwise. As we'll see, however, partitioning the
replicas changes that.

Note, first, though that systems like FAS and Ganymed (the ones our
VLDB reviewers pointed out) are basically following this asynchronous
replication approach. They just increase the replication lag by being
more lazy, to take advantage of batching and write absorption. They
also keep track of timestamps to make sure nobody sees anything more
stale than they can tolerate. But that's about it.


2. Partitioning & database replication

Now what about partitioning? What if we wanted each slave to only
store part of the database? (It's not entirely clear why you'd need to
do this, given that the master still has to be able to store the
entire database -- but maybe the slaves are wimpier machines, or maybe
you just want to increase utilization by not having all changes
applied everywhere.) 

Now we have a problem because no one replica (besides the master) has
all the data in the system, so transactions may need to access more
than one. But the replicas may be at different snapshots -- and I
think this is unavoidable in an asynchronous system. So how
do we combine data from replicas with different timestamps?

The problem here arises because we might access some data from a
replica with snapshot at time t1 and later in the transaction need
some data from another replica that's at time t2. We can deal with
this by making each replica a multiversion store, i.e. asking the
second replica to start a transaction w/ timestamp t1. If t1 <= t2,
this is just a matter of keeping some old revisions around.

On the other hand, if t1 > t2, we might need to block: partition 1 is
newer than partition 2, so we need to wait for partition 2 to come up
to date. A second familiar technique can help us here: namely, keeping
track of the past validity of each data item and choosing timestamps
lazily. That is, we'd realize that the data we've seen from partition
1 might well be consistent with timestamp t2 also, so we can just
declare the transaction to be at timestamp t2 and use data from
partition 2.


3. Application-level caches

Let's switch gears back to application-level caching. What's different
about this world? 

For one thing, the space of application-level objects is so large that
it's not feasible to keep the complete set in the cache. (See my
previous note for why -- or consider that you'd have to keep around
every possible webpage someone might request, even if no one ever
wants it.)

So the cache is *incomplete*: there will be requests that it just
can't satisfy. Those have to go to the database. And the data from the
database will have to be consistent with data seen from the cache,
even though the database probably has more current information. As
with the partitioned database replicas, we're going to need the
database to be a multiversioned backing store in order to cope with
the fact that cache nodes are at different timestamps (without
blocking).

The second, and I think most important, difference between
application-level and data-level caches is that we have to use a
passive-caching approach where we keep around recently-used objects,
rather than an active-replication approach where we push the complete
stream of all changes. (Think about the problems with identifying and
recomputing all changed webpages each time the database is
updated, even if no one requests them.) In other words, our cache is
actually a cache.

This means that our objects are computed at different times. Thus, in
order to combine them, we need to know that they're valid at not just
an instant in time but at (hopefully overlapping) ranges of time. We
can do this in two ways: extending it into the past via database
modifications to compute the lower bound, or extending the validity
forward into the future using invalidations. We currently do the
former, and will do the latter. Either gets us a workable system, but
both are needed for a truly effective one.

Does it matter that the cache is partitioned over many nodes rather
than being a single node? (As we saw above, this made a big difference
for database replication.) I think it's less important here because
the cache is incomplete in either case, so in a sense the system is
already partitioned between the cache and the database.


4. Going further into the past

Note that I haven't mentioned running on stale data as yet, which may
seem a surprising omission. In fact, the caching schemes I've
discussed (at least with invalidations) don't require it. Given that
we want to avoid blocking, they relax freshness slightly in the usual
(e.g. snapshot isolation) sense: even if a transaction sees data that
was current at transaction-start time may become obsolete while the
transaction is running. But we don't have to run on an even older
snapshot.

Why might we want to deal in staleness, then? My first answer is
because we can: applications basically already *are*, to some degree:
 - in an asynchronous network, messages might be delayed, causing data
   to be out of date by the time it arrives
 - in a snapshot-isolation database, snapshots may no longer be
   current by the time the transaction executes
 - if we want to assign a transaction the current timestamp... in a
   partitioned-cache system, what is "the current timestamp", anyway?
   Is it the timestamp on the database? (but we don't want to ask the
   database if we can avoid it!) Is it the timestamp on one of the
   cache nodes? (which one? they could be different!)

Furthermore, we know some applications can explicitly tolerate stale
data, sometimes significantly so. So we might as well expose this
control.

We get a few benefits from increased staleness:
 - the effective cache lifetime of an entry is increased: now it is
   the time until it becomes invalid plus the time until it becomes
   too stale to use. This increases the hit rate.
 - we can make use of data that becomes invalid very quickly: our
   current paper shows that we still get a non-trivial benefit even if
   we assume all data becomes invalid immediately. This also lets us
   cope with invalidation mechanisms with high false-positive rates.
 - we avoid the problem mentioned above where we might have to stall
   because one cache node is lagging behind, as long as it's still
   within the freshness limit. This could be especially useful with
   application-level caching because it could take a while for the
   application to finish computing a value and insert it into the
   cache.

Finally, dealing with this range of possible timestamps requires us to
use lazy timestamp selection to make an intelligent decision about
which timestamp to assign, in the absence of any information up front
about what data will be accessed or is present in the cache.