This is a second attempt to try to sort out what's going on in our system and how it differs fundamentally from database replication. I'm trying to get a different perspective by rederiving the system to see why all the features we provide are really needed. 1. Database replication Suppose for now that we're just interested in replicating a database. If you're like me (or probably Barbara), your first idea might be to use synchronous replication with a master and group of replicas: every modification locks the relevant data items and triggers a two-phase commit. This is a great approach in the sense that it provides both (transactional) consistency and freshness: any request can be sent to any replica or the master with the same result. But it is horrible for performance because of all the blocking! So I doubt anyone really does this -- although it might be reasonable if updates are really rare. So let's make the replication asynchronous: eliminate the two-phase commit and just push updates to the replicas on a best-effort basis. What does this mean? - we've given up a guarantee of freshness, although this probably isn't a big deal in practice because the replication lag should be low - every replica will still have some valid snapshot of the entire database, as long as the master pushes the updates in order. - but: no guarantees that they'll have the same snapshot (Note that we still need some concurrency control mechanism on the replicas, but it is standard -- either we block updates during a transaction, or we use snapshot isolation.) Having replicas with different snapshots of the database (i.e. at different timestamps) isn't a big deal for full-database replication, because we can send each transaction to a single replica -- there's no real reason to do otherwise. As we'll see, however, partitioning the replicas changes that. Note, first, though that systems like FAS and Ganymed (the ones our VLDB reviewers pointed out) are basically following this asynchronous replication approach. They just increase the replication lag by being more lazy, to take advantage of batching and write absorption. They also keep track of timestamps to make sure nobody sees anything more stale than they can tolerate. But that's about it. 2. Partitioning & database replication Now what about partitioning? What if we wanted each slave to only store part of the database? (It's not entirely clear why you'd need to do this, given that the master still has to be able to store the entire database -- but maybe the slaves are wimpier machines, or maybe you just want to increase utilization by not having all changes applied everywhere.) Now we have a problem because no one replica (besides the master) has all the data in the system, so transactions may need to access more than one. But the replicas may be at different snapshots -- and I think this is unavoidable in an asynchronous system. So how do we combine data from replicas with different timestamps? The problem here arises because we might access some data from a replica with snapshot at time t1 and later in the transaction need some data from another replica that's at time t2. We can deal with this by making each replica a multiversion store, i.e. asking the second replica to start a transaction w/ timestamp t1. If t1 <= t2, this is just a matter of keeping some old revisions around. On the other hand, if t1 > t2, we might need to block: partition 1 is newer than partition 2, so we need to wait for partition 2 to come up to date. A second familiar technique can help us here: namely, keeping track of the past validity of each data item and choosing timestamps lazily. That is, we'd realize that the data we've seen from partition 1 might well be consistent with timestamp t2 also, so we can just declare the transaction to be at timestamp t2 and use data from partition 2. 3. Application-level caches Let's switch gears back to application-level caching. What's different about this world? For one thing, the space of application-level objects is so large that it's not feasible to keep the complete set in the cache. (See my previous note for why -- or consider that you'd have to keep around every possible webpage someone might request, even if no one ever wants it.) So the cache is *incomplete*: there will be requests that it just can't satisfy. Those have to go to the database. And the data from the database will have to be consistent with data seen from the cache, even though the database probably has more current information. As with the partitioned database replicas, we're going to need the database to be a multiversioned backing store in order to cope with the fact that cache nodes are at different timestamps (without blocking). The second, and I think most important, difference between application-level and data-level caches is that we have to use a passive-caching approach where we keep around recently-used objects, rather than an active-replication approach where we push the complete stream of all changes. (Think about the problems with identifying and recomputing all changed webpages each time the database is updated, even if no one requests them.) In other words, our cache is actually a cache. This means that our objects are computed at different times. Thus, in order to combine them, we need to know that they're valid at not just an instant in time but at (hopefully overlapping) ranges of time. We can do this in two ways: extending it into the past via database modifications to compute the lower bound, or extending the validity forward into the future using invalidations. We currently do the former, and will do the latter. Either gets us a workable system, but both are needed for a truly effective one. Does it matter that the cache is partitioned over many nodes rather than being a single node? (As we saw above, this made a big difference for database replication.) I think it's less important here because the cache is incomplete in either case, so in a sense the system is already partitioned between the cache and the database. 4. Going further into the past Note that I haven't mentioned running on stale data as yet, which may seem a surprising omission. In fact, the caching schemes I've discussed (at least with invalidations) don't require it. Given that we want to avoid blocking, they relax freshness slightly in the usual (e.g. snapshot isolation) sense: even if a transaction sees data that was current at transaction-start time may become obsolete while the transaction is running. But we don't have to run on an even older snapshot. Why might we want to deal in staleness, then? My first answer is because we can: applications basically already *are*, to some degree: - in an asynchronous network, messages might be delayed, causing data to be out of date by the time it arrives - in a snapshot-isolation database, snapshots may no longer be current by the time the transaction executes - if we want to assign a transaction the current timestamp... in a partitioned-cache system, what is "the current timestamp", anyway? Is it the timestamp on the database? (but we don't want to ask the database if we can avoid it!) Is it the timestamp on one of the cache nodes? (which one? they could be different!) Furthermore, we know some applications can explicitly tolerate stale data, sometimes significantly so. So we might as well expose this control. We get a few benefits from increased staleness: - the effective cache lifetime of an entry is increased: now it is the time until it becomes invalid plus the time until it becomes too stale to use. This increases the hit rate. - we can make use of data that becomes invalid very quickly: our current paper shows that we still get a non-trivial benefit even if we assume all data becomes invalid immediately. This also lets us cope with invalidation mechanisms with high false-positive rates. - we avoid the problem mentioned above where we might have to stall because one cache node is lagging behind, as long as it's still within the freshness limit. This could be especially useful with application-level caching because it could take a while for the application to finish computing a value and insert it into the cache. Finally, dealing with this range of possible timestamps requires us to use lazy timestamp selection to make an intelligent decision about which timestamp to assign, in the absence of any information up front about what data will be accessed or is present in the cache.