Austin Clements (amdragon@mit.edu) and I are interested in working on
the problem of data structures for persistent file systems.

Motivation:

We want to build a file system that's persistent, in the
data-structures sense --- allowing the user to go back in time and
access the previous states of the file system. This has lots of
obvious applications: recovering from accidental deletions and other
errors, automatic (albeit rudimentary) version control, etc.

We're actually working on implementing such a system, called PersiFS,
for a 6.824 (Distributed Systems) project, along with two other
people. If you're interested in more details about why this is a good
idea, or how to implement it (from a systems perspective), see [XXX
PersiFS paper URL].

Our initial design calls for some fairly simple data structures that
will provide most of the necessary functionality but leave something
to be desired in terms of efficiency, in addition to having little
theoretical value. We would like to identify data structures for this
purpose --- perhaps devising new ones or variations on existing ones
as necessary --- and implement them in order to compare their
performance against our original data structures.


Details:

Our data structure needs to be able to support arbitrary modifications
to the content and metadata of files on a filesystem, in a persistent
way. Only partial persistence is necessary; we need to be able to view
previous revisions, but only the latest revision should be
modifiable.

Performance in this case would have to be measured in terms of disk
accesses, so we're looking at data structures in the external memory
model. Cache-obliviousness is probably not required.

Storing every version of every file on disk could require inordinately
large amounts of disk space, so space-efficiency is critical. We can
only afford to store changes between revisions.

It would also be nice to support efficiently erasing a revision from
the data structure, e.g. to purge old revisions to save space. This is
non-critical, however, since our current design doesn't support this
either.


Design Ideas:

WRITEME

- Chunk indirection/LBFS to minimize storage costs.

- Concatenation trees/ropes for file -> chunk mapping?

- B-trees are always good
-- Sleator-Tarjan persistifying them
-- That Arge paper
-- That other paper