\documentclass[11pt]{article}
\usepackage{fullpage}
\usepackage{url}

\begin{document}

\title{Persistent File System Data Structures\\ \Large{6.897 Project
    Proposal}}
\author{Dan Ports and Austin Clements\\
  \texttt{\{drkp,amdragon\}@mit.edu}}
\date{April 22, 2005}
\maketitle

We propose to investigate the problem of data structures for
persistent file systems.

\section{Motivation}
\label{sec:motivation}

We want to build a file system that's partially persistent, in the
data-structures sense --- allowing the user to go back in time and
access the previous states of the file system. This has lots of
obvious applications: recovering from accidental deletions and other
errors, automatic (albeit rudimentary) version control, etc.

We're actually working on implementing such a system, called
\emph{PersiFS}, for a 6.824 (Distributed Systems) project, along with
two other people. If you're interested in more details about why this
is a good idea, or how to implement it (from a systems perspective),
see the preliminary draft of our report~\cite{persifs}.

Our initial design calls for some fairly simple data structures that
will provide most of the necessary functionality but leave something
to be desired in terms of efficiency, in addition to having little
theoretical value. We would like to identify data structures for this
purpose --- devising new ones or variations on existing ones as
necessary --- and implement them in order to compare their performance
against our original, simple data structures.


\section{Details}
\label{sec:details}

Our data structure needs to be able to support arbitrary modifications
to the content and metadata of files on a filesystem, in a persistent
way. Only partial persistence is necessary; we need to be able to view
previous revisions, but only the latest revision should be
modifiable.

Performance in this case would be measured in terms of disk accesses,
so we're looking at data structures in the external memory model.
Cache-obliviousness is probably not required.

Storing every version of every file on disk could require inordinately
large amounts of disk space, so space-efficiency is critical. We can
only afford to store changes between revisions.

It would be preferable to support efficiently erasing a revision from
the data structure, e.g. to purge old revisions to save space. This is
non-critical, however, since our current design doesn't support this
either.

The interface should be a standard file system interface: given a file
ID (inumber), read or change some substring of the file data, or read
a substring of the file data as it appeared at some point in the
past. Directories and other file system structures can be handled at
a higher level, using this abstraction.
  
\section{Design Ideas}
\label{sec:design-ideas}

We don't yet have a design in mind, but we do have a few ideas that
we'll likely incorporate:

\begin{itemize}
% - Chunk indirection/LBFS to minimize storage costs.
\item To minimize storage costs by taking advantage of similarity
  between content unchanged in revisions, we propose to apply
  indirection. We can represent each file as a sequence of
  \emph{chunks} of data, identified by their hash value. Then each
  revision of a file can simply be stored as a short list of the
  chunks that comprise it, and only changed chunks need to be stored.
  We can use an algorithm from the
  LBFS~\cite{muthitacharoen01lowbandwidth} file system to create
  variable-length chunks with boundaries drawn based on content, in
  order to support space-efficient insertion and deletion of file
  content.
  
% - Concatenation trees/ropes for file -> chunk mapping?
\item To further increase the efficiency of this technique, we might
  consider using something like concatenation trees or
  ropes~\cite{boehm95ropes} in conjunction with indirection to store
  file contents.
  
% - B-trees are always good
% -- Sleator-Tarjan persistifying them
% -- That Arge paper
% -- That other paper
  
\item The use of the external memory model suggests using
  B$^{+}$-trees or some related data structure for storing index data.
  Of course, we will need to augment the data structure in order to
  make it persistent. This could probably be done with the standard
  Sleator-Tarjan persistence technique~\cite{driscoll89peristent}.
  Arge et al.~\cite{arge-btree} designed a persistent B-tree for the
  external memory model in roughly this way. Their paper describes an
  implementation for a GIS algorithm; we may wind up adapting it and
  analyzing its performance for our application. Other related work
  \cite{lanka91} exists on persistent B-trees, but we have yet to look
  into it extensively.
\end{itemize}


\bibliographystyle{abbrv}
\bibliography{persifs}

\end{document}