\documentclass[twocolumn,11pt]{article}
\usepackage{fullpage}
\usepackage{url}

\usepackage{graphicx}


\title{PersiFS: \LARGE{A Continuously Versioned Network File System}}

\author{Austin T. Clements, Dan R. K. Ports, Ben A. Schmeckpeper, Hector Yuen \\
\small{\texttt{\{aclements, drkp, bschmeck, hyz\}@mit.edu}}}

\newcommand{\PersiFS}{\emph{PersiFS}}

\begin{document}

\maketitle

% First draft report. This should include a draft of your report's
% abstract, introduction, related work, and design sections. These
% sections should be in good shape, close to what they would look like
% in the final report. Be sure that the draft clearly states what your
% project's goals are, why those goals are worthwhile, and how you're
% going to achieve those goals.

\begin{abstract}
  {\small\it Most file systems are \emph{ephemeral}, meaning that once
    a change has been made, there is no way to recall the previous
    contents of the file system.  Backups, version control systems,
    and user interface improvements such as ``trash cans'' attempt to
    alleviate this problem; however, these are all rough
    approximations of \emph{persistent} file system structures, giving
    users restricted access to a restricted set of past states of the
    file system.  \PersiFS\ is a \emph{fully persistent} or
    \emph{continuously versioned} file system, allowing access to
    \emph{any} past state of the entire file system.  Furthermore,
    \PersiFS\ exposes the file system and its full set of past states
    via a conventional file system interface, allowing unmodified
    applications to access the past states of the file system.}
\end{abstract}

\section{Introduction}
\label{sec:intro}

% Overall description
%% Meaning of ``persistence''
%% Versioning is done at the scope of the entire file system, not just
%% to individual files
%% Ask for a snapshot of the entire file system at a particular point
%% in time

\PersiFS\ is a network file system which remembers all changes made to
every file and allows the user to recall any and all modifications to
any file, including deletions.  Time stamps are used to select a view
of the file system's past state.  A user can specify a time in the
past at which the file was correct and \PersiFS\ will provide
read-only access to that past version of the file.  \PersiFS's time
stamp interface provides users with a natural way to express accesses
to past versions of files.

A \emph{version-controlled file system} lets the user access not just
the current state of the file system but previous states as well.  For
example, backups are one typical, restricted form of a
version-controlled file system, allowing high-latency access to a
highly restricted set of past copies of a file system.

\PersiFS\ is a \emph{continuously} versioned file system.  A
continuously version-controlled file system allows access to the
complete state of the file system at \emph{any} point in the past.
Incorporating continuous versioning into the file system means that a
file can never be lost and mistaken modifications can always be
undone.  In \PersiFS, any change to a file forces the file system to
archive a copy of that file, indexed by a time stamp.

\subsection{Motivation}
\label{sec:intro:motivation}

% Motivations
%% Versioning is good
%% See the wiki
%%% File systems don't provide a trash can, even though UI's do.  This
%%% provides a file system-supported trash can that applications don't
%%% have to be aware of

% Make explicit mention of our goal of replacing periodic snapshot
% systems typically used by multi-user system administrators with our
% continuously versioned network file system.  Something like this
% should go somewhere.
%
% By providing users with access to the state of their files at any
% point in time instead of a limited few, recent, policy-based points
% in time, users can use the system without worrying about recent
% changes or having to resort to version management systems for their
% personal files.
%
% or this
%
% Snapshot systems typically provide users with access to a limited
% few, recent, policy-based points in the state of their files.  While
% this  provides protection against mid-term mistakes, it does not
% capture short-term mistakes that occur between snapshots, nor does
% it capture long-term mistakes that are not caught until the mistake
% has already passed out of the snapshot range.

Users frequently wish to undo changes they have made, whether
intentionally or inadvertently, to the file system --- for example,
inadvertent deletions, restoring files corrupted by application bugs,
or simply reverting to an earlier revision of a document. Regular file
systems do not support this operation natively, which results in a
number of tools that address parts of this problem in different ways.
Some operating systems provide a ``trash can'' interface to help users
avoid mistaken deletions of files.  However, this is only a partial
solution because it only addresses the problem of deletions, and even
then only until the trash can is emptied and the files are permanently
removed. The proliferation of ``undelete'' tools for administrators
suggests that this solution is inadequate. \PersiFS\ makes these tools
unnecessary, since files are never truly deleted from the file system.
Fears of accidental overwrite, rename or deletion are unnecessary.

Versioning is a natural and desirable aspect of file systems. Often,
critical files are versioned through the use of version control
systems. However, these must be explicitly configured, maintained, and
interacted with. Providing such support at the file system level gives
the ability to easily track changes to \emph{any} file over time. To
eliminate the need for user interaction and the possibility of user
error, systems exist which automatically create periodic snapshots
(either of just critical system files, or of entire file systems).
These, however, suffer from problems of time aliasing, failing to
capture multiple changes made between snapshots and possibly creating
inconsistent snapshots at inopportune times.  By capturing every
revision, \PersiFS\ cannot suffer from time aliasing.

Because \PersiFS\ is a network-accessible file system, it is ideal for
multi-user systems.  Such systems typically use some form of
snapshotting to provide all users of the system with security for
their files and to avoid administrative nightmares with recovering
from user errors.  By using a continuously versioned file system like
\PersiFS, system administrators can can provide users with an easy way
to recover from a much wider range of problems.

\subsection{Features}
\label{sec:intro:features}

% Key features
%% Continuous versioning
%% File system interface
%% High performance for a versioned file system
%% Actually stored on disk
%%% Using a regular file system to actually store our data
%%%% For reliability and ease of implementation
%%%% There's no direct correlation between the exported file
%%%% system and the underlying file system
%%%% And it wouldn't be difficult to translate over into a real disk
%%% (This detail should probably go in the internal representation
%%% section instead, though the fact that we store the FS on disk
%%% should be mentioned in the intro)
%% Network accessible
% Key contributions
%% File system interface capable of exposing all past states
%% Unmodified user applications and no special libraries

% Backups -> Snapshotting -> Continuous versioning

\PersiFS's main contribution is a novel file system interface that
allows users and applications to access the file system and its past
states without any modifications to applications or the need for any
special libraries. As a networked file system, \PersiFS\ allows remote
access to both current and archived files for multiple users.
\PersiFS\ is a \emph{continuously versioned} file system, so it is
capable of exposing \emph{all} past states, unlike \emph{snapshotting}
file systems that only save states at periodic intervals. Users can
access the entire file system tree as it existed at a particular time
simply by accessing the automatically mounted directory
\texttt{/persifs/servername@timestamp}.

\PersiFS\ provides durable storage by storing all data on disk on a
central server. For reliability and ease of implementation, the
\PersiFS\ data structures are stored on a normal file system, though
with minor modifications it could operate on a physical disk as
well. Each client system accesses the server via an NFS automount in
the \texttt{/persifs} directory hierarchy.
  
\section{Related Work}
\label{sec:related-work}

Many other systems provide some form of persistent versioned storage.
\PersiFS\ differs from these systems in that it provides access to all
previous versions of the file system's state via a standard file system
interface.

\subsection{Version Control Systems}
\label{sec:related-work:vc}

Version control systems such as CVS~\cite{cvs} are the standard
mechanism for tracking revision histories in large projects. While
\PersiFS\ provides similar revision history operations, it does not
provide the higher-level semantics for synchronizing the work of
multiple users, such as file locking as in RCS~\cite{tichy85rcs} or
merging as in CVS. \PersiFS\ provides only revision history support;
it does not provide the higher-level functionality because the
appropriate semantics are dependent on the type of file and its mode
of use (e.g. text vs. binary files), and so should not be applied at
the file system level.
% XXX Mention that CVS-filesystem 6.824 project here if I can ever
% find it?

CVS organizes revision histories by assigning a version number to each
revision of a particular file. This is a natural interface for the
history of a particular file, but does not generalize well to tracking
the history of the entire file system. Many of the weaknesses of CVS as
a version control system are due to this interface: it does not
capture the notion of changes that occur simultaneously to multiple
files (changesets), and it does not effectively handle changes to the
directory structure, such as moving or renaming a file. The latter is
a major problem for a file system, since the directory structure can
be highly dynamic.

Subversion~\cite{subversion} addresses many of the shortcomings of CVS
by using a single global revision number that identifies a particular
state of the entire repository. This is similar to \PersiFS 's internal
representation; however, the user interface uses date/timestamps
instead because global revision numbers do not generally correspond to
a useful identifier from the user's perspective.

\subsection{Snapshots vs. Continuous Versioning}
\label{sec:related-work:snapshots}
Previous states of the file system are commonly stored by a backup
system that periodically archives the state of the filesystem.
\emph{Snapshot}-based version-controlled file systems are the natural
extension of this idea: they periodically take a snapshot of the state
of the filesystem, and make it readily available. The Plan 9 file
server~\cite{pike95plan}, for example, creates daily snapshots of the
filesystem and stores them on a write-once mass-storage system such as
a WORM jukebox or the Venti archival server~\cite{quinlan02venti}.
WAFL~\cite{hitz94file} provides similar snapshotting functionality
using copy-on-write disk blocks. AFS provides access to the most
recent snapshot through the \texttt{OldFiles} mechanism.
% XXX Find and add AFS cite

While a great improvement over a standard ephemeral file system,
snapshotting filesystems have the obvious disadvantage that they
cannot track changes that occur between revisions. \PersiFS\ is a
\emph{continuously} version-controlled file system: each change to the
file system is stored as a new revision.
Elephant~\cite{santry99deciding} and CVFS~\cite{soules03metadata} also
use this technique. However, \PersiFS\ provides a much more convenient
interface for users.

\subsection{Interfaces}
\label{sec:related-work:interfaces}

Some versioned file systems require special tools to access previous
versions of the file system. For example, CVFS is designed for
performing post-intrusion forensic analysis, so it does not provide an
interface for users to easily recover old versions of their
files. A convenient way to provide access to old versions is via a
filesystem interface. Plan 9 creates a directory hierarchy of
snapshots; Pike et al.~\cite{pike95plan} report that providing access
to snapshots via this interface is very convenient for users.

Ideally, it should be possible to interact with the versioned file
system exclusively through the file system interface, such that
unmodified standard UNIX utilities (\texttt{ls}, \texttt{cp}, etc.)
suffice for revision control. This is not possible in Elephant, which
adds new system calls. VersionFS~\cite{muniswamyreddy03-versionfs}
allows unmodified utilities to be used, but only via a library wrapper
and the \texttt{LD\_PRELOAD} mechanism. \PersiFS\ uses a purely
file-system-based interface, using an automounter based on that of
SFS~\cite{sfs:mazieres-phd}, allowing standard utilities to be used
without any modification.
% Need to check Elephant claim. Express more disgust at VersionFS?


\section{Design}
\label{sec:design}

%* An NFS server provides the external network interface via the SFS
%  automounter.
%* Internal use of directory and file abstractions that provide
%  essentially normal file and directory interfaces to the NFS
%  interface but are backed by the server's internal representation
%  instead of the OS's file system.
%* Directories are represented as regular files with special flags and,
%  as such, inherit all of the features of the file abstraction by
%  building entirely on top of it.


\subsection{User Interface}
\label{sec:design:ui}

\PersiFS\ provides access to both current and previous versions of the
file system state via a standard file system interface. Volumes are
accessed across the network using an automounter daemon in
essentially the same way as SFS~\cite{sfs:mazieres-phd}: a
\texttt{/persifs} directory is created, and the current version of the
file system on server \texttt{foo} is accessed via
\texttt{/persifs/foo}.

The automounter also provides access to previous versions of the file
system, by combining the server name with a timestamp, such as
\texttt{/persifs/foo@20050413120000}. This gives a read-only snapshot of
the file system on \texttt{foo} as it appeared at noon on April
13\textsuperscript{th}, 2005.

The user interface may also include a number of small tools that
improve the usability of \PersiFS. For example,
\begin{itemize}
\item \texttt{persifs pwd} \textit{datespec} \\
  Print the path to the current directory at the time specified by
  \emph{datespec}. Datespec should allow a wide range of time stamp
  formats, including relative ones as supported by CVS.
\item \texttt{persifs log} \emph{file}\\
  Display information about when \emph{file} was modified. This will
  require a special interface to the server to operate efficiently.
\end{itemize}

The administrator of a \PersiFS\ server should be able to set
retention policy, deleting data when it becomes older than a
pre-specified time, or merging together revisions that occur within a
certain time interval. This may be relegated to the domain of future
work depending on time and sanity constraints.

% File system interface
%% Harnessing the automounter
%%% Otherwise, this would either require an infinite number of
%%% directories to be explicitly exported, or there would have to be
%%% some mechanism for explicitly creating references to old versions
%% Read-only version at /persifs/foo@timestamp
%%% Dynamically mounted
%% Read-write current snapshot /persifs/foo
%% Fundamentally, path contains something to identify it as a
%% persifs-managed path (/persifs/), the address of the server (foo),
%% and optionally a timestamp (@timestamp)

% User tools


\subsection{Client-Server Interface}
\label{sec:design:client-server}
% The Client-Server Interface is going to be compatible with NFS,
% minimizing the need to change the behavior of the OS or the
% applications. The server is in charge of containing the PersiFS
% structure, taking care of the versioning.

% The client in the individual systems are going to run in the user
% level, interfacing the applications with the file system, like SFS,
% exposing PersiFS under \verb[/persifs/[. It is going to take care of
% the different versions, requesting the server to mount each one
% whenever it is necessary.

All data in \PersiFS\ is stored on a server, which contains the file
system structure and manages the versioning. Clients interact with the
server using a RPC-based protocol similar in spirit but somewhat
simpler than that of NFS~\cite{sandberg85design}, with support for versioned
operations.

Each client runs a user-level \PersiFS client that exposes \PersiFS\
under the \texttt{/persifs} directory. It is implemented using the SFS
user-level file system toolkit~\cite{sfs-toolkit}, minimizing the need
to change the behavior of the OS or the applications. The client acts
as an automounter, translating timestamps into revision numbers and
mounting directories under \texttt{/persifs} as necessary. It then
forwards requests to the appropriate \PersiFS\ server.

% Network file system interface
%% Server understands the file system structure and versioning
%% The client is responsible for exposing the /persifs/ interface on
%% individual computers and communicating with the appropriate
%% servers obtain and update file content
%% Interface to the server is on the same level as the NFS protocol,
%% though, of course, differs from the NFS protocol in its relative
%% simplicity and expression of versioned operations
%% (ie, the intelligence is mostly in the server, unlike block
%% server-based systems like Frangipani [ref] or Sun's NBD [ref?])


\subsection{Internal Representation}

\newcommand{\tuple}[1]{$\langle$#1$\rangle$}

The file system's internal representation divides into two main
components: the \emph{superblob} and the \emph{inode log}.  To improve
system performance, these are further augmented by the \emph{blob
  index} and the \emph{inode map}.  All of these structures except the
inode map have efficient on-disk representations and the inode map is
treated exclusively as soft state kept in memory.

Like traditional Unix file systems, \PersiFS\ uses inodes identified
with unique inumbers to store file and directory metadata.  Unlike
most file systems, however, inumbers are never recycled.  While not
strictly necessary, this eases inode generation and makes sense in a
system in which inodes are never ultimately removed.  Furthermore,
this eases exportation over NFS by eliminating the need for generation
numbers in order to uniquify file handles.  Also unlike most file
systems, \PersiFS\ inumbers have no correlation to disk addresses.

Files in \PersiFS\ consist of an inode and a sequence of
\emph{chunks}, which, concatenated, form the contents of the file.  As
in most Unix file systems, directories are simply files with a special
directory flag and a particular, internal format, and, as such, are
stored in the same manner as files.

The superblob is an append-only vector of all file content chunks.
Whenever a file is written to, the modified chunks are appended to the
superblob and the file's inode is updated to point to these chunks.
In order to optimize the space used by the superblob, we apply two
additional tricks: \emph{chunk fusion} and \emph{content-sensitive
  chunking}.  Chunk fusion allows us to coalesce multiple chunks with
the same content into one chunk, thus avoiding redundant storage
costs.  This is combined with content-sensitive chunking, which places
chunk boundaries based on the contents of files such that
modifications, including insertions and deletions of content, only
affect chunks local to those modifications.

Every chunk has a fingerprint, which is simply the 160-bit SHA1 of its
contents.  The blob index maps chunk fingerprints to the locations of
those chunks in the superblob.  Whenever a chunk is added to the
superblob, the blob index is first checked for that chunk's
fingerprint.  If the fingerprint is found, the chunk being inserted
can be fused with the existing chunk in the superblob, requiring no
additional space.  Otherwise, it is appended to the superblob and
indexed.  This is similar to the approach employed by
Venti~\cite{quinlan02venti} for archival block storage.  While the
blob index could be recomputed by reading the entire superblob, this
would lead to prohibitively long recovery, so the blob index is
maintained on disk.

The effectiveness of chunk fusion is further improved by
content-sensitive chunking of file contents.  This technique places
chunk boundaries based on file contents, instead of at regular
intervals, such that local modifications to file contents, including
insertions and deletions, only affect chunks in that region of the
file.  We use the same Rabin fingerprint-based algorithm as
LBFS~\cite{muthitacharoen01lowbandwidth} uses for content-sensitive
chunking.  Combined with content fusion, this allows us to efficiently
store local modifications to files, sharing most file contents between
versions of that file.  Furthermore, identical content in multiple
files (for instance, due to file copying) will consume very little
space beyond the single copy.

While the superblob stores the contents of the file system, the
structure of the file system is maintained by the inode log.  Every
time an inode is modified or created, the inode is stored in the
superblob and an update entry for the inode is appended to the inode
log.  Specifically, the inode log contains entries of the form
\tuple{timestamp, inumber, blob address of inode}.  When an inode is
no longer needed, a deletion entry for that inode is appended to the
log.
% XXX Oops.  We do need garbage collection of inodes, or the inode map
% is going to grow monotonically and be full of unreachable inodes.
% Our module specifications need to be fixed up for this.

The mapping of inumbers to blob addresses at a particular point in
time is an inode map.  There exists an inode map for every point in
the file system's history.  The inode map for the current state of the
file system is always kept in memory and inode maps for past versions
may be cached.  Because the inode log is append-only, the inode map
can be reconstructed for any point in the file system's history by
simply replaying the log up to the appropriate timestamp; however,
since the time required to do this may be prohibitive, we periodically
store a snapshot of the entire inode map in the log.  Thus, in order
to reconstruct the inode map for any version of the file system, the
last snapshot preceding that point in time can be loaded and the log
replayed up to the desired point.


\section{Implementation}
\label{sec:implementation}

The main aspects of the implementation of \PersiFS\ are described in
this section. A logical subdivision of the modules involved in
\PersiFS\ is described in Figure~\ref{fig:modules}.
%% The ability
%% to separate into several independent modules lets us subdivide the
%% original problem and detect isolated implementation issues.
In particular, the lowest levels of the implementation are modules
corresponding to the information structures represented on disk, the
\emph{inode log}, \emph{superblob}, and \emph{blob index}. Layered
atop this are \emph{file} and \emph{directory} modules, which
provide an abstraction based on the standard file system
concepts. Finally, interaction with users is handled by a NFS
interface, including the automounter. 

\begin{figure}[tbhp]
\centering
\includegraphics{modules.eps}
\caption{Implementation model}
\label{fig:modules}
\end{figure}


\subsection{Representation Management}
\label{sec:implementation:representation}

Each of the three information structures is stored on disk, and has a
corresponding module that provides an interface to the higher levels
of the \PersiFS\ implementation. For ease of implementation, our
initial implementation stores each structure as a separate file on a
standard file system, though it would be reasonably straightforward to
adapt it for direct storage on a physical disk.

The Superblob module implements content-addressable storage, providing
a standard \texttt{get}/\texttt{put-block} interface that uses the
hash fingerprint of the block to identify it. It makes use of the Blob
Index module, which uses an on-disk B$^+$-tree to map block
fingerprints to indexes into the superblob structure on disk.

The Inode Log module stores variable-length inodes in a time-indexed
log. Each inode contains the inode number, the file metadata, and the
list of chunks that make up the file content (which is the marshalled
form of the chunkable string containing the file content, as described
in Section~\ref{sec:implementation:fs}). The inode log interface
allows new or modified inodes to be stored in the log, and supports
scanning the log to obtain the inode contents at any point in the past
as well as its current state. It also provides the interface for
generating new unique inode numbers.


\subsection{File System Abstraction}
\label{sec:implementation:fs}

To simplify the implementation, a file system abstraction consisting
of \emph{file} and \emph{directory} representations is built atop the
persistent data structures. Using these abstractions also makes it
feasible to experiment with different on-disk representations of the
file system data structures.

The File module bundles together a file's inode metadata and its
content, stored as chunks chunks.  The File module API defines
meaningful file operations, such as \texttt{create}, \texttt{read},
\texttt{write} and \texttt{getAttr}.

The contents of a file are represented by a mutable \emph{chunkable
  string} backed by the superblob and aware of the LBFS-style
content-sensitive chunking. The Chunkable module provides a
substring-read and substring-write interface. As file contents are
large, the full contents of the string are not read from the superblob
until necessary. Multiple batched changes to the chunkable string can
be made without being committed to disk; when a \texttt{flush}
operation is performed, the chunk boundaries are recalculated and new
chunks are added to the superblob as necessary. The chunkable string
can be converted to a marshalled representation suitable for storing
in inodes that lists all of the chunks contained in a file and their
offsets in the superblob.
  
The Directory module provides the standard directory operations of
accessing or modifying the list of files in the directory.  Each
directory is stored as a file whose content happens to be directory
entries, distinguished from other files through special flags.

Ideally, multiple related changes to the file would be aggregated and
committed as a whole only when the file is closed, in order to avoid
inconsistent states. Often one write from the user is in
fact broken into a series of writes, and forcing the filesystem to
create versions for each 'sub-write' would result in unnecessary and
logically incorrect versions. Since NFS v3 does
not provide file closure notification to the server, the File module
attempts to approximate these semantics by grouping together
successive writes (in a span of about a second) to a single file.

%% The file abstractions employed by \PersiFS\ provide run-of-the-mill
%% NFS interfaces, yet are backed by the server's internal representation
%% of the file system.  This has allowed us to develop \PersiFS\
%% independent of the file system representation of any particular
%% operating system.

%Am I correctly interpreting what this description from the email
%means? 
%Internal use of directory and file abstractions that provide
%essentially normal file and directory interfaces to the NFS
%interface but are backed by the server's internal representation
%instead of the OS's file system.


\subsection{Network Interface}
\label{sec:implementation:network}
An NFS server provides the external network interface. Clients access
files on the \PersiFS\ server using standard NFS operations by
mounting a volume whose name contains an encoding of the desired
revision information. NFS requests are handled by a server similar in
structure to that of CCFS \cite{ccfs}, which translates them to
operations on the versioned file and directory abstractions above.

Revisions are accessed via the modified SFS automounter on the client
side. Whenever an archived version of a file is referenced, the
archived directory is automatically mounted and the file read using
typical NFS protocols.  Once the automounter has executed, the
archived file system behaves exactly like an ordinary file system.
% I don't know much about automounting, someone else ought to address
% the implementation.
% Don't worry, nobody else understands it either!


\bibliographystyle{abbrv}
\bibliography{report}

\end{document}

% LocalWords:  superblob inode inodes timestamp inumber inumbers chunking