\ifdraft
\svnInfo $Id$
\fi

\section{Introduction}
\label{sec:introduction}

Peer-to-peer file sharing systems, in which users locate and obtain
files shared by other users rather than downloading from a central
server, have become immensely popular. Indeed, in recent years,
peer-to-peer file sharing traffic has amounted to over 60\% of
aggregate internet traffic~\cite{cachelogic05:_peer_to_peer_in}. Such
systems are useful, as well as interesting to researchers, because
they operate at extremely large scale, without a central
infrastructure, and can tolerate many kinds of failures.

Users of peer-to-peer file sharing systems use a client that upon
startup connects to other peers in a network, and becomes a part of
their distributed system and begins sharing files that the user has
made available. Users perform \emph{searches} for files, usually
specifying part of a file name or other criteria, and the peers in the
system work to cooperatively identify matching files that other users
are sharing. The user can then request to download the file, and the
client will retrieve it, often obtaining pieces from multiple users
who are sharing the same file. We refer to these two tasks as the
problems of \emph{searching} and \emph{content distribution}.

Performing searching in a peer-to-peer network is a challenging
problem. Many current file sharing networks are implemented using
techniques such as centralized indexes, which lack the scalability and
resilience of a distributed system, or query flooding, which is
notoriously inefficient. Often, they trade-off scalability for
correctness, resulting in either systems that scale well but sacrifice
completeness of search results or vice-versa. In this thesis, we
present \emph{Arpeggio}, a novel file sharing system based on the
Chord distributed hash table~\cite{stoica03:_chord}. Arpeggio includes
a system for building a distributed index of file metadata that is
fully decentralized but able to efficiently answer search queries with
near-perfect recall. In addition, Arpeggio also includes a separate
content distribution subsystem for identifying peers that have a
particular file available.


\section{Goals}
\label{sec:overview:goals}

The principal problem that Arpeggio addresses is that of searching for
files by metadata. For each file, a set of \emph{metadata} is
associated with it. The metadata is represented as a set of key-value
pairs containing tagged descriptive information about the file, such
the file name, its size and format, etc. Ideally, the file would also
have more descriptive metadata categorizing its content: an academic
paper could be tagged with its title, authors, venue, publisher, etc;
a song could be tagged with its artist, album name, song title, genre,
length, etc. For some types of data, such as text documents, metadata
can be extracted manually or algorithmically. Some types of files have
metadata built-in; for example, MP3 music files have ID3 tags listing
the song title, artist, etc. We refer to the file's metadata as a
\emph{metadata block}. An example is shown in
Figure~\ref{fig:overview:goals:example-metadata}.

\begin{figure}[tbp]
  \centering
  \large
    \begin{tabular}{|rl|}
    \hline
    \multicolumn{2}{|l|}{(debian, disk1, iso) \hfill } \\
    \hline
    name: & Debian Disk1.iso \\
    file hash: & cdb79ca3db1f39b1940ed5... % 8aa98da2e7
    \\
    size: & 586MB \\
    type: & application/x-iso9660-image \\
    \multicolumn{1}{|c}{$\vdots$} & \\
    \hline
  \end{tabular}
  \caption{Example file metadata}
  \label{fig:overview:goals:example-metadata}
\end{figure}

Upon joining the system, peers register the files that they have
available, sending ``\proc{Insert}'' requests with their files'
metadata blocks. They also perform queries, which specify a list of
keywords; the system should be able to respond to a \proc{Query}
request with a list of file metadata blocks that contain all of the
requested keywords.

\paragraph{}

Several of the key design goals of Arpeggio are listed below:

\paragraph{Full decentralization.}

The system must be fully decentralized; it must be able to withstand
the failure of any individual node.

\paragraph{Perfect recall.}

If a file is being shared in the system, it should always be returned
in response to a search request for its metadata. Many systems, such
as Gnutella, introduce scalability optimizations that sacrifice this
property. They focus primarily on finding results for popular files,
for which many copies exist, and may fail to locate rare files even if
a small number of peers has them available. 

\paragraph{Scalability and load balancing.}

Because peer-to-peer systems are designed at a large scale --- the
largest networks have millions of users --- it is necessary for the
algorithms used to handle these numbers of users. In particular,
because the data involved (the lists of files available in the
network) are large, they must be stored efficiently. However, some of
our techniques trade off scalability for improved load balancing
properties. For example, we are willing to suffer an increase in
overall index size in order to ensure that the load imbalance is
decreased, so that there does not exist a small number of peers doing
most of the work and likely becoming overloaded as a result.

\subsection{Non-goals}
\label{sec:overview:goals:non-goals}

There are several related problems that we explicitly do not address
because they are outside the scope of Arpeggio; they are
separate problem that we are not attempting to solve, or have been
solved elsewhere and can be integrated with Arpeggio. Here, we list a
few of these and provide the rationale.  

\paragraph{Full-text document search.}

Arpeggio's search techniques are intended for indexing the metadata of
files, and rely on the assumption that the searchable metadata content
of each item is small. As a result, they do not scale to support
full-text indexing of documents; Arpeggio would not be practical for
indexing the web, for example.

\paragraph{File transfer.}

File sharing clients often include elaborate mechanisms for
downloading files as quickly as possible, selecting multiple sources
for files, and deciding which peers to download from and upload to. We
only address file transfer to the extent of identifying peers that
share a particular file; a system such as
BitTorrent~\cite{cohen03:_incen_build_robus_in_bittor} can be used to
actually download the file.

\paragraph{Heterogeneity.}

Nodes in peer-to-peer networks vary greatly in their bandwidth,
storage capacity, and stability. Many peer-to-peer networks exploit
this, allowing the more capable nodes (such as those on high-speed
connections) to bear more load proportional to their resources, and
allowing nodes with low resources (such as those on modem links) to
act only as clients, not participating in the operation of the
network. We do not discuss this in our design, but do assume that some
mechanism is in place. Within a distributed hash table, this can be
achieved by using the standard technique of having nodes operate
multiple virtual nodes, with the number proportional to their
capacity~\cite{castro05:_debun_some_myths_about_struc}.

\paragraph{Bootstrapping.}

In order to join a peer-to-peer network, a new node must be able to
contact one of the existing nodes in the network. Discovering such a
node presents a bootstrapping problem if it is to be done without
centralized infrastructure. In Gnutella, for example, this is achieved
with a system of many web caches that track which nodes are
available~\cite{gwebcache}. We assume that a new user is somehow
provided with the address of a well-known node from which they can
bootstrap their entry into the network.

\paragraph{Security.}

There are many security issues inherent in a peer-to-peer
system~\cite{sit02:_secur_concer_for_peer_to}. In general, it is
difficult to make any guarantees about security in a network that
depends on cooperation from many untrusted nodes, particularly when
one attacker may operate many nodes in the
network~\cite{douceur02:_sybil_attac}. We do not address security
concerns in this thesis.


%%% Local Variables: 
%%% mode: latex
%%% TeX-master: "thesis"
%%% TeX-command-default: "rubber"
%%% End: 

% LocalWords:  metadata startup versa