\documentclass[11pt]{article}
%\usepackage{fullpage}
\usepackage{clrscode}
\usepackage{url}
\usepackage{setspace}
%\twocolumn
\onehalfspacing

\newcommand{\bigO}[1]{\ensuremath{O\left(#1\right)}}
\newcommand{\bigOt}[1]{\ensuremath{\tilde{O}\left(#1\right)}}
\newcommand{\littleO}[1]{\ensuremath{o\left(#1\right)}}
\newcommand{\bigTheta}[1]{\ensuremath{\Theta\left(#1\right)}}
\newcommand{\bigOmega}[1]{\ensuremath{\Omega\left(#1\right)}}
\newcommand{\littleOmega}[1]{\ensuremath{\omega\left(#1\right)}}


\begin{document}
\title{Efficient Metadata Searching and Content Distribution in
  Structured Peer-to-Peer Networks\\\-\\
\large{M.Eng. Thesis Proposal}}
\author{Dan R. K. Ports\\\texttt{drkp@mit.edu}}
\maketitle

%\onehalfspacing
\section{Background and Related Work}
\label{sec:background}

The peer-to-peer model has is a relatively new architecture for
internet services such as file-sharing networks. In recent years, the
popularity of peer-to-peer file-sharing networks has increased to the
point that they make up the majority of internet traffic. Rather than
store data on a centralized server or group of servers, data is stored
on \emph{every} node in the network, and nodes communicate directly
with each other to request data. The advantages are many. Such
networks operate more efficiently than the traditional client-server
model by utilizing peers' upload bandwidth. Traditionally, a content
publisher's server must handle all requests for their
content, which can easily become a resource-expensive proposition if
the content size is large or it is very popular. Peer-to-peer networks
encourage downloaders to simultaneously upload, providing the content
they already have to those who are requesting it. This means that
the full load of the download requests is no longer directed at the
content publisher, but shared among all the requestors, which has the
desirable property that the capacity scales with the number of
requesting clients. Moreover, peer-to-peer networks can be implemented
entirely without a central server. This decentralization eliminates a
potential performance bottleneck and single point of failure, and can
make it easier for end-users to publish their own content.

The challenge, however, is structuring the network in order to find
the desired data. Peer-to-peer applications need to perform many types
of searches. File sharing applications, for example, need to be able
to perform searches for files that match a criteria, such as title
keywords or file format. Moreover, once a particular file has been
selected by the user, in order to actually obtain it, the network need
a mechanism for identifying the source peers that are sharing it. A
distributed instant messaging client would need to be able to map user
identifiers to the associated clients, and perhaps also to search for
users.

A number of different approaches have been used to implement lookup
and search operations. The earliest peer-to-peer networks, such as
Napster~\cite{napster}, employed a centralized index server that
tracked all the nodes in the network. This approach makes many types
of queries possible, but the dependence on the central server limits
its scalability and provides a central point of failure. The
diametrically opposite approach is to construct an unstructured
overlay network, in which nodes store only local information and
perform queries by flooding them to their neighbors, as done by
Gnutella~\cite{gnutella}, or by performing random walks over the
network~\cite{gia}. However, this scheme leads to poor scalability,
inefficient use of bandwidth, and incomplete query results.

Structured peer-to-peer overlay networks based on distributed hash
tables \cite{chord,kademlia,pastry,tapestry,can,koorde,accordion} have
become a standard for addressing the routing and storage problems
inherent in peer-to-peer systems. The underlying layer of these
systems is a \emph{distributed lookup service} which provides a
\proc{Lookup} operation that efficiently identifies the node
responsible for a certain piece of data. Thus, it handles the task of
maintaining appropriate connections between the nodes in order to
enable routing, ideally with minimal state at each node, and
resilience to the churn imposed by nodes constantly joining and
leaving the network. The distributed lookup primitive we will focus on
in this thesis, Chord~\cite{chord}, provides a \proc{Lookup} operation
for a network of $n$ nodes in which each node is connected to
$\bigO{\log n}$ of its neighbors, and queries can be performed using
$\bigO{\log n}$ communications. Building on this primitive,
DHash~\cite{dhash} and other \emph{distributed hash tables} (DHTs)
provide a standard \proc{Get-Block}/\proc{Put-Block} hash table
abstraction; this makes it possible to store data in the network in a
distributed fashion.

Distributed hash tables are a natural fit for developing peer-to-peer
systems because they are efficient and decentralized. However, it is
not always clear how to use them to implement certain types of
functionality that are often necessary in peer-to-peer systems. One
major problem is that it is not clear how to efficiently implement
keyword searches, in which the exact identifier of the data being
requested is not known: the \emph{lookup by name} DHT semantics are
not immediately sufficient to perform the necessary complex
\emph{search by content} queries of the data stored in the
network. Addressing this challenge is one of the central goals of this
thesis.

In particular, the contribution of this thesis will be to develop and
implement a file-sharing system called \emph{Arpeggio}, described in
detail below. In doing so, we will construct mechanisms based on Chord
for performing keyword searches of file metadata, and tracking and
distributing content within a network with high levels of churn. These
techniques should be generalizable to related applications.

\section{Arpeggio: the design}
\label{sec:arpeggio}

\emph{Arpeggio} \cite{arpeggio} is a system we have designed to
support sharing files and distributing content efficiently. Recall
that a content-sharing system must support two major operations: to
translate user's requests into a list of files based on some metadata
criterion, and to translate a file identifier into a list of peers
from which the file can be obtained.

The Arpeggio content-sharing system is based on the Chord lookup
primitive.  It includes two subsystems concerned with searching and
with transferring content.  By building and querying distributed
keyword-set indexes, Arpeggio extends the standard DHT interface to
support not only lookup by key but complex search queries.
Keyword-set indexing and extensive network-side processing in the form
of index-side filtering, index gateways, and expiration are used to
address the scalability problems inherent in distributed document
indexing. To identify sources for a file, Arpeggio uses a
content-distribution system based on indirect storage via subrings
that employs segmentation to leverage file similarity, and thereby
optimize availability and transfer speed.


\subsection{Searching}
\label{sec:arpeggio:searching}

In general, the problem of performing keyword searches in a
distributed fashion is quite difficult. Analysis based on required
communications costs suggests that peer-to-peer keyword indexing of
the Web is infeasible because of the size of the data set
\cite{feasibility}. however, we consider systems, such as file
sharing, that search only over a relatively small amount of
\emph{metadata} associated with each file. Peer-to-peer indexing for
metadata remains feasible, because the size of metadata is expected to
be only a few keywords, much smaller than the full text of an average
Web page.

Arpeggio addresses the problem of searching for files by using an
indexing scheme based on the Keyword-Set Search System (KSS)
introduced by Gnawali \cite{kss}. This is a variant of the standard
distributed inverted indexing technique, which simply constructs a
list of files containing each keyword, and partitions this index by
keyword, making each node responsible for the lists of files
containing a certain subset of the keywords in the systems. This idea
is the basis of the indexing scheme used by several other peer-to-peer
file-sharing networks, such as Overnet \cite{overnet}, which uses the
Kademlia DHT \cite{kademlia}.

Arpeggio refines this idea by
introducing \emph{index-side filtering}, storing the full set of
metadata (which is relatively small) for each file in the index, and
thereby enabling index nodes to reduce their bandwidth usage by only
returning results relevant to the full query. This is a simple idea,
but noteworthy because it breaks the DHT's \proc{Get}/\proc{Put} hash
table abstraction, transforming the network from a simple storage
mechanism to something capable of network-side processing, and
demonstrating the utility of direct use of the underlying
\proc{Lookup} primitive.

Arpeggio's use of KSS differs from other other systems in that it
constructs inverted indexes not only on keywords but also on keyword
\emph{sets}. Essentially, this scheme allows us to precompute the
full-index answer to all queries of up to a certain size. This creates
more indexes, each of which is shorter, better distributing the load
around the network. Searches for multiple-keyword queries will contact
smaller and more distributed indexes whenever possible, thereby
alleviating unfair query load caused by queries of more than one
keyword. Since the majority of searches contain multiple keywords
\cite{keyword-searching}, large indexes are no longer critical to
result quality as most queries will be handled by smaller, more
specific indexes.

As an optimization, we also introduce \emph{index gateways}, which
aggregates index insertion in order to alleviate the problem in which
a new node joining the network suffers a tremendous load to register
the files it has available because it must contact the index nodes
responsible for every subset of the keywords of each of its files.
With gateways, rather than directly inserting a file's metadata into
each of these indexes, each peer sends a single copy of the metadata
for each file to the single appropriate metadata gateway.  The gateway
then inserts the metadata block into all of the appropriate indexes,
but \emph{only} if necessary.  If the block already exists in the
network and is not scheduled to expire soon, then there is no need to
re-insert it into the network. A gateway only needs to refresh
metadata blocks when the blocks in the network are due to expire soon,
but the copy of the block held by the gateway has been more recently
refreshed. Gateways dramatically decrease the total cost for multiple
nodes to insert the same file into the index, a significant
enhancement in the common case of a file being shared by many nodes.


\subsection{Content Distribution}
\label{sec:arpeggio:content-distribution}


The indexing system we describe above simply provides the ability to
search for files that match certain criteria. It would of course not
have much use if there were not also a means for obtaining the files
found in the search results, but the indexing system is independent of
the file transfer mechanism. Thus, it is possible to use an existing
content distribution network in conjunction with \emph{Arpeggio}. A
simple implementation might simply store a HTTP URL for the file in
the metadata blocks, or a pointer into a content distribution network
such as Coral~\cite{coral}.  A DHT can be used for direct storage of
file contents, as in distributed storage systems like CFS~\cite{cfs};
with the right replication levels, this makes it possible to guarantee
the availability of all content even in spite of a fixed percentage of
node failures.  However, for a file sharing network, direct storage is
impractical because the amount of churn \cite{measurement} and the
content size create high maintenance costs --- it is impractical to
move large files from where they are stored to the multiple locations
that direct storage requires them to be kept, independent of where
they are already available.

Instead, Arpeggio uses \emph{indirect storage}: it keeps files in
their original locations on the network, on peers that already have
them, and constructs an index that maintains pointers to each peer
that contains a certain file.  Using these pointers, a peer can
identify other peers that are sharing content it wishes to obtain.
Because these pointers are small, they can easily be maintained by the
network, even under high churn, while the large file content remains
on its originating nodes.  This indirection retains the distributed
lookup abilities of direct storage, while still accommodating a highly
dynamic network topology, but may sacrifice content availability.

Generally a central tracker is used to identify sources for each file,
as in BitTorrent~\cite{bittorrent}, but this introduces a centralized
component to the network, which is undesirable. We instead use
\emph{subrings} to distribe the query load throughout the network. The
Diminished Chord protocol~\cite{subrings} allows any subset of the
nodes to form a named ``subring'' and allows \proc{Lookup} operations
that find nodes in that subring in $\bigO{\log n}$ time, with constant
storage overhead per node in the subring. We create a subring for each
file, where the subring is identified by the file ID and consists of
the nodes that are sharing that file. To obtain a file, a node
performs a \proc{Lookup} for a random Chord ID in the subring to
discover the address of one of the sources.  It then contacts that
node and requests the file. If the contacted node is unavailable or
overloaded, the requesting node may perform another \proc{Lookup} to
find a different source. When a node has finished downloading a file,
it becomes a source and can join the subring.  Content-sharing
subrings offer a general mechanism for managing data that may be
prohibitive to manage with regular DHTs.

Arpeggio also segments each file into a sequence of \emph{chunks} for
purposes of content distribution. This is implemented by storing in
the DHT a \emph{file block} for each file, which contains a list of
\emph{chunk IDs}, which can be used to locate the sources of that
chunk, using the procedure described above. The rationale for this
design is twofold. First, peers that do not have an entire file are
able to share the chunks they do have: a peer that is downloading part
of a file can at the same time upload other parts to different peers.
This makes efficient use of otherwise unused upload bandwidth.

File availability is enhanced by using a segmentation algorithm
inspired by LBFS \cite{lbfs} that allows content to be shared between
similar files. The algorithm places chunk boundaries based on content;
it is therefore robust with respect to insertions and deletions, which
would cause the remainder of the document to be out of frame and
negate the advantages of standard fixed-length chunking. In addition,
there is a novel technique we call \emph{postfetching} that increases
availability of rare but demanded files by caching file chunks on
nodes that would not otherwise be sharing the chunks. This is achieved
by tracking files for a period of time after they become unavailable,
and recording requests for unavailable files; when the file then
becomes available again by a new source joining that network, the
source can be directed to replicate its content in the caches of other
nodes.


\section{Thesis project plan}
\label{sec:plan}

This M.Eng.\ thesis work is the continuation of a UROP project began
during the summer of 2004. Current work has focused on creating the
design of the Arpeggio system. Our preliminary paper about the design
\cite{arpeggio} has been accepted for publication and is currently in
press. I recently presented the paper at the Fourth International
Workshop on Peer-to-Peer Systems in Ithaca, NY during February 2005.

The next phase of the work, which will be the contribution of this
thesis, will involve implementing and evaluating the system, as
described below. The insights obtained from implementation and
evaluation will be used to further refine the design and improve its
performance.

\subsection{Implementation}
\label{sec:implementation}

Much of the work in this thesis project will consist of producing an
implementation of Arpeggio.  The goal is to produce a working
file-sharing application suitable for public release and widespread
use. Because Arpeggio relies on a number of algorithms that have never
before been implemented, or have highly experimental implementations,
this will involve extensive implementation work.

The file sharing application in question will be similar in spirit to
popular programs such as Gnutella or BitTorrent, and will have a
similar interface designed for use by the public (i.e.\ not just the
research community). It is anticipated that this will have many
practical uses, for example in the distribution of large software or
media files. Ideally, it should be implementable on the major
operating systems, or at least designed in such a way as to be easily
portable.


\paragraph{Outline}
The following major tasks will be involved in the implementation of
Arpeggio, listed in approximately (though not necessarily strictly)
the order they will be performed.

\begin{enumerate}
\item Infrastructure work related to implementation of the Chord
  algorithm, and distributed hash table services. These components
  have already been implemented (see \cite{chord}), but the current
  implementation relies on libraries that are both difficult to work
  with for developers and difficult to install for users (and will
  likely not run under Windows), so some work will be required here.
  Also, the current Chord implementation does not expose precisely the
  interface Arpeggio requires, but some work has been done on this.
  Ideally, the underlying layers should have some mechanism to make
  simulations of the Arpeggio algorithms possible as well as actual
  operation.
\item Implement index servers and distributed indexing, including
  index-side filtering and keyword-set indexing.
\item Initial implementations of the content-distribution system
  (before the content-sharing subrings system is developed) can
  use some simple mechanism for encoding the locations of files,
  perhaps using a HTTP URL, a centralized tracker, etc.
\item Implement some sort of user interface
\item Implement metadata gateways
\item Implement subrings via Diminished Chord, then implement
  content-sharing subrings
\item Implement postfetching
\end{enumerate}

\subsection{Evaluation}
\label{sec:evaluation}

Once the Arpeggio system is implemented, we will perform measurements
in order to evaluate its performance and investigate and refine the
effectiveness of various design components. A major goal will be to
determine whether KSS with index-side filtering works well for
indexing the data found in peer-to-peer file-sharing networks, and
exactly what sizes of metadata per file it can support without
suffering from extreme complexity. This will also involve
experimenting with different values of tunable parameters, for example
the maximum KSS keyword-subset size parameter $K$, to determine what
their optimal values are under various index and query workloads.

The implementation of Arpeggio is of particular interest because it
will be one of the first large-scale, ``real world'' applications of
Chord and of DHTs in general. Evaluating a deployed Arpeggio system
may be able to provide insights on the scalability of Chord with
regards to the resources available in most file-sharing networks --- a
large collection of nodes with heterogeneous bandwidth resources and
high degrees of churn --- and may suggest some refinements.

We would also like to compare Arpeggio against other peer-to-peer file
sharing systems, in particular its indexing effectiveness. We
anticipate that it will obtain better results than Gnutella-like
unstructured peer-to-peer overlays. A more interesting comparison
worthy of extensive research is against the Overnet~\cite{overnet}
network, which is based on the Kademlia~\cite{kademlia} DHT. This is
an existing deployed indexing system using a DHT, and has not been
widely studied; it is not even clear what precisely Overnet's indexing
approach is (though it appears to be a distributed index with some
sort of index-side filtering), and the performance of Kademlia DHT has
not been measured extensively, despite its having perhaps the largest
deployed system. This suggests it is a prime comparison with Arpeggio
and Chord.

\nocite{rooter}

\bibliographystyle{abbrv}
\bibliography{design-paper}
\end{document}