\documentclass[11pt]{article}
\usepackage{fullpage}
\usepackage{clrscode}
\usepackage{url}
%\twocolumn

\begin{document}
\title{Arpeggio: Metadata Searching and Content Sharing with Chord\\\-\\
\large{6.UAP Project Proposal}}
\author{Dan R. K. Ports\\\texttt{drkp@mit.edu}}
\maketitle
\section{Background and Related Work}
\label{sec:background}

The peer-to-peer model has is a relatively new architecture for
internet services such as file-sharing networks. In recent years, the
popularity of peer-to-peer file-sharing networks has increased to the
point that they make up the majority of internet traffic. Rather than
store data on a centralized server or group of servers, data is stored
on \emph{every} node in the network, and nodes communicate directly
with each other to request data. The advantages are many: such
networks operate more efficiently than the traditional client-server
model by utilizing peers' upload bandwidth, and can be implemented
without a central server. The challenge, however, is structuring the
network in order to find the desired data.

Effectively sharing content requires a peer-to-peer network to support
two fundamental operations: searching for files that match a certain
criteria, and identifying source peers that are sharing a particular
file. A natural approach to supporting these operations is to maintain
a centralized index of the files in the network, as done by the
Napster \cite{napster} file sharing network. However, this incurs the
scalability and reliability issues inherent in using a centralized
server. Unstructured overlay networks such as Gnutella \cite{gnutella}
avoid centralization but sacrifice scalability and query performance
and recall.

Structured overlay networks based on distributed hash tables show
promise for simultaneously achieving the recall advantages of a
centralized index and the scalability and resiliency attributes of
decentralization. Distributed hash location services such as Chord
\cite{chord} provide an efficient \proc{Lookup} primitive that maps a
key to the node responsible for its value. Chord uses at most $O(\log
n)$ messages per lookup in an $n$-machine network, and minimal
overhead for routing table maintenance. Building on this primitive,
DHash~\cite{dhash} and other distributed hash tables provide a
standard \proc{Get-Block}/\proc{Put-Block} hash table abstraction.
This routing substrate provides most of the core functionality
required to implement a content-sharing system. However, it is not
immediately sufficient: for one, the \emph{lookup by name} DHT
semantics are not immediately sufficient to perform the necessary
complex \emph{search by content} queries of the data stored in the
network.


\section{Arpeggio: the design}
\label{sec:arpeggio}

\emph{Arpeggio} \cite{arpeggio} is a content-sharing system built upon
the Chord lookup primitive.  It includes two subsystems concerned with
searching and with transferring content.  By building and querying
distributed keyword-set indexes, Arpeggio extends the standard DHT
interface to support not only lookup by key but complex search
queries.  Keyword-set indexing and extensive network-side processing
in the form of index-side filtering, index gateways, and expiration
are used to address the scalability problems inherent in distributed
document indexing. To identify sources for a file, Arpeggio uses a
content-distribution system based on indirect storage via subrings
that employs segmentation to leverage file similarity, and thereby
optimize availability and transfer speed.

Arpeggio addresses the problem of searching for files by using an
indexing scheme based on the Keyword-Set System introduced by Gnawali
\cite{kss}. This is a variant of the standard distributed inverted
indexing technique, where a keyword is mapped to a list of files
containing that keyword. Similar techniques are used by other
peer-to-peer file-sharing network, such as Overnet \cite{overnet},
which uses the Kademlia DHT \cite{kademlia}. Arpeggio's use of KSS
differs from these other systems in that it constructs inverted
indexes not only on keywords but also on keyword \emph{sets}.
Index-side filtering improves the efficiency of performing search
queries by storing the full metadata block for each matching file in
each keyword-set index, thereby reducing network traffic utilization
by allowing the index node to perform filtering of the index entries,
returning only relevant results.

Of course, an indexing system would not be much use if there were not
also a means for obtaining the files found in the search results. For
this, Arpeggio uses indirect storage (as opposed to the direct
storage used by systems such as CFS \cite{cfs}): it maintains pointers
to each peer that contains a certain file. This is accomplished by
using the Diminished Chord protocol \cite{subrings} to build
distributed anycast groups for identifying file sources, essentially
providing the same functionality as a centralized tracker node without
the centralized bottleneck and single point of failure. File
availability is enhanced by using a segmentation algorithm inspired by
LBFS \cite{lbfs} that allows content to be shared between similar
files, and with postfetching, which uses cache space on other peers to
replicate rare but demanded files.


\section{Project plan}
\label{sec:plan}

This M.Eng.\ thesis project is the continuation of a UROP project
began during the summer of 2004. Current work has focused on creating
the design of the system. A preliminary paper about the design
\cite{arpeggio} has been accepted for publication and will appear in
print shortly. I recently presented the paper at the Fourth
International Workshop on Peer-to-Peer Systems in Ithaca, NY during
February 2005.

The work I propose will involve constructing and evaluating the
system, and revising the design in response to the issues that will
undoubtedly be discovered during the implementation phase. The goal is
to produce a working file-sharing application suitable for public
release and widespread use. Because Arpeggio relies on a number of
algorithms that have never before been implemented, or have highly
experimental implementations, this will involve extensive
implementation work.

Once implementation has been completed, extensive measurements will be
performed. The goal is to evaluate Arpeggio's performance relative to
other peer-to-peer content sharing networks with respect to several
criteria: query bandwidth usage and maintenance cost as a function of
network size, query result recall and relevance, etc. These
measurements are of particular interest since this will be a
large-scale application based on Chord, and these results may have
relevance to Chord's scalability and performance in practice.

\bibliographystyle{abbrv}
\bibliography{design-paper}

\end{document}