\ifdraft
\svnInfo $Id$
\fi

Many peer-to-peer file sharing systems currently exist. Like Arpeggio,
these systems typically comprise both an \emph{indexing} subsystem,
which allows users to search for files matching some query, and a
\emph{content distribution} subsystem, which allows them to download
the file contents from peers. We begin with an overview of the
indexing techniques used by common peer-to-peer systems.


\section{Indexing Techniques}
\label{sec:background:indexing}


\subsection{Centralized Indexes}


The earliest peer-to-peer systems, such as Napster~\cite{napst}
performed their indexing via centralized servers that contained
indexes of the files being shared by each user in the network. This
type of indexing is relatively straightforward to implement, but is
not truly ``peer-to-peer'' in the sense that the index server
comprises a required, centralized infrastructure. It achieves many of
the benefits of peer-to-peer systems: the data is still transferred
peer-to-peer, with clients downloading file content from each other,
and the index server needs to store only pointers to the file data.
Nevertheless, the dependence on the centralized server limits the
scalability of the system, as the server must hold information about
all the users in the system, as well as limiting its reliability
because the server is a central point of failure.

In spite of their disadvantages, certain types of centralized index
systems have remained popular. The popular
BitTorrent~\cite{bittorrent} protocol defines only a mechanism for
transferring files, rather than searching for files, so it is often
used in conjunction with an index website that tracks available
torrents and provides a search engine. Moreover, the file transfer
protocol itself traditionally relies upon a centralized tracker for
each file that keeps track of which peers are sharing or downloading
the file and which pieces of the file they have available. An
extension to the protocol adds support for distributed trackers based
on distributed hash tables, but searching for files can only be done
via the aforementioned centralized index websites. Recently, legal
action has forced several of these sites to be shut
down~\cite{vance05:_mpaa_closes_loki,cullen04:_suprn}, and perhaps as
a result the popularity of BitTorrent has decreased in favor of
decentralized networks such as
eDonkey~\cite{cachelogic05:_peer_to_peer_in}.

\subsection{Unstructured Networks}

At the opposite end of the design spectrum exist purely unstructured
networks, such as Gnutella~\cite{gnutel_protoc_specif}. Responding to
the problems inherent to centralized index systems, these systems are
completely decentralized, and hence eliminate central points of
failure. In Gnutella, the nodes form a \emph{unstructured overlay
  network}, in which each node connects to several randomly-chosen
nodes in the network. The result is a randomly connected graph. Each
node stores only local information about which files it has available,
which keeps maintenance cost to a minimum. Queries are performed by
flooding: a node wishing to perform a query sends a description of the
query to each of its neighboring nodes, and each neighbor forwards it
on to its neighbors, etc. Every node receiving the query sends a
response back to the original requester if it has a matching file.
This query-flooding search mechanism is quite expensive, so these
systems also usually scale poorly: not only do heavy query workloads
rapidly overwhelm the system, increasing the number of nodes in the
system serves to exacerbate the problem rather than ameliorate it
because every query must be forwarded to so many nodes.

A number of optimizations have been proposed to improve the
scalability of Gnutella-like systems. Both
Gia~\cite{chawathe03:_makin_gnutel_like_p2p_system_scalab} and newer
versions of the Gnutella protocol perform queries using random walks
instead of flooding: each node forwards a query request to a
randomly-chosen neighbor instead of all neighbors. Hence, queries no
longer reach every system in the network. As a result, the system
sacrifices \emph{perfect recall}: queries for rare data may return no
results even if the data is present in the network. Proponents of such
designs argue that most queries are for common files for which many
replicas exist in the network, and so a random walk is likely to
return results, particularly if one-hop replication, in which each
node also indexes the contents of its immediate neighbors, is used. In
contrast, we have designed our system, Arpeggio, to provide perfect
recall.

A common refinement that proves quite effective involves exploiting
the heterogeneity of client resources: the connectivity of nodes in
peer-to-peer systems ranges from low-bandwidth, high-latency dialup
links to high-bandwidth backbone connections. Likewise, the node
lifetime ranges from connections lasting under a minute (presumably
users performing a single query then disconnecting) to connections
lasting for many
hours~\cite{gummadi03:_measur_model_and_analy_of}.\marnote{Should we
  mention lifetime variance here? It isn't so relevant to unstructured
  nets. Perhaps there should be a general note about heterogeneity
  somewhere?}  Hence, allocating greater responsibility to the more
powerful nodes is effective for increasing performance. A simple
technique for accomplishing this goal is to designate the fastest and
most reliable nodes as \emph{supernodes} or \emph{ultrapeers}, which
are the only nodes that participate in the unstructured overlay
network. The remainder of the nodes act as clients of one or more
supernodes.%, which maintain the
% index and perform queries on their behalf. This approach is
In the FastTrack~\cite{hargreaves03:_fastt_protoc} network (used by
the KaZaA and Morpheus applications, among others), the supernodes
maintain a list of files that their files are sharing and answer
queries directly; in newer versions of the Gnutella
protocol~\cite{rohrs02:_query_routin_for_gnutel_networ}, clients
provide their ultrapeers with a Bloom
filter~\cite{bloom70:_space_time_trade_offs_in} of their files, which
allows the ultrapeer to forward only some queries to the client,
filtering out queries for which the client is known not to have a
match.  Gia uses a more sophisticated technique that can exploit
finer-grained differences in client resources: the degree of each node
in the graph is proportional to its resources, and search requests are
biased towards nodes with higher degree.

\subsection{Structured Overlay Networks}

Structured peer-to-peer overlay networks based on \emph{distributed
  hash tables} (\emph{DHTs})~\cite{stoica03:_chord, rowstron01:_pastr,
  ratnasamy01:_scalab_conten_addres_networ, maymounkov02:_kadem,
  kaashoek03:_koord, li05:_bandw_effic_manag_of_dht_routin_tables}
strike a balance between the centralized index and unstructured
overlay designs. While the system remains fully decentralized, a
distributed algorithm assigns responsibility for certain subsets of
the data to particular nodes, and provides a means for efficiently
routing messages between nodes while maintaining minimal state. As a
result, they provide efficiency closer to that of a centralized index
while retaining the fault-tolerance properties of a decentralized
system.

Though the precise details differ between DHT designs, the lowest
layer is fundamentally a \emph{distributed lookup service} which
provides a \proc{Lookup} operation that efficiently identifies the
node responsible for a certain piece of data. In the Chord
algorithm~\cite{stoica03:_chord}, which is described in detail in
Section~\ref{sec:chord}, this \proc{Lookup} operation requires
$\bigO{\log n}$ communications, and each node must be connected to
$\bigO{\log n}$ other nodes, where $n$ is the total number of nodes in
the network. Other DHTs make different tradeoffs between \proc{Lookup}
cost and node degree.

Building on this primitive,
DHash~\cite{dabek04:_desig_dht_for_low_laten} and other
\emph{distributed hash tables} (DHTs) implement a storage layer based
on a standard \proc{Get-Block}/\proc{Put-Block} hash table
abstraction. The storage layer makes it possible to store data in the
network by handling the details of replication and synchronization.

Distributed hash tables are a natural fit for developing peer-to-peer
systems because they are efficient and decentralized. However, their
limited interface does not immediately meet the needs of many
applications, such as file sharing networks. Though DHTs provide the
ability to find the data associated with a certain key or file name,
users often wish to perform keyword searches, where they know only
part of the file name or metadata keywords. In other words, the
\emph{lookup by name} DHT semantics are not immediately sufficient to
perform complex \emph{search by content} queries of the data stored in
the network. 


\section{The Chord Protocol}
\label{sec:chord}

Arpeggio uses the Chord lookup protocol as a basis for its indexing
system. Though Arpeggio mainly uses Chord as a ``black box'', we
provide a brief overview of how the lookup protocol works.

Chord uses \emph{consistent
  hashing}~\cite{karger97:_consis_hashin_and_random_trees} to map keys
(such as file names, or in our system, keywords) to the nodes
responsible for storing the associated data. Consistent hashing uses a
single, standardized hash function such as
SHA-1~\cite{nist02:_secur_hash_stand} to map both nodes and keys into
the same circular identifier space. Each node is assigned a 160-bit
identifier by taking the SHA-1 hash of its IP address. Likewise, each
key is assigned the 160-bit identifier that results from taking its
SHA-1 hash. We then assign the responsibility for storing the data
associated with a particular key to the first node that follows it in
the identifier space, as shown in
Figure~\ref{fig:chord:consistent-hashing}.

\begin{figure}[btp]
  \centering
  \includegraphics[scale=0.5]{figures/chash2.1}
  
  \caption[Mapping of keys to nodes using consistent hashing]{Mapping
    of keys to nodes using consistent hashing. Key \#192 is mapped to
    node \#205 because it is the lowest node number greater than \#192.}
  \label{fig:chord:consistent-hashing}
\end{figure}

Because the hash function behaves like a random function, load is
distributed approximately uniformly among the nodes in the system. If
there are $n$ nodes in the system, each node is responsible for
$\nicefrac{1}{n}$ of the keys in expectation, and no more than
$\bigO{\frac{\log n}{n}}$ with high probability; the latter figure is
reduced to $\bigO{\frac{1}{n}}$ if each node operates $\bigO{\log n}$
virtual nodes~\cite{karger97:_consis_hashin_and_random_trees}. It also
deals well with dynamic membership changes in the set of nodes: it is
easy to see that if a node joins or leaves the system, the only keys
that will need to be transferred are among those stored on that node
and its neighbors.

As a result, if each node knows the identifiers of all other nodes of
the system, it can determine which node is responsible for a certain
key simply by calculating a hash function, without communicating with
a central index or any other nodes. However, in large peer-to-peer
systems, it is impractical for each node to keep a list of all other
nodes, both because the list would be very large, and because it
changes frequently and keeping it up to date would require large
amounts of maintenance traffic.

Hence, it is necessary to have a system to allow nodes to look up the
responsible node for a certain key, while maintaining minimal state on
each node. The Chord distributed lookup protocol fills this need.

As an initial step, suppose that each node in the network contains a
pointer to its \emph{successor}: the node with the next larger ID, as
shown in Figure~\ref{fig:chord:successors}.  This ensures that it is
always possible to locate the node responsible for a given key by
following the chain of successor pointers. It is reasonably
straightforward to keep the successor pointers up to date: each node
can maintain the addresses of several successors and predecessors, and
periodically poll them to determine if they have failed or if a new
successor or predecessor has come online.


\begin{figure}[btp]
  \centering
  \includegraphics[scale=0.5]{figures/successors.1}
  
  \caption{Successor pointers in a Chord ring}
  \label{fig:chord:successors}
\end{figure}

Though maintaining correct successor pointers guarantees correctness,
performing a lookup still requires $\bigO{n}$ hops in the average
case, giving rather unsatisfying performance. To improve performance,
each node also maintains $\bigO{\log n}$ finger pointers: a node with
id $n$ knows the successor of keys $\left(n + 2^i \pmod
  {2^{160}}\right)$ for all values of $i$, as shown for key $\#8$ in
Figure~\ref{fig:chord:fingers}. This corresponds to knowing the
location of nodes halfway around the ring, $\nicefrac14$th across the
ring, $\nicefrac18$th across, etc. As a result, \proc{Lookup}
operations can be performed using only $\bigO{\log n}$ messages;
intuitively, this is because each hop reduces the distance to the
destination (in terms of keyspace) by at least half. Though proving
this in a dynamic network is far more involved, a maintenance
protocol can ensure that the system remains sufficiently stable to
allow logarithmic lookups even as nodes constantly join and leave the
system~\cite{liben-nowell02:_analy_of_evolut_of_peer}.

\begin{figure}[btp]
  \centering
  \includegraphics[scale=0.5]{figures/fingers1.1}
  
  \caption{Finger pointers for one node in a Chord ring}
  \label{fig:chord:fingers}
\end{figure}

Finally, the DHash storage layer lies atop the Chord lookup
service. This layer stores the actual data, using a hash table
\proc{Get}/\proc{Put} interface, where keys are associated with
data. Since nodes can fail or leave the system without notice, in
order to ensure durability the data must be replicated on multiple
nodes. This is achieved by storing the data not only on the immediate
successor of the key, but on the next $k$ successors; nodes
periodically poll their neighbors to ensure that they are still
functioning and enough copies of the data exist. The value of $k$ can
be chosen to provide the desired balance between minimizing the
probability of failure and keeping the required maintenance traffic
low~\cite{chun06:_effic_replic_maint_for_distr_storag_system}.

Many optimizations are possible in the storage layer, such as using
erasure codes such as IDA~\cite{rabin89:_effic_disper_of_infor_for}
instead of replication to improve data durability and reduce
maintenance costs~\cite{weatherspoon02:_erasur_codin_vs}, or using a
protocol based on Merkle
trees~\cite{merkle79:_secrec_authen_and_public_key_system} to more
efficiently synchronize
replicas~\cite{cates03:_robus_and_effic_data_manag}.

\paragraph{Generality of Arpeggio.} Though we present our Arpeggio
indexing algorithms in terms of Chord in Chapter~\ref{chap:indexing}
and our implementation described in Chapter~\ref{chap:implementation}
uses Chord as its basis, we emphasize that this is not a requirement
of the system. Chord can be replaced as the underlying substrate of
the system with any other protocol that provides the ability to route
to the nodes responsible for a particular key. In particular, it could
be used in conjunction with the Kademlia
DHT~\cite{maymounkov02:_kadem} already deployed in the trackerless
BitTorrent
system~\cite{forum:_bittor_track_dht_protoc_specif_v1}. Note, however,
that because Arpeggio requires a more complex interface than the
\proc{Get}/\proc{Put} interface that distributed hash tables provide,
the nodes in the system must be modified to support the extended query
processing interface. This is discussed in
Section~\ref{sec:implementation:deployability}.

%%% Local Variables:
%%% TeX-master: "thesis"
%%% TeX-command-default: "rubber"
%%% End:
% LocalWords:  DHTs SHA IP metadata DHash