\documentclass[twocolumn,10pt]{article}
\usepackage{fullpage}
\usepackage[compact]{titlesec}
\usepackage{xspace}
\usepackage{graphicx}
\usepackage{paralist}
\newcommand{\longname}{Data-oriented Chord\xspace}
\newcommand{\name}{DOC\xspace}
\title{A Distributed Directory Service using the Chord DHT}
\author{Irene Zhang, Raluca Ada Popa, Mihir Kedia, Hari Balakrishnan,
  Barbara Liskov\\ \textit{MIT Computer Science and
    Artificial Intelligence Laboratory}} 
\date{}
\newcommand{\bigO}[1]{\ensuremath{O\left(#1\right)}}


\begin{document}

\maketitle

\begin{abstract}
  
\end{abstract}

\section{Introduction}
% people often use DHTs for directory services. They use the DHT to
% find some meta-data about the object, like where it is stored. Using
% a DHT directly is difficult because you do not get any control over
% data placement. Unfortunately this introduces a level of
% indirection, where the meta-data is stored on a different node than
% the data.

%fault tolerance
%fate sharing
%load balancing
%scalability

%maintain the simplicity of Chord but better suited to a lot of systems
%handles failures much better. nodes can just leave
%scales naturally to handle hot spots


%two benefits over DHTs in general, scales well for flash crowds,
%although if only serving pointers, maybe not any different

%CAN:

% hotspot handling: caching and replication, use a ttl, i guess we
% can't handle changes in objects either nodes can just leave without
% transferring data first

%"realities" like virtual nodes, one node is a member of multiple
%realities and for each reality the hash table is replicated, so with
%3 realities, the data would be replicated 3 times

%path length
%neighbor state

%TODO: look at plaxton algorithm

%PASTRY:
%takes into account network locality


Numerous distributed systems use distributed hash tables for locating
data. Distributed hash tables offer many benefits: decentralized
operation, a simple interface, scalability, availability and fault
tolerance. Unfortunately, DHTs not give applications any control over
where data is placed in the system. Distributed systems frequently
need control over data placement because objects are large, frequently
modified or cannot be stored on untrusted peers. Many distributed
systems avoid this problem by storing meta-data in the DHT that points
to where the actual data is located. For example, the BitTorrent
implementation uses the DHT to locate a tracker that keeps the actual
list of servers for the file.

By using the DHT to store meta-data instead of data directly, systems
like the BitTorrent implementation reduce the benefits of using a
DHT. For each piece of data, there is now a single point of failure,
drastically reducing the robustness that DHTs provide. For example in
the Pastry DHT~\cite{rowstron01:_pastr}, typically 8-16 nodes with
adjacent node ids must fail simultaneously before a key in the system
cannot be found at all. In contrast, if the Pastry DHT is used to
store meta-data, data can be lost by just the failure of the node
storing the meta-data.

Once a node fails, recovering the meta-data is very difficult and
time-consuming. In contrast, DHTs generally have efficient algorithms
for quickly rebuilding routing tables. In addition, the meta-data
scheme lacks fate-sharing; if the node holding the meta-data fails,
the data can no longer be located, although the data itself is still
in the system. This problem limits the scalability of the system
somewhat because the amount of data lost grows linearly assuming a
constant rate of failures.

Load-balancing and hot spots can also pose a problem in the meta-data
scheme. If a node is responsible for the meta-data of an especially
popular file, the node will see much more traffic than other nodes and
may become overloaded. The failures and load balancing problems can be
somewhat offset by replicating meta-data across multiple nodes and
placing multiple virtual nodes on each physical node, but this
solution introduces other issues such as consistency.

We introduce {\longname}, a modification of the Chord
DHT~\cite{stoica01:_chord} that gives applications control over key
placement while preserving the simple design and other benefits of
Chord. Like the Chord protocol, {\longname} supports just one
operation: given a key, its maps the key onto a node. Unlike Chord,
{\name} leaves the key assigned to the client/physical node that
inserted the key. The basic idea behind {\name} is very simple:
instead of using virtual nodes for load balancing, {\name} uses
virtual nodes to represent keys held by physical nodes. Every time a
physical node acquires a new key, the node inserts a new virtual node
for that key into the {\name} ring, essentially registering as a server
for that key. Since keys remain mapped to the node that inserted the
key, applications now have control over key placement, and therefore
data placement.

The use of virtual nodes to represent data objects or keys can be
applied to any DHT that can virtualize a physical node in several
locations in the identifier space. For example, the
Content-Addressable
Network~\cite{ratnasamy01:_scalab_conten_addres_networ} can map a
single key onto several points in the coordinate space by using a
different hash function for each point. By using a different hash
function for each data object, every object on a physical node could
be represented by a different point in the coordinate space. This idea
could also be applied to Pastry~\cite{rowstron01:_pastr},
Kademlia~\cite{maymounkov02:_kadem} and many other DHTs.

The rest of the paper gives a short overview of the original Chord
protocol, explains the {\longname} protocol in depth, describes an
example application and provides some performance results based on
simulations.

\section{The Chord protocol}
Chord is a distributed lookup protocol with just one operation: given
a key, it maps the key to a node. Nodes are arranged ordered in a ring
using the node id and keys are assigned to nodes using consistent
hashing. Each node keeps pointers to its successor and predecessor in
the ring, and $\log{n}$ other nodes. The other $\log{n}$ pointers,
called fingers, are used to optimize lookups. Fingers are placed at
powers of two; the $i$-th finger is a node that is at least $2^{i}$
away in the id space.

Lookups hop from node to node around the Chord ring until the node
that the key is assigned to is found. Only maintaining the successor
pointer is necessary for correctness because key lookups can progress
around the ring until the node holding the key is found. Using only
successor pointers, the number of hops needed for each lookup will
increase linearly with respect to the number of nodes in the
system. With the addition of $\log{n}$ fingers at every node, the
distance to the destination can be halved at each hop, so lookups take
$\log{n}$ hops on average.


%talk about virtual nodes

%\subsection{System Model}

\section{{\longname} Protocol}
%We modified the Chord protocol to allow the user to control placement
%of the data

The basic idea behind {\longname} is a simple modification of the
original Chord protocol: using virtual nodes to represent keys held by
physical nodes. First, we will present the base protocol and then we
will address some of its limitations with optimizations.

\subsection{Consistent Hashing}
The consistent hash function in {\longname} assigns each virtual node
an $(m+n)$-bit identifier. Since more than one physical node may hold
a particular key, one key must map to multiple virtual nodes. So the
identifier for a virtual node is chosen as follows: the higher
$m$-bits of the id are chosen by hashing the key using a base hash
function like the SHA-1, and the lower $n$-bits are chosen randomly
when the virtual node is inserted. Ids are ordered in a ring modulo
$2^{m+n}$, divided into ranges of size $2^n$ for each key. Within each
key range, the virtual nodes divide the id space randomly.

For each key lookup, the consistent hash function creates a
$(m+n)$-bit identifier where the higher $m$-bits are a hash of the key
and the lower $n$-bits are chosen randomly. The virtual node that is
the successor in the key range of the lookup identifier is the one
that the key maps to for that lookup. This randomization in the lookup
protocol means that a different virtual node gets returned for each
lookup, balancing the load across all of the physical nodes that hold
a particular key.

\subsection{Key Lookup}
Each physical node in {\longname} maintains routing information for
each of its virtual nodes, meaning that the amount of routing
information maintained by a physical node scales with the number of
keys on the node. As in the Chord protocol, only successor pointers
for each virtual node are necessary for correctness. Figure
\ref{fig:Successors} shows an example ring with only the successor
pointers. To guarantee logarithmic performance on average, physical
nodes in {\name} also maintain finger tables for each virtual node.

\begin{figure}[tp]
  \centering
  \includegraphics[scale=0.3]{figures/successors.ps}
  \caption{\small \textbf{Data-oriented Chord ring.} Each node in the Chord
    ring represents a virtual node and the virtual nodes of the same
    color reside on the same physical node. Even with only successor
    pointers, each physical node already has quite a bit of
    information about the other nodes. For example, the red physical
    node knows the blue physical node has key 129 and the green
    physical node has keys 61 and 251. Unfortunately, even with
    multiple successor pointers on each physical node, searches are
    still linear because there are no guarantees about where the
    successor pointers will be located.}
  \label{fig:Successors}
\end{figure}

Using the lookup identifier given by the consistent hash function,
lookups proceed by first finding the predecessor following the Chord
protocol. As a practical optimization, the lookup protocol considers
the fingers of all virtual nodes on the physical node at each hop to
ensure that a physical node never needs to be contacted twice. Once
the predecessor has been found, the first $m$-bits of the successor's
id are checked to see if the successor is in the key range. If the
successor is in the key range, the successor is returned. Otherwise,
the predecessor's id is checked and the predecessor is returned if it
is in the key range. If both the predecessor and successor are outside
the key range, no nodes in the system have been mapped to that key
yet. This check is necessary because the virtual nodes that serve the
key are randomly arranged in the key range, so the successor to the
last few ids in the key range and the predecessor to the first few ids
in the key range might be a virtual node in another key range.

\subsection{Node Joins}
Physical nodes do not join the {\longname} ring as in the original
Chord protocol. A node only joins the ring when the node has a key,
and therefore a virtual node, to insert.

To join the system, physical nodes can contact some well-known node
and use the well-known node to run lookups until the node acquires a
key. In the base protocol, nodes always register as a server for any
key they have looked up, so the well-known node should only be used
for one lookup. The well-known node should cut off the new node if the
node uses it for more than a few lookup in case the new node is
malicious.

When a physical node wants to register as a server for a key, the
physical node ``joins'' the {\name} ring using the Chord join protocol
and an virtual node identifier given by the consistent hash
function. Like the optimization used in the Chord protocol, the
virtual node can start with an initial finger table copied from
another virtual node and adjust the fingers later using stabilize. If
the physical node acquired the key by a lookup, the node can fetch
initial finger tables along with the other data associated with the
key from node that the key mapped to. Otherwise, the physical node can
find the virtual node's successor using lookup and fetch an initial
finger table for the virtual node from the successor.

To leave the system, the physical node can just exit without
contacting any other nodes. The node will remove keys and data that
belong to itself from the system and the stabilize protocol will
eventually fix any pointers that are broken. This ability to leave
without contacting other nodes is a huge benefit because failures no
longer have any impact.

\subsection{Stabilization}

Each node runs stabilize according to the Chord protocol to keep
successor pointers and finger tables up to date. To reduce the amount
of bandwidth spent running the stabilize protocol, physical nodes
batch stabilize messages to other physical nodes. 


% \section{A Cooperative Caching Application}
% The {\longname} protocol was originally developed for InHome, a
% local-area peer-to-peer cooperative caching system. InHome is
% motivated by the observation that many people inside an organization
% access common data, allowing one to trade expensive wide-area
% bandwidth for cheap local area bandwidth by fetching commonly accessed
% data from local peers. InHome requires a decentralized search protocol
% that can scale to tens of thousands of nodes. Wide-area latencies are
% already fairly low due to optimizations like content distribution
% networks, so the hit rate and search latency for InHome is extremely
% important. {\longname} optimizes the hit rate for a caching
% application like InHome because no data in the system is ever
% unreachable due to failures.

\section{Evaluation}

\subsection{Optimizations}
%finger tables initialized from the node where the data was fetched,
%so another search doesn't have to be run
%finger tables are updated by asking the  

%admit that when there are a lot of files on each node, a meta-data
%scheme might perform better, but still has all of the problems in the
%intro

%point out that every node does not need to serve every file it
%has. There is probably some way of figuring out how many files each
%node should serve, we can leave this for later work


\section{Related Work} 

{\longname} is aimed at the same applications as other distributed
lookup services like Chord, CAN, Pastry and Tapestry. Unlike other
DHTs, {\name} always maps the key onto the node where the key was
inserted. {\name} is similar to Tapestry in that applications have
control over where data placement, but {\name} does not optimize for
locality like Tapestry, making {\name} much simpler.

{\longname} is most similar to the content-sharing subrings used in
the Arpeggio file-sharing network~\cite{clements05:_arpeg}. Arpeggio
uses the Diminished Chord protocol~\cite{karger04:_dimin_chord} to
create a subring for each distinct data object in the system. Nodes
that hold a particular object all join the subring for that object.
For each file with $k$ servers, Diminished Chord only uses a linear
amount of space for each subring, whereas {\name} uses
$\bigO{n\log{n}}$ space, but {\longname} maintains the simplicity of
the original Chord protocol.


\section{Conclusion} We presented {\longname}, a distributed lookup
protocol that offers all of the advantages of DHTs while allowing
applications to control data placement.

{\longname} offers several benefits over storing meta-data in a DHT:
\begin{compactitem}
\item Fault tolerance with fate sharing: When a nodes fails, only data
  on that node is lost. 
\item Load balancing: Each node 
\item Scales based on the popularity of the file.
\end{compactitem}


\begin{scriptsize}
\bibliographystyle{acm} 
\bibliography{paper}
\end{scriptsize}

\end{document}