% -*- TeX-master: "report.tex" -*-

% Implementation

This section discusses the search algorithms our system uses to
quickly locate cached data. In order to adapt the system to different
membership sizes, we explored two different search algorithms. Both
algorithms use consistent hashing, a commonly used algorithm for
locating data in distributed systems~\cite{sherman05:_acms}. For our
project, we looked at consistent hashing with full membership and a
variant of consistent hashing with partial membership based on Chord.



\subsection{Consistent Hashing with Full Membership}
Consistent hashing is a distributed algorithm that maps keys to data
by assigning responsibility for some of the mappings to each
node~\cite{karger97:_consis_hashin_and_random_trees}. Each node and
key in the system is hashed to a common ID space. Nodes are arranged
in a ring ordered by ID. Each node is responsible for all of the keys
that fall between its ID and the ID of its predecessor. 

For the InHome cache, the node that is responsible for a key stores
the list of nodes that have the cached data associated with that key.
To insert a key, the node notifies the responsible node. The
responsible node will add the node to the list of nodes that have the
data for that key. To look up a key, the node just finds the successor
of the key. Since the nodes have global knowledge, finding the node
responsible for a key takes constant time.

To maintain full membership information, a node notifies all of the
other nodes in the system when it joins and leaves.  If a node fails,
the next lookup for a key that the failed node is responsible for will
time out and the node that is performing the lookup will notify the
other nodes of the failed node.

Although consistant hashing with full membership guarantees fast lookup, 
the disadvantage is that the overhead of maintaining full membership
information grows linearly with the number of nodes in the system. Thus, as 
the number of nodes in the system grows, the overhead of maintaining membership
may become unacceptably high.

The trade-off between search performance and bandwidth usage must be
carefully considered. The latency of searches in our system impacts
the number of cache hits because if the lookup takes too long, it will 
time out and the application will fall back to the origin server. The hit
rate determines how much bandwidth is saved in the end, so using a higher
latency search protocol will hurt the performance of the system. But as the
membership size grows, the overhead used by consistent hashing to maintain
full membership information may overwhelm the local area network.

For situations where the membership size is guaranteed to remain small,
consistent hashing with full membership will be the best choice. But
for a system that scales with a large number of nodes, we explore
consistent hashing with partial membership information using
data-oriented Chord.

\subsection{Data-oriented Chord with Partial Membership}
The goal of partial-knowledge consistent hashing is to reduce the
number of other nodes a node must know about to perform lookups. At a
minimum, each node must know about at least one other node. If each
node just maintained a correct successor pointer, a lookup could
traverse the entire ring until it finds the successor for the key.

Data-oriented Chord is based on the Chord lookup
protocol~\cite{stoica01:_chord}. Chord is a variation of
partial-knowledge consistent hashing where each node knows about its
successor and $\log{n}$ other nodes. The other $\log{n}$ nodes, called
fingers, are used to optimize lookups. Fingers are placed at powers of
two; the $i$-th finger is a node that is at least $2^{i}$ away in the
ID space. With $\log{n}$ fingers, Chord can perform a lookup in
$O(\log{n})$ time because the distance to the successor of the key can
be at least halved by each hop.

Data-oriented Chord is a variation on Chort where clients store data
by inserting \emph{virtual nodes} for each new mapping. Each client
(hereafter referred to as physical nodes) stores one virtual node for
each piece of data it contains, and all virtual nodes form a giant
Chord ring. We'll talk about how we construct the virtual node IDs in
a bit.

To join the system, physical nodes only need to contact a well-known
node. A physical node does not join the Chord ring until the new node
has a piece of data to insert. Until the new node inserts its first
piece of data, the node can use the well-known node to run
lookups. Since our system is used for caching, it is expected that the
node re-inserts every piece of data it receives, so the new node
should not need to use the well-known node for lookups for very
long\footnote{Unless the node is malicious, in which case, the well
known node should cut off the malicious node after some number of
lookups}. 

\begin{figure}[tp]
  \centering
  \includegraphics[scale=0.3]{figures/successors.ps}
  \caption{\textbf{Data-oriented Chord ring.} Each node in the Chord ring
    represents a virtual node and the virtual nodes of the same color
    reside on the same physical node. Even with only successor
    pointers, each physical node already has quite a bit of
    information about the other nodes. For example, the red physical
    node knows the blue physical node had an object associated with
    key 129 and the green physical node has objects associated with
    keys 61 and 251.}
  \label{fig:Successors}
\end{figure}

Lookups function similarily to Chord except that each physical node
considers the fingers of \emph{all} of its virtual nodes. This
optimization reduces the runtime of the algorithm from $\log{m}$ where
$m$ is the number of \emph{virtual} nodes down to $\log{n}$ where $n$
is the number of \emph{physical} nodes in the system. If the successor
ID matches the key, the object exists in the system. If the successor
is not the key, the object is not in the cache.

To insert a piece of data, the physical node has the virtual node
representing that piece of data ``join'' the Chord ring using the
Chord join protocol. The physical node will find the virtual node's
successor using lookup and fetch an initial finger table for the
virtual node from the successor. Figure \ref{fig:Successors} shows a
Chord ring with virtual nodes and successor pointers.

Since each virtual node represents a piece of data, a problem arises
when two physical nodes have one piece of data. This problem occurs
frequently because when a node fetches a piece of data from another
node in the system, the node should insert its own copy of the data
into the ring. Furthermore, we want to load balance all requests for 
a piece of data amongst all nodes that contain that data.

We solve this problem as follows. To insert a piece of data, the node first tries to insert a virtual node with an ID where the lower order bits are all set to 1. If this insert fails (i.e. another node has already inserted the data, the node picks a random number for the lower bits of the ID and re-inserts until it succeeds. Any time a node requests the data, it chooses a random number for the lower order bits (Chord lookup will return the first node with a greater ID number than the request.)

Since the first node to insert the data selected the all-1 vector for its lower order bits, any data request is guaranteed to return the correct data\footnote{In the corner case where that node unexpectedly drops and no other node has tried to insert the data, our lookup protocol queries the predecessor pointer of the node to find a node containing the data.} All requests choose a random set of lower-order bits, so on average all nodes containing the data will satisfy an equal number of requests.

To leave the system, the node can just exit without contacting any
other nodes. If the node has a lot of virtual nodes, the node might
end up contacting every other physical node in the system, defeating
the purpose of minimizing the bandwidth overhead of membership
maintenance. 

Each node runs stabilize according to the Chord protocol to keep
successor pointers and finger tables up to date. To reduce the amount
of bandwidth spent running the stabilize protocol, each physical node
cycles through its virtual nodes and only runs stabilize for one
virtual node at a time. Thus, the amount outgoing bandwidth used for
stabilize at each physical node be the same as in Chord, but the
incoming bandwidth used at each node will be proportional to how much
data the node has with respect to the other nodes in the system.

