\ifdraft
\svnInfo $Id$
\fi

The central problem for a content-sharing system is searching: it must
be able to translate a search query from a user into a list of files
that fit the description and a method for obtaining them. In Arpeggio,
we concern ourselves with file \emph{metadata} only.  As discussed
previously, each file shared on the network has an associated set of
metadata. We only allow searches to be performed on the small amount
of metadata associated with the file, rather than the contents, which
are potentially much larger.

Limiting searches to file metadata is an important restriction that
enables us to use a number of algorithms that would not be suitable
for full-text search of documents. Indeed,
Li~et~al.~\cite{li03:_feasib_of_peer_to_peer} investigated the
communications costs that would be required to perform full-text
search of the web using a distributed index, and found that the size
of the data set made the problem infeasible using current
techniques. Our work does not contradict these findings; rather, it
addresses a different problem. Peer-to-peer indexing for metadata
remains feasible, because the datasets we consider have a small number
of metadata keywords per file (on the order of 5--10), which is much
smaller than the full text of an average Web page.

\section{Design Evolution}
\label{sec:indexing:design-evolution}

Our indexing techniques are designed to ensure that the system scales
with the addition of new nodes, and to ensure that the indexing cost
is shared equally among the nodes rather than causing one node to bear
a disproportionate amount of the load. To justify our design
decisions, we present the evolution of our indexing system as a series
of possible designs, exploring the problems with each and how to
address them.


\subsection{Inverted Indexes}
\label{sec:indexing:design-evolution:inverted-indexes}

The standard building block for performing searches is, of course, the
\emph{inverted index}: a map from each search term to a list of the
documents that contain it. Such indexes can be searched by scanning
over all entries in an index to identify documents that match the full
query, or finding the intersection of the indexes corresponding to
each search term.

There are two main approaches to distributing an inverted index. In
\emph{partition-by-document} indexing, each node in the system stores
all of the inverted index entries corresponding to some subset of the
indexed documents. To perform a query, the system must contact a set
of nodes that collectively hold index entries for every document; if
index entries are not replicated, this means sending the query to
every node in the system. In \emph{partition-by-keyword} indexing,
each node stores the inverted index entries corresponding to some
range of \emph{keywords}.

Both approaches have been employed in a number of
systems. Partition-by-document indexing is used for web searches by
Google~\cite{barroso03:_web_searc_for_planet}, and for indexing
scholarly publications in OverCite~\cite{stribling06:_overc}. Li et
al.~\cite{li03:_feasib_of_peer_to_peer} observed that naive
implementations of partition-by-document indexing require three orders
of magnitude less bandwidth to perform a search than naive
implementations of partition-by-keyword indexing. Though this appears
to be a compelling argument for a partition-by-document index, we
instead base our system on partition-by-keyword indexing. We note that
there are a number of optimizations that can be employed to reduce the
bandwidth cost, which we explore in
Sections~\ref{sec:indexing:design-evolution:partition-by-keyword}~and~\ref{sec:indexing:design-evolution:index-side-filtering}. Moreover,
partition-by-document indexing requires contacting nodes from each
index partition. In the two partition-by-document systems described
above, this is reasonable: Google's index servers are likely to be in
the same cluster, and thus enjoy low latency; OverCite has a
relatively small number of nodes, each of which holds as much as
\nicefrac{1}{2} of the whole index. In a large-scale peer-to-peer
system, it is not practical to contact so many nodes or store so much
index data on each node. Partition-by-keyword indexing, on the other
hand, maps nicely to the interface provided by consistent hashing and
distributed hash tables, allowing us to perform queries while
contacting a minimal number of nodes.

\subsection{Partition-by-Keyword Distributed Indexing}
\label{sec:indexing:design-evolution:partition-by-keyword}

A partition-by-keyword inverted index can be implemented in a
relatively straightforward way on top of a distributed hash table:
each keyword maps to the server responsible for storing its inverted
index. To perform a query, the client can fetch each index list from
the DHT, and calculate their intersection, as shown in
Figure~\ref{fig:indexing:design-evolution:partition-by-keyword:pseudocode}. No
modifications to the standard DHT \proc{Get}/\proc{Put} interface are
required.

\begin{figure}[tbp]
  \centering
  \begin{boxedminipage}{3.5in}
  \begin{codebox}
    \Procname{$\proc{Query}(criteria)$}
      \li $indexes \gets \emptyset$ 
      \li \For $keyword,\; value$ \textbf{in} $criteria$
      \li \Do $indexes$.\proc{Append}(\proc{Get}($keyword$))
      \End
      \li \Return $\proc{Intersection}(indexes)$
  \end{codebox}
  
  \end{boxedminipage}
  \caption{Pseudocode for distributed inverted index}
  \label{fig:indexing:design-evolution:partition-by-keyword:pseudocode}
\end{figure}

This naive approach scales poorly to large numbers of documents. Each
keyword index list must be transferred in its entirety; these keyword
index lists can become prohibitively long, particularly for very
popular keywords, so retrieving the entire list may generate
tremendous network traffic.

There is clearly room for optimization, as the desired intersection of
the indexes is typically much smaller than any one of the index lists,
yet in the scheme above the full index lists are transferred. Bloom
filters~\cite{bloom70:_space_time_trade_offs_in} use hashes to encode
a summary of the contents of a set in a small data structure; set
inclusion queries are probabilistic, with no false negatives but some
chance of a false positive. Reynolds and
Vahdat~\cite{reynolds03:_effic_peer_to_peer_keywor_searc} proposed a
protocol in which one index server transmits a Bloom filter of its
results to another, allowing the second to determine a conservative
superset of the intersection to send to the client. Li et
al.~\cite{li03:_feasib_of_peer_to_peer} proposed several further
optimizations, including compression of the index lists and adaptive
intersection
algorithms~\cite{demaine00:_adapt_set_inter_union_and_differ}. These
optimizations can substantially reduce the amount of bandwidth
required to find the index intersection; however, we do not employ
them in Arpeggio because our restrictions on the type of data being
indexed enables the following more effective optimization.

\subsection{Index-Side Filtering}
\label{sec:indexing:design-evolution:index-side-filtering}

Performance of a keyword-based distributed inverted index can be
improved by performing \emph{index-side filtering} instead of
performing an index intersection at the querying node or a distributed
index join. Because our application postulates that metadata is small,
the entire contents of each item's metadata can be kept in the index
as a \emph{metadata block}, along with information on how to obtain
the file contents. To perform a query involving a keyword, we send the
full query to the corresponding index node, and it performs the
filtering and returns only relevant results. This dramatically reduces
network traffic at query time, since only one index needs to be
contacted and only results relevant to the full query are
transmitted. This appears to be similar to the search algorithm used
by the Overnet network~\cite{overnet},
which uses the Kademlia DHT, though details of its operation are
difficult to find. The same technique is also used in
eSearch~\cite{tang04:_hybrid_global_local_index_for}.

\begin{figure}[tbp]
  \centering
  \begin{boxedminipage}{3.5in}
  \begin{codebox}
    \Procname{$\proc{Query}(criteria)$}
  \li $index \gets \proc{Random-Keyword}(criteria)$
    \li \Return $\proc{Filtered-Get}(index, criteria)$
  \end{codebox}
  
  \end{boxedminipage}
  \caption{Pseudocode for distributed inverted index}
  \label{fig:indexing:design-evolution:index-side-filtering:pseudocode}
\end{figure}


Though this seems like a trivial change, it represents an important
change in the role of the DHT from passive storage element to active
processing component. The procedure of transmitting the query to the
index server for execution is reminiscent of the use of searchlets in
Diamond~\cite{huston04:_diamon}, though far simpler. It requires
extending the DHT interface beyond the standard
\proc{Get-Block}/\proc{Put-Block} hash table abstraction; our
implementation accesses the underlying Chord \proc{Lookup} function to
build more powerful functions involving network-side processing. As a
result, our algorithm cannot be deployed atop an existing DHT service,
such as OpenDHT~\cite{rhea05:_opend} without modification. This
presents a deployment problem as several BitTorrent
clients~\cite{forum:_bittor_track_dht_protoc_specif_v1,azureus}
already operate their own (mutually incompatible) DHTs; we cannot use
them to store index data without modifying all existing clients. We
discuss this in Section~\ref{sec:implementation:deployability}.

\subsection{Keyword-Set Indexing}
\label{sec:keyword-set-indexing}

While index-side filtering reduces network usage, query load may still
be unfairly distributed, overloading the nodes that are responsible
for indexing popular keywords. To overcome this problem, we propose to
build inverted indexes not only on individual keywords but also on
keyword \emph{sets}: pairs, triples, etc.\ up to some maximum size. As
before, each unique file has a corresponding metadata block that holds
all of its metadata. Now, however, an identical copy of this metadata
block is stored in an index corresponding to each subset of at most
$K$ metadata terms. The maximum set size $K$ is a parameter of the
network. This is the Keyword-Set Search system (KSS) introduced by
Gnawali~\cite{gnawali02:_keywor_set_searc_system_for}.

Essentially, this scheme allows us to precompute the index
intersection for all queries of up to $K$ keywords.  It is not a
problem if a particular query has more than $K$ keywords: the query
can be sent to the index server for a randomly chosen $K$-keyword
subset of the query, and the index server can filter the results to
only send back responses that match the full query.  This approach has
the effect of querying smaller and more distributed indexes whenever
possible, thus alleviating unfair query load caused by queries of more
than one keyword.

The majority of searches contain multiple keywords, as shown in
Reynolds and Vahdat's analysis of web
queries~\cite{reynolds03:_effic_peer_to_peer_keywor_searc} and our
analysis of peer-to-peer query traffic in
Section~\ref{sec:evaluation:index-properties}. As a result, most
queries can be satisfied by the smaller, more specific
multiple-keyword indexes, so the larger single-keyword indexes are no
longer critical to result quality.  To reduce storage requirements,
maximum index size can be limited, preferentially retaining entries
that exist in fewest other indexes, i.e.\ those with fewest total
keywords.

% Network-side filtering. Explain keyword/filterable metadata.
% unnecessary network traffic is eliminated by filtering inside of the
% network instead of on the originator of the query
In Arpeggio, we combine KSS indexing with index-side filtering, as
described above: indexes are built for keyword sets and results are
filtered on the index nodes. We make a distinction between
\emph{keyword metadata}, which is easily enumerable and excludes
stopwords, and therefore can be used to partition indexes with KSS,
and \emph{filterable metadata}, which is not used to partition indexes
but can further constrain a search. Index-side filtering allows for
more complex searches than KSS alone. A user wish to restrict his or
her keyword query to files of size greater than 1 MB, files in
\texttt{tar.gz} format, or MP3 files with a bitrate greater than 128
Kbps, for example. It is not practical to encode this information in
keyword indexes, but the index obtained via a KSS query can easily be
filtered by these criteria. The combination of KSS indexing and
index-side filtering increases both query efficiency and precision.

Nevertheless, the index structure imposes limitations on the type of
queries supported. Because keywords or keyword sets must be used to
identify the index to be searched, the system cannot support general
substring queries (though it is possible to use index-side filtering
to restrict the output of a keyword search to those also matching a
substring or regular expression query). We argue that this is not a
major limitation, as it is shared by many widely-used systems that
exist today. In addition to being a limitation inherent to systems
based on partition-by-keyword distributed indexes, the use of Bloom
filters in Gnutella's query routing
protocol~\cite{rohrs02:_query_routin_for_gnutel_networ} also limits
the user to keyword queries.


\section{Indexing Cost}
\label{sec:indexing:indexing-cost}

Techniques such as KSS improve the distribution of indexing load,
reducing the number of very large indexes --- but they do so by
creating more index entries. In order to show that this solution is
feasible, we argue that the increase in total indexing cost is
reasonable.

Using keyword set indexes rather than keyword indexes increases the
number of index entries for a file with $m$ metadata keywords from
$m$ to $I(m)$, where
\begin{displaymath}
  I(m) = \sum_{i = 1}^K \binom{m}{i} =
  \begin{cases}
    2^m - 1 & \text{if $m \le K$} \\
    O(m^K) & \text{if $m > K$}
  \end{cases}
\end{displaymath}
For files with many metadata keywords, $I(m)$ is polynomial in $m$.
If $m$ is small compared to $K$ (as for files with few
keywords), then $I(m)$ is exponential in $m$ --- but this is no worse,
since $m$ is so small anyway. The graph in
Figure~\ref{fig:I-growth} shows that $I(m)$ grows polynomially with
respect to $m$, and its degree is determined by $K$.
%This trade-off is desirable because the number of \proc{Insert} requests
%on a network is smaller than the number of \proc{Query} requests.
% Cite Gummadi or some other measurement paper for this statistic.
As discussed in~Section~\ref{sec:evaluation:choosing-k}, for many
applications the desired value of $K$ will be small (around 3 or 4),
and so $I(m)$ will be a polynomial of low degree in $m$.
% A major advantage of keyword-set indexing is that the query traffic to
% nodes responsible for popular keywords will be abated by queries going
% to more specific indexes.

\begin{figure}[tb]
  \centering
  \input{figures/egraph2d.tex}
  \caption{Growth of $I(m)$ for various values of $K$}
  \label{fig:I-growth}
\end{figure}

\section{Index Maintenance}
\label{sec:indexing:index-maintenance}

Peers are constantly joining and leaving the network. In addition to
causing index servers to become unavailable, this churn also causes
new files to become available and old files to disappear as the nodes
containing them join and leave the network. Thus, the search index
must respond dynamically to the shifting availability of the data it
is indexing and the nodes on which the index resides.  Furthermore,
certain changes in the network, such as nodes leaving without
notification, may go unnoticed, and polling for these changing
conditions is too costly, so the index must be maintained by passive
means.

\subsection{Metadata Expiration}
\label{sec:indexing:index-maintenance:metadata-expiration}

Instead of actively polling for the departure of source nodes, or
expecting source nodes to send a notification before leaving, the
index servers expire file metadata entries on a regular basis so that
long-absent files will not be returned by a search.  Nevertheless,
during the expiration delay indexes may contain out-of-date references
to files that are no longer accessible. Thus, a requesting peer must
be able to gracefully handle failure to contact source peers. To
counteract expiration, source peers periodically \emph{refresh}
metadata that is still valid by re-inserting the metadata blocks for
their files, thereby resetting its expiration counter.

The choice of expiration time balances index freshness
against maintenance cost: a short expiration time means that the index
will not contain many invalid entries, but that peers must constantly
send refresh messages for all their files. Though choosing a short
expiration time for index freshness seems desirable at first glance,
we argue in Section~\ref{sec:content-distribution:postfetching} that
there can be value in long expiration times for metadata, as it not
only allows for low refresh rates, but for tracking of attempts to
access missing files in order to artificially replicate them to
improve availability.


\subsection{Index Gateways}
\label{sec:indexing:index-maintenance:index-gateways}

If each node directly maintains its own files' metadata in the
distributed index, and multiple nodes are sharing the same file, the
metadata block for each file will be inserted repeatedly. Consider a
file $F$ that has $m$ metadata keywords and is shared by $s$
nodes. Then each of the $s$ nodes will attempt to insert the file's
metadata block into the $I(m)$ indexes in which it belongs, as in
Figure~\ref{fig:gateway-diagram:without}. The total cost for inserting
the file is therefore $\bigTheta{s \cdot I(m)}$ messages. Since
metadata blocks simply contain the keywords of a file, not information
about which peers are sharing the file, each node will be inserting
the \emph{same} metadata block repeatedly. This is both expensive and
redundant. Moreover, the cost is further increased by each node
repeatedly renewing its insertions to prevent their expiration.

\begin{figure}[tb]
  \centering
  \subfloat[Without gateway]{
    \includegraphics{figures/gateway-without}
    \label{fig:gateway-diagram:without}}
  \subfloat[With gateway]{
    \includegraphics{figures/gateway-with}
    \label{fig:gateway-diagram:with}}
  \caption{Two source nodes $S_{1,2}$, inserting file metadata block
    $M_F$ with keywords $\{a,b\}$ to three index nodes $I_{1,2,3}$,
    with and without a gateway node $G$}
  \label{fig:gateway-diagram}
\end{figure}

To minimize this redundancy, we introduce an \emph{index gateway} node
that aggregates index insertion requests. Index gateways are not
required for correct index operation, but they increase the efficiency
of index insertion using the standard technique of introducing an
intermediary. With gateways, rather than directly inserting a file's
metadata blocks into the index, each peer sends a single copy of the
block to the gateway responsible for the block (found via a Chord
\proc{Lookup} of the block's hash), as in
Figure~\ref{fig:gateway-diagram:with}. The gateway then inserts the
metadata block into all of the appropriate indexes, but only if
necessary.  If the block already exists in the index and is not
scheduled to expire soon, then there is no need to re-insert it into
the index. A gateway only needs to refresh metadata blocks when the
blocks in the network are due to expire soon, but the copy of the
block held by the gateway has been more recently refreshed.

Gateways dramatically decrease the total cost for multiple nodes to
insert the same file into the index. Using gateways, each source node
sends only one metadata block to the gateway, which is no more costly
than inserting into a centralized index. The index gateway only
contacts the $I(m)$ index nodes once, thereby reducing the total cost
from $\bigTheta{s \cdot I(m)}$ to $\bigTheta{s + I(m)}$. For files
that are shared by only a single source node, using the gateway only
negligibly increases the insertion cost, from $I(m)$ to $I(m) +
1$. Another benefit is that it spreads the cost of insertion among the
nodes in the network. A common scenario is that a node joining the
network will want to register all of the files it has available with
the index. Without index gateways, it would have to contact every
index server; with index gateways, it only needs to contact the
gateways, and the cost of contacting the index servers is spread
amongst the gateways --- which, thanks to consistent hashing, are
likely to be distributed equally around the network.


\subsection{Index Replication}
\label{sec:indexing:index-maintenance:index-replication}

In order to maintain the index despite node failure, index replication
is also necessary. As described in Section~\ref{sec:chord}, DHTs
typically accomplish this by storing multiple copies of data on
replicas determined via the lookup algorithm, or storing erasure-coded
fragments of the data in the same way. The choice between erasure
coding and replication affects not only the probability of data loss
during node failures, but also performance: replication can provide
lower read latency because only a single replica needs to be
contacted~\cite{dabek04:_desig_dht_for_low_laten}. Because metadata
blocks are small and our application requires reading from indexes to
have low latency, we opt for replication over erasure
coding. Moreover, it is not straightforward to implement server-side
processing elements such as index-side filtering in an erasure-coded
system.

Furthermore, because replicated indexes are independent, any node in
the index group can handle any request pertaining to the index (such
as a query or insertion) without interacting with any other
nodes. Arpeggio requires only \emph{weak consistency} of indexes, so
index insertions can be propagated periodically and in large batches
as part of index replication.  Expiration can be performed
independently by each index server.


% References to dig up

% Klemm Lindeman Vernon Waldhorst Characterizing the query behavior in
% p2p file sharing systems, IMC 04

% Rasti Stutzbach Rejaie On the long-term evolution of the two-tier
% gnutella overlay IEEE Global Internet Symposium 06

%%% Local Variables: 
%%% mode: latex
%%% TeX-master: "thesis"
%%% TeX-command-default: "rubber"
%%% End: 

% LocalWords:  metadata DHTs DHT lookup