\ifdraft
\svnInfo $Id$
\fi

\graphicspath{{figures/}}

\section{Corpora}
\label{sec:evaluation:corpora}

In order to evaluate the effectiveness of Arpeggio for various
distributed indexing applications, we use data from several sources.

\subsection{FreeDB CD Database}
\label{sec:evaluation:corpora:freedb}

FreeDB~\cite{freedb} is a database of audio CD metadata, based on the
old CDDB service. It contains the artist, album title, and track
titles for many CDs, using user-submitted data. It is commonly used in
audio players to identify a CD and display title information. Though
DHTs may be useful for this application~\cite{rhea05:_opend}, we do
not propose Arpeggio as a replacement for FreeDB, we are merely using
its database as an example corpus of the type of information that
could be indexed by Arpeggio. An Arpeggio index would allow searches
for songs whose title, artist, or album include various keywords.

This data is well-suited for Arpeggio's indexing because it consists
of many files which have large (audio) content and only a few metadata
keywords. For our evaluation, we obtained the complete FreeDB
database, and extracted the keyword information. The database contains
over 2 million discs, with a total of over 28 million songs. Each
song has an average of 6.43 metadata keywords. For comparison, the
actual data corresponding to these songs would make up approximately
214 years of audio, requiring about slightly over a petabyte of
storage in CD format.

\subsection{BitTorrent Search}
\label{sec:evaluation:corpora:bittorrent}

A natural application for Arpeggio is in indexing shared files
(torrents) in BitTorrent. Since BitTorrent itself does not include an
indexing mechanism, the task of locating shared files falls on various
index web sites, which presents scalability and reliability
challenges. The DHT-based trackerless system addresses the related
problem of finding sources for a particular file, but not searching
for files by keyword. Arpeggio can be used for the latter purpose. 

To evaluate Arpeggio's feasibility for indexing torrents, we obtained
metadata for all torrents that were added to several torrent index
sites over one week. This was done using an automated client that
downloaded lists of torrents from the sites listed in
Table~\ref{tab:evaluation:corpora:bittorrent:sites}, and extracted the
file name and other metadata from each entry. The sites indexed
included most of the top 10 most popular torrent sites (some did not
publish a list of latest torrents, so were not suitable for
indexing). Over a period of 7 days, the sites were checked for new
additions at least once every three hours, in order to obtain all
torrents that were added during the sample period.

\begin{table}[tbp]
  \centering
  \caption{BitTorrent search sites indexed}
  \label{tab:evaluation:corpora:bittorrent:sites}

  \begin{tabular}{|l|l|}
    \hline
    \textbf{Site} & \textbf{Content indexed} \\
    \hline
    \url{http://isohunt.com/} & 25 latest submitted torrents\\
    \url{http://isohunt.com/} & 25 latest indexed torrents\\
    \url{http://www.bittorrent.com/} & 10 most popular torrents\\
    \url{http://www.mininova.org/} & All torrents added today\\
    \url{http://thepiratebay.org/} & 30 latest added torrents\\
    \url{http://thepiratebay.org/} & 100 most popular torrents\\
    \url{http://www.bitenova.nl/} & 50 latest added torrents\\
    \url{http://www.torrentspy.com/} & All torrents added today\\
    \url{http://www.torrentz.com/} & 50 latest added torrents\\
    \url{http://www.btjunkie.org/} & 240 latest added torrents\\
    \url{http://www.btjunkie.org/} & 30 most popular torrents\\
    \url{http://www.torrentreactor.net/} & All torrents added today\\
    \url{http://www.torrentreactor.net/} & 20 most popular torrents\\
    \url{http://www.meganova.org/} & All torrents added today\\
    \url{http://www.torrentportal.com/} & All torrents added today\\
    \hline
  \end{tabular}
\end{table}

\subsection{Gnutella}
\label{sec:evaluation:corpora:gnutella}

In terms of intended use, Arpeggio is quite similar to Gnutella, since
both can be used for fully-decentralized peer-to-peer file sharing
networks. Hence, we evaluated Arpeggio using a Gnutella-based
workload.

Gnutella workloads have been measured and studied
extensively~\cite{saroiu02:_measur_study_of_peer_to,
  zhao06:_charac_files_in_moder_gnutel_networ,
  stutzbach04:_charac_unstr_overl_topol_in,
  gummadi03:_measur_model_and_analy_of,
  klemm04:_charac_query_behav_in_peer}, and, indeed, we use results
from these studies to determine our user model in our
analysis. However, the effectiveness of Arpeggio's indexing algorithms
depends heavily on the nature of the metadata of the files, in
particular the number of keywords per file and per query. Since no
prior studies report on these properties, we conducted our own
measurement study of Gnutella to obtain a sample of query strings and
file names.

We conducted a measurement study using an instrumented version of the
popular LimeWire Gnutella client~\cite{limewire}\footnote{Though one
  would not expect the particular Gnutella client used to be
  particular important, incompatibilities between clients made it
  important to use a popular, well-supported client. In particular,
  our earlier attempt to index Gnutella using an instrumented version
  of Mutella~\cite{mutella} was foiled because LimeWire clients, which
  make up much of the network, were unable to connect to it.}. The
purpose of the study was solely to obtain information about file names
and queries; we did not attempt to measure other properties, such as
node lifetime, user behavior, etc. In order to do so, we modified the
client to record query requests and results that passed through it,
and added a programmatic interface for our measurement scripts to
inject queries. The client was also modified to always act as an
ultrapeer (supernode), which meant that it was able to observe both
queries received from its client leaves as well as those routed from
other ultrapeers in the network. Monitoring these queries gave a
sample of the queries that users were performing. Our client reissued
these queries (at a rate of approximately one query per second),
causing it to receive lists of all hosts with matching files. Whenever
a query hit was received from a previously-unseen host, the indexer
contacted that host directly with a ``\proc{Browse-Host}'' request,
prompting it to return all files it was sharing. These files were, in
turn, used to generate additional queries: approximately once per
second, a file was randomly selected and its keywords were set as a
query, in order to discover and browse other hosts that had that file
available.

Performing this sampling process for 24 hours, our client observed
2,661,219 queries\footnote{Most of which cannot be reprinted for
  reasons of decency.} and 10,869,055 query results (including replies
to both its queries and its browse requests). Because the queries and
hosts are selected uniformly at random, we expect the results to be
representative of Gnutella file names and queries as a whole. It is
not, however, without bias: older clients that do not support the
\proc{Browse-Host} protocol are excluded from consideration, as are
newer clients whose users have specifically disabled the
feature. However, these clients should be rare, as the most two common
Gnutella clients, LimeWire and BearShare, which comprise over 90\% of
the network~\cite{stutzbach04:_charac_unstr_overl_topol_in}, both
support this feature and enable it by default. Note that hosts behind
firewalls and NATs do not pose a problem, because our indexer uses the
standard Gnutella hole-punching mechanisms in order to send its
requests.

\section{Index Properties}
\label{sec:evaluation:index-properties}

Since Arpeggio's algorithms rely upon the assumption that the amount
of metadata associated with each file is small, it is important to
investigate the number of metadata keywords associated with each file
in the three
corpora. Figure~\ref{fig:evaluation:index-properties:file-keyword-dist}
shows the number of files in each corpus that have a certain number of
keywords. Notably, the average number of keywords is higher for
FreeDB: 8.40 for FreeDB vs.\ 6.72 for torrents and 5.33 for
Gnutella. This difference occurs because each song in FreeDB has an
artist, album title, and song title associated with it, whereas the
other corpora include only metadata taken from file names. Also, each
corpus includes a small number of files with many metadata keywords;
these can have a disproportionately large effect on the total index
size, particularly for larger values of the keyword-set size parameter
$K$.

In order to reduce the number of keywords per file, we can exclude
\emph{stopwords}: common words that are not included in the index
because they occur so frequently that they are not useful in
searches. This technique is used frequently in search systems. The
obvious downside of excluding stopwords is that queries containing
only stopwords (such as ``the'') cannot return any replies; a more
subtle point is that removing stopwords from queries may decrease the
effectiveness of keyword-set indexing if, for example, only one
keyword remains. For our purposes, we use a list of stopwords
containing the standard common English words, words of two or fewer
characters, and common file type extensions\footnote{Users may wish to
  search for files of a given file type. Even with stopwords, Arpeggio
  still provides this capability by using index-side filtering to
  restrict the returned search results to files of the correct
  type.}. We present many of the following results both with and
without stopwords excluded.

\begin{figure}[tbp]
  \centering
  \subfloat[No stopwords]{
    \input{figures/keyword-dist-no-stopwords}
    \label{fig:evaluation:index-properties:file-keyword-dist:no-stopwords}}
  \subfloat[Excluding stopwords]{
    \input{figures/keyword-dist-stopwords}
    \label{fig:evaluation:index-properties:file-keyword-dist:stopwords}}
  \caption{Distribution of number of keywords per file, with and
    without stopwords}
  \label{fig:evaluation:index-properties:file-keyword-dist}
\end{figure}

Figure~\ref{fig:evaluation:index-properties:query-keyword-dist} shows
the distribution of number of keywords per Gnutella query (query
information is unfortunately not available for the other two proposed
applications). Most Gnutella queries include many keywords; the
average query contained 5.33 keywords, and less than 1\% contained
only one. Even after discarding stopwords, the average query length
remained over 3.72 keywords, with only about 20\% contained only one
keyword. In comparison, Reynolds and Vahdat observed that for web
search engines, the average query length was 2.53 keywords, with 28\%
containing only one keyword, making them considerably less specific
than our Gnutella
queries~\cite{reynolds03:_effic_peer_to_peer_keywor_searc}.

\begin{figure}[tbp]
  \centering
  \input{figures/gnutella-query-dist}
  \caption{Distribution of number of keywords per Gnutella query}
  \label{fig:evaluation:index-properties:query-keyword-dist}
\end{figure}

\subsection{Choosing $K$}
\label{sec:evaluation:choosing-k}

The effectiveness and feasibility of Arpeggio's indexing system depend
heavily on the chosen value of the maximum subset size parameter
$K$. Recall that $K$ is the upper bound on keyword subset size that
will be indexed; i.e. setting $K = 1$ creates a simple inverted index,
$K = 2$ also includes keyword pairs, $K = 3$ also includes triplets,
etc. If $K$ is too small, then the KSS technique will not be as
effective: there will not be enough multiple-keyword indexes to handle
most queries, making long indexes necessary for result quality.  If
$K$ is too large, then the number of index entries required grows
exponentially, resulting in infeasible storage requirements and
maintenance bandwidth requirements. Most of these index entries will
be in many-keyword indexes that will be used only rarely, if at all,
so little benefit is achieved.

The optimum value for the parameter $K$ depends on the application,
since both the number of metadata keywords for each object and the
number of search terms per query vary. The Gnutella query trace showed
an average of 3-5 keywords per query, depending on whether stopwords
are included or excluded, which suggests that value can be obtained
from increasing $K$ to these amounts, though our experiments below
show that the incremental value of increasing $K$ above 3 is small. 

In the following sections, we analyze the costs required to build an
Arpeggio index for varying values of $K$, as well as the benefits
achieved in terms of improved load distribution.

\section{Indexing Cost}
\label{sec:evaluation:indexing-cost}

We now consider the effects of keyword-set indexing on cost. There are
two factors that we must consider: the increase in total storage costs
that result from creating keyword set indexes, and the improvement in
query load distribution that results from being able to send queries
to more specific indexes.

\subsection{Index Size}
\label{sec:evaluation:indexing-cost:index-size}

In order to accurately determine the storage costs imposed by
Arpeggio's use of keyword-set indexes, we analytically computed the
storage that would be required to index each of our three
corpora. This was done by computing the number of keywords for each
file, optionally excluding stopwords, and computing the number of
index entries required for that file using the formula for $I(m,K)$ in
Section~\ref{sec:indexing:indexing-cost}.


\begin{figure}[tb]
  \centering
  \subfloat[No stopwords]{
    \input{figures/index-size-vs-k-no-stopwords}}
  \subfloat[Excluding stopwords]{
    \input{figures/index-size-vs-k-stopwords}}
  \caption{Index size increase vs. $K=1$, with and without stopwords}
  \label{fig:evaluation:indexing-cost:index-size:vs-k}
\end{figure}

The results are shown in
Figure~\ref{fig:evaluation:indexing-cost:index-size:vs-k}. The graph
shows the total number of index entries relative to the $K=1$ case, on
a logarithmic scale. Recall that the $K=1$ case corresponds to the
standard inverted index case, so the graph shows the extra cost
required to maintain keyword-set indexes. From this, we see that our
proposed value of $K=3$ or $K=4$ is feasible, as it results in only an
increase of one order of magnitude in index size. Note also that, even
though the overall index size has increased, its distribution is
improved because there are a larger number of keyword indexes that can
be stored on different nodes.


% \begin{figure}[tbp]
%   \centering
%   \input{figures/index-load-distribution}
%   \caption{Relative index size distribution for various $K$}
%   \label{fig:evaluation:indexing-cost:index-size:dist}
% \end{figure}

% \begin{figure}[tbp]
%   \centering
%   \input{figures/index-load-99-percentile}
%   \caption{99\% percentile relative index size for various $K$}
%   \label{fig:evaluation:indexing-cost:index-size:99}
% \end{figure}


\subsection{Query Load}
\label{sec:evaluation:indexing-cost:query-load}

Next, we consider the benefits obtained from the use of keyword-set
indexing. We will evaluate how the distribution of query load
improves: whether the number of nodes that receive a
disproportionately high number of queries is reduced. To do this, we
simulate the Gnutella workload. We begin by creating 1000 Chord nodes;
each operates 8 virtual nodes with randomly-assigned hash IDs to
improve load balancing. We then consider every query in our
trace of 2.6 million Gnutella queries, compute its keywords, and
select an appropriate keyword set index. The query load is computed in
terms of number of queries.

\begin{figure}[tb]
  \centering
  \input{figures/query-load-distribution}
  \caption{Query load distribution for various $K$}
  \label{fig:evaluation:indexing-cost:query-load:dist}
\end{figure}

\begin{figure}[tb]
  \centering
  \input{figures/query-load-99-percentile}
  \caption{99th-percentile query load vs. K}
  \label{fig:evaluation:indexing-cost:query-load:99}
\end{figure}

The resulting distribution is shown as a CDF in
Figure~\ref{fig:evaluation:indexing-cost:query-load:dist}. The query
load on the horizontal axis is normalized such that a value of 1
represents the average load (i.e. $\nicefrac{\text{total
    queries}}{\text{\# nodes}}$). An ideal distribution would be a
steep curve around 1. We can see that the $K=1$ case is significantly
less balanced than the others, the $K=2$ case improves somewhat, and
the $K \ge 3$ cases are not easily distinguishable from each
other. Perhaps more comprehensibly,
Figure~\ref{fig:evaluation:indexing-cost:query-load:99} shows that the
99th-percentile query load decreases as a result of increasing the
value of $K$.

\section{Maintenance Cost}
\label{sec:evaluation:indexing-cost:maintenance}

Though index size is an important consideration for evaluating the
feasibility of Arpeggio, it is an incomplete measure without also
considering the costs of maintaining the index. Indeed, the principal
bottleneck likely to be encountered is not storage space, but
\emph{bandwidth} requirements. These include the costs of satisfying
$\proc{Insert}$ and $\proc{Query}$ requests, replicating the index,
and running the Chord protocol.
To this end, we performed a packet-level simulation of the Arpeggio
protocol under load, and evaluated the bandwidth usage.

\subsection{Simulation Model}
\label{sec:evaluation:indexing-cost:maintenance:method}

Our simulation was performed using the \texttt{p2psim} discrete-event
simulator for peer-to-peer protocols~\cite{p2psim}. We produced a
simulator implementation of Arpeggio for \texttt{p2psim} that shares
all the major features of the full-fledged implementation with the
exception of a simplified metadata model; the simulator version
handles only metadata derived from file names, and does not support
additional fields for rich metadata. It includes KSS, index-side
filtering, index gateways, and replication. The lookup layer is
simulated using \texttt{p2psim}'s implementation of Chord.

Because nodes in peer-to-peer networks are highly heterogeneous with
respect to their stability and capacity, our model takes this
heterogeneity into account. We use a two-level structure inspired by
Gnutella's ultrapeers: the core of the network is made up of a stable
set of nodes that form a Chord ring and operate Arpeggio index
servers, and a set of leaf nodes send requests to the core nodes but
do not maintain indexes themselves. We assume that each core node
supports 10 leaf nodes. This is consistent with Gnutella, where the
fraction of ultrapeer nodes is in the 10--15\%
range~\cite{stutzbach04:_charac_unstr_overl_topol_in}. The most common
Gnutella client connects to three ultrapeers, each of which allow 30
leaf node connections; we neglect this leaf multihoming for
simplicity. We use an exponentially-distributed core node lifetime of
106 minutes, following Loo et al\.'s observation that this is the
average lifetime of Gnutella nodes after excluding those with
lifetimes less than 10
minutes~\cite{loo04:_case_for_hybrid_p2p_searc_infras}. For
connectivity between core nodes, we use \texttt{p2psim}'s King network
topology, which consists of an all-pairs matrix of latencies between
DNS servers measured using the King tool~\cite{gummadi02:_king}.

Each of the ten leaf nodes connected to each leaf node is simulated as
follows, using the model of
\cite{ge03:_model_peer_peer_file_sharin_system}: the leaf is randomly
assigned a class $i$ (with probability $p_i$, which determines its
behavior. Once connected, the node begins sending queries with an
exponentially-distributed query inter-arrival time with mean
$\lambda^{(i)}_{\text{idle}}$. Queries are randomly selected from our
  Gnutella query trace. After each query, the client has probability
  $p^{(i)}_{\text{leave}}$ of going offline. If so, the time until a new
client arrives is exponentially-distributed with mean
$\lambda_\text{join}$. After remaining online for five minutes, a
client will begin to register its files' metadata with the index at a
rate of one file per second.

\begin{table}[tbp]
  \centering
  \begin{tabular}{|rr|c|c|}
    \hline
    \multicolumn{2}{|c}{\textbf{Parameter}} &
    \multicolumn{2}{|c|}{\textbf{Client Class}} \\
    & & \multicolumn{1}{c}{\textbf{1}} & \multicolumn{1}{c|}{\textbf{2}} \\
    \hline
    class probability & $p_i$ & 0.25 & 0.75 \\
    query idle time & $\lambda^{(i)}_{\text{idle}}$ & 30 s & 300 s \\
    departure probability & $p^{(i)}_{\text{leave}}$ & 0.1 & 0.2 \\
    shares files? & & no & yes \\
    \hline
  \end{tabular}
  \caption{Client model parameters}
  \label{tab:evaluation:indexing-cost:maintenance:model:parameters}
\end{table}

Our model uses two client classes. Based on Gnutella
measurements~\cite{saroiu02:_measur_study_of_peer_to}, 25\% of clients
are freeloaders (only downloading files, not sharing), and 75\% share
files. The parameters for these two classes are shown in
Table~\ref{tab:evaluation:indexing-cost:maintenance:model:parameters}. The
mean time for new client arrival $\lambda_\text{join}$ was set to 300
seconds. For non-freeloader peers, the number of files was chosen
using a logarithmic distribution similar to that described
in~\cite{saroiu02:_measur_study_of_peer_to}. The files were selected
using a Zipf popularity distribution, with keywords for each file
randomly selected from our Gnutella trace.

Arpeggio was configured with two replicas per index, with $K=3$, and
performed replica synchronization every five minutes.

\subsection{Results}
\label{sec:evaluation:indexing-cost:maintenance:results}

We ran the simulator for varying numbers of nodes, and simulated for
720 seconds each time. The graph of per-node bandwidth usage is shown
in
Figure~\ref{fig:evaluation:indexing-cost:maintenance:results:bw}. Note
that the simulator is operating in a two-tier configuration with 1
core node per 10 leaf nodes; the number of clients shown is the number
of leaf nodes, and the number of core nodes is
$\nicefrac{1}{10}^{\text{th}}$ that amount. Only bandwidth between
core nodes is counted; we are neglecting the bandwidth between leaf
nodes and their core node for this experiment. The subset of the
bandwidth that is used for Chord lookups and routing table maintenance
is shown separately to distinguish it from the costs of Arpeggio's
indexing. The data is noisy due to the degree of randomness inherent
in the client behavior in this experiment, but shows that the
maintenance costs scale well, remaining approximately the same even as
the number of index nodes and clients are increased by a factor of
more than 10.

\begin{figure}[tbp]
  \centering
  \input{figures/sim-bandwidth-720}
  \caption{Simulation results: average per-node bandwidth}
  \label{fig:evaluation:indexing-cost:maintenance:results:bw}
\end{figure}


%%% Local Variables: 
%%% mode: latex
%%% TeX-master: "thesis"
%%% TeX-command-default: "rubber"
%%% End: 

% LocalWords:  CDDB CDs metadata FreeDB Gnutella BitTorrent DHTs LimeWire
% LocalWords:  ultrapeer ultrapeers supernode stopwords online
