%
% These are an early iteration of Dan's notes from IPTPS'05. No guarantees
% that this will be at all useful for any application.
%


\documentclass[12pt]{article}
\usepackage{fullpage}
\begin{document}

\section{Overview}

\subsection{Arpeggio}

\begin{itemize}
\item Design for a content-sharing system based on Chord
\item Joint work with Austin Clements and David Karger
\end{itemize}

\subsection{The Problem}

\begin{itemize}
\item First, let's look at the problem we're considering.
\item We want to be able to perform queries for files.
\item and, once someone's decided to obtain a file, which peers are
  sharing it.
\item Want to perform these two tasks in a fully decentralized manner
  to eliminate the scalability costs of a centralized server, and
  avoid single points of failure.
\end{itemize}


\subsection{DHTs Almost to The Rescue}

\begin{itemize}
\item DHTs provide almost the abstraction we need, but aren't quite
  adequate for our purposes
\item DHTs can give us lookup-by-name, but not search-by partial name
  or content.
\item So we need to extend the standard DHT interface, but we use the
  Lookup primitive 
\end{itemize}


\section{Searching}

\begin{itemize}
\item Begin by addressing the first of our two problems: how to
  find files matching a search query.
\item Build an index to answer these queries
\end{itemize}


\subsection{Index entries}

\begin{itemize}
\item First let's look at what we're searching
\item Example: operating system install discs
\item Metadata block -- contains information about the file
\item File ID -- used by content distribution system, but opaque to
  the file. Hash, but could be different.
\item keywords at top -- basis for search. From filename, but could be
  from file
\end{itemize}

\subsection{Distributed inverted index}

\begin{itemize}
\item For each keyword, index all files containing that keyword
\item Return to Debian disk 1 -- put it in an index for each of its
  three keywords
\item More files. Indexes can get long.
\item Now suppose a node wants to perform a query for ``debian disk1''
\item Needs to fetch both indexes and intersect them. Requires lots of
  bandwidth if indexes are long or many keywords
\end{itemize}


\subsection{Index-side filtering}

\begin{itemize}
\item Solution: store full metadata in index, add network-side
  processing.
\item This is ok since metadata is small
\item Query: send to any index.
\item Example
\item Problem: some keywords are popular and will receive many
  queries, long indexes
\end{itemize}


\subsection{Keyword-set indexing}

\begin{itemize}
\item Create an index for all keyword \emph{subsets}
\item Limit to size $K$ (network parameter) since power sets are big
\item Creates more indexes, most of which are more specific and hence
  can be shorter
\item Most queries can be handled by multi-keyword indexes
\item Still a problem with index cost for single-keyword queries, but
  can truncate these -- not as important, likely that maximum will
  provide all results the user wants.
\item Now to perform a search, send query to any (largest possible)
  subset that performs filtering
\item Example
\item Now you're thinking that we've created so many indexes that the
  entire internet is going to catch on fire. Time for me to try to
  convince you otherwise.
\end{itemize}


\subsection{Index cost}

\begin{itemize}
\item $K$-degree polynomial for large $m$
\item Exponential for small $m$, but this is only better since $m$ is
  so small
\item Graph: log scale. $K = \infty$ requires full power set,
  exponential cost, but for finite degree $K$ is polynomial.
\item And fairly nice degree for small $K$
\end{itemize}

\subsection{FreeDB}
\begin{itemize}
\item Now the theoreticians are pleased with the polynomial, but the
  systems people are looking warily at the big-O. So let's look at
  some real numbers.
\item Source: FreeDB. Approximately every CD in the known world. 1.5
  million CDs, 21 million songs, almost a petabyte, 160 years of
  audio.
\item Total keywords about 134 million. This is the number of index
  entries for $K=1$ -- the cost of an inverted index. 6.3 per song.
\item Notice that increasing to $K=3$, which is the type of value
  we're likely to consider, gives only an order of magnitude increase.
\end{itemize}

\subsection{Choosing K}

\begin{itemize}
\item I claimed K would be around 3. Let's see why.
\item Larger $K$ creates more specific indexes, which better
  distributes query load but increases indexing/storage cost.
\item Log scale graph of increase over $K=1$ inverted index.
\item Web search queries average about 2.5 keywords, so $K=3$ or $4$
  likely to be good choice.
\item Only an order of magnitude increase for $K=3/4$.
\end{itemize}


\section{Index Gateways}

\begin{itemize}
\item Hopefully now you're convinced that keyword-set indexing
  improves query load distribution without making index maintenance
  entirely infeasible.
\item Next I'm going to present a technique for improving the cost due
  to insertions.
\item Suppose a node inserts a file. It has to send the metadata block
  to $I(m)$ indexes.
\item Suppose another node inserts the same file. It'll send the
  metadata block off to the same indexes. But the metadata block is
  the same -- it doesn't contain info about how to obtain the file,
  just the file id.
\item And another one. Each of the $s$ sources will have to send
  $I(m)$ messages. This is unnecessary.
\item In fact it's actually worse due to expiration.
\item Our solution is to introduce index gateways. Each file has an
  index gateway, identified by a Lookup on the file ID. Sources send
  their metadata blocks to the index gateway, which sends them to the
  indexes only when necessary.
\item When the second node sends the same metadata block to the
  gateway, it knows that sending it to the index entries is
  unnecessary.
\item Now each source needs to contact the index gateway once, and the
  index gateway needs to contact each index node once. Reduces the
  cost to $s + I(m)$ --- reduces multiplicative cost to additive.
\end{itemize}


\section{Content Distribution}

\begin{itemize}
\item All this indexing won't do us much good if we can't actually get
  the content associated with a file.
\end{itemize}


\subsection{Indirect storage}

\begin{itemize}
\item We might want to try storing the data in a DHT. This would give
  us nice availability guarantees. But...
\item We assume the file data might be really big.
\item There are also a lot of nodes joining and leaving the
  network. Some P2P networks have average session times under five
  minutes!
\item DHT storage is impractical. We can't afford to push around large
  files all the time to people who don't necessarily want them.
\item Solution: add indirection. Now the network holds only pointers
  to the nodes that are sharing content. But how are these pointers
  stored?
\end{itemize}


\subsection{Subrings}

\begin{itemize}
\item We could use a tracker. But this would mean that one node has to
  track all the sources for a file. If the file is popular, it could
  receive unfair load. Also creates a single point of failure.
\item Instead, use subrings!
\item Recall that Diminished Chord lets us create a subring consisting
  of a subset of the nodes in the ring, using space linear in the
  number of nodes in the subring, and answer queries with logarithmic
  cost just like Chord.
\item Create a subring per file, indexed by file ID.
\item When a node joins the network with a file, or retrieves a file,
  it joins the subring.
\item All nodes in the subring are sharing the file, so find a source,
  perform a lookup for any random ID in the subring.
\item Spreads tracker costs around network.
\end{itemize}


\section{Conclusion}
\begin{itemize}
\item Let's review the novelties of this design
\item Make search possible using a distributed keyword-set
  index. Improves distribution of load.
\item Requires going beyond standard DHT Get/Put-block abstraction,
  adding network side processing in the form of index-side filtering
  and index gateways to reduce costs.
\item Distributed index of file sources via subrings.
\item The future is coming!
\end{itemize}


\end{document}
