\documentclass[twocolumn,11pt]{article}
\usepackage{fullpage}
\usepackage{url}
\usepackage{graphicx}
\usepackage{times}
\usepackage[small,compact]{titlesec}

% Use clrscode for \Proc. Probably a bit of overkill.
\usepackage{clrscode}

% Scruntch lists and bibliography
\let \olditemize = \itemize
\renewcommand{\itemize}{\olditemize{}\itemsep=-\parskip}
\let \oldenumerate = \enumerate
\renewcommand{\enumerate}{\oldenumerate{}\itemsep=-\parskip}
\let \oldlist = \list
\renewcommand{\list}[2]{\oldlist{#1}{#2}\itemsep=-\parskip}

% Draftification
\newif\ifdraft
%\drafttrue
\draftfalse

\AtBeginDocument{
\ifdraft
\typeout{WARNING: margin notes visible}
\fi
}

% Margin notes
\newcommand{\marnote}[1]{%
\ifdraft
\setlength{\marginparwidth}{\oddsidemargin}%
\addtolength{\marginparwidth}{1in}%
\addtolength{\marginparwidth}{-\marginparsep}%
\marginpar{\raggedright \tiny \em #1}
\fi}

%\raggedbottom

\title{Arpeggio:\\\LARGE{Efficient Metadata-based Searching and File Transfer with DHTs}}
\author{Austin Clements, Dan Ports, David Karger \\
  MIT Computer Science and Artificial Intelligence Laboratory \\
\small{\texttt{\{aclements, drkp, karger\}@mit.edu}}}
% FIXME: What is the appropriate date?

\begin{document}

\maketitle

% Overall things
%   We're freely mixing ``file'' and ``data'' and ``content''
%   We should probably also check consistency of ``peer'' and ``node''
%   We should italicize our terminology (I think we mostly do)

\begin{abstract}
  \small{\textit{\emph{Arpeggio} is a peer-to-peer file-sharing network
    based on the Chord distributed hash table. Queries for files whose
    metadata matches a certain criterion are performed efficiently by
    using a distributed keyword-set index, augmented with index-side
    filtering.  We introduce \emph{metadata gateways}, a technique for
    minimizing index maintenance overhead. \emph{Arpeggio} also uses
    the DHT for indirect storage of file contents, maintaining
    pointers from content to the live peers that provide it. Finally,
    we introduce \emph{postfetching}, a technique that uses
    information in the index to improve the availability of rare
    files.  The result is a system that provides efficient query
    operations with the scalability and reliability advantages of full
    decentralization, and a content distribution system tuned to the
    requirements of a peer-to-peer file-sharing network.}}
\end{abstract}

\section{Overview and Related Work}
\label{sec:overview}

Recent years have seen a dramatic increase in the popularity of
peer-to-peer file sharing systems. These systems, which let users
locate and obtain files shared by other users, have many advantages:
they operate more efficiently than the traditional client-server model
by utilizing peers' upload bandwidth, and can be implemented without a
central server. However, many current file sharing systems trade-off
scalability for correctness, resulting in either systems that scale
well but sacrifice completeness of search results or vice-versa.
% FIXME: do a better job of explaining the problem with other networks
% (maybe talk about Overnet/Kad here?)

% Peer to peer file sharing networks are good.

% Except that they suck.

% DHTs are good.
Distributed hash tables have become a standard for constructing
peer-to-peer networks because they overcome the difficulties of
quickly and correctly locating peers.  However, the \emph{lookup by
name} primitives provided by DHTs are not immediately sufficient to
perform complex \emph{search by content} queries of the data stored
in the network.  It is not clear how to construct search interfaces
without sacrificing scalability or query completeness.  Indeed, when
it comes to systems that search for documents, Li et
al.~\cite{feasibility} have shown that the obvious approaches to
distributed document search do not scale well.
% Kademlia sucks less.

% Design goals:
%  be scalable
%  perfect recall
%  deal with churn
%  [what am I missing?]

In this paper, however, we consider systems, such as file sharing,
that search only over a relatively small amount of \emph{metadata}
associated with each file.  The relative sparsity of per-document
information in such systems allows for techniques that do not apply in
general document search.  We present the design for \emph{Arpeggio},
which uses the lookup primitives of the Chord distributed hash
table~\cite{chord} to support metadata search and file distribution.
This design retains many of the advantages of a central index, such as
completeness and speed of queries, while providing the scalability and
other benefits of full decentralization.  Arpeggio uses only a
constant amount of space for each distinct file being indexed, and
resolves any query with a constant number of Chord lookups.  The
system can consistently locate even rare files scattered throughout
the network, thereby achieving perfect recall.

In addition to the search process, we consider the process of
transferring found files to those who want them. We use the DHT to
optimize content distribution. Rather than using it for direct
storage, we use a layer of indirection: the DHT tracks which peers are
sharing which content. As in traditional file-sharing networks, files
may only be intermittently available. We propose an architecture for
resolving this problem by recording in the DHT requests for
temporarily unavailable files, then actively increasing their future
availability.

Like most file-sharing systems, \emph{Arpeggio} includes two
subsystems concerned with searching and with transferring files.
Section~\ref{sec:searching} examines the problem of building and
querying distributed search indexes.
Section~\ref{sec:index-maintenance} examines how the indexes are
maintained once they have been built.
Section~\ref{sec:content-distribution} turns to how the DHT can be
leveraged to improve the transfer and availability of files. Finally,
Section~\ref{sec:conclusion} reviews the novel features of this
design.
% FIXME: Section needs more ``related work''

\section{Searching}
\label{sec:searching}
% The searching problem -- identify an object based on partial
% metadata. Explain metadata.

A file-sharing system must be able to translate a search query from a
user into a list of files that fit the description and a method for
obtaining them. Each file shared on the network has an associated set
of metadata: the file name, its format, etc. Specific types of files
can have more descriptive metadata; for example, the song title and
artist name can be extracted from MP3 music files via the ID3 tags.

Analysis based on required communications costs suggests that
peer-to-peer keyword indexing of the Web is infeasible because of the
size of the data set \cite{feasibility}. However, peer-to-peer
indexing for file sharing networks remains feasible, because searches are
performed using only file metadata, not contents. The size of file
metadata is expected to be on the order of a few keywords, much
smaller than the text of an average Web page.

\subsection{Background}
\label{sec:searching:background}

% Searching is hard. Central inverted index, Gnutella/flooding,
A natural approach is to maintain a centralized index of the files in
the network, as done by the Napster \cite{napster} file sharing
network. Structured overlay networks based on distributed hash tables
show promise for simultaneously achieving the recall advantages of a
centralized index and the scalability and resiliency attributes of
decentralization. Distributed hash location services such as Chord
\cite{chord} provide a \proc{Lookup} primitive that maps a key to the
node responsible for its value. It does so efficiently, with at most
$O(\log n)$ messages per lookup in an $n$-machine network, and minimal
overhead for routing table maintenance. Building on this primitive,
DHash \cite{dhash} and other distributed hash tables provide a
standard \proc{Get-Block}/\proc{Put-Block} hash table abstraction.
However, this interface alone does not allow search based on metadata
keywords.

\subsection{Distributed Indexing}
\label{sec:searching:distributed-indexing}

% Solution progression: Distributed inverted index w/ querier join (or
% distributed join) -> keyword-set index w/ querier filter ->
% keyword-set index w/ network-side filter

A reasonable starting point is a \emph{distributed inverted index}. In
this scheme, the DHT maps each keyword to a list of all files whose
metadata contains that keyword, thus distributing the index throughout
the network. To execute a query, a node performs a \proc{Get-Block}
operation for each of the query keywords and intersects the resulting
lists.  The principal disadvantage is that the keyword index lists can
become prohibitively long, particularly for very popular keywords
(e.g.  ``the''), so retrieving the entire list will generate
tremendous network traffic.

Performance of a keyword-based distributed inverted index can be
improved by performing \emph{index-side filtering} instead of joining
at the querying node. Because our application postulates that metadata
is small (unlike in the general case of indexing entire documents),
the entire contents of the metadata can be kept in the index with each
item. To perform a query involving a metadata term, we can send the
full query to the corresponding index node, and it can perform the
filtering and return only relevant results.  This dramatically reduces
network traffic at query time, since only results relevant to the full
query need to be transmitted.  This is similar to the search algorithm
used by the Overnet network \cite{overnet}, which uses the Kademlia
DHT \cite{kademlia}. Note that index-side filtering extends the DHT
abstraction beyond the simple \proc{Get-Block} request: the DHT is no
longer a mere data structure; it now supports network-side processing.
% Overnet, distributed join, etc.

% KSS!
Although index-side filtering reduces query bandwidth, machines
responsible for the large inverted indexes of common search terms will
have disproportionately large storage and query loads.  As an initial
fix for this problem, we \emph{truncate} large inverted indices,
keeping only those items that best-match the inverted index term.
This limits storage load, but sacrifices correctness: multi-term
queries directed to this index may have matches in the full list that
are not in the truncated list, hiding some correct results.

To overcome this problem, we propose to build inverted indexes not
only on keywords but also on keyword \emph{sets}. For each file, an
index entry is created for each subset of at most $k$ metadata terms;
the maximum set size $k$ is a parameter of the network. This is the
Keyword-Set Search system (KSS) introduced by Gnawali \cite{kss}.
Essentially, this scheme allows us to precompute the full-index answer
to all queries of up to $k$ keywords.  For queries of more than $k$
keywords, truncation may still be necessary.  However, it seems likely
that in most cases there will be a ``rare'' $k$-word subset of the
query with a small inverted index. This index will suffer little or no
truncation, and will thus be more likely to hold all relevant results.

Using keyword set indexes rather than keyword indexes increases the
number of index entries for a file with $m$ metadata keywords from
$O(m)$ to $O\left(\min(m^k, 2^m)\right)$.  However, it reduces the
importance of single-keyword indexes: they will be used only to answer
single-word queries, and the negative effects of truncation on these
queries are minimized. Reynolds and Vahdat observed that less than
29\% of web search queries included only one keyword
\cite{keyword-searching}.
%This trade-off is desirable because the number of \proc{Insert} requests
%on a network is smaller than the number of \proc{Query} requests.
% Cite Gummadi or some other measurement paper for this statistic.
Further analysis will be necessary to identify the optimum value for
the parameter $k$; however, the average number of terms for web
searches is approximately $2.53$, so $k$ is likely to be around $3$
\cite{keyword-searching}.  A major advantage of keyword-set indexing
is that the query traffic to nodes responsible for popular keywords
will be abated by queries going to more specific indexes.

% Network-side filtering. Explain keyword/filterable metadata.
% unnecessary network traffic is eliminated by filtering inside of the
% network instead of on the originator of the query
In \emph{Arpeggio}, we combine KSS indexing with index-side filtering,
as described above: indexes are built for keyword sets and results are
filtered on the index nodes. We make a distinction between
\emph{keyword metadata}, which is easily enumerable, and therefore can
be used to partition indexes with KSS, and \emph{filterable metadata},
which can further constrain a search. Index-side filtering allows for
more complex searches than KSS alone. A user may only be interested in
files of size greater than 1 MB, files in \texttt{tar.gz} format, or
MP3 files with a bitrate greater than 128 Kbps, for example. It is not
practical to encode this information in keyword indexes, but the index
obtained via a KSS query can easily be filtered by these criteria.

\subsection{Index Structure and Algorithms}
\label{sec:searching:index-structure-and-algorithms}

We now describe in detail the index representation used by
\emph{Arpeggio}.  All index data is stored in a distributed hash table
similar to DHash \cite{dhash}.  However, in addition to the standard
\proc{Get-Block} and \proc{Put-Block} hash table abstractions, it must
also support a \proc{Filtered-Get} operation that returns all list
entries that match certain criteria.  Everything \emph{Arpeggio}
stores in the DHT is in the form of a block with a tag indicating the
type of block.

A \emph{metadata block} contains the keyword and filterable metadata
for a file as well as information for obtaining the file contents.
Each unique file has a corresponding metadata block that holds its
metadata. When a node registers a file with the index (for instance,
when it comes online), it inserts a copy of the file's metadata block
under each of the file's keyword sets.  The location information can
take multiple forms, since our indexing system is independent of the
content distribution system. For the content distribution system we
describe in Section~\ref{sec:content-distribution}, the location is
the block ID of the file block.

To perform a query, a node first selects some largest possible subset
$K$ (up to size $k$) of the keyword metadata in the query. It then
sends a \proc{Filtered-Get} request with the query as parameter to the
node responsible for the metadata blocks associated with the
keyword-set $K$. The contacted node searches its index to find the
entries that match the full query, and returns them to the querying
node.


\section{Index Maintenance}
\label{sec:index-maintenance}

Peers are constantly joining and leaving the network.  Thus, the
search index must respond dynamically to the shifting availability of
the data it is indexing and the nodes on which the index resides.
Furthermore, certain changes in the network (such as nodes leaving
without notification) may go unnoticed, and polling for these changing
conditions is too costly, so the index must be maintained by passive
means under such circumstance. 

\subsection{Metadata Expiration}
\label{sec:index-maintenance:expiration}
% Expiration needed to deal with changing content
Instead of polling for departures, or expecting nodes to notify us of
them, we expire metadata on a regular basis.  This guarantees that
long-absent files will never be returned by a search.  Blocks may
contain out-of-date references to files that are no longer accessible.
Thus, a requesting peer must be able to gracefully handle failure to
contact source peers.  Pointers to file blocks, on the other hand, are
guaranteed to remain valid in the face of network topology changes
because the DHT handles replication of blocks.  To counteract
expiration, we \emph{refresh} metadata that is still valid, thereby
periodically resetting its expiration counter.

While one might think that metadata should be expired frequently so
that searches are up to date, we argue in
Section~\ref{sec:content-distribution:postfetching} that there is
actually great value in keeping old metadata around to match queries,
even if the corresponding files are no longer available.  In
particular, we track attempts to access those missing files and
\emph{postfetch} them to other nodes in the DHT if they become
available after the access attempt.  This lets us identify files that
are in demand but sporadically available, and replicate them to
increase their availability.

% Exploiting heterogeneity
%   partial presence and multi-presence
%     dealing with churn
%   compared against ultra-peer systems
% [Is this the right section for this?]

\subsection{Metadata Gateways}
\label{sec:index-maintenance:gateways}
% Meta-data gateways (supermetablockination)
%    Method for aggregation of redundant network traffic
%    Interaction with meta-data block expiration
%      Shortish expiration time for block providers list
%      Long expiration time for meta-data block
%        Used for active caching
%        [Should this be explained in transferring section?]
%        [ ^-- Yes, probably. Mention longer expiration time and
%        cross-reference?]
%    Hysteresis of initial block insertion
%      [Is this useful for anything more than making sure the right
%       person suffers initial insertion?]
%    Peers bump gateway, gateway bumps/updates meta-data blocks

Having each node directly maintain its own files' metadata in the
network will result in large numbers of redundant inserts and
refreshes of metadata blocks representing very common files.
Insertion to many distinct keyword search sets is relatively
expensive, so this will also consume a large amount of bandwidth.  The
problem can be solved by introducing a \emph{metadata gateway} node to
the process of insertion.

\begin{figure}[htbp]
  \centering
  \includegraphics{figures/gateway}
  \caption{Two file source nodes inserting the same file $F_1$ via the
    $\langle$a b$, F_1\rangle$ metadata gateway}
  \label{fig:gateway-diagram}
\end{figure}

A metadata gateway aggregates updates to metadata blocks. Gateways
leverage the fact that individual nodes insert many copies of the same
block and, for common files, many different nodes will also insert the
same block.  To use a metadata gateway, instead of directly inserting
metadata blocks into the network, a peer sends a single copy of the
block to the gateway responsible for the block (found via the block's
hash). The gateway then inserts the metadata block under all of the
appropriate keys in the network, but \emph{only} if necessary.  If the
block already exists in the network and is not scheduled to expire
soon, then there is no need to re-insert it into the network. A
gateway only needs to refresh metadata blocks if the blocks in the
network are going to expire soon, but the copy of the block held by
the gateway has been more recently refreshed.

Using gateways, insertion by content-holding nodes is no more costly
than for a centralized index system, though the gateways themselves do
consume bandwidth to insert into the index.  Furthermore, gateways
reduce the total network bandwidth incurred by managing metadata
blocks from being proportional to the total number of files to being
proportional to the number of unique files.

%% While gateways can greatly reduce traffic for common metadata blocks,
%% they can increase bandwidth requirements for uncommon metadata blocks:
%% these metadata blocks must be stored on the gateway even though they
%% will rarely be reused.  To solve this problem, gateways reject any new
%% block the first time it encounters it, and the originating peer is
%% responsible for inserting the copies of the block.  The gateway
%% maintains a sliding window of these rejected blocks so it can
%% recognize the second insertion of a block and take responsibility for
%% future refreshing of that block.  By limiting the size of this window,
%% the gateway will not use unnecessary storage for uncommon metadata
%% blocks, thereby effectively restricting the gateway to handling blocks
%% for which aggregation is useful.

\subsection{Index Replication}
\label{sec:index-maintenance:replication}
% Lazy index replication
%   and replication in general
In order to maintain the index despite node failure, metadata block
replication is also necessary.  Because metadata blocks are small and
reading from them must be low-latency, replication is used instead of
erasure coding \cite{dhash}.  Furthermore, because replicated indexes
are independent, any node in the index group can handle any request
pertaining to the index (such as a query or insertion) without
directly interacting with any other nodes. \emph{Arpeggio} requires
only \emph{weak consistency} of indexes, so index insertions can be
propagated periodically and in large batches as part of index
replication.  Expiration can be performed independently because
metadata blocks will expire on all nodes simultaneously.

\section{Content Distribution}
\label{sec:content-distribution}
% Searching can support *any* content distribution system
The indexing system we describe above simply provides the ability to
search for files that match certain criteria. It is independent of the
file transfer mechanism. Thus, it is possible to use an existing
content distribution network in conjunction with \emph{Arpeggio}. A
simple implementation might simply store a HTTP URL for the file in
the metadata blocks.  Alternatively, a content distribution network
such as Coral \cite{coral} could be used. 
% For example, can just store files in the DHT (as with CFS, etc)
% This doesn't work well because of churn, popular files, etc.
A DHT can be used for direct storage of file contents, such as in
distributed storage systems like CFS \cite{cfs}. However, because of
the large file sizes and amount of churn in a typical file sharing
network, \emph{Arpeggio} uses \emph{indirect storage}: the DHT stores
pointers to each peer that contains a certain file.

% But we can use the DHT and get some powerful advantages. Enter
% Progression.
A peer can query this index to identify all other peers that are
sharing a file it wishes to obtain. The downloader may begin
transferring the file from multiple sources in order to distribute the
load and maximize transfer speed. Our algorithm is designed to
maximize this effect.


\subsection{Segmentation}
\label{sec:content-distribution:segmentation}

% Deal with chunks rather than files

For purposes of content distribution, we segment all files into a
sequence of \emph{chunks}. Rather than tracking which peers are
sharing a certain file, \emph{Arpeggio} tracks which chunks comprise
each file, and which peers are currently sharing each chunk. This is
implemented by storing in the DHT a \emph{file block} for each file,
which contains a list of pointers to \emph{chunk blocks}, which in
turn contain a list of the peers sharing each chunk, as in
Figure~\ref{fig:file-chunk-diagram}.  File and chunk block IDs are
derived from the hash of the file or chunk contents to ensure that
file integrity can be verified.

\begin{figure}[htbp]
  \centering
  \includegraphics{figures/chunks}
  \caption{Division of files into chunks}
  \label{fig:file-chunk-diagram}
\end{figure}

%  The confusing diagram
%  Can share chunks between similar files
%  Partially downloaded files
The rationale for this design is twofold. First, peers that do not
have an entire file are able to share the chunks they do have: a peer
that is downloading part of a file can at the same time upload other
parts to different peers. This makes efficient use of otherwise unused
upload bandwidth.  Second, multiple files may contain the same chunk,
as Figure~\ref{fig:file-chunk-diagram} demonstrates. A peer can obtain
part of a file from peers that do not have an exactly identical file,
but merely a \emph{similar} file.

Though it seems unlikely that multiple files would share the same
chunks, file sharing networks frequently contain multiple versions of
the same file with largely similar content. Users often have MP3 files
with the same audio content but different ID3 metadata tags. Dividing
the file into chunks allows the bulk of the data to be downloaded from
any peer that shares it, rather than only the one with the same ID3
tag.

%  The variable-length chunking algorithm (LBFS)
However, it is not sufficient to use a segmentation scheme that draws
the boundaries between chunks at regular intervals. In the case of MP3
files, since ID3 tags are stored in a variable-length region of
the file, a change in metadata may affect all of the chunks because the
remainder of the file will now be ``out of frame'' with the original.
More generally, insertion or deletion of file content may negate the
advantages of fixed-length chunking.

To solve this problem, we choose variable length chunks based on
content, using a chunking algorithm derived from the LBFS file system
\cite{lbfs}. Due to the way chunk boundaries are chosen, even if
content is added or removed in the middle of the file, the remainder
of the file will still be largely comprised of identical chunks.
% FIXME: mention approximate chunk size?

%  Are there any papers on other P2P networks that chunk we can cite?

% Algorithm
%% When using this content distribution system, the index entry blocks of
%% the search index (described in Section
%% \ref{sec:searching:index-structure-and-algorithms}) contain the block
%% ID of the file block for the file to which they refer. The file block
%% itself simply contains a list of the block IDs of all the associated
%% chunk blocks. File and chunk blocks are inserted into the DHT using
%% block IDs derived from the hash of the file or chunk contents. This
%% ensures that the integrity of downloaded content can be easily
%% verified by computing its hash value and comparing to the block ID.
%% A chunk block contains a list of IP addresses of peers from which the
%% file can be obtained. The entries in this list will expire after a
%% fixed time, as it is possible for nodes to fail or leave the network
%% unexpectedly. However, since the list is provided on a best-effort
%% basis, clients should be robust in the event of being unable to
%% contact one of these peers.

\subsection{Postfetching}
\label{sec:content-distribution:postfetching}

Additional novel properties can be garnered from \emph{Arpeggio} by
introducing caching into the network. Caching file chunks on nodes
that would not usually be sharing the chunks provides new
opportunities for increasing availability of files. Cached chunks are
inserted into the index just like regular chunks, and therefore do not
share the disadvantages of direct storage with regards to having to
maintain the chunks despite topology changes.  Furthermore, this
insertion symmetry makes caching transparent to the search system.
Also, unlike in direct storage systems, caching is non-essential to
the functioning of the network, and therefore each peer can place a
reasonable upper bound on its cache storage size.

% Passive caching
% Active caching
%  [I don't really like these names. Can we come up with better ones?]

% I think it just dawned on me why, in practice, queue lengths don't
% balance in most networks: Most networks maintain one big queue for
% all files that a peer has, instead of a queue per file.  Bittorrent
% doesn't have this issue because it works on a per-file basis (and
% uses something much more dynamic than a queue).  Would this be
% possible to prove somehow?
%
% In fact, this now seems really silly.  The idea was to for the
% downloader of a block to simultaneously push it to some cache node,
% who would then begin sharing it, but why not just make that cache
% node someone who wants that block?

\emph{Postfetching} provides a mechanism by which caching can increase
the supply of rare files in response to demand.  \emph{Request blocks}
are introduced to the network to capture requests for unavailable
files.  Due to the long expiration time of metadata blocks, peers can
find files whose sources are temporarily unavailable.  The peer can
then insert a request block into the network for a particular
unavailable file.  When a source of that file rejoins the network it
will find the request block and actively increase the supply of the
requested file by sending the contents of the file chunks to the
caches of other nodes. These in turn register as sources for those
chunks, thus increasing their availability. Thus, the future supply of
rare files is actively balanced out to meet their demand.

\section{Conclusion}
\label{sec:conclusion}

We have presented the key features of the \emph{Arpeggio} file sharing
system. \emph{Arpeggio} differs from previous peer-to-peer file
sharing systems in that it implements both a metadata indexing system
and a content distribution system using distributed hash tables.  We
extend the standard DHT interface to support not only lookup by key
but complex search queries.  Keyword-set indexing and extensive
network-side processing in the form of index-side filtering, metadata
gateways, and expiration are used to ameliorate the scalability
problems inherent in distributed document indexing. We introduce a
content-distribution system based on indirect DHT storage that uses
chunking to leverage file similarity, and thereby optimize
availability and transfer speed.  Availability is further enhanced
with postfetching, which uses cache space on other peers to replicate
rare but demanded files.

% Novelties
%   Lots of network-side processing
%   Partial presence (if we talk about this)
%   Insertion gateways
%   Weak consistency in index replication
%   Sharing of chunks between files
%   Proactive caching
%   Combining direct and indirect storage

\small{
\bibliographystyle{abbrv}
\bibliography{design-paper}
}
\ifdraft \vfill \hfill \verb|$Rev$| \fi

\end{document}

% LocalWords:  metadata DHT Kademlia KSS keyspace Proactive proactive chunking
% LocalWords:  postfetching LocalWords