% -*- TeX-master: "report.tex" -*-

%Case study

We studied a distributed browser cache as a concrete example of an
application using InHome. Rather than always going to the origin
server to retrieve web content, we first see whether any local peers
have already retrived and stored it in their cache.

This application is implemented as a Mozilla Firefox plugin. The
plugin intercepts HTTP requests and dispatches a query to InHome for
each. If a response is not received within 200 milliseconds (or if the
page is not available), the plugin retrieves the data from the origin server. Pages are identified via their URL, so two identical web objects in different
locations would be treated as separate objects. 200 millseconds was
chosen based on empirical measurements of local area latency balanced
with a desire to avoid worsening the browser experience.

In addition, every time a new object is added to the cache, the plugin
inserts the new object into the InHome system, allowing the computer
to serve the new object to other peers. The plugin also removes pages
with expired TTLs, ensuring that clients always fetch the latest
version of dynamic pages.

\subsection{Security and Caching Policy}

Security is an exceptionally important and challenging problem for a
web cache. Critical business is executed over the Internet, and
retrieving pages from local clients opens up many new avenues of fraud
and phishing attacks. Guaranteeing security is exceptionally
difficult, however, because content is not static or self-certifying.

If we were allowed to make radical changes to the status quo, we could
have servers cryptographically sign their content. At the moment,
though, this is impossible. Instead, we make sure that no compromising
pages are cached by excluding pages with SSL encryption or pages with
password fields.

Our final caching policy can be written as follows:

\begin{itemize}
\item If the URL is SSL-encrypted (e.g. HTTPS protocol), fall back to
  the origin server (for obvious reasons.)
\item Attempt to retrieve data from InHome. Timeout after 200
  milliseconds.
\item Fall back to the origin server if the query was unsucessful.
\item If the data contains a password field, fall back to the origin
  server (this prevents malicious peers from attempting to steal
  passwords.)
\end{itemize}

%Selfish Nodes
%
%Without modifying the protocol, there is literally no way to deal with this. To get a partial solution, we'd have to allow nodes to evesdrop on 

\subsection{Bandwidth Savings}

The efficacy of this system is dependant on how similar user browsing
habits are. The internet is vast -- if different users usually visit
different sites, this scheme is useless. Fortunately, this is not the
case. We analyzed multiple sets of users traces and constructed a
statistical model to analyze multiuser web browsing behavior.

\subsubsection{University of California, Berkeley}

The first trace we studied was a four-hour trace from November, 1996
from U. C. Berkeley~\cite{uc_berkel_home_ip_web_traces_days}. A
cross-section of student, staff, and faculty agreed to install
internet tracking software on their home computers, and the results
were anonymized and made public. Since these traces were from the
client computers, it is possible to figure out which requests were
served by the local browser cache and exclude them from the
measurements.

Analyzing the trace, 24.3\% of web requests are for web objects
previously requested by a different client. Factoring web object sizes
in, applying InHome could save 27.6\% of total bandwidth.

\subsubsection{IRCache}

The second set of traces is from a public caching server that provides
its data for research purposes called IRCache~\cite{ircac}. Unlike the
Berkeley traces or the typical use case for this system, IRCache data
is gathered from a set of users scattered throughout an entire
city. However, this trace can still serve as a good lower bound for
hit rate and bandwidth savings -- browsing habits could only become
more similar if confined to a smaller geographical area.

Since this trace is from an entire day of requests, its more
representative of the cache duration that our solution
provides. Although the records don't clearly identify differing
clients, we know that requests served by the local browser cache will
not appear in the trace output.

Analyzing the trace, 37.6\% of web requests are for web objects
previously requested by a different client. Factoring web object sizes
in, applying InHome could save 41.5\% of total bandwidth.

\subsubsection{Statistical Model}

Prior work has shows that the Zipf Model accurately represents user
browsing behavior~\cite{breslan99:_zipf}. Our model uses a set of 1000
independent Zipf distributions to model clients, with each
distribution using $N = 50000$. In addition, each client randomizes
the order of the top 100 sites to vary user behavior. Since these top
100 sites represent over 50\% of all requests, we've determined that
this change produces enough variance to accurately measure the
behavior of multiple users.

Running our statistical tests, 43.2\% of web requests are for web
objects previously requested by a different client. Factoring web
object sizes in, applying InHome could save 45.7\% of total bandwidth.