Introduction Wide-area bandwidth is becoming an increasingly scarce and expensive resource. Pages are becoming increasingly populated with rich media, and bandwidth-hungry applications like Youtube are growing at an explosive rate. As a result, ISPs are being forced to install increasingly expensive infrastructure, passing the cost onto business and academic customers. Local-area bandwidth, on the other hand, is becoming increasingly plentiful and cheap. Most computers nowadays have 100-Mbit or Gigabit Ethernet cards, despite the fact that network transfer rates rarely approach such levels. Moreover, laying local cable is cheap; most offices and campuses tend to have strong interconnectivity. Since the majority of network traffic is to external servers, local area bandwidth tends to remain mostly unused. Given these conditions, it would be nice if we could find a way to trade wide-area bandwidth for local-area bandwidth. The key to doing this is the observation that most internet traffic is redundant -- 25 to 40 percent of all browser requests are for web objects already stored in the cache of another local client. If web browsers were smart enough to retrieve the data from a local computer rather than the origin server, global bandwidth usage would be significantly decreased (not to mention the cost savings for organizations paying hefty ISP bills.) InHome is a peer-to-peer system that operates like a distributed cache. Internet applications with appropriately-designed plugins can query InHome for data requests that might be satisfiable by local computers, falling back to the origin server if the request times out or cannot be fulfilled. InHome was designed to ensure an ideal set of properties: \begin{itemize} \item The system does not require additional hardware or maintenance. \item Queries execute quickly enough to ensure that the user browsing experience is not worsened. \item Users are not required to store unfamiliar data. \end{itemize} This paper is organized as follows. In Section 2, we provide more detailed description of the problem. In Section 3, we provide a desciption of InHome. In Section 4, we elaborate in more detail on the searching algorithms that allow InHome to guarantee low latency. Section 5 does contains evaluations of InHome's performance and resource costs. Section 6 contains a more detailed look at how InHome could be used as a distributed browser cache. Section 7 covers related work in the literature. Finally, Section 8 contains future improvements and research areas. Related Work To date, there haven't been any InHome-style distributed cache systems. Companies interested in saving bandwidth tend to use a centralized caching proxy server -- effective, albeit costly and requiring maintenance. Content Distribution Networks like Codeen are also a distributed caching system. Unlike InHome, however, these networks are focused on relieving server pressure -- content providers replicate their data on these networks to reduce the number of data requests they process. There has been similar work in reducing external Bittorrent traffic. Ono is a project from Northwestern that piggybacks off existing CDNs to locate local peers for Bittorrent downloads. Although there are no concrete figures on how much bandwidth is saved, Ono is effective at discovering local peers. A study from Stanford has shown that local peer selection can reduce the number of redundant copies of data downloaded by an ISP from 50 to 4. Our system is more general than these -- with the appropriate plugin, a Bittorrent client could discover whether a copy of the data was located on the local network and download it if possible. Remember that Bittorrent was designed to download from a large number of low-throughput connections; for local area networks, its easier to directly download the file from a single peer. Case Study: Web Caching As a concrete example of an application using InHome, we consider a distributed web cache. Rather than always going to the origin server to retrieve web content, we first see whether any local peers have already retrived and stored it in their cache. This application is implemented as a Mozilla Firefox plugin. The plugin intercepts HTTP requests and dispatches a query to InHome for each. If a response is not received within 200 milliseconds (or if the page is not available), it contacts the origin server. Pages are identified via their URL, so two identical web objects in different locations would be treated as separate objects. 200 millseconds was chosen based on empirical measurements of local area latency balanced with a desire to avoid worsening the browser experience. In addition, every time a new object is added to the cache, the plugin binds its URL to its location in the InHome client, allowing the computer to serve pages as well. The plugin also removes pages with expired TTLs, ensuring that clients always fetch the latest version of dynamic pages. {\bf Security and Caching Policy} Security is an exceptionally important and challenging problem for a web cache. Critical business is executed over the internet, and retrieving pages from local clients opens up many new avenues of phishing attacks. Guaranteeing security is exceptionally difficult, however, because content is not static or self-certifying. If we were allowed to make radical changes to the status quo, we could have servers cryptographically sign their content. At the moment, though, this is impossible. Instead, we make sure that no compromising pages are cached by excluding pages with SSL encryption or pages with password fields. Our final caching policy looks something like the following: \begin{itemize} \item If the URL is SSL-encrypted (e.g. HTTPS protocol), fall back to the origin server (for obvious reasons.) \item If the URL has GET or POST parameters, fall back to the origin server (prevents personalized pages from being added to the cache.) \item Attempt to retrieve data from InHome. Timeout after 200 milliseconds. \item Fall back to the origin server if the query was unsucessful. \item If the data contains a password field, fall back to the origin server (this prevents malicious peers from attempting to steal passwords.) \end{itemize} %Selfish Nodes % %Without modifying the protocol, there is literally no way to deal with this. To get a partial solution, we'd have to allow nodes to evesdrop on {\bf Bandwidth Savings} The efficacy of this system is dependant on how similar user browsing habits are. The internet is vast -- if different users tend to visit different sites, this scheme is useless. Fortunately, this is not the case. We analyzed multiple sets of users traces and constructed a statistical model to analyze multiuser web browsing behavior. {\bf University of California, Berkeley} This is a four-hour trace from November, 1996. A cross-section of student, staff, and faculty agreed to install internet tracking software on their home computers, and the results were anonymized and made public. Since these traces were from the client computers, it is possible to figure out which requests were served by the local browser cache and exclude them from the measurements. Analyzing the trace, 24.3\% of web requests are for web objects previously requested by a different client. Factoring web object sizes in, applying InHome could save 27.6\% of total bandwidth. {\bf IRCache} The second set of traces is from a public caching server. While it serves a much more geographically diverse set of clients, we argue that this doesn't significantly affect our statistics (or at worst, it makes them less than normal) -- there's no reason locality should affect multiuser browsing habits. This trace is from an entire day of requests, so its more representative of the cache duration that our solution provides. Although this set of records doesn't clearly identify differing clients, we know that requests served by the local browser cache will not appear in the trace output. Analyzing the trace, 37.6\% of web requests are for web objects previously requested by a different client. Factoring web object sizes in, applying InHome could save 41.5\% of total bandwidth. {\bf Statistical Model} Papers in the literature all agree that the Zipf Model accurately represents user browsing behavior. Extending this to multiple users is a bit more difficult, however. Based on our experiments, we found that using $N = 50000$ and randomizing the order of the top 100 sites most accurately measures the behavior of multiple users. Running our statistical tests, 43.2\% of web requests are for web objects previously requested by a different client. Factoring web object sizes in, applying InHome could save 45.7\% of total bandwidth. System Overview As mentioned earlier, InHome is designed as a peer-to-peer system. Users run a background daemon on their computer -- this daemon is responsible for keeping an index of locally available data, dispatching data requests to other clients, and serving requests from other nodes on the network. In designing InHome, there was a set of essential properties we wanted to guarantee. {\bf The system does not require additional hardware or maintenance.} Organizations have been aware of this external bandwidth problem in the past, and the most common solution is a centralized caching proxy server. {\bf Users should not be required to store unfamiliar data.} It might be possible to use replication or a true distributed file system to optimize performance, but it's important to guarantee that clients are not required to store unfamiliar data. Consider what would happen if other clients were storing data that violated copyrights or were infected with viruses. {\bf Queries execute quickly enough to ensure that the user browsing experience is not worsened.} Individual users don't have a concrete reason to use our system -- this is a decision that should be made at the organization level. As such, we need to ensure that the user experience is not adversely affected -- if it were, this would slow adoption of InHome. {\bf } ... Random Stuff ============ Why not use Bittorrent? Our methodology draws far more from older P2P systems like Gnutella, , etc. -- many of which have now been overtaken by Bittorrent. While Bittorrent has made it much easier to download large files, it has a couple of disadvantages that make it unsuitable for our goal. - Centralized Trackers. In its original form, Bittorrent requires centralized tracking servers to keep track of which clients possess specific data. Thus, infrastructure support would be required by any organization deploying a Bittorrent-based solution. In addition, recent history has shown that legal organizations target such file trackers to stem copyright violations. - High Latency. Bittorrent downloads typically have a setup period on the order of many seconds to a minute -- acceptable for large file downloads but ill-suited for large sequences of small caching transactions. - Bandwidth. Bittorrent is especially suited for situations where many people might possess parts of a file, but connections to each have low throughput. Because our system is specifically designed for local area networks, it makes just as much sense to download the file from one person than to try coordinating many downloads. We note that there have been recent systems and analyses covering the benefits of local peer selection for Bittorrent. These systems are described in more detail in previous work. Previous Work and Alternatives: Local area schemes for reducing intra-ISP traffic is hardly a new idea. Here are some alternative solutions that have been proposed or implemented in the past. Centralized Web Cache: A simple alternative would be to maintain a centralized web cache on a proxy server; indeed, this is a solution already employed by many companies. Our solution is far less expensive and doesn't require maintenance, however. Distributed File System: Much work has been done in created distributed file systems over local networks. A solution of this nature would ask every computer on the local area network to donate some amount of storage space (e.g. 100 MB each) to maintain copies of the most commonly requested data. Our solution has two advantages over these types of systems: (1) Clients in our system only contain data that they're interested in (2) No implicit trust is assumed to exist between clients in our system. Feasibility: Clearly, the the benefits of this system are limited by how similar the data that clients contain is... We analyzed two sets of traces in order to answer this question. The first set was collected from students and faculty computers at UC Berkeley in November, 1996 through the Home IP service. This dataset contained complete records of internet usage, anonymizing client and server IPs via MD5. Analyzing 95,768 requests from 916 unique clients over a four hour period, we found that 17,394 of the requests were for data already stored on another node in the network. Note, however, that these traces were collected from client computers, and thus include requests served by the local browser cache. Eliminating those, we find that this figure is 24.3% of all data requests. The second set of traces requested was from the IRCache project. While this represents a reasonable approximation We've also constructed a statistical model for trace generation and analysis. Prior work has shown that a Zipf model accurately models user web behavior. We use a baseline of $N = 50000$ sites. Note that