Horizon size estimation on the Gnutella network The horizon size estimation protocol (HSEP) Version: 0.2 Created: November 21, 2003 Last change: July 6, 2004 Author: Thomas Schürger (thomas@schuerger.com) Document URL: http://www.menden.org/gnutella/hsep.html Note that this is still a preliminary RFC, not a ready-to-use specification. The suggested header and payload type have not been registered or checked for collisions yet. Vendors are free to implement and test the suggested protocol, though. Comments and suggestions are welcome! Abstract The term "horizon size estimation" refers to the estimation of the number of reachable resources within the Gnutella network, e.g. the number of reachable Gnutella nodes, shared files and shared kibibytes (KiB). A good scheme for horizon size estimation based on dynamic programming has been proposed in the appendix of a document by Christopher Rohrs describing the pong-caching approach ([1]). Yet no proposal has been made so far on how to actually implement horizon size estimation into the Gnutella protocol. Here, I will give some additional comments about horizon size estimation and a specification on how to extend the Gnutella protocol to support this. Change log * 2004-07-06: Small minor changes in various positions. * 2004-06-25: The recommended value for n_max is now 7 instead of 10. * 2004-06-15: Added links to a HSEP_reference_implementation. * 2004-06-14: Added information about how to include horizon data from non-HSEP nodes. This will make horizon statistics more accurate, at least when HSEP is not in widespread use yet. The log is updated whenever changes/suggestions relevant to the protocol are added. Overview The proposed protocol has the following properties: * exact for tree-networks (but limited by a given hop radius), good enough for networks with cycles * preservation of global anonymity (no per-node data extractable except for direct peers) * good scalability (no broadcasting required) * robust against transmission errors * data from different connections is automatically accumulated regardless of the arrival rate * very low bandwidth, memory and CPU time consumption * 64-bit integer based (only addition/subtraction/comparison operations required) * easy to implement Changes to the Gnutella protocol: * introduction of a new "X-Feature" during Gnet handshaking to announce HSEP support * introduction of a new Gnet message payload type for HSEP messages Introduction The original proposal only mentions how to estimate the total number of nodes reachable within a certain distance (in terms of Gnet hops), but it can easily be extended to also estimate the number of shared files and kibibytes reachable within a given distance. Note that Gnutella's original ping-pong scheme allows some very inefficient and inaccurate sort of horizon size estimation, because pong messages contain the number of shared files and kibibytes of each node sending a pong. But due to optimization techniques like pong-caching, pong- based horizon size estimation cannot be used reliably nowadays (in fact, pong- based horizon size estimation was never really reliable at all). Horizon size estimation has an important advantage: besides providing interesting statistical information, it can help to increase the connectivity of a servent, because the quality of a connection from a servent to one of its peer servents becomes a reliable quantitative measure. For example, as an ultrapeer or leaf node you could connect to a couple of ultrapeers and drop the connections to some of those having the worst figures concerning reachable nodes, shared files and/or shared kibibytes. This avoids connecting to servents that are not very well connected or are connected in a way such that few files are reachable (e.g. in a degraded region of the network). As a result, it may very well be the case that fewer connections to other servents are required for the same connectivity, thus saving bandwidth on the network. A drawback to the proposed approach is that it only works perfectly in networks that form a tree, which the Gnutella network obviously does not. Existing cycles in the network will make it possible to take nodes into account more than once (think of it as a sort of hop-limited breadth-first search that cannot check whether it has already seen a node or not). Therefore, the reliability of the horizon size estimation information will drop with increasing number of considered hops, since the probability for the existence of cycles increases with the distance from the originating servent. Consequently, the number of considered hops (n_max) should be limited to a reasonable value. The suggested approach cannot detect network cycles due to the implicit anonymity of the data. The anonymity is lost only for peer servents: a servent knows the number of shared files and kibibytes of each of its peer servents and to how many other servents peer servents are connected to. For horizon size estimation to work properly it is obvious that this will work best if all servents support HSEP and if all servents use the same value for n_max. A quick analysis shows that using 32-bit integers to store the the accumulated number of nodes, shared files (both max. ~4.3 billion) and kibibytes (max. 4 tebibytes) will not suffice. The number of nodes will be okay for a while, but the other two numbers definitely won't. Therefore, unsigned 64-bit integers will be used for all three numbers. Recall that the original proposal defined H[A,n,B] to be the number of other nodes that are reachable from node A through node B (which has a direct connection to A) using at most n hops. That is, H[A,n,B] is the number of other nodes reachable from node A within at most n hops with the first hop being B. H'[A,n,B] is defined to be the number of other nodes reachable from node A within at most n hops where the first hop is not B (i.e. through all directly connected nodes except B). Define H[A,n] to be the sum of H[A,n,B] for all B with direct connection to A. The key observations for the directly connected nodes N_1, ..., N_k are: H[A,n,N_i] = H'[N_i,n-1,A] + 1 (for n >= 1) H'[A,0,N_i] = H[A,0,N_i] = 0 The following is obvious: H'[A,n,N_i] = H[A,n,N_1] + ... + H[A,n,N_(i-1)] + H[A,n,N_(i + 1)] + ... + H [A,n,N_k] = H[A,n] - H[A,n,N_i] Note that the "+ 1" in the first equation arises from the fact that connection N_i serves as a node itself, so it must be considered as being reachable from A. For counting files and kibibytes, this has to be replaced by the number of files and kibibytes, respectively, that N_i shares. To make the algorithm work, directly connected nodes effectively exchange their H' tables and update their H tables accordingly. The proposed protocol is completely self-correcting. If a servent sends faulty HSEP data for some time (e.g. data that is wrong but passes the sanity checks, see below), this data will propagate through the network. But if the servent starts sending correct data again or quits from the network, the correct values will gradually replace the faulty data, i.e. data errors will not accumulate (which would cause the faulty data to circulate in the network forever). This is a very desired property of such a protocol. Throughout the document, "kibibytes" (or "KiB") refers to 1024 bytes (as defined by IEC 60027-2, see here). Required datastructures The client needs a fixed-size global array H[0..n_max], where each element is initialized to zero. Additionally, for each connection N_i the client needs a fixed-size array H_(N_i)[0..n_max], initialized as specified below. Note that each array contains n_max + 1 elements, not n_max. Each array element is a triple (3-tuple) containing (nodes, files, kibibytes), where each triple element is an unsigned 64-bit integer (uint64_t in C 99, or long in Java). In the following, addition/subtraction of such triples means usual component-wise addition/subtraction, like for vectors. The servent also has an (artificial) triple OWN := (1,own_files,own_kibibytes) which contains the resources that the servent itself contributes to the network (a 1 for the number of nodes and the current number of shared files and kibibytes of the servent). OWN must be set to (1,0,0) if the servent currently does not share any files (or doesn't want to announce that it shares any). The triple must be updated whenever these figures change (e.g. when shared files are re-scanned or when the servent decides to start or stop sharing). The array triples H[0] and H_(N_i)[0] (for all i) will remain (0,0,0) throughout the protocol (because they are never changed), which simplifies the update algorithm a little. Handshaking In the proposed Gnutella extension a connecting servent must announce its capability of supporting HSEP at connection time and must check the HSEP capability of the servent it connects to. This only refers to Gnet connections, not to download/upload connections. The HSEP-capability is indicated by making use of the "X-Features" header in the Gnet handshaking phase. The feature is called "HSEP" with version "0.2", e.g. a header could look like this: X-Features: HSEP/0.2, ... After learning the HSEP capability of the peer servent by inspecting the "X- Features" header of the peer's connection request or reply (depending on which side initiated the connection), the servent can continue and use the protocol version that both servents understand (which currently means that the versions must be equal) or choose to either continue without using HSEP or to terminate the connection. After establishing a HSEP-capable connection, both servents have to initialize their H_(N_k) array (with N_k being the new connection) to ((0,0,0), (1,0,0), ... ,(1,0,0)), with n_max + 1 triples altogether. This means that all triples are (1,0,0) except the first, which is (0,0,0). This vector of triples also must be added to the global H array. So in total, the following needs to be done for connection initialization: H_(N_k)[0] := (0,0,0) H_(N_k)[n] := (1,0,0) for all 1 <= n <= n_max H[n] := H[n] + (1,0,0) for all 1 <= n <= n_max Note that different versions of HSEP will not be compatible to each other. A servent must therefore deny using HSEP with a servent that announces supporting a different version of HSEP. This is crucial to allow interoperability with old servents when new versions of the protocol become available. Normal operation for ultrapeers Every now and then (every 30 seconds or so) a servent needs to send out the horizon size estimation data to all of its current HSEP-capable connections. For a connection N_i the servent sends a HSEP message to connection N_i along with the HSEP data suitable for that connection as specified below. The payload type of HSEP messages is 0xcd (205 in decimal), TTL is 1 and hops is 0. At first, the servent should collect all available data from all non-HSEP neighbors. This is the number of non-HSEP neighbors, the sum of the number of files they share and the sum of the number of KiB they share. The first value is trivial to obtain, the other ones can be extracted from received PONG replies from these neighbors (due to limitations in the shared library size data sent in a PONG, this is not very accurate, but good enough). This data can be packed into a triple (nodes, files, kibibytes), which we'll call OTHER. If a servent does not watch PONG replies this way, it can use OTHER := (n, 0, 0), with n being the number of non-HSEP neighbors. However, vendors are urged to support PONG watching to make horizon estimation more accurate. The payload looks as follows. It consists of n_max triples (nodes, files, kibibytes), where each value of the triples is an unsigned 64-bit integer sent in little-endian_byte-order. The n_max payload triples to send are OWN + H[0] - H_(N_i)[0], OWN + OTHER + H[1] - H_(N_i)[1], ..., OWN + OTHER + H[n_max - 1] - H_(N_i)[n_max - 1] (let's call these triples V[0], ..., V[n_max - 1]). Note that a servent never uses H[n_max] or H_(N_i)[n_max] for sending. Note also that OTHER is used in all triples except the first one. When a servent receives such message from connection N_i with payload triples V [0], ..., V[n_max - 1], the following sanity-check must be performed (where "*" stands for an arbitrary value). The provided triple relation must be true for all triple components, i.e. (a,b,c) rel (d,e,f) :<=> (a rel d) AND (b rel e) AND (c rel f): V[0] = (1,*,*) V[n] >= V[n-1] for all 1 <= n <= n_max - 1 (monotony check) If this check fails, the servent must ignore the message or terminate the connection. If the check succeeds, it uses the following update algorithm. For each 0 <= n <= n_max - 1 it sets (in this order!) H[n + 1] := H[n + 1] + V[n] - H_(N_i)[n + 1] H_(N_i)[n + 1] := V[n] So the servent adds the difference of each new and each previous triple to the global H entries (shifted by an offset of 1) and sets all its H_(N_i) triples for the connection to the triples sent by the peer servent (also shifted by an offset of 1). Different servents might have different values of n_max. A servent receiving a HSEP message can deduce the n_max value of the peer servent (called other_n_max) by taking the payload length of the message and by dividing it by 24. If n_max <= other_n_max, nothing special needs to be done except for truncation: only the first n_max triples are processed. But if n_max > other_n_max, simply use V[n] := V[other_n_max - 1] for all other_n_max <= n <= n_max - 1 This means that the message is treated as if it actually contained n_max triples instead of just other_n_max triples, with the last one of the sent triples being repeated until there is a total of n_max triples. As a side effect, this enables a simple but effective optimization on the sending side: if there is an n_last such that V[n] = V[n_last] for all n_last < n <= n_max - 1, only send V[0], ..., V[n_last] in the HSEP message. In other words: if there is a suffix of equal triples, only send the triples V[0], ..., upto the first occurence of the repeated triple, inclusive. This can be exploited for leaf nodes, see below. An ultrapeer should ignore a HSEP message from a leaf node that sends more than 1 triple (it may also choose to terminate the connection in that case). Note that a servent need not send a HSEP message to a peer servent if none of the triples to send has changed since the previous message sent to the peer, because this will have no effect on the peer's side (the per-connection data would be replaced by the same data and the global data would also remain unchanged). The probability that nothing has changed within the last 30 seconds may be small (because other HSEP messages of peers may arrive), but a vendor must include this optimization anyway (especially for leaf nodes, see below). Note also that HSEP does not require sending the HSEP messages to all connections at the same time. It is also possible (and suggested) to send the messages out randomly (with a 30 second average) or in a round-robin manner, scattering the messages over the 30 second interval. The first HSEP message should be sent as soon as possible, i.e. as soon as a connection has been established. As this refers to both sides (the connecting servent and the servent being connected to), the two servents send their triples for the connection to each other. This lets HSEP data propagate quickly through new connections. Luckily, HSEP data exchange between two connected servents is independend from each other: it doesn't matter which of the two sends a HSEP message first (the messages could even be sent simultaneously). At any time, the servent can update its displayed horizon data in its GUI by displaying H[n] to the user, which contains the number of reachable nodes, files and kibibytes within a distance of n hops (with 1 <= n <= n_max). A servent might also display the per-connection data H_(N_i)[n] to the user for each connection, if desired. To each of the displayed triples the servent should add the current OTHER triple to reflect what all currently connected non-HSEP nodes share. Normal operation for leaf nodes A leaf node can operate in the same way that an ultrapeer does, but a little care must be taken here if a leaf node is connected to more than one ultrapeer. Since a leaf node must not propagate any HSEP data between the ultrapeers it is connected to (in fact, it must not forward any Gnet messages at all), the leaf has no need for the 30-second interval for sending: it only needs to send a HSEP message to its ultrapeers once when the connection has been established and after that only when its number of shared files or kibibytes has changed (which it should not do more often than every 30 seconds). The correct HSEP message would normally consist of the leaf's OWN triple repeated n_max times (i.e. HSEP data from other ultrapeers is ignored), but this can be optimized to just sending the OWN triple once: on the ultrapeer's side, this triple is expanded as needed. Therefore, such a message has a payload of only 24 bytes. Even if the leaf node would send out its HSEP messages every 30 seconds, this would only result in a traffic of 0.8 bytes per second per ultrapeer, without any compression applied. Note that receiving HSEP messages from ultrapeers does not require special treatment on the leaf node's side. Incoming HSEP messages can be processed like for ultrapeers, and GUI updates work in the same way, too. Disconnecting When a servent disconnects from one of its HSEP-capable peers (no matter which side closes the connection) the global H array needs to be updated. For the closed connection N_i and each 1 <= n <= n_max (n = 0 can be skipped) simply set H[n] := H[n] - H_(N_i)[n] and free the allocated HSEP memory of the connection, if needed. Also make sure that the connection is not considered for any sanity checks anymore while or after being closed. Invariants At any time (outside of update operations) the following conditions must hold (where "*" stands for an arbitrary value): * the OWN triple must be (1,*,*) * all per-connection triples for 0 hops must be (0,0,0) * all per-connection triples for 1 hop must be (1,*,*) * for each connection its triples must be monotonically increasing for increasing number of hops (the same test as for incoming triples, see above) * the sum of all k-th per-connection triples must match the k-th global triple for all 1 <= k <= n_max, i.e. sum(H_(N_i)[k] for all i) = H[k] for all k in [1,n_max] If the last two conditions are true, the global HSEP triples are automatically also increasing monotonically. The global triple for 0 hops is checked for being equal to (0,0,0) implicitly. A servent should check these invariants every now and then in order to detect internal implementation errors. Most implementation errors will lead to violation of these invariants. For debugging, the invariants should be checked after each update of any HSEP table. At runtime, the invariants could for example be checked at HSEP connection initialization time. Bandwidth requirements I propose n_max to be 7, which results in a payload of 7*3*8 = 168 bytes for a full HSEP message. Most leading bits of each 64-bit value will be zero, therefore good connection-stream compression is possible (e.g. by using the deflate algorithm). The value of 7 has been chosen to reflect the standard TTL used on the Gnutella network. Larger values don't make sense because of cycles in the network. When high out-degree support is implemented by most ultrapeers, the value should still be a good choice. For an ultrapeer maintaining 50 HSEP-capable connections (40 of them being leaf nodes sending in 3 minute and receiving in 30 second intervals, the rest being ultrapeers sending and receiving in 30 second intervals), this will result in an uncompressed bandwidth requirement of 61 bytes in and 280 bytes out per second, which should be no problem for an ultrapeer. An expected 30% of that, 18 bytes in and 84 out per second, will be remaining after compression. For a leaf node connected to 4 ultrapeers this will be 22 uncompressed bytes in and 0.5 bytes out per second (which can be compressed to 7 bytes in and 0.15 bytes out per second), so the bandwidth requirements for HSEP are minimal. If a leaf node only sends a message whenever the number of shared files/KiB changes, bandwidth requirements drop even further. Note that this is the raw payload bandwidth requirement; each message also has a header, which is smaller than a single triple and therefore does not influence the above figures much. Automatic connectivity checks For checking the connectivity gain of a connection N_i, one should not simply look at the connection's last triple H_(N_i)[n_max]. Remember that because of cycles in the network the reliability of the H_(N_i)[k] triples drops with increasing k, so one should keep k low. The aim for good connectivity should be to maximize the number of nodes/files/kibibytes reachable within a small number of hops, which should, for example, lead to quicker response times for queries. To achieve this, something around k = 4 or k = 5 should be fine. This is just a rough guess, it should be determined empirically. Requirements for multi-threading It is important that updating and collecting data for viewing and sending is an atomic process. This refers to reading and writing the values altogether (e.g. parallel read/write, write/read and write/write accesses), therefore such operations must be serialized. This may require proper locking-mechanisms in multi-threaded Gnutella clients, but can be ignored for single-threaded clients. As the required locking time is very short, it should be sufficient to use one global MUTEX lock, whenever the global or per-connection HSEP data is read or updated. Further considerations A differential approach (where only the difference between the current and the triple of the previous message to a connection is sent) would be possible. The compression algorithm might be happy with that since the entropy of the uncompressed data can be reduced, but the triples will probably be so dynamic that this will help very little in practice. Note that a differential approach has the undesirable property that data errors will accumulate. If wrong data is sent (due to implementation or transmission errors), these wrong values will accumulate on the peer's side and would be propagated through the network until the connection is closed. Due to these reasons no differential approach is used. Reference implementation The Gnutella servent gtk-gnutella contains a stable reference implementation of HSEP. The implementation is written in C, is well-documented and has been included since gtk-gnutella 0.94. The implementation contains all suggested implementation details from this document and will be kept in sync with it. Parts of the code is gtk-gnutella-specific, but it should be no problem to adapt the code to other servents or to port the implementation to other programming languages. The main files can be found here: * src/hsep.c * src/hsep.h * src/nodes.h Please refer to the SVN repository (specifically, the src/ directory) if you need any other files. References [1] http://rfc-gnutella.sourceforge.net/src/pong-caching.html