\documentclass[11pt,twocolumn]{article}
\usepackage{fullpage}
\usepackage{graphicx}
\usepackage{url}

\newcommand{\Name}{\emph{Canopy}}

\title{\Name: We Really Need a Better Title} % XXX
\author{Dan Ports, Austin Clements, Jeff Arnold\\
  \texttt{\{drkp,amdragon,jbarnold\}@mit.edu}}
\date{November 18, 2005}

\begin{document}

\maketitle

\section{Introduction}

Non-determinism is both a strength and a weakness of network systems.
It goes hand-in-hand with the asynchrony that differentiates network
systems from sequential systems while putting a bane on development
and experimentation.  Sequential system debuggers lack both the
omniscient viewpoint required to debug systems that are spread across
multiple processes and computers as well as the strict control over
timing conditions required to effectively debug non-deterministic
systems.  The output of experiments may depend heavily on factors
outside the control of the experimenter, making results unreproducible
and conditions uncontrollable.

% I am not sure that I would put nondeterminism front-and-center as the
% primary strength and weakness of network systems, but I like the idea of
% showing how some of the strengths of network systems make them difficult
% to debug.  Here is a sketch of an argument that I would consider making
% in the introduction:                                        -- jbarnold

% Some of the most important properties of network systems -- scalability,
% isolation, and resilience -- make constructing a debugger for these
% systems particularly difficult.

% Scalability:  Networks allow you to attach many nodes together as part
% of a single powerful system, but debugging many interacting nodes is
% more difficult than debugging one.

% Isolation:  Networks are frequently used to provide isolation for
% fault-tolerance and security, but this same isolation makes building a
% meaningful debugger more difficult.

% Resilience:  Network systems are designed to work under many
% circumstances (ie, be resilient), but this attribute can contribute to
% the difficulty of understanding how a network system will behave.
% Constraints help us understand how a system will behave; flexibility can
% introduce subtle problems (race conditions, etc).

% Our solution?  Centralized view/control, timing control, determinism

Thus, to effectively debug and experiment with network systems, we
need to reign-in non-determinism and form a centralized, omniscient
viewpoint of the entire network system.  This hostile environment
calls for a new type of debugger that's not only aware of network
behavior, but that lets the user observe and control it. This
naturally requires some mechanism for putting the entire network
system into a closed environment that's both controlled and
observable.  To create this closed environment, \Name\
\emph{virtualizes} the entire network system.

The non-determinism of an individual node derives entirely from its
coupling with the ``real world'': the content and precise timing of
asynchronous hardware events such as clock interrupts and input
events.  The non-determinism of an overall network system follows from
the non-determinism of its individual nodes, as well as from events
that occur in the network fabric.  Thus, to reign-in non-determinism,
we strictly control the passage of \emph{virtual time} across the
system, as well as the content and timing of ``external inputs'' at
individual nodes such as network packets.  In addition to features
analogous to regular debuggers, such as observation, breakpointing,
and stepping, an effective debugger for a non-deterministic system
naturally requires a highly expressive way for developers to
experiment with the environmental knobs that make the system
non-deterministic: the ability to roll back, change the system's
behavior --- perhaps by dropping, delaying, or reordering packets, or
by failing nodes --- and play forward again to see how the system
being studied responds.

% XXX The above paragraph and the below paragraphs are somewhat
% redundant.  My intent was to give an overview of the next three
% paragraphs in the above paragraph before diving into them, but I
% went too deep.  These should be combined better

% 1a.  Efficient rollback to any time => many snapshots => efficient,
% incremental snapshots
%
% 1b.  Replay with only specified changes ==> deterministic qemus
%
% 2.  Many machines ==> master/slave daemons, barrier synchronization
%
% 3.  Replay with control of processor speed of each node ==> qemus
% with decoupled virtual time (also necessary to support #2)

\paragraph{Control of virtual time}
Variations in physical node hardware should not affect the observed
relative performance of virtual nodes. ...

We begin by precisely controlling time at individual nodes.
% By synchronizing virtual time between nodes, we control virtual time
% and thus non-determinism for the entire network system.
\footnote{Our implementation of deterministic time does not account
  for the introduction of non-determinism by cosmic rays.}

\paragraph{Rollback and deterministic replay}
A powerful debugging system should support rolling back the entire
system state to a user-specified time.  The user should then be able
to make changes to the system and play the simulation forward such
that the new execution path only differs from the original execution
path as a consequence of the user's changes.  In other words, the
evolution of a virtual system should be entirely independent of real
time or other conditions outside of the virtual environment.
% XXX Talk briefly about challenges of achieving deterministic replay

\paragraph{Scalable to large virtual distributed systems}
A debugger designed to operate on distributed systems must exhibit
reasonable scaling behavior since distributed systems can grow rather
large and complex.  In particular, having hardware requirements scale
approximately linearly with virtual system size is particularly
attractive.
% XXX Talk about master/slave daemons and barrier sync

% XXX This is out of place
\Name's node emulator is built atop Qemu 0.7.2, an efficient,
open-source whole-system emulator built around a dynamic binary
translator.  \Name's network emulator is built atop ns2.

\section{Related Work}

\subsection{Network Simulators and Emulators}

The ns2 network simulator~\cite{ns2} is frequently used for evaluating
network protocols because it supports a vast number of network
elements and traffic models. Typically it is used for packet-level
simulation, in which packets are generated by synthetic
sources. However, it also includes emulation
extensions~\cite{fall99:_networ_emulat_in_vint_ns_simul} which make it
possible to ``tap'' live networks, introducing their packets into the
simulated network, and to inject packets from the simulated network
into a live network. This sort of emulation makes it practical to test
real systems under esoteric network topologies, but does not provide
the corresponding level of control over source behavior that \Name\
provides by virtualizing not just the network but also the individual
nodes.
% ns2

% Emulab
Emulab~\cite{white02:_integ_exper_envir_for}... does stuff.
% XXX Write me!

\subsection{Replay}

Replaying an execution history has been used as a tool for debugging
both distributed and single-node systems. Nondeterminism exists not
just in distributed, networked systems of the type \Name\ targets, but
also in single-node multiprocessor systems, in multithreaded
single-CPU systems since the order in which tasks are scheduled can
affect the outcome, or even in sequential single-CPU programs if they
access external state. Because of this nondeterminism, replay is a
useful tool in debugging these systems.

\subsubsection{Replay of single-node sequential programs}
\label{sec:replay:sequential}


Though it is especially useful in networked or parallel systems where
execution is highly nondeterministic, replay is also useful even in
sequential execution environments with little or no nondeterminism: it
can be used simply as a tool for visualizing the execution of a
program. The EXDAMS debugger~\cite{balzer69:_exdam}, dating from the
late 1960s, recorded the history of program execution, and allowed a
user to walk through the log to inspect the execution and debug the
system. However, since it is not actually replaying execution, merely
browsing a pre-generated trace, it cannot rollback and make changes;
it is simply a visualization tool.

One of the earliest systems to apply rollback and replay to debugging
was the IGOR debugger for the DUNE distributed operating
system~\cite{feldman88:_igor}. IGOR uses periodic incremental
snapshots of a process's memory image to rollback a process to a
previous state. It was designed for debugging sequential,
single-process programs, and does not support parallel or distributed
systems.


\subsubsection{Replay of single-node parallel programs}
\label{sec:replay:parallel}

Recap~\cite{pan88:_suppor_rever_execut_for_paral_progr} was one of the
first systems to apply this approach to \emph{parallel} systems. To
achieve determinitic replay, it logs of system call results, signals,
and other asynchronous events (all external sources of nondeterminism)
and periodically checkpoints the system's state by forking and
suspending each process. Though it was designed for debugging parallel
systems, it only allows rolling back of a single process, not the
entire system state.

Instant Replay~\cite{leblanc87:_debug_paral_progr_with_instan_replay}
also uses replay for debugging parallel programs. This system records
the order in which threads acquire locks, which makes it possible to
replay an execution history provided that all interaction between
threads occurs using shared memory and is correctly synchronized with
locks.

Tai et al.~\cite{tai91:_debug_concur_ada_progr_by_deter_replay}
considered the challenges inherent in debugging a parallel (but
single-node) Ada system. They address the nondeterminism introduced by
concurrent execution by adding synchronization sequences that allow
an execution history to be deterministically replayed during debugging.

\subsubsection{Replay of distributed/networked systems}
\label{sec:replay:distributed}

Bugnet~\cite{curtis82:_bugnet, wittie86:_bugnet_distr_debug_system,
  wittie88:_debug_distr_c_progr} supports the debugging of distributed
systems via replay, using an approach quite similar to \Name's. Replay
is achieved by logging inter-process messages and performing periodic
global checkpoints. The architecture is also similar to \Name:
execution is distributed across multiple nodes using a Global Control
Module analogous to \Name's master daemon and a Local Control Module
analogous to \Name's slave daemons. Unlike \Name, it runs processes
directly with trapping of system calls rather than in full emulation;
this is less computationally intensive but does not provide the same
level of control. For example, it would not be possible to test
changes to the kernel scheduler or network drivers. Unlike \Name, it
also does not appear to perform any network emulation, merely passing
messages directly. This means it is not possible to test systems in
the face of complex, non-deterministic network behavior such as packet
delays, loss, and reordering.

Thane and Hannson~\cite{thane00:_using_deter_replay_for} devised a
scheme for debugging real-time distributed systems using a modified
kernel to provide replay. Their system logs events such as context
switches and data such as network messages and system call results in
order to provide a history to replay. The most interesting idea is
their notion of time synchronization, where time is quantized into
units of time $\Pi$ and synchronized between every node with a
specified precision $\delta < \Pi$. With this, it becomes possible to
decrease the amount of information that must be logged to enable
rollback to a specific point: the global time orderings make it
possible to rollback to an earlier point and deterministically replay
to the target point. \Name\ uses this approach.

The work most similar to ours is
PDB~\cite{harris02:_depen_comput_needs_pervas_debug,ho04:_pdb}. PDB
makes it easier to debug distributed systems by running the entire
system in a virtualization layer. It uses
Xen~\cite{dragovic03:_xen_and_art_of_virtual} virtualization for each
of the nodes; we considered using this approach, which would give
higher performance, in \Name, but decided to use Qemu instead for ease
of implementation. Like \Name, PDB allows simulating bandwidth
limitations, latency, and packet loss on network links. PDB also
includes a more comprehensive GDB~\cite{gilmore03:_gdb_inter}-like
debugger for examining the state of processes running within each
individual node in the system. We considered implementing a similar
debugger, but relegated this part of the system to future work because
of time constraints. Our principal improvement over PDB is our
scheme for distributing the execution over multiple physical hosts;
PDB is designed to run under a single host.

% Survey
\nocite{kim02:_distr_and_paral_debug}
% 
\nocite{cheung90:_framew_for_distr_debug}


% Local replay for non-distributed application debugging
% * This whole area seems to have started with a paper by Tai et
%   al. on debugging concurrent Ada programs (Jan. '91)
% Distributed replay
% * Thane, Hansson
% ** Notion of global time synchronization within some quanta
% * ENTRAPID

\section{Overall Architecture}

\begin{figure}[htbp]
  \centering
  \includegraphics{arch}
  \caption{The overall architecture of \Name.  Physical nodes are
    enclosed in solid boxes.  Virtual nodes are enclosed in dashed
    boxes.  Network connections are shown in bold.}
  \label{fig:arch}
\end{figure}

\Name\ runs atop one or more networked physical machines, depicted in
Figure~\ref{fig:arch}.  One of these machines is designated as the
\emph{master physical node}, and the rest of the machines are
\emph{slave physical nodes}.  The \emph{master daemon} oversees all of
the slave physical nodes, communicating with them through \emph{slave
  daemons}.  Each slave physical node runs a single \emph{slave
  daemon} responsible for overseeing zero or more \emph{virtual
  nodes}, depending on the number of physical nodes available and how
the master decides to assign virtual nodes to physical nodes.  A
virtual node consists primarily of a qemu process running a
user-provided environment.  Every virtual node also has its own
\emph{album} process that is responsible for archiving snapshots of
the state of the virtual node at various points in time.

Since a network message may potentially be transmitted between any two
virtual nodes at any time, all of the virtual nodes must progress in
their execution in such a way that their physical isolation does not
lead to unjustified network delay; the location and latency of
physical nodes should not affect the execution of virtual nodes in any
way.  We ensure that no virtual node ``misses'' a network message
arrival by executing past the message's correct virtual arrival time
before the underlying physical node learns about the existence of the
message.  In order to achieve this goal, the system employs barrier
synchronization with a barrier interval less than the minimal one-way
trip time between any two virtual nodes on the virtual network.  By
requiring every slave physical node to check with the master node for
outstanding messages at every barrier, we can guarantee that all slave
physical nodes will always receive a relevant network message before
the progress of virtual machine execution requires one or more packets
to be introduced to the destination virtual node.

\subsection{Master Daemon}

% User interface interface

\subsection{Slave Daemons}

\section{Nodes}

% Virtual time decoupled from real time
% * Define virtual time versus physical time
% * Time progresses with the number of instructions executed by the
%   VM.  This is effectively a cycle count.  By setting the constant
%   of proportionality between VM's cycle counter and its real-time
%   clock, we can control the ``MHz'' that the VM perceives.
% * Qemu acts as a discrete event simulator, interrupting the VM at
%   precisely the times when events or breakpoints should occur.
% * If the VM is idling, the time before the next event can simply be
%   skipped over.  In our current implementation, virtual time
%   progresses approximately 100 times faster when idling
% * All virtual hardware timings are based off this virtual timer
%   instead of physical time
% Deterministic replay
% * Arises naturally from the ability to serialize the complete state
%   of all virtual hardware, the deterministic progress of virtual
%   time, and the exact scheduling of ``asynchronous'' hardware events
% Efficient incremental snapshots
% * Use dirty bits to detect which memory pages have changed since the
%   last snapshot operation
% * Separate snapshot daemon so the emulator doesn't block on disk I/O

\section{Networks}

% Synchronization
% * Global virtual time synchronization: All of the machines need to
%   progress with roughly the same notion of real time (at any point,
%   for any two nodes, the real times of the two nodes must be within
%   delta of each other)
% * Strict virtual time synchronization: Specifically, if $\delta \le
%   \underbar{\lambda}$, where $\underbar{\lambda}$ is the smallest
%   possible packet latency, we can achieve perfect control over
%   simulated network delays. This has a lot of other cool properties,
%   too, like the ability to simulate a faster network over a slower
%   network (the emulated real time will simply progress slower than
%   physical real time to allow the physical network to keep up with
%   the simulated network).
% ** We can achieve this by placing synchronization barriers at every
%    $\delta$
% * Synchronization protocol diagram
% * This can be optimized by speculative execution, under the
%   assumption that hosts will not be receiving or sending packets
%   during the majority of time quanta.
% * Global rollback

\section{Implementation Progress}

% We have extended the Qemu emulator to support decoupled virtual
% time, deterministic replay, and efficient incremental snapshots.
%
% Some of the following paragraph might be better in the general
% information above.
%
% The dynamic translator has been augmented to insert the appropriate
% host instructions between each translated guest instruction to
% increment the virtual time-stamp counter and check for the
% time-stamp counter breakpoint.  We call these \emph{ticks}.  Qemu
% now acts as a discrete event simulator, executing translated guest
% code until the tick of the next hardware event is reached.  When the
% guest is idle (ie, when it is executing the HLT instruction), Qemu
% immediately jumps virtual time forward to the next pending hardware
% event.  The rest of Qemu has been modified so that all notions of
% time are based strictly off ticks and the ``ticks per virtual
% second'' of the virtual machine, so replay from a saved machine
% state is deterministic.
%
% The snapshot system has been augmented to maintain its own set of
% dirty bits for the virtual machine RAM (both addressable RAM and
% miscellaneous RAM managed by virtual PCI cards), which allow it to
% very efficiently record just the differences since the previous
% snapshot.  The constant overhead of an incremental snapshot is under
% 4K, and each modified page adds 4K.

\bibliographystyle{abbrv}
\bibliography{interim}
\end{document}