\documentclass[11pt]{article}
\usepackage{fullpage}

\newcommand{\Name}{\emph{Canopy}}

\begin{document}
\title{\Name: A Controlled Emulation Environment\\
  for Network System Experimentation}
% XXX Needs a better title
% XXX And a name.
%
% Canopy?  We can always change it later
%
% * A Distributed Debugulator for Network Systems
% * A Controlled {Execution,Emulation,Distributed} Environment for
%   Network System Experimentation

\author{Dan Ports, Austin Clements, Jeff Arnold\\
  \texttt{\{drkp,amdragon,jbarnold\}@mit.edu}}
\date{October 5, 2005}
\maketitle

\section{Summary}
\label{sec:summary}

We propose to build a debugger that makes it possible to debug
distributed systems by placing the entire network system into a
controlled environment. This allows experimenting with the system
under a wide range of simulated network conditions that are not
possible with an ordinary test environment, including rolling the
system back to a previous state to change a variable and compare
outcomes. Such an environment poses new scalability challenges, and
will need to address complex issues surrounding the synchronization of
a distributed emulation environment capable of moving both forward and
backward in time.

% XXX ``Research question''
% Such an environment poses new challenges  ...
%
% and will need to address complex issues surrounding the efficient
% and scalable synchronization of a distributed emulation environment
% capable of moving both forward and backward in time.

\section{Motivation}
\label{sec:motivation}


Distributed systems are even harder to debug than traditional
applications because the network introduces a high degree of
nondeterminism that applications must cope with and respond
to. Traditional applications operate in an environment that is fairly
controllable and deterministic, whereas network applications may
depend heavily on variables that are well outside the control of any
individual node and are certainly outside the control of a traditional
debugger.

This environment calls for a new type of debugger that is not only
aware of network behavior, but that lets the user observe and control
it. To do so, we propose creating a controllable network environment
by running each node in the system in an emulator, and connecting them
via a simulated network. In addition to providing features analogous
to regular debuggers, such as observation of network traffic and node
state, breakpointing, and stepping, such a system allows new abilities
for experimenting with the network. The network's behavior can be
changed --- perhaps by dropping, delaying, or reordering packets, or
by failing nodes --- to see how the system being studied responds.
Since the number of variables that can be changed is so large, the
debugger will take snapshots of the emulator's states, so a developer
can roll back the state of the system to virtually any past point,
change the network's characteristics, and play the system forward
again to compare the system's behavior before and after the change.
% XXX Rollback isn't just a cool feature, it's a necessity for
% effective debugging and experimentation with network systems.

This approach is well suited for simulating and debugging low-level
systems, such as network stacks or applications like RMTP that depend
on network behavior.  Example uses include testing new congestion
control implementations for TCP on various links, streaming algorithms
in the presence of packet loss, or fault-tolerant systems during node
failures.

\Name\ is not intended for high-level TCP-based distributed systems
since the TCP layer, by design, masks or dampens the application-level
impact of most network behavior, such as dropped or reordered packets,
thus reducing the utility of network-level debugging.  However, the
ability to crash nodes and affect latencies may still prove useful for
experimenting with certain high-level systems.

\section{Implementation Plan}
\label{sec:implementation}

We will build this environment using an emulator (probably QEMU) for
each node, connected via a network simulator such as dummynet or
ModelNet. The principal challenge will be making the system scalable:
the emulation will need to be spread over multiple physical machines
in order to support a reasonable number of simulated nodes. Simulation
and control overhead will need to be minimized. The goal will be to
make the number of simulatable nodes scale approximately linearly with
the available physical hardware. The other major challenges relate to
synchronization of virtual nodes and rollback of global system state.
Each node's time must be kept synchronized in order to ensure that
packets are delivered with the correct latency. To be able to roll
back the system's state, snapshots must be performed after each
packet, meaning that we will need to improve the efficiency of QEMU's
snapshot functionality.

The major tasks that need to be completed and their approximate
ordering follows
\begin{itemize}
\item Incremental local snapshots: add support for incremental
  snapshots to QEMU to make snapshotting efficient.
\item Emulated node state examination: console aggregation, syslog
  redirection, etc.\ for monitoring nodes, and (simple) user
  interface.
\item Simulated networking between EN's (including link properties)
\item Packet-level operations, such as examination, dropping, and delaying
\item Global snapshot synchronization
\item Global rollback
\end{itemize}

\end{document}
