\subsection{Structure Search: Initial Networks}

Before we can proceed to implement and test greedy structure search,
we must generate initial networks. We tested our greedy search
algorithm using the following initial networks:

\begin{itemize}
\item \textbf{Given:} the initial network provided in the data files.
\item \textbf{Unconnected:} an unconnected network --- all variables
  independent
\item \textbf{Connected:} a fully connected network --- each node is a
  child of every node before it in the topological ordering
\item \textbf{Random:} a randomly generated network, produced as
  follows: on each iteration, one parent and one child node are
  chosen, and an edge is added between them if valid. It is valid to
  add an edge if the edge does not already exist, and the child is not
  an ancestor of the parent; these conditions are necessary and
  sufficient for avoiding directed cycles\footnote{This is a subset of
    the rules for which operations are valid, described in Task 2,
    since here we are only adding edges and there we also consider
    reversing edges.} This process is repeated until there are $n$
  edges in the network, where $n$ is the number of nodes in the
  network.
\end{itemize}

The initial networks used in our experiments can be found in the first
graph of
Figures~\ref{fig:struct-a-given}~and~\ref{fig:struct-b-given} for
Given networks, 
Figures~\ref{fig:struct-a-unconnected}~and~\ref{fig:struct-b-unconnected} for
Unconnected networks,
Figures~\ref{fig:struct-a-connected}~and~\ref{fig:struct-b-connected} for
Connected networks, and
Figures~\ref{fig:struct-a-random1}--~\ref{fig:struct-a-random3}~and~\ref{fig:struct-b-random1}--~\ref{fig:struct-b-random3} for
Random networks.

\subsection{Implementation Details}

In this section, we discuss the implementation and results of our
local structure search algorithms.

\subsubsection{Structure-search Framework}

In planning our structure-search implementation, we made the
observation that all our search algorithms worked via a process of
iterative refinement: at any given point there is a single ``current
network structure,'' which may be paired with a ``best known
operator'' (if one has been found). Given the potential difficulty and
inefficiency of creating deep copies of the provided BayesNet objects,
we decided to implement our structure search based on a paradigm where
there was a single ``working copy'' of the BayesNet object, which
could be temporarily transformed in order to determine the score which
would result from various strucure-search operators.

To this end, we decided to implement a BNOp (Bayes Net Operation)
object class which could be used to represent any given operation.
When created, a BNOp object specifies the structure-operator (add,
delete, reverse), a source node, and a target node. We also
implemented an operator-validity test method which, given a current
network structure, evaluates the legality of the network which would
result from the application of a given BNOp.

In order to test operation validity without actually modifying the
network structure, we introduced a couple of concepts of
node-relationship: Ancestors and GrandAncestors. $A$ is an Ancestor of
$B$ if there is a directed route from $A$ to $B$. $A$ is a
GrandAncestor of $B$ iff there is a directed route from $A$ to $B$
which contains more than one edge\footnote{I.e. regardless of whether
$A$ is a \textit{direct} parent of $B$, is $A$ an \textit{indirect}
ancestor of $B$?}.

Operation validity was determined as follows:
\begin{itemize}
\item \textbf{Delete} Any existing graph edge
could be deleted\footnote{delete($a$, $b$) is valid IFF $a$ is a
parent of $b$}.

\item \textbf{Add} An edge could be added if both (1) it did not already
exist and (2) it did not create any directed cycles in the
graph\footnote{add($a$, $b$) is valid IFF ($a$ is not a parent of
$b$)$\wedge$($b$ is not an Ancestor of $a$)}.

\item \textbf{Reverse} Any existing graph edge
could be reversed if it did not create any directed cycles in the
graph\footnote{reverse($a$, $b$) is valid IFF ($a$ is a parent of
$b$)$\wedge$($a$ is not a GrandAncestor of $b$)}.

\end{itemize}

One may note that our Reverse-edge operator is directional -- if
Reverse($A$, $B$) is valid, Reverse($B$, $A$) is never valid; we
deliberately chose to make this adjustment so there would be no
redundancy among the valid operations\footnote{Redundancy among the
structure-search operations causes wasted work when one is
exhaustively calculating the scores of all operations. Given that
scoring is relatively slow, we decided to optimize by scoring each
reversal only once. Note that this optimization is functionally
equivalent to the non-optimized form, aside from reducing the
probability of randomly selecting a Reverse operation.}.

It can be shown that this formulation of operation validity is closed
over the set of legal network structures. (I.e., given a legal network
structure, one cannot generate a nonlegal network from the application
of BNOps which meet the above conditions for validity.)

Finally, we implemented an apply/unroll mechanism for BNOps, such that
they could be applied (and optionally later unrolled in the order of
original application) to a given BayesNet object. Thus, given a
current network structure, the BIC score of a given BNOp $X$ could be
calculated idempotently by applying $X$, calculating the MLE-based BIC
score, and then unrolling $X$.

\subsubsection{Greedy Hill-Climbing Implementation}

With this framework of Bayes Net operations, it is straightforward to
implement the greedy structure search algorithm. We begin with one of
the initial networks described in Task 1. At each step, we generate
all possible valid operations, compute the CPT values that best fit
the data using maximum likelihood estimation, and determine the BIC
score. We choose the operation that leads to the highest BIC score;
ties are broken randomly. If this operation improves the BIC score, it
is applied and the process is repeated; if no operation leads to a
better BIC score, we have reached a local maximum, and stop.

Note that the random tiebreaking process means that even with the same
initial network, running the greedy structure search algorithm can
lead to two different final networks. Indeed, even though the
randomization happens only when breaking a tie, the final BIC score
can be different since the different resulting networks can lead down
different paths of operations. We did not, however, observe any
nondeterministic results given a nonrandom initialization method;
repeating the procedure with the same initialization method gave the
same results\footnote{Accordingly, we show only one set of results for
  the non-random initialization methods}. From
this we may conclude that our search process most likely did not
encounter many ties in BIC scoring. On the other hand, as is described
in Section~\ref{sec:greedresult}, many different final networks are
possible when starting from a random initial network.
% XXX Mention whether we observed this?
% --> we do, described in 3rd paragraph of next section.

\subsubsection{First-Ascent Hill-Climbing Implementation}

Given the above implementation of Greedy Structure Search, it is
fairly straightforward to adjust the algorithm to perform First-Ascent
Hill-Climbing structure search. In first-ascent search, the
currently-valid operations are examined (in random order) until one is
found which results in an BIC-score improvement over the current graph
state. That operation is applied to yield a new ``current graph
state,'' and first-ascent is applied iteratively to the resulting
graph. When the graph reaches a state where no operation would improve
the BIC score, we have reached a local maximum, and stop.

Note that, at any given state, there are likely to be multiple
operations which would cause a scoring improvement. Unlike greedy
search, first-ascent search picks randomly among these operations;
thus, when attempting to find an optimal graph (given a particular
initial search state), first-ascent search is much more likely than
greedy search to climb to different local maxima if invoked multiple
times. Hence, nondeterministic behavior is relatively likely in
first-ascent search.

\subsection{Results}

\subsubsection{Greedy Structure Search Results}
\label{sec:greedresult}

In the case of the ``Data A'' network, where the supplied network
structure was presumably similar to that used in generating the data,
it is unsurprising that the very fastest search occurred when starting
with the given network configuration. Obviously, if one deliberately
starts out very close to a (local) maximum, it doesn't take many steps
to reach that maximum.

Aside from the above case, the fastest search was performed when
starting with an initially-unconnected network (regardless of which
dataset was used), as can be seen in Table~\ref{tab:scoretime}.
Similarly, the slowest search occurred when using the fully-connected
initialization method. These results should not be surprising either;
the dominating factor in our search times are the MLE/BIC
calculations, and these are fastest for small CPTs (as in the
fully/mostly disconnected case) and slowest for high-dimensional CPTs
(as occur in the fully-connected case). Furthermore, the
fully-connected initialization is also penalized by the fact that it
has to make $O(n^2)$ decisions about which edges to delete; it starts
with a heavily overspecified structure and must take many more steps
to reach a local maximum than in the case of an initially-unconnected
network. In the two experiments we ran, the fully-connected initial
network led to the best final structures, though it is not clear
whether this is just a property of the data sets we were considering,
or applied more generally. However, it took much longer than the other
methods to find a local maximum, so this extra computation time may not be
worthwhile.

In the case of the random initialization, we observed one of a variety
of different results each time we ran the search algorithm. This is
because the randomly-generated starting structures cause the greedy
hill-climbing search to begin at randomly selected points in the
search space, and there exist multiple local maxima towards which the
greedy search ascended. It is noteworthy that, for both networks, the
random initialization led in some cases to the same best final result
that we obtained from starting with a fully-connected
network. However, it did so much faster, requiring fewer steps and
less time to reach the local maximum. This suggests that starting from
a random initial configuration is a worthwhile approach for learning a
network structure, though the nondeterminism implies that it may be
best to run the greedy search procedure more than once on different
random networks.

\subsubsection{First-Ascent Search Results}

As can be seen in Table~\ref{tab:scoredim}, first-ascent search (with
AIC or BIC) yielded graphs of comparable complexity (i.e. dimension)
to greedy search performed with the same metric (respectively).
However, as shown by Table~\ref{tab:scoretime}, we can see that
first-ascent was typically significantly faster than greedy search.
Furthermore, Table~\ref{tab:scorequal} demonstrates that first-ascent
search typically achieved metric scores which were only slightly worse
than those for greedy search. However, the standard deviation of
scores were substantially greater for first-ascent search, due to the
greater degree of nondeterminism involved.

% In the case of the ``Data A'' network, first-ascent search with BIC
% scoring generally yielded graphs fairly similar to those produced by
% Greedy search. (As with Greedy search, use of AIC with first-ascent
% search resulted in higher-dimension structures than when using BIC, as
% seen in Figures~\ref{fig:struct-a-unconnected} and
% \ref{fig:struct-a-random1}.)
% 
% In the case of the ``Data B'' network,  

%%% Local Variables: 
%%% mode: latex
%%% TeX-master: "writeup"
%%% End: