\subsection{Parameter Estimation}

We began by implementing maximum likelihood estimation to determine
CPT parameters, for a given network structure. We initially did so by
computing each entry in the CPT as the observed number of samples
satisfying both the appropriate assignment to the node and its
parents, divided by the number of samples satisfying the appropriate
assignment to the parents. We later refined this by adding the
Laplacian correction to provide a prior estimation of the probability
of node values that depend on parent assignments that never occur in
the data.


The Bayes'-net parameters for the provided dataset B, given the
provided network graph, were estimated by our Parameter Estimation
implementation to be:

\input{estBcpts.tex}

\subsection{Structure Scoring Metrics}

Bayes' Net learning involves the comparison of various possible Bayes'
Network graph structures. As such, a scoring metric is required; we
experimented with both the Bayesian Information Criterion (BIC) and
the Akaike Information Criterion (AIC). These scoring mechanisms are
discussed in the following sections.

\subsubsection{The Bayesian Information Criterion}

We used the Bayesian Information Criterion (BIC) as a basis for
evaluating the quality of network structures. The BIC is defined as
follows:

\begin{equation}
  \label{eq:bic}
  \textsf{BIC}(G:D) = \ell\left(\hat{\theta}_G : D\right) -
  \frac{\log_2 M}{2} \text{Dim}[G]  
\end{equation}

The BIC score consists of the log-likelihood of the data given the MLE
estimation, reduced by a structural penalty of $\frac{\log_2 M}{2}
\text{Dim}[G]$, where $\text{Dim}[G]$ is the dimension of the
network. The network dimension is the number of parameters required to
specify the CPTs in the network: the sum over all nodes of the number
of possible assignments for the parents (number of rows in the CPT)
multiplied by the number of possible assignments minus 1 (number of
columns in the CPT). We subtract 1 from the number of columns in the
CPT because we use the observation that the last column can be derived
from the other ones. The effect of this structure penalty is to prefer
graphs with less connections, since the resulting networks are more
efficient to work with and avoid overfitting the data.

Here are the components of the BIC
score we calcuated for our parameter estimation of dataset ``B'' given
the provided network graph:

\begin{tabular}{r|c|c|}
\cline{2-3}
$\log_2$-Likelihood & $l(\hat{\theta}_G : D)$ & -34285.24 \\
\cline{2-3}
Structural Penalty & $\frac{\log_2M}{2}$Dim$[G]$ & 98.30 \\
\cline{2-3}
BIC Score & $S(G : D)$ & -34383.54 \\
\cline{2-3}
\end{tabular}

In the above, the magnitude of the log-likelihood of our data (given
the graph) is much larger than was computed for dataset ``A'', but
this is due to the greater number of data points. This difference in
magnitude may be plausibly explained as follows:

If one makes the (unreasonable) assumption that all 5000 samples in
dataset ``B'' were equally likely, then they each have probability
$2^{-34285/5000} \approx 0.0086$. Given that dataset ``B'' has 8
variables, there are $2^8 = 256$ possible combinations of node values.
If every possible outcome were equally likely, they would each have
probability $2^{-8} \approx 0.0039$, which is not so different from
the ``average'' likelihood achieved by our fit to the data. Thus, it's
reasonable that such a large log-likelihood does not necessarily
indicate a poor fit.

However, as we will see in Section~\ref{sec:greedresult}, this is a
relatively poor score. In particular, none of our Data-B search
attempts achieved a BIC score worse than -31653. Additionally, the BIC
scores we achieved through structure search all had standard
deviations smaller than 19, suggesting that a score difference of 2600
points is a rather significant difference in fit for this data.

\subsubsection{The Akaike Information Criterion}

We chose to also implement the Akaike Information
Criterion\footnote{Akaike, H. (1974). A new look at statistical model
identification. Transaction on Automatic Control 19,716-723.} (AIC)
for comparison with BIC. The AIC metric seemed appropriate for our
purposes, as it has been widely cited in comparison with the BIC
metric for use in scoring Bayes networks.\footnote{See, e.g. \url{http://www.cs.cmu.edu/~awm/15781/slides/Param_Struct_Learning05v1.pdf}}

The AIC is calculated very similarly to the
BIC:

\begin{equation}
  \label{eq:aic}
  \textsf{AIC}(G:D) = \ell\left(\hat{\theta}_G : D\right) - \text{Dim}[G]  
\end{equation}

That is, the AIC score of a particular graph structure is the log
likelihood of the data under the current estimation of parameters for
the graph, minus a structure penalty given simply by the dimension of
the graph. This is identical to the BIC score, except that the AIC
structure penalty does not have a factor of $\frac{\log_2
  M}{2}$. Hence, the AIC score does not penalize the complexity of the
network structure as much as the BIC score, particularly for very
large data sets.

As a result, we expect structure search based on the
AIC metric to produce networks with more dependencies than if the BIC
network is used. Though the resulting networks may produce higher
likelihood of the data, they do so at the cost of a more complex
network: this is undesirable both because the more complex network
requires a longer description and leads to increased runtime for
inference algorithms, and because it can overfit the data. As we
describe in Section~\ref{sec:struct-scor-results}, we did indeed
observe more complex networks when performing structure search using
the AIC metric.

\subsection{Structure Scoring Results and Comparison}
\label{sec:struct-scor-results}

Overall, we observed that AIC tends to yield more complex graphs than
those found when using BIC scoring. Qualitatively, we observed that
the graphs resulting from AIC are much more highly connected than
those resulting from BIC. Quantitatively, one can observe this
difference in ``graph complexity'' by comparing the dimension of the
graphs resulting from use of both scoring metrics.

To better compare our various search and scoring options, we performed
the following experiment: for each of the datasets under consideration
(``A'' and ``B''), we generated 200 randomly-connected initial graph
structures. We then separately applied Greedy+AIC, Greedy+BIC,
First-Ascent+AIC, and First-Ascent+BIC to each of these initial graph
structures, recording various details about each application of all
these search variants.

As can be seen in Table~\ref{tab:scoredim}, the dimension of graphs
produced from the above experiment vary consistently between AIC and
BIC. In particular, the average dimension yielded by all the AIC
data/search combinations are consistently larger than those found with
BIC; similarly, the standard deviations of the graph dimension are
larger for AIC than BIC. Most tellingly, in all but one case, the
smallest graph dimension yielded by AIC is larger than the largest
graph dimension yielded by BIC\footnote{Furthermore, in each case,
the mean BIC dimension is at least 2.7 of AIC's standard devs below
the mean AIC; by Chebyshev's Theorem, we have (as a lower bound) that
at least $1-\frac{1}{2.7^2} \approx 86\%$ of AIC's dimensions are
larger than the average BIC dimension.}. Thus, the bulk of the graphs
produced by AIC have a higher dimension than any of the graphs
produced when using BIC.

One may also compare the runtime of the various searches, and the
number of search steps taken before finding a result in each case. On
average, AIC also takes more time to yield a result, and uses more
search steps, than BIC (as can be seen in
Table~\ref{tab:scoretime}). This is because it requires more steps to
build up the complex resulting structures: AIC will continue to add
new edges to the graph beyond the point at which the BIC algorithm
concludes that additional edges do not provide enough of a likelihood
improvement to offset the increased complexity of the structure, and
considering and applying the new edges requires time.

From these comparisons, one might conclude that AIC tends to overfit
the given data, by producing Bayes' Net structures which are
unjustifiably complex. Furthermore, searching with AIC suffers the
additional cost of slightly lower efficiency than when using BIC. In
contrast, on the datasets under consideration, BIC tends to produce
relatively streamlined graphs; it seems likely that the structures
found by BIC provide more useful approximations of the underlying
Bayesian relationships which generated the data in the first place.


%% This is crying out for a graph. No time.
\begin{table}[ptbh]
 \centering

  \caption{Resulting graph dimensions, AIC vs. BIC}
  \label{tab:scoredim}
  \vspace{1em}

 \begin{tabular}{|l||c|c|c|c||c|c|c|c|}
 \hline
    & \multicolumn{4}{c||}{\textbf{AIC} dim}
    & \multicolumn{4}{c|}{\textbf{BIC} dim} \\
    & $\mu$ & $\sigma$ & min & max & $\mu$ & $\sigma$ & min & max\\
   \hline
   \hline
   \textbf{Data A, First-Ascent} & 8.674  & 0.469  & 8.000  &  9.000 & 7      & 0     & 7.000  &  7.000 \\
   \hline
   \textbf{Data A, Greedy}       & 8.421  & 0.494  & 8.000  &  9.000 & 7      & 0     & 7.000  &  7.000 \\
   \hline
   \textbf{Data B, First-Ascent} & 51.768 & 11.264 & 28.000 & 82.000 & 21.311 & 2.239 & 19.000 & 36.000 \\
   \hline
   \textbf{Data B, Greedy}       & 51.953 & 10.768 & 31.000 & 80.000 & 20.647 & 1.226 & 19.000 & 24.000 \\
   \hline
\end{tabular}
\end{table}

\begin{table}[ptbh]
 \centering

  \caption{Runtime and search steps, AIC vs. BIC}
  \label{tab:scoretime}
  \vspace{1em}

 \begin{tabular}{|l||c|c||c|c|}
 \hline
    & \multicolumn{2}{c||}{\textbf{AIC}}
    & \multicolumn{2}{c|}{\textbf{BIC}} \\
    & $\mu_{time}$ & $\mu_{steps}$ & $\mu_{time}$ & $\mu_{steps}$ \\
   \hline
   \hline
   \textbf{Data A, First-Ascent} & 40.922	& 5.118  & 36.049    & 4.755  \\
   \hline
   \textbf{Data A, Greedy}       & 71.049	& 4.029  & 69.127    & 3.794  \\
   \hline
   \textbf{Data B, First-Ascent} & 34920.358	& 24.907 & 27181.069 & 21.539 \\
   \hline
   \textbf{Data B, Greedy}       & 93357.363	& 15.623 & 78283.201 & 14.225 \\
   \hline
\end{tabular}
\end{table}

\begin{table}[ptbh]
 \centering

  \caption{Result quality, AIC vs. BIC}
  \label{tab:scorequal}
  \vspace{1em}

 \begin{tabular}{|l||c|c|c|c||c|c|c|c|}
 \hline
    & \multicolumn{4}{c||}{\textbf{AIC} score}
    & \multicolumn{4}{c|}{\textbf{BIC} score} \\
    & $\mu$ & $\sigma$ & min & max & $\mu$ & $\sigma$ & min & max\\
   \hline
   \hline
   \textbf{A, FA} & -201.587  & 0.403  & -202.450  &
   -201.236 & -220.545 & 1.476 & -222.823 & -219.493 \\
   \hline
   \textbf{A, Gr}       & -201.391 & 0.233 & -202.450 &
   -201.236 & -220.692 & 1.582  & -222.823 & -219.493 \\
   \hline
   \textbf{B, FA} &  -31439.454 & 3.710  & 
   -31450.532 & -31433.896 & -31562.090 &    18.520 & -31652.759 &
   -31547.159 \\
   \hline
   \textbf{B, Gr}       & -31437.906 &    3.030   &
   -31446.883 & -31433.836 &  -31557.216 &  12.737 & -31613.307 &
   -31547.159 \\
   \hline
\end{tabular}
\end{table}

%%% Local Variables: 
%%% mode: latex
%%% TeX-master: "writeup"
%%% End: