We implemented both likelihood weighting and Gibbs sampling
techniques. To evaluate their effectiveness, we first investigated the
burn-in time for Gibbs sampling: the number of initial samples that
must be discarded before reaching the stationary distribution. We then
used this information when comparing the quality of likelihood
weighting and Gibbs sampling to the exact solution, for various
numbers of samples.

\subsection{Gibbs Sampling Burn-In Time}
\label{sec:gibbs-burnin}

The Gibbs sampling algorithm begins with a random sample that ignores
the evidence, and so requires some time for the Markov chain to
converge to its stationary distribution. Hence, the earliest samples
generated are not likely to be accurate, and we can expect that
discarding them will improve the overall accuracy of the algorithm.

To test this hypothesis, we experimented with the fraction of samples
that would be discarded. We ran experiments where 10,000 samples were
taken, and an initial prefix was discarded. Note that this means the
number of samples actually used decreases as the prefix length
increases. We tested the Gibbs sampler with varying prefix lengths on
queries in the Burglary (Figure~\ref{fig:burnin-burglary}), Insurance
(Figure~\ref{fig:burnin-insurance}), and Carpo
(Figure~\ref{fig:burnin-carpo}) networks, and plotted the
Kullback-Leibler divergence relative to the exact distribution. Each
experiment was repeated five times, and the mean and standard
deviation are shown.

\begin{figure}[tbhp]
  \centering
  \input{figures/gibbs-burnin-query12}
  \caption{Result quality vs. fraction of discarded samples, Burglary network}
  \label{fig:burnin-burglary}
\end{figure}

\begin{figure}[tbhp]
  \centering
  \input{figures/gibbs-burnin-query21}
  \caption{Result quality vs. fraction of discarded samples, Insurance network}
  \label{fig:burnin-insurance}
\end{figure}

\begin{figure}[tbhp]
  \centering
  \input{figures/gibbs-burnin-query24}
  \caption{Result quality vs. fraction of discarded samples, Carpo network}
  \label{fig:burnin-carpo}
\end{figure}

The Burglary and Carpo networks do not give interesting results: the
divergence remains very low with little variation until the burn-in
time is increased to a large fraction (0.7 or greater). This suggests
that the samples quickly converge to the stationary distribution; then
there are so few samples from before the Markov chain has converged
that they are easily outweighed by the rest of the sample, and so
discarding them has little effect. It isn't surprising that this is
the case for the Burglary and Carpo networks, since these are very
simple queries. The Burglary network is quite small, so only a few
passes will be required before all the nodes have been sampled from a
distribution that takes the evidence into account. The queries in the
Carpo network are independent of the evidence, which means both that
much of the network can be ignored, and that sampling is from the
prior distribution so there is effectively no mixing time. Finally, we
observe that with a very large fraction of samples discarded, the
divergence increases, as well as its variation between experiments ---
this is also no surprise, since the distribution is now being computed
from a smaller number of samples, increasing the error.

The Insurance network query (Figure~\ref{fig:burnin-insurance})
displays the expected pattern: if few or no samples are discarded, the
initial samples introduce error into the sampled distribution, and if
too many samples are discarded, the error increases because the
effective sample size is shrinking. The minimum error occurs somewhere
in the middle. For this query, the graph shows that discarding the
first 0.4 fraction of the samples gives the best results; the other
Insurance queries (not shown) agree with this result. There is quite a
bit of noise in the data, perhaps due to the influence that the random
starting positions have on the Markov chain's trajectory.

\subsection{Quality of Approximate Inference}
\label{sec:sampling-quality}

To evaluate the quality of these inference algorithms, we tested them
on four queries:

\begin{align*}
  \mathbf{(1)} \qquad & \text{Insurance network:} &P(PropCost | Age=Adolescent,
  Airbag=False,\\ & & Mileage=TwentyThou, MakeModel=Luxury) \\
  \mathbf{(2)} \qquad & \text{Insurance network:} & P(PropCost | Age=Adolescent,
  Airbag=False, \\ & & Mileage=TwentyThou, GoodStudent=True) \\
  \mathbf{(3)} \qquad & \text{Carpo network:} & P(N104|N41=2, N84=1, N116=0) \\
  \mathbf{(4)} \qquad & \text{Carpo network:} & P(N73|N152=1, N116=0, N43=1) 
\end{align*}

For each query, we ran both likelihood weighting and Gibbs sampling
with varying numbers of samples, and plotted the Kullback-Leibler
divergence of the sampled distribution relative to the exact
distribution. For Gibbs sampling, an initial prefix of 0.4 times the
number of samples was discarded for the burnin period, as per our
results in Section~\ref{sec:gibbs-burnin}. Likelihood weighting
results are shown in
Figures~\ref{fig:qual-lw-1},~\ref{fig:qual-lw-2},~\ref{fig:qual-lw-3},~and~\ref{fig:qual-lw-4}
(at the top of each page); Gibbs sampling results are shown in
Figures~\ref{fig:qual-gibbs-1},~\ref{fig:qual-gibbs-2},~\ref{fig:qual-gibbs-3},~and~\ref{fig:qual-gibbs-4}
(at the bottom of each page). Each experiment was performed ten times;
the individual experiment curves are shown on the left graph, and the
mean and standard deviation are plotted on the right graph.

To put the number of samples in context, we compared the runtime of
the variable elimination algorithm with the time taken for each
sample. Table~\ref{tab:sampling-runtime} shows the results. For VE,
the greedy elimination order was used, as it gave the best
performance.\footnote{Note that the runtimes for greedy VE in this
  experiment may differ from those presented in
  Section~\ref{VEAnalysis}, as they were performed on a different test
  system.} From this, we find that on the order of 500 LW samples or
5000 Gibbs samples can be performed in the same time as VE for the
Insurance network queries; for the Carpo network queries, these
numbers are approximately 250 and 2500 respectively. The tenfold
increase in sampling speed for the Gibbs sampler relative to LW comes
from the fact that the LW sampler must evaluate and resample the
entire network for each sample, while the Gibbs sample only needs to
resample one node. The Insurance network contains 27 nodes, so an
order of magnitude increase in runtime for the LW sampler is expected;
we see only a $\approx$10-fold rather 27-fold increase because the
Gibbs sampler must evaluate all the children of the selected node in
addition to the node itself.

\begin{table}[htbp]
  \centering
  \begin{tabular}{|c|ccc|}
    \hline
    \textbf{Query} & \textbf{VE runtime} & \textbf{LW time/sample} &
    \textbf{Gibbs time/sample} \\
    \hline
    1 & 149 & 0.479 & 0.0365 \\
    2 & 271 & 0.515 & 0.0397 \\
    3 & 35 & 0.151 & 0.0131 \\
    4 & 14 & 0.058 & 0.0062 \\
    \hline
  \end{tabular}
  \caption{Runtimes of sampling and VE algorithms (all times in ms)}
  \label{tab:sampling-runtime}
\end{table}

Comparing Figures~\ref{fig:qual-lw-1}--\ref{fig:qual-gibbs-2}, we gain
some insight about whether likelihood weighting or Gibbs sampling is
preferable for different kinds of queries in the Insurance network. In
query 1, Gibbs sampling is clearly preferable: with as low as 200
samples, the average KL divergence is below 5; likelihood weighting
requires 500-1000 samples to reach this level (which takes many times
longer). This means that likelihood weighting has little or no
utility for this query, since in the time required to perform 500 LW
samples, the exact solution could be computed with variable
elimination. Inspecting the network, we can see why this query is a
poor candidate for likelihood weighting: the evidence combination of
$Age = Adolescent$ and $MakeModel = Luxury$ is rare, so most samples
generated will have low weights, and there will be a large amount of
error unless many samples are used. Gibbs sampling does not have this
problem.

For query 2, likelihood weighting performs better than Gibbs
sampling. Likelihood weighting gives a divergence slightly greater than 5 with
only 20 samples, and below 1 when more than 40 samples are used. Gibbs
sampling gives reasonable results when 1000 or more samples are used
(KL divergence of about 5 with 1000 samples and less than 2 for
more than 4000). However, these results are still inferior in quality than
those from likelihood weighting, and they require more time. In
addition, Gibbs sampling has a higher variance of divergence: even
when repeating the experiment with the same number of samples, it is
possible to have an unlucky run that gives a large error. Likelihood
weighting shows much less variance. This query is well-suited for
likelihood weighting, since most of the evidence variables ($Age =
Adolescent$, $Mileage=TwentyThou$, $GoodStudent=True$) are located
near the top of the graph, so forcing them to take their observed
values does not greatly impact the particle weight.

The Carpo queries (3 and 4) are quite a different case because the query
variables are $d$-separated from the evidence variables. Hence, we are
effectively sampling the query values from the prior
distribution. Both sampling algorithms give very good results for such
queries. Likelihood weighting works well because the query variables
are near the top of the network --- the query variable in query 3 has
only one ancestor, and the variable in query 4 has none at all!
Indeed, for query 3 likelihood weighting will give a KL divergence of
0.5 with as few as 10 samples, and a divergence below 0.1 with 100 or
more; Gibbs sampling gives a KL divergence below 0.5 with 200 samples,
and below 0.1 with 1000 samples. For query 4, the results are even
better: a divergence below 0.05 can be obtained with as few as 10 LW
samples or 100 Gibbs samples. We can conclude that queries of this
sort can be sampled very accurately without much effort --- but this
is not a particularly interesting result since the queries are so
simple, with a graph in which most of the nodes can be
ignored. Moreover, the ultimate value of sampling may be limited,
since the exact solution to these queries can be computed in less than
50 ms by variable elimination.

A final observation is that the standard deviation of the divergences
can be quite high when the number of sample is small, especially for
Gibbs sampling. In particular, looking at the graphs of individual
runs in Figures~\ref{fig:qual-gibbs-1}~and~\ref{fig:qual-gibbs-2}, we
see that some of the runs give high-quality results, while there are
some outliers with very large errors. Hence, running Gibbs sampling
with few samples may a good result or a terrible one ---
unfortunately, one cannot tell which. This suggests that if few
samples are performed, repeating the experiment multiple times and
taking the average (or otherwise eliminating outliers) might be useful
to give a better guarantee of a good result. (Of course, this
increases runtime, so it would need to be compared against simply
taking more samples in one experiment.)

\begin{landscape}
  \begin{figure}[H]
    \centering
    \begin{minipage}{4in}
      \input{figures/lw-quality-query21}
    \end{minipage}
    \begin{minipage}{4in}
      \input{figures/lw-quality-query21-agg}
    \end{minipage}
    \caption{Divergence vs number of samples for LW sampling, query 1}
    \label{fig:qual-lw-1}
  \end{figure}

  \begin{figure}[H]
    \centering
    \begin{minipage}{4in}
      \input{figures/gibbs-quality-query21}
    \end{minipage}
    \begin{minipage}{4in}
      \input{figures/gibbs-quality-query21-agg}
    \end{minipage}
    \caption{Divergence vs number of samples for Gibbs sampling, query 1}
    \label{fig:qual-gibbs-1}
  \end{figure}
\end{landscape}

\begin{landscape}
  \begin{figure}[H]
    \centering
    \begin{minipage}{4in}
      \input{figures/lw-quality-query22}
    \end{minipage}
    \begin{minipage}{4in}
      \input{figures/lw-quality-query22-agg}
    \end{minipage}
    \caption{Divergence vs number of samples for LW sampling, query 2}
    \label{fig:qual-lw-2}
  \end{figure}

  \begin{figure}[H]
    \centering
    \begin{minipage}{4in}
      \input{figures/gibbs-quality-query22}
    \end{minipage}
    \begin{minipage}{4in}
      \input{figures/gibbs-quality-query22-agg}
    \end{minipage}
    \caption{Divergence vs number of samples for Gibbs sampling, query 2}
    \label{fig:qual-gibbs-2}
  \end{figure}
\end{landscape}

\begin{landscape}
  \begin{figure}[H]
    \centering
    \begin{minipage}{4in}
      \input{figures/lw-quality-query23}
    \end{minipage}
    \begin{minipage}{4in}
      \input{figures/lw-quality-query23-agg}
    \end{minipage}
    \caption{Divergence vs number of samples for LW sampling, query 3}
    \label{fig:qual-lw-3}
  \end{figure}

  \begin{figure}[H]
    \centering
    \begin{minipage}{4in}
      \input{figures/gibbs-quality-query23}
    \end{minipage}
    \begin{minipage}{4in}
      \input{figures/gibbs-quality-query23-agg}
    \end{minipage}
    \caption{Divergence vs number of samples for Gibbs sampling, query 3}
    \label{fig:qual-gibbs-3}
  \end{figure}
\end{landscape}

\begin{landscape}
  \begin{figure}[H]
    \centering
    \begin{minipage}{4in}
      \input{figures/lw-quality-query24}
    \end{minipage}
    \begin{minipage}{4in}
      \input{figures/lw-quality-query24-agg}
    \end{minipage}
    \caption{Divergence vs number of samples for LW sampling, query 4}
    \label{fig:qual-lw-4}
  \end{figure}

  \begin{figure}[H]
    \centering
    \begin{minipage}{4in}
      \input{figures/gibbs-quality-query24}
    \end{minipage}
    \begin{minipage}{4in}
      \input{figures/gibbs-quality-query24-agg}
    \end{minipage}
    \caption{Divergence vs number of samples for Gibbs sampling, query 4}
    \label{fig:qual-gibbs-4}
  \end{figure}
\end{landscape}


%%% Local Variables: 
%%% mode: latex
%%% TeX-master: "report"
%%% End: