\documentclass{article}
\input{6823-preamble}
\usepackage{ifsym}

\begin{document}
\date{2006/09/11}
\def\psetheadertitlevar{Self-Test}
\begin{pset}
  \begin{problem}
    \includegraphics[scale=1]{self-test-1}
  \end{problem}

  \begin{problem}
    \begin{subproblem}
      \begin{tabular}{rl}
      \texttt{clk} &\textifsym{LLLHHHLLLLHHHLLLHHHLLLLHHHLLLHHH}\\
      \texttt{e}   &\textifsym{HHHLLLLLLLHHHHHHLLLLLLLHHHHHHLLL}\\
      \texttt{q}   &\textifsym{LLLHHHHHHHHHHHHHHHHHHHHHHHHHHHHH}\\
      \texttt{d}   &\textifsym{HHHlhhHHHHHHLLLLLLHHHHHHHLLLLLLHHH}\\
      \end{tabular}
    \end{subproblem}

    \begin{subproblem}
      \[\frac{1}{t_{clkQ} + t_{inv} + t_{xor}}\]
    \end{subproblem}
  \end{problem}

  \begin{problem}
    \begin{subproblem}
      \begin{lstlisting}[language=C]
        int *a, b, c, d;        // r1, r2, r3, r4 respectively
        for (int i = 0; i < 80; i++) {
          c[i] = a[i] + b[i];
          d[i] = a[i] - b[i];
        }
      \end{lstlisting}
    \end{subproblem}

    \begin{subproblem}
      We can expect segment B to perform better when executed, because
      each iteration of the loop involves only two \texttt{LD}
      operations vs. 4 for segment A.
    \end{subproblem}

    \begin{subproblem}
      No. If the value in $r_3$ is the same as that in $r_1$ or $r_2$,
      then the two segments will behave differently. Specifically,
      using the pseudocode above, if $c = a$, then segment 1 will use
      the new value of $a[i]$ in calculating $d[i]$, whereas segment 2
      will use the old value.
    \end{subproblem}
  \end{problem}

  \begin{problem}
    \begin{subproblem}
      Since the branch condition is determined by the ALU, we don't
      know whether the branch will be taken until the third stage of
      the pipeline, even though two other instructions have already
      begun processing. Adding two delay slots after every branch
      would solve this problem; using branch prediction and
      speculative execution, we can increase performance.
    \end{subproblem}

    \begin{subproblem}
      The value in $r_1$ is updated by the first operation, and
      required for the second operation. However, the register is not
      written back until the first instruction reaches the final stage
      of the pipeline, though the second instruction accesses it from
      the register file in the second stage. Since the needed value is
      computed when the first instruction reaches the ALU stage, which
      is also when the second instruction requires it, we can solve
      this by adding a bypass path which replaces the output of the
      register file read with the output of the ALU when appropriate.
    \end{subproblem}

    \begin{subproblem}
      Again, the value in $r_1$ is needed by the second operation, but
      modified by the first operation. The same bypass path solution,
      however, does not work because the data is not available from
      memory until stage 5 of the pipeline. So, in addition to a
      bypass path from the memory output in stage 5 to the output of
      stage 2, we will need to stall the pipeline for two cycles.
    \end{subproblem}
  \end{problem}

  \begin{problem}
    Caching addresses the increasing gap between processor speed and
    memory access time by storing data that is expected to be accessed
    soon in faster memory. To determine which data to cache, we
    exploit \emph{temporal locality} (data that has been accessed
    recently is likely to be accessed again) and \emph{spatial
      locality} (two pieces of data located near each other are likely
    to be accessed together).

    The simplest organization of a cache is a direct-mapped one, where
    the low-order bits of the memory address identify one of the words
    in the cache. Hence, for any memory address, there is only one
    location in the cache where it can be stored. But there are many
    memory addresses that could be stored in that cache location, so
    in addition to the data, it's also necessary to store a tag
    containing the high-order bits of the memory address. Then, upon a
    memory access, the tag in the relevant cache line can be checked;
    if it does not match, the request will need to be sent to main
    memory. Alternatively, a $n$-way associative cache can be used,
    where the low-order bits specify not a single cache line but a set
    of $n$; the tags can be matched against the memory address in
    parallel. This adds complexity but can improve performance since
    memory access patterns may involve accessing multiple locations
    with the same low order bits. In both cases, the tag also needs
    valid bits to indicate whether the cache entry contains up-to-date
    information; this might not be the case on startup, or if the
    memory is modified by some other mechanism, such as another
    processor or peripheral device.

    The preceding discussion suggests that each cache location holds
    one word; in fact, the cache line size may be considerably
    larger. This can be implemented by using the low-order bits to
    index within a cache line, the middle bits to select a cache line,
    and the high-order bits for the tag. This takes advantage of
    spatial locality; multiple requests for adjacent memory locations
    can be batched rather than having to repeatedly endure the full
    memory access cost each time.
    
    The cache replacement policy dictates which of the $n$ entries in
    a $n$-way associative cache is evicted to make room for the new
    entry after a cache miss. LRU and random replacement policies are
    common; both are provably $k$-competitive with respect to the
    optimal replacement strategy. LRU seems likeliest to keep the most
    current data in the cache, though it can be difficult to
    implement, and it fares poorly on certain degenerate access
    patterns; pseudo-LRU replacement policies are often used. The
    random policy, by contrast, can be simpler to implement.
  \end{problem}

  \begin{problem}
    \begin{tabular}{|c|c|c|c|c|c|c|}
      \hline
      & \textbf{a} & \textbf{b} & \textbf{c} & \textbf{d} & \textbf{e}
      & \textbf{f} \\
      \hline
      \textbf{1} & 2 & 2 & 2 & 2 & 2 & 2 \\
      \hline
      \textbf{2} & 2 & 2 &&&&\\
      \hline
      \textbf{3} & 2 & 2 & 2 &&& \\
      \hline
      \textbf{4} & 2 & 2 & 2 &&& \\
      \hline
      \textbf{5} & 2 & 2 & 2 & 2 & 2 & \\
      \hline
    \end{tabular}
  \end{problem}
\end{pset}
\end{document}
