\documentclass{article}
\input{6046preamble}
\usepackage{amsmath}

\begin{document}
\pset{6-4}{Dan Ports}{No collab}{Recitation \#8}

\paragraph{Algorithm}
We are given a text $T$ of length $n$, a pattern $P$ of size $m$, in an alphabet $\Sigma$. We will transform this into a set of texts and patterns in the alphabet $\{A,G,T,C\}$ that can be handled by Tidor's algorithm. We assume that the elements of $\Sigma$ are the integers $\{0\ldots \abs{\Sigma}-1\}$, and we give a mapping between each element of $\Sigma$ and a sequence of $\lceil\log_4(\abs{\Sigma})\rceil$ elements of the $\{A,G,T,C\}$ alphabet. We write each integer from $0$ to $\abs{\Sigma}$ in base-4, then map the base-4 digits $\{0,1,2,3\} \mapsto \{A,G,T,C\}$ in that arbitrary order. Let $d = \lceil\log_4(\abs{\Sigma})\rceil$. Since $d$ digits are required to write these integers in base-4, we have found a mapping between each element of $\Sigma$ and a sequence of $d$ elements of the $\{A,G,T,C\}$ alphabet.
\paragraph{}
We use this result to generate a set $\{T_1, T_2, \ldots, T_d\}$ of texts each of length $n$ and a set $\{P_1, P_2, \ldots, P_d\}$ of patterns each of lengths $m$. We define these such that $T_i[j]$ is the $i$th element of the $\{A,G,T,C\}$-representation of $T[j]$, and likewise for $P_i[j]$. Thus the texts $T_i$ and patterns $P_i$ are in the $\{A,G,T,C\}$ alphabet, so we can use Tidor's algorithm to perform matches between $T_i$ and $P_i$. We do this (independently) for every $i$ from $1$ to $d$. Assume that the results are placed in a set of arrays $R_i[j]$ that equals $1$ if $T_i$ matches $P_i$ at position $j$. For every $j$ from $0$ to $n$, we output a match at position $j$ if $R_i[j] = 1$ for every $i$ from $0$ to $d$.
\paragraph{Proof of correctness}
Since the elements of $\Sigma$ can be represented by integers from $0$ to $\abs{\Sigma}$, we can certainly represent them uniquely by their base-4 representation, which has $d$ digits. Using any arbitrary mapping between $\{0,1,2,3\}$ and $\{A,G,T,C\}$, we see that we can uniquely represent every element of $\Sigma$ as a sequence of $d$ elements of the $\{A,G,T,C\}$ alphabet. We can therefore represent the text and pattern this way. Splitting the text and pattern by base-4 digit, we can certainly apply Tidor's algorithm to the resulting $d$ sets of texts and patterns. We assume that Tidor's algorithm gives the correct result. Two numbers are certainly equal if and only if every digit in their base-4 representation are equal. Therefore, if we have found a match at position $j$ in each text/pattern pair, it matches for each base-4 digit, and so it matches in the original alphabet $\Sigma$. Thus the algorithm gives the correct result.
\paragraph{Runtime analysis}
We need to transform the pattern and text into their $\{A,G,T,C\}$-representations. This requires $\Theta((n+m)d) = O(nd)$ time (since $m \leq n$). Next, we apply Tidor's algorithm to $d$ individual texts and patterns, all of length $n$ and $m$ respectively. This requires time $d T(n,m)$. Finally, for every position from $1$ to $n$, we need to check whether all $d$ strings indicate a match; this requires $\Theta(nd)$ time. Thus the algorithm overall has the desired runtime of $\Theta((n+T(n,m))d) = \Theta((n+T(n,m))\log\abs{\Sigma})$. 

\end{document}

