%-*- Mode: latex-mathematica; fill-column: 75 -*- %---TeX-start-of-header--- \documentclass[titlepage]{article} \usepackage{mathematica,figi, alltt} \mmanobreak \newenvironment{implementation}% {\vspace{-0.4cm}\footnotesize\textbf{Implementation notes: }}{\bigskip} \begin{comment} \begin{mathematica} <Infinity]; zero = 0.00001; ohide = DisplayFunction->Identity; oshow = DisplayFunction -> $DisplayFunction; ojoin = PlotJoined -> True; oall = PlotRange->All; osurf = HiddenSurface->False; deg = N[Degree]; pi = N[Pi]; \end{mathematica} \end{comment} \begin{document} \psfiginit %---TeX-end-of-header--- \author{Tomasz J.\ Cholewo} \title{Design and Analysis of Computer Algorithms, EMCS\,619\\ Computer Project \#3\\[0.5in] Pattern Matching Algorithms} \maketitle %==================================================================== \section{Listings of \Mma{} Functions} \subsection{Straightforward Matching} \subsubsection{Function \texttt{SFmatch}} Function \texttt{SFmatch} performs a striaghtforward search for all occurences of the pattern \texttt{p} in the text \texttt{t} and returns a list of positions where the match occured. Setting the third optional parameter \texttt{mode} to \texttt{Debug} causes the function to return the number of character comparisons performed. \begin{mathematica} SFmatch[t_String, p_String, mode_:Silent] := Module[{i, j, n = StringLength[t], m = StringLength[p], comp = 0, matches = {}}, i = 1; While[ m > 0, j = 1; While[ j <= m && i <= n, If[ comp++; StringTake[t, {i}] == StringTake[p, {j}], i++; j++, i = i - j + 2; j = 1 ] ]; If[ j > m, AppendTo[ matches, i - m ]; i = i - m + 1, Break[] ] ]; If[ mode === Debug, comp, matches ] ] \end{mathematica} \begin{implementation} Text and pattern are implemented as \Mma{} strings. Construct \texttt{StringTake[s, {i}]} returns $i$th character of the string \texttt{s}. \end{implementation} Demonstration: \begin{mathematica} SFmatch["ppppapp", "ppa"] . = {3} \end{mathematica} %============================================================ \subsection{Knuth-Morris-Penrose Matching} %----------------------------------------------------------------- \subsubsection{Function \texttt{KMPsetup}} Function \texttt{KMPsetup} calculates an array \texttt{fail} describing a finite state automaton used by the Knuth-Morris-Penrose algorithm. \begin{mathematica} KMPsetup[p_String] := Module[{k, r, fail = {0}}, For[ k = 2, k <= StringLength[p], k++, r = fail[[k - 1]]; While[ r > 0 && StringTake[p, {r}] != StringTake[p, {k - 1}], r = fail[[r]] ]; AppendTo[ fail, r + 1 ]; ]; fail ] \end{mathematica} Demonstrations: \begin{mathematica} KMPsetup /@ {"aabaabb", "ababcb"} . = {{0, 1, 2, 1, 2, 3, 4}, {0, 1, 1, 2, 3, 1}} \end{mathematica} %----------------------------------------------------------------- \subsubsection{Function \texttt{KMPmatch}} Function \texttt{KMPmatch} returns a list of positions where a match between text \texttt{t} and a pattern \texttt{p} occured. \begin{mathematica} KMPmatch[t_String, p_String, mode_:Silent] := Module[{i, j, fail = KMPsetup[p], n = StringLength[t], m = StringLength[p], comp = 0, matches = {}}, i = 1; While[ m > 0, j = 1; While[ i <= n && j <= m, If[ j == 0 || (comp++; StringTake[t, {i}] == StringTake[p, {j}]), i++; j++, j = fail[[j]] ] ]; If[ j > m, AppendTo[matches, i - m]; i = i - m + 1, Break[] ] ]; If[ mode === Debug, comp, matches ] ] \end{mathematica} Demostration: \begin{mathematica} KMPmatch["zuppapappa", "p"] . = {3, 4, 6, 8, 9} \end{mathematica} %============================================================ \subsection{Boyer-Moore Matching} %----------------------------------------------------------------- \subsubsection{Function \texttt{computejumps}} Function \texttt{computejumps} calculates an array based on single character positions used in pattern shifting. Here it is shown for demonstration purposes only as it is inlined in the main algorithm later. \begin{mathematica} computejumps[p_String] := Module[{charjump, m = StringLength[p]}, MapIndexed[ (charjump[ #1 ] = m - #2[[1]])&, Characters[p] ]; charjump[_] := m; charjump /@ Characters["abcde"] ] \end{mathematica} Demonstration: \begin{mathematica} computejumps["abcde"] . = {4, 3, 2, 1, 0} \end{mathematica} %----------------------------------------------------------------- \subsubsection{Function \texttt{computecharjumps}} Function \texttt{computematchjumps} finds an array storing information about how many places can a pattern \texttt{p} be shifted to the right after a mismatch. As discussed, the last part of the algorithm introduced by K.\ Mehlhorn is performed only if a partial match from 1 to \texttt{q} was found, i.e., $\texttt{q} < \texttt{m}$. \begin{mathematica} computematchjumps[p_String] := Module[{k, q, qq, m = StringLength[p], back}, back = Table[x, {m}]; matchjump = Table[2m - k, {k, 1, m}]; k = m; q = m + 1; While[ k > 0, back[[k]] = q; While[ q <= m && StringTake[p, {k}] != StringTake[p, {q}], matchjump[[q]] = Min[ matchjump[[q]], m - k ]; q = back[[q]]; ]; k--; q-- ]; For[ k = 1, k <= q, k++, matchjump[[k]] = Min[ matchjump[[k]], m + q - k ]; ]; If[ q < m, (* <- my modification *) qq = back[[q]]; While[ q <= m, While[ q <= qq, matchjump[[q]] = Min[ matchjump[[q]], qq - q + m ]; q++ ]; qq = back[[qq]]; ]; ]; matchjump ] \end{mathematica} Demonstrations (from the textbook): \begin{mathematica} computematchjumps /@ {"wowwow", "ab"} . = {{8, 7, 6, 7, 3, 1}, {3, 1}} \end{mathematica} %----------------------------------------------------------------- \subsubsection{Function \texttt{BMmatch}} Function \texttt{BMmatch} performs Boyer-Moore matching of text \texttt{t} with pattern \texttt{p}. All matching posiotions are returned in a list. Setting parameter \texttt{mode} to \texttt{Debug} causes that the number of comparisons is returned. \begin{mathematica} BMmatch[t_String, p_String, mode_:Silent] := Module[{i, j, n = StringLength[t], m = StringLength[p], comp = 0, charjump, matchjump = computematchjumps[p], matches = {} }, MapIndexed[ (charjump[ #1 ] = m - #2[[1]])&, Characters[p] ]; charjump[_] = m; i = m; While[ m > 0, j = m; While[ i <= n && j > 0, If[ (comp++; StringTake[t, {i}] == StringTake[p, {j}]), i--; j--, i = i + Max[ charjump[ StringTake[t, {i}] ], matchjump[[j]] ]; j = m ] ]; If[ i <= n, AppendTo[matches, i + 1]; i = i + m + 1, Break[]; ] ]; If[ mode === Debug, comp, matches ] ] \end{mathematica} \begin{implementation} Function \texttt{computejumps} is inlined to allow localizing its definition. It works for any input and pattern alphabet. All characters which didn't occur in the pattern are assigned value \texttt{m}. \end{implementation} Demonstration: \begin{mathematica} BMmatch["zupappppa", "pa"] . = {3, 8} \end{mathematica} %==================================================================== \newpage \section{Comparison} The three above algorithms will be compared on the three patterns: \begin{mathematica} testpatterns = {"abc", "dddac", "aaffaaf"}; \end{mathematica} The test text string is generated by means of the uniform random number generator with range 0--4 which is used for selecting each letter independently. \begin{mathematica} testtext = StringJoin[ Table[ FromCharacterCode[ Random[Integer, 4] + ToCharacterCode["a"] ], {5000}]]; \end{mathematica} We can check that the distribution is really close to uniform by counting the number of occurences of single letters in the test text. \begin{mathematica} Length[SFmatch[testtext, #]]& /@ {"a", "b", "c", "d", "e"} . = {1023, 971, 993, 1047, 966} \end{mathematica} The expected number of occurences of a pattern of lenght $m$ in a text of length $n$ where each character's occurence does not depend on its neighborhood is $ n / s^m $ where $s$ is the size of alphabet used in the text. It can be corroborated to some extent by the following table: \begin{mathematica} TableForm[{Length[BMmatch[testtext, #]], StringLength[testtext] / 5.^StringLength[#]}& /@ testpatterns, TableHeadings->{testpatterns, {"found", "expected"}}] . //TableForm= found expected abc 36 40. dddac 3 1.6 aaffaaf 0 0.064 \end{mathematica} It can be shown that the outputs of all three algorithms are identical (at least in the following case \verb+:-)+): \begin{mathematica} Through[{SFmatch, KMPmatch, BMmatch} [testtext, "abcd"]]//TableForm . //TableForm= 656 786 2247 3104 3694 656 786 2247 3104 3694 656 786 2247 3104 3694 \end{mathematica} The following table summarizes numbers of comparisons performed by all three algorithms on three given test patterns: \begin{mathematica} MatrixForm[(Through[{SFmatch, KMPmatch, BMmatch} [testtext, #, Debug]])& /@ testpatterns, TableHeadings->{testpatterns, {"SF", "KMP", "BM"}}] . //MatrixForm= SF KMP BM abc 6223 6059 2653 dddac 6332 6042 1820 aaffaaf 6233 6023 861 \end{mathematica} \begin{comment} \begin{mathematica} ll = {StringLength[#], Through[{SFmatch, KMPmatch, BMmatch} [testtext, #, Debug]]}&/@ testpatterns . = {{3, {6223, 6059, 2653}}, {5, {6332, 6042, 1820}}, {7, {6233, 6023, 861}}} \end{mathematica} \begin{mathematica} lll= Map[ {#[[1]], N[#[[2]]/StringLength[testtext]]}&, Transpose[Thread[#,List]&/@ll], {2}] . = {{{3, 1.2446}, {5, 1.2664}, {7, 1.2466}}, {{3, 1.2118}, {5, 1.2084}, {7, 1.2046}}, {{3, 0.5306}, {5, 0.364}, {7, 0.1722}}} \end{mathematica} \begin{mathematica} Display["g5a.eps", Show[ListPlot[#, ojoin]& /@lll], "EPS"]; \end{mathematica} \end{comment} The resulting plot is not very precise (Fig.~\ref{g5a}): \figi{g5a}{Number of comparisons per characters passed performed by the three algorithms in function of the pattern length. From top to bottom: SF, KMP, BM.} \begin{comment} \begin{mathematica} ll = {StringLength[#], Through[{SFmatch, KMPmatch, BMmatch} [testtext, #, Debug]]}&/@ (StringJoin[ Table["a", {#}]]& /@ Range[10]) \end{mathematica} \begin{mathematica} lll= Map[ {#[[1]], N[#[[2]]/StringLength[testtext]]}&, Transpose[Thread[#,List]&/@ll], {2}] . = {{{1, 1.}, {2, 1.2046}, {3, 1.2466}, {4, 1.2554}, {5, 1.257}, {6, 1.2576}, {7, 1.2578}, {8, 1.2578}, {9, 1.2578}, {10, 1.2578}}, {{1, 1.}, {2, 1.2046}, {3, 1.2134}, {4, 1.2078}, {5, 1.2064}, {6, 1.2054}, {7, 1.2046}, {8, 1.2046}, {9, 1.2046}, {10, 1.2046}}, {{1, 1.}, {2, 0.6746}, {3, 0.4544}, {4, 0.3342}, {5, 0.2678}, {6, 0.2196}, {7, 0.1872}, {8, 0.1582}, {9, 0.145}, {10, 0.1256}}} \end{mathematica} \begin{mathematica} Display["g5b.eps", Show[ListPlot[#, ojoin]& /@lll], "EPS"]; \end{mathematica} \end{comment} Better plot can be obtained for all pattern lengths from 1 to 10 (Fig.~\ref{g5b}). \figi{g5b}{Number of comparisons per characters passed performed by the three algorithms for pattern $a^m$, $m=1$, $\ldots$, $10$. From top to bottom: SF, KMP, BM.} Results are different from this in the textbook because they don't cover preprocessing costs. Results in the Smit's article which don't consider precomputation are nearly identical to these obtained here, though he used sets of random 20 patterns to obtain averaged results. Increasing the size of the text alphabet practically would not change the performance of the SF and KMP algorithms. For the BM algorithm the number of characters examined would assymptotically go down to $n/m$ on the average. %==================================================================== \end{document}