Exact Matching and Pattern Preprocessing

The exact matching problem is to find all occurrences of a pattern string P of length n in a text string T of length m

Extreme examples:

1. If P = ab and T = bba, there are no occurrences
2. If P = aaa and T = aaaaa, there are three (overlapping) occurrences

The naive approach would attempt to find P at positions 1, 2, 3, ... m-n+1 in T, possibly using n comparisons at each position, resulting in a worst-case running time proportional to mn. This time can be improved by preprocessing the pattern (fundamental preprocessing) as follows:

Define Zi(S) to be the length of the longest prefix of S that matches a substring of S starting at position i.( i > 1 )

For example, Z4(abbcabbdabc) = 0, Z5(abbcabbdabc) = 3, and Z9(abbcabbdabc) = 2

Associated with each Zi, define an interval [li, ri]: (Z-box)

• ri is the furthest position reached by a match starting between 1 and i inclusive
i.e. ri = max j + Zj - 1, where 1 < j <= i
• li is some value of j where this max is attained

Algorithm to compute Zi, li, ri

Input String S

Output: Array of Zi values (li, ri values can also be computed)

Variables: k (iteration number), l (li value from previous iteration), r (ri value from previous iteration)

Initial values: k = 2, l = 0, r = 0

Iteration k:

Case I: If k > r, find Zk explicitly. If this is positive, update r to k+Zk-1 and l to k

If k <= r, let k' = k - l + 1 be the position where S(k) matched when r was last updated

Case IIa: If Zk' < r - k + 1, then Zk = Zk' and l and r are not changed

Case IIb: If Zk' >= r - k + 1, compare characters starting at r+1 with characters starting at r-k+2 until a mismatch occurs at position q. Set Zk = q - k, l = k, r = q-1