Boyer-Moore

The Boyer-Moore Algorithm

Three ideas

Right-to left scan of the pattern at each possible alignment
Precompute R(x) to compute shifts when mismatches occur
Precompute L'(i) and l'(i) using suffix matches for further shifts

Definition: R(x) is the position of the rightmost occurrence of x in P (or zero)

Example: P = actca, so R(a) = 5, R(c) = 4, R(g) = 0, R(t) = 3

T = actgactaactca
P = actca

Bad character shift rule: If the first mismatch is found at position i (when scanning the pattern right to left) and T(k) is the text character that mismatches P(i), then shift the pattern right by max(1, i - R(T(k))

In the above example, i = k = 4 and T(k) = g, so the shift is 4 - R(g) = 4

Extended bad character shift rule: shift so that the next occurrence of T(k) in P (scanning left) is matched to T(k)
Preprocessing requires space proportional to the pattern length rather than the alphabet size

In the above example, the first shift results in

T = actgactaactca
P =     actca

in either case, but now the first rule would only give a second shift of max(1, 4-5) = 1

Good suffix rule: If t is the longest suffix of P that matches T in the current position, then P can be shifted so that the previous occurrence of t in P matches T. In fact, it can be required that the character before the previous occurrence of t be different from the character before the occurrence of t as a suffix. If no such previous occurrence of t exists, then the following cases apply:

Find the smallest shift that matches a prefix of the pattern to a suffix of t in the text
If there's no such match, shift the pattern by n (the length of P)

Note that these cases can be applied even if the length of t is n (a match of P was found in T)

More formally:

L(i) is the largest k < n so that P[i..n] matches a suffix of P[1..k] (or zero if none exists)

L'(i) is the largest k < n so that P[i..n] matches a suffix of P[1..k] and P(i-1) did not match the character before the suffix (or zero if there's no such j)

N_j(P) is the length of the longest suffix of P[1..j] that is also a suffix of P Note: N_j(P) = Z_n-j+1(P^r)

l'(i) is the length of the longest suffix of P[i..n] that is also a prefix of P (or zero if none exists)

Theorem 2.2.2: L(i) is the largest j < n so that N_j(P) >= |P[i..n]| = n - i + 1
L'(i) is the largest j < n so that N_j(P) = |P[i..n]| = n - i + 1

Given this theorem and the above note, one can calculate all the N_j(P) values using the Z (fundamental preprocessing) algorithm, then go through these values in order of j filling in a table of L'(i) = j, where i = n - N_j(P) + 1.

Theorem 2.2.4: l'(i) is the largest j <= |P[i..n]| = n - i + 1 so that N_j(P) = j

Shift due to good suffix rule: If a mismatch occurs at P(i-1) and L'(i) > 0, shift right by n - L'(i)

If L'(i) = 0, shift right by n - l'(i)

When an occurrence of P is found, shift right by n - l'(2)

If P(n) mismatches, shift right by 1