The Boyer-Moore Algorithm

Three ideas

Definition: R(x) is the position of the rightmost occurrence of x in P (or zero)

Example: P = actca, so R(a) = 5, R(c) = 4, R(g) = 0, R(t) = 3

T = actgactaactca
P = actca

Bad character shift rule: If the first mismatch is found at position i (when scanning the pattern right to left) and T(k) is the text character that mismatches P(i), then shift the pattern right by max(1, i - R(T(k))

In the above example, i = k = 4 and T(k) = g, so the shift is 4 - R(g) = 4

Extended bad character shift rule: shift so that the next occurrence of T(k) in P (scanning left) is matched to T(k)
Preprocessing requires space proportional to the pattern length rather than the alphabet size

In the above example, the first shift results in

T = actgactaactca
P =     actca

in either case, but now the first rule would only give a second shift of max(1, 4-5) = 1

Good suffix rule: If t is the longest suffix of P that matches T in the current position, then P can be shifted so that the previous occurrence of t in P matches T. In fact, it can be required that the character before the previous occurrence of t be different from the character before the occurrence of t as a suffix. If no such previous occurrence of t exists, then the following cases apply:

  1. Find the smallest shift that matches a prefix of the pattern to a suffix of t in the text
  2. If there's no such match, shift the pattern by n (the length of P)

Note that these cases can be applied even if the length of t is n (a match of P was found in T)

More formally:

L(i) is the largest k < n so that P[i..n] matches a suffix of P[1..k] (or zero if none exists)

L'(i) is the largest k < n so that P[i..n] matches a suffix of P[1..k] and P(i-1) did not match the character before the suffix (or zero if there's no such j)

Nj(P) is the length of the longest suffix of P[1..j] that is also a suffix of P Note: Nj(P) = Zn-j+1(Pr)

l'(i) is the length of the longest suffix of P[i..n] that is also a prefix of P (or zero if none exists)

Theorem 2.2.2: L(i) is the largest j < n so that Nj(P) >= |P[i..n]| = n - i + 1
L'(i) is the largest j < n so that Nj(P) = |P[i..n]| = n - i + 1

Given this theorem and the above note, one can calculate all the Nj(P) values using the Z (fundamental preprocessing) algorithm, then go through these values in order of j filling in a table of L'(i) = j, where i = n - Nj(P) + 1.

Theorem 2.2.4: l'(i) is the largest j <= |P[i..n]| = n - i + 1 so that Nj(P) = j

Shift due to good suffix rule: If a mismatch occurs at P(i-1) and L'(i) > 0, shift right by n - L'(i)

If L'(i) = 0, shift right by n - l'(i)

When an occurrence of P is found, shift right by n - l'(2)

If P(n) mismatches, shift right by 1