The Knuth-Morris-Pratt Algorithm

Definition: spi(P) is the length of the longest proper suffix of P[1..i] that matches a prefix of P

Example: P = abbcabbd, so sp1(P) = 0, sp2(P) = 0, ..., sp4(P) = 0, sp5(P) = 1, sp6(P) = 2, sp7(P) = 3, sp8(P) = 0

Definition: sp'i(P) is the length of the longest proper suffix of P[1..i] that matches a prefix of P with a mismatch between P(i+1) and the character following the prefix

Example: P = abbcabbd, so sp'1(P) = 0, sp'2(P) = 0, ..., sp'6(P) = 0, sp'7(P) = 3, sp'8(P) = 0

Knuth-Morris-Pratt shift rule: For a given alignment of P and T, suppose the lowest-numbered position with a mismatch is P(i+1) with T(k), then P can be shifted to the right by i-sp'i(P). If there was no mismatch (an occurrence of P was found), the shift is n-sp'n(P).

Definition: Position j maps to i if i is the right end of a Z-box starting at j (so i = j + Zj(P) - 1)

Theorem 2.3.4: For i > 1, sp'i(P) = Zj(P) where j > 1 is the first position that maps to i, or 0 if no position maps to i.

Knuth-Morris-Pratt preprocessing: Given the above theorem, the sp'i(P) values can be calculated from the Zj(P) values by scanning through them in order of decreasing j, setting sp'i(P) = j for i = j + Zj(P) - 1.

Note that Knuth-Morris-Pratt is not a real-time algorithm: if a character of T is involved in a mismatch the pattern is shifted and the same character can be compared with a different character of P. The number of times this can happen is proportional to the length of the pattern. the following modification converts Knuth-Morris-Pratt into a real-time algorithm by ensuring that no character of T is examined more than once. (Note that the alphabet is required to be finite.)

Definition: For every x in the alphabet and i > 0, let sp'(i,x)(P) be the length of the longest proper suffix of P[1..i] that matches a prefix of P and with P( 1 + sp'i(P) ) = x

Theorem 2.4.1: If P[i+1] != x, then sp'(i,x)(P) = i - j + 1, where j is the leftmost position that maps to i with P( Zj(P)+1 ) = x. If there is no such position, sp'(i,x)(P) = 0.