The Levenshtein distance, related distances, and similarity

Definition: The Levenshtein distance (or edit distance) between two strings is the minimal number of insertions, deletions, and substitutions of one character for another that will transform one string into the other.

Definition: A global alignment of strings S₁ and S₂ is a way of lining up the two strings (with spaces possibly inserted into one or both strings or at the ends) so that each letter or space in S₁ corresponds to a letter or space in S₂ and vice-versa. Note that a space indictaes an insertion or deletion and needs to be distinguished from a blank if "blank" is a member of the alphabet.

Definition: D(i, j) = edit distance between S₁[1..i] and S₂[1..j]

Recurrence: D(i, 0) = i, D(0, j) = j, and D(i, j) = min[ D(i-1, j)+1, D(i, j-1)+1, D(i-1, j-1) + ( S₁(i) != S₂(j) ) ], where (a!=b) has the value 1 if the characters a and b don't match and 0 if they match.

Operation-weight edit distance: an insertion or deletion costs d, a mismatch (substitution) costs r, and a match costs e. Edit transcripts with minimal operation weight can be found via the following recurrence:

D(i, 0) = d*i, D(0, j) = d*j, D(i, j) = min[ D(i-1, j)+d, D(i, j-1)+d, D(i-1, j-1) + ( S₁(i) != S₂(j) )*(r-e) + e ]

Alphabet-weight edit distance: the weights can also depend on the letters of the alphabet involved. In biology, the PAM and BLOSUM scoring matrices contain these weights.

Definition: The score for aligning x with y is s(x, y), where x and y can be letters in the alphabet, or space to denote insertion or deletion

If S₁ and S₂ are aligned by possibly inserting spaces into S₁ or S₂ or both, the value of the alignment is given by summing s( S₁'(i), S₂'(i) ) over all positions in S₁' (the string obtained by inserting spaces into S₁).

Definition: The similarity of S₁ and S₂ is the value of the optimal alignment.

Special cases:

If matches score 1 and mismatches, insertions, and deletions score 0 this is Longest Common Subsequence (LCS)
If spaces at the beginning and end don't count, this is end-space free alignment. In this case, V(i, 0) = V(0, j) = 0, and the solution is the maximum in the last row or column.
A substring T' of T is a delta-approximate occurrence of P if the optimal alignment of P with T' has value at least delta. This can be found by setting V(i, 0) = V(0, j) = 0 and finding values of j so that V(n, J) is at least delta.