The Levenshtein distance, related distances, and similarity

Definition: The Levenshtein distance (or edit distance) between two strings is the minimal number of insertions, deletions, and substitutions of one character for another that will transform one string into the other.

Definition: A global alignment of strings S1 and S2 is a way of lining up the two strings (with spaces possibly inserted into one or both strings or at the ends) so that each letter or space in S1 corresponds to a letter or space in S2 and vice-versa. Note that a space indictaes an insertion or deletion and needs to be distinguished from a blank if "blank" is a member of the alphabet.

Definition: D(i, j) = edit distance between S1[1..i] and S2[1..j]

Recurrence: D(i, 0) = i, D(0, j) = j, and D(i, j) = min[ D(i-1, j)+1, D(i, j-1)+1, D(i-1, j-1) + ( S1(i) != S2(j) ) ], where (a!=b) has the value 1 if the characters a and b don't match and 0 if they match.

Operation-weight edit distance: an insertion or deletion costs d, a mismatch (substitution) costs r, and a match costs e. Edit transcripts with minimal operation weight can be found via the following recurrence:

D(i, 0) = d*i, D(0, j) = d*j, D(i, j) = min[ D(i-1, j)+d, D(i, j-1)+d, D(i-1, j-1) + ( S1(i) != S2(j) )*(r-e) + e ]

Alphabet-weight edit distance: the weights can also depend on the letters of the alphabet involved. In biology, the PAM and BLOSUM scoring matrices contain these weights.

Definition: The score for aligning x with y is s(x, y), where x and y can be letters in the alphabet, or space to denote insertion or deletion

If S1 and S2 are aligned by possibly inserting spaces into S1 or S2 or both, the value of the alignment is given by summing s( S1'(i), S2'(i) ) over all positions in S1' (the string obtained by inserting spaces into S1).

Definition: The similarity of S1 and S2 is the value of the optimal alignment.

Special cases: