Suffix trees

Given a string, S, of length m, a suffix tree for S is a rooted directed tree with m leaves corresponding to the m suffixes of S. Each edge is labeled with a substring of S in such a way that the edge labels on any path from the root to a leaf spell out the suffix corresponding to that leaf when concatenated. No two edges out of a node begin with the same letter, and every internal node has at least two children (except possibly the root).

Definition: The string-depth of a node is the length of the string that leads to that node. A path that ends in the middle of an edge is said to split that edge.

Implicit suffix tree: An implicit suffix tree can be formed by building a suffix tree for S$ (the string S followed by a character not occurring in S) then removing every $, edges with no labels, and nodes with fewer than two children.

Ukkonen's algorithm forms the implicit suffix tree for the prefix S[1..i+1] from the implicit suffix tree for S[1..i] in phase i, which consists of i+1 extensions, extending the suffix S[j..i] to S[j..i+1] according to the following rules:

Extension rule 1: If S[j..i] ends at a leaf, the character S(i+1) is added to the label of that leaf.

Extension rule 2: If no path from the end of S[j..i] starts with S(i+1) a new leaf edge (to a leaf numbered j) is created. A new node needs to be created if S[j..i] ended in the middle of an edge.

Extension rule 3: If a path from the end of S[j..i] already starts with S(i+1) nothing needs to be done.

Definition: A suffix link goes from a node labeled xw to a node, s(v), labelled w. x is a single letter and w is a string (possibly empty).

Single extension algorithm:

  1. If S[j-1..i] doesn't label a node with a suffix link (or the root), go up at most one edge to such a node. let G be the string that labels the path from this node to the end of S[j-1..i]
  2. If not at the root, follow the suffix link to s(v) then follow the path labeled G, otherwise follow the path for S[j..i]
  3. Using extension rules 1-3 add S[j..i+1] to the tree if needed.
  4. If extension rule 2 created a new internal node, create the suffix link for this node.

Speedup trick 1: When following a path that must exist in the tree, just match the first letter on each edge

Edge label compression: Label edges with the pair of integers (begin position, end position) instead of explicitly giving the letters labeling the edge

Speedup trick 2: In any phase, when extension rule 3 first applies, that phase can be ended

Speedup trick 3: Label leaves (begin position, end of string) where "end of string" is a global position variable that is equal to i+1 in phase i+1.