1.2 Overview of problems and structures

(1)

(2)

1 Introduction

Given a string 𝑆, a natural question is whether a pattern 𝑝 is a substring of 𝑆. Moreover, one might want to test many patterns against the same string. We call this the general substring problem. The seminal example of this is found in biology where a full genome is checked for the presence of many gene-sequences. In this article we will look at ways to pre process 𝑆 in 𝒪 (|𝑆|) time yielding a 𝒪 (|𝑆|) size data structure that allows 𝒪 (|𝑝|) resolution of the general substring problem. These are trivially lower bounds for this problem.

(3)

We will consider a few data structures. In section 2 and 3 we examine their use and memory footprint;

in section 4 we will see how to construct these data structures in 𝒪 (|𝑆|) time.

The first data structure we look at is the trie, so named because it allows for easy retrieval of strings.

Applying the trie to the general substring problem yields the basic suffix trie. An optimization on tries yields the compressed trie. In the specifc case of the basic suffix trie, this optimization yields the suffix tree. This structure has desired 𝒪 (|𝑝|) search time and 𝒪 (|𝑆|) memory footprint we wanted.

The second structure we examine is the suffix array, derived from a simple ordered array of words, the word array. At first sight, the suffix array seems inferior to the suffix tree. However, with some elegant enhancements shown in section 3, it proves to be at least as efficient as the suffix trie when solving the general substring problem.

1.1 Preliminaries

Before we get to the actual data structures, we need to introduce notation and some terminology.

First, we keep our variable types consistent: Variables 𝑥𝑦𝑧 are nodes, 𝑖 through 𝑙 are integers, 𝑝 through 𝑤 are strings and 𝑎 and 𝑐 (not 𝑏) are characters.

Second, with our arrays we take their first index to be 0. Furthermore, by 𝐴[𝑖 : 𝑗] we mean the subarray starting from 𝑖 up to but excluding 𝑗.

We write 𝑢 ≺ 𝑣 or 𝑣 ≻ 𝑢 to denote 𝑢 lexicographically preceding 𝑣.

Since we are dealing with strings, we introduce terminology regarding strings and substrings. Σ is universally the alphabet over which we take our strings. 𝑆 is universally the string we want to pre- process and 𝑛 = |𝑆|. For a string 𝑆, 𝑣 is a substring if and only if ∃𝑢∃𝑤 : 𝑢𝑣𝑤 = 𝑆. We distinguish a few special sets of substrings. First we have the suffixes and prefixes of 𝑆, defined as:

Suff (𝑢) = {𝑠 | ∃𝑝 : 𝑝𝑠 = 𝑢}, the suffixes of 𝑢 Pref (𝑢) = {𝑝 | ∃𝑠 : 𝑝𝑠 = 𝑢}, the prefixes of 𝑢

We call 𝑢 a repeated substring if it occurs twice in 𝑆. Formally this can be written as: ∃𝑖, ∃𝑗 : 𝑖 ̸=

𝑗 ∧ 𝑆[𝑖, 𝑖 + |𝑢|] = 𝑆[𝑗, 𝑗 + |𝑢|] = 𝑢. A string that is both repeated and a suffix or a prefix is called respectively a nested suffix or prefix. We shall see that nested suffixes can be quite troublesome. For this reason, we introduce a character $ ̸∈ Σ. Appending this to 𝑆 ensures 𝑆$ has no nested suffixes.

We call a string 𝑢 a right branching substring if and only if:

∃𝑎∃𝑐 : 𝑎 ̸= 𝑐 ∧ 𝑢𝑎 a substring of 𝑆 ∧ 𝑢𝑐 a substring of 𝑆

Note that any right-branching substring must also be a repeated substring. Finally, we introduce the following function: suff (𝑖) = 𝑆[𝑖 : 𝑛], which allows us to easily address the suffixes of 𝑆.

We will also make use of edge-labelled trees. For such a tree 𝐿, we introduce the following notation: 𝑁𝐿

are the nodes of 𝐿. The edges are written as a triple: (parent, child, label). The set of all edges is written as 𝐸𝐿. To ensure this is actually a tree, every node except for one must have exactly one parent. The excepted node is the root and has no parent.

Finally, for a node 𝑥 we define 𝑇 (𝑥) to be the subtree rooted at 𝑥. We can define the structure of a tree (not the labels of the edges) by specifying the nodes in each subtree. If we know 𝑁𝑇 (𝑥) for each node 𝑥 we can deduce that:

𝐸_𝐿= {(𝑥, 𝑦, −) | 𝑦 ∈ 𝑁_{𝑇 (𝑥)}∧ ¬∃𝑧 ∈ 𝑁_{𝑇 (𝑥)}: 𝑦 ∈ 𝑁_{𝑇 (𝑧)}} In words, there is an edge only to direct descendants.

(4)

Round nodes correspond to problems, square nodes to structures.

In brackets are the numbers of the sections where the concepts are introduced.

compressed trie [2.3] RMQ [3.5]

suffix tree [2.3]

trie [2.1]

suffix array [2.4]

general substring [1]

±RMQ [3.5]

LCP array [3.2]

LCA [3.5]

basic suffix trie [2.2]

enhanced suffix array [3.6]

solves

reduces to

contains contains

derives from

solves

derives from

derives from requires

reduces to

Figure 1: Graph of relations

1.2 Overview of problems and structures

We will see many interrelated problems and structures in this article. The graph in figure 1 shows the basic relations between these. Other relations of importance that are hard to capture in a a graph are:

∙ The enhanced suffix array solves the general substring problem via both the lcp tree [3.3] and binary search [3.1].

∙ The lcp tree uses lcp intervals [3.3].

∙ The lcp tree is equivalent to the suffix tree .

∙ Binary search is worse than the other methods, unless we are dealing with large alphabets.

∙ The general substring problem and the solutions we present here can be generalized to finding prefixes of a set of words.

2 First concepts

2.1 Tries

Tries are the basic way to store strings for retrieval. They are trees that store strings by prefix. The root of a trie corresponds to the empty string, trivially a prefix of every string. Then recursively, for each prefix it stores all characters we can append to that prefix to get a longer prefix. Each such character is stored in an edge ending in a new node corresponding to the longer prefix. For each prefix 𝑢 we

(5)

shall denote the node in the trie corresponding to it 𝑢. Conversely, for a node 𝑢 we define the function string(𝑢) = 𝑢 We capture this concept of the trie in the following definition:

Definition 2.1 (Trie) 𝑇 , the trie for a set of words 𝑊 is defined as an edge-labeled tree satisfying the following properties:

∙ There exists a bijection between 𝑃 , the set of all prefixes of 𝑊 (i.e. 𝑃 = ⋃︀{Pref (𝑣) | 𝑣 ∈ 𝑊 } ) and 𝑁_𝑇: 𝑢 ↦→ 𝑢

∙ The edges of T are given by:

𝐸𝑇 = {(𝑢, 𝑢𝑐, 𝑐) | 𝑢𝑐 ∈ 𝑃 }

From this it immediately follows that:

Corollary 2.1 Given trie 𝑇 for a set of strings 𝑊 the following hold:

1. For any node 𝑢, the string 𝑢 is obtained by concatenating the edge labels encountered when walking from the root to 𝑢.

2. There exists an injection between the leaves of 𝑇 and the words of 𝑊 . 3. The root of 𝑇 is 𝜖

4. No node has two outgoing edges with the same label.

5. The trie of 𝑊 is uniquely determined.

Note that point 2 of corollary 2.1 does not state a bijection because one word in 𝑊 may be the prefix of another. For example, in the trie seen in figure 2, searching for "at" would not end in a leaf. As such one cannot determine the words in 𝑊 from its trie. In the example we cannot deduce that "at" ∈ 𝑊 from the trie alone. We can prevent this situation by appending the sentinel $ to every word in 𝑊 . This way no word in 𝑊 is the prefix of another word. Doing so ensures a bijection between the leaves of 𝑇 and words in 𝑊 .

The point of a trie is to quickly be able to find prefixes of words in a set. Point 1 of corollary 2.1 is essential to this. Say we want to know if 𝑝 is a prefix of a word in 𝑊 . Given the trie 𝑇 for 𝑊 , point 1 would allow us to easily find incrementally longer prefixes of 𝑝 that occur in 𝑊 . This proceeds until we either find 𝑝 in 𝑊 , or can no longer find the next prefix. Moving from one prefix to the next is simple.

If we are at node𝑝[0 : 𝑗], we need only look for an edge labelled 𝑝[𝑗] because that edge leads to the node 𝑝[0 : 𝑗 + 1]. This gives rise to algorithm 2.1.

Figure 2: A trie for 𝑊 = {at, ate, tea, ten, too}.

a

too

ate tea ten

to t

te at

a

t e

t

n

e o

o

a

(6)

Algorithm 2.1 find(p, T)

Require: string 𝑝, the pattern to find and trie 𝑇 in which to search.

Ensure: Return node𝑝 if it exists, NO_SUCH_PATTERN otherwise.

{ Nodes of 𝑇 are assumed to have method 𝑔𝑒𝑡𝐶ℎ𝑖𝑙𝑑(𝑐) returning the child at the end of the edge labelled 𝑐 if it exists and 𝑛𝑢𝑙𝑙 otherwise.}

1: 𝑛𝑜𝑑𝑒 ← 𝑇.𝑟𝑜𝑜𝑡()

2: 𝑖𝑑𝑥 ← 0

3: while 𝑖𝑑𝑥 ̸= 𝑝.𝑙𝑒𝑛𝑔𝑡ℎ do

4: 𝑛𝑜𝑑𝑒 ← 𝑛𝑜𝑑𝑒.𝑔𝑒𝑡𝐶ℎ𝑖𝑙𝑑(𝑝[𝑖𝑑𝑥])

5: if 𝑒𝑑𝑔𝑒 = 𝑛𝑢𝑙𝑙 then

6: return NO_SUCH_PATTERN

7: 𝑖𝑑𝑥 ← 𝑖𝑑𝑥 + 1

8: return 𝑛𝑜𝑑𝑒

The invariant here is: 𝑛𝑜𝑑𝑒 = 𝑝[0 : 𝑖𝑑𝑥]. Traversing the while loop takes 𝒪 (1) time, and we traverse it 𝒪 (|𝑝|) times, giving us a running time of 𝒪 (|𝑝|).

This gives us the ability to recognize prefixes of 𝑊 but not to distinguish full words. Ending in a leaf certainly guarantees we have a full word but, as stated in point 2 of corollary 2.1 the converse does not hold. Appending all words in 𝑊 by sentinel $ solves this problem by making the converse hold. In this case, searching for 𝑝 only determines whether 𝑝 is a prefix in 𝑊 . Searching for 𝑝$ determines if 𝑝 is a word in 𝑊 . In general one should always append the sentinel because it stores more information.

2.2 Basic Suffix Tries

Tries facilitate the finding of a word, or a prefix of such a word, within a set of words. However, our initial problem requires easy access to all substrings of a string 𝑆. In this case the most obvious usage of a trie is to create a trie for the set of all substrings 𝑆. However, there are 𝒪(︀𝑛²)︀ substrings of 𝑆.

This leads to excessively large memory requirements. Luckily, we can do a lot better by exploiting the following:

Observation 2.2 every substring 𝑢 of 𝑆 is a prefix of a suffix of 𝑆.

Since tries allow retrieval of not just words, but also prefixes, we need merely construct a trie containing all suffixes of 𝑆. It is obvious there are 𝒪 (𝑛) suffixes of 𝑆. The trie consisting of all of these suffixes is called the basic suffix trie. Formally, we define the basic suffix trie as follows:

Definition 2.2 (Basic suffix trie) Given a string 𝑆, its basic suffix trie 𝑇 is a trie for Suff (𝑆).

Corollary 2.3 Given a string 𝑆 and its basic suffix trie 𝑇 :

1. 𝑢 ↦→ 𝑢 gives a bijection between the substrings of 𝑆 and the nodes of 𝑇 . 2. For each leaf 𝑥 of 𝑇 string(𝑥) is a suffix of 𝑆

3. If 𝑆 has no nested suffixes, for each suffix 𝑢, 𝑢 is a leaf.

The absence of nested suffixes can be assured simply by appending $ to 𝑆. Again, this should almost always be done. Take, for example the trie in figure 3. Were $ not appended here, it would be a lot harder to recognize that "ana" is a suffix.

(7)

Figure 3: the basic suffix trie for "banana$"

ana

anana$

an

anana banan

nana na

nana$

nan

$

na$

banana ba

ana$ anan bana

a$

a b

banana$

n

ban

$ n

$

$ a

$

n

b n

a

$ a a

n

a

n

a

a n

a

Now, we can determine whether a string 𝑝 is a substring of 𝑆 by performing a standard trie search on the basic suffix trie of 𝑆. However, the basic suffix trie is not yet the optimal solution. Restricting ourselves to suffixes meant we only had to store 𝒪 (𝑛) words in our trie. However, these suffixes have average length 𝒪 (𝑛). As such, the basic suffix trie takes 𝒪(︀𝑛²)︀ space. This is still to big. Next, we shall see a structure that improves this to 𝒪 (𝑛) space: the suffix tree.

2.3 Suffix Trees

The basic suffix trie forms the basis for the suffix tree. We will reduce its 𝒪(︀𝑛²)︀ memory footprint to 𝒪 (𝑛) with two optimizations. The first optimization will reduce the amount of nodes and edges to 𝒪 (𝑛), though the memory used per edge goes up, keeping the memory footprint at 𝒪(︀𝑛²)︀. The second optimization will push the size of an edge down to 𝒪 (1), giving us the desired memory footprint of 𝒪 (𝑛).

The first optimization works for tries in general. It relies on noticing a trie may have sequences of nodes with just a single child. These sequences don’t branch, so these nodes store no information about the structure of the tree. As such, we consolidate these sequences into single edges. We label that edge with the string obtained by concatenating the labels of the consolidated sequence. We call the resultant structure the compressed trie. The suffix tree is then defined as the compressed trie for all suffixes.

Formally, we define it as follows:

(8)

Definition 2.3 (Compressed trie & Suffix Tree) Given a trie 𝑇 , the corresponding compressed trie 𝐶 is an edge-labeled tree, with strings as edge labels. Its nodes and edges are:

𝑁𝐶={𝑇.𝑟𝑜𝑜𝑡} ∪ branching nodes of 𝑇 ∪ leaves of 𝑇 𝐸_𝐶={(𝑢, 𝑣, 𝑟) | 𝑣 = 𝑢𝑟 ∧ ¬∃𝑝 ∈ Pref (𝑟) : 𝑢𝑝 ∈ 𝑁_𝑐}

Given a string 𝑆, its suffix tree 𝑆𝑇 is then the compressed trie of the basic suffix trie of 𝑆.

Furthermore, for a node 𝑢, its implicit depth is |𝑢|. Its explicit depth is the normal ‘distance to root’

definition.

Corollary 2.4 Given a compressed trie 𝑇 for 𝑊 , we know that:

1. No two edges originating from the same node in 𝑇 have labels starting with the same character.

2. every internal node of 𝑇 has at least two children. (Except when 𝑇 only has two nodes.) 3. 𝑇 has 𝒪 (𝑛) nodes and edges.

Furthermore, if 𝑇 is a suffix tree for 𝑆 we have the following:

For each internal node 𝑢 of 𝑇 , 𝑢 is a right-branching substring. For each leaf 𝑣, 𝑣 is a suffix. This is a result of the first two points of the corollary.

Before we get to the second optimization, we have to address a rather pressing issue. We no longer have a bijection between substrings and nodes as we did in corollary 2.3. Luckily, the information is still there. Some substrings are simply on the edge between two nodes. We call these ‘positions’ on the edge implicit nodes. Given any such implicit node, it is easy to figure out what its child is, and by what label the outgoing edge is labelled.

We still have to address the memory issue. The core of the problem is that, whilst we have reduced the amount of edges and nodes, we changed the edge-labels from characters to strings in the process. This means that every character that was stored in the basic suffix Trie, is also stored in the Suffix Tree. Since the basic suffix trie stored 𝒪(︀𝑛²)︀ characters, so does our Suffix Tree.

The key insight is that every string stored as an edge label is a substring of 𝑆. As such we can store it by two indices. Its start and end position in 𝑆. This reduces the space taken per edge to 𝒪 (1), thus

Figure 4: the suffix tree for "banana$"

a

ana$

ana

na

anana$

banana$ $

na$

a$ nana$

a

na$

$

$ na$

banana$

na

(9)

reducing the total space of the Tree to 𝒪 (𝑛). Note that with three indices, we can do the same for a compressed trie. Every label is the substring of some word in 𝑊 . We use the first index to store which word, and the remaining two for the start and end of the substring.

In order to use the suffix tree, we require a way to represent the implicit nodes. We do this via the concept of a ‘reference pair’:

Definition 2.4 (Reference pair) Let 𝑆𝑇 be a suffix tree, 𝑢 be a node of 𝑆𝑇 and 𝑢𝑠 be an implicit node of ST. We then define the⟨︀𝑢, 𝑠⟩︀ to be a reference pair referring to 𝑢𝑠. We call 𝑢 the anchor of the pair and 𝑠 the label.

For 𝑢𝑠 its reference pair with the deepest possible anchor is its canonical reference pair.

Furthermore, we extend our notation. For substring 𝑣 of 𝑆, we define 𝑣 =⟨︀𝑢, 𝑠⟩︀ where ⟨︀𝑢, 𝑠⟩︀ is canonical.

Which definition of 𝑢𝑠 we use will be clear from context. Finally, we define: string(⟨︀𝑢, 𝑠⟩︀) = 𝑢𝑠.

We introduced the canonical reference pair because it is unique; the same string can have many reference pairs with different anchors. For example, take the substring "anan" of "banana$" as seen in figure 4.

For this point, ⟨︀𝑎, 𝑛𝑎𝑛⟩︀ is a reference pair, but so are ⟨︀𝜖, 𝑎𝑛𝑎𝑛⟩︀ and ⟨︀𝑎𝑛𝑎, 𝑛⟩︀. However, of those three only⟨︀𝑎𝑛𝑎, 𝑛⟩︀ is canonical. A beneficial property of the canonical reference pair is the following: If we were to walk from the anchor to the implicitly referenced node, we would only pass implicit nodes. This is clearly not the case for non-canonical reference pairs.

When storing a reference pair’s label we can again exploit the fact that it is a substring of 𝑆, allowing us to store it by two simple indices.

Interestingly, canonizing a reference pair, checking whether the reference pair is correct in the process, is essentially the substring-finding algorithm. After all to find if 𝑝 is a substring, all we need to do is canonize ⟨𝑟𝑜𝑜𝑡, 𝑝⟩. (Obviously, in this case we can’t store 𝑝 using indices since we don’t know if it is a substring of 𝑆). Canonizing a reference pair with such a check is quite simple. Algorithm 2.2 does this.¹ Here, much like with the find algorithm for the basic suffix trie, the invariant is that string(node) = string(⟨firstNode, String⟩)[0 : 𝑖𝑑𝑥]. The running time remains 𝒪 (|𝑝|) because we require 𝒪 (|𝑝|) character comparisons. Every other operation takes 𝒪 (1) time and the while loop is executed 𝒪 (|𝑝|) times.

1the algorithm also works for general compressed tries

Algorithm 2.2 canonize(⟨𝑛𝑜𝑑𝑒, 𝑠𝑡𝑟⟩, 𝑇 )

Require: a reference pair ⟨𝑛𝑜𝑑𝑒, 𝑠𝑡𝑟⟩ to canonize and a compressed trie 𝑇 to work in.

Ensure: returns the canonized reference pair if ⟨𝑛𝑜𝑑𝑒, 𝑠𝑡𝑟⟩ refers to an existing internal node, INCOR- RECT_REFERENCE_PAIR otherwise

1: 𝑖𝑑𝑥 ← 0 {below variables are only used in proof}

2: 𝑓 𝑖𝑟𝑠𝑡𝑁 𝑜𝑑𝑒 ← 𝑛𝑜𝑑𝑒

3: 𝑓 𝑖𝑟𝑠𝑡𝑆𝑡𝑟 ← 𝑠𝑡𝑟

4: while 𝑖𝑑𝑥 < 𝑠𝑡𝑟.𝑙𝑒𝑛𝑔𝑡ℎ() do

5: 𝑒𝑑𝑔𝑒 ← 𝑛𝑜𝑑𝑒.𝑔𝑒𝑡𝐸𝑑𝑔𝑒(𝑠𝑡𝑟[𝑖𝑑𝑥])

6: if 𝑒𝑑𝑔𝑒 = 𝑛𝑢𝑙𝑙 then

7: return INCORRECT_REFERENCE_PAIR

8: 𝑙𝑎𝑏𝑒𝑙 ← 𝑒𝑑𝑔𝑒.𝑙𝑎𝑏𝑒𝑙

9: 𝑙𝑒𝑛𝑔𝑡ℎ ← min(𝑙𝑎𝑏𝑒𝑙.𝑙𝑒𝑛𝑔𝑡ℎ, 𝑠𝑡𝑟.𝑙𝑒𝑛𝑔𝑡ℎ − 𝑖𝑑𝑥) {if we are assured the reference pair is correct, we can omit this check, saving a lot of time}

10: if 𝑙𝑎𝑏𝑒𝑙 ̸= 𝑠𝑡𝑟[𝑖𝑑𝑥 : 𝑖𝑑𝑥 + 𝑙𝑒𝑛𝑔𝑡ℎ] then

11: return INCORRECT_REFERENCE_PAIR

12: 𝑖𝑑𝑥 = 𝑖𝑑𝑥 + 𝑙𝑒𝑛𝑔𝑡ℎ

13: return ⟨𝑛𝑜𝑑𝑒, 𝑠𝑡𝑟⟩

(10)

It should be noted that suffix trees find applications in string processing far beyond the general substring problem. [Gus97] devotes the entirety of chapters 7 and 9 to the applications.

2.4 Suffix Arrays

Up until now, the key insight has been observation 2.2. It allowed us to transform finding substrings to finding prefixes of a set 𝑊 . So far, we have used tries for this. However, there exists a much more basic structure that allows us to find prefixes, the word array. Essentially, it works as a dictionary, storing the words in lexicographical order. Taking 𝑊 = Suff (𝑆) then gives us the suffix array. Formally, we define these arrays as follows:

Definition 2.5 (Word array & Suffix array) Given an enumerated set of words 𝑊 = {𝑤₁, 𝑤₂· · · 𝑤_𝑛}, its word array WA has size 𝑛. Its entries are defined as follows.

0 ≤ 𝑖 < 𝑗 < 𝑛 ⇐⇒ 𝑤_WA[𝑖]≺ 𝑤_WA[𝑖]

The suffix array SA for string 𝑆 is then defined as the word array taking 𝑤𝑖= suff (𝑖) We also define the following function to quickly access the words of the word array:

word (𝑖) = 𝑤WA[𝑖]

Finding words in 𝑊 , and indeed prefixes of such words, now becomes a simple matter of binary search.

However, this takes 𝒪 (log 𝑛) word-comparisons. When searching for a word 𝑝 each of these comparisons naïvely takes 𝒪 (|𝑝|) time. Thus, naïve binary search takes 𝒪 (|𝑝| log 𝑛) time, much worse than the trie’s 𝒪 (|𝑝|). It seems like the simplicity of the word array has come at a cost.

That said, the bound of 𝒪 (|𝑝| log 𝑛) is rather pessimistic. Unless very long prefixes of 𝑝 occur in our array, few word-comparisons will actually take 𝒪 (𝑝) time. Furthermore, we will see two successive speed- ups to binary search on a word array. These will finally yield a runtime bound of 𝒪 (|𝑝| + log 𝑛). Still worse than 𝒪 (|𝑝|), but not by much. Finally, we will see a surprising alternative use of an (enhanced) suffix array that actually manages 𝒪 (|𝑝|) time.

These speed-ups to binary search both depend on the following property of lexicographical sorting:

Observation 2.5

if 𝑝𝑢 ≺ 𝑠 ≺ 𝑝𝑣 then 𝑝 ∈ Pref (𝑠)

This allows us to significantly reduce the number of character comparisons needed for each successive word-comparison. The first speed up (due to [Gus97]) is a basic application of this observation.

At any point in binary search for 𝑝, we are considering three positions: The left boundary 𝑏𝐿, the mid- point 𝑏_𝑀 and the right boundary 𝑏_𝑅. satisfying word (𝑏_𝐿) ≺ word (𝑏_𝑀) ≺ word (𝑏_𝑅). Now, we take 𝑛_𝐿 to be the index in 𝑝 up to which we have matched word (𝑏_𝐿) to 𝑝, and take 𝑛_𝑅 analogously. Taking 𝑚𝑙𝑟 = min(𝑛_𝐿, 𝑛_𝑅) observation 2.5 tells us that 𝑏_𝑀 must match 𝑝 up to 𝑚𝑙𝑟. This allows us to skip the first 𝑚𝑙𝑟 characters when comparing 𝑏_𝑀 to 𝑝. This is captured in algorithm 2.3:

Although this algorithm saves a lot of redundant comparisons, we retain the 𝒪 (|𝑝| log 𝑛) worst case bound. That said, it only occurs in degenerate cases. For example, when searching 𝑆 = 𝑎𝑏 . . . 𝑏 for 𝑎𝑏𝑏𝑏𝑏𝑏𝑏𝑐. In this case 𝑏𝑅will always be of the form 𝑏 . . . 𝑏, and thus 𝑛𝑅will remain 0. The second speed up will improve this bound to 𝒪 (|𝑝| + log 𝑛). However, it will have to wait until section 3.1. It is much more complicated and depends on the not yet introduced concept of the longest common prefix function.

(11)

Algorithm 2.3 binarySearch(𝑝, WA)

Require: string 𝑝, the pattern to search and word array WA in which to search.

Ensure: returns 𝑡𝑟𝑢𝑒 if 𝑝 is a prefix of a word in WA and 𝑓 𝑎𝑙𝑠𝑒 otherwise.

𝑏𝐿← 0 𝑛𝐿← 0

𝑙𝑒𝑓 𝑡𝑊 𝑜𝑟𝑑 ← word (𝑏𝐿)

while 𝑝[𝑛𝐿+ 1] = 𝑙𝑒𝑓 𝑡𝑊 𝑜𝑟𝑑[𝑛𝐿+ 1] do 𝑛_𝐿← 𝑛𝐿+ 1

if 𝑝[𝑛_𝐿+ 1] ≺ 𝑙𝑒𝑓 𝑡𝑊 𝑜𝑟𝑑[𝑛_𝐿+ 1] then return 𝑓 𝑎𝑙𝑠𝑒 {𝑝 ≺ word (0)}

𝑏𝑅← 𝑆𝐴.𝑠𝑖𝑧𝑒 − 1 𝑛𝑅← 0

𝑟𝑖𝑔ℎ𝑊 𝑜𝑟𝑑 ← word (𝑏𝑅)

while 𝑝[𝑛𝑅+ 1] = 𝑙𝑒𝑓 𝑡𝑊 𝑜𝑟𝑑[𝑛𝑅+ 1] do 𝑛𝑅← 𝑛𝑅+ 1

if 𝑝[𝑛𝑅+ 1] ≺ 𝑙𝑒𝑓 𝑡𝑊 𝑜𝑟𝑑[𝑛𝑅+ 1] then return 𝑓 𝑎𝑙𝑠𝑒 {word (0 ≺ 𝑝}

while 𝑏𝐿̸= 𝑏𝑅 do

𝑏𝑀 ← (𝑏𝐿+ 𝑏𝑅)/2 {𝑛𝑀 is the point up to which we know WA[𝑏𝑀] and 𝑝 agree.}

𝑛_𝑀 ← min(𝑛𝐿, 𝑛_𝑅) 𝑚𝑖𝑑𝑊 𝑜𝑟𝑑 ← word (𝑏_𝑚)

while 𝑝[𝑛_𝑀+ 1] = 𝑚𝑖𝑑𝑊 𝑜𝑟𝑑[𝑛_𝑀+ 1] do 𝑛_𝑀 ← 𝑛_𝑀 + 1

if 𝑛_𝑀 = 𝑃.𝑙𝑒𝑛𝑔𝑡ℎ − 1 then return 𝑡𝑟𝑢𝑒

if 𝑝[𝑛𝑀 + 1] ≺ 𝑚𝑖𝑑𝑊 𝑜𝑟𝑑[𝑛𝑀 + 1] then 𝑏𝐿← 𝑏𝑀

𝑛𝐿← 𝑛𝑀

if 𝑝[𝑛𝑀 + 1] ≻ 𝑚𝑖𝑑𝑊 𝑜𝑟𝑑[𝑛𝑀 + 1] then 𝑏𝑅← 𝑏𝑀

𝑛𝑅← 𝑛𝑀

return 𝑓 𝑎𝑙𝑠𝑒

2.5 Concluding remarks

On first sight, the suffix tree seems indisputably better than the suffix array. The suffix array’s, 𝒪 (|𝑝| log 𝑛), is significantly worse than 𝒪 (|𝑝|). However, the 𝒪 (|𝑝| log 𝑛) bound only occurs in patho- logical cases. And indeed in [MM90] Manber and Myers report seeing 𝒪 (𝑝 + log 𝑛) performance in the general case. This stands to reason as, in general, one expects both boundaries of a binary search to improve. This performance is a lot closer to that of the suffix tree.

Furthermore, an easy to overlook advantage of the suffix array is absolute memory footprint. Whilst both structures are 𝒪 (𝑛), the suffix tree has a significant constant factor when compared the the suffix array. In the case where there are no nested suffixes, we must have at least 𝑛 edges. After all, each suffix has a leaf and there are 𝑛 suffixes. Now, for each edge, we need to store 3 pointers. One to the child node, and two for the edge label. This already brings us to 3𝑛 words, ignoring the need to store edges in a node. On the other hand, the suffix array takes exactly 𝑛 words to store. Due to I/O limitations, these differences in memory footprint can have significant performance repercussions.

In the next session, we will see how we can bring the suffix array’s performance completely up to par with the suffix tree, whilst keeping the memory footprint below 3𝑛. We also still need to know if these structures can actually be constructed in 𝒪 (𝑛) time. We will see this in section 4.

(12)

3 The core enhancement: longest common prefix

The entirety of this section is about using the concept of the longest common prefix, specifically the length of that prefix. We capture this by the following function:

Definition 3.1 (lcp function) Given string 𝑢 and 𝑣, we define the lcp function as follows:

lcp(𝑢, 𝑣) = max{︀|𝑝|⃒

⃒𝑝 ∈ Pref (𝑢) ∩ Pref (𝑣)}︀

Corollary 3.1

lcp(𝑢, 𝑤) ≥ min(lcp(𝑢, 𝑣), lcp(𝑣, 𝑤)) (1)

Furthermore, if 𝑢 ≺ 𝑣 ≺ 𝑤 we have:

lcp(𝑢, 𝑤) = min(lcp(𝑢, 𝑣), lcp(𝑣, 𝑤)) (2)

Proof Take 𝑚 = min(lcp(𝑢, 𝑣), lcp(𝑣, 𝑤)) and 𝑝 = 𝑢[0 : 𝑚]. Certainly, we have 𝑝 ∈ Pref (𝑢) and 𝑝 ∈ Pref (𝑤). It then follows that 𝑚 = |𝑝| ≤ lcp(𝑢, 𝑤). This gives us the first claim.

Now for the second claim, suppose 𝑢 ≺ 𝑣 ≺ 𝑤.

We take 𝑟 = 𝑢[0 : lcp(𝑢, 𝑤)]. Trivially we have: 𝑟 ∈ Pref (𝑢) and 𝑟 ∈ Pref (𝑤). By observation 2.5, this gives 𝑟 ∈ Pref (𝑣). From which we can conclude 𝑟 to be a common prefix of 𝑢, 𝑣 and 𝑤. This gives lcp(𝑢, 𝑤) = |𝑟| ≤ 𝑚.

From the first claim, we have lcp(𝑢, 𝑤) ≥ 𝑚. Thus lcp(𝑢, 𝑤) = 𝑚 = min(lcp(𝑢, 𝑣), lcp(𝑣, 𝑤)) The utility of the lcp function lies in the fact that, after preprocessing, lcp queries can be answered in 𝒪 (1) time using the above corollary. We will see the exact mechanics of this later. For now, we focus on how to best exploit this easily computable function.

3.1 Using lcp to improve binary search

The first application of lcp (due to [Gus97]) is to speed up the binary search applied to word arrays as promised. Recall how our first speed up managed to reduce the amount character comparisons but still left us with the 𝒪 (|𝑝| log 𝑛) worst case bound. Here we will reduce that bound to 𝒪 (|𝑝| + log 𝑛).

We do this by bounding the amount of ‘redundant’ character comparisons to 1 per iteration. We call a comparison of a character of 𝑝 redundant when we’ve already compared it. This gives |𝑝| necessary comparisons and 𝒪 (log 𝑛) redundant ones. The bound 𝒪 (|𝑝| + log 𝑛) follows immediately.

We reiterate the definitions used in binary search previously, making use of the lcp function where possible:

𝑏𝐿 = the left boundary of the current search interval 𝑛𝐿 = lcp(word (𝑏𝐿), 𝑝)

𝑏_𝑀 = the mid point of the current search interval 𝑛𝑀 = lcp(word (𝑏𝑀), 𝑝)

𝑏_𝑅 = the right boundary of the current search interval 𝑛𝑅 = lcp(word (𝑏𝑅), 𝑝)

The previous method is slow because it has potentially many redundant comparisons. Specifically when 𝑛𝐿 ̸= 𝑛𝑅 we have performed max(𝑛𝐿, 𝑛𝑅) comparisons and yet we will start at min(𝑛𝐿, 𝑛𝑅). Yielding max(𝑛𝐿, 𝑛𝑅) − min(𝑛𝐿, 𝑛𝑅) redundant comparisons.

Our speed up is achieved by improving this case where 𝑛𝐿̸= 𝑛𝑅. We proceed with the case of 𝑛𝐿> 𝑛𝑅. For the other case, all arguments below hold upon exchanging 𝐿 and 𝑅, and reversing the ordering of ≺.

The key concept to our speed up is the following observation, based on the contraposition of (2) of corollary 3.1:

(13)

Observation 3.2 Given strings 𝑢, 𝑣, 𝑤 such that 𝑢 ≺ 𝑣 and 𝑢 ≺ 𝑤. If lcp(𝑢, 𝑣) < lcp(𝑢, 𝑤) we have 𝑢 ≺ 𝑤 ≺ 𝑣

This, combined with word (𝑏_𝐿) ≺ word (𝑏_𝑀) and word (𝑏_𝐿) ≺ 𝑝 allows us to deduce the ordering of word (𝑏_𝐿), word (𝑏_𝑀) and 𝑝 based on 𝑙𝑚 = lcp(word (𝑏_𝐿), word (𝑏_𝑀)) and 𝑛_𝐿. We do this by distinguishing the following three cases:

𝑙𝑚 > 𝑛𝐿: Here, it follows that word (𝑏𝐿) ≺ word (𝑏𝑀) ≺ 𝑝.

This means we set 𝑏𝐿← 𝑏𝑀. We need not change 𝑛𝐿 because 𝑛𝐿= min(𝑙𝑚, 𝑛𝑀) and 𝑙𝑚 > 𝑛𝐿 so 𝑛𝐿= 𝑛𝑀.

𝑙𝑚 < 𝑛𝐿: Here, it follows that word (𝑏𝐿) ≺ 𝑝 ≺ word (𝑏𝑀).

This means we set 𝑏𝑅← 𝑏𝑀. We also set 𝑛𝑅← 𝑙𝑚. Because 𝑙𝑚 = min(𝑛𝐿, 𝑛𝑀) < 𝑛𝐿so 𝑙𝑚 = 𝑛𝑀. 𝑙𝑚 = 𝑛𝐿: In this case, observation 3.2 gives no information. However, we know 𝑛𝑚 ≥ 𝑛𝐿 because

𝑛𝑀 ≥ min(𝑙𝑚, 𝑛𝐿). This means we can start comparing at 𝑛𝐿+ 1 = max(𝑛𝐿, 𝑛𝑅) + 1.

This is implemented in algorithm 3.1 on page 13. In this algorithm, if we do any comparison at all, we always start at max(𝑛_𝐿, 𝑛_𝑅) + 1 in 𝑝. Moreover, at that point we will not have compared any character beyond the first max(𝑛_𝐿, 𝑛_𝑅) + 1 characters of 𝑝 (the + 1 because we only know to string agree up to 𝑖 when we see a difference at 𝑖 + 1). Therefore, we perform at most a single redundant comparison per iteration. This finally gives us the 𝒪 (|𝑝| + log 𝑛) bound.

3.2 Computing lcp via the lcp array

Having seen the power of the longest common prefix, we still need to know how to compute it in 𝒪 (1) time. The basis is the lcp array. This array enhances a word array WA. (Recall that word (𝑖) = 𝑤_WA[𝑖].) It is defined as:

Definition 3.2 (lcp array) LCP[𝑖] = lcp(word (𝑖 − 1), word (𝑖))

Obviously, this only stores the answer to a small part of all possible lcp queries. However, due to the lexicographical ordering of the word array, we can compute lcp based on the lcp array:

Lemma 3.3 Given a word array WA and corresponding lcp array LCP, we have:

lcp(word (𝑖), word (𝑗)) = min(LCP[𝑖 + 1 : 𝑗])

Proof This is a simple consequence of recursive application of corollary 3.1

However nice this result, naïve computation based on this formula takes 𝒪 (𝑛) time. Far more than the promised 𝒪 (1). However, yet another function, Range minimal query or RMQ, allows us to reduce this to 𝒪 (1) time. This does require 𝒪 (𝑛) time and memory for preprocessing but this is still acceptable.

Once again we first examine the definition and applications, deferring the internal workings to section 3.5. RMQ is defined as follows:

Definition 3.3 (Range minimial query) Given an array of integers 𝐴, and indices into 𝐴 𝑖 and 𝑗, the function RMQ_𝐴(𝑖, 𝑗) returns the index of the leftmost minimal element in subarray A[i:j].

This function will be essential throughout this chapter. Here, it allows us to write:

lcp(word (𝑖), word (𝑗)) = min(LCP[𝑖 + 1 : 𝑗]) = LCP[RMQ_LCP(𝑖 + 1, 𝑗)]

(14)

Algorithm 3.1 improvedBinarySearch(𝑝, WA)

Require: string 𝑝, the pattern to search and word array WA in which to search. And algorithm 3.2 Ensure: returns 𝑡𝑟𝑢𝑒 if 𝑝 is a prefix of a word in WA and 𝑓 𝑎𝑙𝑠𝑒 otherwise.

𝑏𝐿← 0 𝑛_𝐿← 0

𝑙𝑒𝑓 𝑡𝑊 𝑜𝑟𝑑 ← word (𝑏_𝐿)

while 𝑝[𝑛_𝐿+ 1] = 𝑙𝑒𝑓 𝑡𝑊 𝑜𝑟𝑑[𝑛_𝐿+ 1] do 𝑛_𝐿← 𝑛_𝐿+ 1

if 𝑝[𝑛_𝐿+ 1] ≺ 𝑙𝑒𝑓 𝑡𝑊 𝑜𝑟𝑑[𝑛_𝐿+ 1] then return 𝑓 𝑎𝑙𝑠𝑒 {𝑝 ≺ word (0)}

𝑏𝑅← 𝑆𝐴.𝑠𝑖𝑧𝑒 − 1 𝑛𝑅← 0

𝑟𝑖𝑔ℎ𝑊 𝑜𝑟𝑑 ← word (𝑏𝑅)

while 𝑝[𝑛𝑅+ 1] = 𝑙𝑒𝑓 𝑡𝑊 𝑜𝑟𝑑[𝑛𝑅+ 1] do 𝑛𝑅← 𝑛𝑅+ 1

if 𝑝[𝑛𝑅+ 1] ≺ 𝑙𝑒𝑓 𝑡𝑊 𝑜𝑟𝑑[𝑛𝑅+ 1] then return 𝑓 𝑎𝑙𝑠𝑒 {word (0 ≺ 𝑝}

while 𝑏𝐿̸= 𝑏𝑅 do if 𝑛𝐿= 𝑛𝑅 then

𝑏𝑎𝑠𝑖𝑐𝑆𝑡𝑒𝑝(𝑏_𝐿, 𝑏_𝑅, 𝑛_𝐿, 𝑛_𝑅, 𝑝, WA) continue

if 𝑛_𝐿> 𝑛_𝑅 then 𝑠𝑖𝑑𝑒 ← 𝐿 𝑜𝑡ℎ𝑒𝑟𝑆𝑖𝑑𝑒 ← 𝑅 if 𝑛𝐿< 𝑛𝑅 then

𝑠𝑖𝑑𝑒 ← 𝑅 𝑜𝑡ℎ𝑒𝑟𝑆𝑖𝑑𝑒 ← 𝐿

{set new 𝑏’s and 𝑛’s based on 𝑠𝑖𝑑𝑒 and 𝑙𝑐}

𝑙𝑐 ← lcp(word (𝑏𝑠𝑖𝑑𝑒), word (𝑏𝑀)) if 𝑙𝑐 = 𝑛𝑠𝑖𝑑𝑒then

𝑏𝑎𝑠𝑖𝑐𝑆𝑡𝑒𝑝(𝑏𝐿, 𝑏𝑅, 𝑛𝐿, 𝑛𝑅, 𝑝, WA) continue

if 𝑙𝑐 > 𝑛𝑠𝑖𝑑𝑒then 𝑏_{𝑠𝑖𝑑𝑒}← 𝑏𝑀

if 𝑙𝑐 < 𝑛𝑠𝑖𝑑𝑒then 𝑏_{𝑜𝑡ℎ𝑒𝑟𝑆𝑖𝑑𝑒}← 𝑏𝑀

𝑛_{𝑜𝑡ℎ𝑒𝑟𝑆𝑖𝑑𝑒}← 𝑙𝑐

Algorithm 3.2 basicStep(𝑏_𝐿, 𝑏_𝑅, 𝑛_𝐿, 𝑛_𝑅, 𝑝, WA)

Require: 𝑏𝐿, 𝑏𝑅, 𝑛𝐿, 𝑛𝑅, the left and right boundaries and how far they match 𝑝, 𝑝 itself and word array WA to compute word .

Ensure: perform the standard binary search step 𝑏𝑀 ← (𝑏𝐿+ 𝑏𝑅)/2

𝑛𝑀 ← max(𝑛𝐿, 𝑛𝑅) 𝑚𝑖𝑑𝑊 𝑜𝑟𝑑 ← word (𝑏𝑚)

while 𝑝[𝑛𝑀 + 1] = 𝑚𝑖𝑑𝑊 𝑜𝑟𝑑[𝑛𝑀+ 1] do 𝑛_𝑀 ← 𝑛𝑀 + 1

if 𝑛𝑀 = 𝑃.𝑙𝑒𝑛𝑔𝑡ℎ − 1 then

return 𝑡𝑟𝑢𝑒{This return must cascade to the calling algorithm}

if 𝑝[𝑛_𝑀 + 1] ≺ 𝑚𝑖𝑑𝑊 𝑜𝑟𝑑[𝑛_𝑀 + 1] then 𝑏_𝐿← 𝑏_𝑀

𝑛_𝐿← 𝑛_𝑀

if 𝑝[𝑛𝑀 + 1] ≻ 𝑚𝑖𝑑𝑊 𝑜𝑟𝑑[𝑛𝑀 + 1] then 𝑏𝑅← 𝑏𝑀

𝑛𝑅← 𝑛𝑀

(15)

Since RMQ takes 𝒪 (1) time, so does computation of this formula. This is how we are able to answer arbitrary lcp queries in 𝒪 (1) time.

It is interesting to note that our speed up of binary search does not need RMQ. The lcp values that are used are restricted, allowing us to pre-compute them. The set of all possible search intervals has a simple binary-tree structure. As such within a word array, there are only 𝒪 (𝑛) possible search intervals it suffices to pre-compute the 𝑙𝑐𝑝 value for all of these. This can be done in 𝒪 (𝑛) time based on the lcp array. This is done by dynamic programming, starting with the smaller intervals. For the details, see [Gus97, p. 145]. However, we will soon see situations where arbitrary lcp queries are needed.

3.3 lcp intervals and the lcp tree

We have already seen how the lcp array, pre-processed for RMQ, can be used to speed up binary search over a word array. And thus over a suffix array. Here, based on [FH07], we will see how the lcp array can take an even more prominent role. We will be focussing on the specific case of a suffix array. However, this entire section can be generalized to work for word arrays as is explained in section 3.7

For every prefix that is shared among multiple suffixes, these suffixes form a continuous interval in the suffix array. This fact forms the basis for the concept ‘lcp interval’ which is defined as follows¹:

Definition 3.4 (proper lcp interval) Given a suffix array SA and a corresponding lcp array LCP we say that [𝑖, 𝑗) is a proper lcp interval of value 𝑙 if and only if the following all hold:

1. LCP[𝑖] < 𝑙 2. LCP[𝑗] < 𝑙

3. min(LCP[𝑖 + 1 : 𝑗]) = 𝑙. The indices where LCP[𝑖] = 𝑙 are called the 𝑙-indices.

Conceptually, we take undefined entries (both 0 and 𝑛) in the LCP array as being 0.

If [𝑖, 𝑗) is a proper lcp interval of value 𝑙, we can also write that 𝑙 -[𝑖, 𝑗) is a proper lcp interval.

Corollary 3.4 The following gives a bijection between all proper lcp intervals and (right branching substrings of 𝑆 ∪ nested suffixes of 𝑆):

string(𝑙 -[𝑖, 𝑗)) = suff (SA[𝑘])[0 : 𝑙] for any 𝑘 ∈ [𝑖 : 𝑗) (𝑗 being excluded) (3) The inverse of this function is:

𝑝 = |𝑝| -[𝑖, 𝑗) where [𝑖, 𝑗) = {𝑘 | 𝑝 ∈ Pref (suff (SA[𝑘]))} (4) Proof By corollary 3.1 and point 3 of the definition, we have (3) being well defined. Furthermore, if we take 𝑖 to be an 𝑙-index, it follows that suff (SA[𝑖 − 1]) and suff (SA[𝑖]) differ on the 𝑙 + 1th character.

This means that suff (SA[𝑖])[0 : 𝑙] is either right branching or a nested suffix (or both).

For (4) we look to 2.5. This tells us that given any 𝑝, 𝐼 = {𝑘 | 𝑝 ∈ Pref (suff (SA[𝑘]))} forms an interval.

By (3) we already know that if 𝑝 ∈ right branching substrings of 𝑆 ∪ nested suffixes of 𝑆 then [𝑖 : 𝑗) ⊂ 𝐼. Now, it follows from points 1 and 2 of the definition that neither 𝑖 − 1 nor 𝑗 is included in

𝐼.

1here, due to our convention, 𝑗 is not included in the interval. This makes things easier later, but is contrary to what is seen in the literature.

(16)

Note that, since we decided to take undefined values of the LCP array to be 0, [0, 𝑛) is also an lcp interval.

The next step is to extend our definition of the lcp interval. We are currently missing singleton intervals, intervals of the form [𝑖, 𝑖 + 1). These play an important role because they correspond to the suffixes.

Indeed, if we set 𝑙 = |suff (SA[𝑖])| then (3) extends to this case without effort. The use of suffixes is that they ‘terminate’ the branching done by right-branching substrings.

This extension is somewhat troublesome in the case of nested suffixes. If 𝑢 = suff (SA[𝑖]) is nested, it already corresponds to both a proper lcp interval and a singleton interval. In these cases, we define 𝑢 to be the proper lcp interval. This is a technical detail that never even comes up when implementing the algorithms. It does, however, illustrate the complexity induced by nested suffixes. With that detail taken care of, we can now define the complete set of lcp intervals:

Definition 3.5 (lcp interval) We say [𝑖, 𝑗) is an lcp interval of value 𝑙 when either of the following hold:

∙ 𝑙 -[𝑖, 𝑗) is a proper lcp interval.

∙ 𝑗 = 𝑖 + 1 and 𝑙 = |suff (SA[𝑖])|.

Again, if [𝑖, 𝑗) is an lcp interval of value 𝑙, we can also write that 𝑙 -[𝑖, 𝑗) is an lcp interval.

Corollary 3.5 If 𝑙 -[𝑖, 𝑗) is an lcp interval, then string(𝑙 -[𝑖, 𝑗)) ∈ right-branching substrings of 𝑆 ∪ Suff (𝑆)

Now, one lcp interval may very well contain other lcp intervals. In fact, if 𝑥 and 𝑦 are both lcp intervals, either one fully contains the other or they are disjoint. This allows us to define a descendant-ancestor relationship between intervals. We say that 𝑙 -[𝑖, 𝑗) is a descendant of 𝑙^′-[𝑖^′, 𝑗^′) if and only if [𝑖^′, 𝑗^′) ⊂ [𝑖, 𝑗).

It follows trivially that 𝑙 ≥ 𝑙^′. Furthermore, since [0, 𝑛) is an lcp interval, each other lcp interval is a descendant of it. This allows us to define the lcp tree:

Definition 3.6 (lcp tree) Given a suffix array SA and the corresponding LCP array, the lcp tree 𝐿 is defined as follows:

𝑁𝐿= {𝑙 -[𝑖, 𝑗) | 𝑙 -[𝑖, 𝑗)is an lcp interval}

Furthermore, the structure is given by:

𝑇_𝐿(𝑙 -[𝑖, 𝑗)) = {𝑙^′-[𝑖^′, 𝑗^′) | 𝑖 ≤ 𝑖^′< 𝑗^′≤ 𝑗}

Corollary 3.6 For an lcp tree 𝑇 , we conclude the following about the nodes:

1. The root of 𝑇 is the entire interval, which corresponds to 𝜖.

2. The leaves of 𝑇 are the singleton-intervals, corresponding to suffixes.

3. All internal nodes are proper lcp intervals, corresponding to right branching substrings and nested suffixes. Furthermore, since proper lcp intervals contain at least two singleton intervals, all internal intervals are branching.

In order to traverse this tree, we need an efficient way to find all the child intervals of an interval. The following lemma allows us to do this based on the 𝑙-indices.

Lemma 3.7 [FH07] Let 𝑙 -[𝑖, 𝑗) be an lcp interval. Furthermore, let 𝑘1 < 𝑘2 < . . . < 𝑘𝑚 be the 𝑙-indices of the interval. The child intervals of 𝑙 -[𝑖, 𝑗) are then: [𝑖, 𝑘1), [𝑘1, 𝑘2) . . . [𝑘𝑚, 𝑗).

(17)

Proof Define 𝑘0= 𝑖 and 𝑘𝑚+1= 𝑗. The intervals we are considering are then of the form [𝑘𝑎, 𝑘𝑎+1) for any 𝑎 : 0 < 𝑎 < 𝑚. It suffices to prove these are lcp intervals since these intervals cover the interval [𝑖, 𝑗).

Singleton intervals are lcp intervals by definition. This leaves the normal intervals. For these we have: LCP[𝑘_𝑎] ≤ 𝑙 from equality for 𝑙-indices and inequality for 𝑙 and 𝑟. Thus [𝑘_𝑎, 𝑘_𝑎+1− 1) satisfies conditions 1 and 2 for any 𝑙^′ > 𝑙. Furthermore we have: 𝑘_𝑎 < ℎ < 𝑘_𝑎+1→ LCP[ℎ] > 𝑙 since such ℎ are not 𝑙-indices. This gives us condition 3 for 𝑙^′= min {LCP[ℎ] | 𝑘𝑎 < ℎ < 𝑘𝑎+1} .

As such, finding the children of an interval only requires finding the 𝑙-indices. Looking at their definition, finding the leftmost 𝑙-index 𝑘1 of 𝑙 -[𝑖, 𝑗) can be done by computing 𝑘1 = RMQ_LCP(𝑖 + 1, 𝑗) which we know can be done in 𝒪 (1) time. If LCP[𝑘] = 𝑙, 𝑘 is the leftmost 𝑙-index. If LCP[𝑘] ̸= 𝑙, there are no 𝑙-indices. We can find 𝑘_𝑖+1 recursively by taking 𝑘0= 𝑖 and then applying:

𝑘𝑖+1 = RMQ_LCP(𝑘𝑖+ 1, 𝑗) while LCP[𝑘𝑖+1] = 𝑙

3.4 Lcp tree – suffix tree equivalence

The structure we see in the lcp tree is very familiar to one we have seen before, the suffix tree. Both trees have the empty string as their root, right-branching substrings as internal nodes and suffixes as leaves. We will show these two trees are isomorphic by introducing a tree ℬ defined for any string. To reduce technicalities, we further assume the string 𝑆 ends with a sentinel character, and therefore does not have any nested suffixes. Based on this assumption we will show that both the lcp tree and suffix tree are isomorphic to this underlying tree ℬ.

We call this tree ℬ the branching set over string 𝑆. The essential idea is that each substring can be found by traversing ever longer right-branching substrings. It is defined as follows:

Definition 3.7 (Branching set) Given a string 𝑆, the branching set ℬ is a tree. The nodes of the tree are:

𝑁 = 𝜖 ∪ right-branching substrings of 𝑆 ∪ Suff (𝑆) Furthermore, the edges are given by:

𝑣 ∈ 𝑇 (𝑢) ⇐⇒ 𝑢 a prefix of 𝑣

Since we assume 𝑆 to have no nested suffixes, this means that all leaves of ℬ are suffixes, and all internal nodes are right-branching substrings.

First, we will show the lcp tree and ℬ are isomorphic with the following lemma:

Lemma 3.8 The following is an isomorphism between the lcp tree and ℬ:

string : lcp intervals → 𝑁_ℬ

Proof Corollary 3.5 combined with the definition of ℬ implies string is a bijection between the lcp intervals and the nodes of ℬ.

It remains to be shown that the structure of the trees is isomorphic. For this, we must show:

𝑙^′-[𝑖^′, 𝑗^′) ∈ 𝑇 (𝑙 -[𝑖, 𝑗)) ⇐⇒ string(𝑙 -[𝑖, 𝑗)) ∈ Pref (string(𝑙^′-[𝑖^′, 𝑗^′)))

This holds because:

⇒: We can immediately conclude that: 𝑙 < 𝑙^′ and [𝑖, 𝑗) ⊃ [𝑖^′, 𝑗^′). Applying the definition of string (equation (3)) gives the desired result.

⇐: Due to the lexicographical ordering of SA, we know that all suffixes that have string(𝑙 -[𝑖, 𝑗)) as a prefix lie in a single interval. By definition of the lcp interval, this interval is SA[𝑖 : 𝑗]. Since

(18)

string(𝑙 -[𝑖, 𝑗)) is a prefix of 𝑙^′-[𝑖^′, 𝑗^′), all suffixes that have string(𝑙 -[𝑖, 𝑗)) as a prefix must form a subinterval of SA[𝑖 : 𝑗]. This gives us 𝑖 ≤ 𝑖^′ < 𝑗^′ ≤ 𝑗. Which, by definition means that: 𝑙^′-[𝑖^′, 𝑗^′) ∈

the subtree of 𝑙 -[𝑖, 𝑗).

Next, we will show the suffix tree and ℬ are isomorhic with the following lemma:

Lemma 3.9 The following is an isomorphism between the suffix tree 𝑆𝑇 and ℬ:

string : 𝑁𝑆𝑇 → 𝑁_ℬ Proof First, note that string(𝑢) = 𝑢.

Due to corollary 2.4 point 4, we have a bijection between the nodes of ℬ and 𝑆𝑇 . Furthermore, from the definition of the suffix tree, we have: 𝑣 ∈ subtree of 𝑢 ⇐⇒ 𝑢 is a prefix of 𝑣. This immediately

gives us the isomorphism.

This finally gives us the equivalence between the suffix tree and lcp tree. Though only in the case where 𝑆 has no nested suffixes. But why do nested suffixes yield a problem here? At the heart of the matter lies the ambiguity of a string ‘branching’ when there is a nested suffix. Is a string branching when we can append it by two different characters, as with the suffix tree or when it is a prefix of two different strings, as with the lcp tree. Neither choice is better. The lcp tree has duplicate nodes, whilst the suffix array has suffixes without any (explicit) node. This is an important part of why nested suffixes are so troublesome.

3.5 RMQ

The subject of computing and pre-processing for RMQ is rich enough to devote an entire paper to. Here, we will only examine one method that is easily understood. This section is based mostly on [BFC00].

We will encounter two new problems in this algorithm. LCA and ±RMQ. LCA, or the lowest common ancestor problem, is defined as follows:

Definition 3.8 (lowest common ancestor) Given a tree 𝑇 , and two nodes 𝑥, 𝑦, we define the set of common ancestors:

𝐴 = {𝑧 ∈ 𝑇 | 𝑧 ancestor of 𝑥 ∧ 𝑧 ancestor of 𝑦}

The lowest common ancestor is then given by:

LCA𝑇(𝑥, 𝑦) = arg max

𝑧∈𝐴

𝑑𝑒𝑝𝑡ℎ(𝑧)

The ±RMQ problem is defined as RMQ on a restricted class of arrays: ±arrays. These are arrays where successive values differ by either +1 or −1.

Our method works via two reductions. First, we reduce RMQ to LCA via the ‘Cartesian tree’. Second, we reduce LCA to ±RMQ by looking at the depths of an Euler path. Finally, we will show how to solve

±RMQ.

3.5.1 RMQ to LCA

We can reduce the problem of RMQ on an array 𝐴 of size 𝑛 to the problem of LCA on a binary tree 𝐶.

Here 𝐶 is the Cartesian tree of 𝐴. It is defined as follows:

(19)

Definition 3.9 (Cartesian tree) Given an array 𝐴 of size 𝑛, the Cartesian tree is recursively defined as follows:

The root of 𝐶 is the index 𝑖 of the (leftmost) minimal element of 𝐴. The left and right subtrees of the root are respectively, the Cartesian trees for the left subarray 𝐴[0 : 𝑖] and the right subarray 𝐴[𝑖 + 1 : 𝑛].

From this definition, we derive the following theorem:

Theorem 3.10 Given an array 𝐴 and its Cartesian tree 𝐶, we have:

LCA𝐶(𝑖, 𝑗) = RMQ_𝐴(𝑖, 𝑗 + 1)

Proof Taking 𝑘 = 𝐿𝐶𝐴𝐶(𝑖, 𝑗), 𝑖 and 𝑗 lie in respectively the left and right subtrees of 𝑘 and thus 𝑖 ≤ 𝑘 ≤ 𝑗.

Furthermore, 𝑘 is the leftmost minimal element of some range [𝑎, 𝑏]. Due to 𝑖 and 𝑗 being descendants of 𝑘, we have [𝑖, 𝑗] ⊂ [𝑎, 𝑏]. This, combined with 𝑖 ≤ 𝑘 ≤ 𝑗, ensures 𝑘 is the leftmost minimal element

of [𝑖, 𝑗].

All that remains to be shown is that constructing the cartesian tree can be done in 𝒪 (𝑛) time. We present an iterative approach. Let 𝐶𝑖 be the Cartesian tree for 𝐴[0 : 𝑖] and let 𝑥 = 𝐴[𝑖] be the node we need to add. Furthermore, let 𝑅𝑖be the right path of 𝐶𝑖. Certainly, 𝑥 must be added as the child of some node in 𝑅𝑖. After all, every other node concerns a subarray not including the end point. Now, for any 𝑦 ∈ 𝑅𝑖 : 𝑦 ≤ 𝑥, 𝑥 must be added to the subtree rooted at 𝑦. For the other points in 𝑅𝑖: 𝑧 ∈ 𝑅𝑖: 𝑧 > 𝑥 we know they must move into the left subtree of 𝑥.

As such, we search 𝑅𝑖 bottom-up until we find the first point 𝑦 ∈ 𝑅𝑖: 𝑦 ≤ 𝑥. We then set 𝑥 as the new right-child of 𝑦 and set the old right subtree of 𝑦 to be the left subtree of 𝑥.

This construction method takes 𝒪 (𝑛) time. Because any node that is compared along the right path is subsequently removed from it. Therefore, every node is only ever compared once. Since the runtime is proportional to the total number of comparisons, this gives us 𝒪 (𝑛) comparisons. So we have reduced RMQ to LCA.

3.5.2 LCA to ±RMQ

Our next step is to reduce LCA to ±RMQ in 𝒪 (𝑛) time. Our method takes any tree 𝑇 with enumerated nodes. The key to this step is the following:

Observation 3.11 LCA𝑇(𝑥, 𝑦) is the shallowest node encountered after 𝑥 and before 𝑦 during an Euler tour.

We start by storing an Euler tour of 𝑇 in an array 𝐸 such that 𝐸[𝑖] is the 𝑖-th node visited on the Euler tour. Next, we create an array 𝐷 that stores the depth of the nodes visited on the Euler tour. That is:

𝐷[𝑖] = 𝑑𝑒𝑝𝑡ℎ(𝐸[𝑖]). Finally we create an array 𝑅 that stores the representative of each node, its first occurrence in 𝐸. This gives us 𝐸[𝑅[𝑖]] = 𝑖. We could use any occurrence, but making a choice makes 𝑅 well-defined. This fact, combined with observation 3.11 yields the following lemma:

Lemma 3.12 Defining 𝐸, 𝐷 and 𝑅 as stated above we have: LCA𝐶(𝑖, 𝑗) = 𝐸[RMQ_𝐷(𝑅[𝑖], 𝑅[𝑗])]

where 𝐷 is a ±array.

Constructing 𝐸, 𝐷 and 𝑅 in 𝒪 (𝑛) time during an Euler tour is trivial. All that remains is solving ±RMQ in 𝒪 (1) time after 𝒪 (𝑛) time and memory for pre-processing. Note that, after having constructed arrays 𝐸, 𝐷 and 𝑅, we no longer need the actual Cartesian tree.

(20)

±RMQ

The matter at hand is computing RMQ for a ±array 𝐴 of size 𝑛. This should take 𝒪 (𝑛) time and memory for pre-processing and 𝒪 (1) time per query.

We proceed as follows: We divide the array 𝐴 into blocks of size 𝑘 = ^{log 𝑛}₂ . We then distinguish two different cases:

(a) 𝑖 and 𝑗 lie in the same block.

(b) 𝑖 and 𝑗 lie in different blocks.

We call case (a) an ‘in-block query’. We shall solve these with a lookup table. We shall solve case (b) by comparing 3 minima:

1. The minimum between 𝑖 and the end of its block.

2. The minimum of any blocks between 𝑖’s block and 𝑗’s block.

3. The minimum between 𝑗 and the beginning of its block

If we know the (leftmost) indexes associated with these minima, calculating RMQ(𝑖, 𝑗) becomes trivial.

In this case 1 and 3 will be computed as in-block queries. 2 Will be computed as a ‘superblock querry’.

We shall first explain how to answer in-block queries.

in-block queries: To solve in-block queries we want to pre compute enough answers that in-block queries become simple lookups. Essential to this is the following:

Observation 3.13 Given two arrays 𝑋 and 𝑌 that differ by some fixed value at every position, that is: ∃𝑐∀𝑖 : 𝑋[𝑖] = 𝑌 [𝑖] + 𝑐, RMQ_𝑋 is equivalent to RMQ_𝑌.

We call a block normalized when its first element is 0. Due to the above observation, we can reduce our pre-computation to only normalized blocks. This is where the ± property comes in, limiting the amount of normalized blocks:

Observation 3.14 There are only 2^𝑘−1possible normalized ± blocks of length 𝑘.

We now simply pre-compute the answers to all 𝒪(︀2^𝑘−1· 𝑘²)︀ ⊂ 𝒪 (𝑛) possible in-block queries for these normalized ± blocks. Finally, for each block in 𝐴, we store which normalized block should be used.

Thus, in-block queries reduce to lookups.

Pre-computing all these answers can be done in 𝒪 (𝑛) time by dynamic programming, solving the queries for shorter intervals first.

Superblock queries: We introduce two new arrays storing information about the blocks as a whole.

𝐵 stores the minimum value of each block, and 𝐼 the corresponding leftmost index. Since there are 𝑚 = ^𝑛_𝑘 = _{log 𝑛}^2𝑛 blocks, the arrays are of size 𝑚. Formally we have:

𝐼[𝑖] = RMQ_𝐴(𝑖 · 𝑘, (𝑖 + 1) · 𝑘) 𝐵[𝑖] = 𝐴[𝐼[𝑖]]

Now, a superblock query from the 𝑖-th block to the 𝑗-th block reduces to a general RMQ query on 𝐵:

That is, we need simply return 𝐼[RMQ_𝐵(𝑖, 𝑗 + 1)]. To solve this general RMQ query we use a method

(21)

called sparse table. Now, this method requires 𝒪 (𝑚 log 𝑚) time for preprocessing, which is why we do not use it to solve general RMQ directly. However, in this case it suffices because:

𝑚 log 𝑚 =

= 2𝑛

log 𝑛log 2𝑛 log 𝑛

= 2𝑛 log 𝑛

(︂

log 𝑛 + log 2 log 𝑛

)︂

≤ 2𝑛 log 𝑛2 log 𝑛

∈ 𝒪 (𝑛) .

The sparse table works as follows: For each interval of size 2^𝑎, starting at position 𝑖 we precompute the RMQ. We subsequently store this in a table 𝑀 . That is:

𝑀 [𝑖][𝑎] = RMQ_𝐵(𝑖, 𝑖 + 2^𝑎)

Now, for any 𝑖 and 𝑗 we compute RMQ_𝐵(𝑖, 𝑗) as follows: We take 𝑎 = ⌊log |𝑖 − 𝑗|⌋. Then, the union of the intervals [𝑖, 𝑖 + 2^𝑎) and [𝑗 − 2^𝑎, 𝑗) is [𝑖, 𝑗). For these intervals, the RMQ answers are stored in 𝑀 [𝑖][𝑎]

and 𝑀 [𝑗 − 2^𝑎][𝑎]. Based on these, computing RMQ_𝐵(𝑖, 𝑗) in 𝒪 (1) is trivial. This table 𝑀 clearly stores 𝒪 (𝑚 log 𝑚) values. We can also fill it in 𝒪 (𝑚 log 𝑚) time using dynamic programming, starting with the smaller intervals.

3.5.3 alternative methods

What we have seen so far gives an easy to understand implementation of RMQ. However, it is somewhat unwieldy. The memory requirement for the reduction to LCA already takes 5𝑛 words of memory. Arrays 𝐸, 𝐷 take 2𝑛 − 1 words of memory and 𝑅 takes another 𝑛.

Luckily, there are alternative methods. One important such method is presented in [FH07]. Here, it is shown that RMQ can be solved needing only 3𝑛 bits.¹ That is easily less than 𝑛 words. The key to their method is to realize that two ‘blocks’ have the same RMQ results when their Cartesian trees are the same.

3.6 The Enhanced Suffix Array

Based on what we have seen, we define the Enhanced Suffix Array (or ESA) as follows

Definition 3.10 (Enhanced Suffix Array) Given a string 𝑆, the Enhanced Suffix Array or ESA is the normal suffix array SA together with its LCP array. Furthermore, LCP has been pre-processed for RMQ

Corollary 3.15 The ESA has a memory footprint of 𝒪 (𝑛). Furthermore, assuming SA and LCP can be constructed in 𝒪 (𝑛) time (as is shown in section 4), the ESA can be constructed in 𝒪 (𝑛) time as well.

Proof SA and LCP have, by definition, a memory footprint of 𝒪 (𝑛). Furthermore, The pre- processing of LCP for RMQ takes 𝒪 (𝑛) time and memory as seen in section 3.5.

12𝑛 + 𝑜(𝑛) bits to be exact

1.2 Overview of problems and structures

Contents

1 Introduction

1.1 Preliminaries

1.2 Overview of problems and structures

2 First concepts

2.1 Tries

2.2 Basic Suffix Tries

2.3 Suffix Trees

2.4 Suffix Arrays

2.5 Concluding remarks

3 The core enhancement: longest common prefix

3.1 Using lcp to improve binary search

3.2 Computing lcp via the lcp array

3.3 lcp intervals and the lcp tree

3.4 Lcp tree – suffix tree equivalence

3.5 RMQ

3.6 The Enhanced Suffix Array