**Contents**

**1** **Introduction** **1**

1.1 Preliminaries . . . 2

1.2 Overview of problems and structures . . . 3

**2** **First concepts** **3**
2.1 Tries . . . 3

2.2 Basic Suffix Tries . . . 5

2.3 Suffix Trees . . . 6

2.4 Suffix Arrays . . . 9

2.5 Concluding remarks . . . 10

**3** **The core enhancement: longest common prefix** **11**
3.1 *Using lcp to improve binary search . . . .* 11

3.2 *Computing lcp via the lcp array* . . . 12

3.3 lcp intervals and the lcp tree . . . 14

3.4 Lcp tree – suffix tree equivalence . . . 16

3.5 RMQ . . . 17

3.6 The Enhanced Suffix Array . . . 20

3.7 Generalizing to word arrays . . . 21

3.8 Alphabet size dependence . . . 22

3.9 Concluding remarks . . . 23

**4** **Construction** **23**
4.1 Enhanced Suffix Array - Suffix Tree conversion . . . 23

4.2 Ukkonen’s algorithm . . . 24

4.3 concluding remarks . . . 29

**5** **Conclusion** **30**
5.1 Pending matters . . . 31

**1** **Introduction**

*Given a string 𝑆, a natural question is whether a pattern 𝑝 is a substring of 𝑆. Moreover, one might*
*want to test many patterns against the same string. We call this the general substring problem. The*
seminal example of this is found in biology where a full genome is checked for the presence of many
*gene-sequences. In this article we will look at ways to pre process 𝑆 in 𝒪 (|𝑆|) time yielding a 𝒪 (|𝑆|)*
*size data structure that allows 𝒪 (|𝑝|) resolution of the general substring problem. These are trivially*
lower bounds for this problem.

We will consider a few data structures. In section 2 and 3 we examine their use and memory footprint;

*in section 4 we will see how to construct these data structures in 𝒪 (|𝑆|) time.*

*The first data structure we look at is the trie, so named because it allows for easy retrieval of strings.*

*Applying the trie to the general substring problem yields the basic suffix trie. An optimization on tries*
*yields the compressed trie. In the specifc case of the basic suffix trie, this optimization yields the suffix*
*tree. This structure has desired 𝒪 (|𝑝|) search time and 𝒪 (|𝑆|) memory footprint we wanted.*

*The second structure we examine is the suffix array, derived from a simple ordered array of words, the*
*word array. At first sight, the suffix array seems inferior to the suffix tree. However, with some elegant*
enhancements shown in section 3, it proves to be at least as efficient as the suffix trie when solving the
general substring problem.

**1.1** **Preliminaries**

Before we get to the actual data structures, we need to introduce notation and some terminology.

*First, we keep our variable types consistent: Variables 𝑥𝑦𝑧 are nodes, 𝑖 through 𝑙 are integers, 𝑝 through*
*𝑤 are strings and 𝑎 and 𝑐 (not 𝑏) are characters.*

*Second, with our arrays we take their first index to be 0. Furthermore, by 𝐴[𝑖 : 𝑗] we mean the subarray*
*starting from 𝑖 up to but excluding 𝑗.*

*We write 𝑢 ≺ 𝑣 or 𝑣 ≻ 𝑢 to denote 𝑢 lexicographically preceding 𝑣.*

Since we are dealing with strings, we introduce terminology regarding strings and substrings. Σ is
*universally the alphabet over which we take our strings. 𝑆 is universally the string we want to pre-*
*process and 𝑛 = |𝑆|. For a string 𝑆, 𝑣 is a substring if and only if ∃𝑢∃𝑤 : 𝑢𝑣𝑤 = 𝑆. We distinguish a*
*few special sets of substrings. First we have the suffixes and prefixes of 𝑆, defined as:*

*Suff (𝑢) = {𝑠 | ∃𝑝 : 𝑝𝑠 = 𝑢}, the suffixes of 𝑢*
*Pref (𝑢) = {𝑝 | ∃𝑠 : 𝑝𝑠 = 𝑢}, the prefixes of 𝑢*

*We call 𝑢 a repeated substring if it occurs twice in 𝑆. Formally this can be written as: ∃𝑖, ∃𝑗 : 𝑖 ̸=*

*𝑗 ∧ 𝑆[𝑖, 𝑖 + |𝑢|] = 𝑆[𝑗, 𝑗 + |𝑢|] = 𝑢. A string that is both repeated and a suffix or a prefix is called*
respectively a nested suffix or prefix. We shall see that nested suffixes can be quite troublesome. For
*this reason, we introduce a character $ ̸∈ Σ. Appending this to 𝑆 ensures 𝑆$ has no nested suffixes.*

*We call a string 𝑢 a right branching substring if and only if:*

*∃𝑎∃𝑐 : 𝑎 ̸= 𝑐 ∧ 𝑢𝑎 a substring of 𝑆 ∧ 𝑢𝑐 a substring of 𝑆*

Note that any right-branching substring must also be a repeated substring. Finally, we introduce the
*following function: suff (𝑖) = 𝑆[𝑖 : 𝑛], which allows us to easily address the suffixes of 𝑆.*

*We will also make use of edge-labelled trees. For such a tree 𝐿, we introduce the following notation: 𝑁**𝐿*

*are the nodes of 𝐿. The edges are written as a triple: (parent, child, label). The set of all edges is written*
*as 𝐸**𝐿*. To ensure this is actually a tree, every node except for one must have exactly one parent. The
excepted node is the root and has no parent.

*Finally, for a node 𝑥 we define 𝑇 (𝑥) to be the subtree rooted at 𝑥. We can define the structure of a tree*
*(not the labels of the edges) by specifying the nodes in each subtree. If we know 𝑁**𝑇 (𝑥)* *for each node 𝑥*
we can deduce that:

*𝐸*_{𝐿}*= {(𝑥, 𝑦, −) | 𝑦 ∈ 𝑁*_{𝑇 (𝑥)}*∧ ¬∃𝑧 ∈ 𝑁*_{𝑇 (𝑥)}*: 𝑦 ∈ 𝑁** _{𝑇 (𝑧)}*}
In words, there is an edge only to direct descendants.

*Round nodes correspond to problems, square nodes to structures.*

*In brackets are the numbers of the sections where the concepts are introduced.*

compressed trie [2.3] RMQ [3.5]

suffix tree [2.3]

trie [2.1]

suffix array [2.4]

general substring [1]

±RMQ [3.5]

LCP array [3.2]

LCA [3.5]

basic suffix trie [2.2]

enhanced suffix array [3.6]

solves

reduces to

contains contains

derives from

solves

derives from

derives from

derives from requires

reduces to

Figure 1: Graph of relations

**1.2** **Overview of problems and structures**

We will see many interrelated problems and structures in this article. The graph in figure 1 shows the basic relations between these. Other relations of importance that are hard to capture in a a graph are:

**∙ The enhanced suffix array solves the general substring problem via both the lcp tree [3.3] and**
**binary search [3.1].**

**∙ The lcp tree uses lcp intervals [3.3].**

**∙ The lcp tree is equivalent to the suffix tree .**

**∙ Binary search is worse than the other methods, unless we are dealing with large alphabets.**

**∙ The general substring problem and the solutions we present here can be generalized to finding**
prefixes of a set of words.

**2** **First concepts**

**2.1** **Tries**

Tries are the basic way to store strings for retrieval. They are trees that store strings by prefix. The root
of a trie corresponds to the empty string, trivially a prefix of every string. Then recursively, for each
prefix it stores all characters we can append to that prefix to get a longer prefix. Each such character
*is stored in an edge ending in a new node corresponding to the longer prefix. For each prefix 𝑢 we*

*shall denote the node in the trie corresponding to it 𝑢. Conversely, for a node 𝑢 we define the function*
*string(𝑢) = 𝑢 We capture this concept of the trie in the following definition:*

* Definition 2.1 (Trie) 𝑇 , the trie for a set of words 𝑊 is defined as an edge-labeled tree satisfying the*
following properties:

*∙ There exists a bijection between 𝑃 , the set of all prefixes of 𝑊 (i.e. 𝑃 =* *⋃︀{Pref (𝑣) | 𝑣 ∈ 𝑊 } )*
*and 𝑁*_{𝑇}*: 𝑢 ↦→ 𝑢*

∙ The edges of T are given by:

*𝐸**𝑇* *= {(𝑢, 𝑢𝑐, 𝑐) | 𝑢𝑐 ∈ 𝑃 }*

From this it immediately follows that:

**Corollary 2.1 Given trie 𝑇 for a set of strings 𝑊 the following hold:**

*1. For any node 𝑢, the string 𝑢 is obtained by concatenating the edge labels encountered when*
*walking from the root to 𝑢.*

*2. There exists an injection between the leaves of 𝑇 and the words of 𝑊 .*
*3. The root of 𝑇 is 𝜖*

4. No node has two outgoing edges with the same label.

*5. The trie of 𝑊 is uniquely determined.*

*Note that point 2 of corollary 2.1 does not state a bijection because one word in 𝑊 may be the prefix of*
another. For example, in the trie seen in figure 2, searching for "at" would not end in a leaf. As such one
*cannot determine the words in 𝑊 from its trie. In the example we cannot deduce that "at" ∈ 𝑊 from*
*the trie alone. We can prevent this situation by appending the sentinel $ to every word in 𝑊 . This way*
*no word in 𝑊 is the prefix of another word. Doing so ensures a bijection between the leaves of 𝑇 and*
*words in 𝑊 .*

The point of a trie is to quickly be able to find prefixes of words in a set. Point 1 of corollary 2.1 is
*essential to this. Say we want to know if 𝑝 is a prefix of a word in 𝑊 . Given the trie 𝑇 for 𝑊 , point 1*
*would allow us to easily find incrementally longer prefixes of 𝑝 that occur in 𝑊 . This proceeds until we*
*either find 𝑝 in 𝑊 , or can no longer find the next prefix. Moving from one prefix to the next is simple.*

If we are at node*𝑝[0 : 𝑗], we need only look for an edge labelled 𝑝[𝑗] because that edge leads to the node*
*𝑝[0 : 𝑗 + 1]. This gives rise to algorithm 2.1.*

*Figure 2: A trie for 𝑊 = {at, ate, tea, ten, too}.*

a

too

ate tea ten

to t

te at

a

t e

t

n

e o

o

a

**Algorithm 2.1 find(p, T)**

**Require: string 𝑝, the pattern to find and trie 𝑇 in which to search.**

**Ensure: Return node***𝑝 if it exists, NO_SUCH_PATTERN otherwise.*

*{ Nodes of 𝑇 are assumed to have method 𝑔𝑒𝑡𝐶ℎ𝑖𝑙𝑑(𝑐) returning the child at the end of the edge*
*labelled 𝑐 if it exists and 𝑛𝑢𝑙𝑙 otherwise.}*

1: *𝑛𝑜𝑑𝑒 ← 𝑇.𝑟𝑜𝑜𝑡()*

2: *𝑖𝑑𝑥 ← 0*

3: **while 𝑖𝑑𝑥 ̸= 𝑝.𝑙𝑒𝑛𝑔𝑡ℎ do**

4: *𝑛𝑜𝑑𝑒 ← 𝑛𝑜𝑑𝑒.𝑔𝑒𝑡𝐶ℎ𝑖𝑙𝑑(𝑝[𝑖𝑑𝑥])*

5: **if 𝑒𝑑𝑔𝑒 = 𝑛𝑢𝑙𝑙 then**

6: **return NO_SUCH_PATTERN**

7: *𝑖𝑑𝑥 ← 𝑖𝑑𝑥 + 1*

8: **return 𝑛𝑜𝑑𝑒**

*The invariant here is: 𝑛𝑜𝑑𝑒 = 𝑝[0 : 𝑖𝑑𝑥]. Traversing the while loop takes 𝒪 (1) time, and we traverse it*
*𝒪 (|𝑝|) times, giving us a running time of 𝒪 (|𝑝|).*

*This gives us the ability to recognize prefixes of 𝑊 but not to distinguish full words. Ending in a leaf*
certainly guarantees we have a full word but, as stated in point 2 of corollary 2.1 the converse does not
*hold. Appending all words in 𝑊 by sentinel $ solves this problem by making the converse hold. In this*
*case, searching for 𝑝 only determines whether 𝑝 is a prefix in 𝑊 . Searching for 𝑝$ determines if 𝑝 is a*
*word in 𝑊 . In general one should always append the sentinel because it stores more information.*

**2.2** **Basic Suffix Tries**

Tries facilitate the finding of a word, or a prefix of such a word, within a set of words. However, our
*initial problem requires easy access to all substrings of a string 𝑆. In this case the most obvious usage*
*of a trie is to create a trie for the set of all substrings 𝑆. However, there are 𝒪(︀𝑛*^{2}*)︀ substrings of 𝑆.*

This leads to excessively large memory requirements. Luckily, we can do a lot better by exploiting the following:

**Observation 2.2 every substring 𝑢 of 𝑆 is a prefix of a suffix of 𝑆.**

Since tries allow retrieval of not just words, but also prefixes, we need merely construct a trie containing
*all suffixes of 𝑆. It is obvious there are 𝒪 (𝑛) suffixes of 𝑆. The trie consisting of all of these suffixes is*
called the basic suffix trie. Formally, we define the basic suffix trie as follows:

**Definition 2.2 (Basic suffix trie) Given a string 𝑆, its basic suffix trie 𝑇 is a trie for Suff (𝑆).**

From this it immediately follows that:

**Corollary 2.3 Given a string 𝑆 and its basic suffix trie 𝑇 :**

*1. 𝑢 ↦→ 𝑢 gives a bijection between the substrings of 𝑆 and the nodes of 𝑇 .*
*2. For each leaf 𝑥 of 𝑇 string(𝑥) is a suffix of 𝑆*

*3. If 𝑆 has no nested suffixes, for each suffix 𝑢, 𝑢 is a leaf.*

*The absence of nested suffixes can be assured simply by appending $ to 𝑆. Again, this should almost*
always be done. Take, for example the trie in figure 3. Were $ not appended here, it would be a lot
harder to recognize that "ana" is a suffix.

Figure 3: the basic suffix trie for "banana$"

ana

anana$

an

anana banan

nana na

nana$

nan

$

na$

banana ba

ana$ anan bana

a$

a b

banana$

n

ban

$ n

$

$

$ a

$

$

n

b n

a

$ a a

n

a

n

a

a n

a

*Now, we can determine whether a string 𝑝 is a substring of 𝑆 by performing a standard trie search on the*
*basic suffix trie of 𝑆. However, the basic suffix trie is not yet the optimal solution. Restricting ourselves*
*to suffixes meant we only had to store 𝒪 (𝑛) words in our trie. However, these suffixes have average*
*length 𝒪 (𝑛). As such, the basic suffix trie takes 𝒪(︀𝑛*^{2})︀ space. This is still to big. Next, we shall see a
*structure that improves this to 𝒪 (𝑛) space: the suffix tree.*

**2.3** **Suffix Trees**

The basic suffix trie forms the basis for the suffix tree. We will reduce its 𝒪*(︀𝑛*^{2})︀ memory footprint
*to 𝒪 (𝑛) with two optimizations. The first optimization will reduce the amount of nodes and edges to*
*𝒪 (𝑛), though the memory used per edge goes up, keeping the memory footprint at 𝒪(︀𝑛*^{2})︀. The second
*optimization will push the size of an edge down to 𝒪 (1), giving us the desired memory footprint of 𝒪 (𝑛).*

The first optimization works for tries in general. It relies on noticing a trie may have sequences of nodes with just a single child. These sequences don’t branch, so these nodes store no information about the structure of the tree. As such, we consolidate these sequences into single edges. We label that edge with the string obtained by concatenating the labels of the consolidated sequence. We call the resultant structure the compressed trie. The suffix tree is then defined as the compressed trie for all suffixes.

Formally, we define it as follows:

**Definition 2.3 (Compressed trie & Suffix Tree) Given a trie 𝑇 , the corresponding compressed trie***𝐶 is an edge-labeled tree, with strings as edge labels. Its nodes and edges are:*

*𝑁**𝐶**={𝑇.𝑟𝑜𝑜𝑡} ∪ branching nodes of 𝑇 ∪ leaves of 𝑇*
*𝐸*_{𝐶}*={(𝑢, 𝑣, 𝑟) | 𝑣 = 𝑢𝑟 ∧ ¬∃𝑝 ∈ Pref (𝑟) : 𝑢𝑝 ∈ 𝑁** _{𝑐}*}

*Given a string 𝑆, its suffix tree 𝑆𝑇 is then the compressed trie of the basic suffix trie of 𝑆.*

Furthermore, for a node *𝑢, its implicit depth is |𝑢|. Its explicit depth is the normal ‘distance to root’*

definition.

From this it immediately follows that:

**Corollary 2.4 Given a compressed trie 𝑇 for 𝑊 , we know that:**

*1. No two edges originating from the same node in 𝑇 have labels starting with the same character.*

*2. every internal node of 𝑇 has at least two children. (Except when 𝑇 only has two nodes.)*
*3. 𝑇 has 𝒪 (𝑛) nodes and edges.*

*Furthermore, if 𝑇 is a suffix tree for 𝑆 we have the following:*

*For each internal node 𝑢 of 𝑇 , 𝑢 is a right-branching substring. For each leaf 𝑣, 𝑣 is a suffix. This is*
a result of the first two points of the corollary.

Before we get to the second optimization, we have to address a rather pressing issue. We no longer have a bijection between substrings and nodes as we did in corollary 2.3. Luckily, the information is still there. Some substrings are simply on the edge between two nodes. We call these ‘positions’ on the edge implicit nodes. Given any such implicit node, it is easy to figure out what its child is, and by what label the outgoing edge is labelled.

We still have to address the memory issue. The core of the problem is that, whilst we have reduced the
amount of edges and nodes, we changed the edge-labels from characters to strings in the process. This
means that every character that was stored in the basic suffix Trie, is also stored in the Suffix Tree. Since
the basic suffix trie stored 𝒪*(︀𝑛*^{2})︀ characters, so does our Suffix Tree.

*The key insight is that every string stored as an edge label is a substring of 𝑆. As such we can store it*
*by two indices. Its start and end position in 𝑆. This reduces the space taken per edge to 𝒪 (1), thus*

Figure 4: the suffix tree for "banana$"

a

ana$

ana

na

anana$

banana$ $

na$

a$ nana$

a

na$

$

$

$

$ na$

banana$

na

na

*reducing the total space of the Tree to 𝒪 (𝑛). Note that with three indices, we can do the same for a*
*compressed trie. Every label is the substring of some word in 𝑊 . We use the first index to store which*
word, and the remaining two for the start and end of the substring.

In order to use the suffix tree, we require a way to represent the implicit nodes. We do this via the concept of a ‘reference pair’:

* Definition 2.4 (Reference pair) Let 𝑆𝑇 be a suffix tree, 𝑢 be a node of 𝑆𝑇 and 𝑢𝑠 be an implicit*
node of ST. We then define the

*⟨︀𝑢, 𝑠⟩︀ to be a reference pair referring to 𝑢𝑠. We call 𝑢 the anchor of the*

*pair and 𝑠 the label.*

*For 𝑢𝑠 its reference pair with the deepest possible anchor is its canonical reference pair.*

*Furthermore, we extend our notation. For substring 𝑣 of 𝑆, we define 𝑣 =⟨︀𝑢, 𝑠⟩︀ where ⟨︀𝑢, 𝑠⟩︀ is canonical.*

*Which definition of 𝑢𝑠 we use will be clear from context. Finally, we define: string(⟨︀𝑢, 𝑠⟩︀) = 𝑢𝑠.*

We introduced the canonical reference pair because it is unique; the same string can have many reference pairs with different anchors. For example, take the substring "anan" of "banana$" as seen in figure 4.

For this point, *⟨︀𝑎, 𝑛𝑎𝑛⟩︀ is a reference pair, but so are ⟨︀𝜖, 𝑎𝑛𝑎𝑛⟩︀ and ⟨︀𝑎𝑛𝑎, 𝑛⟩︀. However, of those three*
only*⟨︀𝑎𝑛𝑎, 𝑛⟩︀ is canonical. A beneficial property of the canonical reference pair is the following: If we*
were to walk from the anchor to the implicitly referenced node, we would only pass implicit nodes. This
is clearly not the case for non-canonical reference pairs.

*When storing a reference pair’s label we can again exploit the fact that it is a substring of 𝑆, allowing*
us to store it by two simple indices.

Interestingly, canonizing a reference pair, checking whether the reference pair is correct in the process,
*is essentially the substring-finding algorithm. After all to find if 𝑝 is a substring, all we need to do is*
*canonize ⟨𝑟𝑜𝑜𝑡, 𝑝⟩. (Obviously, in this case we can’t store 𝑝 using indices since we don’t know if it is a*
*substring of 𝑆). Canonizing a reference pair with such a check is quite simple. Algorithm 2.2 does this.*^{1}
*Here, much like with the find algorithm for the basic suffix trie, the invariant is that string(node) =*
*string(⟨firstNode, String⟩)[0 : 𝑖𝑑𝑥]. The running time remains 𝒪 (|𝑝|) because we require 𝒪 (|𝑝|) character*
*comparisons. Every other operation takes 𝒪 (1) time and the while loop is executed 𝒪 (|𝑝|) times.*

1the algorithm also works for general compressed tries

**Algorithm 2.2 canonize(⟨𝑛𝑜𝑑𝑒, 𝑠𝑡𝑟⟩, 𝑇 )**

**Require: a reference pair ⟨𝑛𝑜𝑑𝑒, 𝑠𝑡𝑟⟩ to canonize and a compressed trie 𝑇 to work in.**

* Ensure: returns the canonized reference pair if ⟨𝑛𝑜𝑑𝑒, 𝑠𝑡𝑟⟩ refers to an existing internal node, INCOR-*
RECT_REFERENCE_PAIR otherwise

1: *𝑖𝑑𝑥 ← 0 {below variables are only used in proof}*

2: *𝑓 𝑖𝑟𝑠𝑡𝑁 𝑜𝑑𝑒 ← 𝑛𝑜𝑑𝑒*

3: *𝑓 𝑖𝑟𝑠𝑡𝑆𝑡𝑟 ← 𝑠𝑡𝑟*

4: **while 𝑖𝑑𝑥 < 𝑠𝑡𝑟.𝑙𝑒𝑛𝑔𝑡ℎ() do**

5: *𝑒𝑑𝑔𝑒 ← 𝑛𝑜𝑑𝑒.𝑔𝑒𝑡𝐸𝑑𝑔𝑒(𝑠𝑡𝑟[𝑖𝑑𝑥])*

6: **if 𝑒𝑑𝑔𝑒 = 𝑛𝑢𝑙𝑙 then**

7: **return INCORRECT_REFERENCE_PAIR**

8: *𝑙𝑎𝑏𝑒𝑙 ← 𝑒𝑑𝑔𝑒.𝑙𝑎𝑏𝑒𝑙*

9: *𝑙𝑒𝑛𝑔𝑡ℎ ← min(𝑙𝑎𝑏𝑒𝑙.𝑙𝑒𝑛𝑔𝑡ℎ, 𝑠𝑡𝑟.𝑙𝑒𝑛𝑔𝑡ℎ − 𝑖𝑑𝑥) {if we are assured the reference pair is correct, we*
can omit this check, saving a lot of time}

10: **if 𝑙𝑎𝑏𝑒𝑙 ̸= 𝑠𝑡𝑟[𝑖𝑑𝑥 : 𝑖𝑑𝑥 + 𝑙𝑒𝑛𝑔𝑡ℎ] then**

11: **return INCORRECT_REFERENCE_PAIR**

12: *𝑖𝑑𝑥 = 𝑖𝑑𝑥 + 𝑙𝑒𝑛𝑔𝑡ℎ*

13: **return ⟨𝑛𝑜𝑑𝑒, 𝑠𝑡𝑟⟩**

It should be noted that suffix trees find applications in string processing far beyond the general substring problem. [Gus97] devotes the entirety of chapters 7 and 9 to the applications.

**2.4** **Suffix Arrays**

Up until now, the key insight has been observation 2.2. It allowed us to transform finding substrings to
*finding prefixes of a set 𝑊 . So far, we have used tries for this. However, there exists a much more basic*
structure that allows us to find prefixes, the word array. Essentially, it works as a dictionary, storing the
*words in lexicographical order. Taking 𝑊 = Suff (𝑆) then gives us the suffix array. Formally, we define*
these arrays as follows:

**Definition 2.5 (Word array & Suffix array) Given an enumerated set of words 𝑊 = {𝑤**_{1}*, 𝑤*_{2}*· · · 𝑤** _{𝑛}*},

*its word array WA has size 𝑛. Its entries are defined as follows.*

*0 ≤ 𝑖 < 𝑗 < 𝑛 ⇐⇒ 𝑤*_{WA[𝑖]}*≺ 𝑤*_{WA[𝑖]}

*The suffix array SA for string 𝑆 is then defined as the word array taking 𝑤**𝑖**= suff (𝑖) We also define the*
following function to quickly access the words of the word array:

*word (𝑖) = 𝑤**WA[𝑖]*

*Finding words in 𝑊 , and indeed prefixes of such words, now becomes a simple matter of binary search.*

*However, this takes 𝒪 (log 𝑛) word-comparisons. When searching for a word 𝑝 each of these comparisons*
*naïvely takes 𝒪 (|𝑝|) time. Thus, naïve binary search takes 𝒪 (|𝑝| log 𝑛) time, much worse than the trie’s*
*𝒪 (|𝑝|). It seems like the simplicity of the word array has come at a cost.*

*That said, the bound of 𝒪 (|𝑝| log 𝑛) is rather pessimistic. Unless very long prefixes of 𝑝 occur in our*
*array, few word-comparisons will actually take 𝒪 (𝑝) time. Furthermore, we will see two successive speed-*
*ups to binary search on a word array. These will finally yield a runtime bound of 𝒪 (|𝑝| + log 𝑛). Still*
*worse than 𝒪 (|𝑝|), but not by much. Finally, we will see a surprising alternative use of an (enhanced)*
*suffix array that actually manages 𝒪 (|𝑝|) time.*

These speed-ups to binary search both depend on the following property of lexicographical sorting:

**Observation 2.5**

*if 𝑝𝑢 ≺ 𝑠 ≺ 𝑝𝑣 then 𝑝 ∈ Pref (𝑠)*

This allows us to significantly reduce the number of character comparisons needed for each successive word-comparison. The first speed up (due to [Gus97]) is a basic application of this observation.

*At any point in binary search for 𝑝, we are considering three positions: The left boundary 𝑏**𝐿*, the mid-
*point 𝑏*_{𝑀}*and the right boundary 𝑏*_{𝑅}*. satisfying word (𝑏*_{𝐿}*) ≺ word (𝑏*_{𝑀}*) ≺ word (𝑏*_{𝑅}*). Now, we take 𝑛*_{𝐿}*to be the index in 𝑝 up to which we have matched word (𝑏*_{𝐿}*) to 𝑝, and take 𝑛** _{𝑅}* analogously. Taking

*𝑚𝑙𝑟 = min(𝑛*

_{𝐿}*, 𝑛*

_{𝑅}*) observation 2.5 tells us that 𝑏*

_{𝑀}*must match 𝑝 up to 𝑚𝑙𝑟. This allows us to skip the*

*first 𝑚𝑙𝑟 characters when comparing 𝑏*

_{𝑀}*to 𝑝. This is captured in algorithm 2.3:*

*Although this algorithm saves a lot of redundant comparisons, we retain the 𝒪 (|𝑝| log 𝑛) worst case*
*bound. That said, it only occurs in degenerate cases. For example, when searching 𝑆 = 𝑎𝑏 . . . 𝑏 for*
*𝑎𝑏𝑏𝑏𝑏𝑏𝑏𝑐. In this case 𝑏**𝑅**will always be of the form 𝑏 . . . 𝑏, and thus 𝑛**𝑅*will remain 0. The second speed
*up will improve this bound to 𝒪 (|𝑝| + log 𝑛). However, it will have to wait until section 3.1. It is much*
more complicated and depends on the not yet introduced concept of the longest common prefix function.

**Algorithm 2.3 binarySearch(𝑝, WA)**

**Require: string 𝑝, the pattern to search and word array WA in which to search.**

**Ensure: returns 𝑡𝑟𝑢𝑒 if 𝑝 is a prefix of a word in WA and 𝑓 𝑎𝑙𝑠𝑒 otherwise.**

*𝑏**𝐿*← 0
*𝑛**𝐿*← 0

*𝑙𝑒𝑓 𝑡𝑊 𝑜𝑟𝑑 ← word (𝑏**𝐿*)

**while 𝑝[𝑛***𝐿**+ 1] = 𝑙𝑒𝑓 𝑡𝑊 𝑜𝑟𝑑[𝑛**𝐿***+ 1] do**
*𝑛*_{𝐿}*← 𝑛**𝐿*+ 1

**if 𝑝[𝑛**_{𝐿}*+ 1] ≺ 𝑙𝑒𝑓 𝑡𝑊 𝑜𝑟𝑑[𝑛*_{𝐿}**+ 1] then**
**return 𝑓 𝑎𝑙𝑠𝑒 {𝑝 ≺ word (0)}**

*𝑏**𝑅**← 𝑆𝐴.𝑠𝑖𝑧𝑒 − 1*
*𝑛**𝑅*← 0

*𝑟𝑖𝑔ℎ𝑊 𝑜𝑟𝑑 ← word (𝑏**𝑅*)

**while 𝑝[𝑛***𝑅**+ 1] = 𝑙𝑒𝑓 𝑡𝑊 𝑜𝑟𝑑[𝑛**𝑅***+ 1] do**
*𝑛**𝑅**← 𝑛**𝑅*+ 1

**if 𝑝[𝑛***𝑅**+ 1] ≺ 𝑙𝑒𝑓 𝑡𝑊 𝑜𝑟𝑑[𝑛**𝑅***+ 1] then**
**return 𝑓 𝑎𝑙𝑠𝑒 {word (0 ≺ 𝑝}**

**while 𝑏***𝐿**̸= 𝑏**𝑅* **do**

*𝑏**𝑀* *← (𝑏**𝐿**+ 𝑏**𝑅**)/2 {𝑛**𝑀* *is the point up to which we know WA[𝑏**𝑀**] and 𝑝 agree.}*

*𝑛*_{𝑀}*← min(𝑛**𝐿**, 𝑛** _{𝑅}*)

*𝑚𝑖𝑑𝑊 𝑜𝑟𝑑 ← word (𝑏*

*)*

_{𝑚}**while 𝑝[𝑛**_{𝑀}*+ 1] = 𝑚𝑖𝑑𝑊 𝑜𝑟𝑑[𝑛*_{𝑀}**+ 1] do**
*𝑛*_{𝑀}*← 𝑛** _{𝑀}* + 1

**if 𝑛**_{𝑀}**= 𝑃.𝑙𝑒𝑛𝑔𝑡ℎ − 1 then****return 𝑡𝑟𝑢𝑒**

**if 𝑝[𝑛***𝑀* *+ 1] ≺ 𝑚𝑖𝑑𝑊 𝑜𝑟𝑑[𝑛**𝑀* **+ 1] then**
*𝑏**𝐿**← 𝑏**𝑀*

*𝑛**𝐿**← 𝑛**𝑀*

**if 𝑝[𝑛***𝑀* *+ 1] ≻ 𝑚𝑖𝑑𝑊 𝑜𝑟𝑑[𝑛**𝑀* **+ 1] then**
*𝑏**𝑅**← 𝑏**𝑀*

*𝑛**𝑅**← 𝑛**𝑀*

**return 𝑓 𝑎𝑙𝑠𝑒**

**2.5** **Concluding remarks**

On first sight, the suffix tree seems indisputably better than the suffix array. The suffix array’s,
*𝒪 (|𝑝| log 𝑛), is significantly worse than 𝒪 (|𝑝|). However, the 𝒪 (|𝑝| log 𝑛) bound only occurs in patho-*
*logical cases. And indeed in [MM90] Manber and Myers report seeing 𝒪 (𝑝 + log 𝑛) performance in the*
general case. This stands to reason as, in general, one expects both boundaries of a binary search to
improve. This performance is a lot closer to that of the suffix tree.

Furthermore, an easy to overlook advantage of the suffix array is absolute memory footprint. Whilst
*both structures are 𝒪 (𝑛), the suffix tree has a significant constant factor when compared the the suffix*
*array. In the case where there are no nested suffixes, we must have at least 𝑛 edges. After all, each suffix*
*has a leaf and there are 𝑛 suffixes. Now, for each edge, we need to store 3 pointers. One to the child*
*node, and two for the edge label. This already brings us to 3𝑛 words, ignoring the need to store edges*
*in a node. On the other hand, the suffix array takes exactly 𝑛 words to store. Due to I/O limitations,*
these differences in memory footprint can have significant performance repercussions.

In the next session, we will see how we can bring the suffix array’s performance completely up to par
*with the suffix tree, whilst keeping the memory footprint below 3𝑛. We also still need to know if these*
*structures can actually be constructed in 𝒪 (𝑛) time. We will see this in section 4.*

**3** **The core enhancement: longest common prefix**

The entirety of this section is about using the concept of the longest common prefix, specifically the length of that prefix. We capture this by the following function:

**Definition 3.1 (lcp function) Given string 𝑢 and 𝑣, we define the lcp function as follows:**

*lcp(𝑢, 𝑣) = max{︀|𝑝|*⃒

⃒*𝑝 ∈ Pref (𝑢) ∩ Pref (𝑣)*}︀

**Corollary 3.1**

*lcp(𝑢, 𝑤) ≥ min(lcp(𝑢, 𝑣), lcp(𝑣, 𝑤))* (1)

*Furthermore, if 𝑢 ≺ 𝑣 ≺ 𝑤 we have:*

*lcp(𝑢, 𝑤) = min(lcp(𝑢, 𝑣), lcp(𝑣, 𝑤))* (2)

**Proof Take 𝑚 = min(lcp(𝑢, 𝑣), lcp(𝑣, 𝑤)) and 𝑝 = 𝑢[0 : 𝑚]. Certainly, we have 𝑝 ∈ Pref (𝑢) and***𝑝 ∈ Pref (𝑤). It then follows that 𝑚 = |𝑝| ≤ lcp(𝑢, 𝑤). This gives us the first claim.*

*Now for the second claim, suppose 𝑢 ≺ 𝑣 ≺ 𝑤.*

*We take 𝑟 = 𝑢[0 : lcp(𝑢, 𝑤)]. Trivially we have: 𝑟 ∈ Pref (𝑢) and 𝑟 ∈ Pref (𝑤). By observation 2.5,*
*this gives 𝑟 ∈ Pref (𝑣). From which we can conclude 𝑟 to be a common prefix of 𝑢, 𝑣 and 𝑤. This*
*gives lcp(𝑢, 𝑤) = |𝑟| ≤ 𝑚.*

*From the first claim, we have lcp(𝑢, 𝑤) ≥ 𝑚. Thus lcp(𝑢, 𝑤) = 𝑚 = min(lcp(𝑢, 𝑣), lcp(𝑣, 𝑤))*
*The utility of the lcp function lies in the fact that, after preprocessing, lcp queries can be answered in*
𝒪 (1) time using the above corollary. We will see the exact mechanics of this later. For now, we focus
on how to best exploit this easily computable function.

**3.1** **Using lcp to improve binary search**

**Using lcp to improve binary search**

*The first application of lcp (due to [Gus97]) is to speed up the binary search applied to word arrays*
as promised. Recall how our first speed up managed to reduce the amount character comparisons but
*still left us with the 𝒪 (|𝑝| log 𝑛) worst case bound. Here we will reduce that bound to 𝒪 (|𝑝| + log 𝑛).*

We do this by bounding the amount of ‘redundant’ character comparisons to 1 per iteration. We call
*a comparison of a character of 𝑝 redundant when we’ve already compared it. This gives |𝑝| necessary*
*comparisons and 𝒪 (log 𝑛) redundant ones. The bound 𝒪 (|𝑝| + log 𝑛) follows immediately.*

*We reiterate the definitions used in binary search previously, making use of the lcp function where*
possible:

*𝑏**𝐿* = the left boundary of the current search interval
*𝑛**𝐿* *= lcp(word (𝑏**𝐿**), 𝑝)*

*𝑏** _{𝑀}* = the mid point of the current search interval

*𝑛*

*𝑀*

*= lcp(word (𝑏*

*𝑀*

*), 𝑝)*

*𝑏** _{𝑅}* = the right boundary of the current search interval

*𝑛*

*𝑅*

*= lcp(word (𝑏*

*𝑅*

*), 𝑝)*

The previous method is slow because it has potentially many redundant comparisons. Specifically when
*𝑛**𝐿* *̸= 𝑛**𝑅* *we have performed max(𝑛**𝐿**, 𝑛**𝑅**) comparisons and yet we will start at min(𝑛**𝐿**, 𝑛**𝑅*). Yielding
*max(𝑛**𝐿**, 𝑛**𝑅**) − min(𝑛**𝐿**, 𝑛**𝑅*) redundant comparisons.

*Our speed up is achieved by improving this case where 𝑛**𝐿**̸= 𝑛**𝑅**. We proceed with the case of 𝑛**𝐿**> 𝑛**𝑅*.
*For the other case, all arguments below hold upon exchanging 𝐿 and 𝑅, and reversing the ordering of ≺.*

The key concept to our speed up is the following observation, based on the contraposition of (2) of corollary 3.1:

**Observation 3.2 Given strings 𝑢, 𝑣, 𝑤 such that 𝑢 ≺ 𝑣 and 𝑢 ≺ 𝑤. If lcp(𝑢, 𝑣) < lcp(𝑢, 𝑤) we have***𝑢 ≺ 𝑤 ≺ 𝑣*

*This, combined with word (𝑏*_{𝐿}*) ≺ word (𝑏*_{𝑀}*) and word (𝑏*_{𝐿}*) ≺ 𝑝 allows us to deduce the ordering of*
*word (𝑏*_{𝐿}*), word (𝑏*_{𝑀}*) and 𝑝 based on 𝑙𝑚 = lcp(word (𝑏*_{𝐿}*), word (𝑏*_{𝑀}*)) and 𝑛** _{𝐿}*. We do this by distinguishing
the following three cases:

**𝑙𝑚 > 𝑛****𝐿****: Here, it follows that word (𝑏***𝐿**) ≺ word (𝑏**𝑀**) ≺ 𝑝.*

*This means we set 𝑏**𝐿**← 𝑏**𝑀**. We need not change 𝑛**𝐿* *because 𝑛**𝐿**= min(𝑙𝑚, 𝑛**𝑀**) and 𝑙𝑚 > 𝑛**𝐿* so
*𝑛**𝐿**= 𝑛**𝑀*.

**𝑙𝑚 < 𝑛****𝐿****: Here, it follows that word (𝑏***𝐿**) ≺ 𝑝 ≺ word (𝑏**𝑀*).

*This means we set 𝑏**𝑅**← 𝑏**𝑀**. We also set 𝑛**𝑅**← 𝑙𝑚. Because 𝑙𝑚 = min(𝑛**𝐿**, 𝑛**𝑀**) < 𝑛**𝐿**so 𝑙𝑚 = 𝑛**𝑀*.
**𝑙𝑚 = 𝑛****𝐿****: In this case, observation 3.2 gives no information. However, we know 𝑛***𝑚* *≥ 𝑛**𝐿* because

*𝑛**𝑀* *≥ min(𝑙𝑚, 𝑛**𝐿**). This means we can start comparing at 𝑛**𝐿**+ 1 = max(𝑛**𝐿**, 𝑛**𝑅*) + 1.

This is implemented in algorithm 3.1 on page 13. In this algorithm, if we do any comparison at all, we
*always start at max(𝑛*_{𝐿}*, 𝑛*_{𝑅}*) + 1 in 𝑝. Moreover, at that point we will not have compared any character*
*beyond the first max(𝑛*_{𝐿}*, 𝑛*_{𝑅}*) + 1 characters of 𝑝 (the + 1 because we only know to string agree up to*
*𝑖 when we see a difference at 𝑖 + 1). Therefore, we perform at most a single redundant comparison per*
*iteration. This finally gives us the 𝒪 (|𝑝| + log 𝑛) bound.*

**3.2** **Computing lcp via the lcp array**

**Computing lcp via the lcp array**

Having seen the power of the longest common prefix, we still need to know how to compute it in 𝒪 (1)
*time. The basis is the lcp array. This array enhances a word array WA. (Recall that word (𝑖) = 𝑤** _{WA[𝑖]}*.)
It is defined as:

**Definition 3.2 (lcp array) LCP[𝑖] = lcp(word (𝑖 − 1), word (𝑖))**

Obviously, this only stores the answer to a small part of all possible lcp queries. However, due to the
*lexicographical ordering of the word array, we can compute lcp based on the lcp array:*

**Lemma 3.3 Given a word array WA and corresponding lcp array LCP, we have:**

*lcp(word (𝑖), word (𝑗)) = min(LCP[𝑖 + 1 : 𝑗])*

**Proof This is a simple consequence of recursive application of corollary 3.1**

*However nice this result, naïve computation based on this formula takes 𝒪 (𝑛) time. Far more than the*
promised 𝒪 (1). However, yet another function, Range minimal query or RMQ, allows us to reduce this
*to 𝒪 (1) time. This does require 𝒪 (𝑛) time and memory for preprocessing but this is still acceptable.*

Once again we first examine the definition and applications, deferring the internal workings to section 3.5. RMQ is defined as follows:

**Definition 3.3 (Range minimial query) Given an array of integers 𝐴, and indices into 𝐴 𝑖 and 𝑗,***the function RMQ*_{𝐴}*(𝑖, 𝑗) returns the index of the leftmost minimal element in subarray A[i:j].*

This function will be essential throughout this chapter. Here, it allows us to write:

*lcp(word (𝑖), word (𝑗)) = min(LCP[𝑖 + 1 : 𝑗]) = LCP[RMQ*_{LCP}*(𝑖 + 1, 𝑗)]*

**Algorithm 3.1 improvedBinarySearch(𝑝, WA)**

**Require: string 𝑝, the pattern to search and word array WA in which to search. And algorithm 3.2****Ensure: returns 𝑡𝑟𝑢𝑒 if 𝑝 is a prefix of a word in WA and 𝑓 𝑎𝑙𝑠𝑒 otherwise.**

*𝑏**𝐿*← 0
*𝑛** _{𝐿}*← 0

*𝑙𝑒𝑓 𝑡𝑊 𝑜𝑟𝑑 ← word (𝑏** _{𝐿}*)

**while 𝑝[𝑛**_{𝐿}*+ 1] = 𝑙𝑒𝑓 𝑡𝑊 𝑜𝑟𝑑[𝑛*_{𝐿}**+ 1] do**
*𝑛*_{𝐿}*← 𝑛** _{𝐿}*+ 1

**if 𝑝[𝑛**_{𝐿}*+ 1] ≺ 𝑙𝑒𝑓 𝑡𝑊 𝑜𝑟𝑑[𝑛*_{𝐿}**+ 1] then**
**return 𝑓 𝑎𝑙𝑠𝑒 {𝑝 ≺ word (0)}**

*𝑏**𝑅**← 𝑆𝐴.𝑠𝑖𝑧𝑒 − 1*
*𝑛**𝑅*← 0

*𝑟𝑖𝑔ℎ𝑊 𝑜𝑟𝑑 ← word (𝑏**𝑅*)

**while 𝑝[𝑛***𝑅**+ 1] = 𝑙𝑒𝑓 𝑡𝑊 𝑜𝑟𝑑[𝑛**𝑅***+ 1] do**
*𝑛**𝑅**← 𝑛**𝑅*+ 1

**if 𝑝[𝑛***𝑅**+ 1] ≺ 𝑙𝑒𝑓 𝑡𝑊 𝑜𝑟𝑑[𝑛**𝑅***+ 1] then**
**return 𝑓 𝑎𝑙𝑠𝑒 {word (0 ≺ 𝑝}**

**while 𝑏***𝐿**̸= 𝑏**𝑅* **do**
**if 𝑛***𝐿**= 𝑛**𝑅* **then**

*𝑏𝑎𝑠𝑖𝑐𝑆𝑡𝑒𝑝(𝑏*_{𝐿}*, 𝑏*_{𝑅}*, 𝑛*_{𝐿}*, 𝑛*_{𝑅}*, 𝑝, WA)*
**continue**

**if 𝑛**_{𝐿}*> 𝑛*_{𝑅}**then**
*𝑠𝑖𝑑𝑒 ← 𝐿*
*𝑜𝑡ℎ𝑒𝑟𝑆𝑖𝑑𝑒 ← 𝑅*
**if 𝑛***𝐿**< 𝑛**𝑅* **then**

*𝑠𝑖𝑑𝑒 ← 𝑅*
*𝑜𝑡ℎ𝑒𝑟𝑆𝑖𝑑𝑒 ← 𝐿*

*{set new 𝑏’s and 𝑛’s based on 𝑠𝑖𝑑𝑒 and 𝑙𝑐}*

*𝑙𝑐 ← lcp(word (𝑏**𝑠𝑖𝑑𝑒**), word (𝑏**𝑀*))
**if 𝑙𝑐 = 𝑛***𝑠𝑖𝑑𝑒***then**

*𝑏𝑎𝑠𝑖𝑐𝑆𝑡𝑒𝑝(𝑏**𝐿**, 𝑏**𝑅**, 𝑛**𝐿**, 𝑛**𝑅**, 𝑝, WA)*
**continue**

**if 𝑙𝑐 > 𝑛***𝑠𝑖𝑑𝑒***then**
*𝑏*_{𝑠𝑖𝑑𝑒}*← 𝑏**𝑀*

**if 𝑙𝑐 < 𝑛***𝑠𝑖𝑑𝑒***then**
*𝑏*_{𝑜𝑡ℎ𝑒𝑟𝑆𝑖𝑑𝑒}*← 𝑏**𝑀*

*𝑛*_{𝑜𝑡ℎ𝑒𝑟𝑆𝑖𝑑𝑒}*← 𝑙𝑐*

**Algorithm 3.2 basicStep(𝑏**_{𝐿}*, 𝑏*_{𝑅}*, 𝑛*_{𝐿}*, 𝑛*_{𝑅}*, 𝑝, WA)*

**Require: 𝑏***𝐿**, 𝑏**𝑅**, 𝑛**𝐿**, 𝑛**𝑅**, the left and right boundaries and how far they match 𝑝, 𝑝 itself and word array*
*WA to compute word .*

**Ensure: perform the standard binary search step**
*𝑏**𝑀* *← (𝑏**𝐿**+ 𝑏**𝑅**)/2*

*𝑛**𝑀* *← max(𝑛**𝐿**, 𝑛**𝑅*)
*𝑚𝑖𝑑𝑊 𝑜𝑟𝑑 ← word (𝑏**𝑚*)

**while 𝑝[𝑛***𝑀* *+ 1] = 𝑚𝑖𝑑𝑊 𝑜𝑟𝑑[𝑛**𝑀***+ 1] do**
*𝑛*_{𝑀}*← 𝑛**𝑀* + 1

**if 𝑛***𝑀* **= 𝑃.𝑙𝑒𝑛𝑔𝑡ℎ − 1 then**

**return 𝑡𝑟𝑢𝑒{This return must cascade to the calling algorithm}**

**if 𝑝[𝑛**_{𝑀}*+ 1] ≺ 𝑚𝑖𝑑𝑊 𝑜𝑟𝑑[𝑛*_{𝑀}**+ 1] then**
*𝑏*_{𝐿}*← 𝑏*_{𝑀}

*𝑛*_{𝐿}*← 𝑛*_{𝑀}

**if 𝑝[𝑛***𝑀* *+ 1] ≻ 𝑚𝑖𝑑𝑊 𝑜𝑟𝑑[𝑛**𝑀* **+ 1] then**
*𝑏**𝑅**← 𝑏**𝑀*

*𝑛**𝑅**← 𝑛**𝑀*

*Since RMQ takes 𝒪 (1) time, so does computation of this formula. This is how we are able to answer*
*arbitrary lcp queries in 𝒪 (1) time.*

*It is interesting to note that our speed up of binary search does not need RMQ. The lcp values that*
are used are restricted, allowing us to pre-compute them. The set of all possible search intervals has a
*simple binary-tree structure. As such within a word array, there are only 𝒪 (𝑛) possible search intervals*
*it suffices to pre-compute the 𝑙𝑐𝑝 value for all of these. This can be done in 𝒪 (𝑛) time based on the*
lcp array. This is done by dynamic programming, starting with the smaller intervals. For the details,
*see [Gus97, p. 145]. However, we will soon see situations where arbitrary lcp queries are needed.*

**3.3** **lcp intervals and the lcp tree**

We have already seen how the lcp array, pre-processed for RMQ, can be used to speed up binary search over a word array. And thus over a suffix array. Here, based on [FH07], we will see how the lcp array can take an even more prominent role. We will be focussing on the specific case of a suffix array. However, this entire section can be generalized to work for word arrays as is explained in section 3.7

For every prefix that is shared among multiple suffixes, these suffixes form a continuous interval in the
suffix array. This fact forms the basis for the concept ‘lcp interval’ which is defined as follows^{1}:

**Definition 3.4 (proper lcp interval) Given a suffix array SA and a corresponding lcp array LCP we***say that [𝑖, 𝑗) is a proper lcp interval of value 𝑙 if and only if the following all hold:*

*1. LCP[𝑖] < 𝑙*
*2. LCP[𝑗] < 𝑙*

*3. min(LCP[𝑖 + 1 : 𝑗]) = 𝑙. The indices where LCP[𝑖] = 𝑙 are called the 𝑙-indices.*

*Conceptually, we take undefined entries (both 0 and 𝑛) in the LCP array as being 0.*

*If [𝑖, 𝑗) is a proper lcp interval of value 𝑙, we can also write that 𝑙 -[𝑖, 𝑗) is a proper lcp interval.*

**Corollary 3.4 The following gives a bijection between all proper lcp intervals and**
*(right branching substrings of 𝑆 ∪ nested suffixes of 𝑆):*

*string(𝑙 -[𝑖, 𝑗)) = suff (SA[𝑘])[0 : 𝑙] for any 𝑘 ∈ [𝑖 : 𝑗) (𝑗 being excluded)* (3)
The inverse of this function is:

*𝑝 = |𝑝| -[𝑖, 𝑗) where [𝑖, 𝑗) = {𝑘 | 𝑝 ∈ Pref (suff (SA[𝑘]))}* (4)
**Proof By corollary 3.1 and point 3 of the definition, we have (3) being well defined. Furthermore, if**
*we take 𝑖 to be an 𝑙-index, it follows that suff (SA[𝑖 − 1]) and suff (SA[𝑖]) differ on the 𝑙 + 1th character.*

*This means that suff (SA[𝑖])[0 : 𝑙] is either right branching or a nested suffix (or both).*

*For (4) we look to 2.5. This tells us that given any 𝑝, 𝐼 = {𝑘 | 𝑝 ∈ Pref (suff (SA[𝑘]))} forms an*
interval.

*By (3) we already know that if 𝑝 ∈ right branching substrings of 𝑆 ∪ nested suffixes of 𝑆 then [𝑖 :*
*𝑗) ⊂ 𝐼. Now, it follows from points 1 and 2 of the definition that neither 𝑖 − 1 nor 𝑗 is included in*

*𝐼.*

1*here, due to our convention, 𝑗 is not included in the interval. This makes things easier later, but is contrary to what*
is seen in the literature.

*Note that, since we decided to take undefined values of the LCP array to be 0, [0, 𝑛) is also an lcp*
interval.

The next step is to extend our definition of the lcp interval. We are currently missing singleton intervals,
*intervals of the form [𝑖, 𝑖 + 1). These play an important role because they correspond to the suffixes.*

*Indeed, if we set 𝑙 = |suff (SA[𝑖])| then (3) extends to this case without effort. The use of suffixes is that*
they ‘terminate’ the branching done by right-branching substrings.

*This extension is somewhat troublesome in the case of nested suffixes. If 𝑢 = suff (SA[𝑖]) is nested, it*
*already corresponds to both a proper lcp interval and a singleton interval. In these cases, we define 𝑢*
to be the proper lcp interval. This is a technical detail that never even comes up when implementing
the algorithms. It does, however, illustrate the complexity induced by nested suffixes. With that detail
taken care of, we can now define the complete set of lcp intervals:

* Definition 3.5 (lcp interval) We say [𝑖, 𝑗) is an lcp interval of value 𝑙 when either of the following*
hold:

*∙ 𝑙 -[𝑖, 𝑗) is a proper lcp interval.*

*∙ 𝑗 = 𝑖 + 1 and 𝑙 = |suff (SA[𝑖])|.*

*Again, if [𝑖, 𝑗) is an lcp interval of value 𝑙, we can also write that 𝑙 -[𝑖, 𝑗) is an lcp interval.*

**Corollary 3.5 If 𝑙 -[𝑖, 𝑗) is an lcp interval, then string(𝑙 -[𝑖, 𝑗)) ∈ right-branching substrings of 𝑆 ∪***Suff (𝑆)*

*Now, one lcp interval may very well contain other lcp intervals. In fact, if 𝑥 and 𝑦 are both lcp intervals,*
either one fully contains the other or they are disjoint. This allows us to define a descendant-ancestor
*relationship between intervals. We say that 𝑙 -[𝑖, 𝑗) is a descendant of 𝑙*^{′}*-[𝑖*^{′}*, 𝑗*^{′}*) if and only if [𝑖*^{′}*, 𝑗*^{′}*) ⊂ [𝑖, 𝑗).*

*It follows trivially that 𝑙 ≥ 𝑙*^{′}*. Furthermore, since [0, 𝑛) is an lcp interval, each other lcp interval is a*
descendant of it. This allows us to define the lcp tree:

* Definition 3.6 (lcp tree) Given a suffix array SA and the corresponding LCP array, the lcp tree 𝐿 is*
defined as follows:

*𝑁**𝐿**= {𝑙 -[𝑖, 𝑗) | 𝑙 -[𝑖, 𝑗)is an lcp interval}*

Furthermore, the structure is given by:

*𝑇*_{𝐿}*(𝑙 -[𝑖, 𝑗)) = {𝑙*^{′}*-[𝑖*^{′}*, 𝑗*^{′}*) | 𝑖 ≤ 𝑖*^{′}*< 𝑗*^{′}*≤ 𝑗}*

**Corollary 3.6 For an lcp tree 𝑇 , we conclude the following about the nodes:**

*1. The root of 𝑇 is the entire interval, which corresponds to 𝜖.*

*2. The leaves of 𝑇 are the singleton-intervals, corresponding to suffixes.*

3. All internal nodes are proper lcp intervals, corresponding to right branching substrings and nested suffixes. Furthermore, since proper lcp intervals contain at least two singleton intervals, all internal intervals are branching.

In order to traverse this tree, we need an efficient way to find all the child intervals of an interval. The
*following lemma allows us to do this based on the 𝑙-indices.*

**Lemma 3.7** *[FH07] Let 𝑙 -[𝑖, 𝑗) be an lcp interval. Furthermore, let 𝑘*1 *< 𝑘*2 *< . . . < 𝑘**𝑚* be the
*𝑙-indices of the interval. The child intervals of 𝑙 -[𝑖, 𝑗) are then: [𝑖, 𝑘*1*), [𝑘*1*, 𝑘*2*) . . . [𝑘**𝑚**, 𝑗).*

* Proof Define 𝑘*0

*= 𝑖 and 𝑘*

*𝑚+1*

*= 𝑗. The intervals we are considering are then of the form [𝑘*

*𝑎*

*, 𝑘*

*𝑎+1*)

*for any 𝑎 : 0 < 𝑎 < 𝑚. It suffices to prove these are lcp intervals since these intervals cover the*

*interval [𝑖, 𝑗).*

Singleton intervals are lcp intervals by definition. This leaves the normal intervals. For these we
*have: LCP[𝑘*_{𝑎}*] ≤ 𝑙 from equality for 𝑙-indices and inequality for 𝑙 and 𝑟. Thus [𝑘*_{𝑎}*, 𝑘** _{𝑎+1}*− 1) satisfies

*conditions 1 and 2 for any 𝑙*

^{′}

*> 𝑙. Furthermore we have: 𝑘*

_{𝑎}*< ℎ < 𝑘*

_{𝑎+1}*→ LCP[ℎ] > 𝑙 since such ℎ*

*are not 𝑙-indices. This gives us condition 3 for 𝑙*

^{′}

*= min {LCP[ℎ] | 𝑘*

*𝑎*

*< ℎ < 𝑘*

*𝑎+1*

*} .*

*As such, finding the children of an interval only requires finding the 𝑙-indices. Looking at their definition,*
*finding the leftmost 𝑙-index 𝑘*1 *of 𝑙 -[𝑖, 𝑗) can be done by computing 𝑘*1 *= RMQ*_{LCP}*(𝑖 + 1, 𝑗) which we*
*know can be done in 𝒪 (1) time. If LCP[𝑘] = 𝑙, 𝑘 is the leftmost 𝑙-index. If LCP[𝑘] ̸= 𝑙, there are no*
*𝑙-indices. We can find 𝑘*_{𝑖+1}*recursively by taking 𝑘*0*= 𝑖 and then applying:*

*𝑘**𝑖+1* *= RMQ*_{LCP}*(𝑘**𝑖**+ 1, 𝑗) while LCP[𝑘**𝑖+1**] = 𝑙*

**3.4** **Lcp tree – suffix tree equivalence**

The structure we see in the lcp tree is very familiar to one we have seen before, the suffix tree. Both
trees have the empty string as their root, right-branching substrings as internal nodes and suffixes as
leaves. We will show these two trees are isomorphic by introducing a tree ℬ defined for any string. To
*reduce technicalities, we further assume the string 𝑆 ends with a sentinel character, and therefore does*
not have any nested suffixes. Based on this assumption we will show that both the lcp tree and suffix
tree are isomorphic to this underlying tree ℬ.

*We call this tree ℬ the branching set over string 𝑆. The essential idea is that each substring can be*
found by traversing ever longer right-branching substrings. It is defined as follows:

* Definition 3.7 (Branching set) Given a string 𝑆, the branching set ℬ is a tree. The nodes of the tree*
are:

*𝑁 = 𝜖 ∪ right-branching substrings of 𝑆 ∪ Suff (𝑆)*
Furthermore, the edges are given by:

*𝑣 ∈ 𝑇 (𝑢) ⇐⇒ 𝑢 a prefix of 𝑣*

*Since we assume 𝑆 to have no nested suffixes, this means that all leaves of ℬ are suffixes, and all internal*
nodes are right-branching substrings.

First, we will show the lcp tree and ℬ are isomorphic with the following lemma:

**Lemma 3.8 The following is an isomorphism between the lcp tree and ℬ:**

*string : lcp intervals → 𝑁*_{ℬ}

* Proof Corollary 3.5 combined with the definition of ℬ implies string is a bijection between the lcp*
intervals and the nodes of ℬ.

It remains to be shown that the structure of the trees is isomorphic. For this, we must show:

*𝑙*^{′}*-[𝑖*^{′}*, 𝑗*^{′}*) ∈ 𝑇 (𝑙 -[𝑖, 𝑗)) ⇐⇒ string(𝑙 -[𝑖, 𝑗)) ∈ Pref (string(𝑙*^{′}*-[𝑖*^{′}*, 𝑗*^{′})))

This holds because:

*⇒: We can immediately conclude that: 𝑙 < 𝑙*^{′} *and [𝑖, 𝑗) ⊃ [𝑖*^{′}*, 𝑗*^{′}*). Applying the definition of string*
(equation (3)) gives the desired result.

*⇐: Due to the lexicographical ordering of SA, we know that all suffixes that have string(𝑙 -[𝑖, 𝑗))*
*as a prefix lie in a single interval. By definition of the lcp interval, this interval is SA[𝑖 : 𝑗]. Since*

*string(𝑙 -[𝑖, 𝑗)) is a prefix of 𝑙*^{′}*-[𝑖*^{′}*, 𝑗*^{′}*), all suffixes that have string(𝑙 -[𝑖, 𝑗)) as a prefix must form a*
*subinterval of SA[𝑖 : 𝑗]. This gives us 𝑖 ≤ 𝑖*^{′} *< 𝑗*^{′} *≤ 𝑗. Which, by definition means that: 𝑙*^{′}*-[𝑖*^{′}*, 𝑗*^{′}) ∈

*the subtree of 𝑙 -[𝑖, 𝑗).*

Next, we will show the suffix tree and ℬ are isomorhic with the following lemma:

**Lemma 3.9 The following is an isomorphism between the suffix tree 𝑆𝑇 and ℬ:**

*string : 𝑁**𝑆𝑇* *→ 𝑁*_{ℬ}
**Proof First, note that string(𝑢) = 𝑢.**

*Due to corollary 2.4 point 4, we have a bijection between the nodes of ℬ and 𝑆𝑇 . Furthermore, from*
the definition of the suffix tree, we have: *𝑣 ∈ subtree of 𝑢 ⇐⇒ 𝑢 is a prefix of 𝑣. This immediately*

gives us the isomorphism.

This finally gives us the equivalence between the suffix tree and lcp tree. Though only in the case where
*𝑆 has no nested suffixes. But why do nested suffixes yield a problem here? At the heart of the matter*
lies the ambiguity of a string ‘branching’ when there is a nested suffix. Is a string branching when we
can append it by two different characters, as with the suffix tree or when it is a prefix of two different
strings, as with the lcp tree. Neither choice is better. The lcp tree has duplicate nodes, whilst the suffix
array has suffixes without any (explicit) node. This is an important part of why nested suffixes are so
troublesome.

**3.5** **RMQ**

*The subject of computing and pre-processing for RMQ is rich enough to devote an entire paper to. Here,*
we will only examine one method that is easily understood. This section is based mostly on [BFC00].

*We will encounter two new problems in this algorithm. LCA and ±RMQ. LCA, or the lowest common*
ancestor problem, is defined as follows:

* Definition 3.8 (lowest common ancestor) Given a tree 𝑇 , and two nodes 𝑥, 𝑦, we define the set of*
common ancestors:

*𝐴 = {𝑧 ∈ 𝑇 | 𝑧 ancestor of 𝑥 ∧ 𝑧 ancestor of 𝑦}*

The lowest common ancestor is then given by:

*LCA**𝑇**(𝑥, 𝑦) = arg max*

*𝑧∈𝐴*

*𝑑𝑒𝑝𝑡ℎ(𝑧)*

*The ±RMQ problem is defined as RMQ on a restricted class of arrays: ±arrays. These are arrays where*
successive values differ by either +1 or −1.

*Our method works via two reductions. First, we reduce RMQ to LCA via the ‘Cartesian tree’. Second,*
*we reduce LCA to ±RMQ by looking at the depths of an Euler path. Finally, we will show how to solve*

*±RMQ.*

**3.5.1** **RMQ to LCA**

*We can reduce the problem of RMQ on an array 𝐴 of size 𝑛 to the problem of LCA on a binary tree 𝐶.*

*Here 𝐶 is the Cartesian tree of 𝐴. It is defined as follows:*

* Definition 3.9 (Cartesian tree) Given an array 𝐴 of size 𝑛, the Cartesian tree is recursively defined*
as follows:

*The root of 𝐶 is the index 𝑖 of the (leftmost) minimal element of 𝐴. The left and right subtrees of the*
*root are respectively, the Cartesian trees for the left subarray 𝐴[0 : 𝑖] and the right subarray 𝐴[𝑖 + 1 : 𝑛].*

From this definition, we derive the following theorem:

**Theorem 3.10 Given an array 𝐴 and its Cartesian tree 𝐶, we have:**

*LCA**𝐶**(𝑖, 𝑗) = RMQ*_{𝐴}*(𝑖, 𝑗 + 1)*

**Proof Taking 𝑘 = 𝐿𝐶𝐴***𝐶**(𝑖, 𝑗), 𝑖 and 𝑗 lie in respectively the left and right subtrees of 𝑘 and thus*
*𝑖 ≤ 𝑘 ≤ 𝑗.*

*Furthermore, 𝑘 is the leftmost minimal element of some range [𝑎, 𝑏]. Due to 𝑖 and 𝑗 being descendants*
*of 𝑘, we have [𝑖, 𝑗] ⊂ [𝑎, 𝑏]. This, combined with 𝑖 ≤ 𝑘 ≤ 𝑗, ensures 𝑘 is the leftmost minimal element*

*of [𝑖, 𝑗].*

*All that remains to be shown is that constructing the cartesian tree can be done in 𝒪 (𝑛) time. We*
*present an iterative approach. Let 𝐶**𝑖* *be the Cartesian tree for 𝐴[0 : 𝑖] and let 𝑥 = 𝐴[𝑖] be the node we*
*need to add. Furthermore, let 𝑅**𝑖**be the right path of 𝐶**𝑖**. Certainly, 𝑥 must be added as the child of some*
*node in 𝑅**𝑖*. After all, every other node concerns a subarray not including the end point. Now, for any
*𝑦 ∈ 𝑅**𝑖* *: 𝑦 ≤ 𝑥, 𝑥 must be added to the subtree rooted at 𝑦. For the other points in 𝑅**𝑖**: 𝑧 ∈ 𝑅**𝑖**: 𝑧 > 𝑥*
*we know they must move into the left subtree of 𝑥.*

*As such, we search 𝑅**𝑖* *bottom-up until we find the first point 𝑦 ∈ 𝑅**𝑖**: 𝑦 ≤ 𝑥. We then set 𝑥 as the new*
*right-child of 𝑦 and set the old right subtree of 𝑦 to be the left subtree of 𝑥.*

*This construction method takes 𝒪 (𝑛) time. Because any node that is compared along the right path is*
subsequently removed from it. Therefore, every node is only ever compared once. Since the runtime is
*proportional to the total number of comparisons, this gives us 𝒪 (𝑛) comparisons. So we have reduced*
RMQ to LCA.

**3.5.2** **LCA to ±RMQ**

*Our next step is to reduce LCA to ±RMQ in 𝒪 (𝑛) time. Our method takes any tree 𝑇 with enumerated*
nodes. The key to this step is the following:

**Observation 3.11 LCA***𝑇**(𝑥, 𝑦) is the shallowest node encountered after 𝑥 and before 𝑦 during an*
Euler tour.

*We start by storing an Euler tour of 𝑇 in an array 𝐸 such that 𝐸[𝑖] is the 𝑖-th node visited on the Euler*
*tour. Next, we create an array 𝐷 that stores the depth of the nodes visited on the Euler tour. That is:*

*𝐷[𝑖] = 𝑑𝑒𝑝𝑡ℎ(𝐸[𝑖]). Finally we create an array 𝑅 that stores the representative of each node, its first*
*occurrence in 𝐸. This gives us 𝐸[𝑅[𝑖]] = 𝑖. We could use any occurrence, but making a choice makes 𝑅*
well-defined. This fact, combined with observation 3.11 yields the following lemma:

**Lemma 3.12 Defining 𝐸, 𝐷 and 𝑅 as stated above we have: LCA***𝐶**(𝑖, 𝑗) = 𝐸[RMQ*_{𝐷}*(𝑅[𝑖], 𝑅[𝑗])]*

*where 𝐷 is a ±array.*

*Constructing 𝐸, 𝐷 and 𝑅 in 𝒪 (𝑛) time during an Euler tour is trivial. All that remains is solving ±RMQ*
*in 𝒪 (1) time after 𝒪 (𝑛) time and memory for pre-processing. Note that, after having constructed arrays*
*𝐸, 𝐷 and 𝑅, we no longer need the actual Cartesian tree.*

**±RMQ**

*The matter at hand is computing RMQ for a ±array 𝐴 of size 𝑛. This should take 𝒪 (𝑛) time and*
memory for pre-processing and 𝒪 (1) time per query.

*We proceed as follows: We divide the array 𝐴 into blocks of size 𝑘 =* ^{log 𝑛}_{2} . We then distinguish two
different cases:

*(a) 𝑖 and 𝑗 lie in the same block.*

*(b) 𝑖 and 𝑗 lie in different blocks.*

We call case (a) an ‘in-block query’. We shall solve these with a lookup table. We shall solve case (b) by comparing 3 minima:

*1. The minimum between 𝑖 and the end of its block.*

*2. The minimum of any blocks between 𝑖’s block and 𝑗’s block.*

*3. The minimum between 𝑗 and the beginning of its block*

*If we know the (leftmost) indexes associated with these minima, calculating RMQ(𝑖, 𝑗) becomes trivial.*

In this case 1 and 3 will be computed as in-block queries. 2 Will be computed as a ‘superblock querry’.

We shall first explain how to answer in-block queries.

**in-block queries:** To solve in-block queries we want to pre compute enough answers that in-block
queries become simple lookups. Essential to this is the following:

**Observation 3.13 Given two arrays 𝑋 and 𝑌 that differ by some fixed value at every position, that***is: ∃𝑐∀𝑖 : 𝑋[𝑖] = 𝑌 [𝑖] + 𝑐, RMQ*_{𝑋}*is equivalent to RMQ** _{𝑌}*.

We call a block normalized when its first element is 0. Due to the above observation, we can reduce our pre-computation to only normalized blocks. This is where the ± property comes in, limiting the amount of normalized blocks:

**Observation 3.14 There are only 2**^{𝑘−1}*possible normalized ± blocks of length 𝑘.*

We now simply pre-compute the answers to all 𝒪(︀2^{𝑘−1}*· 𝑘*^{2}*)︀ ⊂ 𝒪 (𝑛) possible in-block queries for these*
*normalized ± blocks. Finally, for each block in 𝐴, we store which normalized block should be used.*

Thus, in-block queries reduce to lookups.

*Pre-computing all these answers can be done in 𝒪 (𝑛) time by dynamic programming, solving the queries*
for shorter intervals first.

**Superblock queries:** We introduce two new arrays storing information about the blocks as a whole.

*𝐵 stores the minimum value of each block, and 𝐼 the corresponding leftmost index. Since there are*
*𝑚 =* ^{𝑛}* _{𝑘}* =

_{log 𝑛}

^{2𝑛}*blocks, the arrays are of size 𝑚. Formally we have:*

*𝐼[𝑖] = RMQ*_{𝐴}*(𝑖 · 𝑘, (𝑖 + 1) · 𝑘)*
*𝐵[𝑖] = 𝐴[𝐼[𝑖]]*

*Now, a superblock query from the 𝑖-th block to the 𝑗-th block reduces to a general RMQ query on 𝐵:*

*That is, we need simply return 𝐼[RMQ*_{𝐵}*(𝑖, 𝑗 + 1)]. To solve this general RMQ query we use a method*

*called sparse table. Now, this method requires 𝒪 (𝑚 log 𝑚) time for preprocessing, which is why we do*
*not use it to solve general RMQ directly. However, in this case it suffices because:*

*𝑚 log 𝑚 =*

= *2𝑛*

*log 𝑛*log *2𝑛*
*log 𝑛*

= *2𝑛*
*log 𝑛*

(︂

*log 𝑛 + log* 2
*log 𝑛*

)︂

≤ *2𝑛*
*log 𝑛2 log 𝑛*

*∈ 𝒪 (𝑛) .*

The sparse table works as follows: For each interval of size 2^{𝑎}*, starting at position 𝑖 we precompute the*
*RMQ. We subsequently store this in a table 𝑀 . That is:*

*𝑀 [𝑖][𝑎] = RMQ*_{𝐵}*(𝑖, 𝑖 + 2** ^{𝑎}*)

*Now, for any 𝑖 and 𝑗 we compute RMQ*_{𝐵}*(𝑖, 𝑗) as follows: We take 𝑎 = ⌊log |𝑖 − 𝑗|⌋. Then, the union of*
*the intervals [𝑖, 𝑖 + 2*^{𝑎}*) and [𝑗 − 2*^{𝑎}*, 𝑗) is [𝑖, 𝑗). For these intervals, the RMQ answers are stored in 𝑀 [𝑖][𝑎]*

*and 𝑀 [𝑗 − 2*^{𝑎}*][𝑎]. Based on these, computing RMQ*_{𝐵}*(𝑖, 𝑗) in 𝒪 (1) is trivial. This table 𝑀 clearly stores*
*𝒪 (𝑚 log 𝑚) values. We can also fill it in 𝒪 (𝑚 log 𝑚) time using dynamic programming, starting with*
the smaller intervals.

**3.5.3** **alternative methods**

*What we have seen so far gives an easy to understand implementation of RMQ. However, it is somewhat*
*unwieldy. The memory requirement for the reduction to LCA already takes 5𝑛 words of memory. Arrays*
*𝐸, 𝐷 take 2𝑛 − 1 words of memory and 𝑅 takes another 𝑛.*

Luckily, there are alternative methods. One important such method is presented in [FH07]. Here, it is
*shown that RMQ can be solved needing only 3𝑛 bits.*^{1} *That is easily less than 𝑛 words. The key to*
*their method is to realize that two ‘blocks’ have the same RMQ results when their Cartesian trees are*
the same.

**3.6** **The Enhanced Suffix Array**

Based on what we have seen, we define the Enhanced Suffix Array (or ESA) as follows

**Definition 3.10 (Enhanced Suffix Array) Given a string 𝑆, the Enhanced Suffix Array or ESA is***the normal suffix array SA together with its LCP array. Furthermore, LCP has been pre-processed for*
*RMQ*

**Corollary 3.15 The ESA has a memory footprint of 𝒪 (𝑛). Furthermore, assuming SA and LCP***can be constructed in 𝒪 (𝑛) time (as is shown in section 4), the ESA can be constructed in 𝒪 (𝑛)*
time as well.

* Proof SA and LCP have, by definition, a memory footprint of 𝒪 (𝑛).* Furthermore, The pre-

*processing of LCP for RMQ takes 𝒪 (𝑛) time and memory as seen in section 3.5.*

1*2𝑛 + 𝑜(𝑛) bits to be exact*