String to String Correction Kernelization

(1)

by

Nathaniel Watt

B.Sc., University of Victoria, 2010

A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of

MASTER OF SCIENCE

in the Department of Computer Science

c

Nathaniel Watt, 2013 University of Victoria

(2)

String to String Correction Kernelization by Nathaniel Watt B.Sc., University of Victoria, 2010 Supervisory Committee Dr. U. Stege, Co-Supervisor

(Department of Computer Science)

Dr. S. Ganti, Co-Supervisor

(3)

Supervisory Committee

Dr. U. Stege, Co-Supervisor

Dr. S. Ganti, Co-Supervisor

ABSTRACT

The StringToStringCorrection problem asks, given mutable string M , tar-get string T , and positive integer k, can M be mutated into T using at most k operations (single symbol deletions or swaps of adjacent symbols) applied to M ? The problem is known to be NP-complete. Abu-Khzam et al. give the first fixed-parameter algorithm for the problem when the parameter is the number of operations permitted. Their technique, charge and reduce, gives a O∗(1.612k_{) bounded search tree algorithm,}

but leaves open whether a poly-size kernel exists. We show, using two polynomial time reduction rules on large regions of the input strings, that the problem has a prob-lem kernel of size O(k4). Our algorithm achieves this in a polynomial running time. Additionally, we introduce the problem k-MultiStringToStringCorrection (k-MS2SC), a generalized version of StringToStringCorrection, and prove it to be fixed-parameter tractable.

(4)

List of Tables

(7)

List of Figures

Figure 1.1 Correct ccoiriat to match victoria . . . 4

Figure 2.1 Buss kernelization example I = (G, 4) . . . 8

(a) k = 4 . . . 8

(b) k = 3 . . . 8

(c) k = 2 . . . 8

Figure 2.2 Delete, insert, change, and swap instance trace . . . 8

Figure 2.3 Traces for yes-instance I = (“cab”, “cbcab”, 3) . . . 10

(a) Optimal valid trace . . . 10

(b) Valid trace . . . 10

(c) Invalid trace . . . 10

Figure 3.1 Tail T -symbol matching examples . . . 14

(a) T (|T |) matches M (i) and |O| = 4 . . . 14

(b) T (|T |) matches M (j) and |O0| = 3 . . . 14

(c) T (|T |) matches M (i) and |O| = 3 . . . 14

(d) T (|T |) matches M (j) and |O0| = 3 . . . 14

Figure 3.2 Matching symbols T (i) and M (j) . . . 16

(a) i > j . . . 16

(b) i < j . . . 16

Figure 3.3 Untouched M -regions one-to-one with URPs . . . 18

Figure 3.4 Naive removal of untouched region pair . . . 23

(a) no-instance (k = 1) . . . 23

(b) yes-instance (k = 1) . . . 23

Figure 3.5 Instance split reduction of untouched region pair . . . 26

(a) no-instance (k = 1) . . . 26

(b) no-instance (k = 1) . . . 26

Figure 3.6 Valid traces for yes-instance I = (“aazcb”, “zaaccbz”, 4) . . . . 28 Figure 3.7 Region RT

1,` forming PURP with both RM4,` and RM6,` (∆

(8)

(9)

ACKNOWLEDGEMENTS I would like to thank:

Ulrike Stege, for challenging me and for demonstrating immense amounts of pa-tience. There are not many people who challenge me as you have. I loved our theoretical debates.

Sudhakar Ganti, for convincing me to accept an M.Sc position. I am truly sorry we never worked on problems of mutual interest. Thanks for your support. Jennifer Carlisle, for sharing amazing highs over our two years studying and

ad-venturing together. You continue to inspire me.

Jennifer Sammut for being my best friend and companion over the last six years. You have unquestionably offered more support than I could have ever hoped to have.

Camp out among the grass and gentians of glacier meadows, in craggy garden nooks full of Natures darlings. Climb the mountains and get their good tidings. Natures peace will flow into you as sunshine flows into trees. The winds will blow their own freshness into you, and the storms their energy, while cares will drop off like autumn leaves John Muir

(10)

DEDICATION For my family.

(11)

String Correction

Efficient algorithms measuring distances between strings are important for informa-tion retrieval, signal processing, computainforma-tional biology, error correcinforma-tion, handwriting recognition and more [14, 8, 15]. Generally, string distance algorithms compare a query string against a goal or target string. Each algorithm uses different distance metrics to meet each application’s specific requirements. Distance metrics suitable for comparing DNA across species (e.g. sorting by reversals [10]) are unlikely to be useful metrics for spell checking.

For example, consider elementary teacher Jerry marking a spelling test. Three of Jerry’s students misspell target string T = being. Tim wrote M1 = beeing, Alice

wrote M2 = bieng, and Esrever wrote M3 = gnieb. All three answers are wrong, but

Jerry wants to give partial marks based on how similar each answer is to target string T . Jerry decides deducting a mark for each change required to correct an incorrect answer is fair. That is, Jerry’s distance metric is the number of correction operations performed. Since he has not done this before, Jerry marks each test three times. He uses a different set of correction operations each time.

1. Substitution only. 2. Substring reversals only.

3. Substitution, deletion, insertion, and adjacent symbol interchange (swap). Using only substitution operations, Jerry can change incorrect symbols into the correct target symbol. In the first set of marks, Alice loses two marks for bieng → beeng → being, and Esrever loses four marks for gnieb → bnieb → beieb → beinb →

(12)

being. Tim loses infinite marks for beeing → beiing → beinng → beingg → ... → beingg. No number of substitutions can correct Tim’s solution, which contains too many letters. Tim’s mark seems harsh. After all, Tim’s answer is correct except for one extra ‘e’.

Using substring reversals, Jerry can reverse any substring such as changing bie into eib. In the second set of marks, Alice loses only one mark for b←ie ng → being,− Esrever also loses only one mark for ←−−−gnieb → being. Tim once again loses infinite marks since no number of reversal operations can remove the excess symbol.

Using substitution, deletion, insertion, and adjacent symbol swaps, Jerry can fi-nally avoid holding Tim back a grade. Tim loses only one mark for be

eeing → being, Alice loses one mark for b←ie ng → being, and Esrever loses five marks for−

gnieb →

nieb → b←ie b → beib → beinb → being.−

The third distance metric seems reasonable for spell checking; the first two schemes do not. However, other applications heavily utilize the first and second distance metrics. The first scheme is utilized extensively in error correction as it calculates the hamming distance [9] of two equal length strings. The second scheme is used in computational biology [10].

String distance and string correction problems are closely related. If Jerry returns each question with only a numerical score, he is solving string distance. If instead Jerry returns a minimal set of allowable corrections, he solves string correction. When a mutable string M cannot be mutated into target string T the distance problem reports an infinite distance whereas string correction reports an unsolvable instance.

StringDistance

Input: Target string T , query string M , list A = {A1, ..., An} of allowable operations,

and list W = {W1, ..., Wn} of integer weights assigned to each operation.

Question: What is the minimum total cost k required to mutate M into T using only weighted operations described in A?

Output: Integer cost k or infinity.

StringCorrection

Input: Target string T , query string M , list A = {A1, ..., An} of allowable operations,

and list W = {W1, ..., Wn} of weights assigned to each operation.

(13)

using only weighted operations described in A? Output: Set S or error on unsolvable instance.

Another closely related problem approximate string matching, also known as string matching with errors, finds locations where a query string M approximately matches in a large text.

StringMatchingWithErrors

Input: Large text string X, query string M , list A = {A1, ..., An} of allowable

op-erations, list W = {W1, ..., Wn} of weights assigned to each operation, and integer

k.

Question: Which locations within X contain string T that is within distance k of M using weighted operations in A?

Output: Set S containing all locations that match within distance k.

We note, efficient string correction algorithms for operation set A and weight set W can be modified to answer string distance or approximate matching problems. Approximate string matching returns the locations in text where string correction operation costs do not exceed k. Solving string correction and returning the sum of operation costs (or infinity for unsolvable instances) solves string distance; however, solving approximate string matching and string distance problems with distance al-gorithms directly is often more efficient since computing a set of operations is not necessarily required.

Extended string to string correction problems (ESSCP ) [18] are one of the most thoroughly studied classes of string correction. ESSCP problems allow for sub-stitution, deletion, insertion, and adjacent interchange (swap) operations and any non-negative weight assignments. To illustrate each of these operations, consider Figure 1.1 which shows correcting M = ccoiriat to match T = victoria where each operation weight is one. (1) Substitute M (1) ← v. (2) Swap M (3) ↔ M (4). (3) Swap M (2) ↔ M (3). (4) Insert t → M (4). (5) Delete M (9).

An ESSCP problem may disallow any of the four operations by setting the asso-ciated weight to infinity. There are 15 possible combinations of allowable operations within ESSCP . Many classic computer science problems are solvable by variants of ESSCP : hamming distance, longest common subsequence [2], Levenshtein dis-tance [12], and Damerau-Levenshtein disdis-tance [4]. All but two ESSCP variants are

(14)

T: M: v i c c c o i r t o r i i a t a 1 2 3 4 5 6 7 8 v c o i r i a t v c o r i a t v c o r i a t i c o r i a t i i t v i c t o r i a v (1) (2) (3) (4) (5)

Figure 1.1: Correct ccoiriat to match victoria

solvable in polynomial time [16]. Garey & Johnson’s catalogue of NP-hard problems [7] lists string correction with swaps and deletes only as StringToStringCorrec-tion (S2SC). Abu-Khzam et al. [1] give the first fixed-parameter tractable algorithm for k-S2SC.

Table 1.1 contains a sample of algorithms that handle ESSCP with all weights equal to one.

Operations Run time Reference

Substitute O(n) [11]

Delete O(n) [11]

Insert O(n) [11]

Swap O(n2) [11]

Substitute, Delete O(n2) [16]

Substitute, Insert O(n2) [16]

Substitute, Swap O(n2₎ _[16]

Delete, Insert O(n2₎ _[17]

Delete, Swap O(1.6181kn) 1 [1]

Insert, Swap O(1.6181k_n) 1 _[1]

Substitute, Delete, Insert O(kn/w)2 [13] Substitute, Delete, Swap O(n2₎ _[16]

Substitute, Insert, Swap O(n2₎ _[16]

Delete, Insert, Swap O(n2) [16]

Substitute, Delete, Insert, Swap O(n2₎ _[4]

Table 1.1: ESSCP variant run times

1_{k is the number of allowed operations.}

(15)

Chapter 2 General Terminology

2.1 Fixed Parameter Tractability

What makes a problem intractable? Classical complexity analysis tells us problems are intractable when deterministic turing machines cannot always produce a solu-tion in polynomial time; yet, in practice many such problems are efficiently solvable. Classical complexity fails to capture the essence of problem tractability.

Two common approaches for coping with classically intractable problems are ap-plying heuristics or using approximation algorithms. Heuristics are used widely, but are not theoretically sound. They often perform well yet provide no accuracy guaran-tees. Furthermore, heuristic approaches offer no insight into the essence of intractabil-ity. Approximation algorithms offer accuracy error bounds and are backed by a well developed theoretical base. Unlike heuristics, approximation theory also supports problem classification. In particular, problems with fully-polynomial approximation schemes (FPTAS) are approximatible with run-times polynomial in both problem size and error factor 1

. Problems that admit FPTAS are, typically, considered practical to

approximate within a reasonable error allowance. Instinctively, we expect problems with FPTAS solutions are likely more tractable than problems that do not. Approx-imation theory, although useful, fails to formalize the essence of tractability much further. It is not surprising that neither heuristics nor approximation theory define tractability since both approaches still reside under the umbrella of classic complexity. To truly divide tractable and intractable problems, a paradigm shift is required.

Parameterized complexity [5] provides a very different view of intractability than classic complexity. Parameterized complexity offers a well established framework

(16)

to classify degrees of intractability. The essence of tractability is captured by the complexity class Fixed Parameter Tractable (FPT). Parameterized complexity theory focuses on multi-dimensional problem analysis. Each problem is defined by not only the problem size, n, but also by a parameter k. A problem is fixed parameter tractable if and only if for some computable function f it can be solved in time f (k) · nO(1) _[6].

This is a relaxation of the classical understanding of tractability since exponential (or worse) time in the parameter is allowed. Typically, when k is relatively small FPT algorithms are practical. Parameterized complexity also verifies our previous instinctive expectation regarding problems with a FPTAS; all problems with a FPTAS are contained in FPT.

2.1.1 Kernelization

Kernelization is one of many powerful tools of parameterized complexity. Kerneliza-tion algorithms preprocess a given problem instance yielding an equivalent parame-terized instance with size bound by the parameter alone.

Definition 2.1.1 (Kernelization). A kernelization for a k-parameterized problem P is an algorithm that maps any P instance (x, k) to P instance (x0, k0) such that

1. (x, k) is a yes-instance iff (x0, k0) is

2. the length of x0 is bound by polynomial function f (k) 3. k0 is bound by some function g(k).

A kernelization algorithm applies a set of reduction rules to parameterized input instance (x, k). Generally each reduction rule performs one of the following tasks:

1. Reject no-instance (x, k) 2. Accept yes-instance (x, k)

3. Map instance (x, k) → (x0, k0) where |x0| < |x| and k0 _{≤ g(k).}

The output of a kernelization algorithm is called a kernel. Unlike heuristics, kernelization guarantees an exact size bound and yields an equivalent parameterized instance. A kernelization algorithm exists for a parameterized problem if and only if the problem is fixed parameter tractable. We conclude this section with a well-known example of kernelizing k-VertexCover.

(17)

Buss Kernelization [3]

We formally define k-VertexCover as follows:

k-VertexCover Input: Graph G Parameter: k

Question: Does G have a vertex cover of size at most k?

Consider k-VertexCover instance I = ((V, E), k) (e.g., Figure 2.1a). Select any vertex v ∈ V . Either v is in the cover or all neighbouring vertices are in the cover. If v has more than k neighbours, it is safe to assume v is in the cover. In this case, if v’s neighbours are in an optimal cover and v is not, the cover is larger than k and I is a no-instance anyway. Buss kernelization repeatedly applies the following reduction rule: for every vertex v ∈ V with deg(v) > k reduce I to I0 = (G − v, k − 1). If at any point k < 0, reject I as a no-instance since the cover is already too large. Let exhaustively applying the reduction yield instance I0 = (G0 = (V0, E0), k0). Since each v ∈ V0 has deg(v) ≤ k0, a cover of size k0 can cover at most k2 _{edges. Therefore,}

if |E0| > k2 _{reject I as a no-instance. If I has not been rejected, V}0 _{has at most 2(k}0₎2

non-isolated vertices. Since isolated vertices cover no edges, for each isolated vertex v ∈ V0 reduce v to yield instance I0 = (G0− v, k). Removing isolated vertices does not reduce the parameter k since they are not added to the cover. I has been kernelized to yield I0 containing at most 2(k0)2 _vertices.

Given k-VertexCover instance I = (G, 4) (Figure 2.1a), we kernelize I follows: 1. deg(v1) > 4: compute kernelization of I = (G − v1, 3) (Figure 2.1b).

2. deg(v3) > 3: compute kernelization of I = (G − v3, 2) (Figure 2.1c).

3. Since no vertex in G has deg(v) > 2, complete the kernelization by reducing all isolated vertices.

4. Kernel I has |V | = 3 ≤ 2(k)2 vertices.

We conclude, k-VertexCover is in FPT since a k-VertexCover kernelization algo-rithm exists. Observe that approximation theory guarantees no FPTAS exists for k-VertexCover unless P = N P . This further demonstrates the superior intractability

(18)

1 2 3 4 5 6 7 (a) k = 4 2 3 4 5 6 7 (b) k = 3 2 4 5 6 7 (c) k = 2

Figure 2.1: Buss kernelization example I = (G, 4)

classifications of parametrized complexity. Even problems that are not approximat-able to better than a constant factor may still be tractapproximat-able.

2.2 Traces

Two common representations of string to string correction solutions can be found in the literature. The most specific solution representation for an instance I = (T, M, k) is a tuple ((T, M ), S) where S is a sequence of operations that mutates M into T . We use an alternate representation consisting of operations that mutate M into T . Specif-ically, we represent each solution as a trace. Traces were first introduced by Wagner and Fischer [17] and later extended to include swaps by Wagner and Lowrance [18]. A trace is a tuple ((T, M ), L) where L is a set of lines (i, j) connecting the ith _{T -symbol}

(T (i)) to the jth _{M -symbol (M (j)). Traces are visualized as diagrams. Figure 2.2}

shows a trace for a problem instance I = (“victoria”, “ccoiriat”) where delete, insert, substitute, and swap operations are all allowed.

T: M: 1 2 3 4 5 v i c c c o i r t o r i a i a t 6 7 8

Figure 2.2: Delete, insert, change, and swap instance trace

A set of operations that mutates ccioriat into victoria are represented by the lines in Figure 2.2. Each line connecting different symbols, e.g., T (1) and M (1), indicates the M -symbol is substituted to match the T -symbol. Each line crossing represents a swap operation, e.g., M (3) swaps with M (4). Every M -symbol not connected by a line is deleted, e.g. M (8). Every T -symbol not connected by a line, e.g. T (4), requires

(19)

insertion of an identical symbol into M at the same index.

For every solvable instance I, a trace for I contains only operations that are allowed by I’s problem definition. For example, consider a swap and delete only S2SC instance trace C. Since insertions are not allowed, every T -symbol in C is connected by a line to an M -symbol. Since substitutions are not allowed, each line connects identical symbols; therefore, in C, a line connects every T -symbol to a unique identical M -symbol. If no trace exists for an instance, the instance is unsolvable since M cannot mutate into T .

Traces provide significant advantages over operation sequences.

1. Sequence diagram size depends on the number of operations since each step requires a new depiction of the mutable string (e.g., Figure 1.1). Trace diagrams are very succinct (e.g., Figure 2.2).

2. While describing an operation sequence the index of a particular M -symbol changes depending on prior operations. Symbols in a trace have a fixed index. If a sequence is desired, one can be derived from a trace without much difficulty. To convert a trace into a sequence the set of operations must be performed in valid order. Note that not all orderings are valid. M (4) in Figure 2.2 swaps with both T (2) and T (3). Clearly any valid sequence must swap M (4) with T (3) before swapping with T (2). Another issue arises while ordering swap and delete operations. It is unnecessary to ever swap a symbol that is deleted or inserted. One method to avoid unnecessary swaps performs deletions first then swaps and finally insertions.

For our purposes we modify traces in two ways: (i) We redefine trace to be a tuple C = (I, O) where O is the set of operations given by interpreting the set of lines rather than the lines themselves; (ii) A trace of a parameterized instance C = ((T, M, k), O) is valid if the total cost of O is no greater than k and does not contain any identical symbols swaps; a trace is invalid otherwise. When each operation weight equals one traces are valid only if |O| ≤ k. We note that disallowing identical symbol swaps in a valid trace is for convenience only. Identical symbols M (i) and M (j) that swap can always be unswapped by redrawing the trace lines to connect M (i) to the T -symbol originally connected to M (j) and vice versa. This new trace requires one less swap than before.

Figure 2.3 illustrates valid and invalid swap and delete (both weight one) only traces. Each subfigure is a different trace for parameterized instance I = (“cab”, “cbcab”, 3).

(20)

T:

M:

1 2 3 4 5 c a b

c b c a b

(a) Optimal valid trace

T: M: 1 2 3 4 5 c a b c b c a b (b) Valid trace T: M: 1 2 3 4 5 c a b c b c a b (c) Invalid trace

Figure 2.3: Traces for yes-instance I = (“cab”, “cbcab”, 3)

Figures 2.3a and 2.3b are both valid traces requiring two and three operations respectively. Figure 2.3a is additionally an optimal trace as the solution uses the minimum number of operations. Figure 2.3c is an invalid trace which requires four operations. The existence of a valid trace for a parameterized instance I is a necessary and sufficient condition for I being a yes-instance. The existence of an invalid optimal trace guarantees I is a no-instance.

(21)

Chapter 3 Swap Delete Kernel

3.1 Introduction

We study here a parameterization of the NP-complete StringToStringCorrec-tion (S2SC) [16] problem variant. S2SC is listed in Garey & Johnson’s catalogue of NP-hard problems [7]. This parameterized version of S2SC is defined as follows:

k-StringToStringCorrection (k-S2SC)

Input: Target string T , mutable string M , and positive integer k. Parameter: k

Question: Can M mutate into T using at most k delete and swap operations? Here a delete is the deletion of a single M -symbol, and a swap exchanges two adjacent M -symbols.

Abu-Khzam et al. [1] give the first fixed-parameter algorithm for k-S2SC. We present the first polynomial kernel result for k-S2SC. That is to say, we present a polynomial-time kernelization algorithm that reduces any k-S2SC instance (T, M, k) to a k-S2SC instance (T0, M0, k0), where k0 ≤ k and |T0_{| ≤ |M}0_{| ≤ k}0c _{for some}

constant c > 0.

Our kernelization algorithm stems from one observation: every yes-instance’s mu-table string contains at most 2k symbols involved in operations. This bound is met when k swaps are used and no symbol is swapped more than once. This means that there are at least |M | − 2k mutable string symbols not involved in operations. Since |M | is assumed to be significantly greater than k, there must be at least one long

(22)

sub-string within M where no operations are performed. We show that in large instances it is possible to find and reduce such substrings.

This chapter is outlined as follows. In the next section we present a selection of reduction rules from the literature. Using these reduction rules we show how to handle special instances. We then introduce new terminology and useful prop-erties in Sections 3.3 and 3.4. We describe three kernelization algorithms in Sec-tions 3.5, 3.7, and 3.8. Some of our reduction rules rely on a related problem variant k-MultiStringToStringCorrection (k-MS2SC). We introduce k-MS2SC be-tween our first two kernelization algorithms and prove its membership in F P T . Our first kernelization algorithm uses an oracle Op(·) to aid analysis by abstracting away

parts of the problem. Our second kernelization algorithm uses a less helpful oracle Ow(·). This approach helps separate our theory and reduction rule development into

manageable pieces. We construct our final kernelization algorithm by putting the pieces together and providing a polynomial implementation of Ow(·). Our oracle free

kernelization algorithm yields a polynomial-sized kernel in polynomial time.

3.2 Initial Reduction Rules

Many practical reduction rules are introduced in [1]. We present five of them here, with brief justification. We then describe our first new reduction rule, which extends Reduction Rule 3.2.5.

If |T | > |M |, matching each T -symbol to a unique M -symbol is impossible. Reduction Rule 3.2.1. Reject instance I = (T, M, k) if |T | > |M |.

If |M | > |T | + k, more than k deletions are required.

Reduction Rule 3.2.2. Reject instance I = (T, M, k) if |M | > |T | + k.

If a symbol occurs more times in T than in M , matching each occurrence to an M -symbol is impossible.

Reduction Rule 3.2.3. Reject instance I = (T, M, k) if T contains more occurrences of any particular symbol φ than M contains.

If Reduction Rule 3.2.3 does not reject an instance I = (T, M, k), every T -symbol can match a unique M -symbol. In other words, any instance I not rejected by Reduction Rule 3.2.3 is solvable (a trace exists for I). From this point forward, we

(23)

assume Reduction Rule 3.2.3 does not apply to any instance we discuss. We note that Reduction Rule 3.2.3 rejects a superset of instances that Reduction Rule 3.2.1 rejects. Reduction Rule 3.2.1 is described separately for clarity only.

Symbols that occur in M but not in T are always deleted.

Reduction Rule 3.2.4. Within instance I = (T, M, k) delete all symbols from M that do not occur in T . Reduce k by the number of deleted symbols.

The final rule we use from literature reduces matching prefix strings.

Reduction Rule 3.2.5. If T (1) = M (1), reduce I = (T, M, k) by removing T (1) and M (1).

Note that the above rule does not change k. Reduction Rule 3.2.5 follows from a theorem [1] proving T (1) optimally matches the first occurrence of T (1) in M . In a similar fashion to the proof in [1], we prove that T (|T |) optimally matches with the last occurrence of T (|T |) in M .

Theorem 3.2.1. There exists an optimal trace for I = (T, M, k) where T (|T |) = z matches the last M -occurrence of z.

Proof. (By Contradiction) Assume for i < j, M (i) = M (j) = z and T (|T |) matches M (i) in trace C = (I, O) (e.g., Figure 3.1a and 3.1c). M (i) is involved in exactly one swap for every non-deleted M -symbol to the right of i. Since swapping M (i) with M (j) is forbidden, M (j) is deleted. We modify C to create trace C0 = (I, O0) such that |O0| ≤ |O| and T (|T |) matches M (j) as follows: delete M (i) instead of M (j) and connect T (|T |) to M (j) (e.g., Figure 3.1b and 3.1d). M (j) is involved in exactly one swap for every non-deleted M -symbol to the right of j. Since i < j, M (j) is never involved in more swaps than M (i); therefore, it is always optimal to match T (|T |) to the last occurrence of T (|T |) in M .

We create our first new reduction rule by combining Reduction Rule 3.2.5 and Theorem 3.2.1.

Reduction Rule 3.2.6 (Head/Tail). While T (1) = M (1) or T (|T |) = M (|M |) remove these matching symbols from instance I = (T, M, k).

We conclude, Reduction Rule 3.2.6 is sound by [1] and Theorem 3.2.1. Theorem 3.2.2. Reduction Rule 3.2.6 is sound.

(24)

T:

M:

a b z

z a a z b

(a) T (|T |) matches M (i) and |O| = 4

T: M: a b z z a a z b (b) T (|T |) matches M (j) and |O0| = 3 T: M: x y z x z x z y

(c) T (|T |) matches M (i) and |O| = 3

T:

M:

x y z

x z x z y

(d) T (|T |) matches M (j) and |O0| = 3

Figure 3.1: Tail T -symbol matching examples

We assume Reduction Rules 3.2.1-3.2.6 do not apply to any instance we discuss. Therefore, all instances discussed henceforth satisfy the following structural proper-ties:

1. A trace C exists for I = (T, M, k). 2. |T | ≤ |M | ≤ |T | + k.

3. The first and last symbols of M are involved in at least one operation each. With six reduction rules under our belt, we now introduce terminology required to develop more complex reduction rules.

3.3 Terminology and Basic Properties

A trace line connects every T -symbol in a S2SC trace to an M -symbol. We call trace line connected symbols matching symbols.

Definition 3.3.1 (Matching Symbols). Given trace C = ((T, M, k), O), symbols T (i) and M (j) are matching symbols if they are trace line connected in C.

We use two terms—touched symbol and untouched symbol—to differentiate be-tween a symbol that is involved in operations and one that is not. Untouched symbols are defined as follows:

Definition 3.3.2 (Untouched Symbol). Let trace C = ((T, M, k), O). An M -symbol is untouched if it is neither deleted nor swapped in C. A T -symbol is untouched if it matches an untouched M -symbol.

(25)

Naturally, all other symbols are touched symbols. Given matching untouched symbols (T (i), M (j)), a T -symbol on either the left or right side of i cannot match an M -symbol on the opposite side of j.

Property 3.3.1. Let trace C = ((T, M ), O) contain matching untouched symbols T (i) and M (j). T (i) and M (j) divides C into two independent sections. That is, each T -symbol to the left of T (i) matches an M -symbol to the left of M (j) and each T -symbol to the right of T (i) matches an M -symbol to the right of M (j).

Proof. Since M (j) is not deleted and cannot be swapped, no M -symbol to the left of M (j) can match a T -symbol to the right of T (i). Likewise, no M -symbol to the right of M (j) can match a T -symbol to the left of T (i).

The following property of matching symbols has an important consequence for matching untouched symbols.

Theorem 3.3.1. Let T (i) match M (j) in trace C = ((T, M, k), O). If i > j, M (j) is swapped at least (i − j) times.

Proof. Since i > j, there are not enough M -symbols in M (1)...M (j − 1) to match all the T -symbols in T (1)...T (i − 1) (Figure 3.2a). The excess (i − j) T -symbols must match M -symbols that have an index greater than j. Each such M -symbol must swap, from the right side, to the left side of M (j). Therefore, M (j) is swapped at least i − j times.

Alternatively, we could state that matching the excess (i − j) T -symbols forms (i − j) line crossings with line (i, j).

Corollary 3.3.1. If T (i) and M (j) are matching untouched symbols, i ≤ j. Proof. Follows from Theorem 3.3.1.

Theorem 3.3.2. Let trace C = ((T, M, k), O) contain matching untouched symbols T (i) and M (j). If i < j, exactly (j − i) M -symbols prior to index j are deleted. Proof. There are (j − i) excess M -symbols in M (1)...M (j − 1) that cannot match a T -symbol in T (1)...T (i − 1) (Figure 3.2b). By Property 3.3.1, no such M -symbol can match a T -symbol to the right of T (i). Therefore, each excess M -symbol is deleted. No more deletes prior to M (j) occur since all remaining (i − 1) M -symbols prior to M (j) are required to match the T -symbols prior to T (i).

(26)

. . . T(1) T(j) . . . T(i) . . . . . . M(1) M(j) . . . M(i) . . . (a) i > j . . . T(1) T(i) . . . T(j) . . . . . . M(1) M(i) . . . M(j) . . . (b) i < j

Figure 3.2: Matching symbols T (i) and M (j)

For ease of presentation, we require a compact representation to identify sub-strings. We also wish to indicate when substrings match each other or contain no operations. To this end, we define regions, matching regions, and untouched regions as follows:

Definition 3.3.3 (Region). For string S, region RS

i,`identifies the substring S(i)...S(i+

` − 1), of length `.

Definition 3.3.4 (Matching Regions). Let trace C = ((T, M ), k) contain regions RT i,`

and RM_j,`. For 0 ≤ x < `, regions RT_i,` and RM_j,`are matching regions if T (i+x) matches M (j + x).

Definition 3.3.5 (Untouched Region). Untouched regions are regions that contain only untouched symbols.

It is important to note that matching symbols, untouched symbols, matching regions, and untouched regions are all defined relative to a specific trace. Different traces for the same instance can have very different sets of matching and untouched symbols/regions.

3.4 Yes-Instance Constraints

In this section, we explore yes-instance constraints. In particular, we prove all match-ing symbols, within a valid trace, match within a limited index range. We then prove all non-kernelized instances contain untouched regions. Our results culminate in the following guarantee for any valid trace C: Within C, any vertical cross section of length t ≥ k + (k + n) · ` contains at least n non-overlapping untouched matching regions of length `. Our kernelization algorithms rely on this guarantee to locate and reduce matching untouched regions.

(27)

We have already observed that for any untouched matching symbols (T (i), M (j)), i ≤ j. We now show that within a valid trace, parameter k limits the maximum distance between untouched matching symbols.

Theorem 3.4.1. Let trace C = ((T, M, k), O) contain matching untouched symbols T (i) and M (j). If C is valid, i ≤ j ≤ i + k.

Proof. By Corollary 3.3.1, i ≤ j . Theorem 3.3.2 ensures C is valid only if j ≤ i + k; otherwise, C would contain more than k deletes prior to j.

Corollary 3.4.1. Let trace C = ((T, M, k), O) contain matching symbols T (i) and M (j). If C is valid, |i − j| ≤ k.

Proof. C is valid only if i − j ≤ k by Property 3.3.1. Furthermore, C is valid only if j − i ≤ k by Property 3.3.2.

Theorem 3.4.1 and Corollary 3.4.1 are generalizable into results for matching regions within a valid trace. Matching regions RT

i,t and RMj,t are uniquely identified

by their starting indices and length. Alternately, expressing j as i plus an M -offset ∆ yields M -region RM

i+∆,t. In a valid trace |∆| ≤ k since matching symbols T (i) and

M (i + ∆) satisfy |i − (i + ∆)| ≤ k (Corollary 3.4.1). If RT

i,t and RMi+∆,t are untouched in

a valid trace, ∆ is bounded by 0 ≤ ∆ ≤ k (Theorem 3.4.1). We use these constraints to define matching ranges for regions.

Definition 3.4.1 (Matching Range). Let trace ((T, M, k), O) contain regions RT i,t and

R_i+∆,tM . Regions RT_i,t and RM_i+∆,t are within matching range if |∆| ≤ k.

Definition 3.4.2 (Untouched Matching Range). Let trace ((T, M, k), O) contain re-gions RT_i,t and RM_i+∆,t. Regions RT_i,t and RM_i+∆,t are within untouched matching range if 0 ≤ ∆ ≤ k.

Theorem 3.4.2. Let trace C contain matching untouched regions RT_i,t and RM_i+∆,t. If C is valid, RT_i,t and RM_i+∆,t are within untouched matching range.

Proof. (By Contradiction) Let RT_i,t match RM_i+∆,t and assume either ∆ > k or ∆ < 0. Contradictory to Theorem 3.4.1, untouched symbol T (i) matches untouched symbol M (i + ∆).

Our introductory statement that we find and reduce untouched substrings is not the whole story. We actually find and reduce pairs of matching untouched regions. These matching untouched regions must be within untouched matching range; if they are not, we reject the instance. We define such regions as untouched region pairs.

(28)

T: M: 1 2 3 4 5 h e l h e l p u l o s l o 6 7 8 u 9

Figure 3.3: Untouched M -regions one-to-one with URPs

Definition 3.4.3 (Untouched Region Pair). Let instance I = (T, M, k) contain equal regions RT

i,` and RMi+∆,`. Regions RTi,`and RMi+∆,`form an untouched region pair (URP)

U = (RT_i,`, RM_i+∆,`) when the following condition is satisfied: if I is a yes-instance, RT_i,` and R_i+∆,`M are untouched matching regions within a valid trace.

Within a valid trace all matching untouched symbols are within untouched match-ing range. In this scenario, matchmatch-ing untouched regions always form an URP. We say an URP U = (RT

i,`, RMi+∆,`) is contained within regions RTx,t and RMy,t0 when RT_i,` is

con-tained in RT

x,t and RMi+∆,`is contained in RMy,t0. URPs range from length one to length

|T |. The shortest possible URP is a single pair of matching symbols. The longest possible URP occurs when T is a substring of M . Regardless of size, URPs prevent any symbol on one side from matching with any symbol on the other (Property 3.3.1). Given a trace C = ((T, M, k), O), untouched M -regions form a one-to-one relation-ship with URPs. Untouched T -regions do not, in general, have the same one-to-one relationship since M -symbols can be deleted. We illustrate this concept in Figure 3.3. M contains nine untouched regions: five of length one (R_1,1M, RM_2,1, RM_3,1, RM_7,1, RM_8,1), three of length two (RM

1,2, R2,2M, RM7,2), and one of length three (RM1,3). Each untouched

M -region forms an URP with the untouched T -region that it matches. In addition to the nine untouched T regions that match untouched M regions, six untouched T -regions do not (RT

3,2, RT2,3, RT3,3, RT1,4, RT2,4, R1,5T ). We prove this one-to-one relationship

holds for M in Lemma 3.4.1.

Lemma 3.4.1. Given valid trace C = ((T, M, k), O), every untouched region RM i,`

forms an URP with an untouched T -region.

Proof. (By Contradiction) Assume that RM_i,`’s symbols match a T -subsequence Q that is not an untouched region. Unless a symbol T (r) occurs between Q-symbols and matches an M -symbol outside of RM_i,`, Q is an untouched region. Let M (j) occur outside of RM_i,` and match T (r). Matching T (r) to M (j) forces a symbol of RM_i,` to swap with M (j). This contradicts the fact RM

i,` is untouched; Q is an untouched

(29)

Since every untouched M -region forms an URP with a T -region, I = (T, M, k) contains an URP if M contains an untouched region. Every valid trace for I = (T, M, k) with |M | > 2k + 1 contains a length one URP since at most 2k symbols (two per swap) in M are touched. Guaranteeing an URP exists somewhere in some valid trace for I is not particularly useful on its own. We wish to narrow down where within I URPs must occur. Our first step is to narrow down where untouched regions must occur within a valid trace. From there we use Lemma 3.4.1 to prove that every vertical cross section of a valid trace that exceeds a minimum length contains an URP. Definition 3.4.4 (Guaranteed Untouched Region Function). Let N denote the set of positive integers. Define the guaranteed untouched region function F : N3 → N by F (k, `, n) = k + (k + n) · `.

We show that in any valid trace ((T, M, k), O), any region of length t ≥ F (k, `, n) contains at least n non-overlapping untouched regions of length `.

Theorem 3.4.3. For S ∈ {T, M } and t ≥ F (k, `, n), let trace C = ((T, M, k), O) contain region RS

i,t. If C is valid, RSi,t contains at least n non-overlapping untouched

regions of length `.

Proof. Let C = ((T, M, k), O) be a valid trace and S ∈ {T, M } be a string. Up to k possible operations occur in RS

x,t. At most 2k symbols are involved in the k operations.

Untouched symbols can exist to the left and right of each operation. That is, there exist up to k + 1 separate ‘holes’ in RS

x,t that can hold untouched symbols. A hole

that contains at least δ` untouched symbols contains δ non-overlapping untouched regions of length `. Conversely, a hole that contains exactly δ non-overlapping regions of length ` contains at most (δ + 1) · ` − 1 symbols. Assume RS_x,t has p holes and q untouched symbols. Then p ≤ k + 1 and q ≥ k + (k + n) · ` − 2k = k · (` − 1) + n`. Suppose the first p − 1 holes contain exactly r non-overlapping regions of length `, r ≥ 0. Then these holes contain at most (p − 1)(` − 1) + r` ≤ k · (` − 1) + r` symbols. Thus the pth _{hole contains at least}

q − [(p − 1)(` − 1) + r`] ≥ k · (` − 1) + n` − k(` − 1) − r` = (n − r) · ` (3.1) symbols. Hence, the pth _{hole contains at least (n − r) non-overlapping regions of}

length `, and so RS

x,t contains at least n non-overlapping regions of length `.

We now take our second step towards narrowing down where URPs must occur within I. By Theorem 3.4.3 any length t ≥ F (k, `, n) region RM

(30)

C = ((T, M, k), O) contains a set D of at least n non-overlapping length ` untouched regions. Consider an arbitrary element RM_j+∆,` ∈ D. Region RM

j+∆,` forms an URP

with RT

j,` by Lemma 3.4.1. Since 0 ≤ ∆ ≤ k, regions RTi,t+k and RMi+k,t contain all

URPs formed with elements of D.

We improve this result to guarantee that any length t vertical cross section of a valid trace is guaranteed to contain an URP of length `. This theorem applies to all vertical cross sections, including cross sections where T ends prior to the end of the cross section. Theorem 3.4.4 provides the guarantee we reference at the beginning of this section.

Theorem 3.4.4 (Untouched Region Pair Existence). Let trace C = ((T, M, k), O) contain regions RT

x,min(t,|T |−x+1) and R M

x,t where t ≥ F (k, `, n). If C is valid, regions

RT

x,min(t,|T |−x+1) and R M

x,t contain at least n non-overlapping length ` URPs.

Proof. (By Contradiction)

For convenience, assume that z = min(t, |T | − x + 1). Assume that RT

x,z and RMx,t contain fewer than n non-overlapping length ` URPs.

Region RM

x,t contains at least n non-overlapping untouched regions of length `

(The-orem 3.4.3). Denote an arbitrary set of n such regions as D. Select the element d ∈ D with the lowest starting index. Replace d with the leftmost untouched region R_rM_L_+∆_L_,` contained in R_x,tM. Each element di ∈ D forms an URP (RTri,`, R

M

ri+∆i,`) with

untouched region RT

ri,` (Lemma 3.4.1 ). Let set P contain all such URPs. index rL

must be less than x, otherwise, RT

x,z and RMx,t contain all n elements of P.

By Property 3.3.2, ∆L deletes occur prior to index rL+ ∆L. Therefore, at most

k − ∆L operations occur within length t0 = t − (rL+ ∆L− x) region RMrL+∆L,t0. Since

t0 ≥ F (k − ∆L, `, n + ∆L), RrML+∆L,t0 contains at least n + ∆Lnon-overlapping regions

of length `. Denote an arbitrary set of n+∆Lsuch regions as D0. No element of D0 can

form an URP with an untouched region RT_j,`where j < rL. Set D0contains at most ∆L

elements that match untouched T -regions with starting index j ∈ [rL, rL+ ∆L− 1].

The remaining n elements of D0 must match T -regions starting at indices after x. Therefore, RT

x,z and RMx,t contain at least n non-overlapping length ` URPs.

We have identified two uses for the guaranteed untouched region function F = (k, `, n). First, it guarantees that regions of minimum length F (k, `, n) contain at least n non-overlapping untouched regions of length ` in every valid trace. Second, it guarantees that in every valid trace vertical cross of minimum length F (k, `, n)

(31)

contains at least n non-overlapping URPs of length `. We use F (k, `, n) in both contexts throughout our reduction rule development.

Often it is useful to find `, given k, x, and n, such that F (k, `, n) ≤ x < F (k, ` + 1, n). That is, we wish to derive a function L(k, x, n) from F (k, `, n) to find the maximum ` that satisfies x ≥ F (k, `, n).

Theorem 3.4.5. L(k, x, n) = bx−k_k+nc calculates the maximum value of ` that satisfies x ≥ F (k, `, n).

Proof. By the following three equations with integer i > 0, L(k, x, n) calculates ` when F (k, `, n) ≤ x < F (k, ` + 1, n).

1. L(k, F (k, `, n), n) = bk+(k+n)·`−k_k+n c = `.

2. L(k, F (k, `, n) + i, n) = bk+(k+n)·`+i−k_k+n c = ` + b i k+nc.

3. L(k, F (k, ` + 1, n) − 1, n) = L(k, F (k, `, n) + (k + n − 1), n) = ` + bk+n−1_k+n c = `.

3.5 Oracle Access Kernelization

Our first kernelization algorithm (Algorithm 1) consists of two sub-problems: finding URPs and reducing them.

We use oracle Op(·) to find URPs for us. On input instance I = (T, M, k),

Op(I) internally generates an optimal trace C. Oracle Op(I) then outputs a length

t ≥ L(|M |, k, 1) URP U contained in C if one exists. If C does not contain an URP of length t, Op(I) returns the empty string ().

If I is a yes-instance and |M | ≥ F (k, 1, 1), Op(I) will always return a length

t URP (Theorem 3.4.4). Therefore, if Op(I) outputs the empty string, I is either

shorter than F (k, 1, 1) or a no-instance.

We develop an URP reduction algorithm, Algorithm 2 (ReduceU RP ), in Sec-tion 3.5.2. For now, assume ReduceU RP is a polynomial-time black box that con-forms to Definition 3.5.1.

Definition 3.5.1 (ReduceURP). Let instance I = (T, M, k) contain an URP U of length t > k + 1. On input (I, U ), ReduceU RP creates instance I0 = (T0, M0, k) by replacing U with an URP of length t0 = k + 1 and returns I0 that is a yes-instance iff I is.

(32)

Algorithm 1 OracleKernelization Require: k-S2SC instance I = (T, M, k). 1: while |M | > F (k, k + 2, 1) do 2: U ← Op(I) 3: if (U = ) then 4: return no-instance 5: I ← ReduceU RP (I, U ) 6: return I

3.5.1 Algorithm 1 Correctness

Based on our assumptions of ReduceU RP we are able to prove Algorithm 1 is correct. Theorem 3.5.1. Given access to polynomial-time oracle Op(·), Algorithm 1 is a

polynomial-time O(k2_{) kernelization of k-S2SC.}

Proof. Let instance I = (T, M, k). If |M | < F (k, k + 2, 1), I is already bounded by O(k2) and returned on line 6. If I is a yes-instance and |M | ≥ F (k, k + 2, 1), oracle Op(I) will return an URP U of length t ≥ L(|M |, k, 1) ≥ k + 2 since one

exists. If |M | ≥ F (k, k + 2, 1) and oracle Op(I) returns the empty string, we return

‘no-instance’ at line 4 since I is a no-instance. Every URP U returned by Op(I) is

reduced by ReduceU RP to an URP of length k + 1 at line 5. In a trace C for I, T contains less than |T | non-overlapping untouched regions of length k + 2; therefore, less than |T | iterations of the loop are executed before either |M | < F (k, k + 2, 1) or ’no-instance’ is returned. If querying oracle Op(·) is a polynomial-time operation,

Algorithm 1 is a polynomial-time algorithm.

Unfortunately, we are unable to present an FPT algorithm that implements Op(·)

nor present a proof that the existence of such an FPT algorithm would cause a complexity class hierarchy collapse. We leave this as an open challenge.

3.5.2 Untouched Region Pair Reduction

In this section, we identify a new reduction rule that provides a valid basis for ReduceU RP . We then give a valid algorithm for ReduceU RP . We assume we are given an instance I and an URP U ← Op(I). Consider the following natural question:

“Since symbols in URPs do not require operations, is it valid to reduce them entirely from both T and M ?” Unfortunately, such a naive approach can lead to false yes-instances. We exemplify this fault by considering no-instance I = (abccca, bacccba, 1).

(33)

a b T: M: c c c b a c c c b a a (a) no-instance (k = 1) a b T': M': b a b a a (b) yes-instance (k = 1)

Figure 3.4: Naive removal of untouched region pair

Figure 3.4a shows I’s optimal trace C contains URP U = (RT_3,3, RM_3,3). Therefore, we expect ReduceU RP (I, U ) to output no-instance I0. However, if ReduceU RP (I, U ) naively reduces U from I the result is yes-instance I0 (Figure 3.4b).

By reducing U we declare that U is an URP in every trace we are interested in from this point forward. This reduces the solution space to include only traces that do not contain any symbols on one side of U that match a symbol on the other. A valid implementation of ReduceU RP must reduce U while ensuring within output instance I0 no symbol to the left of U matches a symbol to the right of U .

One possibility is to remove U and split I = (T, M, k) into two disconnected smaller instances ILef t and IRight with a shared parameter. We do not consider this

approach here as it changes the definition of the problem. Later, we investigate a problem variant that handles multiple S2SC instances with a shared parameter. For now, we need an approach that does not change the problem definition.

We already know a method of separating traces into independent left and right sections. URPs do exactly that. The trick is creating instance I0 by replacing U with a smaller URP that is guaranteed to be an URP in all valid traces of I0. To this end, we introduce separator strings and dividers.

Definition 3.5.2 (Separator String). For instance I = (T, M, k) and alphabet Σ containing all symbols of I, a separator string consists of k + 1 occurrences of a symbol φ /∈ Σ.

Definition 3.5.3 (Divider). Let S be a separator string for instance I = (T, M, k). For 0 ≤ ∆ ≤ k, inserting S at T (i) and M (i+∆) forms divider D = (RT

i,k+1, RMi+∆,k+1).

Theorem 3.5.2. Let trace C = ((T, M, k), O) contain divider D. D is an URP in every trace.

Proof. Let trace C = ((T, M, k), O) be valid. To prove that D = (RT_i,k+1, RM_i+∆,k+1) is an URP in C we show that (i.) R_i,k+1T and R_i+∆,k+1M are matching regions and (ii.) no symbol of D is touched.

(34)

(i.) Let the symbols of D be occurrences of φ. No occurrence of φ is deleted since T and M contain an equal number of occurrences. Furthermore, the ith occurrence of φ in M matches the ith _{occurrence of φ in T since swapping two φ symbols is}

disallowed. Therefore, RT

i,k+1 and RMi+∆,k+1 are matching regions.

(ii.) No symbol M (j) to the left of RM

i+∆,k+1matches a symbol to the right of RTi,k+1

since M (j) cannot swap to the right of the k + 1 symbols of RM

i+∆,k+1. Since RTi,k+1

matches RM

i+∆,k+1, M (j) cannot match any T -symbol with index x ≥ i. Therefore, no

D-symbol swaps with M (j). Similarly, no D-symbol swaps with an M -symbol that occurs to the right of RM_i+∆,k+1.

We conclude that D is an URP since RT_i,k+1 and RM_i+∆,k+1 match and are involved in no operations.

We now present our URP reduction rule:

Reduction Rule 3.5.1 (Exact Untouched Region Pair Reduction). Given instance I = (T, M, k) and URP U = (R_i,`T , RM_i+∆), replace U with a divider.

Theorem 3.5.3. Reduction Rule 3.5.1 is sound.

Proof. Let trace C = (I, O) for instance I = (T, M, k) contain URP U = (RT

i,`, RMi+∆,`).

We prove that I is a yes-instance if and only if applying Reduction Rule 3.5.1 to U yields yes-instance I0 = (T0, M0, k).

The only difference between I and I0 is that I contains URP U = (RT

i,`, RMi+∆,`)

and I0 contains URP D = (RT_i,k+1, RM_i+∆,k+1) instead.

⇐: Assume I0 _{is a yes-instance. Let C}0 _{be a valid trace for I}0_{. The regions of C}0

prior to D contain some number of operations o0Lef t. The regions of C0 that make up

D contain no operations (Theorem 3.5.2). The regions of C0 after D contain some number of operations o0Right. Create trace C by replacing D in C0 with U . Trace C

is a trace for I. Furthermore, C contains o0Lef t operations to the left of U and o0Right

operations to the right of U . URP U itself contains no operations. Valid trace C contains o0Lef t+ o0Right ≤ k operations.

⇒: Assume I is a yes-instance. Let C be a valid trace for I. Create a trace C0

by replacing U with a divider D. By the same reasoning above C0 is a trace for I0 containing k or less operations.

We now present a valid algorithm for ReduceU RP (Algorithm 2). Whenever reducing U with Reduction Rule 3.5.1 does not decrease the length of I, ReduceU RP

(35)

does not modify I (Step 2). Otherwise, ReduceU RP reduces the length of input I by applying Reduction Rule 3.5.1 to U (Step 3).

With this implementation of ReduceU RP , Algorithm 1 is approaching a realizable kernelization approach. Unfortunately, as mentioned in Section 3.5.1 we are unable to present an FPT implementation of Op(·). Our next kernelization algorithm

(Al-gorithm 3) in Section 3.7 uses a weaker oracle Ow(·) in lieu of Op(·). We then show

that this weaker oracle has a polynomial-time implementation in Section 3.8. Algorithm 2 ReduceU RP (I, U )

Require: k-S2SC instance I = (T, M, k), URP U guaranteed to exist in some opti-mal trace for I.

Ensure: k-S2SC instance I0 = (T0, M0, k).

1: if |U | ≤ k + 1 then

2: return I

3: Create instance I0 by applying Reduction Rule 3.5.1 to U

4: return I0

3.6 Multi String-to-String Correction

Before presenting our next kernelization algorithm, we revisit an idea mentioned in Section 3.5.2. During the development of Reduction Rule 3.5.1 we solved the problem of naive URP removal by using dividers. However, dividers were not the only suggested solution. Alternatively, we could have removed the URPs entirely and split the instance into two smaller independent instances with a shared parameter. We introduce the k-MultiStringToStringCorrection problem which captures this alternative approach.

k-Multi-String-To-String Correction (k-MS2SC)

Input: List D = [di| di = (Ti, Mi) and i > 0] where Ti and Mi are strings, and

positive integer k. Parameter: k

Question: For each (Ti, Mi) ∈ D, can Mi mutate into Ti using a total of at most k

delete and swap operations summed over all such mutations?

A MS2SC trace C = ((D, k), O) is very similar to a S2SC trace. In k-MS2SC, however, no element of O can swap symbols from two different elements of

(36)

a b T: M: c c c b a c c c b a a (a) no-instance (k = 1) a b T : M : b a b a a 1 1 T : M : 2 2 (b) no-instance (k = 1)

Figure 3.5: Instance split reduction of untouched region pair

D. Trace C presented as a diagram looks like a k-S2SC trace broken up into multiple parts. Consider Figure 3.5a, which gives the optimal trace for k-S2SC no-instance I = (abccca, bacccba, 1). This is the same figure as Figure 3.4a which we used to illustrate the problems with naive URP removal. We now remove ”ccc” and split I into k-MS2SC no-instance I0 = ([(ab, ba), (a, ba)], 1). Figure 3.5b shows the optimal and invalid trace for I0.

To see that MS2SC is NP-hard, observe that S2SC is a special case of MS2SC for |D| = 1. Thus, we can conclude the following theorem from Wagner’s NP-completeness result for S2SC [16].

Theorem 3.6.1. MS2SC is NP-complete.

Furthermore, we prove that k-MS2SC is in F P T .

Theorem 3.6.2. k-Multi-String-To-String Correction is in F P T

Proof. Given a k-MS2SC instance I = (D, k), where di = (Ti, Mi) for each di ∈ D,

construct a k-S2SC instance I0 = (T, M, k) as follows:

1. Create separator string SΣ,k where Σ contains all symbols within elements of D

2. T = T1SΣ,kT2SΣ,k...SΣ,kT|D|

3. M = M1SΣ,kM2SΣ,k...SΣ,kM|D|

I is a k-MS2SC yes-instance if and only if I0 is a k-S2SC yes-instance. Let φ be the single symbol of SΣ,k. Because φ occurs the same number of times in T as in

M no occurrences of φ can be deleted in a yes-instance. Since the length of SΣ,k is

k + 1, no symbol on one side of SΣ,k can swap to the other side in a yes-instance.

Therefore, in a yes-instance, every ith _{occurrence of S}

Σ,k ∈ T forms an URP with

the ith occurrence of SΣ,k ∈ M . Given a trace for I0, split the trace into pieces by

(37)

Corollary 3.6.1. Removing an URP from a k-S2SC yes-instance I = (T, M, k) yields a two part k-MS2SC yes-instance.

A k-MS2SC trace is optimal exactly when each independent part of the trace is an optimal S2SC trace. Therefore, optimal S2SC reductions are also optimal when applied to an independent k-MS2SC part. Each URP found in a part of an n-part k-MS2SC instance IM can be reduced by either proposed reduction method. If a

divider is used then the result is a new n-part k-MS2SC instance I0M. If the URP

is reduced by splitting then the result is a new (n + 1)-part k-MS2SC instance I0M.

We can in fact remove a divider and split the instance or join two parts together with a divider. Both the type of problem instance and the reduction methods are completely interchangeable. We generally choose to only represent problems as k-S2SC instances. We only use k-Mk-S2SC instances to explain concepts that are easier to view from a different perspective. For example, the head/tail Reduction Rule 3.2.6 combined with k-MS2SC instances instantly increases the efficiency of Reduction Rule 3.5.1.

Theorem 3.6.3. Let trace C = (I, O) for k-S2SC instance I = (T, M, k) containing URP U = (RT

i,`, Ri+∆,`M ). Equal regions RTj,t and RMj+∆,t that contain U form URP

U0 = (RT

j,t, RMj+∆,t) in some trace C

0_{. Trace C}0 _{is valid if and only if C is valid.}

Proof. First we split I into k-MS2SC instance IM by removing U . Equivalently we

split trace C into k-MS2SC trace CM. At this point we already know that C is valid

if and only if CM is valid. Apply head/tail Reduction Rule 3.2.6 to each part of IM

to form instance I0M. Since this reduction is optimal, instance I0M requires no more

operations to solve than instance IM. Equivalently, I0M has a trace C0M that contains

no more operations than CM. Now convert I0M and C0M into k-S2SC instance I0

and trace C0 by joining the two parts with a divider. Trace C is a valid trace if and only if C0 is a valid trace. In case it is not yet clear why U0 is an URP we can go one step further. Replace the divider in C0 with U0 and match U0’s regions. URP U0 contains no operations as the regions were defined to be equal. We have just created a trace for I = (T, M, k) that contains U0 as an URP.

We do not actually need to perform all these conversions in practice. Theo-rem 3.6.3 guarantees that the following strategy is valid. Before applying Reduction Rule 3.5.1 to an URP U , first expand U to include any immediately adjacent matching regions. This ensures that Reduction Rule 3.5.1 reduces as many symbols as possible based on the selection of U .

(38)

T: M: 1 2 3 4 5 a a z z a a c c c b b z 6 7 T: M: 1 2 3 4 5 a a z z a a c c c b b z 6 7 T: M: 1 2 3 4 5 a a z z a a c c c b b z 6 7

Figure 3.6: Valid traces for yes-instance I = (“aazcb”, “zaaccbz”, 4)

Unless explicitly stated otherwise, all instances from this point forward will once again be k-S2SC instances.

3.7 Weaker Oracle Access Kernelization

We now modify our first kernelization algorithm to require access to a weaker oracle Ow(·) (Algorithm 3). Oracle Ow(·) is Op(·)’s lazy brother. On input instance I =

(T, M, k), Ow(I) does not internally generate an optimal trace for I as Op(I) does.

Instead, Ow searches T and M for equal regions of length t = L(|M |, k, 1) that are

within untouched matching range of each other. If appropriate regions are found Ow(·) returns them. Otherwise, Ow(·) returns the empty string. From an outside

perspective, Ow(·)’s output is indistinguishable from Op(·)’s output. We are no longer

able to blindly apply Reduction Rule 3.5.1 because the returned value may or may not be an URP. Since this output is indistinguishable from an URP we call these pairs of regions potential untouched region pairs.

Definition 3.7.1 (Potential Untouched Region Pair). Given instance I = (T, M, k), P = (RT

i,`, RMi+∆,`) is a potential untouched region pair (PURP) when equal regions

RT

i,` and RMi+∆,` are within untouched matching range constraints.

PURPs satisfy the matching range and substring equivalence constraints of an URP. Therefore, every URP is a PURP. If an instance I = (T, M, k) does not have a length t = L(|M |, k, 1) PURP, I is a no-instance as it does not contain a length t URP.

All PURPs have the potential to be URPs in a given trace but are not guaranteed to be. In valid traces a PURP P may be an URP within one trace, many traces, or no traces at all. Consider Figure 3.6. There are two PURPs of length two— P = (RT_1,2, RM_1+1,2) and P0 = (RT_4,2, R_4+1,2M ). In the first two traces P is an URP and P0 is not. Neither P nor P0 is an URP in the last trace.

Our second kernelization algorithm (Algorithm 3) replaces Op(·) with Ow(·) at

(39)

at line 5 (Section 3.7.1. We also delay determination of the loop condition variable Xw at line 1 until Section 3.7.4.

Algorithm 3 Kernelization 2 Require: k-S2SC instance I = (T, M, k). 1: while |M | ≥ Xw do 2: P ← Ow(I) 3: if (P = ) then 4: return no-instance 5: I = ReduceP U RP (I, P ) 6: return I

3.7.1 Potential Untouched Region Pair Reduction

Opportu-nities

We require a better understanding of PURPs before we are able to develop an al-gorithm for ReduceP U RP . We already know that URPs of length k + 1 or less are not reducible with our current approach. Therefore, we only concern ourselves with PURPs of length greater than k + 1. Specifically we focus on PURPs of length at least t = F (k, `, 1) + ∆ where ∆ is the M -offset of the PURP. Any such PURP P = (RT

i,t, RMi+∆) contains regions RTi+∆,t−∆ and Ri+∆,t−∆M that (due to Theorem 3.4.4)

contain at least one length ` URP in every valid trace.

Lemma 3.7.1. Every length t ≥ F (k, `, 1) PURP P contains at least t − ` + 1 length ` PURPs.

Proof. Let P = (RT

i,t, RMi+∆,t) and set P contain all length ` PURPs within P .

Since RT

i,t equals RMi+∆,t, every arbitrary region RTj,n contained in RTi,t equals region

RM

j+∆,n. Therefore, P ⊇ {P1 = (RTi,`, RMi+∆,`), P2 = (Ri+1,`T , RMi+1+∆,`), ..., Pt−`+1 =

(RT

t−`+1,`, RMt−`+1+∆,`)}.

Theorem 3.7.1. For yes-instance I = (T, M, k) every length t ≥ F (k, `, 1) + ∆ PURP P = (RT

i,t, RMi+∆,t) that contains exactly t + ∆ − ` + 1 length ` PURPs, P

contains a length ` URP U = (RT

j,`, Rj+∆,`M ) with the same M -offset ∆.

Proof. Let set P contain all length ` PURPs within P . From the proof of Lemma 3.7.1 P ⊇ P0 _{= {P}

1 = (RTi,`, RMi+∆,`), P2 = (RTi+1,`, RMi+1+∆,`), ..., Pt−`+1 = (RTt−`+1,`, RMt−`+1+∆,`)}.

(40)

Every element of P has an M -offset ∆. Any URP contained in P must therefore have an M -offset ∆. By Theorem 3.4.4 there exists an URP U = (RT_j,`, R_j+∆,`M ) contained within PURP P = (RT

i,t, RMi+∆,t).

The conditions of Theorem 3.7.1 are satisfied only if each region RT_j,` of P can form a PURP with only one M -region of P . Since we know the exact M -offset of an URP contained in a PURP that satisfies Theorem 3.7.1, we call such a PURP an exact potential untouched region pair. Contrarily, we call a PURP that does not satisfy Theorem 3.7.1 an inexact potential untouched region pair.

Our current definitions of exact and inexact PURPs, while valid, are ambiguous. Consider an instance I = (T, M, k) that contains a PURP P = (RT

i,t, RMi+∆,t) of length

t = F (k, 10, 1) + ∆. Since t ≥ F (k, `, 1) + ∆ for 0 < ` ≤ 10, the exactness of P varies depending on the choice of `. Although there is nothing inherently wrong with our current definition, we prefer Definition 3.7.2. Definition 3.7.2 uses the maximum possible value of ` to determine the exactness of a PURP, removing all ambiguity. Definition 3.7.2 (Exact Potential Untouched Region Pair). Let instance I = (T, M, k) contain a PURP P with M -offset ∆ and length t. We call P an exact potential untouched region pair (exact PURP) if for ` = L(k, t − ∆, 1), P contains exactly t + ∆ − ` + 1 length ` PURPs.

We note that all PURPs that satisfy Theorem 3.7.1 are exact PURPs by Defini-tion 3.7.2. The longer a T -region becomes, the more likely it is to form a PURP with only one M -region. Another important feature of exact PURPs is that we can decide whether a PURP is exact or not in polynomial-time.

Theorem 3.7.2. Let instance I = (T, M, k) contain PURP P = (RT

i,t, RMi+∆,t). The

exactness of P is decidable in polynomial-time.

Proof. Let ` = L(k, t − ∆, 1). Since P is a PURP, we already know that every region RT

j,` contained in P forms a PURP with region RMj+∆,`. If any region RTj,` contained

in P also forms a PURP with a region RM

j+∆0 where ∆ 6= ∆0, P is inexact; otherwise,

P is exact.

There are t + ∆ − ` + 1 < |T | + ∆ − ` + 1 unique T -regions of length ` contained in P . Check whether any of the t + ∆ − ` + 1 unique T -regions can form a PURP with any of the at most k M -regions contained in P that have M -offsets ∆0 6= ∆. This algorithm only requires a polynomial number of polynomial-sized string comparisons to decide exactness.

(41)

Since we have identified two types of PURPs, we revise our reduction strategy to deal with each type separately. We outline our high level algorithm for ReduceP U RP in Algorithm 4.

Algorithm 4 ReducePURP

Require: k-S2SC instance I = (T, M, k) and PURP P = (RT

i,t, RMi+∆,t).

if P is an exact PURP then I0 ← ReduceExactP U RP (I, P ) else

I0 = ReduceInexactP U RP (I, P ) return I0

In the next section we show how to implement ReduceExactU RP using Reduc-tion Rule 3.5.1. ReduceInexactP U RP requires a new reducReduc-tion rule, presented in Section 3.7.3.

3.7.2 Exact Potential Untouched Region Pair Reduction

Given a length t ≥ F (k, `, 1) + ∆ exact PURP P = (RT

i,t, RMi+∆,t) we know it contains

URP U = (RT

j,`, RMj+∆,`) for some j. We know almost everything about U . We know

U ’s length, M -offset and range of possible starting index values. Our knowledge of U is sufficient to prove that if I is a yes-instance, at least one valid trace contains an URP equal to P .

We use the ability to convert k-S2SC yes-instances into k-MS2SC and vice versa to prove the following theorem.

Theorem 3.7.3. For yes-instance I = (T, M, k) containing a length t ≥ F (k, `, 1)+∆ exact PURP P = (RT

i,t, RMi+∆,t), at least one valid trace containing URP U = P exists.

Proof. We already know that a set P containing all length ` PURPs contained in P has exactly t − ` + 1 elements. We create a set S of two-part k-MS2SC instances by removing a single distinct element of P from P for each element of S. If I is a k-S2SC yes-instance at least one element of S is a k-MS2SC yes-instance (Corollary 3.6.1). We now show that all elements of S are equivalent.

Consider an arbitrary element s of S formed by removing PURP P0 = (RT_j,`, RM_j+∆,`) from P . Since P0’s M -offset is equal to P ’s M -offset, applying head/tail reduction to s removes all of P ’s symbols and yields k-MS2SC instance s0. Head/tail reduction may remove even more symbols, but for the purpose of this proof we will stop reducing

(42)

once all of P ’s symbols have been removed. Applying head/tail Reduction Rule 3.2.6 to every element of S reduces every element to s0. One element of S is known to be valid when I is a yes-instance. Optimally head/tail reducing this element results in s0. Therefore, s0 is a yes-instance when I is a yes-instance. Joining the two parts of s0 with a divider yields yes-instance I0. Given a valid trace for I0 we construct a valid trace for I by replacing the divider with URP U = P .

By Theorem 3.7.3, Reduction Rule 3.5.1 can replace exact PURP P contained in instance I with a divider.

We conclude that we do not need an additional reduction rule to handle exact PURPs. We only need to apply the existing Reduction Rule 3.5.1 to the entire exact PURP.

3.7.3 Inexact Potential Untouched Region Pair Reduction

For t ≥ F (k, `, 1)+∆, inexact PURP P = (RT

x,t, Rx+∆,tM ) contains at least one T -region

R_i,`T that forms PURPs with more than one M -region. Each PURP that contains non-unique T -component RT_i,` has a very specific structure. In fact Theorem 3.7.4 proves they are all patterned PURPs.

Definition 3.7.3 (Patterned PURP). A p-patterened PURP in given instance I = (T, M, k) is a PURP that consists exclusively of one or more occurrences of some shortest pattern string p, |p| ≤ k, where the final occurrence of p may be truncated. Theorem 3.7.4. Given instance I = (T, M, k), PURP P = (RT

i,`, RMi+∆,`) is the only

PURP to contain component RT

i,` unless P is patterned.

Proof. We first observe that if |P | ≤ k, P is trivially patterned. In this case, P either consists of multiple repetitions of a pattern string p where |p| < |P | or a single occurrence where |p| = |P |.

Now consider when |P | > k. Given two PURPs P = (RT

i,`, RMi+∆,`) and P 0 ₌

(RT

i,`, RMi+∆0_,`), ` ≥ k + 1 and ∆ < ∆0. Assume to the contrary that P and P0 are not

patterned. Divide RT

i,`, RMi+∆,`, and RMi+∆0_,`into length ∆0− ∆ substrings with the last

substring possibly shorter than ∆0− ∆. Label the RT

i,`–substrings with (Ψ1, ..., Ψn),

the RM

i+∆,`–substrings with (ω1, ..., ω2), and the RMi+∆0_,`–substrings with (ω0₁, ...ω0_n)

as depicted in Figure 3.7. Regions RM

i+∆,` and Ri+∆M 0_,` must overlap each other since

0 ≤ ∆ < ∆0 ≤ k < `. Because the regions RM

i+∆,` and R M

String to String Correction Kernelization

Contents

List of Tables

List of Figures

String Correction

Chapter 2

General Terminology

2.1

Fixed Parameter Tractability

2.1.1

Kernelization

2.2

Traces

Chapter 3

Swap Delete Kernel

3.1

Introduction

3.2

Initial Reduction Rules

3.3

Terminology and Basic Properties

3.4

Yes-Instance Constraints

3.5

Oracle Access Kernelization

3.5.1

Algorithm 1 Correctness

3.5.2

Untouched Region Pair Reduction

3.6

Multi String-to-String Correction

3.7

Weaker Oracle Access Kernelization

3.7.1

Potential Untouched Region Pair Reduction

Opportu-nities

3.7.2

Exact Potential Untouched Region Pair Reduction

3.7.3

Inexact Potential Untouched Region Pair Reduction