A derivation of the Knuth-Morris-Pratt pattern matching program

(1)

A derivation of the Knuth-Morris-Pratt pattern matching

program

Citation for published version (APA):

Cai Chengdian, N. V. (1985). A derivation of the Knuth-Morris-Pratt pattern matching program. (EUT report. WSK, Dept. of Mathematics and Computing Science; Vol. 85-WSK-02). Eindhoven University of Technology.

Document status and date: Published: 01/01/1985

Document Version:

Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)

Please check the document version of this publication:

• A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website.

• The final author version and the galley proof are versions of the publication after peer review.

• The final published version features the final layout of the paper including the volume, issue and page numbers.

Link to publication

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

• You may freely distribute the URL identifying the publication in the public portal.

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement:

www.tue.nl/taverne Take down policy

If you believe that this document breaches copyright please contact us at: openaccess@tue.nl

providing details and we will investigate your claim.

(2)

ONDERAFDELING DER WISKUNDE DEPARTMENT OF MATHEMATICS

EN INFORMATICA AND COMPUTING SCIENCE

A Derivation of the Knuth-Morris-Pratt Pattern Matching Program

by

Cai Chengdian

EUT Report 85-WSK-02 ISSN 0167-9708

(3)

A DERIVATION OF THE KNUTH-MORRIS-PRATT PATTERN MATCHING PROGRAM

Department of Computer Science Jinan University

Guangzhou. Guangdong China

by

Cai Chengdian

Department of Mathematics and Computing Science Eindhoven University of Technology

5600 MB Eindhoven The Netherlands

Abstract

The purpose of this note is to produce a formal derivation of the Knuth-Morris-Pratt pattern matching program.

(4)

Introduction

Pattern matching is one of the fundamental operations on strings. Several algorithms that solve this problem have been developed. Among these, Knuth-Morris-Pratt's one ([1]) is well-known and has the advantage of time com-plexity linear in the length of the text and storage requirements linear in the length of the pattern. But it requires some complicated processing on the pattern that is difficult to understand and this has limited the extent to which the program is used ([2]). In [1] the program is introduced by a play-by-play description for an example, and the formal correctness of the program is not given in detail. In [3] a formal treatment for this program is given, but the processing on the pattern is not contained in detail in the text either, and is somewhat different from [1]. In this note we shall apply formal techniques that are given by [3] to produce a proof and a pro-gram for the pattern matching problem. The processing on the pattern is essentially the same as [1].

Notational interlude and a property

The expression

(MIN i: O(i): E(i»

denotes the minimum value of E(i) for all i satisfying O(i).

For r ~ s, x(r, s) denotes the sequence x(r) ,x(r + 1) , ... ,x(s - 1) of s - r elements.

For two sequences x(s,s+k) and y(r,r+k),

(5)

3

-denotes

(.!

i : 0 ~ i

<

k: xes + i)

=

y(r + i » , and

denotes

(~i: 0 ~ i

<

k: x(s+i) t= y(r+i» .

From the above, we have for k > 0 the following property:

0< q ~ k 1\ x(s,s+k) = y(r,r+k)

=>

x(s+q,s+k)

=

y(r+q,r+k) .

The notations for proof structures and programs have been adapted from [3],

[4] and [5J.

The development of the pattern matching program

(0)

For two integer sequences p(O,M) and t(O,N) (M> 1 and N» M), a program is

requested to determine whether p occurs as a continuous subsequence of t and,

furthermore, if so, to find the position of the first occurrence of p in t.

Formally, our program has to establish

Rl cor RO ,

where

Rl: (A u:: imatch(u» ,

RO: k (MIN u: match(u): u) ,

and

(6)

To this end, we introduce the invariant PO A Pl , where PO: t (k, k + j) ;::;: p (0, j) A 0 ~ j ~ MAO ~ k ~ N - j , and Pl:

(!

u: 0 ~ u

<

k:'" match(u» .

PO and Pl derive their importance from

PO A Pl A j ;::;: M ~ RO ,

and

Pl A k

>

N - M ~ Rl .

PO A Pl is trivially established by "k,j .- 0,0", and we can sketch our

program as follows:

I [

k, j: int

] I .

p(i: 0 ~ i

<

M), t(i: 0 ~ i

<

N): array of int k,j := 0,0 {PO A PI}

do j 'I' M A k ~ N - M -+

"increase the bound function 2*k + j of j and k under invariance of PO A Pl"

od {PO A Pl A (j ;::;: M v k

>

N - M), hence Rl cor RO}

Now we investigate increases of j or k that cause an increase of the bound function.

Because j does not occur in PI, for an increase of j we need to consider PO

(7)

5

-of j by one is adequate. And we derive j

POj+1

=

t(k,k+j+l) == p(O,j+l) 1\ 0

~

j + l

~

M 1\ 0

~

k

~

N- (j+l)

..

_PO _1\ _t(k + j)

=

p (j) 1\ j ~ M 1\ k ~ N - M •

If B holds, where

B: t(k+j) ~ p(j) ,

we know that if k were kept constant, an increase of j would destroy PO, thus we increase k. Because of

B =>Imatch(k)

an increase of k by 1 maintains Pl under B, and i t trivially maintains PO for j

=

O.

As a consequence, "increase the bound function 2*k + j of j and k under in-variance of PO 1\ Pl" can be refined as the following alternative construct:

if t(k+j) == p(j) -+ j := j + l {PO 1\ PI}

U

j

=

0 1\ t(k+j) l' p(j) -+ k .- k+l {PO 1\ PI}

U

j F 0 1\ t(k+ j) F p(j) -+ ? fi

But for j ~ 0 1\ B, PO possibly allows a further increase of k in view of

what follows. For 0

<

i ~ j, t(k,k+j)

=

p(O,j) => {(O)} t (k + i, k + j) = p (i, j) . (1) (2) (3)

(8)

For j ~ 0, the equation

i: p(i,j) = p(O,j-i) A 0

<

i ~ j (4)

has at least one solution, e.g. i

=

j.

Thanks to (3), we have for each solution i of (4)

PO .. t(k+i,k+j) =p(O,j-i) AO~j-i<j AO~k+i~N-(j-i).

Therefore, PO is maintained by

Itk,j := k + i , j - i "

where i is any solution of (4).

On the other hand, if i satisfies

p ( i , j) ~ p (0 ,j - i) A 0

<

i ~ j we have, on account of (3), PO .. t(k+ i,k+ j)

F-

p(O,j - i) hence PO .. rmatch(k+ i) . For j

F-

0, let

f(j) = (MIN i: p(i,j) = p(O,j - i) A 0 < i ~ j: i) .

We conclude that

"k,j .- k+f(j),j-f(j)"

maintains PO A PI under B:

(i) thanks to (5), (7) maintians PO, because f(j) is a solution of (4);

(5)

(6)

(9)

7

-(ii) in view of (1) and (6)

(! i: 0 ~ i

<

f(j): PO 1\ B~ imatch(k+i»

holds, so that (7) maintains PI under B.

Consequently, the question mark in the alternative construct (2) can be replaced by (7), and we have developed our program.

The repetition in the program terminates, since 2*k+j is bounded from above by 2*N + 111 and increases by at least 1 at each iteration.

The computation of f(l/lII)

The only problem left is to obtain an f(I,III) satisfying

Rf:

(!

v: 0

<

v

<

111: f(v)

=

(IIIIN i: D(i,v): i» where

D(i,v): p(i,v)

=

p(O,v-i) 1\ 0

<

i ~ v •

In the standard fashion we derive from Rf an invariant Pf by replacing the constant upper bound 111 - 1 by a suitably bounded variable k:

Pf:

(!

v: 0

<

v ~ k: f(v)

=

(IIIIN i: D(i,v): i» 1\ 1~k~III-1 .

"k := 1; f:(k)

=

1" establishes Pf, and we can sketch the computation of f(l,lII) as follows:

(10)

I [

k: int

p(i: 0 ~ i

<

M): array of int

k := 1; f:(k) ;:;: 1 {Pf} do k;iM-l + "compute f(k+l)" {Pfk k+1} ; k := k + 1 {Pf} od {Pf I\k=M-l, hence Rf}

] I .

The repetition in the sketch trivially terminates.

Now we refine "compute f(k + 1)". From the definition of f(k + 1), i.e.

f(k+l)

=

(MIN i: D(i,k+l): i)

and

(i ~ k 1\ D(i,k+ 1» .. D(i,k)

we conclude

D(f(k + 1) ,k) V f(k + 1) ;:;: k + 1 . (8)

f(k) is the minimum solution of the equation

i: D(i,k) • (9)

hence,

f(k)~f(k+l)~k+l. (10)

In order to compute f(k + 1), i t consequently suffices to search the solutions of (9), starting with f(k), in increasing order in the light of the linear search theorem ([3]).

(11)

9

-We observe that the largest solution of (9) is k. Thus, the search has the form

j := f(k)

do j 1: k A I D (j , k + 1) -+

j .- F(j) od

where F(j) is the minimum solution larger than j of (9).

The search terminates, since j

<

F(j) ~ k. And Q: D(j ,k) A j ~ f(k + 1)

is an invariant of the search. The reason is:

(i) Q is established by "j := f(k)", on account of (10) and D(f(k),k)j

(ii) D(F(j),k) holds according to the definition of F(j)j

(iii) j ~ f(k+ 1) A ID(j,k+ 1) ~ {(8) and the definition of F(j)}

F(j)~f(k+1) .

Considering D(j ,k) A p(k) ;:; p(k - j) ~ D(j.k + 1), we obtain the following program to compute f(k + 1):

I [

j: int jj := f(k) {Q} do j 1: k A P (k) 1: P (k - j) -+ j := F(j) od {Q A (j = k V p(k) = p(k-j»} if p(k) = p(k-j) -+ f:(k+l) = j {j ~ f(k+1) A D(j,k+1), hence f(k + 1)

=

j}

n

p(k)1:p(k-j) -+ f:(k+1) = k+1 {k ~ f(k+1) A ID(k.k+1), hence f(k+ 1) = k + 1} fi

] I

(11)

(12)

Finally, by using

p(j,k)

=

p(O,k- j) /\ 0

<

i ~ k- j :::l> {(O)}

p (j + i, k) = P (i, k - j) ,

F(j) can be refined as follows:

F(j)

= {the definition of F(j)}

(MIN i: p(i,k)

=

p(O,k - i) /\ j

<

i ~ k: i)

=

_{j + (MIN i: p(j+i,k)}

₌

_{p(O,k-j-i) /\ 0}

_<

_i _~_{k-j: i)}

=

{j is a solution of (9), hence (12)}

j + (MIN i: D(i,k - j): i)

=

{the definition of f(k - j)}

j + f(k - j) •

Thanks to 0

<

j ~ k/\ j 'F k and Pf, f(k - j) is defined. Therefore, in (11)

(12)

"j := F(j)" can be replaced by "j := j + f(k - j)", and we have completed the computation of f(l,M).

Acknowledgement

I am greatly indebted to A.J.M. van Gasteren for her instructive comments and several improvements. I would like to thank Edsger W. Dijkstra, W.H.J. Feijen, J.G. Wiltink and the members of the Tuesday Afternoon Club for many valuable suggestions.

(13)

11

-References

[1] D.E. Knuth, J.H. Morris and V.R. Pratt, Fast Pattern Matching in Strings 1 SIAM J. Comput. 6, 323 - 350 (1977).

[2] R. Sedgewick, Algorithms, Addison-Wesley (1983).

[3] Edsger W. Dijkstra, A Discipline of Programming, Prentice-Hall (1976). [4] Edsger W. Dijkstra, Lecture notes "Predicate Transformers" (Draft),

EWD 835, (1982).

[5] A.J.M. van Gasteren and Edsger W. Dijkstra, About the Presentation of Programs, AvG 2/EWD 781, Internal Report, (1981).