Extracting Declarative Constraints from Natural Language Text

(1)

Extracting Declarative Constraints from

Natural Language Text

Mark Bothof 1

Student ID: 5789303 | E-mail: markbothof@gmail.com Thesis Master Information Studies: Business Information Systems

1 _{University of Amsterdam, Faculty of Science}

Supervisor: Han van der Aa 2

2 _{VU University Amsterdam, Faculty of Science}

Second examiner: Dick Heinhuis 1

April 20, 2018

Abstract. Contemporary organizations have to ensure that their business processes are compliant with laws and regulations. Failure to do so can lead to penalties, scandals, and a loss of business reputation. Compliance checking techniques allow organizations to monitor their compliance in real-time and in an automated manner. However, existing techniques require compliance rules to be formally specified, which they generally are not. To bridge this gap, this thesis presents an approach to automatically extract formalized declarative constraints from natural language text. Organizations also struggle to document their more flexible and knowledge intensive business processes in declarative process models. Techniques to extract imperative process models already exist. This thesis offers an approach to convert declarative process descriptions into declarative process constraints for improved documentation, visualization and applicability. A quantitative evaluation with 46 sentences demonstrates that the approach allows users to quickly and effectively extract declarative constraints that can be used as input for existing automated compliance checking techniques and easily converted into process models to document business processes.

Keywords. Business process management, natural language processing, compliance checking, declarative process modeling

1 Introduction

Contemporary organizations have to deal with an increasing number of constraints, which stem from various compliance sources, such as Sarbanes-Oxley (Sarbanes, 2002), Basel II ( Decamps, Rochet & Roger, 2004 ) and others. Such normative laws and regulations induce organizations to ensure that their business processes are compliant with them (Liu, Muller & Xu, 2007). Failure to do so can lead to penalties, scandals, and a loss of business reputation. Regulatory compliance of business operations is a critical problem for enterprises. Compliance checking of business processes can be both a mandatory and practical operation. When government regulations are changed or new business processes are implemented, compliance checking is required. Furthermore, automated compliance checking can be useful for the application of internal requirements, like service levels. Optimization of this process can result in substantial savings in time and money for organizations (El Kharbili, de Medeiros, Stein & Van der Aalst, 2008). Nevertheless, the world of regulations keeps growing more complex and

(2)

2

organizations face the challenge to comply. Compliance checking techniques play an increasingly important role, because they allow organizations to monitor their compliance in real-time and in an automated manner. However, existing techniques require compliance rules to be specified according to a formalized notation, e.g. a declarative formalized constraint or process model. This stands in sharp contrast to the natural language text in which such rules are generally specified (Chowdhury, 2003). Nowadays, this conversion requires manual effort. This thesis aims to overcome this gap by exploring the possibilities to automate the extraction of compliance rules from natural language text and the transformation of those rules in a representation that can be used for automated compliance checking.

There are already many techniques to automatically extract imperative process models from text, but not for the the extraction of declarative process models. In an imperative notation everything is illegal except the actions explicitly allowed, whereas in a declarative notation everything is allowed except those actions that are explicitly prohibited (Fahland, Lübke, et al., 2009 ). Nowadays, flexibility in business processes becomes more and more important. In former times organizations were often using strict assembly-line-like processes and this required and allowed process descriptions to be very explicit. This also made supervision and control easy, as there were not a lot of deviations from the process. For increased understandability and applicability the textual descriptions of such explicit processes would ideally be more visualized and formalized, e.g. in a process model. These process models are imperative and call for a technique to derive imperative process models from natural language text. Such techniques exist nowadays, but not for the declarative counterpart. Over the last century domains like accounting and regulations became more complex, which demanded for more diverse processes and the possibility to handle exceptions. Also developments like the telephone and the internet resulted in a new type of organizations, with the desire for more flexible business processes, e.g. for sales activities or product development. In the present-day processes are more knowledge intensive. Consequently, contemporary organizations require flexibility in their processes ( Pichler et al., 2011 ), which results in a more declarative form of process descriptions. Again, a visual and formal documentation of these descriptions has many benefits and therefore calls for an efficient and automated manner to extract declarative process models from text. This thesis intents to bridge this gap with the automated extraction of declarative constraints from natural language text, the form in which processes are or can generally be described by practitioners.

The goal of this research is to transform compliance sources in a representation that can be used by existing compliance checking techniques. Such techniques compare the rules with business process models and point out inconsistencies. The rules must be specified according to a formalized notation for this process to be automated. A convenient notation for compliance sources, that can be used for automated compliance checking, is a declarative constraint or process model. Most documented business process models are imperative. Logically, a declarative and imperative process model are easier to compare with one another than a declarative natural language text and an imperative process model. Furthermore, this thesis aims to transform declarative process descriptions in declarative process models for the sake of process visualization and documentation. This is a very important step for organizations to maintain supervision and control over their business processes. Both goals can be achieved with the same technique, because the natural language texts of compliance sources and declarative process descriptions are very similar. The tool designed in this research applies extraction rules on input text which results in declarative constraints that can be used for compliance checking and business process documentation.

On a scientific level this thesis contributes to the fields of business process management, process modeling and natural language processing. Business process

(3)

3

management is a discipline in operations management that uses various methods to discover, model, analyze, measure, improve, optimize, and automate business processes (Jeston & Nelis, 2010 ). An interesting movement in science and practice is the rise of declarative approaches. There is a lot of discussion going on about the advantages and disadvantages of different process modeling techniques. Imperative process models tend the be better understandable, because of their sequential nature (Fahland, Lübke, et al., 2009 ). But declarative process models offer and support better flexibility and are therefore better maintainable (Fahland, Mendling, et al., 2009). Due to the complexity of business processes, the need for flexibility and the possibility to handle edge cases and exceptions, the use of declarative business processes is on the rise ( Pichler et al., 2011 ). Simultaneously, the demand for an efficient and automated manner to convert declarative process descriptions into declarative process models grows. The extraction of declarative constraints from natural language text revolves around natural language processing, a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and natural languages and, in particular, concerned with programming computers to fruitfully process large natural language corpora (samples of “real world” texts). Natural language processing is the ability of a computer program to understand human speech (Chowdhury, 2003). Traditionally, computers require humans to “speak” to them in a programming language that is precise, unambiguous and highly structured. Compliance sources and textual process descriptions generally do not meet these requirements. Though machine learning patterns can be discovered to improve the computer’s understanding of such sources. The remainder of this thesis is structured as follows. Section 2 explains the research problem using illustrative examples. Section 3 discusses related work and identifies the research gap of interest. Section 4 describes the proposed approach for declarative constraint extraction and formalization. In Section 5, I present a quantitative evaluation of the approach. In the end, I draw conclusions and present directions for future research in Section 6.

2 Problem Illustration

In this section I illustrate the challenges of extracting declarative constraints from natural language text based on clear examples. Section 2.1 describes Declare, the notation used for declarative constraints and modeling. In section 2.2 the main challenges of extracting declarative constraints from natural language text are explained. Both sections provide examples for clarification.

2.1 Declare

Table 1 shows example sentences from a compliance source for an insurance claim process and corresponding formalized constraints and declarative process models in the Declare notation. Declare is a constraint-based modeling language for the development of declarative models describing loosely-structured processes ( Pesic, Schonenberg & Van der Aalst, 2007 ). An overview of the Declare notation can be found in Appendix A.

(4)

4

Sentence Constraint Model

A claim should be created, before it can be approved. Precedence(Create claim, Approve claim) If a claim is approved, then it must be paid. Response(Approve claim, Pay claim) If a claim is

approved, it can be paid, and if a claim is approved, then it must be paid.

Succession(Approve claim, Pay claim)

The process starts when a claim is created.

Init(Create claim)

Payment of the claim

occurs last. End(Pay claim)

Table 1. _{Unformalized and formalized notations of declarative constraints for an} insurance claim process

The constraint templates of the Declare notation consist of the _{constraint type} followed by the relevant process activities. This activity notation is typically known as the verb-object convention (Mendling, Reijers & Recker, 2010 ). An activity consists of a verb in the imperative form and a noun, like “Repair bicycle”. Declare constraints always contain one or two activities, depending on the constraint type. Table 2 shows the explanation of several Declare templates. Explanations of other Declare templates can be found in Appendix B.

Template Explanation

Precedence(x,

y) if activity y occurs, then activity x must precede

Response(x,

y) if activity x occurs, then activity y must follow

Succession(x,

y) if activity x occurs, then activity y must follow AND if activity y occurs, then activity x must

precede

Init(x) activity x occurs first in the process

End(x) activity x occurs last in the process

Table 2. _{Declare templates and explanations}

2.2 Challenges

By analyzing the _{syntax and semantics of the sentences, generic patterns can be} discovered to automate the extraction of formalized declarative constraints. Syntax is about the_{structure or the grammar of natural language text and semantics is about}

(5)

5

the meaning of the sentence or the individual words ( Harel & Rumpe, 2004 ). Also thedependency relations between words can be of crucial importance. Dependency relations connect pairs of words or phrases and name the relationship between these parts (Kao & Poteet, 2007 ). These results from existing natural language processing techniques are fundamental in the extraction of declarative constraints from natural language text.

The syntactic and semantic analysis comes with all kinds of challenges. An inquiry of compliance texts and process descriptions combined with the idea of the declarative constraints that should be extracted, pointed out some obvious difficulties that need to be overcome. The most important challenges for the extraction of formalized declarative constraints from natural language text are discussed below.

Sentence formats Each constraint can be written down in countless of different sentence formats. For example the “Precedence(Create claim, Approve claim)” constraint can be deduced from “ When a claim is created, it may be approved.” and “Approving the claim can happen at anytime, unless it still needs to be created.”. These are just two very contrasting sentence formats out of many. Therefore a broad spectrum of declarative sentences needs to be analyzed and converted into patterns for automated formalized constraint extraction.

Combined constraint types As can be concluded from the explanations of Table 2, a constraint type can also be a combination of other constraint types. The

Succession constraint type is a combination of the _{Precedence and Response}

constraint types. This adds an extra layer of complexity.

Verb conjugations In the first four sentences of Table 1, the relevant verbs for activities are past participles, while constraint activities need a verb in the imperative form. In general, conjugations seem to cause a problem. For verbs the imperative is always identical to the lemma, so tracking down the _{lemmatization} could help.

Complex activities Furthermore, as can be seen in the last sentence of Table 1, activities can also be described in complex formats. “Payment of the claim” contains no verb, but should eventually be converted into the “Pay claim” activity. For this the derivation should be analyzed.

Pronouns In the first three sentences of Table 1 the word “it” occurs and such pronouns seem relevant for the extraction of activities. However, in formalized process modeling notations such vague and implicit terms are not useful and therefore disallowed. _{References between words might provide an answer for this} problem.

Activity order Also the order of activities is very relevant in case of constraint types with two activities. For the Precedence and Response constraint types the compliance rule is substantially different if the activities are represented in a different order. “Response(Approve claim, Pay claim)” has very different implications in comparison with “Response(Pay claim, Approve claim)”. Such errors can cause significant and unacceptable problems with grave consequences. However, for the Succession constraint type it is irrelevant, because reversed activities result in the same process constraint. The explanation of the Succession constraint type in Table 2 clarifies this.

No or multiple constraints These are just some of the challenges faced when extracting declarative constraints from natural language text. Compliance sources and process descriptions also often contain sentences with explanatory meaning but no extractable constraints at all. Some sentences can also represent multiple constraints. The design of an appropriate technique for extracting formalized declarative constraints from natural language text must deal with these challenges.

(6)

6

3 Related Work

The work presented in this thesis relates to the fields of automated compliance checking and process model generation from natural language text. In this section relevant work from both fields are discussed to clarify the research gap.

Existing compliance checking techniques for business processes need formalized compliance rules as input ( El Kharbili, de Medeiros, Stein & Van der Aalst, 2008 ). Compliance rules describe regulations, policies and quality constraints business processes must adhere to. Given the large number of rules and their frequency of change, manual compliance checking can become a time-consuming task. Automated compliance checking of process activities and their ordering is an alternative whenever business processes and compliance rules are described in a formal way (Awad, Decker & Weske, 2008 ). Compliance rules serve as input to model checkers which in turn verify whether a process model satisfies the requested compliance rule. Several approaches have been developed to support compliance checking of process models. One major challenge for such approaches is their ability to handle different modeling techniques and compliance rules in order to enable widespread adoption and application (Becker, Delfmann, Eggert & Schwittay, 2012). Current approaches mainly focus on special modeling techniques and a restricted set of types of compliance rules. Most approaches abstain from real-world evaluation which raises the question of their practical applicability.

Business process modeling has become an important tool for managing organizational change and for capturing requirements of software. A central problem in this area is the fact that the acquisition of as-is models consumes up to 60 percent of the time spent on process management projects (Friedrich, Mendling & Puhlmann, 2011). This is paradox as there are often extensive documentations available in companies, but not in a ready-to-use format. To tackle this automated approaches are developed to generate process models from natural language text. However, these techniques cannot handle declarative documentation, only imperative. The dynamic nature of the modern business environment means processes are subject to an increasingly wide range of variations and must demonstrate flexible approaches to deal with these variations if they are to remain viable (Schonenberg, Mans, Russell, Mulyar & Van der Aalst, 2008 ). The challenge is to provide flexibility and offer process support at the same time. To increase flexibility in an imperative process, more execution paths have to be modeled explicitly, whereas increasing flexibility in declarative processes is accomplished by reducing the number of constraints, or weakening existing constraints. Therefore, the use of declarative business processes and the desire to automate the extraction of process models are increasing.

For the extraction of models or constraints from natural language text natural language processing must be performed to obtain structured information of syntactic and semantic nature. Then patterns must be discovered to draw generic conclusions and develop an automated approach. This is a form of text mining. Text mining refers to the process of extracting interesting and non-trivial patterns or knowledge from text documents (Tan, 1999). Data mining techniques are usually dedicated to information extraction from structured databases. Text mining techniques, on the other hand, are dedicated to information extraction from unstructured textual data and natural language processing can then be seen as an interesting tool for the enhancement of information extraction procedures (Rajman & Besançon, 1998).

In summary, existing techniques do not provide the tools for automated process modeling from declarative natural language texts. In light of this research gap, I designed an approach to extract declarative constraints from natural language text.

(7)

7

4 Approach

This section describes the approach to extract declarative constraints from natural language text, which consists of a pre-processing step and three extraction steps. Section 4.1 presents an overview of the approach. Sections 4.2 through 4.4 subsequently describe the extraction steps in detail.

4.1 Overview

As depicted in Figure 1, the approach consists of four steps. The first step is a pre-processing step in which existing natural language processing tools are used to collect syntactic and semantic information about the unstructured natural language text. This structured and labeled data is analyzed in the other three steps to work towards a complete and true extraction of declarative constraints. The second step extracts the constraint types. The third step focuses on the extraction of process activities. The fourth and final step optimizes the results from the two previous steps.

Fig. 1._{Outline of the approach}

Natural language processing tools For the syntactic and semantic analysis of natural language text existing natural language processing tools were analyzed and selected. After testing with a small set of compliance sentences FreeLing ( Carreras, Chao, Padró & Padró, 2004 ) and Google NLP were chosen. FreeLing is developed1 and maintained by the Universitat Politècnica de Catalunya natural language processing research group. Google NLP is a service of United States-based Google LLC. The decision for these tools was based on the quality of the results, the rough fit with the earlier described challenges and the available programming languages and data formats to integrate the tools into the new tool.

Constraint types The focus of this research is on the Precedence, Response,

Succession, Init and End constraint types. These are the most fundamental

constraints (Pesic, Schonenberg & Van der Aalst, 2007). A lot of the other constraints are variations on these constraints. An overview of all constraint types can be found in Appendices A and B.

Input text Only individual compliance sentences that all contain exactly one constraint are analyzed. Of course more complex sources contain sentences that contain multiple constraints, sentences without any relevance to constraints and sentences that need to be analyzed together to derive a single constraint. However, these were left out of the scope of this research.

Execution order A general issue that needs to be addressed is the order in which extraction rules are executed. Because the _Succession constraint type is a combination of the _{Precedence and Response constraint types, a sentence with a}

Succession constraint will also positively trigger a _{Precedence and Response}

extraction rule. Therefore, combined with the condition that each sentence only

(8)

8

contains one constraint, it is useful to check for Succession constraints first and if found, stop looking for other constraints. Therefore, in general, it seems relevant to order and execute the extraction rules from the most to the least complex and end the execution if a constraint type is found. The applied order is Succession, Precedence and Response, Init and End.

Verb conjugations For the extraction of activities the conjugations of verbs in sentences cause an issue. Fortunately, the used natural language processing tools provide lemmatization. These lemmas are used in all extraction rules and the extracted elements. Activity verbs actually need to be in the imperative form, but these are always identical to the lemma.

The following sections describe the extraction rules and used patterns in detail. Rules are created by analyzing similarities in sentences that should result in the same declarative constraint. In this way only rules with some level of genericity are created and not rules that only apply to a single sentence. The designed tool then becomes a generic solution for the extraction of declarative constraints from all sorts of natural language texts, e.g. compliance sources and process descriptions. Sections 4.2.1 through 4.2.5 address the extraction of each constraint type. Section 4.3 is about the third step of the approach, the extraction of activities. Section 4.4 explains the designed optimization rules.

4.2 Constraint Types

4.2.1 Precedence

An example Precedence constraint is “Precedence(Create claim, Approve claim)” derived from the sentence “A claim should be created, before it can be approved.”. The constraint template is “Precedence(x, y)” which means that if activity y occurs, then activity x must precede. The following extraction rules are implemented to extract the Precedence constraint type.

Precedence 1 The semantic meaning of the words “precede” and “before” indicate a _Precedence constraint. They indicate some activity has to take place before another activity. For example in the sentence “The approval of the claim should be preceded by the creation of the claim.”. The formal extraction rule:

- Sentence contains “precede” OR “before”.

Precedence 2 If the words “only” and “after” occur subsequently this indicates a_{Precedence constraint. For example in the sentence “Only after a claim is created,} it is possible to approve the claim.”. The formal extraction rule:

- Sentence contains “only after”.

Precedence 3 The word “if” or “when” combined with the word “can” or “may” indicates a _{Precedence constraint. For example in the sentence “If a claim is created,} it can be approved.”. The formal extraction rule:

- Sentence contains “if” OR “when” AND “can” OR “may”.

Precedence 4 The word “if” or “when” combined with the word “must” or “should” and with the word “first” indicates a Precedence constraint. For example in the sentence “If a claim is approved, then it must have been created first.”. The formal extraction rule:

- Sentence contains “if” OR “when” AND “must” OR “should” AND “first”. 4.2.2 Response

An exampleResponseconstraint is “Response(Approve claim, Pay claim)” derived from the sentence “If a claim is approved, then it must be paid.”. The constraint template is “Response(x, y)” which means that if activity x occurs, then activity y

(9)

9

must follow. The following extraction rules are implemented to extract the Response constraint type.

Response 1 The semantic meaning of the words “follow” and “after” indicate a Response constraint. They indicate some activity has to take place after another activity. For example in the sentence “A claim should be paid, after it is approved.”. The formal extraction rule:

- Sentence contains “follow” OR “after”.

Response 2 The word “if” or “when” combined with the word “must” or “should” indicates a _Response constraint. For example in the sentence “When a claim is approved, it should subsequently be paid.”. The formal extraction rule:

- Sentence contains “if” OR “when” AND “must” OR “should”. 4.2.3 Succession

The _{Succession constraint type is a combination of the Precedence and Response} constraint types. An example constraint is “Succession(Approve claim, Pay claim)” derived from the sentence “If a claim is approved, it can be paid, and if a claim is approved, then it must be paid.”. The constraint template is “Succession(x, y)” which means that if activity x occurs, then activity y must follow and if activity y occurs, then activity x must precede. The following extraction rules are implemented to extract the Succession constraint type.

Succession 1 The semantic meaning of the words “precede” and “before” indicate a _{Precedence constraint and the semantic meaning of the words “follow”} and “after” indicate a _Responseconstraint. The combination results in a Succession constraint. For example in the sentence “A claim should be approved, before it can be paid, and after a claim is approved, it should be paid.”. The formal extraction rule:

- Sentence contains “precede” OR “before” AND “follow” OR “after”. Succession 2 The word “if” or “when” combined with the word “can” or “may” indicates a _{Precedence constraint and that same “if” or “when” combined with the} word “must” or “should” indicates a _Response constraint. The two constraint together result in a _{Succession constraint. For example in the sentence “If a claim is} approved, it can be paid, and if a claim is approved, then it must be paid.”. The formal extraction rule:

- Sentence contains “if” OR “when” AND “can” OR “may” AND “must” OR “should”.

4.2.4 Init

An example _Init _{constraint is “Init(Create claim)” derived from the sentence “The} process starts when a claim is created.”. The constraint template is “Init(x)” which means that activity x occurs first in the process. The following extraction rule is implemented to extract the Init constraint type.

Init The semantic meaning of the words “begin”, “start” and “first” indicate an Init constraint. For example in the sentence “Creation of a claim occurs first.”. The formal extraction rule:

- Sentence contains “begin” OR “start” OR “first”. 4.2.5 End

An example _End constraint is “End(Pay claim)” derived from the sentence “Payment of the claim occurs last.”. The constraint template is “End(x)” which

(10)

10

means that activity x occurs last in the process. The following extraction rule is implemented to extract the End constraint type.

End The semantic meaning of the words “end”, “stop” and “last” indicate an Endconstraint. For example in the sentence “The process stops with the payment of the claim.”. The formal extraction rule:

- Sentence contains “end” OR “stop” OR “last”.

4.3 Activities

All constraint types need one or two activities. Based on the dependencies from the Google NLP results certain syntactic conclusions can be drawn to extract these activities, each consisting of a verb and a noun in that particular order. The following extraction rules are implemented to extract activities.

Nsubjpass A _{passive nominal subject dependency in the Google NLP results} indicates a process activity. For example in the sentence “A claim should be created, before it can be approved.”. In this sentence there are two of these dependencies. One between “claim” and “created” an another one between “it” and “approved”. By analyzing the lemmas and the right dependency direction this adds up to the activities “Create claim” and “Approve it”. The formal extraction rule:

- Find words with dependency “NSUBJPASS” (passive nominal subject) in the Google NLP results. Save the lemma of the word the dependency refers to and then the initial word.

Dobj Adirect object dependency in the Google NLP results indicates a process activity. For example in the sentence “They have to create the claim, before they can approve it.”. In this sentence there are two of these dependencies. One between “create” and “claim” an another one between “approve” and “it”. By analyzing the lemmas and the right dependency direction this adds up to the activities “Create claim” and “Approve it”. The formal extraction rule:

- Find words with dependency “DOBJ” (direct object) in the Google NLP results. Save the lemma of the word the dependency refers to and then the initial word.

Prep A prepositional modifier dependency in the Google NLP results can indicate a process activity. But only if the word it refers to has a nominal subject, object of a preposition , passive nominal subject or direct object dependency. For example in the sentence “Approval of the claim has payment of the claim as a response.”. There is a prepositional modifier dependency between “of” and “Approval” and there is a nominal subject dependency between “Approval” and “claim”. By analyzing the lemmas and the right dependency directions this adds up to the activity “Approve claim”. To get from a noun like “Approval” to the verb “Approve” thederivation should be collected from a dictionary tool like WordsAPI. In a similar way the activity “Pay claim” can be extracted. But this one is based on

2

a direct object dependency instead. There is even a third prepositional modifier dependency in this sentence, between “as” and “has”. However, the threshold for the second dependency is not met which results in this activity extraction to be correctly dismissed. The formal extraction rule:

- Find words with dependency “PREP” (prepositional modifier) in the Google NLP results. Find the word the dependency refers to check if this word has a dependency with label “NSUBJ” (nominal subject), “POBJ” (object of a preposition), “NSUBJPASS” (passive nominal subject) or “DOBJ” (direct object). If so, find words with dependency “POBJ” and check if this

(11)

11

dependency refers to the initial word with the “PREP” dependency. If this all checks out, then save the lemma of the second word and then the third word.

4.4 Optimization

Some extraction rules could use some general optimization. These optimization rules are relevant for multiple extraction rules so are defined separately and executed after all the extraction rules. The following optimization rules are implemented.

Coreferences In a natural language text it is very common to replace nouns with synonyms or pronouns like “it” that refer to the earlier defined initial noun. For example in the sentence “A claim should be created, before it can be approved.”. This causes problems with the extraction of activities, but can be solved by analyzing and using the _coreferences_{from the FreeLing results. The optimization} rule in short:

- Replace nouns in activities with coreference if this dependency exists in the FreeLing results.

Duplicates It is not uncommon for activities to be repeated, for example in a sentence with a Succession constraint like “If a claim is approved, it can be paid, and if a claim is approved, then it must be paid.”. To prevent all these activities to show up in the extracted elements, causing false elements and affecting the performance scores negatively, simply removing duplicate elements removes these false elements. The optimization rule in short:

- Remove duplicates from extracted elements.

Multiple activities TheInit and End constraint types should always have only one activity. So if the sentence contains more than one activity and an Init or End constraint type, it can be concluded there is something wrong. To prevent wrongful element extraction theInit or End constraint type is then removed. The optimization rule in short:

- If the extracted elements contain more than one activity AND constraint “Init” OR “End”, then this constraint is removed.

Order For the construction of the declarative constraints in the Declare notation the extracted elements need to be placed in the right order and a simple format needs to be implemented. For example, an extracted “Precedence” constraint type with two activities “Create claim” and “Approve claim” needs to be formatted into “Precedence(Create claim, Approve claim)”. The constraint type must always be placed first, directly followed by a parenthesis, namely “(”. Then the activities must be mentioned, in the right order, separated by a comma. In the end another parenthesis must be added, namely “)”, to finish the Declare constraint.

5 Evaluation

This section presents a quantitative evaluation that demonstrates how well the proposed approach is able to extract declarative constraints from a collection of sentences from compliance sources and declarative process descriptions. Declarative constraints were manually annotated from a collection of 46 sentences obtained from practice. This annotation is referred to as the _{Gold Standard against which the} results of the approach are compared. After elaborating on the test collection in section 5.1, I present the setup of the evaluation in section 5.2, its results in section 5.3 and a discussion of the strengths and weaknesses of the approach in section 5.4.

(12)

12

5.1 Test Collection

To evaluate the approach, a collection of 46 sentences was used. The data collection as presented in Appendix C was derived from multiple compliance sources for business processes and real-life process descriptions (Di Ciccio, in press; _{Maggi, Di} Ciccio, Di Francescomarino & Kala, 2017; _{Hildebrandt, Mukkamala & Slaats, 2011;} Slaats, Mukkamala, Hildebrandt & Marquard, 2013). The collection was eventually converted into a usable set of 46 sentences in five iterations. The final set was derived from an initial set of 50 pieces of natural language text of which most contained multiple sentences. For the first step towards the automation of extracting declarative constraints from natural language text it was decided to focus on individual sentences with a single constraint and only the _{Precedence, Response,} Succession_{, Init and End constraint types. These constraint types are the most} common and with these you are able to describe and derive complete process models with multiple activities. All sentences from the initial compliance texts were separated. Then sentences with different or no constraints at all were filtered out. Sentences with multiple constraints were manually split up into multiple complete sentences with each one constraint, if the constraint types were relevant. If no clear split could be made, the sentence was discarded from the collection. A few sentences were manually copied and modified to fit another constraint type to make sure every constraint type had sufficient representation and certain syntax constructions had several variations. In this way the final set became more fitting to real-life compliance sources and process descriptions, although it only contains individual sentences. Table 4 provides an overview of the data collection.

Constraint Type # Sentences

Precedence 17

Response 14

Succession 5

Init 5

End 5

Table 4. Overview of the data collection

For each sentence in the dataset a Gold Standard was defined. A total of two researchers were involved in the creation of the Gold Standard. The Gold Standard is a reference standard considered adequate to define the presence or absence of the condition of interest. In this case the correct declarative constraints that should result from the automated extraction based on the generic patterns designed in this research. Several sentences from the dataset required some discussion. Issues that could not be solved or caused uncertainty resulted in the removal of the sentence from the collection.

5.2 Setup

To gain insight in the quality of the designed extraction rules a suitable evaluation method was selected. To demonstrate the applicability of the approach presented in this thesis, the performance of each extraction rule, related sets of rules and the total set of extraction rules were assessed with standard information retrieval metrics.

(13)

13

More specifically,Precisionand Recall were calculated by comparing the computed results against the manually createdGold Standard. Then the accuracy of the test is calculated in the F1 Score. The definitions of these performance metrics are displayed in Table 3.

Metric Definition

Precision extracted true elements / all extracted elements

Recall extracted true elements / all elements in Gold

Standard

F1 Score 2 * ((Precision * Recall) / (Precision + Recall))

Table 3. Performance metrics and definitions

A true element means it is also present in the Gold Standard and therefore correctly extracted. A false element is an extracted element that does not appear in the Gold Standard and is considered incorrect.

Precision is the portion of true elements among all extracted elements, true and false. _{Recall is the portion of true elements among the total amount of elements in} the_{Gold Standard. The F1 Score measures the accuracy of a test. It is the harmonic} average of _{Precision and Recall, where it reaches its best value at 1.00 and worst at} 0.00. For these metrics it is important to define what an _{element is. In this case it} was decided that a constraint type and a full activity, consisting of a verb and a noun, are elements. So for example in the constraint “Precedence(Create claim, Approve claim)” there are three elements: “Precedence”, “Create claim” and “Approve claim”.

In constructing the extraction rules a lot of experimentation and testing was conducted. For an extraction rule to be found a valuable addition to the new extraction tool certain thresholds were defined:

A. The extraction rule has at least two positive triggers, unless it is derived from another extraction rule.

B. The extraction rule has a positive effect on the overall scores when the full dataset and all extraction rules are enabled, unless it is derived from another extraction rule.

A _{positive trigger as mentioned in threshold A is when an extraction rule is} executed on a compliance sentence and the extracted element is true compared to the Gold Standard. A positive effect on the overall scores as mentioned in threshold B is a significant positive change in either the Precision, Recall or F1 Score. These scores are rounded on two decimals. Any positive change in these scores is considered significant.

The exception mentioned in both thresholds is important to ensure consistency. Several constraint types are somehow related to each other, for example as counterparts or if a constraint type is actually a combination of other constraint types, likeSuccession is a combination of Precedence and Response. For example, if a new extraction rule for the Precedence constraint type and a new rule for the Response constraint type pass the defined thresholds and are added to the new tool, then for consistency purposes a combination of these extraction rules is also implemented for the Succession constraint type, while this combined rule may not pass all thresholds.

(14)

14

A full insight in the setup can be found on my website. Different parts of the 3 dataset can be selected and specific extraction rules can be applied. The results show the extracted elements, which are green if true and red if false. This is based on matching against theGold Standard. For each sentence the performance scores are shown. At the top the overall evaluation is visible. For convenience and extra insight the original results of the used natural language processing tools, FreeLing and Google NLP, are also included.

5.3 Results

The Precision, Recall and F1 Score were computed for every extraction rule. For each rule the complete dataset was selected, to replicate a real-life situation. For the constraint type related extraction rules the activities were logically excluded from the Gold Standard and therewith the results. The same applies for activity related extraction rules, for which the constraint types were excluded from the Gold Standard. An overview of the results is presented in Figure 2. A results table can be found in Appendix D.

Fig. 2. Overview of the results

(15)

15

The Precedence set is a combination of all the numbered Precedence extraction rules. The same applies for the Response and Succession sets. The Constraint types set contains all the extraction rules shown above it in figure 2, so all rules related to the extraction of constraint types. The Activities set is a combination of all the rules related to the extraction of activities, so Nsubjpass, Dobj and Prep. The optimization rules are by default enabled so were applied to all the previously mentioned rules and sets. No optimization refers to a combination of all the extraction rules, but with the optimization rules as mentioned in section 4.8 disabled. The optimization rules could not be separately measured, because they rely on each other. By comparing the No optimization results with the results of All rules the performance of the optimization rules can be measured. With the optimization rules the overall scores for Precision, Recall and F1 Score increase from 0.73, 0.80 and 0.76 to 0.85, 0.88 and 0.86. These last three scores are the overall performance scores for the extraction tool designed in this research.

The Precision scores represent the extracted true elements versus all extracted elements. So if the Precision is lower than 0.50 it means there are more false than true elements extracted. In other words, it does more harm than good. It probably could be considered not performing very well then. Fortunately, all extraction rules have a Precision of 0.50 or more. The Dobj extraction rule has a Precision of 0.50 though, so is an edge case. A Precision of 1.00 means that all extracted elements are true, so no false elements are extracted.

Any _{Recall score above 0.00 can actually be considered positively performing,} not considering other performance scores. _{Recall represents the extracted true} elements versus all elements in the _{Gold Standard. So if you combine extraction} rules the Recall scores will stack up, unless some of the extracted true elements overlap. But Recall scores will never negatively influence each other when combined. Extraction rules with relatively low Recall scores can therefore be considered well performing, like the Precedence 1 (0.12), 2 (0.12) and 3 (0.12), the Succession 2 (0.20) and the Dobj (0.10) extraction rules. A Recall of 1 means all elements in the Gold Standard are extracted. Extending the tool with extraction rules to extract the same type of elements and increase the Recall score seems irrelevant then. However, the Precision score could still be ripe for improvement.

The End extraction rule performs perfectly with a score of 1.00 for all three performance metrics. However, more complex compliance sources and declarative process descriptions could still call for improvement in the designed extraction tool with regards to this constraint type.

Also a notable result is that of the Dobj extraction rule, as mentioned before. The performance scores seem quite low, but an advanced analysis points out it could still be relevant for extracting compliance rules from natural language text. Without that extraction rule thePrecision, Recall and F1 Score for all the extraction rules would be 0.90, 0.83 and 0.86 instead of the current 0.85, 0.88 and 0,86. This means the Precision is negatively affected, but the Recall is positively affected with the same amount. With a different dataset the positive effect could definitely be outweighing the negative.

5.4 Discussion

In this section the strengths and weaknesses of the approach are discussed. The evaluation shows that the full approach successfully extracts declarative constraints from a collection of sentences while limiting the number of extracted false elements. A post-hoc analysis reveals that the approach has several points of consideration. These points could help improve the designed tool or possibly be a thread to the validity of this research.

(16)

16

First, the Nsubjpass and Dobj activity extraction rules could use some improvement. These two extraction rules trigger several wrongful activity extractions. A lot of the extracted false elements look quite similar. For example “Require creation” is extracted from the sentence “Creation of the claim is required, before it can be approved.”. This type of extracted elements seems to occur at least ten times in this research. A more thorough experimentation with the activity extraction rules might have resulted in more enhanced extraction rules and less extracted false elements, resulting in a better Precision score and therewith F1 Score. A quick post-hoc analysis of a few of these sentences seem to point to a possible solution. If the initial word in a Nsubjpass or Dobj dependency has “root” as parse label, an activity should not be extracted. There are also some activities with linking verbs, auxiliary verbs or modal verbs (Swan, 2005). These verb types are never used for process activities, so if an extracted element contains such a verb, it could be excluded, resulting in less extracted false elements as well.

Second, the threshold for an extraction rule to be implemented only if it has at least two positive triggers also has a downside. The upside and reason this threshold is used is so that the implemented rules have some level of genericity. However, there are some unextracted true elements and extracted false elements that seem to be unique in this research, which could possibly have been solved by an obvious new extraction rule or adaptation on an existing extraction rule, which logically appear quite generic, but cannot be proven generic because it only applies to a single element. Using a larger data collection could have provided more of these identical cases and therefore could have resulted in more or improved extraction rules and therewith better performance scores.

Third, several sentences have no extracted constraint type at all. Under the assumption applied in this research that all sentences have exactly one constraint, it would probably be effective to make sure always one constraint type is selected. So if no constraint type could be extracted based on the implemented extraction rules, maybe a probability analysis could have been performed to at least select the most probable constraint type. An even simpler solution would be to always add the most common constraint type to the extracted elements if no constraint type is extracted.

Fourth, in the extraction rules a search for certain words is conducted. In none of the extraction rules the order in which these words occur in the sentence is considered. With a larger or real-life data collection this might be a relevant topic to address.

Fifth, the order of activities is currently not considered in the results. As mentioned in the challenges, in section 2.2, this is very important for certain constraint types, namely Precedence and Response. For Succession constraints the order of activities does not matter and _{Init and End should only have one activity,} which also makes the order irrelevant. Mainly due to time limitations this issue is not researched, but it should be one of the first enhancements of the tool. Currently, the tool considers a constraint with reversed activities correct, but this would not work in real-life. In fact, this would cause erroneous situations crucial to prevent in compliance checking and process documentation. Related to this the formatting of the extracted elements into the official Declare notation is also not performed. This would however be easy to implement before the tool would be used in practice and was therefore considered unnecessary in this research.

The overall strength of the approach can be considered good. Of course performance scores between 0.80 and 0.90 do not offer perfect extraction, which would make human interference unnecessary and provide full automation. But with 80 to 90 percent of the work already done, practitioners, with the task to check the compliance of business processes based on a compliance source or the task to convert declarative process descriptions into declarative process models, can perform their work much more efficient. However, the data collection used in this

(17)

17

research is not fully representative for the real world. Individual sentences with each exactly one constraint were used. Actual compliance sources and business process descriptions are more complex. In other words, the performance scores of the designed tool would be lower in practice, but still could prove worthy to simplify the work and provide benefits in time and money.

6 Conclusions

In this thesis, I presented an approach to automatically extract declarative constraints from natural language text. The practical applications include compliance checking and the visualization and documentation of business processes in process models. The approach combines best practices from the fields of business process management, in particular process modeling techniques, and natural language processing. First, a pre-processing step with two selected natural language processing tools is performed to collect syntactic and semantic information about the input text. Then three newly designed steps are executed to extract formalized declarative constraints. The first analyzes constraint types, the second extracts process activities and the third applies post-hoc optimizations on the extraction rules from the two previous steps. A quantitative evaluation shows that this approach successfully extracts the majority of declarative constraints from compliance sources and business process descriptions. The evaluation furthermore reveals that each step in the approach and each extraction rule contribute to the performance of the newly designed tool. The use of certain qualitative thresholds ensured a sufficient level of genericity of the technique. By using this approach, organizations can save time and money on compliance checking and business process documentation. The world of compliance checking keeps growing more complex and contemporary organizations struggle to comply. Also, the need for flexibility in business processes induced a growth in business processes of a declarative nature. For visualization and documentation of such processes, for the sake of supervision and control, process models are created. But existing techniques can only convert explicit imperative process descriptions into process models. The technique presented in these thesis can be used to adhere to the modern standards of compliance checking and business process management.

In future work the technique designed in this thesis can be further extended with new and improved extraction and optimization rules. Developments in the field of natural language processing should also be closely monitored and applied in future research if relevant. Analysis and experimentation with more varied compliance sources and business process descriptions is essential to prove and improve the quality and reliability of the approach. The data collection used in this thesis only contained sentences with exactly one constraint. Real-life compliance sources and business process descriptions contain sentences with explanatory meaning but no extractable constraints, sentences with multiple constraints and sentences that need to be analyzed together to extract a single constraint. Ideally future research should aim to fully automate the process of compliance checking starting from compliance sources and creating declarative process models from declarative process descriptions. Another limitation of this research is the integrability of the designed tool. An open-source publication of the project could simplify future work and an application programming interface (API) with clear documentation could make it usable and testable in practice.

(18)

18

References

1. Awad, A., Decker, G., & Weske, M. (2008, September). Efficient compliance checking using BPMN-Q and temporal logic. In _{International Conference on} Business Process Management_{(pp. 326-341). Springer, Berlin, Heidelberg.}

2. Becker, J., Delfmann, P., Eggert, M., & Schwittay, S. (2012). Generalizability and applicability of model-based business process compliance-checking approaches—a state-of-the-art analysis and research roadmap. Business

Research, 5(2), 221-247.

3. Carreras, X., Chao, I., Padró, L., & Padró, M. (2004, May). FreeLing: An Open-Source Suite of Language Analyzers. In LREC (pp. 239-242).

4. Chowdhury, G. G. (2003). Natural language processing. Annual review of information science and technology, 37(1), 51-89.

5. Decamps, J. P., Rochet, J. C., & Roger, B. (2004). The three pillars of Basel II: optimizing the mix. Journal of Financial Intermediation, 13(2), 132-155.

6. Di Ciccio, C. (in press)

7. El Kharbili, M., de Medeiros, A. K. A., Stein, S., & Van der Aalst, W. M. (2008). Business Process Compliance Checking: Current State and Future Challenges. MobIS, 141, 107-113.

8. Fahland, D., Lübke, D., Mendling, J., Reijers, H., Weber, B., Weidlich, M., & Zugal, S. (2009). Declarative versus imperative process modeling languages: The issue of understandability. In Enterprise, Business-Process and

Information Systems Modeling (pp. 353-366). Springer, Berlin, Heidelberg. 9. Fahland, D., Mendling, J., Reijers, H. A., Weber, B., Weidlich, M., & Zugal, S.

(2009, September). Declarative versus imperative process modeling languages: the issue of maintainability. In International Conference on Business Process Management (pp. 477-488). Springer, Berlin, Heidelberg.

10. Friedrich, F., Mendling, J., & Puhlmann, F. (2011, June). Process model generation from natural language text. In International Conference on

Advanced Information Systems Engineering (pp. 482-496). Springer, Berlin,

Heidelberg.

11. Harel, D., & Rumpe, B. (2004). Modeling Languages: Syntax, Semantics and all that Stuff.

12. Hildebrandt, T., Mukkamala, R. R., & Slaats, T. (2011, August). Designing a cross-organizational case management system using dynamic condition response graphs. In Enterprise Distributed Object Computing Conference

(EDOC), 2011 15th IEEE International (pp. 161-170). IEEE.

13. Jeston, J., & Nelis, J. (2010). Business process management. Routledge. 14. Kao, A., & Poteet, S. R. (Eds.). (2007). Natural language processing and text

mining. Springer Science & Business Media.

15. Liu, Y., Muller, S., & Xu, K. (2007). A static compliance-checking framework for business process models. IBM Systems Journal, 46(2), 335-361.

16. Maggi, F. M., Di Ciccio, C., Di Francescomarino, C., & Kala, T. (2017). Parallel algorithms for the automated discovery of declarative process models.

Information Systems.

17. Mendling, J., Reijers, H. A., & Recker, J. (2010). Activity labeling in process modeling: Empirical insights and recommendations. Information Systems,

35(4), 467-482.

18. Pesic, M., Schonenberg, H., & Van der Aalst, W. M. (2007, October). Declare: Full support for loosely-structured processes. In _{Enterprise Distributed Object}

Computing Conference, 2007. EDOC 2007. 11th IEEE International (pp.

287-287). IEEE.

19. Pichler, P., Weber, B., Zugal, S., Pinggera, J., Mendling, J., & Reijers, H. A. (2011, August). Imperative versus declarative process modeling languages: An

(19)

19

empirical investigation. In International Conference on Business Process

Management(pp. 383-394). Springer, Berlin, Heidelberg.

20. Rajman, M., & Besançon, R. (1998). Text mining: natural language techniques and text mining applications. In Data mining and reverse engineering (pp. 50-64). Springer US.

21. Sarbanes, P. (2002, July). Sarbanes-oxley act of 2002. In The Public Company

Accounting Reform and Investor Protection Act. Washington DC: US Congress.

22. Schonenberg, H., Mans, R., Russell, N., Mulyar, N., & Van der Aalst, W. (2008). Process flexibility: A survey of contemporary approaches. In Advances

in enterprise engineering I (pp. 16-30). Springer, Berlin, Heidelberg.

23. Slaats, T., Mukkamala, R. R., Hildebrandt, T., & Marquard, M. (2013). Exformatics declarative case management workflows as DCR graphs. In

Business Process Management (pp. 339-354). Springer, Berlin, Heidelberg. 24. Swan, M. (2005). Practical english usage. Oxford Univ. Press.

25. Tan, A. H. (1999, April). Text mining: The state of the art and the challenges. InProceedings of the PAKDD 1999 Workshop on Knowledge Disocovery from Advanced Databases (Vol. 8, pp. 65-70). sn.

(20)

20

Appendices

(21)

21

Appendix B.

_{Declare templates and explanations}

● Existence constraints:

○ Cardinality constraints:

■ Participation(x): x occurs at least once ■ AtMostOne(x): x occurs at most once ■ Existence(n, x): x occurs at least n times ■ Absence(n + 1, x): x occurs at most n times ○ Position constraints:

■ Init(x): x occurs first ■ End(x): x occurs last ● Relation constraints:

○ Forward unidirectional relation constraints:

■ RespondedExistence(x, y): if x occurs, then y must follow or precede

■ Response(x, y): if x occurs, then y must follow

■ AltResponse(x, y): if x occurs, then y must follow without x in between

■ ChainResponse(x, y): if x occurs, then y must follow immediately

○ Backward unidirectional relation constraints:

■ Precedence(x, y): if y occurs, then x must precede

■ AltPrecedence(x, y): if y occurs, then x must precede without y in between

■ ChainPrecedence(x, y): if y occurs, then x must precede immediately

○ Coupling constraints:

■ CoExistence(x, y): if x occurs, then y must follow or precede AND vice versa

■ AltSuccession(x, y): if x occurs, then y must follow without x in between AND if y occurs, then x must precede without y in between

■ Succession(x, y): if x occurs, then y must follow AND if y occurs, then x must precede

■ ChainSuccession(x, y): if x occurs, then y must follow immediately AND if y occurs, then x must precede immediately

○ Negative constraints:

■ NotCoExistence(x, y): if x occurs then, y must not occur ■ NotSuccession(x, y): if x occurs, then y must not follow ■ NotChainSuccession(x, y): if x occurs, then y must not

(22)

22

(23)

23

Appendix D.

_{Results table}

Extraction Rule Precision Recall F1 Score

Precedence 1 0.70 0.41 0.52 Precedence 2 1.00 0.12 0.21 Precedence 3 0.67 0.12 0.20 Precedence 4 0.67 0.12 0.20 Precedence 0.72 0.76 0.74 Response 1 0.45 0.36 0.40 Response 2 0.44 0.29 0.35 Response 0.45 0.64 0.53 Succession 1 1.00 0.60 0.75 Succession 2 1.00 0.20 0.33 Succession 1.00 0.80 0.89 Init 0.56 1.00 0.71 End 1.00 1.00 1.00 Constraint types 0.92 0.78 0.85 Nsubjpass 0.81 0.46 0.59 Dobj 0.50 0.10 0.16 Prep 1.00 0.39 0.56 Activities 0.82 0.95 0.88 No optimization 0.73 0.80 0.76 All 0.85 0.88 0.86

Extracting Declarative Constraints from Natural Language Text