Differentiable Fuzzy Logics: Integrated Learning and Reasoning using Gradient Descent

(1)

MSc Artificial Intelligence

Master Thesis

Differentiable Fuzzy Logics

Integrated Learning and Reasoning using Gradient Descent

by

Emile van Krieken

11282304

February 12, 2019

36EC 6.2018 - 2-2019

Supervisors:

Dr. E. Acar

Prof. dr. F.A.H. van Harmelen

Assessor:

T.N. Kipf MSc

(2)

In recent years there has been a push to integrate symbolic AI and deep learn-ing as it is argued that the strengths and weaknesses of these approaches are complementary. One such trend in the literature is a weakly supervised learning technique that we call Differentiable Logics. It employs prior back-ground knowledge described using logic to benefit from unlabeled and noisy data. By interpreting logical symbols using neural networks, this background knowledge can be added to regular loss functions used in deep learning to integrate reasoning and learning. In particular, we analyze how fuzzy logic behaves in a differentiable setting in an approach we call Differentiable Fuzzy Logics. One of our findings is that there is a strong but influential imbalance between gradients going into the antecedent and the consequent of the impli-cation. Furthermore, we show that it is possible to use Differentiable Logics for semi-supervised learning on the MNIST dataset and discuss extensions to large-scale problems.

Acknowledgements

Firstly, I would like to thank my excellent supervisors Erman Acar and Frank van Harmelen. I was surprised by their detailed feedback, pushing me to write clearly and with great precision. They always pointed my work into the right direction. Their enthusiasm for this research direction was obvious and very contagious. We talked about many fascinating ideas and had great discussions. I could not have hoped for better.

I also want to thank Peter Bloem, who helped me out greatly with his expertise on the machine learning side of things. Furthermore, I would like to thank Haukur P´all J´onsson, Jasper Driessens, Finn Potason, Ilaria Tiddi, Luciano Serafini and Fije van Overeem for additional great discussions, feedback and insight.

Of course I also need to thank my fantastic friends, family and girlfriend, who supported me through everything during the writing of this thesis. In particular I want to mention the fantastic days in the office with my fellow student Alex, it has been fun.

Finally, I want to thank my mother. I have no words for how you always supported me and how you always were proud of me. I miss you.

(3)

Introduction

In recent years there has been a push to integrate symbolic and statistical1_{approaches to Artificial Intelligence}

(AI) (A. S. d. Garcez, Broda, and Gabbay 2012; Besold et al. 2017). This push coincides with critiques on the statistical method deep learning (Marcus 2018; Pearl 2018), which has been the dominating focus of the AI community in the last decade. While deep learning has caused many important breakthroughs in computer vision (Brock, Donahue, and Simonyan 2018), natural language processing (Devlin et al. 2018) and reinforcement learning (Silver et al. 2017), their concern is that progress will be halted if its shortcomings are not dealt with. Among these is the massive amount of data deep learning needs to be effective, requiring thousand or even millions of examples to properly learn a concept. On the other hand, symbolic AI can reuse concepts and knowledge from a small amount of data. Also, it is usually far easier to interpret the decisions symbolic AI makes, in contrast to deep learning models that act like black-boxes. This is because symbols, which are used to reason with in symbolic AI, refer to concepts that have a clear meaning to humans, while deep learning uses millions or billions of numerical parameters to compute mathematical models that are extremely complex to grasp. Finally, it is much easier to describe a priori domain knowledge using symbolic AI and to integrate it into such a system.

A major downside of symbolic AI is that it is unable to capture the nuances of sensory data as it is noisy and high dimensional. Furthermore, it is difficult to express how small changes in the input data should produce different outputs. This is related to the symbol grounding problem. (Harnad 1990) defines the symbol grounding problem as how “the semantic interpretation of a formal symbol system can be made intrinsic to the system, rather than just parasitic on the meanings in our heads”. As we mentioned, symbols refer to concepts that have an intrinsic meaning to us humans, but computers that are able to reason and act on these symbols using symbol manipulation can not understand this meaning. In contrast to symbolic AI, a properly trained deep learning model excels at modeling complex sensory data and is, for example, able to recognize the guitar in the top right images of Figure 1.1. These models could provide this exact intrinsic meaning needed to bridge the gap between symbolic systems and the real world. Therefore, several recent approaches, among which (Diligenti, Roychowdhury, and Gori 2017; Garnelo, Arulkumaran, and Shanahan 2016; Serafini and A. D. Garcez 2016) and (Manhaeve et al. 2018) aim to interpret symbols that are used in logic-based systems using deep learning models. These are some of the first systems to implement a proposition from (Harnad 1990), namely “a hybrid nonsymbolic/symbolic system (...) in which the elementary symbols are grounded in (...) nonsymbolic representations that pick out, from their proximal sensory projections, the distal object categories to which the elementary symbols refer.”

1.1 Reasoning and Learning using Gradient Descent

This thesis is about what we call Differentiable Logics. Differentiable Logics integrates reasoning and learning by using logical formulas which express a priori background knowledge. The symbols in these formulas are interpreted using a deep learning model of which the parameters are to be learned. However, the formulas are not able to handle uncertainty, and represent discrete structures which are not differentiable. Differentiable Logics construct differentiable loss functions based on these formulas. By minimizing these loss functions using gradient descent, we ensure that the deep learning model acts in a manner that is consistent with the background knowledge. As the loss functions are fully differentiable, we can backpropagate into the deep learning model to ensure its interpretation of the symbols is consistent with the background knowledge.

The Differentiable Logics this thesis focuses on are Differentiable Fuzzy Logics (DFL), which are inspired by Real Logic (Serafini and A. D. Garcez 2016). In DFL the background knowledge is expressed in first-order logic and uses fuzzy logic semantics (Klir and Yuan 1995). In contrast to having binary truth values that are

(6)

Figure 1.1: Three images in the Visual Genome dataset annotated with their scene graph. Figure taken from visualgenome.org.

either true or false, propositions in fuzzy logic have truth values that can be any real number between 0 and 1, called the degree of truth. In DFL, the predicates, functions and constants symbols are interpreted using the deep learning model. By maximizing the degree of truth of the background knowledge using gradient descent, both learning and reasoning are performed in parallel. Another approach is Differentiable Probabilistic Logics (Manhaeve et al. 2018) that instead aim to interpret propositions as being true with some probability.

By adding the loss function of DFL to other loss functions commonly used in deep learning, DFL can be used for more challenging machine learning tasks than (fully) supervised learning. These methods fall under the umbrella of weakly supervised learning (Zhou 2017). For example, it can detect noisy or inaccurate super-vision by correcting inconsistencies between the labels, the model’s predictions and the background knowledge (Donadello, Serafini, and A. d. Garcez 2017). Furthermore, by splitting the problem of arithmetic addition into recognizing digits using deep learning models and addition using symbolic reasoning, (Manhaeve et al. 2018) solves the problem of recognizing the sum of two handwritten numbers. A third application, and the one we will be focusing on, is semi-supervised learning in which only a limited fraction of the dataset is labeled, and a large part is unlabeled (Xu et al. 2018; Hu et al. 2016). If the prediction of the deep learning model on the unlabeled data is logically inconsistent, Differentiable Logics can be used to correct this.

We apply semi-supervised learning using DFL to the Scene Graph Parsing (SGP) task. In SGP, the goal is to generate a semantic description of the objects in an image (Johnson, Gupta, and Fei-Fei 2018). This description is represented as a labeled directed graph known as a scene graph. An example of a labeled dataset for this problem is Visual Genome (Krishna et al. 2017). Figure 1.1 shows some examples of images from this dataset annotated with a part of their scene graph. The binary relations in particular make it difficult to train a strong deep learning model on this dataset, as there are many different pairs of objects that could be related. An example of this data sparsity problem is that it is not likely that there are many images of rabbits eating cabbage in Visual Genome, making it challenging to learn how to recognize this concept. Furthermore, because labeling scene graphs is also challenging for humans, there are not many images with labeled scene graphs as labeling is so expensive. However, far larger datasets exist that are unlabeled, such as ImageNet (Russakovsky et al. 2015). Such a dataset could be used to reduce the data sparsity when training a model for SGP. By expressing a priori background knowledge of the world, Differentiable Fuzzy Logics can be a viable candidate for this approach.

1.2 Derivatives for Reasoning

The major contribution of this thesis is an analysis of the choice of operators used to compute the logical connectives in DFL. An example of such an operator is a t-norm which connects two fuzzy propositions and returns the degree of truth of the event that both propositions are true. Therefore, a t-norm generalizes the

(7)

1.3. RESEARCH QUESTIONS AND CONTRIBUTIONS

Boolean conjunction. Similarly, an implication operator generalizes the Boolean implication. These operators are differentiable and can thus be used in DFL. Interestingly, the derivatives of these operators determine how DFL corrects the deep learning model when its predictions are inconsistent with the background knowledge. Surprisingly enough, there is only a small number of works in the literature on the qualitative properties of these derivatives, despite the fact that the choice of operators is integral to both theory and practice.

For example, assume that the deep learning model observes a non-black raven. We might have some back-ground knowledge encoded in Real Logic saying that all ravens are black. Note that this observation is not consistent with the background knowledge. During the backpropagation step, the deep learning model is cor-rected by DFL, and the way in which this is done is determined by our choice of implication operator. One way to correct this observation would be to tell the deep learning model that it was a black raven instead.

1.3 Research Questions and Contributions

The main research question that we wish to answer is as follows:

“What is the effect of the choice of operators used to compute the logical connectives in Differentiable Fuzzy Logics? ”

To answer this question, we first analyze the theoretical properties of four types of operators: Aggregation functions, which are used to compute the universal quantifier ∀, conjunction and disjunction operators, which are used to compute the conjunction and disjunction connectives ∧ and ∨, and fuzzy implications which are used to compute the implication connective. Then, we perform two different experiments to compare a list of combinations of operators in practice. We conclude with several recommendations for operators to use in Differentiable Fuzzy Logics.

The second research question is on the practical problem of Scene Graph Parsing:

“Can the performance of a deep learning model on the Scene Graph Parsing task be improved with semi-supervised learning using Differentiable Fuzzy Logics? ”

To answer this question, we split the Visual Genome dataset into two datasets, where the first part is labeled and the second unlabeled. Next, we devise a set of formulas that express background knowledge of the Visual Genome dataset that is used in our experiments with DFL. The deep learning model is trained using both a supervised loss function on the labeled dataset and the DFL loss function on the unlabeled dataset. The latter computes the consistency of the model’s predictions with the background knowledge. We notice no clear improvement compared to the supervised baseline and present challenges with applying Differentiable Logics to a complex task like Scene Graph Parsing.

1.4 Outline

In Chapter 2 we introduce the background that is relevant for this work. In particular, we discuss relational logics in Section 2.1 and introduce fuzzy logics with several common operators in Section 2.2. In Chapter 3 we present DFL (Section 3.1) along with theoretical properties of the operators used in it in Section 3.2. In Chapter 4, we test combinations of aggregation functions and disjunction operators to solve the 3-SAT problem (Section 4.1), apply DFL in a semi-supervised setting to the MNIST dataset (Section 4.3) and use a priori background knowledge to attempt to solve the same task for the Visual Genome dataset (Section 4.4). In Chapter 5 we discuss related work and in Chapter 6 we conclude with a small discussion about challenges and possible future work.

(8)

Background

2.1 Relational Logic

In this thesis, we will be using relational knowledge bases, which are sets of sentences expressed in a relational logic language L. By using relational logic, we will be limiting ourselves to function-free formulas.

2.1.1 Syntax

Formulas are constructed using constants (or objects) C = c1, c2, ..., variables x1, x2, ..., predicates P =

P, R, partOf, ... and the logical connectives ¬, ∨, ∧, → and quantifiers ∃, ∀. There is an arity function that maps each predicate to a natural number, i.e. σ ∈ P → N. The syntax of the logic that we will use throughout this thesis is defined as follows.

Definition 2.1. A term in L is an individual variable or a constant symbol. If t1, ..., tn are terms and P ∈ P

has arity n, then P(t1, ..., tn) is an atomic formula.

Four things are well-formed formulas for L. An atomic formula is a formula. If φ is a formula, then ¬φ (negation) is also a formula. If φ and ψ are formulas, then φ ∨ ψ (disjunction) and φ ∧ ψ (conjunction) are also formulas. Furthermore, φ → ψ (implication) is also a formula. In an implication, φ is called the antecedent of the implication and ψ the consequent. If φ is a formula in which the variable x appears, then ∃x φ (existential quantification) and ∀x φ (universal quantification) are also formulas. x is then said to be a bound variable.

If φ is an atomic formula, then φ and ¬φ are literals.

We will only consider formulas in prenex form, namely formulas that start with quantifiers and bound variables followed by a quantifier-free subformula. An example of a formula in prenex form is

∀x, y P(x, y) ∧ Q(x) → R(y).

2.1.2 Semantics

To evaluate the truth value of a formula, we will need a way to interpret all symbols in the language so we can assign a truth value to every sentence. For this, we introduce two orthogonal semantics that we will use to describe and analyze our algorithms.

2.1.2.1 Standard Semantics

Traditional (or Tarskian) semantics (Van Dalen 2004) maps symbols in L to objects and relations using a structure which consists of a domain of discourse and an interpretation.

Definition 2.2. A domain of discourse is a nonempty set of objects O = {o1, o2, ...} that specifies the range

of quantifiers. An interpretation η is a function. For each constant symbol c, η(c) maps to an object in O. For each predicate symbol P with arity m, η(P) maps to a function in Om_{→ {true, f alse}.}

For example, the truth value of P(c1, c1) is η(P)(η(c1), η(c2)). Using a structure, we can easily define the

semantics of full sentences inductively. For this, we first need to define variable assignments that are used to associate variable symbols to elements in the domain.

Definition 2.3. A variable assignment µ is a set which associates each variable symbol x with objects from the domain O and associates each constant symbol c with its respective interpretation η(c).

(9)

2.2. FUZZY LOGIC

Definition 2.4. The truth values of formulas are determined using the valuation function e, which uses a variable assignment µ and a structure hO, ηi, and is defined using the following inductive definition:

• For atomic formulas P(t1, ..., tm), e(P(t1, ..., tm)) = η(P)(µ(t1), ..., µ(tm)), where η(P) is the interpretation

of the predicate P.

• If φ and ψ are formulas, then e(¬φ) is true whenever e(φ) is not, e(φ ∧ ψ) is true whenever both e(φ) and e(ψ) are, e(φ ∨ ψ) is true if at least one of e(φ) and e(ψ) is, and e(φ → ψ) is true if e(φ) is f alse or e(ψ) is true.

• For formulas with an existential quantifier ∃x φ, e(∃x φ) is true iff there is an object o ∈ O such that e(φ) is true when the variable x is bound to o using the variable assignment µ0that only differs from µ in that x is assigned to o.

• For formulas with a universal quantifier ∀x φ, e(∀x φ) is true iff e(φ) is true for all bindings of the variable x to every object o ∈ O by assignments µ0 differing from µ in that x is assigned to o.

We say that a formula φ is satisfiable if there is a structure so that e(φ) is true. Such a structure is called a model of φ.

2.1.2.2 Herbrand Semantics

Herbrand semantics (Shoenfield 2010) refers not to external objects but rather to ground atoms. We will first introduce some definitions:

Definition 2.5. A ground term is a term which does not contain variables. The set of all ground terms is called the Herbrand universe. If t1, ..., tn are ground terms and P(t1, ..., tn) is an atomic formula, then

it also a ground atom. The set of all possible ground atoms is called the Herbrand base. A Herbrand interpretation assigns a truth value to every ground atom in the Herbrand base.

The difference with traditional semantics is that there are no objects but the ground terms in Herbrand Semantics. Because we do not use function symbols in our relational logic, all the objects we are interested in are the (named) constants C.

Given such a set of constants C, we can compute the full grounding of a formula in prenex normal form. This is done by assigning the variables bound by the quantifiers to every possible combination of objects from the constants C. Each resulting ground formula, that is, a formula without free variables, is called an instance of the formula. The conjunction of all instances is used to compute the truth value of the universal quantifier ∀, and the disjunction of all instances is used for the existential quantifier ∃.

Example 2.1. Say we have a language with two constants C = {c1, c2} and predicates P = {P, Q} where P is

a unary predicate and Q is a binary predicate. The full grounding of the formula ∀x, y P(x) → Q(x, y) is given by P(c1) → Q(c1, c1) ∧ P(c1) → Q(c1, c2) ∧ P(c2) → Q(c2, c1) ∧ P(c2) → Q(c2, c2).

The Herbrand Base is {P(c1), P(c2), Q(c1, c1), Q(c1, c2), Q(c2, c1), Q(c2, c2)} and any subset is a possible Herbrand

interpretation.

2.2 Fuzzy Logic

Fuzzy Logic is, contrary to classical logics, a real-valued logic. Truth values of propositions are not binary, that is, either true or f alse, but instead are real numbers in [0, 1] where 0 denotes completely false and 1 denotes completely true. Fuzzy logic models the concept of vagueness by arguing that the truth value of many propositions can be noisy to measure, or subjective. For example, the truth value of the predicate old is not something that is easily determined. A person who is 50 years old would be called old by some, but most would certainly call someone who is 90 old as well. However, for this second person the predicate old clearly holds to a higher degree, and this person would probably deem the first rather young.

We will be looking at predicate t-norm fuzzy logics in particular. Predicate fuzzy logics extend normal fuzzy logics with universal and existential quantification, mimicking the relational logic described in Section 2.1.

2.2.1 Fuzzy Operators

We will first introduce the semantics of the fuzzy operators ∧, ∨ and ¬ that are used to connect truth values of fuzzy predicates. We follow (Jayaram and Baczynski 2008) in this section and refer to it for proofs and additional results.

(10)

2.2.1.1 Properties of Functions

We first define several common properties of functions. Definition 2.6. A function f : D → D is called

• continuous if for any a ∈ D, limx→af (x) = f (a);

• left-continuous if for all positive and arbitrarily small numbers > 0 there exists another value δ > 0 such that for any a ∈ D it holds that |f (x) − f (a)| < whenever a − δ < x < a;

• increasing if for all a, b ∈ D, if a ≤ b then f (a) ≤ f (b), and similarly for decreasing;

• strictly increasing if for all a, b ∈ D, if a < b then f (a) < f (b), and similarly for strictly decreasing. A function f : D2_{→ D is called}

• commutative if for all a, b ∈ D, f (a, b) = f (b, a).

• associative if for all a, b, c ∈ D, f (f (a, b), c) = f (a, f (b, c)).

Left-continuity informally means that when a point is approached from the left, no ‘jumps’ will occur. 2.2.1.2 Fuzzy Negation

The functions that are used to compute the negation of a truth value are called fuzzy negations.

Definition 2.7. A fuzzy negation is a function N : [0, 1] → [0, 1] so that N (0) = 1 and N (1) = 0. N is called strict if it is strictly decreasing and continuous, and strong if it is an involution, that is, for all a ∈ [0, 1], N (N (a)) = a.

In this thesis we will exclusively use the strict and strong classic negation NC(a) = 1 − a.

2.2.1.3 Triangular Norms

The functions that are used to compute the conjunction of two truth values are called t-norms.

Definition 2.8. A t-norm (triangular norm) is a function T : [0, 1]2→ [0, 1] that is commutative and associa-tive, and

• Monotonicity: For all a ∈ [0, 1], T (a, ·) is increasing and • Neutrality: For all a ∈ [0, 1], T (1, a) = a.

The phrase ‘T (a, ·) is increasing’ means that whenever 0 ≤ b1≤ b2≤ 1, then T (a, b1) ≤ T (a, b2).

Definition 2.9. A t-norm T can have the following properties:

1. Continuity: A continuous t-norm is continuous in both arguments.

2. Left-continuity: A left-continuous t-norm is left-continuous in both arguments.

3. Idempotency: An idempotent t-norm has the property that for all a ∈ [0, 1], T (a, a) = a.

4. Strict-monotony: A strictly monotone t-norm has the property that for all a ∈ [0, 1], T (a, ·) is strictly increasing.

5. Strict: A strict t-norm is continuous and strictly monotone.

Table 2.1 shows several common t-norms that we will investigate in this thesis alongside their properties. The product t-norm has a counterpart in probability theory, namely the probability of two independent events coinciding.

1_{See (L´}_aszl´_{o G´}_{al et al. 2014). For v = 0, this is called the Hamacher product. v = 1 gives the normal product norm.} 2_{See (G´}_{al, Lovassy, and K´}_{oczy 2010).}

(11)

2.2. FUZZY LOGIC

Name T-norm Properties

G¨odel (minimum) TG(a, b) = min(a, b) idempotent, continuous

Product TP(a, b) = a · b strict

Lukasiewicz TLK(a, b) = max(a + b − 1, 0) continuous

Nilpotent minimum TnM(a, b) =

(

0, if a + b ≤ 1

min(a, b), otherwiset left-continuous Yager TY(a, b) = max(1 − ((1 − a)p+ (1 − b)p)

1

p_{, 0), p ≥ 1} _continuous

Hamacher1 _T

H(a, b) = _{v+(1−v)(a+b−a·b)}a·b , v ≥ 0 strict

Trigonometric2 _T

T(a, b) = _π2arcsin sin aπ₂ · sin bπ₂

strict Table 2.1: Some common t-norms.

Name T-conorm Properties

G¨odel (maximum) SG(a, b) = max(a, b) idempotent, continuous

Product (probabilistic sum) SP(a, b) = a + b − a · b strict

Lukasiewicz SLK(a, b) = min(a + b, 1) continuous

Nilpotent maximum SnM(a, b) =

(

1, if a + b ≥ 1

max(a, b), otherwise right-continuous

Yager SY(a, b) = min((ap+ bp)

1

p_{, 1), p ≥ 1} _continuous Hamacher SH(a, b) = a+b−a·b−(1−v)a·b_{1−(1−v)a·b} , v ≥ 0 strict

Trigonometric ST(a, b) = 2_πarccos cos aπ₂ · cos bπ₂ strict

Table 2.2: Some common t-conorms.

2.2.1.4 Triangular Conorms

The functions that are used to compute the disjunction of two truth values are called t-conorms or s-norms. Definition 2.10. A t-conorm (triangular conorm, also known as s-norm) is a function S : [0, 1]2_{→ [0, 1] that}

is commutative and associative, and

• Monotonicity: For all a ∈ [0, 1], S(a, ·) is increasing and • Neutrality: For all a ∈ [0, 1], S(0, a) = a.

T-conorms can be found from t-norms using De Morgan’s laws from classical logic, in particular p∨q = ¬(¬p∧ ¬q). Therefore, if T ∈ [0, 1]2_{→ [0, 1] is a t-norm and N}

Cthe classical negation, T ’s NC-dual S ∈ [0, 1]2→ [0, 1]

is calculated using

S(a, b) = 1 − T (1 − a, 1 − b) (2.1)

Table 2.2 shows several common t-conorms derived using Equation 2.1 and the t-norms from Table 2.1. The same optional properties as those for t-norms in Definition 2.9 can hold for t-conorms and are presented in the same table. The t-conorm of the product t-norm also has a probabilistic interpretation, namely the probability that at least one of two independent events is true.

2.2.1.5 Aggregation operators

The functions that are used to compute quantifiers like ∀ and ∃ are aggregation functions (Y. Liu and Kerre 1998).

Definition 2.11. An aggregation operator is a function A : [0, 1]n→ [0, 1] that is symmetric and increasing with respect to each dimension, and for which A(0, ..., 0) = 0 and A(1, ..., 1) = 1. A symmetric function is one in which the output value is the same for every ordering of its arguments.

Note that aggregation operators are essentially variadic functions which are functions that are defined for any finite set of arguments. For this reason we will often use the notation

_A

n_i=1xi:= A(x1, ..., xn).

Simple examples of aggregation functions are found by extending t-norms from 2-dimensional input to n-dimensional input using

AT(x1, x2) = T (x1, x2) (2.2)

(12)

Name Type Aggregation operator Characteristics Minimum anding ATG(x1, ..., xn) = min(x1, ..., xn) Generalizes TG Product anding ATP(x1, ..., xn) =

Qn

i=1xi Generalizes TP

Lukasiewicz anding ATLK(x1, ..., xn) = max( Pn

i=1xi− (n − 1), 0) Generalizes TLK

Maximum oring ASG(x1, ..., xn) = max(x1, ..., xn) Generalizes SG Probabilistic sum oring ASP(x1, ..., xn) = 1 −

Qn

i=1(1 − xi) Generalizes SG

Bounded sum oring ASLK(x1, ..., xn) = min ( Pn

i=1xi, 1) Generalizes SLK

Table 2.3: Some common aggregation operators.

where T is any t-norm. Because of the commutativity and associativity of T , the ordering of the arguments is irrelevant and thus AT is symmetric. The other required properties also follow from the definition of t-norms.

These operators do well for modeling the ∀ quantifier, as it can be seen as a series of conjunctions. We can do the same for s-norms:

AS(x1, x2) = S(x1, x2) (2.4)

AS(x1, x2, ..., xn) = S(x1, AS(x2, ..., xn)) (2.5)

where S is any s-norm. These operators again do well for modeling the ∃ quantifier, and can be seen as a series of disjuctions.

Table 2.3 shows some common aggregation operators that we will talk about.

2.2.2 Fuzzy Implications

The functions that are used to compute the implication of two truth values are called fuzzy implications (Jayaram and Baczynski 2008).

Definition 2.12. A fuzzy implication is a function I : [0, 1]2 _{→ [0, 1] so that for all a, c ∈ [0, 1], I(·, c) is}

decreasing, I(a, ·) is increasing and for which I(0, 0) = 1, I(1, 1) = 1 and I(1, 0) = 0.

From this definition follows that I(0, 1) = 1. If this were not the case and it is lower, then I(a, ·) would not be increasing.

Definition 2.13. Let N be a fuzzy negation. A fuzzy implication I can have several properties that hold for all a, b, c ∈ [0, 1]:

1. Left-neutrality: For a left-neutral (LN) fuzzy implication holds that I(1, c) = c.

2. Exchange principle: For a fuzzy implication that satisfies the exchange principle (EP) holds that I(a, I(b, c)) = I(b, I(a, c)).

3. Identity principle: For a fuzzy implication that satisfies the identity principle (IP) holds that I(a, a) = 1. 4. Contraposition: For a fuzzy implication that is contrapositive symmetric with respect to N (denoted

CS(N )) holds that I(a, c) = I(N (c), N (a)).

5. Left-contraposition: For a fuzzy implication that is left-contrapositive symmetric with respect to N (denoted L-CS(N )) holds that I(N (a), c) = I(N (c), a).

6. Right-contraposition: For a fuzzy implication that is right-contrapositive symmetric with respect to N (denoted R-CS(N )) holds that I(a, N (c)) = I(c, N (a)).

All these statements generalize a law from classical logic. A left neutral fuzzy implication generalizes (1 → p) = p, that is, if we know that the antecedent is true, p captures the truth value of 1 → p. The exchange principle generalizes p → (q → r) = q → (p → r), and the identity principle generalizes that p → p is a tautology.

When a fuzzy implication is contrapositive symmetric (with respect to a fuzzy negation N ), it generalizes p → q = ¬q → ¬p. Left-contraposion furthermore generalizes ¬p → q = ¬q → p and right-contraposition generalizes p → ¬q = q → ¬p.

(13)

2.2. FUZZY LOGIC

Name T-conorm S-implication Properties

G¨odel (Kleene-Dienes) SG IKD(a, c) = max(1 − a, c) All but IP

Product (Reichenbach) SP IRC(a, c) = 1 − a + a · c All but IP

Lukasiewicz SLK ILK(a, c) = min(1 − a + c, 1) All

Nilpotent (Fodor) SN m IF D(a, c) =

(

1, if a ≤ c

max(1 − a, c), otherwise All

Table 2.4: Some common S-implications. The first four are S-implications retrieved from the four common t-conorms.

2.2.2.1 S-Implications

In classical logic, the (material) implication is defined as follows: p → q = ¬p ∨ q

Using this definition, we can use a t-conorm S and a fuzzy negation N to construct a fuzzy implication. Definition 2.14. Let S be a t-conorm and N a fuzzy negation. The function IS,N : [0, 1]2→ [0, 1] is called an

(S, N)-implication and is defined for all a, c ∈ [0, 1] as

IS,N(a, c) = S(N (a), c). (2.6)

If N is a strong fuzzy negation, then IS,N is called an S-implication (or strong implication).

As we will only consider the classic negation NC, we omit the N and simply use IS to refer to IS,NC All S-implications IS are fuzzy implications and satisfy LN, EP and R-CP(N ). Additionally, if the negation

N is strong, it satisfies CP(N ) and if, in addition, it is strict, it also satisfies L-CP(N ). In Table 2.4 we show several S-implications that use the strong fuzzy negation NC and t-conorms from Table 2.2. Note that these

implications are nothing more than rotations of the s-norms. 2.2.2.2 R-Implications

Where S-implications are constructed from the generalization of the material implication, residuated implications (R-implications) are constructed in quite a different way. They are the standard choice in t-norm fuzzy logics. It uses the following identity from set theory:

A0∪ B = (A \ B)0₌[

{C ⊆ X|A ∩ C ⊆ B}

where A and B are subsets of the universal set X and A0 denotes the complement of A.

Definition 2.15. Let T be a t-norm. The function IT : [0, 1]2 → [0, 1] is called an R-implication and is

defined as

IT(a, c) = sup{b ∈ [0, 1]|T (a, b) ≤ c} (2.7)

The supremum of a set A, denoted sup{A}, is the smallest upper bound in [0, 1] of A. An upper bound is a value that is larger than all elements in A. If, and only if, T is a left-continuous t-norm, the supremum can be replaced with the maximum function, that finds the largest element in the set A instead. Furthermore, T and IT then form an adjoint pair having the following residuation property for all a, b, c ∈ [0, 1]:

T (a, c) ≤ b ⇐⇒ IT(a, b) ≥ c (2.8)

All R-implications IT are fuzzy implications. Note that if a ≤ c then IT(a, c) = 1. We can see this by

looking at Equation 2.7. The largest value for b possible is 1, as then T (a, 1) = a (and a ≤ c) because for all t-norms T and all a ∈ [0, 1], T (1, a) = a. Furthermore, all satisfy LN and EP.

Table 2.5 shows the four R-implications created from the four common T-norms. Note that ILK and IF D

appear in both tables: They are both S-implications and R-implications. 2.2.2.3 Contrapositivisation

As shown in Table 2.5, not all fuzzy implications are contrapositive symmetric, and R-implications in particular often are not. However, (Jayaram and Baczynski 2008) shows two techniques that can be used to create fuzzy implications that are contrapositive symmetric.

(14)

Name T-norm R-implication Properties

G¨odel TG IG(a, c) =

(

1, if a ≤ c

c, otherwise LN, EP, IP, R-CP(ND1)

product (Goguen) TP IGG(a, c) =

(

1, if a ≤ c

c

a, otherwise

LN, EP, IP, R-CP(ND1)

Lukasiewicz TLK ILK(a, c) = min(1 − a + c, 1) All

nilpotent (Fodor) TN m IF D(a, c) =

(

1, if a ≤ c

max(1 − a, c), otherwise All Table 2.5: Four common R-implications.

Implication Type Contrapositivisation Properties

IG upper (IG)u_N_C(a, c) = IF D(a, c) All

IG lower (IG)

l

NC(a, c) = (

1, if a ≤ c

min(1 − a, c), otherwise All IGG upper (IGG)u_N_C(a, c) =

(1, if a ≤ c

max_ac,1−a_1−c, otherwise All but EP

IGG lower (IGG)l_N_C(a, c) =

(1, if a ≤ c

minc_a,1−a_1−c, otherwise All but LN

Table 2.6: The G¨odel and Goguen implications after upper and lower contrapositivisation.

Definition 2.16. Let I be a fuzzy implication and N a fuzzy negation. The upper contrapositivisation Iu N and

lower contrapositivisation I_Nl of I with respect to N is defined as

INu(a, c) = max(I(a, c), I(N (c), N (a)) (2.9)

I_Nl (a, c) = min(I(a, c), I(N (c), N (a)) (2.10)

Iu

N and INl are also both fuzzy implications, and if N is strong, then CP(N ) holds.

Table 2.6 shows the result of applying both lower and upper contrapositivisation to the G¨odel implication IG and the Goguen implication IGG.

(15)

Chapter 3

Differentiable Fuzzy Logics

Differentiable Logics (DL) are logics for which loss functions can be constructed that can be minimized with gradient descent methods. It is based on the following idea: Use background knowledge described using some logic to deduce the truth value of ground atoms in unlabeled or poorly labeled data. This allows us to use large pools of such data in our learning, possibly together with normal labeled data. This can be beneficial as unlabeled or poorly labeled data is cheaper and easier to come by. Importantly, this is not like Inductive Logic Programming (Muggleton and Raedt 1994) where we derive logically consistent rules from data. It is the other way around: The logic informs what the truth values of the ground atoms could have been.

We motivate the use of Differentiable Logics with the following scenario: Assume we have an agent A whose goal is to describe the world around it. When it describes a scene, it gets feedback from a supervisor S. Now, S is a curious supervisor: It knows exactly how to describe some scenes. When our agent A communicates its description of one of these scenes, S can simply correct A by comparing A’s description with its own description. Yet for the other scenes S is staring in the dark and does not know the true description of the world A is in. All it has access to is A’s description of the scene. However, S does have a knowledge base K containing background knowledge about the concepts of the world. This background knowledge is encoded in some logical formalism. The idea of Differentiable Logics is that S can correct A’s descriptions of scenes that are not consistent with the knowledge base K.

Example 3.1. To illustrate this idea, consider the following example. Say that our agent A comes across the scene I in Figure 3.1 that contains two objects, o1and o2. A and the supervisor S only know of the unary class

predicates {chair, cushion, armRest} and the binary predicate {partOf}. S also does not have a description of I, and will have to correct A based on the knowledge in the knowledge base K. A predicts the following using its

(16)

current model of the world:

p(chair(o1)|I, o1) = 0.9 p(chair(o2)|I, o2) = 0.4

p(cushion(o1)|I, o1) = 0.05 p(cushion(o2)|I, o2) = 0.5

p(armRest(o1)|I, o1) = 0.05 p(armRest(o2)|I, o2) = 0.1

p(partOf(o1, o1)|I, o1) = 0.001 p(partOf(o2, o2)|I, o2) = 0.001

p(partOf(o1, o2)|I, o1, o2) = 0.01 p(partOf(o2, o1)|I, o2, o1) = 0.95

Say that the corpus K contains the following formula written in the relational logic from Section 2.1: ∀x, y chair(x) ∧ partOf(y, x) → cushion(y) ∨ armRest(y)

where ∀x, y is short for ∀x∀y. S might now reason that since A is very confident of chair(o1) and of partOf(o2, o1)

that the antecedent of this formula is satisfied, and thus cushion(o2) or armRest(o2) has to hold. Since

p(cushion(o2)|I, o2) > p(armRest(o2)|I, o2), a possible correction would be to tell A to increase its degree of

belief in cushion(o2). A can use this to update the model it uses to interpret future images.

We would like to automate the kind of reasoning S does in the previous example. In DL, we add a loss term that is computed using the formulas in K and the unlabeled data.1 _{This loss term is added to a normal}

supervised loss function. Assume we have a labeled dataset Dland an unlabeled dataset Du. If we have a deep

learning model pθ with model parameters θ, which is used to classify ground atoms, we can say that all these

methods minimize with respect to θ

L(θ; Dl, Du, K) = LS(θ; Dl) + α · LDL(θ; Du, K). (3.1)

LS can be any supervised training loss acting on the labeled data Dl. In particular, for classification tasks we

will use the common cross entropy loss function (I. Goodfellow et al. 2016). LDL is the Differentiable Logics

loss that uses the formulas in K and acts on the unlabeled data Du. Furthermore, it has to be differentiable

with respect to θ so that the complete loss function L can be minimized using a form of gradient descent. α ≥ 0 is a hyperparameter to weight the influence of the DL loss with respect to the supervised loss.

We identify two families of Differentiable Logics in the literature. The first is Differentiable Fuzzy Logics (DFL). In DFL, the knowledge base of formulas is interpreted using a fuzzy logic. The objective of DFL is to maximize the satisfaction of the full grounding of this fuzzy knowledge base. Truth values of ground atoms are not discrete but continuous, and logical connectives are interpreted using some function over these truth values. One such logic is Real Logic (Serafini and A. D. Garcez 2016) which uses fuzzy t-norms, dual t-conorms and S-implications. In Section 3.2 we will discuss in particular how the interpretation of the connectives influence the reasoning.

In the second family of Differentiable Logics, we maximize the likelihood that the prediction the agent makes will satisfy a knowledge base (Xu et al. 2018; Manhaeve et al. 2018). The knowledge base itself can be modeled using classical and probabilistic logics. We call this Differentiable Probabilistic Logics. The relation between a DFL we call Differentiable Product Fuzzy Logic and a Differentiable Probabilistic Logic called Semantic Loss (Xu et al. 2018) is discussed in Section 3.3. Furthermore, we talk about some Differentiable Probabilistic Logics in the related work in Section 5.4.

3.1 Differentiable Fuzzy Logics

Differentiable Fuzzy Logics (DFL) is a family of fuzzy logics in which the satisfaction of knowledge bases is differentiable and can be maximized to perform learning. It uses fuzzy operators as its connectives and is general enough to handle both predicates and functions. In our experiments, we are not interested in functions, existential quantifiers and negated universal quantifiers and thus leave it out of the discussion.2 We follow both the introduction of Real Logic in (Serafini and A. D. Garcez 2016) and of embeddings based semantics in (Guha 2014) for this introduction. We will define our semantics only for the limited domain from the relational logic defined in Section 2.1.

1_{For now, we limit our discussion to semi-supervised learning. In other approaches to weakly-supervised learning in which, for}

example, many labels are inaccurate, the unlabeled data is replaced with this inaccurately labeled portion of the data.

2_{Existential quantification can be modeled in much the same way as the universal quantifier is modeled in this thesis, but using}

‘oring’ operators instead of ‘anding’. Furthermore, functions are modelled in (Serafini and A. D. Garcez 2016) and (Marra et al. 2018).

(17)

3.1. DIFFERENTIABLE FUZZY LOGICS

3.1.1 Semantics

3.1.1.1 Embedded Semantics

DFL defines a new semantics which extends traditional semantics from Section 2.1.2.1 to use vector embeddings. A structure in DFL again consists of a domain of discourse and an embedded interpretation:3

Definition 3.1. A DFL structure for a relational language L = hC, Pi4 is a tuple hp, ηθi. p is a domain

distribution over objects o in a d-dimensional5 real-valued vector space. The domain of discourse is O = {o|p(o) > 0}, or all objects with non-zero probability. ηθ is an (embedded) interpretation, which is a

function parameterized by θ that satisfies the following conditions: • If c ∈ C is a constant symbol, then ηθ(c) ∈ O.

• If P ∈ P is a predicate with arity α, then ηθ(P) : Oα→ [0, 1].

That is, objects in DFL semantics are d-dimensional vectors of reals. Their semantics come from the implicit meaning of the vector space: Terms are interpreted in a real (valued) world (Serafini and A. D. Garcez 2016). Likewise, predicates are interpreted as functions mapping these vectors to a fuzzy truth value. This can be seen as a solution to the symbol grounding problem (Harnad 1990). The domain distribution is used to limit the size of the vector space. For example, if we consider the space of all images, p might be the distribution over this space that represents the natural images.

Embedded interpretations can be implemented using any deep learning model.6 _{This model defines all}

functions used for the interpretation of the predicates, and can also define mappings of the constant symbols. Note that with the differentiable model we mean a model along with its trainable parameters, as different values of the trainable parameters will produce different outputs. Therefore, we include the parameters θ of the model with the notation of the embedded interpretation ηθ.

Now that we know how to interpret constants and predicates and we are able to associate variables to elements in the domain, we can compute the truth value of sentences of DFL.

Definition 3.2. Let O be a domain of discourse, ηθ an interpretation for the relational language L = hC, Pi,

N a fuzzy negation, T a t-norm, S a t-conorm, I a fuzzy implication and A an aggregation operator. The valuation function eηθ,O,N,T ,S,I,A (or, for brevity, eθ) computes the truth value of a well-formed formula ϕ for L given a variable assignment µ. It is defined inductively as follows:

eθ(P(x1, ..., xm), µ) = ηθ(P) (ηθ(l(x1, µ)), ..., ηθ(l(xm, µ))) (3.2) eθ(¬φ, µ) = N (eθ(φ, µ)) (3.3) eθ(φ ∧ ψ, µ) = T (eθ(φ, µ), eθ(ψ, µ)) (3.4) eθ(φ ∨ ψ, µ) = S(eθ(φ, µ), eθ(ψ, µ)) (3.5) eθ(φ → ψ, µ) = I(eθ(φ, µ), eθ(ψ, µ)) (3.6) eθ(∀x φ, µ) =

A

o∈O eθ(φ, µ ∪ {x/o}) (3.7)

where l is the assignment lookup function that finds the ground term o assigned to xi in µ.

Equation 3.2 defines the fuzzy truth value of an atomic formula. First, it determines the interpretation of the predicate symbol ηθ(P) that returns a function in Rd·α→ [0, 1]. We then find the interpretations of the terms of

the atomic formula by first finding the correct ground term using l and then determining the interpretation of this ground term using ηθ(l(xi, µ)) ∈ Rd. The resulting list of d-dimensional vectors are finally plugged into the

interpretation of the predicate symbol ηθ(P) that we found earlier to get the fuzzy truth value of the statement.

Equations 3.3 - 3.6 defines the truth values of the connectives using the operators N, T, S and I. The recursion is similar to that in classical logic.

Finally, Equation 3.7 defines the degree of truth of the statement ‘for all x φ’. This is done by applying an aggregation operator A over the enumeration of every possible assignment of an object of the domain of discourse O to the variable x. This assignment is done by adding x/o to the variable assignment µ.

3_{(Serafini and A. D. Garcez 2016) uses the term “(semantic) grounding” or “symbol grounding” (Mayo 2003) instead of ‘embedded}

interpretation’, “to emphasize the fact that L is interpreted in a ‘real world’” but we find this confusing as we also talk about groundings in Herbrand semantics. Furthermore, by using the word ‘interpretation’ we highlight the parallel with classical logical interpretations.

4_{The L symbol is also used for loss functions. Context will make clear which of the two is referred to.}

5_{Without loss of generality we fix the dimensionality of the vectors representing the objects. For our domain, namely natural}

images, we might find that some images have different dimensions than others, however, the dimensionality of these get reduced to a fixed size somewhere in the pipeline of the neural network model we use. DFL could easily be extended to use different types of objects with a varying number of dimensions.

6_{A note on terminology: When we talk about models in DFL we talk about deep learning models such as neural networks, and}

(18)

Importantly, this semantics of the ∀ quantifier makes the assumption that the domain of discourse O is finite as aggregation operators are only defined on a, possibly very large, finite numbers of arguments. For infinite domains, we would have to calculate limits of the aggregation operator instead. Furthermore, even if the domain is finite, the computation might still be intractable. We will therefore need to look at a slightly different semantics to make the computation of the DFL valuation feasible.

3.1.1.2 Sampled Embedded Semantics

Because the domain of discourse is generally too large or even infinite, we will have to sample a batch of b objects from O to approximate the computation of the valuation. This can be done simply by replacing Equation 3.7 with

eθ(∀x φ, µ) = b

A

_i=1 eθ(φ, µ ∪ {x/oi}), o1, ..., ob chosen from O. (3.8)

That is, we choose b objects from the domain of discourse. An obvious way would be to sample simply from the domain distribution p, if it happens to be available. It is commonly assumed in Machine Learning (I. Goodfellow et al. 2016, p.109) that the unlabeled dataset Du contains independent samples from the domain

distribution p and thus using such samples approximates sampling from p. Obviously, by sampling we give up on the soundness of the method.

3.1.2 Learning using Best Satisfiability

Next, we explain how we learn the set of parameters θ using DFL. This is done using best satisfiability (Don-adello, Serafini, and A. d. Garcez 2017): Find parameters that maximize the valuation over all formulas in the knowledge base K.

Definition 3.3. Let O be a set of objects, K be a knowledge base of formulas, ηθan (embedded) interpretation

of the symbols in K parameterized by θ and hN, T, S, I, Ai the usual operators. Then the Differentiable Fuzzy Logics loss LDF L of a knowledge base of formulas K is computed using

LDF L(θ; O, K) = −

X

ϕ∈K

wϕ· eηθ,O,N,T ,S,I,A(ϕ, ∅), (3.9)

where wϕ is the weight for formula ϕ, which is assumed to be _|K|1 unless mentioned otherwise. The best

satisfiability problem is the problem of finding parameters θ∗so that the valuation using the interpretation η_θ∗ is a global minimum of the DFL loss:

θ∗= argmin_θ LDF L(θ; O, K). (3.10)

This optimization problem can be solved using a form of gradient descent. Indeed, if the operators N, T, S, I and A are all differentiable, we can backpropagate through the computation graph of the valuation function and through the computation of the truth value of the ground atoms. This changes the parameters θ resulting in a different embedded interpretation ηθ.

3.1.2.1 Implementation

The computation of the satisfaction is shown in pseudocode form in Algorithm 1. By first computing the dictionary g that contains truth values for all ground atoms,7 _{we can reduce the amount of forward passes}

through the computations of the truth values of the ground atoms that are required to compute the satisfaction. This algorithm can fairly easily be parallelized for efficient computation on a GPU by noting that the individual terms that are aggregated over in line 12 (the different instances of the universal quantifier) are not dependent on each other. By noting that formulas are in prenex normal form, we can set up the dictionary g using tensor operations so that the recursion has to be done only once for each formula. This can be done by applying the fuzzy operators elementwise over vectors of truth values instead of a single truth value, where each element of the vector represents a variable assignment.

The complexity of this computation then is O(|K| · P · bd), where K is the set of formulas, P is the amount of predicates used in each formula and d is the maximum depth of nesting of universal quantifiers in the formulas in K (known as the quantifier rank). This is exponential in the amount of quantifiers, as every object from the constants C has to be iterated over in line 12, although as mentioned earlier this can be mitigated somewhat using efficient parallelization. Still, computing the valuation for transitive rules (such as. ∀x y, z Q(x, z) ∧ R(z, y) → P(x, y)) will for example be far more demanding than for antisymmetry formulas (such as ∀x, y P(x, y) → ¬P(y, x)).

(19)

3.1. DIFFERENTIABLE FUZZY LOGICS

Algorithm 1 Computation of the Differentiable Fuzzy Logics loss. First it computes the fuzzy Herbrand interpretation g given the current embedded interpretation ηθ. This performs a forward pass through the

neural networks that are used to interpret the predicates. Then it computes the valuation of each formula ϕ in the knowledge base K, implementing Equations 3.2-3.7.

1: function eN,T ,S,I,A(ϕ, g, C, µ) . The valuation function computes the Fuzzy truth value of ϕ.

2: if ϕ = P(x1, ..., xm) then

3: return g[P, (µ(x1), ..., µ(xm)] . Find the truth value of a ground atom using the dictionary g.

4: else if ϕ = ¬φ then

5: return N (eN,T ,S,I,A(φ, g, C, µ))

6: else if ϕ = φ ∧ ψ then

7: return T (eN,T ,S,I,A(φ, g, C, µ), eN,T ,S,I,A(ψ, g, C, µ))

8: else if ϕ = φ ∨ ψ then

9: return S(eN,T ,S,I,A(φ, g, C, µ), eN,T ,S,I,A(ψ, g, C, µ))

10: else if ϕ = φ → ψ then

11: return I(eN,T ,S,I,A(φ, g, C, µ), eN,T ,S,I,A(ψ, g, C, µ))

12: else if ϕ = ∀x φ then . Apply the aggregation operator as a quantifier. 13: return

_A

_o∈CeN,T ,S,I,A(φ, g, C, µ ∪ {x/o}) . Each assignment can be seen as an instance of ϕ.

14: end if

15: end function

16:

17: _{procedure DFL(η}θ, P, K, O, N, T, S, I, A) . Computes the Differentiable Fuzzy Logics loss.

18: C ← o1, ..., ob sampled from O . Sample b constants to use this pass.

19: g ← dict() . Collects truth values for ground atoms.

20: for P ∈ P do

21: for o1, ..., oα(P)∈ C do

22: g[P, (o1, ..., oα(P))] ← ηθ(P)(o1, ..., oα(P)) . Calculate the truth values of the ground atoms.

23: end for

24: end for

25: return

_A

_ϕ∈Kwϕ· eN,T ,S,I,A(ϕ, g, C, ∅) . Calculate valuation of the formulas ϕ. Start with an empty

variable assignment. This implements Equation 3.9.

(20)

Example 3.2. To illustrate the computation of the valuation function eθ, we return to the problem in Example

3.1. The domain of discourse is the set of all subimages of natural images. The domain distribution is a distribution over the subimages of natural images. The constants are {o1, o2}, which is also the Herbrand

universe. The valuation of the rule ϕ = ∀x, y chair(x) ∧ partOf(y, x) → cushion(y) ∨ armRest(y) is computed as follows:

eθ(ϕ, {}) =A(A(I(T (ηθ(chair)(ηθ(o1)), ηθ(partOf)(ηθ(o1), ηθ(o1))), S(ηθ(cushion)(ηθ(o1)), ηθ(armRest)(ηθ(o1)))),

I(T (ηθ(chair)(ηθ(o1)), ηθ(partOf)(ηθ(o2), ηθ(o1))), S(ηθ(cushion)(ηθ(o2)), ηθ(armRest)(ηθ(o2))))),

A(I(T (ηθ(chair)(ηθ(o2)), ηθ(partOf)(ηθ(o1), ηθ(o2))), S(ηθ(cushion)(ηθ(o1)), ηθ(armRest)(ηθ(o1)))),

I(T (ηθ(chair)(ηθ(o2)), ηθ(partOf)(ηθ(o2), ηθ(o2))), S(ηθ(cushion)(ηθ(o2)), ηθ(armRest)(ηθ(o2))))))

To illustrate this more intuitively, Figure 3.2 shows the computation using a tree. Let us now make this computation concrete by choosing the product t-norm T = TP and t-conorm S = SP, alongside the product

aggregator A = ATP and the product S-implication known as the Reichenbach implication I = IRC. The resulting computation of the valuation function can then be written as

eθ(ϕ, {}) =

Y

x∈C

Y

y∈C

1 − (ηθ(chair)(ηθ(x)) · ηθ(partOf)(ηθ(y), ηθ(x))) (3.11)

+ (ηθ(chair)(ηθ(x)) · ηθ(partOf)(ηθ(y), ηθ(x)))

· (ηθ(cushion)(ηθ(y)) + ηθ(armRest)(ηθ(y)) − ηθ(cushion)(ηθ(y)) · ηθ(armRest)(ηθ(y)))

= Y

x,y∈C

1 − ηθ(chair)(ηθ(x)) · ηθ(partOf)(ηθ(y), ηθ(x)) (3.12)

· (1 − ηθ(cushion)(ηθ(y)))(1 − ηθ(armRest)(ηθ(y)))

If we interpret the predicate functions using a lookup from the table on the probabilities from Example 3.1 so that ηθ(P(x)) = p(P(x)|I, x), we find that eθ(ϕ, {}) = 0.612.

Example 3.3. Continuing from Example 3.2, we can use gradient descent to update the table of probabilities from Example 3.1. Taking K = {ϕ}, we find that

We can now do a gradient update step to update the probabilities in the table of probabilities, or find what the partial derivative of the parameters θ of some deep learning model pθ should be:

One particularly interesting property of Differentiable Fuzzy Logics is that the partial derivatives of the subfor-mulas with respect to the satisfaction of the knowledge base have a somewhat explainable meaning. For example, as hypothesized in Example 3.1, the computed gradients reflect that we should increase p(cushion(o2)|I, o2), as

it is indeed the (absolute) largest partial derivative.

Furthermore, note that there are many (relatively) large negative gradients. If we look at the computation of the valuation (Equation 3.12) it is easy to see why each partial derivative has its particular sign.

(21)

3.1. DIFFERENTIABLE FUZZY LOGICS A A I T ηθ (chair ) ηθ (o1 ) ηθ (pa rtOf ) ηθ (o1 ), ηθ (o1 ) S ηθ (cushion ) ηθ (o 1 ) ηθ (a rmRest ) ηθ (o 1 ) I T ηθ (chair ) ηθ (o 1 ) ηθ (pa rtOf ) ηθ (o 2 ) ηθ (o1 ) S ηθ (cushion ) ηθ (o2 ) ηθ (a rmRest ) ηθ (o2 ) A I T ηθ (chair ) ηθ (o 2 ) ηθ (pa rtOf ) ηθ (o1 ), ηθ (o2 ) S ηθ (cushion ) ηθ (o1 ) ηθ (a rmRest ) ηθ (o1 ) I T ηθ (chair ) ηθ (o2 ) ηθ (pa rtOf ) ηθ (o2 ) ηθ (o 2 ) S ηθ (cushion ) ηθ (o2 ) ηθ (a rmRest ) ηθ (o 2 ) Figure 3.2: Represen ting the computation in Example 3.2 using a tree.

(22)

3.1.2.2 Discussion

Because the sampled semantics is defined on a subset C of the domain of discourse C ⊆ O, the best satisfiability problem can also be understood as finding parameters θ so that all formulas are satisfied for this particular subset C. In the same way one could say a machine learning model learns to recognize cats by feeding it with pictures of cats, the machine learning model pθ learns to predict in a way that is logically consistent with K.

There is no guarantee, however, that the formulas are also always satisfied if we evaluate them using objects other than those in C. In fact, as we will see, there is nothing intrinsically stimulating the learning algorithm to make predictions that are logically consistent, but rather, DFL shows the machine learning model more logically consistent examples. It is well known that most contemporary deep learning still has issues with generalization such as a weakness to adversarial examples (I. J. Goodfellow, Shlens, and Szegedy 2014) and this is no different for this method.8

Furthermore, in most practical purposes C is not just a random subset of O but one distributed by the domain distribution p(o). So, this will maximize the expected truth value of the knowledge base with respect to the domain distribution p(o), rather than maximizing the constraints irrespective of how common some object is. It is likely that the machine learning model will still predict inconsistently for uncommon scenes.

A second important point is that because the corpus K contains no facts9_{(literals with just constants), there}

is also no guarantee that the learned embedded interpretation will correspond to their true semantic meaning. In fact, the learned embedded interpretation might be one that assigns the wrong truth value to nearly every literal because it can still be one that satisfies all formulas in the knowledge base.

Because of this, it is important to also learn what the predicate symbols mean in the real world (that is, from the data). This is where we use the supervised learning loss. If we look at the definition of the general Differentiable Logics loss in Equation 3.1 and substitute LDL(θ; Du, K) = LDF L(θ; Du, K), we can jointly

optimize the supervised learning loss and the DFL loss, where the first is used to provide examples of what the semantics of the predicates should be, and the second validates logical consistency.

3.2 Derivatives of Operators

The main argument of this thesis is that the choice of operators determines the inferences that are done when using the DFL loss. If we used a different set of operators in Example 3.2 than those based on the product t-norm, we would have gotten very different derivatives. These could in some ways make more sense, and in some other ways less. In this section, we analyze many functions that can be used for logical reasoning and present several of its properties.

We will not go in depth with fuzzy negations as the classical negation NC(a) = 1 − a is common, continuous,

intuitive and has simple derivatives.

Definition 3.4. A function f : R → R is said to be nonvanishing if f (a) 6= 0 for all a ∈ R, i.e. it is nonzero everywhere.

A function f : Rn _{→ R has a nonvanishing derivative if for all a}

1, ..., an ∈ R there is some 1 ≤ i ≤ n such

that ∂f (a1,...,an)

∂ai 6= 0.

Note that even if we only use nonvanishing operators, the derivatives of composites of these functions can still be vanishing. For instance, using the product s-norm and the classical negation on a ∨ ¬a, we find that the gradient of SP(a, 1 − a) is 2a − 1, which is 0 at 1₂. Another problem is that if the computation tree has a very

high height, gradients can vanish upstream. By the chain rule, all the partial derivatives of the connectives used from the root to a leaf in the tree have to be multiplied. If the partial derivatives of many of the connectives used are smaller than 1, their product can approach 0, and in the case of an arithmetic underflow become 0.

3.2.1 Aggregation

Aggregation of both formulas and instances is an important choice when working with DFL. The function used for instance-aggregation is the interpretation of the ∀ quantifier in Equation 3.7. Because we limit our-selves to universal quantification, and the aggregation of formulas is by assuming every formula is true, we are not interested in ‘oring’ aggregators. We will next talk about the theoretical benefits and problems of each aggregator.

8_{In fact, this inspired (Minervini et al. 2017) to use generated adversarial examples that are not consistent with K as the ‘sampled}

batch’ C. For more details, see Section 5.2

9_{We assume in this thesis that the knowledge base K does not contain facts and only contains universally quantified formulas.}

We separate the learning of facts from the learning of the other formulas to be able to use standard supervised learning methods for learning the facts.

(23)

3.2. DERIVATIVES OF OPERATORS

3.2.1.1 Minimum Aggregator The minimum aggregator is given as

ATG(x1, ..., xn) = min(x1, ..., xn). (3.13)

It corresponds to strict universal quantification (or ‘anding’): For this aggregator to be high, every single input element needs to be high. The partial derivatives are given by

∂ATG(x1, ..., xn) ∂xi = ( 1 if i = argmin_jxj 0 otherwise. (3.14)

It is easy to see that this is a poor aggregator. There is only a nonzero gradient on a single element. However, many of the practical formulas have exceptions. For example, if we would believe that all ravens are black, we would be surprised to see that white ravens do exists, even if they are very rare. Furthermore, the raven might turn red if someone throws a bucket of paint over it. Because only the lowest scoring input has a nonzero gradient, this aggregator is likely to correct just that exception, ‘forgetting’ correct behavior. Additionally, it is an ineffective gradient as we still have to compute the forward pass for all other inputs even though they get no feedback signal.

3.2.1.2 Lukasiewicz Aggregator The Lukasiewicz aggregator is given as

ATLU(x1, ..., xn) = max n X i=1 xi− (n − 1), 0 ! . (3.15)

This again is strict universal quantification. The partial derivatives are given by ∂ATLU(x1, ..., xn) ∂xi = ( 1 if Pn i=1xi> n − 1 0 otherwise. (3.16)

This is also a very poor aggregation operator. There only is a gradient whenPn

i=1xi > n − 1, that is, only when

the average value of xi is larger than n−1_n (P´all J´onsson 2018). And because limn→∞n−1_n = 1, this effectively

means that, for larger values of n, all inputs are already satisfied. In other words, we can only learn when we are already (almost) correct. As there would nearly never be any gradient during learning, this aggregation operator would render DFL useless.

3.2.1.3 Yager Aggregator The Yager aggregator is given as

ATY(x1, ..., xn) = max  1 − n X i=1 (1 − xi)p !1p , 0  , p ≥ 1 (3.17)

The Lukasiewicz aggregator is a special case of the Yager aggregator when p = 1. Furthermore, if p approaches infinity, it approaches the minimum aggregator. The derivative of the Yager aggregator is

∂ATY(x1, ..., xn) ∂xi =    Pn j=1(1 − xj)p 1−1p · (1 − xi)p−1 if Pn j=1(1 − xj)p 1p < 1 0 otherwise. (3.18)

This derivative vanishes whenever Pn

j=1(1 − xj)p

1p

≥ 1. By exponentiating by p, we note that then also Pn

j=1(1 − xj)p ≥ 1 holds. As 1 − xi ∈ [0, 1], (1 − xi)p is a decreasing function with respect to p. Therefore, 1

n

Pn

i=1(1 − xi)p < 1_n holds for a larger proportion of the domain when p increases. We can quantify this for

the common (Euclidean) case p = 2.

Proposition 3.1. The ratio of points x1, ..., xn ∈ [0, 1] for which there is some xi where

∂A_TY(x1,...,xn) ∂xi > 0 is equal to πn2 2n_·Γ(1 2n+ 1 2)

(24)

2 4 6 8 10 12 0.0 0.2 0.4 0.6 0.8 1.0 n r

Figure 3.3: The ratio of points in [0, 1]n for which ATY with p = 2 has a positive gradient.

Proof. We begin by noting from Equation 3.18 that there is only a gradient whenever Pn

i=1(1 − xi)2 < 1.

The points for which this equation holds describes the volume of an n-ball with radius 1. An n-ball is the generalization of the concept of a ball to n + 1 dimensions: The region enclosed by a n − 1 hypersphere10. A hypersphere with radius 1 is the set of points which are at a distance of 1 from its center. The volume of a n-ball is found by (Ball et al. 1997, p.5):

V (n) = π n 2

Γ(1₂n + 1). (3.19)

As we are only interested in the volume of this n-ball for a single orthant11_{, we have to divide this volume by}

the amount of orthants in which the n-ball lies, which is 2n12_{. The total volume of a single orthant is 1. Thus,}

the ratio of points in [0, 1]n that have a nonzero gradient is V (n)₂n =

πn2

2n_·Γ(1

2n+12)

.

We plot the values of this ratio up to n = 12 in Figure 3.3. It shows that in practice this problem with vanishing gradients persists even for small input domains. If we are concerned only by optimizing the truth value, we can simply remove the max constraint, resulting in the ‘unbounded yager’ norm

AU Y(x1, ..., xn) = 1 − n X i=1 (1 − xi)p !1p , p ≥ 1. (3.20)

However, then the co-domain of the function is no longer [0, 1]. We can do a linear transformation on this function to ensure this is the case (Appendix A.1).

Definition 3.5. For some p ≥ 0, the Mean-p Error aggregator ApM E is defined as

ApM E(x1, ..., xn) = 1 − 1 n n X i=1 (1 − xi)p !p1 . (3.21)

The ‘error’ here the is difference between the predicted value xi and the ‘ground truth’ value, 1. This

function has the following derivative: ∂ApM E(x1, ..., xn) ∂xi = 1 n   1 n n X j=1 (1 − xj)p   1 p−1 · (1 − xi)p−1. (3.22)

10_{The 3-ball (or ball) is surrounded by a sphere (or 2-sphere). Similarly, the 2-ball (or disk) is surrounded by a circle (or 1-sphere).} 11_{An orthant in n dimensions is a generalization of the quadrant in two dimensions and the octant in three dimensions.} 12_{To help understand this, consider n = 2. The 1-ball is the circle with center (0, 0). The area of this circle is evenly distributed}

(25)

3.2. DERIVATIVES OF OPERATORS

We quickly mention two special cases. The first is for p = 1: AM AE(x1, ..., xn) = 1 − 1 n n X i=1 (1 − xi) (3.23)

having the simple derivative ∂AM AE(x1,...,xn)

∂xi =

1

n. This measure is equal to the mean absolute error (MAE) (as

the error is always nonnegative) and associated with the Lukasiewicz norm. Another special case is p = 2:

ARM SE(x1, ..., xn) = 1 − v u u t 1 n n X i=1 (1 − xi)2. (3.24)

This function is the root-mean-square error (RMSE) (also known as the root-mean-square deviation). It is commonly used for regression tasks and heavily weights outliers.

We can do the same for the Yager s-norm min((ap+ bp)1/p, 1) (see Appendix A.1): Definition 3.6. For some p ≥ 0, the p−Mean aggregator is defined as

ApM(x1, ..., xn) = 1 n n X i=1 xp_i !1p , p ≥ 1. (3.25)

p = 1 corresponds to the arithmetic mean and p = 2 to the geometric mean.13 Additionally, its derivative has the issue of having high values for already high inputs when p > 1. Note that the arithmetic mean A1M

has the same derivative as the mean absolute error AM AE.

3.2.1.4 Product Aggregator The product aggregator is given as

ATP(x1, ..., xn) =

n

Y

i=1

xi. (3.26)

This again is strict universal quantification, with the following partial derivatives: ∂ATP(x1, ..., xn) ∂xi = n Y j=1,i6=j xj. (3.27)

∇ATP(x1, ..., xn) > 0 if x1, ..., xn > 0, which is nonvanishing as x1 = ... = xn = 0 is extremely unlikely to be relevant in practice. However, the derivative for some input xi is decreased if some other input xj is low,

even when they are independent. Furthermore, in practice we cannot compute this aggregation operator as numerical underflow will happen when multiplying many small numbers. Luckily, we can use a common trick where we notice that argmaxf (x) = argmax log(f (x)), as log is a strict monotonically increasing function.

If we use the product aggregator both for connecting instances and formulas, and our formulas are in prenex normal form, the best satisfiability problem in Equation 3.9 using the product norm ATP can be written as

η_θ∗= argmin_η θ− Y ϕ∈K eθ ϕ = ∀x1, ..., xnϕ φ, {} wϕ (3.28) = argmin_η_θ− Y ϕ∈K   Y o1,...,onϕ∈C eθ φ, {x1/o1, ..., xnϕ/onϕ}   wϕ (3.29) = argmin_η θ− X ϕ∈K wϕ· X o1,...,onϕ∈C log eθ(φ, {x1/o1, ..., xnϕ/onϕ}) (3.30)

where nϕis the depth of nesting of universal quantifiers in prenex normal form formula ϕ = ∀x1, ..., xnϕ φϕ, and φϕ is the quantifier-free part of the formula ϕ, also known as the matrix of ϕ. We call this the log-product

aggregator

Alog TP(x1, ..., xn) = (log ◦ATP)(x1, ..., xn) =

n

X

i=1

log(xi). (3.31)

13_{(Donadello, Serafini, and A. d. Garcez 2017; Diligenti, Roychowdhury, and Gori 2017) and Marra et al. 2019 used these for the}

semantics of ∀, even though it is an ‘oring’ and not an ‘anding’ aggregator. The motivation they give is that it is better than the minimum aggregator ATG as the more examples satisfy the formula, the higher the truth value of the formula. We agree with this

Differentiable Fuzzy Logics: Integrated Learning and Reasoning using Gradient Descent

MSc Artificial Intelligence

Master Thesis

Differentiable Fuzzy Logics

Integrated Learning and Reasoning using Gradient Descent

Emile van Krieken

February 12, 2019

Supervisors:

Dr. E. Acar

Prof. dr. F.A.H. van Harmelen

Assessor:

T.N. Kipf MSc

Acknowledgements

Contents

Chapter 1

Introduction

1.1

Reasoning and Learning using Gradient Descent

1.2

Derivatives for Reasoning

1.3

Research Questions and Contributions

1.4

Outline

Background

2.1

Relational Logic

2.1.1

Syntax

2.1.2

Semantics

2.2

Fuzzy Logic

2.2.1

Fuzzy Operators

A

2.2.2

Fuzzy Implications

Chapter 3

Differentiable Fuzzy Logics

3.1

Differentiable Fuzzy Logics

3.1.1

Semantics

A

A

3.1.2

Learning using Best Satisfiability

A

A

3.2

Derivatives of Operators

3.2.1

Aggregation

_A

_A

_A