MSc Artificial Intelligence
Master Thesis
Differentiable Fuzzy Logics
Integrated Learning and Reasoning using Gradient Descent
by
Emile van Krieken
11282304
February 12, 2019
36EC 6.2018 - 2-2019Supervisors:
Dr. E. Acar
Prof. dr. F.A.H. van Harmelen
Assessor:
T.N. Kipf MSc
In recent years there has been a push to integrate symbolic AI and deep learn-ing as it is argued that the strengths and weaknesses of these approaches are complementary. One such trend in the literature is a weakly supervised learning technique that we call Differentiable Logics. It employs prior back-ground knowledge described using logic to benefit from unlabeled and noisy data. By interpreting logical symbols using neural networks, this background knowledge can be added to regular loss functions used in deep learning to integrate reasoning and learning. In particular, we analyze how fuzzy logic behaves in a differentiable setting in an approach we call Differentiable Fuzzy Logics. One of our findings is that there is a strong but influential imbalance between gradients going into the antecedent and the consequent of the impli-cation. Furthermore, we show that it is possible to use Differentiable Logics for semi-supervised learning on the MNIST dataset and discuss extensions to large-scale problems.
Acknowledgements
Firstly, I would like to thank my excellent supervisors Erman Acar and Frank van Harmelen. I was surprised by their detailed feedback, pushing me to write clearly and with great precision. They always pointed my work into the right direction. Their enthusiasm for this research direction was obvious and very contagious. We talked about many fascinating ideas and had great discussions. I could not have hoped for better.
I also want to thank Peter Bloem, who helped me out greatly with his expertise on the machine learning side of things. Furthermore, I would like to thank Haukur P´all J´onsson, Jasper Driessens, Finn Potason, Ilaria Tiddi, Luciano Serafini and Fije van Overeem for additional great discussions, feedback and insight.
Of course I also need to thank my fantastic friends, family and girlfriend, who supported me through everything during the writing of this thesis. In particular I want to mention the fantastic days in the office with my fellow student Alex, it has been fun.
Finally, I want to thank my mother. I have no words for how you always supported me and how you always were proud of me. I miss you.
Contents
1 Introduction 4
1.1 Reasoning and Learning using Gradient Descent . . . 4
1.2 Derivatives for Reasoning . . . 5
1.3 Research Questions and Contributions . . . 6
1.4 Outline . . . 6 2 Background 7 2.1 Relational Logic . . . 7 2.1.1 Syntax . . . 7 2.1.2 Semantics . . . 7 2.2 Fuzzy Logic . . . 8 2.2.1 Fuzzy Operators . . . 8 2.2.2 Fuzzy Implications . . . 11
3 Differentiable Fuzzy Logics 14 3.1 Differentiable Fuzzy Logics . . . 15
3.1.1 Semantics . . . 16
3.1.2 Learning using Best Satisfiability . . . 17
3.2 Derivatives of Operators . . . 21
3.2.1 Aggregation . . . 21
3.2.2 Conjunction and Disjunction . . . 27
3.2.3 Implication . . . 35
3.3 Differentiable Product Fuzzy Logic versus Semantic Loss . . . 47
3.3.1 Semantic Loss . . . 47
3.3.2 Differentiable Product Fuzzy Logic: Probabilistic Approximation? . . . 48
4 Experiments 49 4.1 3-SAT . . . 49 4.2 Measures . . . 50 4.3 MNIST Experiments . . . 51 4.3.1 Formulas . . . 51 4.3.2 Experimental Setup . . . 52 4.3.3 Results . . . 52 4.3.4 Conclusion . . . 58
4.4 Visual Genome Experiments . . . 58
4.4.1 Logical formulation . . . 58
4.4.2 Formulas . . . 59
4.4.3 Experimental Setup . . . 60
4.4.4 Results . . . 61
5 Related Work 62 5.1 Differentiable Fuzzy Logics . . . 62
5.2 Differentiable Fuzzy Logics on Artificial Data . . . 63
5.3 Projected Fuzzy Logics . . . 63
5.4 Differentiable Probabilistic Logics . . . 63
6 Discussion and Conclusion 65 6.1 Discussion . . . 65
6.2 Conclusion . . . 66
A Derivations of Used Functions 72 A.1 p-Error Aggregators . . . 72 A.2 Sigmoidal Aggregators . . . 73 A.3 Nilpotent Aggregator . . . 74
B Differentiable Product Fuzzy Logic 75
C MNIST Results Without Supervised Learning of same 77
Chapter 1
Introduction
In recent years there has been a push to integrate symbolic and statistical1approaches to Artificial Intelligence
(AI) (A. S. d. Garcez, Broda, and Gabbay 2012; Besold et al. 2017). This push coincides with critiques on the statistical method deep learning (Marcus 2018; Pearl 2018), which has been the dominating focus of the AI community in the last decade. While deep learning has caused many important breakthroughs in computer vision (Brock, Donahue, and Simonyan 2018), natural language processing (Devlin et al. 2018) and reinforcement learning (Silver et al. 2017), their concern is that progress will be halted if its shortcomings are not dealt with. Among these is the massive amount of data deep learning needs to be effective, requiring thousand or even millions of examples to properly learn a concept. On the other hand, symbolic AI can reuse concepts and knowledge from a small amount of data. Also, it is usually far easier to interpret the decisions symbolic AI makes, in contrast to deep learning models that act like black-boxes. This is because symbols, which are used to reason with in symbolic AI, refer to concepts that have a clear meaning to humans, while deep learning uses millions or billions of numerical parameters to compute mathematical models that are extremely complex to grasp. Finally, it is much easier to describe a priori domain knowledge using symbolic AI and to integrate it into such a system.
A major downside of symbolic AI is that it is unable to capture the nuances of sensory data as it is noisy and high dimensional. Furthermore, it is difficult to express how small changes in the input data should produce different outputs. This is related to the symbol grounding problem. (Harnad 1990) defines the symbol grounding problem as how “the semantic interpretation of a formal symbol system can be made intrinsic to the system, rather than just parasitic on the meanings in our heads”. As we mentioned, symbols refer to concepts that have an intrinsic meaning to us humans, but computers that are able to reason and act on these symbols using symbol manipulation can not understand this meaning. In contrast to symbolic AI, a properly trained deep learning model excels at modeling complex sensory data and is, for example, able to recognize the guitar in the top right images of Figure 1.1. These models could provide this exact intrinsic meaning needed to bridge the gap between symbolic systems and the real world. Therefore, several recent approaches, among which (Diligenti, Roychowdhury, and Gori 2017; Garnelo, Arulkumaran, and Shanahan 2016; Serafini and A. D. Garcez 2016) and (Manhaeve et al. 2018) aim to interpret symbols that are used in logic-based systems using deep learning models. These are some of the first systems to implement a proposition from (Harnad 1990), namely “a hybrid nonsymbolic/symbolic system (...) in which the elementary symbols are grounded in (...) nonsymbolic representations that pick out, from their proximal sensory projections, the distal object categories to which the elementary symbols refer.”
1.1
Reasoning and Learning using Gradient Descent
This thesis is about what we call Differentiable Logics. Differentiable Logics integrates reasoning and learning by using logical formulas which express a priori background knowledge. The symbols in these formulas are interpreted using a deep learning model of which the parameters are to be learned. However, the formulas are not able to handle uncertainty, and represent discrete structures which are not differentiable. Differentiable Logics construct differentiable loss functions based on these formulas. By minimizing these loss functions using gradient descent, we ensure that the deep learning model acts in a manner that is consistent with the background knowledge. As the loss functions are fully differentiable, we can backpropagate into the deep learning model to ensure its interpretation of the symbols is consistent with the background knowledge.
The Differentiable Logics this thesis focuses on are Differentiable Fuzzy Logics (DFL), which are inspired by Real Logic (Serafini and A. D. Garcez 2016). In DFL the background knowledge is expressed in first-order logic and uses fuzzy logic semantics (Klir and Yuan 1995). In contrast to having binary truth values that are
Figure 1.1: Three images in the Visual Genome dataset annotated with their scene graph. Figure taken from visualgenome.org.
either true or false, propositions in fuzzy logic have truth values that can be any real number between 0 and 1, called the degree of truth. In DFL, the predicates, functions and constants symbols are interpreted using the deep learning model. By maximizing the degree of truth of the background knowledge using gradient descent, both learning and reasoning are performed in parallel. Another approach is Differentiable Probabilistic Logics (Manhaeve et al. 2018) that instead aim to interpret propositions as being true with some probability.
By adding the loss function of DFL to other loss functions commonly used in deep learning, DFL can be used for more challenging machine learning tasks than (fully) supervised learning. These methods fall under the umbrella of weakly supervised learning (Zhou 2017). For example, it can detect noisy or inaccurate super-vision by correcting inconsistencies between the labels, the model’s predictions and the background knowledge (Donadello, Serafini, and A. d. Garcez 2017). Furthermore, by splitting the problem of arithmetic addition into recognizing digits using deep learning models and addition using symbolic reasoning, (Manhaeve et al. 2018) solves the problem of recognizing the sum of two handwritten numbers. A third application, and the one we will be focusing on, is semi-supervised learning in which only a limited fraction of the dataset is labeled, and a large part is unlabeled (Xu et al. 2018; Hu et al. 2016). If the prediction of the deep learning model on the unlabeled data is logically inconsistent, Differentiable Logics can be used to correct this.
We apply semi-supervised learning using DFL to the Scene Graph Parsing (SGP) task. In SGP, the goal is to generate a semantic description of the objects in an image (Johnson, Gupta, and Fei-Fei 2018). This description is represented as a labeled directed graph known as a scene graph. An example of a labeled dataset for this problem is Visual Genome (Krishna et al. 2017). Figure 1.1 shows some examples of images from this dataset annotated with a part of their scene graph. The binary relations in particular make it difficult to train a strong deep learning model on this dataset, as there are many different pairs of objects that could be related. An example of this data sparsity problem is that it is not likely that there are many images of rabbits eating cabbage in Visual Genome, making it challenging to learn how to recognize this concept. Furthermore, because labeling scene graphs is also challenging for humans, there are not many images with labeled scene graphs as labeling is so expensive. However, far larger datasets exist that are unlabeled, such as ImageNet (Russakovsky et al. 2015). Such a dataset could be used to reduce the data sparsity when training a model for SGP. By expressing a priori background knowledge of the world, Differentiable Fuzzy Logics can be a viable candidate for this approach.
1.2
Derivatives for Reasoning
The major contribution of this thesis is an analysis of the choice of operators used to compute the logical connectives in DFL. An example of such an operator is a t-norm which connects two fuzzy propositions and returns the degree of truth of the event that both propositions are true. Therefore, a t-norm generalizes the
1.3. RESEARCH QUESTIONS AND CONTRIBUTIONS
Boolean conjunction. Similarly, an implication operator generalizes the Boolean implication. These operators are differentiable and can thus be used in DFL. Interestingly, the derivatives of these operators determine how DFL corrects the deep learning model when its predictions are inconsistent with the background knowledge. Surprisingly enough, there is only a small number of works in the literature on the qualitative properties of these derivatives, despite the fact that the choice of operators is integral to both theory and practice.
For example, assume that the deep learning model observes a non-black raven. We might have some back-ground knowledge encoded in Real Logic saying that all ravens are black. Note that this observation is not consistent with the background knowledge. During the backpropagation step, the deep learning model is cor-rected by DFL, and the way in which this is done is determined by our choice of implication operator. One way to correct this observation would be to tell the deep learning model that it was a black raven instead.
1.3
Research Questions and Contributions
The main research question that we wish to answer is as follows:
“What is the effect of the choice of operators used to compute the logical connectives in Differentiable Fuzzy Logics? ”
To answer this question, we first analyze the theoretical properties of four types of operators: Aggregation functions, which are used to compute the universal quantifier ∀, conjunction and disjunction operators, which are used to compute the conjunction and disjunction connectives ∧ and ∨, and fuzzy implications which are used to compute the implication connective. Then, we perform two different experiments to compare a list of combinations of operators in practice. We conclude with several recommendations for operators to use in Differentiable Fuzzy Logics.
The second research question is on the practical problem of Scene Graph Parsing:
“Can the performance of a deep learning model on the Scene Graph Parsing task be improved with semi-supervised learning using Differentiable Fuzzy Logics? ”
To answer this question, we split the Visual Genome dataset into two datasets, where the first part is labeled and the second unlabeled. Next, we devise a set of formulas that express background knowledge of the Visual Genome dataset that is used in our experiments with DFL. The deep learning model is trained using both a supervised loss function on the labeled dataset and the DFL loss function on the unlabeled dataset. The latter computes the consistency of the model’s predictions with the background knowledge. We notice no clear improvement compared to the supervised baseline and present challenges with applying Differentiable Logics to a complex task like Scene Graph Parsing.
1.4
Outline
In Chapter 2 we introduce the background that is relevant for this work. In particular, we discuss relational logics in Section 2.1 and introduce fuzzy logics with several common operators in Section 2.2. In Chapter 3 we present DFL (Section 3.1) along with theoretical properties of the operators used in it in Section 3.2. In Chapter 4, we test combinations of aggregation functions and disjunction operators to solve the 3-SAT problem (Section 4.1), apply DFL in a semi-supervised setting to the MNIST dataset (Section 4.3) and use a priori background knowledge to attempt to solve the same task for the Visual Genome dataset (Section 4.4). In Chapter 5 we discuss related work and in Chapter 6 we conclude with a small discussion about challenges and possible future work.
Background
2.1
Relational Logic
In this thesis, we will be using relational knowledge bases, which are sets of sentences expressed in a relational logic language L. By using relational logic, we will be limiting ourselves to function-free formulas.
2.1.1
Syntax
Formulas are constructed using constants (or objects) C = c1, c2, ..., variables x1, x2, ..., predicates P =
P, R, partOf, ... and the logical connectives ¬, ∨, ∧, → and quantifiers ∃, ∀. There is an arity function that maps each predicate to a natural number, i.e. σ ∈ P → N. The syntax of the logic that we will use throughout this thesis is defined as follows.
Definition 2.1. A term in L is an individual variable or a constant symbol. If t1, ..., tn are terms and P ∈ P
has arity n, then P(t1, ..., tn) is an atomic formula.
Four things are well-formed formulas for L. An atomic formula is a formula. If φ is a formula, then ¬φ (negation) is also a formula. If φ and ψ are formulas, then φ ∨ ψ (disjunction) and φ ∧ ψ (conjunction) are also formulas. Furthermore, φ → ψ (implication) is also a formula. In an implication, φ is called the antecedent of the implication and ψ the consequent. If φ is a formula in which the variable x appears, then ∃x φ (existential quantification) and ∀x φ (universal quantification) are also formulas. x is then said to be a bound variable.
If φ is an atomic formula, then φ and ¬φ are literals.
We will only consider formulas in prenex form, namely formulas that start with quantifiers and bound variables followed by a quantifier-free subformula. An example of a formula in prenex form is
∀x, y P(x, y) ∧ Q(x) → R(y).
2.1.2
Semantics
To evaluate the truth value of a formula, we will need a way to interpret all symbols in the language so we can assign a truth value to every sentence. For this, we introduce two orthogonal semantics that we will use to describe and analyze our algorithms.
2.1.2.1 Standard Semantics
Traditional (or Tarskian) semantics (Van Dalen 2004) maps symbols in L to objects and relations using a structure which consists of a domain of discourse and an interpretation.
Definition 2.2. A domain of discourse is a nonempty set of objects O = {o1, o2, ...} that specifies the range
of quantifiers. An interpretation η is a function. For each constant symbol c, η(c) maps to an object in O. For each predicate symbol P with arity m, η(P) maps to a function in Om→ {true, f alse}.
For example, the truth value of P(c1, c1) is η(P)(η(c1), η(c2)). Using a structure, we can easily define the
semantics of full sentences inductively. For this, we first need to define variable assignments that are used to associate variable symbols to elements in the domain.
Definition 2.3. A variable assignment µ is a set which associates each variable symbol x with objects from the domain O and associates each constant symbol c with its respective interpretation η(c).
2.2. FUZZY LOGIC
Definition 2.4. The truth values of formulas are determined using the valuation function e, which uses a variable assignment µ and a structure hO, ηi, and is defined using the following inductive definition:
• For atomic formulas P(t1, ..., tm), e(P(t1, ..., tm)) = η(P)(µ(t1), ..., µ(tm)), where η(P) is the interpretation
of the predicate P.
• If φ and ψ are formulas, then e(¬φ) is true whenever e(φ) is not, e(φ ∧ ψ) is true whenever both e(φ) and e(ψ) are, e(φ ∨ ψ) is true if at least one of e(φ) and e(ψ) is, and e(φ → ψ) is true if e(φ) is f alse or e(ψ) is true.
• For formulas with an existential quantifier ∃x φ, e(∃x φ) is true iff there is an object o ∈ O such that e(φ) is true when the variable x is bound to o using the variable assignment µ0that only differs from µ in that x is assigned to o.
• For formulas with a universal quantifier ∀x φ, e(∀x φ) is true iff e(φ) is true for all bindings of the variable x to every object o ∈ O by assignments µ0 differing from µ in that x is assigned to o.
We say that a formula φ is satisfiable if there is a structure so that e(φ) is true. Such a structure is called a model of φ.
2.1.2.2 Herbrand Semantics
Herbrand semantics (Shoenfield 2010) refers not to external objects but rather to ground atoms. We will first introduce some definitions:
Definition 2.5. A ground term is a term which does not contain variables. The set of all ground terms is called the Herbrand universe. If t1, ..., tn are ground terms and P(t1, ..., tn) is an atomic formula, then
it also a ground atom. The set of all possible ground atoms is called the Herbrand base. A Herbrand interpretation assigns a truth value to every ground atom in the Herbrand base.
The difference with traditional semantics is that there are no objects but the ground terms in Herbrand Semantics. Because we do not use function symbols in our relational logic, all the objects we are interested in are the (named) constants C.
Given such a set of constants C, we can compute the full grounding of a formula in prenex normal form. This is done by assigning the variables bound by the quantifiers to every possible combination of objects from the constants C. Each resulting ground formula, that is, a formula without free variables, is called an instance of the formula. The conjunction of all instances is used to compute the truth value of the universal quantifier ∀, and the disjunction of all instances is used for the existential quantifier ∃.
Example 2.1. Say we have a language with two constants C = {c1, c2} and predicates P = {P, Q} where P is
a unary predicate and Q is a binary predicate. The full grounding of the formula ∀x, y P(x) → Q(x, y) is given by P(c1) → Q(c1, c1) ∧ P(c1) → Q(c1, c2) ∧ P(c2) → Q(c2, c1) ∧ P(c2) → Q(c2, c2).
The Herbrand Base is {P(c1), P(c2), Q(c1, c1), Q(c1, c2), Q(c2, c1), Q(c2, c2)} and any subset is a possible Herbrand
interpretation.
2.2
Fuzzy Logic
Fuzzy Logic is, contrary to classical logics, a real-valued logic. Truth values of propositions are not binary, that is, either true or f alse, but instead are real numbers in [0, 1] where 0 denotes completely false and 1 denotes completely true. Fuzzy logic models the concept of vagueness by arguing that the truth value of many propositions can be noisy to measure, or subjective. For example, the truth value of the predicate old is not something that is easily determined. A person who is 50 years old would be called old by some, but most would certainly call someone who is 90 old as well. However, for this second person the predicate old clearly holds to a higher degree, and this person would probably deem the first rather young.
We will be looking at predicate t-norm fuzzy logics in particular. Predicate fuzzy logics extend normal fuzzy logics with universal and existential quantification, mimicking the relational logic described in Section 2.1.
2.2.1
Fuzzy Operators
We will first introduce the semantics of the fuzzy operators ∧, ∨ and ¬ that are used to connect truth values of fuzzy predicates. We follow (Jayaram and Baczynski 2008) in this section and refer to it for proofs and additional results.
2.2.1.1 Properties of Functions
We first define several common properties of functions. Definition 2.6. A function f : D → D is called
• continuous if for any a ∈ D, limx→af (x) = f (a);
• left-continuous if for all positive and arbitrarily small numbers > 0 there exists another value δ > 0 such that for any a ∈ D it holds that |f (x) − f (a)| < whenever a − δ < x < a;
• increasing if for all a, b ∈ D, if a ≤ b then f (a) ≤ f (b), and similarly for decreasing;
• strictly increasing if for all a, b ∈ D, if a < b then f (a) < f (b), and similarly for strictly decreasing. A function f : D2→ D is called
• commutative if for all a, b ∈ D, f (a, b) = f (b, a).
• associative if for all a, b, c ∈ D, f (f (a, b), c) = f (a, f (b, c)).
Left-continuity informally means that when a point is approached from the left, no ‘jumps’ will occur. 2.2.1.2 Fuzzy Negation
The functions that are used to compute the negation of a truth value are called fuzzy negations.
Definition 2.7. A fuzzy negation is a function N : [0, 1] → [0, 1] so that N (0) = 1 and N (1) = 0. N is called strict if it is strictly decreasing and continuous, and strong if it is an involution, that is, for all a ∈ [0, 1], N (N (a)) = a.
In this thesis we will exclusively use the strict and strong classic negation NC(a) = 1 − a.
2.2.1.3 Triangular Norms
The functions that are used to compute the conjunction of two truth values are called t-norms.
Definition 2.8. A t-norm (triangular norm) is a function T : [0, 1]2→ [0, 1] that is commutative and associa-tive, and
• Monotonicity: For all a ∈ [0, 1], T (a, ·) is increasing and • Neutrality: For all a ∈ [0, 1], T (1, a) = a.
The phrase ‘T (a, ·) is increasing’ means that whenever 0 ≤ b1≤ b2≤ 1, then T (a, b1) ≤ T (a, b2).
Definition 2.9. A t-norm T can have the following properties:
1. Continuity: A continuous t-norm is continuous in both arguments.
2. Left-continuity: A left-continuous t-norm is left-continuous in both arguments.
3. Idempotency: An idempotent t-norm has the property that for all a ∈ [0, 1], T (a, a) = a.
4. Strict-monotony: A strictly monotone t-norm has the property that for all a ∈ [0, 1], T (a, ·) is strictly increasing.
5. Strict: A strict t-norm is continuous and strictly monotone.
Table 2.1 shows several common t-norms that we will investigate in this thesis alongside their properties. The product t-norm has a counterpart in probability theory, namely the probability of two independent events coinciding.
1See (L´aszl´o G´al et al. 2014). For v = 0, this is called the Hamacher product. v = 1 gives the normal product norm. 2See (G´al, Lovassy, and K´oczy 2010).
2.2. FUZZY LOGIC
Name T-norm Properties
G¨odel (minimum) TG(a, b) = min(a, b) idempotent, continuous
Product TP(a, b) = a · b strict
Lukasiewicz TLK(a, b) = max(a + b − 1, 0) continuous
Nilpotent minimum TnM(a, b) =
(
0, if a + b ≤ 1
min(a, b), otherwiset left-continuous Yager TY(a, b) = max(1 − ((1 − a)p+ (1 − b)p)
1
p, 0), p ≥ 1 continuous
Hamacher1 T
H(a, b) = v+(1−v)(a+b−a·b)a·b , v ≥ 0 strict
Trigonometric2 T
T(a, b) = π2arcsin sin aπ2 · sin bπ2
strict Table 2.1: Some common t-norms.
Name T-conorm Properties
G¨odel (maximum) SG(a, b) = max(a, b) idempotent, continuous
Product (probabilistic sum) SP(a, b) = a + b − a · b strict
Lukasiewicz SLK(a, b) = min(a + b, 1) continuous
Nilpotent maximum SnM(a, b) =
(
1, if a + b ≥ 1
max(a, b), otherwise right-continuous
Yager SY(a, b) = min((ap+ bp)
1
p, 1), p ≥ 1 continuous Hamacher SH(a, b) = a+b−a·b−(1−v)a·b1−(1−v)a·b , v ≥ 0 strict
Trigonometric ST(a, b) = 2πarccos cos aπ2 · cos bπ2 strict
Table 2.2: Some common t-conorms.
2.2.1.4 Triangular Conorms
The functions that are used to compute the disjunction of two truth values are called t-conorms or s-norms. Definition 2.10. A t-conorm (triangular conorm, also known as s-norm) is a function S : [0, 1]2→ [0, 1] that
is commutative and associative, and
• Monotonicity: For all a ∈ [0, 1], S(a, ·) is increasing and • Neutrality: For all a ∈ [0, 1], S(0, a) = a.
T-conorms can be found from t-norms using De Morgan’s laws from classical logic, in particular p∨q = ¬(¬p∧ ¬q). Therefore, if T ∈ [0, 1]2→ [0, 1] is a t-norm and N
Cthe classical negation, T ’s NC-dual S ∈ [0, 1]2→ [0, 1]
is calculated using
S(a, b) = 1 − T (1 − a, 1 − b) (2.1)
Table 2.2 shows several common t-conorms derived using Equation 2.1 and the t-norms from Table 2.1. The same optional properties as those for t-norms in Definition 2.9 can hold for t-conorms and are presented in the same table. The t-conorm of the product t-norm also has a probabilistic interpretation, namely the probability that at least one of two independent events is true.
2.2.1.5 Aggregation operators
The functions that are used to compute quantifiers like ∀ and ∃ are aggregation functions (Y. Liu and Kerre 1998).
Definition 2.11. An aggregation operator is a function A : [0, 1]n→ [0, 1] that is symmetric and increasing with respect to each dimension, and for which A(0, ..., 0) = 0 and A(1, ..., 1) = 1. A symmetric function is one in which the output value is the same for every ordering of its arguments.
Note that aggregation operators are essentially variadic functions which are functions that are defined for any finite set of arguments. For this reason we will often use the notation
A
ni=1xi:= A(x1, ..., xn).Simple examples of aggregation functions are found by extending t-norms from 2-dimensional input to n-dimensional input using
AT(x1, x2) = T (x1, x2) (2.2)
Name Type Aggregation operator Characteristics Minimum anding ATG(x1, ..., xn) = min(x1, ..., xn) Generalizes TG Product anding ATP(x1, ..., xn) =
Qn
i=1xi Generalizes TP
Lukasiewicz anding ATLK(x1, ..., xn) = max( Pn
i=1xi− (n − 1), 0) Generalizes TLK
Maximum oring ASG(x1, ..., xn) = max(x1, ..., xn) Generalizes SG Probabilistic sum oring ASP(x1, ..., xn) = 1 −
Qn
i=1(1 − xi) Generalizes SG
Bounded sum oring ASLK(x1, ..., xn) = min ( Pn
i=1xi, 1) Generalizes SLK
Table 2.3: Some common aggregation operators.
where T is any t-norm. Because of the commutativity and associativity of T , the ordering of the arguments is irrelevant and thus AT is symmetric. The other required properties also follow from the definition of t-norms.
These operators do well for modeling the ∀ quantifier, as it can be seen as a series of conjunctions. We can do the same for s-norms:
AS(x1, x2) = S(x1, x2) (2.4)
AS(x1, x2, ..., xn) = S(x1, AS(x2, ..., xn)) (2.5)
where S is any s-norm. These operators again do well for modeling the ∃ quantifier, and can be seen as a series of disjuctions.
Table 2.3 shows some common aggregation operators that we will talk about.
2.2.2
Fuzzy Implications
The functions that are used to compute the implication of two truth values are called fuzzy implications (Jayaram and Baczynski 2008).
Definition 2.12. A fuzzy implication is a function I : [0, 1]2 → [0, 1] so that for all a, c ∈ [0, 1], I(·, c) is
decreasing, I(a, ·) is increasing and for which I(0, 0) = 1, I(1, 1) = 1 and I(1, 0) = 0.
From this definition follows that I(0, 1) = 1. If this were not the case and it is lower, then I(a, ·) would not be increasing.
Definition 2.13. Let N be a fuzzy negation. A fuzzy implication I can have several properties that hold for all a, b, c ∈ [0, 1]:
1. Left-neutrality: For a left-neutral (LN) fuzzy implication holds that I(1, c) = c.
2. Exchange principle: For a fuzzy implication that satisfies the exchange principle (EP) holds that I(a, I(b, c)) = I(b, I(a, c)).
3. Identity principle: For a fuzzy implication that satisfies the identity principle (IP) holds that I(a, a) = 1. 4. Contraposition: For a fuzzy implication that is contrapositive symmetric with respect to N (denoted
CS(N )) holds that I(a, c) = I(N (c), N (a)).
5. Left-contraposition: For a fuzzy implication that is left-contrapositive symmetric with respect to N (denoted L-CS(N )) holds that I(N (a), c) = I(N (c), a).
6. Right-contraposition: For a fuzzy implication that is right-contrapositive symmetric with respect to N (denoted R-CS(N )) holds that I(a, N (c)) = I(c, N (a)).
All these statements generalize a law from classical logic. A left neutral fuzzy implication generalizes (1 → p) = p, that is, if we know that the antecedent is true, p captures the truth value of 1 → p. The exchange principle generalizes p → (q → r) = q → (p → r), and the identity principle generalizes that p → p is a tautology.
When a fuzzy implication is contrapositive symmetric (with respect to a fuzzy negation N ), it generalizes p → q = ¬q → ¬p. Left-contraposion furthermore generalizes ¬p → q = ¬q → p and right-contraposition generalizes p → ¬q = q → ¬p.
2.2. FUZZY LOGIC
Name T-conorm S-implication Properties
G¨odel (Kleene-Dienes) SG IKD(a, c) = max(1 − a, c) All but IP
Product (Reichenbach) SP IRC(a, c) = 1 − a + a · c All but IP
Lukasiewicz SLK ILK(a, c) = min(1 − a + c, 1) All
Nilpotent (Fodor) SN m IF D(a, c) =
(
1, if a ≤ c
max(1 − a, c), otherwise All
Table 2.4: Some common S-implications. The first four are S-implications retrieved from the four common t-conorms.
2.2.2.1 S-Implications
In classical logic, the (material) implication is defined as follows: p → q = ¬p ∨ q
Using this definition, we can use a t-conorm S and a fuzzy negation N to construct a fuzzy implication. Definition 2.14. Let S be a t-conorm and N a fuzzy negation. The function IS,N : [0, 1]2→ [0, 1] is called an
(S, N)-implication and is defined for all a, c ∈ [0, 1] as
IS,N(a, c) = S(N (a), c). (2.6)
If N is a strong fuzzy negation, then IS,N is called an S-implication (or strong implication).
As we will only consider the classic negation NC, we omit the N and simply use IS to refer to IS,NC All S-implications IS are fuzzy implications and satisfy LN, EP and R-CP(N ). Additionally, if the negation
N is strong, it satisfies CP(N ) and if, in addition, it is strict, it also satisfies L-CP(N ). In Table 2.4 we show several S-implications that use the strong fuzzy negation NC and t-conorms from Table 2.2. Note that these
implications are nothing more than rotations of the s-norms. 2.2.2.2 R-Implications
Where S-implications are constructed from the generalization of the material implication, residuated implications (R-implications) are constructed in quite a different way. They are the standard choice in t-norm fuzzy logics. It uses the following identity from set theory:
A0∪ B = (A \ B)0=[
{C ⊆ X|A ∩ C ⊆ B}
where A and B are subsets of the universal set X and A0 denotes the complement of A.
Definition 2.15. Let T be a t-norm. The function IT : [0, 1]2 → [0, 1] is called an R-implication and is
defined as
IT(a, c) = sup{b ∈ [0, 1]|T (a, b) ≤ c} (2.7)
The supremum of a set A, denoted sup{A}, is the smallest upper bound in [0, 1] of A. An upper bound is a value that is larger than all elements in A. If, and only if, T is a left-continuous t-norm, the supremum can be replaced with the maximum function, that finds the largest element in the set A instead. Furthermore, T and IT then form an adjoint pair having the following residuation property for all a, b, c ∈ [0, 1]:
T (a, c) ≤ b ⇐⇒ IT(a, b) ≥ c (2.8)
All R-implications IT are fuzzy implications. Note that if a ≤ c then IT(a, c) = 1. We can see this by
looking at Equation 2.7. The largest value for b possible is 1, as then T (a, 1) = a (and a ≤ c) because for all t-norms T and all a ∈ [0, 1], T (1, a) = a. Furthermore, all satisfy LN and EP.
Table 2.5 shows the four R-implications created from the four common T-norms. Note that ILK and IF D
appear in both tables: They are both S-implications and R-implications. 2.2.2.3 Contrapositivisation
As shown in Table 2.5, not all fuzzy implications are contrapositive symmetric, and R-implications in particular often are not. However, (Jayaram and Baczynski 2008) shows two techniques that can be used to create fuzzy implications that are contrapositive symmetric.
Name T-norm R-implication Properties
G¨odel TG IG(a, c) =
(
1, if a ≤ c
c, otherwise LN, EP, IP, R-CP(ND1)
product (Goguen) TP IGG(a, c) =
(
1, if a ≤ c
c
a, otherwise
LN, EP, IP, R-CP(ND1)
Lukasiewicz TLK ILK(a, c) = min(1 − a + c, 1) All
nilpotent (Fodor) TN m IF D(a, c) =
(
1, if a ≤ c
max(1 − a, c), otherwise All Table 2.5: Four common R-implications.
Implication Type Contrapositivisation Properties
IG upper (IG)uNC(a, c) = IF D(a, c) All
IG lower (IG)
l
NC(a, c) = (
1, if a ≤ c
min(1 − a, c), otherwise All IGG upper (IGG)uNC(a, c) =
(1, if a ≤ c
maxac,1−a1−c, otherwise All but EP
IGG lower (IGG)lNC(a, c) =
(1, if a ≤ c
minca,1−a1−c, otherwise All but LN
Table 2.6: The G¨odel and Goguen implications after upper and lower contrapositivisation.
Definition 2.16. Let I be a fuzzy implication and N a fuzzy negation. The upper contrapositivisation Iu N and
lower contrapositivisation INl of I with respect to N is defined as
INu(a, c) = max(I(a, c), I(N (c), N (a)) (2.9)
INl (a, c) = min(I(a, c), I(N (c), N (a)) (2.10)
Iu
N and INl are also both fuzzy implications, and if N is strong, then CP(N ) holds.
Table 2.6 shows the result of applying both lower and upper contrapositivisation to the G¨odel implication IG and the Goguen implication IGG.
Chapter 3
Differentiable Fuzzy Logics
Differentiable Logics (DL) are logics for which loss functions can be constructed that can be minimized with gradient descent methods. It is based on the following idea: Use background knowledge described using some logic to deduce the truth value of ground atoms in unlabeled or poorly labeled data. This allows us to use large pools of such data in our learning, possibly together with normal labeled data. This can be beneficial as unlabeled or poorly labeled data is cheaper and easier to come by. Importantly, this is not like Inductive Logic Programming (Muggleton and Raedt 1994) where we derive logically consistent rules from data. It is the other way around: The logic informs what the truth values of the ground atoms could have been.
We motivate the use of Differentiable Logics with the following scenario: Assume we have an agent A whose goal is to describe the world around it. When it describes a scene, it gets feedback from a supervisor S. Now, S is a curious supervisor: It knows exactly how to describe some scenes. When our agent A communicates its description of one of these scenes, S can simply correct A by comparing A’s description with its own description. Yet for the other scenes S is staring in the dark and does not know the true description of the world A is in. All it has access to is A’s description of the scene. However, S does have a knowledge base K containing background knowledge about the concepts of the world. This background knowledge is encoded in some logical formalism. The idea of Differentiable Logics is that S can correct A’s descriptions of scenes that are not consistent with the knowledge base K.
Example 3.1. To illustrate this idea, consider the following example. Say that our agent A comes across the scene I in Figure 3.1 that contains two objects, o1and o2. A and the supervisor S only know of the unary class
predicates {chair, cushion, armRest} and the binary predicate {partOf}. S also does not have a description of I, and will have to correct A based on the knowledge in the knowledge base K. A predicts the following using its
current model of the world:
p(chair(o1)|I, o1) = 0.9 p(chair(o2)|I, o2) = 0.4
p(cushion(o1)|I, o1) = 0.05 p(cushion(o2)|I, o2) = 0.5
p(armRest(o1)|I, o1) = 0.05 p(armRest(o2)|I, o2) = 0.1
p(partOf(o1, o1)|I, o1) = 0.001 p(partOf(o2, o2)|I, o2) = 0.001
p(partOf(o1, o2)|I, o1, o2) = 0.01 p(partOf(o2, o1)|I, o2, o1) = 0.95
Say that the corpus K contains the following formula written in the relational logic from Section 2.1: ∀x, y chair(x) ∧ partOf(y, x) → cushion(y) ∨ armRest(y)
where ∀x, y is short for ∀x∀y. S might now reason that since A is very confident of chair(o1) and of partOf(o2, o1)
that the antecedent of this formula is satisfied, and thus cushion(o2) or armRest(o2) has to hold. Since
p(cushion(o2)|I, o2) > p(armRest(o2)|I, o2), a possible correction would be to tell A to increase its degree of
belief in cushion(o2). A can use this to update the model it uses to interpret future images.
We would like to automate the kind of reasoning S does in the previous example. In DL, we add a loss term that is computed using the formulas in K and the unlabeled data.1 This loss term is added to a normal
supervised loss function. Assume we have a labeled dataset Dland an unlabeled dataset Du. If we have a deep
learning model pθ with model parameters θ, which is used to classify ground atoms, we can say that all these
methods minimize with respect to θ
L(θ; Dl, Du, K) = LS(θ; Dl) + α · LDL(θ; Du, K). (3.1)
LS can be any supervised training loss acting on the labeled data Dl. In particular, for classification tasks we
will use the common cross entropy loss function (I. Goodfellow et al. 2016). LDL is the Differentiable Logics
loss that uses the formulas in K and acts on the unlabeled data Du. Furthermore, it has to be differentiable
with respect to θ so that the complete loss function L can be minimized using a form of gradient descent. α ≥ 0 is a hyperparameter to weight the influence of the DL loss with respect to the supervised loss.
We identify two families of Differentiable Logics in the literature. The first is Differentiable Fuzzy Logics (DFL). In DFL, the knowledge base of formulas is interpreted using a fuzzy logic. The objective of DFL is to maximize the satisfaction of the full grounding of this fuzzy knowledge base. Truth values of ground atoms are not discrete but continuous, and logical connectives are interpreted using some function over these truth values. One such logic is Real Logic (Serafini and A. D. Garcez 2016) which uses fuzzy t-norms, dual t-conorms and S-implications. In Section 3.2 we will discuss in particular how the interpretation of the connectives influence the reasoning.
In the second family of Differentiable Logics, we maximize the likelihood that the prediction the agent makes will satisfy a knowledge base (Xu et al. 2018; Manhaeve et al. 2018). The knowledge base itself can be modeled using classical and probabilistic logics. We call this Differentiable Probabilistic Logics. The relation between a DFL we call Differentiable Product Fuzzy Logic and a Differentiable Probabilistic Logic called Semantic Loss (Xu et al. 2018) is discussed in Section 3.3. Furthermore, we talk about some Differentiable Probabilistic Logics in the related work in Section 5.4.
3.1
Differentiable Fuzzy Logics
Differentiable Fuzzy Logics (DFL) is a family of fuzzy logics in which the satisfaction of knowledge bases is differentiable and can be maximized to perform learning. It uses fuzzy operators as its connectives and is general enough to handle both predicates and functions. In our experiments, we are not interested in functions, existential quantifiers and negated universal quantifiers and thus leave it out of the discussion.2 We follow both the introduction of Real Logic in (Serafini and A. D. Garcez 2016) and of embeddings based semantics in (Guha 2014) for this introduction. We will define our semantics only for the limited domain from the relational logic defined in Section 2.1.
1For now, we limit our discussion to semi-supervised learning. In other approaches to weakly-supervised learning in which, for
example, many labels are inaccurate, the unlabeled data is replaced with this inaccurately labeled portion of the data.
2Existential quantification can be modeled in much the same way as the universal quantifier is modeled in this thesis, but using
‘oring’ operators instead of ‘anding’. Furthermore, functions are modelled in (Serafini and A. D. Garcez 2016) and (Marra et al. 2018).
3.1. DIFFERENTIABLE FUZZY LOGICS
3.1.1
Semantics
3.1.1.1 Embedded Semantics
DFL defines a new semantics which extends traditional semantics from Section 2.1.2.1 to use vector embeddings. A structure in DFL again consists of a domain of discourse and an embedded interpretation:3
Definition 3.1. A DFL structure for a relational language L = hC, Pi4 is a tuple hp, ηθi. p is a domain
distribution over objects o in a d-dimensional5 real-valued vector space. The domain of discourse is O = {o|p(o) > 0}, or all objects with non-zero probability. ηθ is an (embedded) interpretation, which is a
function parameterized by θ that satisfies the following conditions: • If c ∈ C is a constant symbol, then ηθ(c) ∈ O.
• If P ∈ P is a predicate with arity α, then ηθ(P) : Oα→ [0, 1].
That is, objects in DFL semantics are d-dimensional vectors of reals. Their semantics come from the implicit meaning of the vector space: Terms are interpreted in a real (valued) world (Serafini and A. D. Garcez 2016). Likewise, predicates are interpreted as functions mapping these vectors to a fuzzy truth value. This can be seen as a solution to the symbol grounding problem (Harnad 1990). The domain distribution is used to limit the size of the vector space. For example, if we consider the space of all images, p might be the distribution over this space that represents the natural images.
Embedded interpretations can be implemented using any deep learning model.6 This model defines all
functions used for the interpretation of the predicates, and can also define mappings of the constant symbols. Note that with the differentiable model we mean a model along with its trainable parameters, as different values of the trainable parameters will produce different outputs. Therefore, we include the parameters θ of the model with the notation of the embedded interpretation ηθ.
Now that we know how to interpret constants and predicates and we are able to associate variables to elements in the domain, we can compute the truth value of sentences of DFL.
Definition 3.2. Let O be a domain of discourse, ηθ an interpretation for the relational language L = hC, Pi,
N a fuzzy negation, T a t-norm, S a t-conorm, I a fuzzy implication and A an aggregation operator. The valuation function eηθ,O,N,T ,S,I,A (or, for brevity, eθ) computes the truth value of a well-formed formula ϕ for L given a variable assignment µ. It is defined inductively as follows:
eθ(P(x1, ..., xm), µ) = ηθ(P) (ηθ(l(x1, µ)), ..., ηθ(l(xm, µ))) (3.2) eθ(¬φ, µ) = N (eθ(φ, µ)) (3.3) eθ(φ ∧ ψ, µ) = T (eθ(φ, µ), eθ(ψ, µ)) (3.4) eθ(φ ∨ ψ, µ) = S(eθ(φ, µ), eθ(ψ, µ)) (3.5) eθ(φ → ψ, µ) = I(eθ(φ, µ), eθ(ψ, µ)) (3.6) eθ(∀x φ, µ) =
A
o∈O eθ(φ, µ ∪ {x/o}) (3.7)where l is the assignment lookup function that finds the ground term o assigned to xi in µ.
Equation 3.2 defines the fuzzy truth value of an atomic formula. First, it determines the interpretation of the predicate symbol ηθ(P) that returns a function in Rd·α→ [0, 1]. We then find the interpretations of the terms of
the atomic formula by first finding the correct ground term using l and then determining the interpretation of this ground term using ηθ(l(xi, µ)) ∈ Rd. The resulting list of d-dimensional vectors are finally plugged into the
interpretation of the predicate symbol ηθ(P) that we found earlier to get the fuzzy truth value of the statement.
Equations 3.3 - 3.6 defines the truth values of the connectives using the operators N, T, S and I. The recursion is similar to that in classical logic.
Finally, Equation 3.7 defines the degree of truth of the statement ‘for all x φ’. This is done by applying an aggregation operator A over the enumeration of every possible assignment of an object of the domain of discourse O to the variable x. This assignment is done by adding x/o to the variable assignment µ.
3(Serafini and A. D. Garcez 2016) uses the term “(semantic) grounding” or “symbol grounding” (Mayo 2003) instead of ‘embedded
interpretation’, “to emphasize the fact that L is interpreted in a ‘real world’” but we find this confusing as we also talk about groundings in Herbrand semantics. Furthermore, by using the word ‘interpretation’ we highlight the parallel with classical logical interpretations.
4The L symbol is also used for loss functions. Context will make clear which of the two is referred to.
5Without loss of generality we fix the dimensionality of the vectors representing the objects. For our domain, namely natural
images, we might find that some images have different dimensions than others, however, the dimensionality of these get reduced to a fixed size somewhere in the pipeline of the neural network model we use. DFL could easily be extended to use different types of objects with a varying number of dimensions.
6A note on terminology: When we talk about models in DFL we talk about deep learning models such as neural networks, and
Importantly, this semantics of the ∀ quantifier makes the assumption that the domain of discourse O is finite as aggregation operators are only defined on a, possibly very large, finite numbers of arguments. For infinite domains, we would have to calculate limits of the aggregation operator instead. Furthermore, even if the domain is finite, the computation might still be intractable. We will therefore need to look at a slightly different semantics to make the computation of the DFL valuation feasible.
3.1.1.2 Sampled Embedded Semantics
Because the domain of discourse is generally too large or even infinite, we will have to sample a batch of b objects from O to approximate the computation of the valuation. This can be done simply by replacing Equation 3.7 with
eθ(∀x φ, µ) = b
A
i=1 eθ(φ, µ ∪ {x/oi}), o1, ..., ob chosen from O. (3.8)That is, we choose b objects from the domain of discourse. An obvious way would be to sample simply from the domain distribution p, if it happens to be available. It is commonly assumed in Machine Learning (I. Goodfellow et al. 2016, p.109) that the unlabeled dataset Du contains independent samples from the domain
distribution p and thus using such samples approximates sampling from p. Obviously, by sampling we give up on the soundness of the method.
3.1.2
Learning using Best Satisfiability
Next, we explain how we learn the set of parameters θ using DFL. This is done using best satisfiability (Don-adello, Serafini, and A. d. Garcez 2017): Find parameters that maximize the valuation over all formulas in the knowledge base K.
Definition 3.3. Let O be a set of objects, K be a knowledge base of formulas, ηθan (embedded) interpretation
of the symbols in K parameterized by θ and hN, T, S, I, Ai the usual operators. Then the Differentiable Fuzzy Logics loss LDF L of a knowledge base of formulas K is computed using
LDF L(θ; O, K) = −
X
ϕ∈K
wϕ· eηθ,O,N,T ,S,I,A(ϕ, ∅), (3.9)
where wϕ is the weight for formula ϕ, which is assumed to be |K|1 unless mentioned otherwise. The best
satisfiability problem is the problem of finding parameters θ∗so that the valuation using the interpretation ηθ∗ is a global minimum of the DFL loss:
θ∗= argminθ LDF L(θ; O, K). (3.10)
This optimization problem can be solved using a form of gradient descent. Indeed, if the operators N, T, S, I and A are all differentiable, we can backpropagate through the computation graph of the valuation function and through the computation of the truth value of the ground atoms. This changes the parameters θ resulting in a different embedded interpretation ηθ.
3.1.2.1 Implementation
The computation of the satisfaction is shown in pseudocode form in Algorithm 1. By first computing the dictionary g that contains truth values for all ground atoms,7 we can reduce the amount of forward passes
through the computations of the truth values of the ground atoms that are required to compute the satisfaction. This algorithm can fairly easily be parallelized for efficient computation on a GPU by noting that the individual terms that are aggregated over in line 12 (the different instances of the universal quantifier) are not dependent on each other. By noting that formulas are in prenex normal form, we can set up the dictionary g using tensor operations so that the recursion has to be done only once for each formula. This can be done by applying the fuzzy operators elementwise over vectors of truth values instead of a single truth value, where each element of the vector represents a variable assignment.
The complexity of this computation then is O(|K| · P · bd), where K is the set of formulas, P is the amount of predicates used in each formula and d is the maximum depth of nesting of universal quantifiers in the formulas in K (known as the quantifier rank). This is exponential in the amount of quantifiers, as every object from the constants C has to be iterated over in line 12, although as mentioned earlier this can be mitigated somewhat using efficient parallelization. Still, computing the valuation for transitive rules (such as. ∀x y, z Q(x, z) ∧ R(z, y) → P(x, y)) will for example be far more demanding than for antisymmetry formulas (such as ∀x, y P(x, y) → ¬P(y, x)).
3.1. DIFFERENTIABLE FUZZY LOGICS
Algorithm 1 Computation of the Differentiable Fuzzy Logics loss. First it computes the fuzzy Herbrand interpretation g given the current embedded interpretation ηθ. This performs a forward pass through the
neural networks that are used to interpret the predicates. Then it computes the valuation of each formula ϕ in the knowledge base K, implementing Equations 3.2-3.7.
1: function eN,T ,S,I,A(ϕ, g, C, µ) . The valuation function computes the Fuzzy truth value of ϕ.
2: if ϕ = P(x1, ..., xm) then
3: return g[P, (µ(x1), ..., µ(xm)] . Find the truth value of a ground atom using the dictionary g.
4: else if ϕ = ¬φ then
5: return N (eN,T ,S,I,A(φ, g, C, µ))
6: else if ϕ = φ ∧ ψ then
7: return T (eN,T ,S,I,A(φ, g, C, µ), eN,T ,S,I,A(ψ, g, C, µ))
8: else if ϕ = φ ∨ ψ then
9: return S(eN,T ,S,I,A(φ, g, C, µ), eN,T ,S,I,A(ψ, g, C, µ))
10: else if ϕ = φ → ψ then
11: return I(eN,T ,S,I,A(φ, g, C, µ), eN,T ,S,I,A(ψ, g, C, µ))
12: else if ϕ = ∀x φ then . Apply the aggregation operator as a quantifier. 13: return
A
o∈CeN,T ,S,I,A(φ, g, C, µ ∪ {x/o}) . Each assignment can be seen as an instance of ϕ.14: end if
15: end function
16:
17: procedure DFL(ηθ, P, K, O, N, T, S, I, A) . Computes the Differentiable Fuzzy Logics loss.
18: C ← o1, ..., ob sampled from O . Sample b constants to use this pass.
19: g ← dict() . Collects truth values for ground atoms.
20: for P ∈ P do
21: for o1, ..., oα(P)∈ C do
22: g[P, (o1, ..., oα(P))] ← ηθ(P)(o1, ..., oα(P)) . Calculate the truth values of the ground atoms.
23: end for
24: end for
25: return
A
ϕ∈Kwϕ· eN,T ,S,I,A(ϕ, g, C, ∅) . Calculate valuation of the formulas ϕ. Start with an emptyvariable assignment. This implements Equation 3.9.
Example 3.2. To illustrate the computation of the valuation function eθ, we return to the problem in Example
3.1. The domain of discourse is the set of all subimages of natural images. The domain distribution is a distribution over the subimages of natural images. The constants are {o1, o2}, which is also the Herbrand
universe. The valuation of the rule ϕ = ∀x, y chair(x) ∧ partOf(y, x) → cushion(y) ∨ armRest(y) is computed as follows:
eθ(ϕ, {}) =A(A(I(T (ηθ(chair)(ηθ(o1)), ηθ(partOf)(ηθ(o1), ηθ(o1))), S(ηθ(cushion)(ηθ(o1)), ηθ(armRest)(ηθ(o1)))),
I(T (ηθ(chair)(ηθ(o1)), ηθ(partOf)(ηθ(o2), ηθ(o1))), S(ηθ(cushion)(ηθ(o2)), ηθ(armRest)(ηθ(o2))))),
A(I(T (ηθ(chair)(ηθ(o2)), ηθ(partOf)(ηθ(o1), ηθ(o2))), S(ηθ(cushion)(ηθ(o1)), ηθ(armRest)(ηθ(o1)))),
I(T (ηθ(chair)(ηθ(o2)), ηθ(partOf)(ηθ(o2), ηθ(o2))), S(ηθ(cushion)(ηθ(o2)), ηθ(armRest)(ηθ(o2))))))
To illustrate this more intuitively, Figure 3.2 shows the computation using a tree. Let us now make this computation concrete by choosing the product t-norm T = TP and t-conorm S = SP, alongside the product
aggregator A = ATP and the product S-implication known as the Reichenbach implication I = IRC. The resulting computation of the valuation function can then be written as
eθ(ϕ, {}) =
Y
x∈C
Y
y∈C
1 − (ηθ(chair)(ηθ(x)) · ηθ(partOf)(ηθ(y), ηθ(x))) (3.11)
+ (ηθ(chair)(ηθ(x)) · ηθ(partOf)(ηθ(y), ηθ(x)))
· (ηθ(cushion)(ηθ(y)) + ηθ(armRest)(ηθ(y)) − ηθ(cushion)(ηθ(y)) · ηθ(armRest)(ηθ(y)))
= Y
x,y∈C
1 − ηθ(chair)(ηθ(x)) · ηθ(partOf)(ηθ(y), ηθ(x)) (3.12)
· (1 − ηθ(cushion)(ηθ(y)))(1 − ηθ(armRest)(ηθ(y)))
If we interpret the predicate functions using a lookup from the table on the probabilities from Example 3.1 so that ηθ(P(x)) = p(P(x)|I, x), we find that eθ(ϕ, {}) = 0.612.
Example 3.3. Continuing from Example 3.2, we can use gradient descent to update the table of probabilities from Example 3.1. Taking K = {ϕ}, we find that
∂LRL(K, ηθ) ∂p(chair(o1)|I, o1) = −0.4261 ∂LRL(K, ηθ) ∂p(chair(o2)|I, o2) = −0.0058 ∂LRL(K, ηθ) ∂p(cushion(o1)|I, o1) = 0.0029 ∂LRL(K, ηθ) ∂p(cushion(o2)|I, o2) = 0.7662 ∂LRL(K, ηθ) ∂p(armRest(o1)|I, o1) = 0.0029 ∂LRL(K, ηθ) ∂p(armRest(o2)|I, o2) = 0.4257 ∂LRL(K, ηθ) ∂p(partOf(o1, o1)|I, o1) = −0.4978 ∂LRL(K, ηθ) ∂p(partOf(o2, o2)|I, o2) = −0.1103 ∂LRL(K, ηθ) ∂p(partOf(o1, o2)|I, o1, o2) = −0.2219 ∂LRL(K, ηθ) ∂p(partOf(o2, o1)|I, o1, o2) = −0.4031.
We can now do a gradient update step to update the probabilities in the table of probabilities, or find what the partial derivative of the parameters θ of some deep learning model pθ should be:
∂LRL(K, ηθ) ∂θ = ∂LRL(K, ηθ) ∂pθ(chair(o1)|I, o1) ·∂pθ(chair(o1)|I, o1) ∂θ + ... + ∂LRL(K, ηθ) ∂pθ(partOf(o2, o1)|I, o1, o2) ·∂pθ(partOf(o2, o1)|I, o1, o2) ∂θ = − 0.4261 ·∂pθ(chair(o1)|I, o1) ∂θ + ... + −0.4031 · ∂pθ(partOf(o2, o1)|I, o1, o2) ∂θ
One particularly interesting property of Differentiable Fuzzy Logics is that the partial derivatives of the subfor-mulas with respect to the satisfaction of the knowledge base have a somewhat explainable meaning. For example, as hypothesized in Example 3.1, the computed gradients reflect that we should increase p(cushion(o2)|I, o2), as
it is indeed the (absolute) largest partial derivative.
Furthermore, note that there are many (relatively) large negative gradients. If we look at the computation of the valuation (Equation 3.12) it is easy to see why each partial derivative has its particular sign.
3.1. DIFFERENTIABLE FUZZY LOGICS A A I T ηθ (chair ) ηθ (o1 ) ηθ (pa rtOf ) ηθ (o1 ), ηθ (o1 ) S ηθ (cushion ) ηθ (o 1 ) ηθ (a rmRest ) ηθ (o 1 ) I T ηθ (chair ) ηθ (o 1 ) ηθ (pa rtOf ) ηθ (o 2 ) ηθ (o1 ) S ηθ (cushion ) ηθ (o2 ) ηθ (a rmRest ) ηθ (o2 ) A I T ηθ (chair ) ηθ (o 2 ) ηθ (pa rtOf ) ηθ (o1 ), ηθ (o2 ) S ηθ (cushion ) ηθ (o1 ) ηθ (a rmRest ) ηθ (o1 ) I T ηθ (chair ) ηθ (o2 ) ηθ (pa rtOf ) ηθ (o2 ) ηθ (o 2 ) S ηθ (cushion ) ηθ (o2 ) ηθ (a rmRest ) ηθ (o 2 ) Figure 3.2: Represen ting the computation in Example 3.2 using a tree.
3.1.2.2 Discussion
Because the sampled semantics is defined on a subset C of the domain of discourse C ⊆ O, the best satisfiability problem can also be understood as finding parameters θ so that all formulas are satisfied for this particular subset C. In the same way one could say a machine learning model learns to recognize cats by feeding it with pictures of cats, the machine learning model pθ learns to predict in a way that is logically consistent with K.
There is no guarantee, however, that the formulas are also always satisfied if we evaluate them using objects other than those in C. In fact, as we will see, there is nothing intrinsically stimulating the learning algorithm to make predictions that are logically consistent, but rather, DFL shows the machine learning model more logically consistent examples. It is well known that most contemporary deep learning still has issues with generalization such as a weakness to adversarial examples (I. J. Goodfellow, Shlens, and Szegedy 2014) and this is no different for this method.8
Furthermore, in most practical purposes C is not just a random subset of O but one distributed by the domain distribution p(o). So, this will maximize the expected truth value of the knowledge base with respect to the domain distribution p(o), rather than maximizing the constraints irrespective of how common some object is. It is likely that the machine learning model will still predict inconsistently for uncommon scenes.
A second important point is that because the corpus K contains no facts9(literals with just constants), there
is also no guarantee that the learned embedded interpretation will correspond to their true semantic meaning. In fact, the learned embedded interpretation might be one that assigns the wrong truth value to nearly every literal because it can still be one that satisfies all formulas in the knowledge base.
Because of this, it is important to also learn what the predicate symbols mean in the real world (that is, from the data). This is where we use the supervised learning loss. If we look at the definition of the general Differentiable Logics loss in Equation 3.1 and substitute LDL(θ; Du, K) = LDF L(θ; Du, K), we can jointly
optimize the supervised learning loss and the DFL loss, where the first is used to provide examples of what the semantics of the predicates should be, and the second validates logical consistency.
3.2
Derivatives of Operators
The main argument of this thesis is that the choice of operators determines the inferences that are done when using the DFL loss. If we used a different set of operators in Example 3.2 than those based on the product t-norm, we would have gotten very different derivatives. These could in some ways make more sense, and in some other ways less. In this section, we analyze many functions that can be used for logical reasoning and present several of its properties.
We will not go in depth with fuzzy negations as the classical negation NC(a) = 1 − a is common, continuous,
intuitive and has simple derivatives.
Definition 3.4. A function f : R → R is said to be nonvanishing if f (a) 6= 0 for all a ∈ R, i.e. it is nonzero everywhere.
A function f : Rn → R has a nonvanishing derivative if for all a
1, ..., an ∈ R there is some 1 ≤ i ≤ n such
that ∂f (a1,...,an)
∂ai 6= 0.
Note that even if we only use nonvanishing operators, the derivatives of composites of these functions can still be vanishing. For instance, using the product s-norm and the classical negation on a ∨ ¬a, we find that the gradient of SP(a, 1 − a) is 2a − 1, which is 0 at 12. Another problem is that if the computation tree has a very
high height, gradients can vanish upstream. By the chain rule, all the partial derivatives of the connectives used from the root to a leaf in the tree have to be multiplied. If the partial derivatives of many of the connectives used are smaller than 1, their product can approach 0, and in the case of an arithmetic underflow become 0.
3.2.1
Aggregation
Aggregation of both formulas and instances is an important choice when working with DFL. The function used for instance-aggregation is the interpretation of the ∀ quantifier in Equation 3.7. Because we limit our-selves to universal quantification, and the aggregation of formulas is by assuming every formula is true, we are not interested in ‘oring’ aggregators. We will next talk about the theoretical benefits and problems of each aggregator.
8In fact, this inspired (Minervini et al. 2017) to use generated adversarial examples that are not consistent with K as the ‘sampled
batch’ C. For more details, see Section 5.2
9We assume in this thesis that the knowledge base K does not contain facts and only contains universally quantified formulas.
We separate the learning of facts from the learning of the other formulas to be able to use standard supervised learning methods for learning the facts.
3.2. DERIVATIVES OF OPERATORS
3.2.1.1 Minimum Aggregator The minimum aggregator is given as
ATG(x1, ..., xn) = min(x1, ..., xn). (3.13)
It corresponds to strict universal quantification (or ‘anding’): For this aggregator to be high, every single input element needs to be high. The partial derivatives are given by
∂ATG(x1, ..., xn) ∂xi = ( 1 if i = argminjxj 0 otherwise. (3.14)
It is easy to see that this is a poor aggregator. There is only a nonzero gradient on a single element. However, many of the practical formulas have exceptions. For example, if we would believe that all ravens are black, we would be surprised to see that white ravens do exists, even if they are very rare. Furthermore, the raven might turn red if someone throws a bucket of paint over it. Because only the lowest scoring input has a nonzero gradient, this aggregator is likely to correct just that exception, ‘forgetting’ correct behavior. Additionally, it is an ineffective gradient as we still have to compute the forward pass for all other inputs even though they get no feedback signal.
3.2.1.2 Lukasiewicz Aggregator The Lukasiewicz aggregator is given as
ATLU(x1, ..., xn) = max n X i=1 xi− (n − 1), 0 ! . (3.15)
This again is strict universal quantification. The partial derivatives are given by ∂ATLU(x1, ..., xn) ∂xi = ( 1 if Pn i=1xi> n − 1 0 otherwise. (3.16)
This is also a very poor aggregation operator. There only is a gradient whenPn
i=1xi > n − 1, that is, only when
the average value of xi is larger than n−1n (P´all J´onsson 2018). And because limn→∞n−1n = 1, this effectively
means that, for larger values of n, all inputs are already satisfied. In other words, we can only learn when we are already (almost) correct. As there would nearly never be any gradient during learning, this aggregation operator would render DFL useless.
3.2.1.3 Yager Aggregator The Yager aggregator is given as
ATY(x1, ..., xn) = max 1 − n X i=1 (1 − xi)p !1p , 0 , p ≥ 1 (3.17)
The Lukasiewicz aggregator is a special case of the Yager aggregator when p = 1. Furthermore, if p approaches infinity, it approaches the minimum aggregator. The derivative of the Yager aggregator is
∂ATY(x1, ..., xn) ∂xi = Pn j=1(1 − xj)p 1−1p · (1 − xi)p−1 if Pn j=1(1 − xj)p 1p < 1 0 otherwise. (3.18)
This derivative vanishes whenever Pn
j=1(1 − xj)p
1p
≥ 1. By exponentiating by p, we note that then also Pn
j=1(1 − xj)p ≥ 1 holds. As 1 − xi ∈ [0, 1], (1 − xi)p is a decreasing function with respect to p. Therefore, 1
n
Pn
i=1(1 − xi)p < 1n holds for a larger proportion of the domain when p increases. We can quantify this for
the common (Euclidean) case p = 2.
Proposition 3.1. The ratio of points x1, ..., xn ∈ [0, 1] for which there is some xi where
∂ATY(x1,...,xn) ∂xi > 0 is equal to πn2 2n·Γ(1 2n+ 1 2)
2 4 6 8 10 12 0.0 0.2 0.4 0.6 0.8 1.0 n r
Figure 3.3: The ratio of points in [0, 1]n for which ATY with p = 2 has a positive gradient.
Proof. We begin by noting from Equation 3.18 that there is only a gradient whenever Pn
i=1(1 − xi)2 < 1.
The points for which this equation holds describes the volume of an n-ball with radius 1. An n-ball is the generalization of the concept of a ball to n + 1 dimensions: The region enclosed by a n − 1 hypersphere10. A hypersphere with radius 1 is the set of points which are at a distance of 1 from its center. The volume of a n-ball is found by (Ball et al. 1997, p.5):
V (n) = π n 2
Γ(12n + 1). (3.19)
As we are only interested in the volume of this n-ball for a single orthant11, we have to divide this volume by
the amount of orthants in which the n-ball lies, which is 2n12. The total volume of a single orthant is 1. Thus,
the ratio of points in [0, 1]n that have a nonzero gradient is V (n)2n =
πn2
2n·Γ(1
2n+12)
.
We plot the values of this ratio up to n = 12 in Figure 3.3. It shows that in practice this problem with vanishing gradients persists even for small input domains. If we are concerned only by optimizing the truth value, we can simply remove the max constraint, resulting in the ‘unbounded yager’ norm
AU Y(x1, ..., xn) = 1 − n X i=1 (1 − xi)p !1p , p ≥ 1. (3.20)
However, then the co-domain of the function is no longer [0, 1]. We can do a linear transformation on this function to ensure this is the case (Appendix A.1).
Definition 3.5. For some p ≥ 0, the Mean-p Error aggregator ApM E is defined as
ApM E(x1, ..., xn) = 1 − 1 n n X i=1 (1 − xi)p !p1 . (3.21)
The ‘error’ here the is difference between the predicted value xi and the ‘ground truth’ value, 1. This
function has the following derivative: ∂ApM E(x1, ..., xn) ∂xi = 1 n 1 n n X j=1 (1 − xj)p 1 p−1 · (1 − xi)p−1. (3.22)
10The 3-ball (or ball) is surrounded by a sphere (or 2-sphere). Similarly, the 2-ball (or disk) is surrounded by a circle (or 1-sphere). 11An orthant in n dimensions is a generalization of the quadrant in two dimensions and the octant in three dimensions. 12To help understand this, consider n = 2. The 1-ball is the circle with center (0, 0). The area of this circle is evenly distributed
3.2. DERIVATIVES OF OPERATORS
We quickly mention two special cases. The first is for p = 1: AM AE(x1, ..., xn) = 1 − 1 n n X i=1 (1 − xi) (3.23)
having the simple derivative ∂AM AE(x1,...,xn)
∂xi =
1
n. This measure is equal to the mean absolute error (MAE) (as
the error is always nonnegative) and associated with the Lukasiewicz norm. Another special case is p = 2:
ARM SE(x1, ..., xn) = 1 − v u u t 1 n n X i=1 (1 − xi)2. (3.24)
This function is the root-mean-square error (RMSE) (also known as the root-mean-square deviation). It is commonly used for regression tasks and heavily weights outliers.
We can do the same for the Yager s-norm min((ap+ bp)1/p, 1) (see Appendix A.1): Definition 3.6. For some p ≥ 0, the p−Mean aggregator is defined as
ApM(x1, ..., xn) = 1 n n X i=1 xpi !1p , p ≥ 1. (3.25)
p = 1 corresponds to the arithmetic mean and p = 2 to the geometric mean.13 Additionally, its derivative has the issue of having high values for already high inputs when p > 1. Note that the arithmetic mean A1M
has the same derivative as the mean absolute error AM AE.
3.2.1.4 Product Aggregator The product aggregator is given as
ATP(x1, ..., xn) =
n
Y
i=1
xi. (3.26)
This again is strict universal quantification, with the following partial derivatives: ∂ATP(x1, ..., xn) ∂xi = n Y j=1,i6=j xj. (3.27)
∇ATP(x1, ..., xn) > 0 if x1, ..., xn > 0, which is nonvanishing as x1 = ... = xn = 0 is extremely unlikely to be relevant in practice. However, the derivative for some input xi is decreased if some other input xj is low,
even when they are independent. Furthermore, in practice we cannot compute this aggregation operator as numerical underflow will happen when multiplying many small numbers. Luckily, we can use a common trick where we notice that argmaxf (x) = argmax log(f (x)), as log is a strict monotonically increasing function.
If we use the product aggregator both for connecting instances and formulas, and our formulas are in prenex normal form, the best satisfiability problem in Equation 3.9 using the product norm ATP can be written as
ηθ∗= argminη θ− Y ϕ∈K eθ ϕ = ∀x1, ..., xnϕ φ, {} wϕ (3.28) = argminηθ− Y ϕ∈K Y o1,...,onϕ∈C eθ φ, {x1/o1, ..., xnϕ/onϕ} wϕ (3.29) = argminη θ− X ϕ∈K wϕ· X o1,...,onϕ∈C log eθ(φ, {x1/o1, ..., xnϕ/onϕ}) (3.30)
where nϕis the depth of nesting of universal quantifiers in prenex normal form formula ϕ = ∀x1, ..., xnϕ φϕ, and φϕ is the quantifier-free part of the formula ϕ, also known as the matrix of ϕ. We call this the log-product
aggregator
Alog TP(x1, ..., xn) = (log ◦ATP)(x1, ..., xn) =
n
X
i=1
log(xi). (3.31)
13(Donadello, Serafini, and A. d. Garcez 2017; Diligenti, Roychowdhury, and Gori 2017) and Marra et al. 2019 used these for the
semantics of ∀, even though it is an ‘oring’ and not an ‘anding’ aggregator. The motivation they give is that it is better than the minimum aggregator ATG as the more examples satisfy the formula, the higher the truth value of the formula. We agree with this