=Paper=
{{Paper
|id=Vol-3261/paper5
|storemode=property
|title=A view to a KILL: knowledge injection via lambda layer
|pdfUrl=https://ceur-ws.org/Vol-3261/paper5.pdf
|volume=Vol-3261
|authors=Matteo Magnini,Giovanni Ciatto,Andrea Omicini
|dblpUrl=https://dblp.org/rec/conf/woa/MagniniCO22
}}
==A view to a KILL: knowledge injection via lambda layer==
A view to a KILL: Knowledge Injection via Lambda Layer Matteo Magnini* , Giovanni Ciatto and Andrea Omicini Dipartimento di Informatica – Scienza e Ingegneria (DISI), Alma Mater Studiorum—Università di Bologna Abstract We propose KILL (Knowledge Injection via Lambda Layer) as a novel method for the injection of symbolic knowledge into neural networks (NN) allowing data scientists to control what the network should (not) learn. Unlike other similar approaches, our method does not (i) require ground input formulæ, (ii) impose any constraint on the NN undergoing injection, (iii) affect the loss function of the NN. Instead, it acts directly at the backpropagation level, by increasing penalty whenever the NN output is violating the injected knowledge. An experiment is reported to demonstrate the potential (and limits) of our approach. Keywords symbolic knowledge injection, AI, ML, neural networks 1. Introduction A major issue in supervised machine learning (ML) is the usage of opaque predictors – such as neural networks (NN) – as black boxes [1]. It is not trivial to understand what NN actually learn from training data, and how they generalise from that data to the whole domain. Currently, state of the art solutions address this issue by supporting their inspection via a plethora of different mechanisms [2]. In this work we tackle the problem from another prospective. We discuss how prior symbolic knowledge can be injected into NN, during learning, in order to endow them with the designer’s common sense. Injected knowledge may thus harness the NN and make it learn what desired, while preventing it from violating the designer’s constraints—expressed in a symbolic way. Along this line, we propose KILL (Knowledge Injection via Lambda Layer) as a method for the injection of logic formulæ in Datalog form [3] into neural networks of any shape. Unlike other related works, our method (i) does not require the input formulæ to be ground, (ii) it does not imposes any constraint on the NN undergoing injection, (iii) and it does not require the loss function of the network to be affected. Conversely, our method acts directly at the backpropagation level, by increasing the penalty to be backpropagated whenever the NN output WOA 2022: Workshop “From Objects to Agents”, September 1–2, 2022, Genova, Italy * Corresponding author. $ matteo.magnini@unibo.it (M. Magnini); giovanni.citatto@unibo.it (G. Ciatto); andrea.omicini@unibo.it (A. Omicini) http://matteomagnini.apice.unibo.it (M. Magnini); http://giovanniciatto.apice.unibo.it (G. Ciatto); http://andreaomicini.apice.unibo.it (A. Omicini) 0000-0001-9990-420X (M. Magnini); 0000-0002-1841-8996 (G. Ciatto); 0000-0002-6655-3869 (A. Omicini) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) is violating the knowledge to be injected. In other words, KILL performs knowledge injection by constraining networks’ training to adhere to the symbolic knowledge. To validate our method, we report a simple experiment where the designer’s common sense – conveniently represented as logic formulæ – is injected into a NN classifier, to improve its accuracy. Notably, our experiment shows how KILL can be exploited to improve classification performance in inconvenient scenarios where training data is relatively small, and classes are unbalanced and overlapping. Indeed, thanks to symbolic knowledge injection, these incon- veniences can be effectively tackled without re-engineering the dataset. The experiment also reveal a lack of robustness w.r.t. extremely rare classes. Notably, this limitation let us elaborate an interesting discussion on the limits of symbolic knowledge injectionattained via constraining. Accordingly, the paper is organised as follows. Section 2 briefly summarises the background on symbolic knowledge injection, eliciting a number of related works. Then, Section 3 formally describes KILL, as well as its rationale and internal operation. Section 4 then reports our experiments and their design, while results are discussed in Section 5. Finally, Section 6 concludes the paper providing some insights about how the current limitations of KILL could be overcame. 2. Background & related works In this paper, symbolic knowledge injection (SKI) is the task of letting a sub-symbolic predictor exploit formal, symbolic information to improve its predictive performance (e.g. accuracy, F1- measure, learning time, etc.) over data or to use the predictor as a logic engine. Unlike numeric data upon which predictors are commonly trained, symbolic data is generally more compact and expressive, as intensional representations of complex concepts may be concisely written. In particular, symbolic information may encode bold rules that must be satisfied by the concepts the predictor is willing to learn. Hence, provided that some SKI procedure is available, data scientists may craft ad-hoc collections of symbolic expressions aimed at aiding the training of a particular predictor, for a specific learning task. In other words, injection enables provisioning prior knowledge – namely, the designer’s common sense – to ML predictors under training. When it comes to neural networks, approaches for SKI are manifold, and the literature on this topic is vast [4, 5, 6]. Broadly speaking, there exist at least two major sorts of approaches – not mutually exclusive – supporting the injection of symbolic knowledge into neural networks. Approaches of the first sort perform injection during the network’s training, using the symbolic knowledge as either a constraint or a guide for the optimisation process [7, 8, 9, 10, 11, 12]. Conversely, approaches of the second sort perform injection by altering the network’s architec- ture to make it mimic the symbolic knowledge [13, 14, 7, 15, 16, 17, 18, 11, 19]. In the reminder of this paper, we focus on approaches of the former sort, as our proposed method falls in this category. According to the state of the art, the latter strategy commonly works by converting the symbolic knowledge into an additive regularisation term to be added to the loss function used for training. In particular, when the predictor is a NN, knowledge injection is performed during the back propagation step. When the loss function is evaluated, if the network violates the integrated symbolic constraints, then the actual loss value will be grater than the unconstrained case. In this way, data scientists can lead the networks’ learning algorithm to minimise constraints violation, while minimising its error w.r.t. the data as well. Concerning the symbolic knowledge, virtually all techniques we are aware of require information to be represented via (some subset of) first order logic (FOL) formulæ—e.g. propositional logic (quite limited) [14, 7, 15, 18, 9, 10, 12] or full FOL [13, 16, 17]. Actual methods may then vary, depending on (i) which particular sub-set of FOL they rely upon, (ii) how are logic formulæ interpreted as constraints, and (iii) whether formulæ require to be grounded or not, before SKI can occur. To the best of our knowledge, virtually all methods proposed so far work by converting formualæ in fuzzy logic functions, and they often require be grounded at some point in the process. With respect to other state-of-the-art contributions, our proposal differs in several ways. In particular, we accept logic formulæ in Datalog form as input – meaning that we support a restricted subset of FOL –, and we do not require those formulæ to be ground—neither are formulæ grounded anywhere in the process. 3. Knowledge Injection via Lambda Layer We propose KILL – short for Knowledge Injection via Lambda Layer – as an approach to SKI where the training process is constrained in such a way that the network is penalised whenever its predictions are in conflict with the symbolic knowledge to be injected. In doing so, our approach does not impose any constraint on the architecture (e.g. number of layers, number of neurons, types of activation functions, etc.) or the initialisation status (e.g. random weights or partially trained) of the network subject to injection. It does require, however, (i) the network to have an input and an output layer, and (ii) to be trained via gradient descent. Furthermore, it also requires (iii) the symbolic knowledge to be expressed via one or more Datalog formulæ, and (iv) to encode logic statements about the network’s input or output features. 3.1. Λ-layers for SKI KILL performs injection during training. It works by appending one further layer – the Λ-layer henceforth – at the output end of the neural network, and by training the overall network as usual, via gradient descent or others similar strategies. The Λ-layer is in charge of introducing an error (w.r.t. the actual prediction provided by the network’s output layer) whenever the prediction violates the symbolic knowledge. The error is expected to affect the gradient descent – or whatever optimisation function – in such a way that violating the symbolic knowledge is discouraged. In other words the NN inductively learns the penalties applied to wrong predictions over the examples and conversely it is much more inclined to avoid such wrong predictions. To serve its purpose, the Λ-layer requires an ad-hoc activation function altering the outcome of the network’s original output layer. It also needs the logic formulæ to be numerically interpreted – i.e., converted into functions of real numbers –, to draw actual error values. Once the NN training is over, the injection phase is considered over as well, hence the Λ-layer can be removed and the remaining network can be used as usual. Hence, no architectural property of the original network is hindered by the addition of the Λ-layer. Differences notwithstanding, injecting by regularising the loss function and injecting using the Λ-layer serve the same purpose. However, in some cases it may be easier to manipulate Figure 1: General architecture of a NN with 𝑛 hidden layers, 𝑋 and 𝑌 are respectively the input and output layers. Before training, the KILL algorithm is applied to the predictor so that a new model constrained by the knowledge base (KB) is created. The Λ-layer is kept during the training and then removed for the inference. the network via the Λ-layer rather than the loss function or in other scenarios the designer may choose to preserve the layer after the training possibly with a different cost function than the one reported in Equation (1). A more detailed discussion about the advantages of Λ-layeris reported in 5. In the reminder of this section we delve into the details of KILL. First, we discuss how the Λ-layer affects the network architecture, and how it works in detail. Then, we discuss how logic formulæ can be numerically interpreted as errors. 3.2. Additional Λ-layer We consider the case of a symbolic knowledge base, denoted by 𝒦, to be injected into a feed- forward NN of arbitrary depth, denoted by 𝒩 . We denote as 𝑋 (resp., 𝑌 ) the input (resp., output) layer of 𝒩 . Without loss of generality, we consider the case where the shape of 𝑌 is 𝑛 × 1, but a similar discussion would hold for layers of any shape (e.g. bi-, tri-, or multi- dimensional). We draw no assumptions on the activation function of 𝑌 , nor on the amount, topology, or nature of the hidden layers connecting 𝑋 and 𝑌 , nor on the shape of 𝑋. Hence, we denote by y = [𝑦1 , . . . , 𝑦𝑛 ] the output of 𝑌 —i.e., the prediction of the network for some input x. We also assume 𝒦 to be composed by as many rules are the neurons in 𝑌 , thus we write 𝒦 = {𝜑1 , . . . , 𝜑𝑛 }, where 𝜑𝑖 is a symbolic formula describing/constraining the relation among x and 𝑦𝑖 . To perform injection, we alter the structure of 𝒩 by adding one additional layer – namely, the Λ-layer – as depicted in Figure 1, and we then train it as usual, via gradient descent. The Λ-layer is densely connected with both 𝑋 and 𝑌 , and its activation function aims at introducing a penalty on 𝑦𝑖 every time the formula 𝜑𝑖 is violated by some input-output pair (x, y). In particular, we denote as 𝜆 the output of the Λ-layer, which is defined as follows: 𝜆 = y × (1 + 𝐶(x, y)) (1) where 𝐶(x, y) is a positive penalty vector representing the cost of modifying the actual output of the network y. In turn, the cost vector is defined as follows: 𝐶(x, y) = [𝑐1 (x, 𝑦1 ), . . . , 𝑐𝑖 (x, 𝑦𝑖 ), . . . , 𝑐𝑛 (x, 𝑦𝑛 )] (2) where 𝑐𝑖 : 𝑋 × 𝑌 → [0, 1] is a function interpreting 𝜑𝑖 as a cost in the [0, 1] range, for possible actual value of x and 𝑦𝑖 . An in-depth discussion about how logic formulæ can be interpreted as continuos (a.k.a. fuzzy) penalties is provided in Section 3.4. For now, it is enough to understand that a penalty is added on the output of the 𝑖th neuron of 𝑌 whenever the corresponding prediction violates the symbolic knowledge in 𝒦. Such penalty is closer to 𝑦𝑖 when the formula 𝜑𝑖 is violated the most, while it is closer to 0 either when the formula is violated the less, or there is no formula for that neuron. The rationale behind the Λ-layer, and the penalty it introduces, is that of altering the output of the network in such a way that its error is very high when it violates the knowledge base to be injected. In this way, the network error is the result of a function of two different components: the actual prediction error and the penalty. The overall error is thus minimised during back propagation, as well as both its components. 3.3. Input knowledge KILL supports the injection of knowledge bases composed by one or more logic formulæ in “stratified Datalog with negation” form—that is, a variant of Datalog [20] with no recursion (neither direct nor indirect), yet supporting negated atoms. We choose Datalog because of its expressiveness (strictly higher than propositional logic) and its acceptable limitations. The lack of recursion, in particular, prevents issues when it comes to convert formulæ into neural structures (which are DAG). More precisely, Datalog is a restricted subset of FOL, representing knowledge via function-free Horn clauses [3]. Horn clauses, in turn, are formulæ of the form 𝜑 ← 𝜓1 ∧ 𝜓2 ∧ . . . denoting a logic implication (←) stating that 𝜑 (the head of the clause) is implied by the conjunction among a number of atoms 𝜓1 , 𝜓2 , . . . (the body of the clause). Since we rely on Datalog with negation, we allow atoms in the bodies of clauses to be negated. In case the 𝑖𝑡ℎ atom in the body of some clause is negated, we write ¬𝜓𝑖 . There, each atom 𝜑, 𝜓1 , 𝜓2 , . . . may be a predicate of arbitrary arity. An 𝑙-ary predicate 𝑝 denotes a relation among 𝑙 entities: p(𝑡1 , . . . , 𝑡𝑙 ) where each 𝑡𝑖 is a term, i.e., either a constant (denoted in monospace) representing a particular entity, or a logic variable (denoted by Capitalised Italic) representing some unknown entity or value. Well- known binary predicates – e.g., >, <, = – are admissible, too, and retain their usual semantics from arithmetic. For the sake of readability, we may write these predicates in infix form—hence > (𝑋, 1) ≡ 𝑋 > 1. Consider for instance the case of a rule aimed at defining when a Poker hand can be classified as a pair—the example may be useful in the remainder of this paper. Assuming that a Poker hand consists of 5 cards, each one denoted by a couple of variables 𝑅𝑖 , 𝑆𝑖 – where 𝑅𝑖 (resp. 𝑆𝑖 ) is the rank (resp. seed) of the 𝑖𝑡ℎ card in the hand –, hands of type pair may be described via a set of clauses such as the following one: pair (𝑅1 , 𝑆1 , . . . , 𝑅5 , 𝑆5 ) ← 𝑅1 = 𝑅2 pair (𝑅1 , 𝑆1 , . . . , 𝑅5 , 𝑆5 ) ← 𝑅2 = 𝑅3 .. (3) . pair (𝑅1 , 𝑆1 , . . . , 𝑅5 , 𝑆5 ) ← 𝑅4 = 𝑅5 To support injection into a particular NN, we further assume the input knowledge base defines one (and only one) outer relation – say output or class – involving as many variables as the input and output features the NN has been trained upon. That relation must be defined via one clause per output neuron. Yet, each clause may contain other predicates in their bodies, in turn defined by one or more clause. In that case, since we rely on stratified Datalog, we require the input knowledge to not include any (directly or indirectly) recursive clause definition. For example, for a 3-class classification task, any provided knowledge base should include a clause such as the following one: ¯ , y1 ) class(𝑋 ← p1 (𝑋 ¯ ) ∧ p2 (𝑋 ¯) p1 (𝑋 ¯) ← ... p2 (𝑋 ¯) ← ... ¯ class(𝑋 , y2 ) ← p1′ (𝑋¯ ) ∧ p ′ (𝑋¯) 2 p1′ (𝑋¯) ← ... ′ p2 (𝑋 ¯) ← ... ¯ class(𝑋 , y3 ) ← ¯ ) ∧ p ′′ (𝑋 p1′′ (𝑋 ¯) 2 ¯) p1′′ (𝑋 ← ... ′′ p2 (𝑋 ¯) ← ... where 𝑋¯ is a tuple having as many variables as the neurons in the output layer, y𝑖 is a constant denoting the 𝑖𝑡ℎ class, and p1 , p2 , p1′ , p2′ , p1′′ , p2′′ are ancillary predicates defined via Horn clauses as well. 3.4. Logic formulæ as penalties Before undergoing injection, each formula corresponding to some output neuron must be converted into a real-valued function aimed at computing the cost of violating that formula. To this end, we rely on a multi-valued interpretation of logic inspired to Łukasiewicz’s logic [21]. Accordingly, we encode each formula via the J·K function, mapping logic formulæ into real- valued functions accepting real vectors of size 𝑚 + 𝑛 as input and returning scalars in R as output. These scalars are then clipped into the [0, 1] range, via the 𝜂 : R → [0, 1], defined as Table 1 Logic formulæ’s encoding into real-valued functions. There, 𝑋 is a logic variable, while 𝑥 is the corresponding real-valued variable, whereas is 𝑋 ¯ a tuple of logic variables. Similarly, k is a numeric constant, and 𝑘 is the corresponding real value, whereas k𝑖 is the constant denoting the 𝑖𝑡ℎ class of a ¯ ) is an arithmetic expression involving the variables in 𝑋 classification problem. Finally, expr(𝑋 ¯. Formula C. interpretation Formula C. interpretation J¬𝜑K 𝜂(1 − J𝜑K) J𝜑 ≤ 𝜓K 𝜂(J𝜑K − J𝜓K) J𝜑 ∧ 𝜓K 𝜂(𝑚𝑎𝑥(J𝜑K, J𝜓K)) Jclass(𝑋¯ , y𝑖 ) ← 𝜓K J𝜓K* J𝜑 ∨ 𝜓K 𝜂(𝑚𝑖𝑛(J𝜑K, J𝜓K)) Jexpr(𝑋¯ )K expr(J𝑋 ¯ K) J𝜑 = 𝜓K 𝜂(|J𝜑K − J𝜓K|) JtrueK 0 J𝜑 ̸= 𝜓K J¬(𝜑 = 𝜓)K JfalseK 1 J𝜑 > 𝜓K 𝜂(0.5 − J𝜑K + J𝜓K) J𝑋K 𝑥 J𝜑 ≥ 𝜓K 𝜂(J𝜓K − J𝜑K) JkK 𝑘 J𝜑 < 𝜓K 𝜂(0.5 + J𝜑K − J𝜓K) Jp(𝑋¯ )K** J𝜓1 ∨ . . . ∨ 𝜓𝑘 K * encodes the penalty for the 𝑖𝑡ℎ neuron ** assuming predicate 𝑝 is defined by 𝑘 clauses of the form: ¯ ) ← 𝜓1 , . . . , p(𝑋 p(𝑋 ¯ ) ← 𝜓𝑘 follows: ⎧ ⎨0 ⎪ if 𝑥 ≤ 0 𝜂(𝑥) = 𝑥 if 0 < 𝑥 < 1 (4) ⎪ 1 if 𝑥 ≥ 1 ⎩ The resulting values are the penalties discussed in Section 3.2. Hence, the penalty associated with the 𝑖𝑡ℎ neuron violating rule 𝜑𝑖 can be written as 𝑐𝑖 (x, 𝑦𝑖 ) = 𝜂(J𝜑𝑖 K(x, 𝑦𝑖 )). The J·K encoding function is recursively defined in Table 1. Put it simply, when comput- ing the penalty 𝑐𝑖 (x, 𝑦𝑖 ) of the 𝑖𝑡ℎ neuron, KILL looks for the only Datalog rule of the form class(𝑋 ¯ , y𝑖 ) ← 𝜓. It then focuses on the body of this rule – namely, 𝜓 – ignoring its head—since the head simply reports which expected output the rule is focussing upon. If the body 𝜓 contains some predicates 𝑝1 , 𝑝2 , defined by one or more clauses in the provided knowledge base, then these predicates are replaced by the disjunction of the bodies of all clauses defining them. This process is repeated until no ancillary predicates remain in 𝜓, except for binary expressions involving input variables, constants, arithmetic operators, and logical connectives. Finally, operators and connectives are replaced by continuous functions, as indicated by Table 1. The hole process produces a real-valued interpretation of the original formula which shall be used by KILL to compute 𝑐𝑖 (x, 𝑦𝑖 ). Figure 2 depicts an example of the encoding process where a logic formula is firstly simplified – i.e., converted in a form where it only contains a minimal subset of operators –, and then encoded into an actual real-valued function. The example formula is: class(𝑋1 , 𝑋2 , z) ← (𝑋1 ≥ k) ∧ (𝑋2 ≥ h) where k, h, z are numeric constants, while 𝑋1 and 𝑋2 are input variables and 𝑌 is an output variable. In particular, Figure 2a shows the abstract syntax tree (AST) of this formula, Figure 2b shows the same AST where the ≤ operator is replaced by a negated > operator, and Figure 2c shows the AST of the encoded function. Figure 2: Example of the encoding process of logic formulæ into real-values functions. Only AST are depicted. Box coloured in the same way represent the encoding of a given operator through each encoding step. So, for instance, operator < (red) is firstly converted into a negated ≥, and then in a combination of 𝑚𝑎𝑥 and subtractions. → → (a) AST of a logic formula (b) Simplified AST of a logic for- mula (c) AST of the same formula, en- coded as real-valued function 4. Experiments Here we report a number of experiments aimed at assessing KILL for SKI w.r.t. its capability to improve neural networks’ predictive performance.1 A public implementation of the KILL algorithm is available on PSyKI [22]. The design of the experiments is straightforward. We consider a simple learning task – namely, classification – on a finite domain where (i) it is easy to formulate correct constraints in Datalog, and (ii) NN training is difficult because of, e.g., poor separability among classes, as well as unevenly distributed training data. Along this line, we write a set of logic formulæ, one for each class, logically denoting how classification should be performed. We then train a neural network to solve such a classification task, with and without injecting those formulæ. We repeat the experiment by injecting different subsets of formulæ each time. Finally, we assess 1 For the sake of reproducibility, the code of our experiments is available at https://github.com/MatteoMagnini/ kill-experiments-woa-2022. Table 2 Poker hand dataset statistics per class. Train. Train. Test Test Class Instances Freq. (%) Instances Freq. (%) nothing 12,493 49.95 501,209 50.12 pair 10,599 42.38 422,498 42.25 two pairs 1,206 4.82 47,622 4.76 three of a kind 513 2.05 21,121 2.11 straight 93 0.37 3,885 0.39 flush 54 0.22 1,996 0.2 full house 36 0.14 1,424 0.14 four of a kind 6 0.024 230 0.023 straight flush 5 0.02 12 0.001 royal flush 5 0.02 3 1.2 · 10−5 Total 25,010 100 1,000,000 100 if and under which circumstances SKI succeeds/fails in improving the network’s predictive accuracy. The rationale behind the experiment design is to assess the effectiveness of SKI in a toy scenario where the correctness of the symbolic knowledge is undoubted, and where an ordinary NN may easily struggle in reaching good predictive performance in a reasonable time. A secondary goal of this design is to identify potential corner cases where SKI falls short. As we empirically demonstrate in the remainder of this section, KILL is capable of improving the predictive performance of a neural network classifier trained on such dataset, despite being sensitive to severe class unbalancing. A discussion about the possible motivations behind such sensitivity is then provided in Section 5. 4.1. Poker hand data set and logic rules We rely on the poker hand data set [23], which subtends a multi-classification task on a finite – yet very large – discrete domain, where classes are overlapped and heavily unbalanced, while exact classification rules can be written in logic formulæ. It consists of a tabular dataset, containing 1,025,010 records—each one composed by 11 features. Each record encodes a poker hand of 5 cards. Hence, each records involves 5 couples of features – denoting the cards in the hand –, plus a single categorical feature denoting the class of the hand. Two features are necessary to identify each card: suit and rank. Suit is a categorical feature (heart, spade, diamond and club), while rank is ordinal feature—suitably represented by an integer between 1 and 13 (ace to king). The multi-classification task consists in predicting the poker hand’s class. Each hand may be classified as one of 10 different classes denoting the nature of the hand according to the rules of Poker (e.g. nothing, pair, double pair, flush). This data set satisfies all the aforementioned requirements: (i) the input space is discrete and finite in size2 , and the available dataset is just a small sample of it; (ii) classes are extremely unbalanced, as shown in Table 2: a few classes (e.g. nothing and pair) cover nearly half of the 2 52! The size of the input space is the amount 5-permutations of 52 cards, i.e., (52−5)! dataset, while most classes cover less than 1% of the dataset; (iii) there is a hierarchy between classes (e.g. if there are three cards with same rank, then class is three of a kind even if the condition for one pair is satisfied). We use 25,010 records for training and the remaining million for testing, as shown in Table 2. We define a class rule for each class, encoding the preferred way of classifying a Poker hand. For example, let {𝑆1 , 𝑅1 , . . . , 𝑆5 , 𝑅5 } be the logic variables representing a Poker hand (𝑆 for suit and 𝑅 for rank), then for class flush we define the following rule: class(𝑅1 , 𝑆1 , . . . , 𝑅5 , 𝑆5 , flush) ← flush(𝑅1 , 𝑆1 , . . . , 𝑅5 , 𝑆5 ) (5) flush(𝑅1 , 𝑆1 , . . . , 𝑅5 , 𝑆5 ) ← 𝑆1 = 𝑆2 ∧ 𝑆1 = 𝑆3 ∧ 𝑆1 = 𝑆4 ∧ 𝑆1 = 𝑆5 All other rules have the same structure as equation 5: the left-hand side declares the expected class, while the right-hand side describes the necessary conditions for that class—possibly, via some ancillary predicates such as flush. Table 3 provides an overview of all the rules we rely upon in our experiments. 4.2. Methodology It is worth repeating that we choose to use the same data partitioning proposed by the authors of the dataset, meaning that for the training set we rely on 25.010 samples, therefore 1.000.000 for the test set. Such a small training set w.r.t. the test set is quite unusual, since it makes the task more challenging, yet at the same time results are more reliable. We use the same starting model for all the experiments, consisting of a fully connected NN with 3 layers, where each layer has rectified linear unit (ReLU) as activation function except for the last one that has softmax. We use categorical cross-entropy as the loss function for training. After empirical experiments, the best NN has 128 neurons in the first and second layer—10 for the output layer (number of classes). We use batch size equals to 32 and 100 epochs for network’s training. Networks’ performance is evaluated using accuracy, macro-F1 and weighted-F1 score functions. To evaluate network’s performance with knowledge integration we remove the Λ-layer and use the resulting NN as is. Since we are not relying on any validation set, to avoid overfitting we use 3 stopping criteria during the training of the network: (i) for the 99% of training examples the activation of every output unit is within 0.25 of correct, (ii) at most 100 epochs, (iii) predictor has at least 90% of accuracy on training examples but has not improved its ability to classify training examples for 5 epochs. Similar criteria were used in [14]. Finally, we run 30 experiments for each configuration to have a statistical population for comparisons. 4.3. Results We define two different configurations for experiments: (i) “classic”, where we use the NN described in Section 4.2, (ii) “knowledge”, where we apply KILL algorithm on the same network architecture. For both configurations we use the same hyper-parameters and run 30 experiments. Results are reported in Table 4 and in Figure 3. Table 3 Datalog formulæ describing poker hands. For the sake of readability variables in the head of formulæ and in the arguments of predicates are abbreviated. Class Logic Formulation class(𝑅1 , . . . , 𝑆5 , pair) ← pair (𝑅1 , . . . , 𝑆5 ) pair (𝑅1 , . . . , 𝑆5 ) ← 𝑅1 = 𝑅2 pair (𝑅1 , . . . , 𝑆5 ) ← 𝑅1 = 𝑅3 pair (𝑅1 , . . . , 𝑆5 ) ← 𝑅1 = 𝑅4 pair (𝑅1 , . . . , 𝑆5 ) ← 𝑅1 = 𝑅5 Pair pair (𝑅1 , . . . , 𝑆5 ) ← 𝑅2 = 𝑅3 pair (𝑅1 , . . . , 𝑆5 ) ← 𝑅2 = 𝑅4 pair (𝑅1 , . . . , 𝑆5 ) ← 𝑅2 = 𝑅5 pair (𝑅1 , . . . , 𝑆5 ) ← 𝑅3 = 𝑅4 pair (𝑅1 , . . . , 𝑆5 ) ← 𝑅3 = 𝑅5 pair (𝑅1 , . . . , 𝑆5 ) ← 𝑅4 = 𝑅5 class(𝑅1 , . . . , 𝑆5 , two) ← two(𝑅1 , . . . , 𝑆5 ) two(𝑅1 , . . . , 𝑆5 ) ← 𝑅1 = 𝑅2 ∧ 𝑅3 = 𝑅4 two(𝑅1 , . . . , 𝑆5 ) ← 𝑅1 = 𝑅3 ∧ 𝑅2 = 𝑅4 two(𝑅1 , . . . , 𝑆5 ) ← 𝑅1 = 𝑅4 ∧ 𝑅2 = 𝑅3 two(𝑅1 , . . . , 𝑆5 ) ← 𝑅1 = 𝑅2 ∧ 𝑅3 = 𝑅5 two(𝑅1 , . . . , 𝑆5 ) ← 𝑅1 = 𝑅3 ∧ 𝑅3 = 𝑅5 two(𝑅1 , . . . , 𝑆5 ) ← 𝑅1 = 𝑅5 ∧ 𝑅2 = 𝑅3 two(𝑅1 , . . . , 𝑆5 ) ← 𝑅1 = 𝑅2 ∧ 𝑅4 = 𝑅5 Two Pairs two(𝑅1 , . . . , 𝑆5 ) ← 𝑅1 = 𝑅4 ∧ 𝑅2 = 𝑅5 two(𝑅1 , . . . , 𝑆5 ) ← 𝑅1 = 𝑅5 ∧ 𝑅2 = 𝑅4 two(𝑅1 , . . . , 𝑆5 ) ← 𝑅1 = 𝑅3 ∧ 𝑅4 = 𝑅5 two(𝑅1 , . . . , 𝑆5 ) ← 𝑅1 = 𝑅4 ∧ 𝑅3 = 𝑅5 two(𝑅1 , . . . , 𝑆5 ) ← 𝑅1 = 𝑅5 ∧ 𝑅3 = 𝑅4 two(𝑅1 , . . . , 𝑆5 ) ← 𝑅2 = 𝑅3 ∧ 𝑅4 = 𝑅5 two(𝑅1 , . . . , 𝑆5 ) ← 𝑅2 = 𝑅4 ∧ 𝑅3 = 𝑅5 two(𝑅1 , . . . , 𝑆5 ) ← 𝑅2 = 𝑅5 ∧ 𝑅3 = 𝑅4 class(𝑅1 , . . . , 𝑆5 , three) ← three(𝑅1 , . . . , 𝑆5 ) three(𝑅1 , . . . , 𝑆5 ) ← 𝑅1 = 𝑅2 ∧ 𝑅1 = 𝑅3 three(𝑅1 , . . . , 𝑆5 ) ← 𝑅1 = 𝑅2 ∧ 𝑅1 = 𝑅4 three(𝑅1 , . . . , 𝑆5 ) ← 𝑅1 = 𝑅2 ∧ 𝑅1 = 𝑅5 three(𝑅1 , . . . , 𝑆5 ) ← 𝑅1 = 𝑅3 ∧ 𝑅1 = 𝑅4 Three of a Kind three(𝑅1 , . . . , 𝑆5 ) ← 𝑅1 = 𝑅3 ∧ 𝑅1 = 𝑅5 three(𝑅1 , . . . , 𝑆5 ) ← 𝑅1 = 𝑅4 ∧ 𝑅1 = 𝑅5 three(𝑅1 , . . . , 𝑆5 ) ← 𝑅2 = 𝑅3 ∧ 𝑅2 = 𝑅4 three(𝑅1 , . . . , 𝑆5 ) ← 𝑅2 = 𝑅3 ∧ 𝑅2 = 𝑅5 three(𝑅1 , . . . , 𝑆5 ) ← 𝑅2 = 𝑅4 ∧ 𝑅2 = 𝑅5 three(𝑅1 , . . . , 𝑆5 ) ← 𝑅3 = 𝑅4 ∧ 𝑅3 = 𝑅5 class(𝑅1 , . . . , 𝑆5 , straight) ← royal (𝑅1 , . . . , 𝑆5 ) class(𝑅1 , . . . , 𝑆5 , straight) ← straight(𝑅1 , . . . , 𝑆5 ) Straight straight(𝑅1 , . . . , 𝑆5 ) ← (𝑅1 + 𝑅2 + 𝑅3 + 𝑅4 + 𝑅5 ) = (5 * min(𝑅1 , . . . , 𝑅5 ) + 10) ∧ ¬pair (𝑅1 , . . . , 𝑆5 ) royal (𝑅1 , . . . , 𝑆5 ) ← min(𝑅1 , . . . , 𝑅5 ) = 1 ∧ (𝑅1 + 𝑅2 + 𝑅3 + 𝑅4 + 𝑅5 = 47) ∧ ¬pair (𝑅1 , . . . , 𝑆5 ) class(𝑅1 , . . . , 𝑆5 , flush) ← flush(𝑅1 , . . . , 𝑆5 ) Flush flush(𝑅1 , . . . , 𝑆5 ) ← 𝑆1 = 𝑆2 ∧ 𝑆1 = 𝑆3 ∧ 𝑆1 = 𝑆4 ∧ 𝑆1 = 𝑆5 class(𝑅1 , . . . , 𝑆5 , four) ← four (𝑅1 , . . . , 𝑆5 ) four (𝑅1 , . . . , 𝑆5 ) ← 𝑅1 = 𝑅2 ∧ 𝑅1 = 𝑅3 ∧ 𝑅1 = 𝑅4 four (𝑅1 , . . . , 𝑆5 ) ← 𝑅1 = 𝑅2 ∧ 𝑅1 = 𝑅3 ∧ 𝑅1 = 𝑅5 Four of a Kind four (𝑅1 , . . . , 𝑆5 ) ← 𝑅1 = 𝑅2 ∧ 𝑅1 = 𝑅4 ∧ 𝑅1 = 𝑅5 four (𝑅1 , . . . , 𝑆5 ) ← 𝑅1 = 𝑅3 ∧ 𝑅1 = 𝑅4 ∧ 𝑅1 = 𝑅5 four (𝑅1 , . . . , 𝑆5 ) ← 𝑅2 = 𝑅3 ∧ 𝑅2 = 𝑅4 ∧ 𝑅2 = 𝑅5 Full House class(𝑅1 , . . . , 𝑆5 , full) ← 𝑡ℎ𝑟𝑒𝑒(𝑆1 , . . . , 𝑅5 ) ∧ 𝑡𝑤𝑜(𝑆1 , . . . , 𝑅5 ) ∧ ¬𝑓 𝑜𝑢𝑟(𝑆1 , . . . , 𝑅5 ) class(𝑅1 , . . . , 𝑆5 , straight_flush) ← straight(𝑅1 , . . . , 𝑆5 ) ∧ flush(𝑅1 , . . . , 𝑆5 ) Straight Flush class(𝑅1 , . . . , 𝑆5 , straight_flush) ← royal (𝑅1 , . . . , 𝑆5 ) ∧ flush(𝑅1 , . . . , 𝑆5 ) Royal Flush class(𝑅1 , . . . , 𝑆5 , royal) ← royal (𝑅1 , . . . , 𝑆5 ) ∧ flush(𝑅1 , . . . , 𝑆5 ) Nothing class(𝑅1 , . . . , 𝑆5 , nothing) ← ¬pair (𝑅1 , . . . , 𝑆5 ) ∧ ¬flush(𝑅1 , . . . , 𝑆5 ) ∧ ¬straight(𝑅1 , . . . , 𝑆5 ) ∧ ¬royal (𝑅1 , . . . , 𝑆5 ) 5. Discussion In general, experiments show that both predictors are very good in classifying common classes such as “nothing” and “pair”. Also for less frequent classes , “two pairs” and “three of a kind”, Table 4 Test set accuracy, macro-F1 and weighted-F1 on all classes and single mean class accuracies. All measures represent the mean over the experiment population (30). Metric Classic Knowledge Metric Classic Knowledge Accuracy 0.962 0.978 Acc. Straight 0.415 0.509 Macro-F1 0.512 0.538 Acc. Flush 0.002 0.002 Weighted-F1 0.96 0.977 Acc. Full 0.628 0.69 Acc. Nothing 0.977 0.989 Acc. Four 0.186 0.19 Acc. Pair 0.968 0.985 Acc. Straight F. 0.003 0 Acc. Two Pairs 0.867 0.914 Acc. Royal F. 0 0 Acc. Three 0.913 0.922 Class accuracy distributions 1.0 classic knowledge 0.8 0.6 Accuracy 0.4 0.2 0.0 Nothing Pair Two Pairs Three Straight Flush Full Four Straight F. Royal F. Classes Figure 3: Class accuracy distributions of the two kinds of experiments: blue colour is used for “classic” experiments while red colour is used for the ones with KILL (“knowledge”). accuracy is still high. The remaining six classes represent the 0.8% of the training set and therefore much more difficult to correctly predict. For “full house” and “straight”, accuracy has middle values, while for the remaining classes accuracy is pretty close to 0. All of this is quite expected due to the strong unbalance in the distribution of classes. The only remarkable fact is that accuracy is extremely low for class “flush” even if less frequent classes such as “full house” and “four of a kind” have higher accuracy. We hypothesise that this happens because the vast majority of data consists in classes which depend only on the values of rank and therefore the network tends to consider it much more. Indeed, only “flush”, “straight flush” and “royal flush” (about the 0.26% of the training set) depend on the values of suit. Concerning the comparison between the “classic” predictor – with no additional knowledge – and the “knowledge” predictor – obtained by applying KILL algorithm – results show that the latter has higher performances with statistic significance (Student-T test with p-value < 0.01 when comparing the overall accuracy of the two populations of experiments). In particular, we can observe that the predictor exploiting knowledge during training has on average a higher accuracy for all frequent classes. Also for less common classes such as “straight” and “full house” there is great improvement. However, accuracy is not improved for ”flush” and very sporadic classes. We speculate that the reason for the lack of improvement in the prediction of rare classes lies on the kind of injection we are performing. Injecting knowledge by constraining the network during the training is effective as long as classes are represented by a sufficient amount of examples. For instance, if a class is not represented at all, having a logic rule for that specific class is the same as not having it. So, if we have only few units representing a class over a million of examples the effect it is too small to make a difference. We believe that this issue is possibly common upon all SKI algorithms based on constraining, however this should be verified in future works. To overcome such limit, we may exploit the advantage of having a Λ-layer affecting the constraining instead of operating directly on the loss function of the network. More precisely, we can keep the Λ-layerafter the training of the network and just change the function in Equation (1) in such a way that the cost of the violation of the knowledge is used to reduce the network’s error and not to increase it. This should be also explored in future works. 6. Conclusion In this work we define KILL, a general technique for prior symbolic knowledge injection into deep neural networks. Designers may use Datalog rules to express common sense that are injected through Λ-layer into the network. Rules are encoded into class specific fuzzy logic functions that add a cost to the class prediction value when the rule is violated. We report a number of experiments where we compare networks without knowledge in- jection with networks that receives additional information in a multi-classification task. The selected task has some common criticalities of ML classification tasks, in particular data set size limitation, unbalanced classes, class overlapping, and intra-class constraints. Results show that out approach can improve network’s accuracy for classes that are not too sporadic and the overall network’s accuracy. However results reveal a limitation: KILL is quite sensitive w.r.t. situations which have been rarely met during training. We speculate this is general limitation characterising SKI methods acting by constraining the predictor during training. Along this line, further experiments over different methods are required to confirm such general statement. Accordingly, in our future works we shall consider different Λ-layer functions and test the technique on different domains. To mitigate the sensitivity w.r.t. rare situations, we intend to investigate scenarios where we keep the Λ-layer after the training but instead of increase the network’s error it should reduce the error w.r.t. the prior knowledge (i.e., use a different function in the Λ-layer). In this way we combine techniques based on both constraining during the training and structuring (i.e., altering the predictor’s architecture). This is expected to mitigate the SKI as information about rare situation is encoded into the network structure after the training. Acknowledgments This paper was partially supported by the CHIST-ERA IV project “Expectation” – CHIST-ERA- 19-XAI-005 –, co-funded by EU and the Italian MUR (Ministry for University and Research). References [1] Z. C. Lipton, The mythos of model interpretability, Communications of the ACM 61 (2018) 36–43. doi:10.1145/3233231. [2] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, D. Pedreschi, A survey of methods for explaining black box models, ACM Computing Surveys 51 (2019) 93:1–93:42. doi:10.1145/3236009. [3] M. Ajtai, Y. Gurevich, Datalog vs first-order logic, Journal of Computer and System Sciences 49 (1994) 562–588. doi:10.1016/S0022-0000(05)80071-6. [4] T. R. Besold, A. S. d’Avila Garcez, S. Bader, H. Bowman, P. M. Domingos, P. Hitzler, K. Kühnberger, L. C. Lamb, D. Lowd, P. M. V. Lima, L. de Penning, G. Pinkas, H. Poon, G. Zaverucha, Neural-symbolic learning and reasoning: A survey and interpretation, in: P. Hitzler, M. K. Sarker (Eds.), Neuro-Symbolic Artificial Intelligence: The State of the Art, volume 342 of Frontiers in Artificial Intelligence and Applications, IOS Press, 2021, pp. 1–51. doi:10.3233/FAIA210348. [5] Y. Xie, Z. Xu, K. S. Meel, M. S. Kankanhalli, H. Soh, Embedding symbolic knowledge into deep networks, in: H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, R. Garnett (Eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, 2019, pp. 4235–4245. URL: https://proceedings.neurips. cc/paper/2019/hash/7b66b4fd401a271a1c7224027ce111bc-Abstract.html. [6] R. Calegari, G. Ciatto, A. Omicini, On the integration of symbolic and sub-symbolic tech- niques for XAI: A survey, Intelligenza Artificiale 14 (2020) 7–32. doi:10.3233/IA-190036. [7] V. Tresp, J. Hollatz, S. Ahmad, Network structuring and training using rule-based knowl- edge, in: S. J. Hanson, J. D. Cowan, C. L. Giles (Eds.), Advances in Neural Information Processing Systems 5, Morgan Kaufmann, 1992, pp. 871–878. URL: http://papers.nips.cc/ paper/638-network-structuring-and-training-using-rule-based-knowledge, NIPS Confer- ence, Denver, Colorado, USA, November 30–December 3, 1992. [8] S. Bader, S. Hölldobler, N. C. Marques, Guiding backprop by inserting rules, in: A. S. d. Garcez, P. Hitzler (Eds.), Proceedings of the 4th International Workshop on Neural-Symbolic Learning and Reasoning (NeSy), Patras, Greece, July 21, 2008, vol- ume 366 of CEUR Workshop Proceedings, CEUR-WS.org, 2008, pp. 19–22. URL: http: //ceur-ws.org/Vol-366/paper-5.pdf. [9] M. Diligenti, S. Roychowdhury, M. Gori, Integrating prior knowledge into deep learning, in: X. Chen, B. Luo, F. Luo, V. Palade, M. A. Wani (Eds.), 16th IEEE International Conference on Machine Learning and Applications, ICMLA 2017, Cancun, Mexico, December 18-21, 2017, IEEE, 2017, pp. 920–923. doi:10.1109/ICMLA.2017.00-37. [10] J. Xu, Z. Zhang, T. Friedman, Y. Liang, G. Van den Broeck, A semantic loss function for deep learning with symbolic knowledge, in: J. G. Dy, A. Krause (Eds.), Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, PMLR, 2018, pp. 5498–5507. URL: http://proceedings.mlr.press/v80/xu18h.html. [11] R. Evans, E. Grefenstette, Learning explanatory rules from noisy data, Jounal of Artificial Intelligence Research 61 (2018) 1–64. doi:10.1613/jair.5714. [12] G. Marra, F. Giannini, M. Diligenti, M. Gori, LYRICS: A general interface layer to integrate logic inference and deep learning, in: U. Brefeld, É. Fromont, A. Hotho, A. J. Knobbe, M. H. Maathuis, C. Robardet (Eds.), Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2019, Würzburg, Germany, September 16-20, 2019, Proceedings, Part II, volume 11907 of Lecture Notes in Computer Science, Springer, 2019, pp. 283–298. doi:10.1007/978-3-030-46147-8_17. [13] D. H. Ballard, Parallel logical inference and energy minimization, in: T. Kehler (Ed.), Proceedings of the 5th National Conference on Artificial Intelligence. Philadelphia, PA, USA, August 11-15, 1986. Volume 1: Science, Morgan Kaufmann, 1986, pp. 203–209. URL: http://www.aaai.org/Library/AAAI/1986/aaai86-033.php. [14] G. G. Towell, J. W. Shavlik, M. O. Noordewier, Refinement of approximate domain theories by knowledge-based neural networks, in: H. E. Shrobe, T. G. Dietterich, W. R. Swartout (Eds.), Proceedings of the 8th National Conference on Artificial Intelligence. Boston, Massachusetts, USA, July 29 - August 3, 1990, 2 Volumes, AAAI Press / The MIT Press, 1990, pp. 861–866. URL: http://www.aaai.org/Library/AAAI/1990/aaai90-129.php. [15] A. S. d. Garcez, G. Zaverucha, The connectionist inductive learning and logic programming system, Applied Intelligence 11 (1999) 59–77. doi:10.1023/A:1008328630915. [16] A. S. d. Garcez, D. M. Gabbay, Fibring neural networks, in: D. L. McGuinness, G. Fergu- son (Eds.), Proceedings of the Nineteenth National Conference on Artificial Intelligence, Sixteenth Conference on Innovative Applications of Artificial Intelligence, July 25-29, 2004, San Jose, California, USA, AAAI Press / The MIT Press, 2004, pp. 342–347. URL: http://www.aaai.org/Library/AAAI/2004/aaai04-055.php. [17] S. Bader, A. S. d. Garcez, P. Hitzler, Computing first-order logic programs by fibring artificial neural networks, in: I. Russell, Z. Markov (Eds.), Proceedings of the 18th International Florida Artificial Intelligence Research Society Conference (FlAIRS), Clearwater Beach, Florida, USA, May 15–17, 2005, AAAI Press, 2005, pp. 314–319. URL: http://www.aaai.org/ Library/FLAIRS/2005/flairs05-052.php. [18] M. V. M. França, G. Zaverucha, A. S. d. Garcez, Fast relational learning using bottom clause propositionalization with artificial neural networks, Machine Learning 94 (2014) 81–104. doi:10.1007/s10994-013-5392-1. [19] R. Manhaeve, S. Dumancic, A. Kimmig, T. Demeester, L. De Raedt, Neural probabilistic logic programming in deepproblog, Artificial Intelligence 298 (2021) 103504. doi:10.1016/ j.artint.2021.103504. [20] D. Maier, K. T. Tekle, M. Kifer, D. S. Warren, Datalog: concepts, history, and outlook, in: M. Kifer, Y. A. Liu (Eds.), Declarative Logic Programming: Theory, Systems, and Applica- tions, ACM / Morgan & Claypool, 2018, pp. 3–100. doi:10.1145/3191315.3191317. [21] L. S. Hay, Axiomatization of the infinite-valued predicate calculus, The Journal of Symbolic Logic 28 (1963) 77–86. URL: http://www.jstor.org/stable/2271339. [22] M. Magnini, G. Ciatto, A. Omicini, On the design of PSyKI: a platform for symbolic knowledge injection into sub-symbolic predictors, in: D. Calvaresi, A. Najjar, M. Winikoff, K. Främling (Eds.), Proceedings of the 4th International Workshop on EXplainable and TRAnsparent AI and Multi-Agent Systems, volume 13283 of Lecture Notes in Computer Science, Springer, 2022, pp. 90–108. doi:10.1007/978-3-031-15565-9_6. [23] R. Cattral, F. Oppacher, Poker hand data set, UCI machine learning repository, 2007. URL: https://archive.ics.uci.edu/ml/datasets/Poker+Hand.