Exploiting T-norms for Deep Learning in Autonomous Driving Mihaela C. Stoian1,∗ , Eleonora Giunchiglia2 and Thomas Lukasiewicz2,1 1 Department of Computer Science, University of Oxford, UK 2 Institute of Logic and Computation, TU Wien, Austria Abstract Deep learning has been at the core of the autonomous driving field development, due to the neural networks’ success in finding patterns in raw data and turning them into accurate predictions. Moreover, recent neuro-symbolic works have shown that incorporating the available background knowledge about the problem at hand in the loss function via t-norms can further improve the deep learning models’ performance. However, t-norm-based losses may have very high memory requirements and, thus, they may be impossible to apply in complex application domains like autonomous driving. In this paper, we show how it is possible to define memory-efficient t-norm-based losses, allowing for exploiting t-norms for the task of event detection in autonomous driving. We conduct an extensive experimental analysis on the ROAD-R dataset and show (i) that our proposal can be implemented and run on GPUs with less than 25 GiB of available memory, while standard t-norm-based losses are estimated to require more than 100 GiB, far exceeding the amount of memory normally available, (ii) that t-norm-based losses improve performance, especially when limited labelled data are available, and (iii) that t-norm-based losses can further improve performance when exploited on both labelled and unlabelled data. Keywords Neuro-symbolic AI, Autonomous Driving, Logical Constraints, T-norms, Memory-efficiency 1. Introduction Deep learning has been at the core of the autonomous driving field development [1, 2], due to the neural networks’ success in finding patterns in raw data and turning them into accurate predictions. However, existing self-driving vehicle systems are very limited in their capabilities [3], with many of the obstacles in reaching a fully autonomous system being rooted in the un- derlying neural models’ own caveats, such as the inherent data greediness and the impossibility of incorporating background knowledge about the problem at hand. Recently, neuro-symbolic methods emerged as a way to integrate background knowledge within the neural networks’ topology (see, e.g., [4, 5, 6]) and/or loss function (see, e.g., [7, 8, 9]), with a large number of them highlighting a positive impact particularly in scenarios where little annotated data is available (see, e.g., [10, 11, 12]). A popular method to include background knowledge expressed as logical constraints into neural networks consists in relaxing the constraints using t-norms NeSy 2023, 17th International Workshop on Neural-Symbolic Learning and Reasoning, July 03–05, 2023, Certosa di Pontignano, Siena, Italy ∗ Corresponding author. Envelope-Open mihaela.stoian@cs.ox.ac.uk (M. C. Stoian); eleonora.giunchiglia@tuwien.ac.at (E. Giunchiglia); thomas.lukasiewicz@tuwien.ac.at (T. Lukasiewicz) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) and incorporating them in the loss function [7, 8, 13]. Such a t-norm-based loss function is not only intuitive, but it also has been shown to improve neural networks’ performance on a range of different tasks (including event detection in autonomous driving [14]), especially when limited data are available. However, t-norm-based losses may have very high memory requirements, and thus they may be impossible to apply when considering complex application domains like event detection in autonomous driving. In this paper, we show how it is possible to define memory-efficient t-norm-based losses, allowing for exploiting t-norms for the task of event detection in autonomous driving. We conduct an extensive experimental analysis on the ROAD-R dataset [14] and show that our proposal can be implemented and run on GPUs with less than 25 GiB of available memory, while standard t-norm-based losses are estimated to require more than 100 GiB, far exceeding the amount of memory normally available. Then, we train our state-of-the-art event detection model using different percentages of labelled training data (i.e., 10%, 20%, 50%, 75%, and 100%), and we show that while t-norm-based losses can improve the performance of the models in all cases, they are particularly helpful when data are scarce. Indeed, our models yield an improvement of up to 1.85% and 3.95% when using, 10% and 20% of the labelled training data, respectively. Finally, we investigate the behaviour of the t-norm-based loss when only 10% annotated data are available, along with 10% unlabelled data, and find that applying the t-norm-based loss on both labelled and unlabelled data after a warm-up training phase leads to further improvements, i.e., up to 2.75% w.r.t. the fully supervised baseline. 2. Preliminaries An event detection problem 𝒫 is a pair (𝒜 , 𝒳 ), where 𝒜 is a finite set of labels, and 𝒳 is set of 𝑋, 𝒴 ) where: pairs (𝑋 1. 𝑋 ∈ ℝ3×𝑊 ×𝐻 is the tensor associated with X 234 ⋯ 78 each frame in the video. 𝑊 (resp., 𝐻) rep- 168⋮ ⋯⋱ 3 ⋮ 46 ⋮152 ⋯ ⋱⋯ 195⋮21 resents the width (resp., height) of each H ⋮ 32 ⋱ ⋯ ⋮ 0 34 ⋯ 76 frame, while 3 is the number of channels W used in the RGB encoding, 2. 𝒴 is the ground truth of 𝑋 comprising a set of pairs (𝑏𝑏, ℒ ), where 𝑏 ∈ ℝ4 represents ([498.1,264.1,791.1,621.2], {MedVeh, Brake, Stop, VehLane}) the coordinates of a bounding box, i.e., a ([801.4,401.9,866.9,476.2], {LarVeh, MovTow, IncomLane}) rectangle marking the position of an agent b ℒ in the frame, while ℒ represents the set 𝒴 of labels associated with 𝑏 . Figure 1: Visual example of a data point. Figure 1 illustrates an example of a data point 𝑋 and its ground truth 𝒴. A model 𝑚 for 𝒫 takes as input a sequence of video frames and, for each input frame 𝑋 , its outputs are the pairs (𝑏𝑏,̂ 𝑦̂𝑦), ̂ where each 𝑏̂ ∈ ℝ4 represents a predicted bounding box, and 𝑦̂𝑦̂ ∈ [0, 1]|𝒜 | represents the confidence of the model regarding which labels can be associated with 𝑏𝑏.̂ Given an output (𝑏𝑏,̂ 𝑦̂𝑦), ̂ a prediction is then defined as the pair (𝑏𝑏,̂ ℒ ̂), where ℒ ̂ is the set of labels associated with 𝑏𝑏;̂ a label 𝐴 is associated with 𝑏̂ if 𝑦̂𝑦̂ 𝐴 ≥ 𝜃, where 𝜃 is a user-defined threshold. To predict the bounding boxes in each frame, standard off-the-shelf event detection models use anchor boxes, which are predefined boxes of varying sizes at different locations within the frame. For each frame, the model defines 𝐷 anchor boxes whose positions are fixed, and then the position of each predicted bounding box is computed as the offset from one anchor box. An event detection problem with propositional logic constraints (𝒫 , Π) consists of an event detection problem 𝒫 and a finite set of constraints Π, expressed over the set 𝒜 of labels in 𝒫. We assume w.l.o.g. that the constraints are given as a set of clauses, each of the form: 𝑙1 ∨ 𝑙2 ∨ ⋯ ∨ 𝑙𝑛 , (1) where every 𝑙𝑖 is a literal, i.e., is either a label 𝐴 ∈ 𝒜 or its negation, ¬𝐴, for 𝑖 ∈ {1, … , 𝑛}. Intuitively, (1) expresses the fact that the model should always predict at least one of the literals in the clause, i.e., in {𝑙1 , ..., 𝑙𝑛 }. We assume that in any clause, a label occurs either positively or negatively at most once. We say that a label 𝐴 occurs positively (resp., negatively) in the clause (1) if there is a literal 𝑙 in (1) such that 𝑙 = 𝐴 (resp., 𝑙 = ¬𝐴), and that 𝐴 occurs in (1) if 𝐴 occurs either positively or negatively in (1). 3. Memory-efficient t-norm-based loss Inspired by [13, 7], we added a new regularization term to the localisation and classification losses to express the degree of the logical constraints satisfaction. Since our constraints are all of the form (1), we can easily convert each of them into a form containing only negations and conjunctions (i.e., 𝑙1 ∨ 𝑙2 ∨ ⋯ ∨ 𝑙𝑛 ≡ ¬(¬𝑙1 ∧ ¬𝑙2 ∧ ⋯ ∧ ¬𝑙𝑛 )). We can then relax: 1. the conjunction using different t-norms [15]. A t-norm is a function 𝑇 ∶ [0, 1]2 → [0, 1] such that for every 𝑎, 𝑏, 𝑐 ∈ [0, 1]: 𝑇 (𝑎, 𝑏) = 𝑇 (𝑏, 𝑎), 𝑇 (𝑎, 1) = 𝑎, 𝑇 (𝑎, 0) = 0, 𝑇 (𝑎, 𝑇 (𝑏, 𝑐)) = 𝑇 (𝑇 (𝑎, 𝑏), 𝑐), 𝑎 ≤ 𝑏 → 𝑇 (𝑎, 𝑐) ≤ 𝑇 (𝑏, 𝑐). 2. the negation using strong negation, which, given 𝑎 ∈ [0, 1], is defined as 1 − 𝑎. The above operation is equivalent to directly relaxing the disjunction using the appropriate t-conorm. In the first two columns of Table 1, we summarise the most used t-norms together with their respective t-conorms. We now show how to implement t-norm-based loss functions first in the standard way, and then in a memory-efficient way using sparse tensors. Let (𝒫 , Π) be an event detection problem with propositional logic constraints. We can then express Π using two matrices 𝐶+ and 𝐶− , both of size |Π| × |𝒜 |, such that 𝐶+ − 𝑖𝑗 = 1 (resp., 𝐶𝑖𝑗 = 1), if the 𝑗-th label appears positively (resp., negatively) in the 𝑖th constraint, and 𝐶 + 𝑖𝑗 = 0 (resp., 𝐶−𝑖𝑗 = 0), otherwise. We call 𝐶 + 𝐶 − (resp., ) the positive (resp., negative) constraints matrix. Let 𝑚 be an event detection model for Π using 𝐷 anchor boxes for each frame. For each input frame, 𝑚 outputs the prediction matrix 𝑃 of size 𝐷 × |𝒜 |. We call 𝑃 the prediction matrix. Given 𝑃 and 𝐶 , our goal is to compute the degree of satisfaction of each constraint for each output, which can be compactly expressed as a matrix 𝐺 of size 𝐷 × |Π|. Ultimately, we want to use 𝐺 to compute the frame-wise logic-based regularisation term in the loss, and we do this by defining: 1 1 𝐺) = 1 − 𝐿𝑙𝑜𝑔𝑖𝑐 (𝐺 ∑𝐺 . (2) 𝐷 |Π| 𝑖𝑗 𝑖𝑗 Standard Approach. Let 𝑃̂ be the tensor obtained by stacking |Π| times the matrix 𝑃 along + − its first dimension. Let 𝐶̂ and 𝐶̂ be the tensors obtained by stacking 𝐷 times the matrices 𝐶 + and 𝐶 − along their second dimension. We then obtain three tensors all of size 𝐷 × |Π| × |𝒜 |. We can then choose the desired t-conorm and compute the matrix goal 𝐺 as: + − − 𝐺 = t-conorm([(𝑃̂ ⊙ 𝐶̂ ) + (𝐶̂ − 𝑃̂ ⊙ 𝐶̂ )], dim = 3), where ⊙ is the Hadamard product, and given a generic tensor 𝑄 of size 𝑝×𝑞×𝑠, t-conorm(𝑄 𝑄, dim = 3) returns a matrix of size 𝑝 × 𝑞 whose element at position (𝑖, 𝑗) is equal to the value of the t-conorm computed over the third dimension, e.g., if we choose the Gödel t-conorm, we obtain t-conorm(𝑄𝑄, dim = 3) = max (𝑄𝑄𝑖𝑗1 , 𝑄 𝑖𝑗2 , … , 𝑄 𝑖𝑗𝑠 ). Example 3.1. Let (𝒫 , Π) be an event detection problem with propositional logic constraints such that 𝒜 = {Car, Moving, Stopped} and Π = {¬Moving ∨ Car, ¬Moving ∨ ¬Stopped}. Then, our positive and negative constraints matrices, assuming labels are numbered as listed in 𝒜, would be: 1 0 0 0 1 0 𝐶+ = [ ] 𝐶− = [ ]. 0 0 0 0 1 1 Let 𝑚 be a model for (𝒫 , Π) using 3 anchor boxes. Given a prediction matrix 𝑃 as below, and supposing we use the Gödel t-conorm, 𝐺 is equal to: 0.1 0.7 0.3 0.3 0.7 𝑃 = [0.9 0.9 0.2] 𝐺 = [0.9 0.8] . 0.4 0.9 0.9 0.4 0.1 The problem with this approach is that it requires working with dense 3-dimensional matrices, inducing a large computational overhead and making the method unfeasible, especially for application domains like autonomous driving. For example, given |Π| = 200 constraints, |𝒜 | = 50 labels, and a model generating 𝐷 = 55K anchor boxes per frame (a common number for event detection problems) and taking as input sequences of 10 frames at a time, storing a single matrix of size 𝐷 × |Π| × |𝒜 | requires about 20 GiB (550000 × 200 × 50 × 4 bytes). Moreover, the standard approach works with 5 matrices of this size to compute the t-norm loss in the forward pass and then backpropagate through it. Excluding any other memory allocation needed for storing the input and intermediate outputs, just computing the loss and backpropagating through it would take 100 GiB, exceeding the memory space limits of even the largest GPU available today (i.e., the NVIDIA A100 Tensor Core GPU having 80 GiB RAM [16]). Notice that this computation is done for a single frame, however, normally deep learning models for event detection are trained using batches of 4/8 elements, each comprising 4 to 32 frames in sequence. It is thus impossible to use the above standard dense representation to train event detection models. Sparse Matrix Representation Approach. Our solution mostly relies on the intuition that in practice most of the constraints are written over a subset of the available labels 𝒜, and that this subset is usually much smaller than 𝒜. For example, in our experimental analysis, we will see that although there are 41 labels available in ROAD-R, the longest constraint is written over Table 1 From left to right: (i) t-norm definitions, (ii) respective t-conorm definitions, and (iii-iv) operations to update 𝐺 on the grounds of the chosen t-conorm. Given two matrices, max (resp., min) represent the element-wise operation taking the maximum (resp., minimum) between two elements at the same position in the two input matrices. To simplify the notation, in the last two columns, we used 1 to refer to the matrix of ones of appropriate size. T-norm T-conorm Operation to update 𝐺 𝑗𝐴+ Operation to update 𝐺 𝑗𝐴− Gödel min(𝑎, 𝑏) max(𝑎, 𝑏) 𝐺𝑗𝐴+ , 𝑃 𝐴 ⋅ 𝟙⊤|𝑗+| ) max(𝐺 max(𝐺𝐺𝑗𝐴− , 1 − 𝑃 𝐴 ⋅ 𝟙⊤|𝑗𝐴−| ) 𝐴 Łukasiewicz max(𝑎 + 𝑏 − 1, 0) min(𝑎 + 𝑏, 1) 𝐺𝑗𝐴+ + 𝑃 𝐴 ⋅ 𝟙⊤|𝑗+| , 1) min(𝐺 min(𝐺𝐺𝑗𝐴− + 1 − 𝑃 𝐴 ⋅ 𝟙⊤|𝑗𝐴−| , 1) 𝐴 Product 𝑎⋅𝑏 1 − (1 − 𝑎)(1 − 𝑏) 1 − (1 − 𝐺 𝑗𝐴+ ) ⊙ (1 − 𝑃 𝐴 ⋅ 𝟙⊤|𝑗+| ) 1 − (1 − 𝐺 𝑗𝐴− ) ⊙ (𝑃𝑃𝐴 ⋅ 𝟙⊤|𝑗𝐴−| ) 𝐴 just 15 labels. As a result, 𝐶 + and 𝐶 − contain mostly zeros. Hence, we designed a method to capture the logic-based loss that makes use of this sparsity property and ultimately avoids the high computational costs induced by the 3D matrices, operating only on 2D matrices. Given Π, we associate with each constraint an index, and then we define the set of sequences 𝒥 + = {𝑗𝐴+ ∶ 𝐴 ∈ 𝒜 }, where 𝑗𝐴+ is the sequence of indices of the constraints in which 𝐴 occurs positively. Analogously, we define 𝒥 − = {𝑗𝐴− ∶ 𝐴 ∈ 𝒜 }, where 𝑗𝐴− is the sequence of indexes of the constraints in which 𝐴 occurs negatively. Once we know which constraints each label occurs in, we can instantiate the goal matrix 𝐺 to the identity element of the disjunction, iterate through the labels in 𝒜, and for each label 𝐴 ∈ 𝒜 update, according to the values in 𝑃 , all the columns of 𝐺 associated with constraints where 𝐴 occurs. More specifically, we set 𝐺 = 0 𝐷×|Π| , where 0 𝐷×|Π| is a matrix of zeros of size 𝐷 × |Π|, and then, for each label 𝐴 ∈ 𝒜, we determine: 𝐺 𝑗 + , 𝑃 𝐴 ⋅ 𝟙⊤ 𝐺 𝑗𝐴+ ⟵ t-conorm(𝐺 |𝑗 + | ) 𝐺𝑗𝐴− , 1 − 𝑃 𝐴 ⋅ 𝟙⊤ 𝐺 𝑗𝐴− ⟵ t-conorm(𝐺 |𝑗 − | ), 𝐴 𝐴 𝐴 where (i) 𝐺 𝑗𝐴+ (resp., 𝐺 𝑗𝐴− ) selects the columns of 𝐺 associated with constraints where 𝐴 occurs positively (resp., negatively), (ii) 𝑃 𝐴 corresponds to the column of 𝑃 associated with the label 𝐴, (iii) 𝟙𝑛 indicates the unit column vector with 𝑛 elements, and (iv) t-conorm returns the pairwise t-conorm, i.e., given two matrices 𝑊 , 𝑍 of the same size, t-conorm(𝑊 𝑊, 𝑍 )𝑖𝑗 = t-conorm(𝑊 𝑊𝑖𝑗 , 𝑍 𝑖𝑗 ). Finally, given 𝐺 , we compute 𝐿𝑙𝑜𝑔𝑖𝑐 (𝐺 𝐺) as defined in Equation 2. Example 3.2 (Example 3.1, cont’d). Let (𝒫 , Π) be the problem in Example 3.1 and assume that we associate with the constraint (¬Moving ∨ Car) index 0, and with (¬Moving ∨ ¬Stopped) index 1. Let 𝜖 denote the empty sequence. Then, the sequences associated with each label are: + = (0), 𝑗Car + 𝑗Moving = 𝜖, + 𝑗Stopped = 𝜖, − = 𝜖, 𝑗Car − 𝑗Moving = (0, 1), − 𝑗Stopped = (1). Suppose that we use the Gödel t-conorm, then after having initialized 𝐺 = 0 3×2 , we start updating it from the label Car: 0 0.1 0.1 0.1 0 + = max ([0] , [0.9]) = [0.9] 𝐺 𝑗Car and thus 𝐺 = [0.9 0] 0 0.4 0.4 0.4 0 − = 𝜖, we do not need to further update 𝐺 for Car. We then consider the label Moving and Since 𝑗Car − update it according to 𝑗Moving + (as 𝑗Moving = 𝜖): 0.1 0 0.7 0.7 0.3 0.3 0.3 0.3 𝐺 𝑗Moving − = max ([0.9 0] , 1 − [0.9 0.9]) = [0.9 0.1] and thus 𝐺 = [0.9 0.1] . 0.4 0 0.9 0.9 0.4 0.1 0.4 0.1 Finally, we consider the label Stopped, and perform the last update: 0.3 0.3 0.7 0.3 0.7 𝐺 𝑗Stopped − = max ([0.1] , 1 − [0.2]) = [0.8] and thus 𝐺 = [0.9 0.8] . 0.1 0.9 0.1 0.4 0.1 4. Experimental Analysis We tested our t-norm loss on the task of event detection for autonomous driving, where the goal is to assign to each detected bounding box in each video a subset of labels—including one agent label, and a subset of the action and location labels. To this end, we used the recently introduced dataset for autonomous driving, ROAD-R [14], which extends the ROAD dataset [17] with 243 manually annotated constraints, provided in disjunctive normal form, as shown in Table 5 from Appendix A. The dataset contains 22 videos, each ∼8 minutes long, annotated with tubelets/tubes that link a sequence of bounding boxes in time. Each bounding box is annotated with a subset of the 41 labels available (listed in Table 4 from Appendix A). We used the available training partition for training the models and, for reproducibility purposes, we reported our results on the validation dataset, as the test set is not publicly available. For our experiments, we used the 3D-RetinaNet [17] detector with a ResNet50 [18] backbone combined with a Random Connectivity Gated Recurrent Unit (RCGRU) [19] for temporal feature learning, which we chose based on its high performance in [14]. We set a weight of 10 for the t-norm loss term and use sequences of 8 frames as input. Memory assessment. To assess the efficiency of our method for computing the t-norm loss w.r.t. how much GPU memory is allocated during train- ing the models, we compared it to the standard implementation of the t-norm-based loss. We used a Titan RTX GPU with 24 GiB of RAM for training models for 50 iterations on ROAD-R, while using different numbers of constraints. Note that, while most of the constraints in ROAD-R contain only two labels, to allow for a fair comparison with the stan- dard implementation, we selected constraints with a different numbers of labels. For reference, in each iteration, the number of anchors 𝐷 was of about 67K. Figure 2 shows that our method significantly Figure 2: Comparison between the standard ap- reduced the memory costs, making it possible to proach and ours in terms of GPU memory allocated when using different number of constraints. Each use t-norm-based losses on our dataset, where the point on the continuous (resp., dashed) line corre- number of constraints and data points per batch are sponds to an actual observation (resp., estimate). both large. Using the standard implementation sup- ports at most 40 constraints; this is 203 constraints less than the ones in ROAD-R. Table 2 Comparison between our models with different t-norm-based losses and the baseline models when varying the percentage of labelled data, as indicated on top of each column. 10% 20% 50% 75% 100% Baseline 24.49 26.81 31.56 33.57 33.39 Gödel 26.34 (+1.85) 30.76 (+3.95) 32.71 (+1.15) 34.39 (+0.82) 32.16 (−1.23) Łukasiewicz 26.24 (+1.75) 29.13 (+2.32) 34.07 (+2.51) 33.89 (+0.32) 34.42 (+1.03) Product 24.53 (+0.04) 26.96 (+0.15) 31.66 (+0.10) 34.06 (+0.49) 33.34 (−0.05) Results. We first investigated how our memory-efficient t-norm-based loss performs in a fully-supervised scenario. To this end, we tested our method with three different t-norm losses (i.e., Gödel, Łukasiewicz, and Product) using 10%, 20%, 50%, 75% and 100% of the available annotated ROAD-R data, and training for 110, 70, 45, 30, and 30 epochs, respectively. We used a learning rate of 0.0041 for all models, but for 100% labelled data, we dropped it at epochs 18 and 25 by a factor of 10, as in [14]. We always computed the t-norm-based losses w.r.t. all of the 243 constraints from ROAD-R. For evaluation, we used the frame-wise mean average precision (f-mAP) metric, computed by taking the mean average precision at a fixed intersection-over- union threshold of 0.5 over each frame for each class and then averaging these per-class scores, and reported the result at the best epoch. Table 2 summarises the results, from which we first observe that t-norm-based losses always improve the baseline performance, except when using 100% labelled data, where only the Łukasiewicz t-norm outperforms the baseline—this being in line with the result on 100% labelled data from our previous work1 [14]. We also notice that unlike the Gödel and Łukasiewicz t-norms, in most cases, the Product t-norm brings negligible improvements, if any. Lastly, as expected, integrating background knowledge (via t-norm loss) in the neural models helps more when little data are available. Indeed, our models yield an improvement of up to 1.85% and 3.95% when using, 10% and 20% of the labelled training data, respectively. The last observation led us to investigate whether in our setting also holds the known re- Table 3: The best- and worst-performing models sult that background knowledge helps when un- across fully-supervised models and models using unla- labelled data are available (see, e.g., [20, 12]). To belled data, with or without warm-up. All models here used 10% labelled data for training. The models in the this end, we trained models where we applied last two columns used also 10% unlabelled data during the t-norm based losses on 10% labelled and 10% the training phase. unlabelled data. As shown in the second column Fully-sup. With unlabelled data of Table 3, neither the Gödel nor the Łukasiewicz - - Warm-up t-norm were particularly helpful w.r.t. their fully- supervised performance. Surprisingly, the Prod- Gödel 26.34 26.38 26.76 Łukasiewicz 26.24 25.48 26.75 uct t-norm improved its performance instead, Product 24.53 25.79 27.24 now surpassing the baseline. Since the added unlabelled data were not really helpful in two 1 These results are in line with those obtained in our work [14], where an early and less optimised version (nevertheless, still capable of handling all 243 constraints) of this implementation of the t-norm-based loss had been deployed but not described. out of three cases, we analysed the losses at the beginning of the training and hypothesised that it would be beneficial to introduce a warm-up training phase, during which the t-norm loss would be inactive, and after which the unlabelled data would be added and the t-norm loss activated. As expected, the results from the last column of Table 3 consistently improve previous performances of all t-norm based losses, with the Product t-norm giving the highest result (of 27.24 f-mAP) w.r.t. the baseline (of 24.49 f-mAP). 5. Related Work Neuro-symbolic works have proposed ways to embed available background knowledge by either embedding it into the topology and/or into the loss. In the former category, we find works that build a constrained layer on top of a neural network, such as Coherent-by-Construction Network (CCN) [5], MultiplexNet [21], and Semantic Probabilistic Layer (SPL) [6], all of these approaches being able to guarantee the satisfaction of hard constraints under certain conditions. Having a similar goal, NESTER [22] proposes another end-to-end approach by imposing soft and hard constraints via a program applied on the outputs. Yet another recent work is Iterative Local Refinement (ILR) [23], which proposed an analytic way of integrating t-norm-based functions as neural network layers to refine the predictions in a differentiable manner. The other main line of work comprises methods that relax the constraints and integrate them into the neural networks’ loss, directly relating to our method. Early work on semantic based regularisation (SBR) [24] on kernel machines led to the development of ways to map the constraints into the neural networks’ loss according to the t-norm operations [7, 12, 25, 26]. However, among other issues highlighted in [27, 28], one problem occurring in these approaches is that they are syntax-dependent. To address this, Semantic Loss [9] and DL2 [11] introduced syntax-independent loss functions. Another work, by Ahmed et al. [29], integrates logic into the standard entropy regularisation [30] term of the loss. While the most recent work, by Li et al. [31], explores another problem with the previous approaches, namely, that models tend to settle on the easier solutions that satisfy the constraints, and proposes a way to enforce the model to fully explore the available knowledge. While many such works proved to be particularly helpful when little annotated data are available [10, 11, 12, 32, 33, 20], they have been designed for small and/or synthetic datasets and would not scale to complex scenarios. For a complete survey on how to incorporate logical constraints in deep learning, see [34]. 6. Conclusion In this paper, we formalise an approach for computing a memory-efficient t-norm-based loss to equip neural networks with background knowledge logical constraints. We show that, unlike standard implementations of t-norm-based losses, our method can be applied in resource- intensive scenarios, such as event detection for autonomous driving. On the ROAD-R dataset, we test our t-norm-based loss on different amounts of labelled data showing that the t-norms indeed help in boosting the performance of state-of-the-art models, and we also present an effective way to use the t-norms in presence of unlabelled data. For future work, we plan on conducting a study into how the loss can help in a semi-supervised scenario, with varying amounts of unlabelled data added during the training and using the warm- up phase that we found helpful here for reducing the runtime strain brought by integrating background knowledge into neural networks. Another research direction could be a study on the effect of using different intervals of warm-up, before introducing the t-norm loss and applying it on the unlabelled data. Acknowledgments Mihaela C. Stoian is supported by the EPSRC under the grant EP/T517811/1. This work was also supported by the Alan Turing Institute under the EPSRC grant EP/N510129/1, by the AXA Research Fund, by the EPSRC grant EP/R013667/1, and by the EU TAILOR grant. We also acknowledge the use of the EPSRC-funded Tier 2 facility JADE (EP/P020275/1) and GPU computing support by Scan Computers International Ltd. References [1] S. Grigorescu, B. Trasnea, T. Cocias, G. Macesanu, A survey of deep learning techniques for autonomous driving, Journal of Field Robotics 37 (2020). [2] Y. Huang, Y. Chen, Autonomous driving with deep learning: A survey of state-of-art technologies, CoRR abs/2006.06091 (2020). [3] S. Hacohen, O. Medina, S. Shoval, Autonomous Driving: A Survey of Technological Gaps Using Google Scholar and Web of Science Trend Analysis, IEEE Trans. Intell. Transp. Syst. 23 (2022). [4] A. d’Avila Garcez, G. Zaverucha, The connectionist inductive learning and logic program- ming system, Applied Intell. 11 (1999). [5] E. Giunchiglia, T. Lukasiewicz, Multi-label classification neural networks with hard logical constraints, JAIR 72 (2021). [6] K. Ahmed, S. Teso, K. Chang, G. V. den Broeck, A. Vergari, Semantic probabilistic layers for neuro-symbolic learning, in: Proc. of NeurIPS, 2022. [7] M. Diligenti, S. Roychowdhury, M. Gori, Integrating prior knowledge into deep learning, in: Proc. of ICMLA, 2017. [8] I. Donadello, L. Serafini, A. d’Avila Garcez, Logic tensor networks for semantic image interpretation, in: Proc. of IJCAI, 2017. [9] J. Xu, Z. Zhang, T. Friedman, Y. Liang, G. Van den Broeck, A semantic loss function for deep learning with symbolic knowledge, in: Proc. of ICML, 2018. [10] T. Li, V. Srikumar, Augmenting neural networks with first-order logic, in: Proc. of ACL, 2019. [11] M. Fischer, M. Balunovic, D. Drachsler-Cohen, T. Gehr, C. Zhang, M. Vechev, DL2: Training and querying neural networks with logic, in: Proc. of ICML, 2019. [12] G. Marra, F. Giannini, M. Diligenti, M. Gori, LYRICS: A general interface layer to integrate logic inference and deep learning, in: Proc. of ECML-PKDD, 2019. [13] M. Diligenti, M. Gori, C. Sacca, Semantic-based regularization for learning and inference, Art. Intell. 244 (2017). [14] E. Giunchiglia, M. C. Stoian, S. Khan, F. Cuzzolin, T. Lukasiewicz, ROAD-R: The au- tonomous driving dataset with logical requirements, Machine Learning (2023). [15] G. Metcalfe, Fundamentals of fuzzy logics, https://www.logic.at/tbilisi05/Metcalfe- notes.pdf, 2005. [16] NVIDIA, NVIDIA AI A100, https://www.nvidia.com/en-gb/data-center/a100/, 2023. Ac- cessed: 17/03/2023. [17] G. Singh, S. Akrigg, M. D. Maio, V. Fontana, R. J. Alitappeh, S. Khan, S. Saha, K. Jeddisaravi, F. Yousefi, J. Culley, T. Nicholson, J. Omokeowa, S. Grazioso, A. Bradley, G. D. Gironimo, F. Cuzzolin, ROAD: The Road Event Awareness Dataset for Autonomous Driving, IEEE TPAMI 45 (2023). [18] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proc. of CVPR, 2016. [19] Y. Hua, Z. Zhao, Z. Liu, X. Chen, R. Li, H. Zhang, Traffic prediction based on random connectivity in deep learning with long short-term memory, in: Proc. of VTC-Fall, 2018. [20] R. Stewart, S. Ermon, Label-free supervision of neural networks with physics and domain knowledge, in: Proc. of AAAI, 2017. [21] N. Hoernle, R. Karampatsis, V. Belle, K. Gal, MultiplexNet: Towards fully satisfied logical constraints in neural networks, in: Proc. of AAAI, 2022. [22] P. Dragone, S. Teso, A. Passerini, Neuro-symbolic constraint programming for structured prediction, in: Proc. of IJCLR-NeSy, 2021. [23] A. Daniele, E. Krieken, L. Serafini, F. Harmelen, Refining neural network predictions using background knowledge, Machine Learning (2023). [24] M. Diligenti, M. Gori, M. Maggini, L. Rigutini, Bridging logic and kernel machines, Machine Learning 86 (2012). [25] L. Serafini, A. d’Avila Garcez, Logic tensor networks: Deep learning and logical reasoning from data and knowledge, in: Proc. of NeSy-HLAI, 2016. [26] S. Badreddine, A. d’Avila Garcez, L. Serafini, M. Spranger, Logic tensor networks, Artif. Intell. 303 (2022). [27] E. van Krieken, E. Acar, F. van Harmelen, Analyzing differentiable fuzzy implications, in: Proc. of KR, 2020. [28] E. van Krieken, E. Acar, F. van Harmelen, Analyzing differentiable fuzzy logic operators, Artif. Intell. 302 (2022). [29] K. Ahmed, E. Wang, K. Chang, G. V. den Broeck, Neuro-symbolic entropy regularization, in: Proc. of UAI, 2022. [30] Y. Grandvalet, Y. Bengio, Semi-supervised learning by entropy minimization, in: Proc. of NeurIPS, 2004. [31] Z. Li, Z. Liu, Y. Yao, J. Xu, T. Chen, X. Ma, J. Lü, Learning with logical constraints but without shortcut satisfaction, in: Proc. of ICLR, 2023. [32] Z. Hu, X. Ma, Z. Liu, E. Hovy, E. Xing, Harnessing deep neural networks with logic rules, in: Proc. of ACL, 2016. [33] Z. Hu, Z. Yang, R. Salakhutdinov, E. Xing, Deep neural networks with massive learned knowledge, in: Proc. of EMNLP, 2016. [34] E. Giunchiglia, M. C. Stoian, T. Lukasiewicz, Deep Learning with Logical Constraints, in: Proc. of IJCAI, 2022. A. Appendix: the ROAD-R dataset ID Agent ID Action ID Location 0 Pedestrian 10 Move away 29 AV lane 1 Car 11 Move towards 30 Outgoing lane 2 Cyclist 12 Move 31 Outgoing cycle lane 3 Motorbike 13 Brake 32 Incoming lane 4 Medium vehicle 14 Stop 33 Incoming cycle lane 5 Large vehicle 15 Indicating left 34 Pavement 6 Bus 16 Indicating right 35 Left pavement 7 Emergency vehicle 17 Hazard lights on 36 Right pavement 8 AV traffic light 18 Turn left 37 Junction 9 Other traffic light 19 Turn right 38 Crossing location 20 Overtake 39 Bus stop 21 Wait to cross 40 Parking 22 Cross road from left 23 Cross road from right 24 Crossing 25 Push object 26 Red traffic light 27 Amber traffic light 28 Green traffic light Table 4 The labels (with their ID’s) from the ROAD-R dataset used in our experiments. Logical constraints Descriptions in natural language {not Mobike, not Bus} A motorbike cannot be a bus {not TL, not TurLft} A traffic light cannot turn left {not Wait2X, not Ovtak} An agent cannot wait to cross and overtake {Ped, not PushObj} If an agent pushes an object then it is a pedestrian {PushObj, not Ped, MovAway, MovTow, Mov, Stop, TurLft, A pedestrian can only push objects, move away, etc. TurRht, Wait2X, XingFmLft, XingFmRht, Xing} {VehLane, OutgoLane, OutgoCycLane, IncomLane, Incom- Every agent but traffic lights must have a position CycLane, Pav, LftPav, RhtPav, Jun, XingLoc, BusStop, Park- ing, TL, OthTL} Table 5 Examples of logical constraints and their natural language explanations from Tables 9-11 of [14].