1. Introduction

OVERLAY

Integrating L0 regularization into Multi-layer Logical Perceptron for Interpretable Classification

Gonzalo Jaimovitch-López

Luca Bergamin

Fabio Aiolli

Roberto Confalonieri

1 0 Universidad Internacional Menéndez Pelayo , Santander , Spain 1 University of Padua, Department of Mathematics , Padua , Italy

2024

6 28 29

Deep neural networks are widely used in practical applications of AI, however, their inner structure and complexity made them generally not easily interpretable. Model transparency and interpretability are key requirements in multiple scenarios where not only high performance is enough to adopt the proposed solution. In this work, we adapt a diferentiable approximation of 0 regularization to a logic-based neural network, the Multi-layer Logical Perceptron (MLLP), and we evaluate its efectiveness in reducing the complexity of its interpretable discrete version, the Concept Rule Set (CRS), while preserving its performance. Results are compared to alternative heuristics, such as Random Binarization of the network weights, to assess whether better results can be achieved with a less-noisy technique that sparsifies the network based on the loss function rather than a random distribution.

eol>Logical Perceptron Propositional Network Interpretable Classification

1. Introduction

Advances in deep learning have promoted the training of models with a continuously increasing number of parameters in the search of higher performance and newer capabilities, reaching the order of billions in some cases. However, some of these solutions became black-box models incapable of explaining the reasoning behind their decisions [ 1 ]. Some scenarios with critical use cases such as medicine, law or ifnance demand for a higher level of explainability. Explainable AI techniques can help in both reducing model complexity while improving the interpretability of the solutions [2].

There are two important aspects of the interpretability of neural networks. The first one, which is a key notion of transparency as explained in [3], is that each part of a model should have an intuitive explanation. In the case of neural networks, the use of activation functions in the neurons makes the understanding of the transformation of each input not feasible for humans. A second aspect is the existing trade-of for interpretability of neural networks among faithfulness, understandability, and model performance. Most of the methods in the literature compromise at least one of those requirements, which might not be suitable for highly sensitive scenarios [4].

Typically, when explaining a black-box model using a surrogate symbolic model (e.g., ruleset, decision tree), accuracy is often sacrificed for transparency. To address this, Wang et al. [5] proposed a hierarchical rule-based model obtained using a Multi-layer Logical Perceptron network (MLLP), where a rule-based model is learned through backpropagation, and discretized later in a ruleset form. A key challenge for rule-based models is finding an easily interpretable, concise structure. In this work, we claim that a sparser network naturally leads to simpler rules. Thus, to achieve higher interpretability, we promote network sparsity through the introduction of a regularization term into the neural network’s loss function. Specifically, we apply a diferentiable approximation of 0 regularization [6] and study its efectiveness in aligning the trained continuous model with its discrete, interpretable version.

2. Related Work

Rule-based models (often presented as decision trees, rule lists, and rule sets) are widely present in the field due to their transparent inner structure, in contrast to other approaches such as Deep Neural Networks (DNN) which are considered black-box models hard to interpret [7]. Multiple works have explored using backpropagation and/or multilayer structures to learn rule-based models and improve their performance while retaining their transparency [8, 9, 10]. Using diferentiable proxies of logic operations is a recently studied topic, due to its easy integration with gradient-based optimization [11].

Model compression for neural networks is a relevant field of study in deep learning since it has been shown that models can be over-parameterized, leading to unnecessary computational resource usage, reduced eficiency, and lower generalization capabilities [ 2]. Finally, Binary Neural Networks (BNN) are a special type of neural network for which the weights are binary. Courbariaux and Bengio [12] use the straight-through estimator to allow for gradient descent over non-diferentiable functions. Even when reducing the computational complexity and improving the transparency, these models are still not considered interpretable due to the complex interactions of non-linear activation functions.

The aim of regularization within the machine learning field is to reduce the generalization error without increasing its training error [13], i.e., to prevent a model from overfitting (which is specially an issue in neural networks due to its generally complex and deep structure), ensuring its performance is aligned for both the training data and new unseen data. Some popular regularization techniques are 2 norm regularization (also known as weight decay), 1 norm regularization, or Dropout regularization.

0 regularization is a type of regularization that imposes a penalty on the objective function directly by the number of parameters that are non-zero, highly promoting sparsity and improving generalization. Furthermore, it can help speed up inference and training, as those weights that become zero remove some paths in the computational graph. 0 regularization is an NP-hard problem from the computational complexity perspective [14], which is considered dificult to solve. In [ 6], an approximate approach to 0 regularization is proposed, making the problem tractable, with some experimental results that demonstrate its efectiveness in reducing the size of neural networks, such as AlexNet. We argue that the application of this regularization is a great fit for logic-based networks, since sparsity is, in our opinion, a key ingredient to build interpretable classifiers.

3. Background 3.1. Notation

In this section, we introduce the main notation and concepts used throughout the paper. In the paper, we make use of the following notation. is a set of binary features and is a set of class labels. = {(x1, y1), . . . , (xn, yn)} denotes a training dataset with instances, where xi ∈ {0, 1}|| is a binary feature vector and yi ∈ {0, 1}|| is a one-hot class label vector.

ℛ() and () will denote the -th layer of the network when is odd and even, respectively. (0) denotes the input layer which consists of | | input neurons. The output layer consists of || output neurons. Given n to denote the number of nodes in the -th layer, () is a n × n− 1 matrix containing the weights between the -th layer and the (-1)-th layer. Each element of () is referred as (,).

A rule r is a conjunction of one or more Boolean variables r = 1 ∧. . .∧. A rule set s is a disjunction of one or more rules s = 1 ∨ . . . ∨ , i.e., s is a Disjunctive Normal Form (DNF) clause. 1

Finally, () denotes the sigmoid function, defined as () = 1+− .

3.2. Concept Rule Set (CRS)

A Concept Rule Set (CRS) is a multi-level hierarchical rule-based model proposed in [5]. A CRS is an instance of a multi-layer logical perceptron network (see Section 3.3) where each weight in the network is binary, i.e., , ∈ {0, 1}. (,) = 1 indicates that there exists an edge connecting the -th node in () the -th layer to the -th node in the (-1)-th layer, otherwise (,) = 0. Following the notation adopted in [5], r() and s() denote the -th node in layer ℛ() and (), respectively. These nodes are formally defined as follows: r() =

⋀︁ s(− 1), s(+1) = (,)=1

⋁︁ (,+1)=1

r() () corresponds to a rule, while node s() corresponds to a rule set. In a CRS Given the above, node r with levels, each ℛ() ( ∈ {1, 3, . . . , 2 − 1}) is known as a conjunction layer and each () ( ∈ {2, 4, . . . , 2}) is known as a disjunction layer. A CRS consists of 2 + 1 layers organized as one input layer followed by pairs of conjunction and disjunction layers.

By providing each input instance as values of the input layer (0), once trained, a CRS model works as a classifier ℱ : {0, 1}|| → {0, 1}||. The model outputs the values of nodes in the last disjunction layer (2), where s(2) = 1 indicates that ℱ classifies the input instance as the -th class label. The representation learned by the -th layer in CRS is a binary vector h(): h() = ⎪⎨⎧[︁r(1), r(2), . . . , r(n)]︁⊤ ⎪⎩[︁s(1), s(2), . . . , s(n)]︁⊤ ∈ {1, 3, . . . , 2 − 1} ∈ {2, 4, . . . , 2} () is equal to the value of the -th node in the -th layer which corresponds to a rule The value of h or a rule set. This discrete and explicit representation makes the model transparent.

We refer to the complexity of a CRS model as the total length of all rules. We use |r()| and |s()| to refer to the number of nodes contained in a rule and rule set, respectively. Then, the complexity of a CRS model (cℱ ) is defined as follows:

(︃ n n cℱ = ∑︁ ∑︁ |r(2− 1)| + ∑︁ |s(2)| =1 =1 =1 )︃ (1) (2) (3) (4)

Wang et al. [5] proposes the use of a MLLP model (introduced in the next section), a neural network architecture, to search for the discrete solution of the CRS model in the continuous space by using gradient descent over the continuous weights of the MLLP model. Subsequently, the weights of the network are binarized to transform the MLLP model into a CRS model, resulting in a classifier which ensures both good performance and transparency. For the discrete CRS extraction, a simple method of binarizing the weights using a threshold is applied.

Disj(h, ^ ) = 1 −

Conj(h, ^) = ∏︁ (ℎ , ^, ), (ℎ, ) = 1 − (1 − ℎ),

=1 ∏︁ (1 − (ℎ , ^, )) , (ℎ, ) = ℎ · =1

3.3. Multi-layer Logical Perceptron

The Multi-layer Logical Perceptron (MLLP) is a neural network architecture proposed in [5], designed in such a way that each of its neurons corresponds to one node in the CRS. The main diference with a fully connected Multi-layer Perceptron is the specific design of the activation functions, aiming to replicate the behavior of conjunction and disjunction logical operations. Another important diference is the presence of a selection mechanism that is used to attend to a subset of its inputs.

Given the -dimensional vectors h (a layer input vector) and ^ (the weights of a given neuron), and ^, ∈ [ 0, 1 ] (the weight of the connection between the input ℎ and the neuron ^ ), the conjunction and disjunction functions are given by:

In Conj(h, ^), the conjunction operation is obtained by multiplying many values between 0 and 1 together. For Disj(h, ^ ), the disjunction is computed with a similar operation, applying Morgan’s law by negating both the inputs and the outputs of the function. To implement this negation, () = 1 is used. and act as a selection mechanism. By turning a weight to zero, can learn to output 1, regardless of its input, leaving the subsequent conjunction operation unafected. A similar mechanism is implemented for , making sure to output zero instead. Notice that when h and ^ are both binary vectors, Eq. 3 and 4 reduce to the conjunction and disjunction of a subset of the elements in h. − The conjunction and disjunction functions are applied to the neurons in the -th layer as follows: ^ r () = Conj ^s(− 1), ^ ())︁ ︁( , ∈ {1, . . . , 2 − 1 } ^ s () = Disj (︁ ^r(− 1), ^ ())︁ function is given by: where ^r () and ^s

() are neurons in the -th layer of the MLLP, and ^ () is a n ×

The weights of the MLLP in the network need to be constrained in the range [ 0, 1 ] to ensure the a given layer are constrained using a clip function: Clip(^(,) ) = max 0, min 1, ^, . ︁( ︁( ())︁ proper functioning of the conjunction and disjunction activation functions. To this end, the weights of Given the encoding above, the MLLP network can function identically to the corresponding CRS when the weights are set to the same discrete values, while still remaining diferentiable.

Let ℱ^ be the MLLP model and ^ be the weights to be learned by the network. The MLLP loss , ∈ {2, . . . , 2}

(5) n− 1 weight matrix. =1 ℒ(^ ) = 1 ∑︁ MSE ︁( ℱ^ (x, ^ ), y + ∑︁ ⃦ ︁) ⃦ ⃦ ^ ()⃦⃦ 2

⃦ 2 2 =1 (6) The second term in the rhs of Eq. 6 is the 2 regularization term. This term is included to shrink the MLLP weights and it induces a simpler CRS model.

After training the continuous model (MLLP), a discrete and interpretable model (CRS) can be extracted. Its behavior is not guaranteed to follow the continuous version, and the authors observe a drop in performance. This problem arises with the application of conjunction and disjunction operations over continuous weights, as the MLLP weights can be in the range [ 0,1 ]. To mitigate this issue, the authors propose a training method based on Random Binarization (RB).

The RB method selects some of the weights during training using a random binarization rate ( ∈ [ 0, 1 ]), which is the probability of a weight to be binarized. The binarized weights are frozen and the forward (predict) and backpropagation are performed. After a given number of steps, the binarized weights are restored and a new set of random weights is binarized (in the experiments, the RB operation is applied every epoch). Therefore, during training, the MLLP model has a more aligned behaviour to that of the CRS, what ensures a closer performance between the MLLP and CRS models after the binarization conversion of the continuous solution. ˜ () -th layer of the MLLP after RB. These weights replace ^ () in Eq. 5. The behavior of the RB method is similar to that of the Dropout regularization [15], which helps addressing overfitting. denotes the weights of the -th neuron in the 3.4. 0 regularization Approximation In this work, we focus on the 0 regularization approximation proposed in [6]. The authors propose a set of gates that determine whether a parameter or group of parameters of the network (e.g., all the weights associated to a given input neuron of a layer of the network) are active (i.e., value greater than 0) or inactive (i.e., value equal to 0). The values of those gates can be drawn from a distribution such as the binary Bernoulli. However, this distribution needs to be smoothed in order to be diferentiable. Given as a continuous random variable with a distribution () and parameters , rectified to be in the interval [ 0,1 ], the probability of the gate being active can be calculated using its cumulative distribution function (CDF) so that: ∼ (|), = min(1, max(0, )), ( ̸= 0|) = 1 − ( ≤ 0|) = ( > 0|) (7)

The selected distribution proposed in this work is the hard concrete distribution, obtained by stretching a binary concrete distribution [16], closely tied to the Bernoulli distribution, and rectifying each sampled value using a hard-sigmoid to constrain its values in the range [ 0, 1 ]. The parameters for this distribution are log as the location and as the temperature (to control the degree of recovery of the original Bernoulli distribution). and determine the stretching of the original distribution. The hard concrete distribution can be formalized as follows: ∼ U(0, 1), = ︂( log − log(1 − ) + log )︂ , ¯ = ( − ) + , = min (1, max(0, ¯)) (8) The nature of the binary concrete distribution allows for the reparametrization trick [17], which prevents the introduction of randomness in the gradient descent process when involving sampling from the learnt distribution (as expressed in Eq. 8). On another note, the authors propose group sparsity instead of parameter sparsity as a means to achieve network speedups. In their approach, the group sparsity is presented as input neuron sparsity in the case of fully connected layers. ℛ^ (^ , ) = where || corresponds to the number of groups and || corresponds to the number of parameters of group . The probability of a gate being active under the hard concrete distribution for the proposed grouped sparsity can be expressed as: ︂( ⏞ 1 − ( ≤ 0|) ︂) = (︀ log − log −

︀) ^ = min(1, max(0, (log )( − ) + )) ⏟ || (︂ =1 || =1

︂) (9) (10) (11)

During test, the deterministic values of the gates are obtained with the following estimator: 4. Integrating 0 Regularization into MLLP The original implementation of the MLLP architecture employs the RB method to address the misalignment between the continuous and discrete solutions, while addressing overfitting similarly as Dropout regularization. Although the CRS model learned through the MLLP achieves high transparency and performance, the resulting solutions can still be complex due to the large number of rules involved.

0 regularization turns out to be an interesting method to increase the sparsity of a model, which, in turn, directly impacts the interpretability of the CRS model. Integrating 0 has a number of advantages: • Direct Sparsity Induction: 0 regularization directly promotes sparse solutions. This can act also as a feature selection process: those input neurons that are not relevant can be ignored.

Dropout regularization mainly improves robustness, and solutions are not necessarily sparse. • Interpretability: with the 0 regularization technique, the result is a sparse model, which means a less complex model. Therefore, the resulting model is inherently more interpretable. • Training Eficiency:

the training of sparse models can be made be more eficient, since all the weights that become 0 are no longer part of the computational graph.

For including 0 regularization into the MLLP architecture, multiple changes need to be considered. ˆz z test ⊗ ˆ W

x F orward

⊗ training

a activation

To start, the MLLP network needs to include the locations of a hard concrete distribution [16] for each gate as trainable parameters. In our implementation1, we follow the same approach as [6]2, where each gate represents the activation or deactivation of an input neuron, coined as group sparsity. These values are initialized by sampling a normal distribution determined by a dropout rate. Furthermore, we define a separate dropout rate for the input layer to the dropout rate used for the rest of the layers.

When performing the continuous forward operation, the behavior difers during training and testing. During training, the value of each gate is sampled from a hard concrete distribution, using the reparameterization trick by sampling a uniform distribution. Then, the weights are multiplied with the value of its corresponding 0 gate, and randomly binarized (when > 0) before applying the conjunction or disjunction activation functions. During testing, the value of ^ is obtained with a deterministic operation using Eq. 12. We summarize in Fig. 1 the operations of the forward pass.

^ = min(1, max(0, (log )( − ) + ))

A new threshold ′ = 0.5 is included to binarize the ^ value of each 0 gate when applying the (binarized) forward operation to the CRS model. The regularized loss function is defined as follows: =1 ℒ′(^ , ) = 1 (︂ ∑︁ MSE ︁( ℱ^ (x, ^ × z), y︁) )︂ + ℒ0() + ′ℒ2()

Finally, since the original implementation of the 0 regularization is applied to classical neural networks, some minor adaptations were also needed. First, since the weights in the MLLP network are constrained between 0 and 1, the Kaiming initialization strategy outlined in [6] would not make sense. Therefore, the initialization is left in the same way as in [5], using Uniform(0, 0.1). Second, MLLP networks do not have any bias term. Therefore, the bias was not considered neither in the neurons nor as locations of the 0 gates. (12) (13)

5. Evaluation

In this section, we evaluate the proposed solution following these steps: 1. Analysis of the sparsity achieved using 0 regularization adapted to the MLLP framework. 2. Comparison between the MLLP baseline and the framework adapted to use the 0 regularization in terms of CRS and MLLP predictions, as well as complexity of rules in the CRS model. Dataset. We consider the connect-4 dataset as our benchmark. This dataset is relevant for rule-learning models [18] due to its considerable size (having 67557 instances), and its dificulty: despite being a deterministic game, i.e., data contains no noise, classifiers struggle to achieve high F1-scores. 1https://github.com/gonzalojaimovitch/mllp_l0.git 2https://github.com/AMLab-Amsterdam/L0_regularization 0.8 0.6 e r o c0.4 S 1 F 0.2 0.0 0.5 Random Binarization Rate 0.9

0.5 Random Binarization Rate 0.9 Experimental Settings. The hyperparameter settings are extracted from the Experimental section of [5], namely batch size of 128, 400 epochs, learning rate of 5 × 10− 3, learning rate decay of 0.75 every 100 epochs, weight decay ( ′) of 10− 8, of 0.5, and update of randomly binarized weights every epoch (when applied). Regarding the data, 80% of the training data is used for training, and 20% for validation when performing hyperparameter search. Furthermore, 5-cross validation is adopted for more representative results. For analysing the performance, the F1 score (Macro) is the chosen metric, due to dataset imbalance. The 0 settings are and initial dropout rates are of 0.001, ′ of 0.5, and a value of fixed to 2/3 (as recommended by [16]).

Sparsity analysis. Fig. 2 shows the evolution of active weights of the MLLP models during training on the connect-4 dataset, for both the replicated results and the results including 0 regularization. An active weight is a weight which has a value greater than 0. In the case of the models including 0 regularization, the active weights are those for which the product of the value of the weight with its corresponding 0 gate is greater than 0. As expected, the models including 0 regularization yield solutions that are sparser than those obtained by the replicated MLLP models. Furthermore, the models with a greater number of weights are sharply sparsed compared to simpler models in terms of active weights. This is more evident in Fig. 2 (right), where it is shown how the active weights are drastically reduced in the case of the model with the largest number of weights: the MLLP model with 3 hidden layers of 256 nodes each.

MLLP performance comparison. Fig. 3 shows the average F1 scores and standard errors for both the replicated results from [5] and the results for CRS models for which 0 regularization was included. In the case of MLLP models, similar conclusions can be drawn when compared with CRS results. Results are similar or considerably improved with the inclusion of 0 regularization for both = 0 and = 0.5, with special emphasis on the deep models with = 0.5. However, when increasing to 0.9, 10000 9000 lse8000 uR7000 fo6000 th5000 g en4000 lL3000 ta2000 o T1000 0 0.0 0.5 Random Binarization Rate 0.9

0.5 Random Binarization Rate 0.9 scores without 0 regularization are generally higher, with some exceptions for deeper models. In both cases, the greater the value of , the worst the performance of the MLLP models, with a similar trend of deterioration.

Rule complexity. Fig. 4 shows the measure of complexity (Eq. 2), as the total length of rules, of the diferent CRS models for both the replicated results from [ 5] and the results of CRS models for which 0 regularization was included during the training of their respective MLLP models. When RB is not applied ( = 0), the total number of rules is pretty similar in both cases, with the exception of the largest model, whose complexity is considerably reduced. In the case of = 0.5, the complexity of the model with 3 hidden layers of 128 nodes each is increased with the inclusion of 0 regularization, which might be related to the drastic increase in performance it promotes. When focusing on the largest model, there is a great increase in performance with even a considerable reduction in complexity. Lastly, for = 0.9, 0 regularization helps reducing the complexity of almost every model, with special emphasis on the largest model, for which the length of rules is reduced in approximately eight times. This considerable reduction might be related to the slight decrease in performance compared to those models without 0 regularization.

6. Conclusions and Future Works

In this work, we proposed an adaptation of a computationally complex regularization technique into the MLLP framework, a logical network that acts as the continuous (diferentiable) version of a multi-level hierarchical rule-based model. We enhanced the interpretability of a rule-based model by reducing its complexity through a model compression technique, in the form of a regularizer exploited during the optimization. We showed that 0 regularization can efectively reduce model complexity without afecting its performance, and, in some cases, even enhances it.

As future work, we plan to extend the datasets considered and to introduce the capability of including logical constraints in the network. Specification of data requirements through explicit background knowledge could potentially help the network to meet desirable properties such as safety, consistency and fairness. [2] M. Sabih, F. Hannig, J. Teich, Utilizing explainable ai for quantization and pruning of deep neural networks, arXiv preprint arXiv:2008.09072 (2020). [3] Z. C. Lipton, The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery., Queue 16 (2018) 31–57. [4] V. Swamy, S. Montariol, J. Blackwell, J. A. Frej, M. Jaggi, T. Käser, Interpretcc: Intrinsic user-centric interpretability through global mixture of experts (2024). [5] Z. Wang, W. Zhang, N. Liu, J. Wang, Transparent classification with multilayer logical perceptrons and random binarization, in: The 34th AAAI Conference on Artificial Intelligence, AAAI, New York, NY, USA, February 7-12, 2020, AAAI Press, 2020, pp. 6331–6339. URL: https://doi.org/10. 1609/aaai.v34i04.6102. doi:10.1609/AAAI.V34I04.6102. [6] C. Louizos, M. Welling, D. P. Kingma, Learning sparse neural networks through l_0 regularization, in: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, OpenReview.net, 2018. URL: https://openreview.net/forum?id=H1Y8hhg0b. [7] R. Confalonieri, L. Coba, B. Wagner, T. R. Besold, A historical perspective of explainable artificial intelligence, WIREs Data Mining Knowl. Discov. 11 (2021). URL: https://doi.org/10.1002/widm.1391. doi:10.1002/WIDM.1391. [8] Z. Wang, W. Zhang, N. Liu, J. Wang, Scalable rule-based representation learning for interpretable classification, in: Proceedings of the 35th International Conference on Neural Information Processing Systems, NIPS ’21, Curran Associates Inc., Red Hook, NY, USA, 2021. [9] F. Beck, J. Fürnkranz, An empirical investigation into deep and shallow rule learning, 2021. URL: https://arxiv.org/abs/2106.10254. arXiv:2106.10254. [10] L. Dierckx, R. Veroneze, S. Nijssen, RL-Net: Interpretable Rule Learning with Neural Networks, in: Advances in Knowledge Discovery and Data Mining: 27th Pacific-Asia Conference, PAKDD 2023, May 25-28, Proceedings, 2023. URL: https://dial.uclouvain.be/pr/boreal/object/boreal:274378. [11] E. van Krieken, E. Acar, F. van Harmelen, Analyzing diferentiable fuzzy logic operators, Artificial Intelligence 302 (2022) 103602. URL: http://dx.doi.org/10.1016/j.artint.2021.103602. doi:10.1016/ j.artint.2021.103602. [12] M. Courbariaux, Y. Bengio, Binarynet: Training deep neural networks with weights and activations constrained to +1 or -1, CoRR abs/1602.02830 (2016). URL: http://arxiv.org/abs/1602.02830. arXiv:1602.02830. [13] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016. http://www.

deeplearningbook.org. [14] T. T. Nguyen, C. Soussen, J. Idier, E.-H. Djermoune, NP-hardness of ℓ0 minimization problems: revision and extension to the non-negative setting, in: 13th International Conference on Sampling Theory and Applications, SampTA 2019, 13th International Conference on Sampling Theory and Applications, SampTA 2019, Bordeaux, France, 2019. URL: https://hal.science/hal-02112180. doi:10.1109/sampta45681.2019.9030937. [15] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, Journal of Machine Learning Research 15 (2014) 1929–1958. URL: http://jmlr.org/papers/v15/srivastava14a.html. [16] C. J. Maddison, A. Mnih, Y. W. Teh, The concrete distribution: A continuous relaxation of discrete random variables, in: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, OpenReview.net, 2017. URL: https://openreview.net/forum?id=S1jE5L5gl. [17] D. P. Kingma, M. Welling, Auto-encoding variational bayes, in: Y. Bengio, Y. LeCun (Eds.), 2nd International Conference on Learning Representations, ICLR 2014, Banf, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014. URL: http://arxiv.org/abs/1312.6114. [18] L. Bergamin, M. Polato, F. Aiolli, Improving rule-based classifiers by bayes point aggregation, Neurocomputing 613 (2025) 128699. URL: https://www.sciencedirect.com/science/article/pii/ S092523122401470X. doi:https://doi.org/10.1016/j.neucom.2024.128699.

[1]

Ali ,

Abuhmed ,

El-Sappagh ,

Muhammad ,

J. M.

Alonso-Moral ,

Confalonieri ,

Guidotti ,

J. Del

Ser ,

Díaz-Rodríguez ,

Herrera , Explainable artificial intelligence (xai): What we know and what is left to attain trustworthy artificial intelligence , Information Fusion 99 ( 2023 ) 101805 . URL: https://www.sciencedirect.com/science/article/pii/S1566253523001148. doi:https://doi. org/10.1016/j.inffus. 2023 . 101805 .