<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>OVERLAY</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Integrating L0 regularization into Multi-layer Logical Perceptron for Interpretable Classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gonzalo Jaimovitch-López</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Bergamin</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabio Aiolli</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberto Confalonieri</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universidad Internacional Menéndez Pelayo</institution>
          ,
          <addr-line>Santander</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Padua, Department of Mathematics</institution>
          ,
          <addr-line>Padua</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>6</volume>
      <fpage>28</fpage>
      <lpage>29</lpage>
      <abstract>
        <p>Deep neural networks are widely used in practical applications of AI, however, their inner structure and complexity made them generally not easily interpretable. Model transparency and interpretability are key requirements in multiple scenarios where not only high performance is enough to adopt the proposed solution. In this work, we adapt a diferentiable approximation of 0 regularization to a logic-based neural network, the Multi-layer Logical Perceptron (MLLP), and we evaluate its efectiveness in reducing the complexity of its interpretable discrete version, the Concept Rule Set (CRS), while preserving its performance. Results are compared to alternative heuristics, such as Random Binarization of the network weights, to assess whether better results can be achieved with a less-noisy technique that sparsifies the network based on the loss function rather than a random distribution.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Logical Perceptron</kwd>
        <kwd>Propositional Network</kwd>
        <kwd>Interpretable Classification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Advances in deep learning have promoted the training of models with a continuously increasing number
of parameters in the search of higher performance and newer capabilities, reaching the order of billions
in some cases. However, some of these solutions became black-box models incapable of explaining the
reasoning behind their decisions [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Some scenarios with critical use cases such as medicine, law or
ifnance demand for a higher level of explainability. Explainable AI techniques can help in both reducing
model complexity while improving the interpretability of the solutions [2].
      </p>
      <p>There are two important aspects of the interpretability of neural networks. The first one, which is a
key notion of transparency as explained in [3], is that each part of a model should have an intuitive
explanation. In the case of neural networks, the use of activation functions in the neurons makes the
understanding of the transformation of each input not feasible for humans. A second aspect is the
existing trade-of for interpretability of neural networks among faithfulness, understandability, and
model performance. Most of the methods in the literature compromise at least one of those requirements,
which might not be suitable for highly sensitive scenarios [4].</p>
      <p>Typically, when explaining a black-box model using a surrogate symbolic model (e.g., ruleset, decision
tree), accuracy is often sacrificed for transparency. To address this, Wang et al. [5] proposed a hierarchical
rule-based model obtained using a Multi-layer Logical Perceptron network (MLLP), where a rule-based
model is learned through backpropagation, and discretized later in a ruleset form. A key challenge for
rule-based models is finding an easily interpretable, concise structure. In this work, we claim that a
sparser network naturally leads to simpler rules. Thus, to achieve higher interpretability, we promote
network sparsity through the introduction of a regularization term into the neural network’s loss
function. Specifically, we apply a diferentiable approximation of 0 regularization [6] and study its
efectiveness in aligning the trained continuous model with its discrete, interpretable version.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Rule-based models (often presented as decision trees, rule lists, and rule sets) are widely present in
the field due to their transparent inner structure, in contrast to other approaches such as Deep Neural
Networks (DNN) which are considered black-box models hard to interpret [7]. Multiple works have
explored using backpropagation and/or multilayer structures to learn rule-based models and improve
their performance while retaining their transparency [8, 9, 10]. Using diferentiable proxies of logic
operations is a recently studied topic, due to its easy integration with gradient-based optimization [11].</p>
      <p>Model compression for neural networks is a relevant field of study in deep learning since it has been
shown that models can be over-parameterized, leading to unnecessary computational resource usage,
reduced eficiency, and lower generalization capabilities [ 2]. Finally, Binary Neural Networks (BNN) are
a special type of neural network for which the weights are binary. Courbariaux and Bengio [12] use
the straight-through estimator to allow for gradient descent over non-diferentiable functions. Even
when reducing the computational complexity and improving the transparency, these models are still
not considered interpretable due to the complex interactions of non-linear activation functions.</p>
      <p>The aim of regularization within the machine learning field is to reduce the generalization error
without increasing its training error [13], i.e., to prevent a model from overfitting (which is specially an
issue in neural networks due to its generally complex and deep structure), ensuring its performance is
aligned for both the training data and new unseen data. Some popular regularization techniques are 2
norm regularization (also known as weight decay), 1 norm regularization, or Dropout regularization.</p>
      <p>0 regularization is a type of regularization that imposes a penalty on the objective function directly
by the number of parameters that are non-zero, highly promoting sparsity and improving generalization.
Furthermore, it can help speed up inference and training, as those weights that become zero remove
some paths in the computational graph. 0 regularization is an NP-hard problem from the computational
complexity perspective [14], which is considered dificult to solve. In [ 6], an approximate approach
to 0 regularization is proposed, making the problem tractable, with some experimental results that
demonstrate its efectiveness in reducing the size of neural networks, such as AlexNet. We argue that
the application of this regularization is a great fit for logic-based networks, since sparsity is, in our
opinion, a key ingredient to build interpretable classifiers.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Background</title>
      <sec id="sec-3-1">
        <title>3.1. Notation</title>
        <p>In this section, we introduce the main notation and concepts used throughout the paper.
In the paper, we make use of the following notation.  is a set of binary features and  is a set of class
labels.  = {(x1, y1), . . . , (xn, yn)} denotes a training dataset with  instances, where xi ∈ {0, 1}||
is a binary feature vector and yi ∈ {0, 1}|| is a one-hot class label vector.</p>
        <p>ℛ() and () will denote the -th layer of the network when  is odd and even, respectively. (0)
denotes the input layer which consists of | | input neurons. The output layer consists of || output
neurons. Given n to denote the number of nodes in the -th layer,  () is a n × n− 1 matrix containing
the weights between the -th layer and the (-1)-th layer. Each element of  () is referred as (,).</p>
        <p>A rule r is a conjunction of one or more Boolean variables r = 1 ∧. . .∧. A rule set s is a disjunction
of one or more rules s = 1 ∨ . . . ∨ , i.e., s is a Disjunctive Normal Form (DNF) clause.
1</p>
        <p>Finally,  () denotes the sigmoid function, defined as  () = 1+−  .</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Concept Rule Set (CRS)</title>
        <p>A Concept Rule Set (CRS) is a multi-level hierarchical rule-based model proposed in [5]. A CRS is an
instance of a multi-layer logical perceptron network (see Section 3.3) where each weight in the network
is binary, i.e., , ∈ {0, 1}. (,) = 1 indicates that there exists an edge connecting the -th node in
()
the -th layer to the -th node in the (-1)-th layer, otherwise (,) = 0. Following the notation adopted
in [5], r() and s() denote the -th node in layer ℛ() and (), respectively. These nodes are formally
defined as follows:
r() =</p>
        <p>⋀︁ s(− 1), s(+1) =
(,)=1</p>
        <p>⋁︁
(,+1)=1</p>
        <p>r()
() corresponds to a rule, while node s() corresponds to a rule set. In a CRS
Given the above, node r
with  levels, each ℛ() ( ∈ {1, 3, . . . , 2 − 1}) is known as a conjunction layer and each ()
( ∈ {2, 4, . . . , 2}) is known as a disjunction layer. A CRS consists of 2 + 1 layers organized as one
input layer followed by pairs of conjunction and disjunction layers.</p>
        <p>By providing each input instance as values of the input layer (0), once trained, a CRS model works
as a classifier ℱ : {0, 1}|| → {0, 1}||. The model outputs the values of nodes in the last disjunction
layer (2), where s(2) = 1 indicates that ℱ classifies the input instance as the -th class label. The
representation learned by the -th layer in CRS is a binary vector h():
h() =
⎪⎨⎧[︁r(1), r(2), . . . , r(n)]︁⊤
⎪⎩[︁s(1), s(2), . . . , s(n)]︁⊤
 ∈ {1, 3, . . . , 2 − 1}
 ∈ {2, 4, . . . , 2}
() is equal to the value of the -th node in the -th layer which corresponds to a rule
The value of h
or a rule set. This discrete and explicit representation makes the model transparent.</p>
        <p>We refer to the complexity of a CRS model as the total length of all rules. We use |r()| and |s()| to
refer to the number of nodes contained in a rule and rule set, respectively. Then, the complexity of a
CRS model (cℱ ) is defined as follows:</p>
        <p>(︃ n n
cℱ = ∑︁ ∑︁ |r(2− 1)| + ∑︁ |s(2)|
=1 =1 =1
)︃
(1)
(2)
(3)
(4)</p>
        <p>Wang et al. [5] proposes the use of a MLLP model (introduced in the next section), a neural network
architecture, to search for the discrete solution of the CRS model in the continuous space by using
gradient descent over the continuous weights of the MLLP model. Subsequently, the weights of the
network are binarized to transform the MLLP model into a CRS model, resulting in a classifier which
ensures both good performance and transparency. For the discrete CRS extraction, a simple method of
binarizing the weights using a threshold is applied.</p>
        <p>Disj(h, ^ ) = 1 −</p>
        <p>Conj(h, ^) = ∏︁ (ℎ , ^, ), (ℎ, ) = 1 − (1 − ℎ),</p>
        <p>=1

∏︁ (1 − (ℎ , ^, )) , (ℎ, ) = ℎ · 
=1</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Multi-layer Logical Perceptron</title>
        <p>The Multi-layer Logical Perceptron (MLLP) is a neural network architecture proposed in [5], designed
in such a way that each of its neurons corresponds to one node in the CRS. The main diference with
a fully connected Multi-layer Perceptron is the specific design of the activation functions, aiming to
replicate the behavior of conjunction and disjunction logical operations. Another important diference
is the presence of a selection mechanism that is used to attend to a subset of its inputs.</p>
        <p>
          Given the -dimensional vectors h (a layer input vector) and ^  (the weights of a given neuron), and
^, ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] (the weight of the connection between the input ℎ and the neuron ^ ), the conjunction
and disjunction functions are given by:
        </p>
        <p>In Conj(h, ^), the conjunction operation is obtained by multiplying many values between 0 and 1
together. For Disj(h, ^ ), the disjunction is computed with a similar operation, applying Morgan’s law
by negating both the inputs and the outputs of the function. To implement this negation,  () = 1
is used.  and  act as a selection mechanism. By turning a weight  to zero,  can learn to output 1,
regardless of its input, leaving the subsequent conjunction operation unafected. A similar mechanism
is implemented for , making sure to output zero instead. Notice that when h and ^  are both binary
vectors, Eq. 3 and 4 reduce to the conjunction and disjunction of a subset of the elements in h.
− 
The conjunction and disjunction functions are applied to the neurons in the -th layer as follows:
^
r
() = Conj ^s(− 1), ^ ())︁
︁(
,  ∈ {1, . . . , 2 − 1
}
^
s
() = Disj (︁ ^r(− 1), ^ ())︁
function is given by:
where ^r
() and ^s</p>
        <p>() are neurons in the -th layer of the MLLP, and ^ () is a n ×</p>
        <p>
          The weights of the MLLP in the network need to be constrained in the range [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] to ensure the
a given layer  are constrained using a clip function: Clip(^(,) ) = max 0, min 1, ^,
.
︁(
︁(
())︁
proper functioning of the conjunction and disjunction activation functions. To this end, the weights of
Given the encoding above, the MLLP network can function identically to the corresponding CRS when
the weights are set to the same discrete values, while still remaining diferentiable.
        </p>
        <p>Let ℱ^ be the MLLP model and ^ be the weights to be learned by the network. The MLLP loss
,  ∈ {2, . . . , 2}</p>
        <p>(5)
n− 1 weight matrix.

=1
ℒ(^ ) =

1 ∑︁ MSE ︁( ℱ^ (x, ^ ), y +  ∑︁ ⃦
︁)
⃦
⃦ ^ ()⃦⃦ 2</p>
        <p>⃦ 2
2
=1
(6)
The second term in the rhs of Eq. 6 is the 2 regularization term. This term is included to shrink the
MLLP weights and it induces a simpler CRS model.</p>
        <p>
          After training the continuous model (MLLP), a discrete and interpretable model (CRS) can be extracted.
Its behavior is not guaranteed to follow the continuous version, and the authors observe a drop in
performance. This problem arises with the application of conjunction and disjunction operations over
continuous weights, as the MLLP weights can be in the range [
          <xref ref-type="bibr" rid="ref1">0,1</xref>
          ]. To mitigate this issue, the authors
propose a training method based on Random Binarization (RB).
        </p>
        <p>
          The RB method selects some of the weights during training using a random binarization rate ( ∈
[
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ]), which is the probability of a weight to be binarized. The binarized weights are frozen and the
forward (predict) and backpropagation are performed. After a given number of steps, the binarized
weights are restored and a new set of random weights is binarized (in the experiments, the RB operation
is applied every epoch). Therefore, during training, the MLLP model has a more aligned behaviour
to that of the CRS, what ensures a closer performance between the MLLP and CRS models after the
binarization conversion of the continuous solution. ˜ ()
-th layer of the MLLP after RB. These weights replace ^ () in Eq. 5. The behavior of the RB method is
similar to that of the Dropout regularization [15], which helps addressing overfitting.
denotes the weights of the -th neuron in the
3.4. 0 regularization Approximation
In this work, we focus on the 0 regularization approximation proposed in [6]. The authors propose a
set of gates that determine whether a parameter or group of parameters of the network (e.g., all the
weights associated to a given input neuron of a layer of the network) are active (i.e., value greater than
0) or inactive (i.e., value equal to 0). The values  of those gates can be drawn from a distribution such
as the binary Bernoulli. However, this distribution needs to be smoothed in order to be diferentiable.
Given  as a continuous random variable with a distribution () and parameters , rectified to be in the
interval [
          <xref ref-type="bibr" rid="ref1">0,1</xref>
          ], the probability of the gate being active can be calculated using its cumulative distribution
function (CDF) so that:
 ∼ (|),  = min(1, max(0, )), ( ̸= 0|) = 1 − ( ≤ 0|) = ( &gt; 0|)
(7)
        </p>
        <p>
          The selected distribution  proposed in this work is the hard concrete distribution, obtained by
stretching a binary concrete distribution [16], closely tied to the Bernoulli distribution, and rectifying
each sampled value  using a hard-sigmoid to constrain its values in the range [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ]. The parameters 
for this distribution are log  as the location and  as the temperature (to control the degree of recovery
of the original Bernoulli distribution).  and  determine the stretching of the original distribution. The
hard concrete distribution can be formalized as follows:
 ∼ U(0, 1),  = 
︂( log  − log(1 − ) + log  )︂

, ¯ = ( −  ) + ,  = min (1, max(0, ¯)) (8)
The nature of the binary concrete distribution allows for the reparametrization trick [17], which prevents
the introduction of randomness in the gradient descent process when involving sampling from the
learnt distribution (as expressed in Eq. 8). On another note, the authors propose group sparsity instead
of parameter sparsity as a means to achieve network speedups. In their approach, the group sparsity is
presented as input neuron sparsity in the case of fully connected layers.
ℛ^ (^ , ) =
where || corresponds to the number of groups and || corresponds to the number of parameters of
group . The probability of a gate  being active under the hard concrete distribution for the proposed
grouped sparsity can be expressed as:
︂(
⏞
1
− ( ≤ 0|)
︂)
=  (︀ log   −  log −
        </p>
        <p>︀)
^ = min(1, max(0,  (log  )( −  ) +  ))

⏟
|| (︂
=1
||
=1</p>
        <p>︂)
(9)
(10)
(11)</p>
        <p>During test, the deterministic values of the gates are obtained with the following estimator:
4. Integrating 0 Regularization into MLLP
The original implementation of the MLLP architecture employs the RB method to address the
misalignment between the continuous and discrete solutions, while addressing overfitting similarly as Dropout
regularization. Although the CRS model learned through the MLLP achieves high transparency and
performance, the resulting solutions can still be complex due to the large number of rules involved.</p>
        <p>0 regularization turns out to be an interesting method to increase the sparsity of a model, which, in
turn, directly impacts the interpretability of the CRS model. Integrating 0 has a number of advantages:
• Direct Sparsity Induction: 0 regularization directly promotes sparse solutions. This can act
also as a feature selection process: those input neurons that are not relevant can be ignored.</p>
        <p>Dropout regularization mainly improves robustness, and solutions are not necessarily sparse.
• Interpretability: with the 0 regularization technique, the result is a sparse model, which means
a less complex model. Therefore, the resulting model is inherently more interpretable.
• Training Eficiency:</p>
        <p>the training of sparse models can be made be more eficient, since all the
weights that become 0 are no longer part of the computational graph.</p>
        <p>For including 0 regularization into the MLLP architecture, multiple changes need to be considered.
ˆz
z
test
⊗
ˆ
W</p>
        <p>x
F orward</p>
        <p>⊗
training</p>
        <p>RB</p>
        <p>a
activation</p>
        <p>To start, the MLLP network needs to include the locations   of a hard concrete distribution [16] for
each gate as trainable parameters. In our implementation1, we follow the same approach as [6]2, where
each gate represents the activation or deactivation of an input neuron, coined as group sparsity. These
values are initialized by sampling a normal distribution determined by a dropout rate. Furthermore, we
define a separate dropout rate for the input layer to the dropout rate used for the rest of the layers.</p>
        <p>When performing the continuous forward operation, the behavior difers during training and testing.
During training, the value  of each gate is sampled from a hard concrete distribution, using the
reparameterization trick by sampling a uniform distribution. Then, the weights are multiplied with
the value of its corresponding 0 gate, and randomly binarized (when  &gt; 0) before applying the
conjunction or disjunction activation functions. During testing, the value of ^ is obtained with a
deterministic operation using Eq. 12. We summarize in Fig. 1 the operations of the forward pass.</p>
        <p>^ = min(1, max(0,  (log  )( −  ) +  ))</p>
        <p>A new threshold  ′ = 0.5 is included to binarize the ^ value of each 0 gate when applying the
(binarized) forward operation to the CRS model. The regularized loss function is defined as follows:
=1
ℒ′(^ , ) = 1 (︂ ∑︁ MSE ︁( ℱ^ (x, ^ × z), y︁) )︂

+  ℒ0() +  ′ℒ2()</p>
        <p>Finally, since the original implementation of the 0 regularization is applied to classical neural
networks, some minor adaptations were also needed. First, since the weights in the MLLP network
are constrained between 0 and 1, the Kaiming initialization strategy outlined in [6] would not make
sense. Therefore, the initialization is left in the same way as in [5], using Uniform(0, 0.1). Second, MLLP
networks do not have any bias term. Therefore, the bias was not considered neither in the neurons nor
as locations   of the 0 gates.
(12)
(13)</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Evaluation</title>
      <p>In this section, we evaluate the proposed solution following these steps:
1. Analysis of the sparsity achieved using 0 regularization adapted to the MLLP framework.
2. Comparison between the MLLP baseline and the framework adapted to use the 0 regularization
in terms of CRS and MLLP predictions, as well as complexity of rules in the CRS model.
Dataset. We consider the connect-4 dataset as our benchmark. This dataset is relevant for rule-learning
models [18] due to its considerable size (having 67557 instances), and its dificulty: despite being a
deterministic game, i.e., data contains no noise, classifiers struggle to achieve high F1-scores.
1https://github.com/gonzalojaimovitch/mllp_l0.git
2https://github.com/AMLab-Amsterdam/L0_regularization
0.8
0.6
e
r
o
c0.4
S
1
F
0.2
0.0
0.5
Random Binarization Rate
0.9</p>
      <p>0.5
Random Binarization Rate
0.9
Experimental Settings. The hyperparameter settings are extracted from the Experimental section
of [5], namely batch size of 128, 400 epochs, learning rate of 5 × 10− 3, learning rate decay of 0.75
every 100 epochs, weight decay ( ′) of 10− 8,  of 0.5, and update of randomly binarized weights every
epoch (when applied). Regarding the data, 80% of the training data is used for training, and 20% for
validation when performing hyperparameter search. Furthermore, 5-cross validation is adopted for
more representative results. For analysing the performance, the F1 score (Macro) is the chosen metric,
due to dataset imbalance. The 0 settings are  and initial dropout rates are of 0.001,  ′ of 0.5, and a
value of  fixed to 2/3 (as recommended by [16]).</p>
      <p>Sparsity analysis. Fig. 2 shows the evolution of active weights of the MLLP models during training
on the connect-4 dataset, for both the replicated results and the results including 0 regularization. An
active weight is a weight  which has a value greater than 0. In the case of the models including 0
regularization, the active weights are those for which the product of the value of the weight  with its
corresponding 0 gate  is greater than 0. As expected, the models including 0 regularization yield
solutions that are sparser than those obtained by the replicated MLLP models. Furthermore, the models
with a greater number of weights are sharply sparsed compared to simpler models in terms of active
weights. This is more evident in Fig. 2 (right), where it is shown how the active weights are drastically
reduced in the case of the model with the largest number of weights: the MLLP model with 3 hidden
layers of 256 nodes each.</p>
      <p>MLLP performance comparison. Fig. 3 shows the average F1 scores and standard errors for both
the replicated results from [5] and the results for CRS models for which 0 regularization was included.
In the case of MLLP models, similar conclusions can be drawn when compared with CRS results. Results
are similar or considerably improved with the inclusion of 0 regularization for both  = 0 and
 = 0.5, with special emphasis on the deep models with  = 0.5. However, when increasing  to 0.9,
10000
9000
lse8000
uR7000
fo6000
th5000
g
en4000
lL3000
ta2000
o
T1000
0
0.0
0.5
Random Binarization Rate
0.9</p>
      <p>0.5
Random Binarization Rate
0.9
scores without 0 regularization are generally higher, with some exceptions for deeper models. In both
cases, the greater the value of , the worst the performance of the MLLP models, with a similar trend
of deterioration.</p>
      <p>Rule complexity. Fig. 4 shows the measure of complexity (Eq. 2), as the total length of rules, of the
diferent CRS models for both the replicated results from [ 5] and the results of CRS models for which
0 regularization was included during the training of their respective MLLP models. When RB is not
applied ( = 0), the total number of rules is pretty similar in both cases, with the exception of the
largest model, whose complexity is considerably reduced. In the case of  = 0.5, the complexity of
the model with 3 hidden layers of 128 nodes each is increased with the inclusion of 0 regularization,
which might be related to the drastic increase in performance it promotes. When focusing on the largest
model, there is a great increase in performance with even a considerable reduction in complexity. Lastly,
for  = 0.9, 0 regularization helps reducing the complexity of almost every model, with special
emphasis on the largest model, for which the length of rules is reduced in approximately eight times.
This considerable reduction might be related to the slight decrease in performance compared to those
models without 0 regularization.</p>
    </sec>
    <sec id="sec-5">
      <title>6. Conclusions and Future Works</title>
      <p>In this work, we proposed an adaptation of a computationally complex regularization technique into the
MLLP framework, a logical network that acts as the continuous (diferentiable) version of a multi-level
hierarchical rule-based model. We enhanced the interpretability of a rule-based model by reducing
its complexity through a model compression technique, in the form of a regularizer exploited during
the optimization. We showed that 0 regularization can efectively reduce model complexity without
afecting its performance, and, in some cases, even enhances it.</p>
      <p>As future work, we plan to extend the datasets considered and to introduce the capability of including
logical constraints in the network. Specification of data requirements through explicit background
knowledge could potentially help the network to meet desirable properties such as safety, consistency
and fairness.
[2] M. Sabih, F. Hannig, J. Teich, Utilizing explainable ai for quantization and pruning of deep neural
networks, arXiv preprint arXiv:2008.09072 (2020).
[3] Z. C. Lipton, The mythos of model interpretability: In machine learning, the concept of
interpretability is both important and slippery., Queue 16 (2018) 31–57.
[4] V. Swamy, S. Montariol, J. Blackwell, J. A. Frej, M. Jaggi, T. Käser, Interpretcc: Intrinsic user-centric
interpretability through global mixture of experts (2024).
[5] Z. Wang, W. Zhang, N. Liu, J. Wang, Transparent classification with multilayer logical perceptrons
and random binarization, in: The 34th AAAI Conference on Artificial Intelligence, AAAI, New
York, NY, USA, February 7-12, 2020, AAAI Press, 2020, pp. 6331–6339. URL: https://doi.org/10.
1609/aaai.v34i04.6102. doi:10.1609/AAAI.V34I04.6102.
[6] C. Louizos, M. Welling, D. P. Kingma, Learning sparse neural networks through l_0
regularization, in: 6th International Conference on Learning Representations, ICLR 2018, Vancouver,
BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, OpenReview.net, 2018. URL:
https://openreview.net/forum?id=H1Y8hhg0b.
[7] R. Confalonieri, L. Coba, B. Wagner, T. R. Besold, A historical perspective of explainable artificial
intelligence, WIREs Data Mining Knowl. Discov. 11 (2021). URL: https://doi.org/10.1002/widm.1391.
doi:10.1002/WIDM.1391.
[8] Z. Wang, W. Zhang, N. Liu, J. Wang, Scalable rule-based representation learning for interpretable
classification, in: Proceedings of the 35th International Conference on Neural Information
Processing Systems, NIPS ’21, Curran Associates Inc., Red Hook, NY, USA, 2021.
[9] F. Beck, J. Fürnkranz, An empirical investigation into deep and shallow rule learning, 2021. URL:
https://arxiv.org/abs/2106.10254. arXiv:2106.10254.
[10] L. Dierckx, R. Veroneze, S. Nijssen, RL-Net: Interpretable Rule Learning with Neural Networks, in:
Advances in Knowledge Discovery and Data Mining: 27th Pacific-Asia Conference, PAKDD 2023,
May 25-28, Proceedings, 2023. URL: https://dial.uclouvain.be/pr/boreal/object/boreal:274378.
[11] E. van Krieken, E. Acar, F. van Harmelen, Analyzing diferentiable fuzzy logic operators, Artificial
Intelligence 302 (2022) 103602. URL: http://dx.doi.org/10.1016/j.artint.2021.103602. doi:10.1016/
j.artint.2021.103602.
[12] M. Courbariaux, Y. Bengio, Binarynet: Training deep neural networks with weights and
activations constrained to +1 or -1, CoRR abs/1602.02830 (2016). URL: http://arxiv.org/abs/1602.02830.
arXiv:1602.02830.
[13] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016. http://www.</p>
      <p>deeplearningbook.org.
[14] T. T. Nguyen, C. Soussen, J. Idier, E.-H. Djermoune, NP-hardness of ℓ0 minimization problems:
revision and extension to the non-negative setting, in: 13th International Conference on Sampling
Theory and Applications, SampTA 2019, 13th International Conference on Sampling Theory
and Applications, SampTA 2019, Bordeaux, France, 2019. URL: https://hal.science/hal-02112180.
doi:10.1109/sampta45681.2019.9030937.
[15] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A simple way
to prevent neural networks from overfitting, Journal of Machine Learning Research 15 (2014)
1929–1958. URL: http://jmlr.org/papers/v15/srivastava14a.html.
[16] C. J. Maddison, A. Mnih, Y. W. Teh, The concrete distribution: A continuous relaxation of discrete
random variables, in: 5th International Conference on Learning Representations, ICLR 2017,
Toulon, France, April 24-26, 2017, Conference Track Proceedings, OpenReview.net, 2017. URL:
https://openreview.net/forum?id=S1jE5L5gl.
[17] D. P. Kingma, M. Welling, Auto-encoding variational bayes, in: Y. Bengio, Y. LeCun (Eds.), 2nd
International Conference on Learning Representations, ICLR 2014, Banf, AB, Canada, April 14-16,
2014, Conference Track Proceedings, 2014. URL: http://arxiv.org/abs/1312.6114.
[18] L. Bergamin, M. Polato, F. Aiolli, Improving rule-based classifiers by bayes point
aggregation, Neurocomputing 613 (2025) 128699. URL: https://www.sciencedirect.com/science/article/pii/
S092523122401470X. doi:https://doi.org/10.1016/j.neucom.2024.128699.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Abuhmed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>El-Sappagh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Muhammad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Alonso-Moral</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Confalonieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Guidotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Del</given-names>
            <surname>Ser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Díaz-Rodríguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Herrera</surname>
          </string-name>
          ,
          <article-title>Explainable artificial intelligence (xai): What we know and what is left to attain trustworthy artificial intelligence</article-title>
          ,
          <source>Information Fusion</source>
          <volume>99</volume>
          (
          <year>2023</year>
          )
          <article-title>101805</article-title>
          . URL: https://www.sciencedirect.com/science/article/pii/S1566253523001148. doi:https://doi. org/10.1016/j.inffus.
          <year>2023</year>
          .
          <volume>101805</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>