=Paper=
{{Paper
|id=Vol-3409/paper1
|storemode=property
|title=Compiling Neural Network Classifiers into Boolean Circuits for Efficient Shap-Score Computation (short paper)
|pdfUrl=https://ceur-ws.org/Vol-3409/paper1.pdf
|volume=Vol-3409
|authors=Leopoldo Bertossi,Jorge E. Leon
|dblpUrl=https://dblp.org/rec/conf/amw/BertossiL23
}}
==Compiling Neural Network Classifiers into Boolean Circuits for Efficient Shap-Score Computation (short paper)==
<pdf width="1500px">https://ceur-ws.org/Vol-3409/paper1.pdf</pdf>
<pre>
Compiling Neural Network Classifiers into Boolean
Circuits for Efficient Shap-Score Computation
Leopoldo Bertossi1 , Jorge E. León2
1
    SKEMA Business School, Montreal, Canada
2
    Universidad Adolfo Ibáñez (UAI), Santiago, Chile


                                         Abstract
                                         We describe the transformation of binary neural networks (BNNs) for classification into deterministic
                                         and decomposable Boolean circuits by means of knowledge compilation techniques. The resulting circuit
                                         is used, as an open-box model, to compute Shap scores taking advantage of a recent efficient algorithm
                                         for Shap on this class of circuits. We show experimental results that corroborate the huge improvement
                                         in score computation time in comparison with the computation that directly uses the BNN as a black-box
                                         model.

                                         Keywords
                                         Explainable ML, Shap Scores, Knowledge Compilation, Neural Networks, Boolean Decision Circuits


1. Introduction
Explanations for the outcomes from classification models come in different forms, and can be
obtained through different approaches. A common one assigns attribution scores to the features
values associated to an input that goes through an ML-based model, to quantify their relevance
for the obtained outcome. We concentrate on local scores, i.e. associated to a particular input,
as opposed to global scores that indicates the overall relevance of a feature. In this work, we
also concentrate on explanations for binary classification models, whose features take binary
values, so as the classification label, say 0 or 1.
   A popular local score is Shap [10], which is based on the Shapley value that has introduced
and used in coalition game theory and practice for a long time [15, 13]. Shap scores can be
computed with a black-box or an open-box model [14]. With the former, we do not know or use
its internal components, but only its input/output relation. This is the most common approach.
In the latter case, we can have access to its internal structure and components, and we can
use them for score computation. It is common to consider neural-network-based models as
black-box models, because their internal gates and structure may be difficult to understand or
process when it comes to explaining classification outputs. However, a decision-tree model, due
to its much simpler structure and use, is considered to be open-box for the same purpose.
   Even for binary classification models, the complexity of Shap computation is provably hard,
actually #𝑃 -hard for several kinds of binary classification models, independently from whether

AMW 2023: Alberto Mendelzon International Workshop on Foundations of Data Management, May 22–26, 2023, Santiago,
Chile
$ leopoldo.bertossi@skema.edu (L. Bertossi); jorgleon@alumnos.uai.cl (J. E. León)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
the internal components of the model are used when computing Shap [3, 1, 2]. However, there
are classes of classifiers for which, using the model components and structure, the complexity
of Shap computation can be brought down to polynomial time [11, 2, 18].
   A polynomial time algorithm for Shap computation with deterministic and decomposable
Boolean circuits (dDBCs) was presented in [2]. From this result, the tractability of Shap computa-
tion can be obtained for a variety of Boolean circuit-based classifiers than can be represented as
(or compiled into) them. In particular, this holds for Ordered Binary Decision Diagrams (OBDDs)
[5], decision trees, and other established classification models that can be compiled into OBDDs
[16, 6, 12]. Similar results can be obtained for Sentential Decision Diagrams (SDDs) [9], which
can be seen as dDBCs, and form a convenient knowledge compilation target language [7, 17].
In [18], through a different approach, tractability of Shap computation was obtained for a
collection of classifiers that intersect with that in [2].
   In this work, we concentrate on explicitly developing this approach to the efficient com-
putation of Shap for binary neural networks (BNNs). For this, and inspired by [16], a BNN
is transformed into a dDBC using techniques from knowledge compilation [7], an area that
investigates the transformation of (usually) propositional theories into an equivalent one with
a canonical syntactic form that has some good computational properties, e.g. tractable model
counting. The compilation may incur in a relatively high computational cost [7, 8], but it may
still be worth the effort when a particular property will be checked often, as is the case of
explanations for the same BNN.
   We also make experimental comparisons between this open-box and circuit-based Shap
computation and that based directly on the BNN treated as a black-box, i.e. using only its
input/output relation. We perform comparisons in terms of computation time and alignment of
Shap scores. We confirm that Shap computation via the dDBC vastly outperforms the direct
Shap computation on the BNN. It is also the case that the scores obtained are fully aligned, as
expected since the dDBC represents the BNN. A detailed account of our work can be found in
[4].


2. Background
Consider a fixed entity e = ⟨𝐹1 (e), . . . , 𝐹𝑁 (e)⟩ subject to classification. It has values 𝐹𝑖 (e) for
features ℱ = {𝐹1 , . . . , 𝐹𝑁 }. In [10, 11], the Shapley value is applied with {𝐹1 (e), . . . , 𝐹𝑁 (e)}
as the set of players, and the game function 𝒢e (𝑠) := E(𝐿(e′ ) | e′𝑠 = e𝑠 ), giving rise to the
Shap score. Here, e𝑠 is the projection (or restriction) of e on (to) the subset 𝑠 of features, 𝐿 is
the label. The e′ inside the expected value is an entity whose values coincides with those of e
for the features in 𝑠. For 𝐹 ∈ ℱ, and entity e:

                             ∑︁        |𝑠|!(|ℱ| − |𝑠| − 1)!
   Shap(ℱ, 𝒢e , 𝐹 ) =                                       ×
                                               |ℱ|!
                           𝑠⊆ℱ ∖{𝐹 }

                                          [ E(𝐿(e′ ) | e′𝑆∪{𝐹 } = e𝑠∪{𝐹 } ) − E(𝐿(e′ ) | e′𝑠 = e𝑠 ) ].
  The expected value is defined on the basis of an underlying probability distribution on the
entity population. Shap quantifies the contribution of feature value 𝐹 (e) to the outcome label.
                                          of the comparison in (4) greater than or equal to 0. This is
                                          the (only) case when h1 becomes true; otherwise, it is false.                 straightforw
                                          This number can be computed in general by: (Narodytska                        is, in its tur
                                          et al. 2018)                                                                  formula by
                                                                                                                      For this exa
                                                              |ī|
                                                            X
                                            d=    (−b +           wj )/2 +# of negative weights in w̄. (5)           o ←→ (−
                                                                          
In order to compute Shap, we only need function 𝐿, and none of the internal components of
                                                            j=1
                                                                                                                        Having a C
the classifier. Given that all possible subsets
                                          In theof case
                                                     features
                                                            of appear     in its2 definition,
                                                                   h1 , with                  Shap is bound
                                                                                   negative weights:        d =         conversion
to be hard to compute. Actually, for some ⌈(−0.16classifiers,
                                                      + (−1 its − 1computation
                                                                      + 1))/2⌉ + 2may     become
                                                                                       = 2.         this,-hard
                                                                                              With #𝑃     we can
(c.f. [2] for some cases). However, in [2],impose
                                             it is shown   that Shap
                                                     conditions     on twocan   be computed
                                                                            input              in polynomial
                                                                                   variables with  the right sign          Followin
                                          at  a  time, considering      all possible  convenient
time for every deterministic and decomposable Boolean circuit (dDBC) used as a classifier. The     pairs.  For h1       formula is
                                          we   obtain  its
circuit’s internal structure is used in the computation.   condition     to be  true:                                   (SDD) (Da
                                             h1 ←→ (−x1 ∧ −x2 ) ∨ (−x1 ∧ x3 ) ∨ (−x2 ∧ x3 ). (6)                        2015), whic
                                                                                                                        lig and But
3. BNN Compilation                      This is DNF formula, directly obtained from considering all                     popular OB
                                        possible convenient pairs (which is already better that trying                  can be used
A BNN is first compiled into a propositional formula
                                        all cases      in Conjunctive
                                                  of three   variables atNormal
                                                                          a time).Form   (CNF), there
                                                                                     However,    which,is a             ular, propo
in its turn, is compiled into an SDD, more
                                        whichexpedite,
                                                is finallyiterative
                                                            compiled method
                                                                       into athat  still uses
                                                                               dDBC.           the number
                                                                                          We compute                    per se prop
                                        of convenient
Shap on the resulting circuit via the efficient         inputs.
                                                algorithm        In order
                                                             in [2].      to convey theisbigger
                                                                     This compilation               picture,
                                                                                             performed                  Example 3
once, and is independent from any input to the classifier. The final circuit can be used(that
                                        we  postpone   the detailed  description  of  this  method     to is            SDD, S, to
                                        also used in our experiments) until Appendix A. Using this                      Nakamura,
compute Shap scores for different input entities. We show the compilation path by means of a
                                        algorithm, we obtain an equivalent formula defining h1 :                        nitions.) An
simple example.
                                                                                                         (7)            sented with
   In our BNNs we used, for each gate 𝑔, a step    ←→ (xas
                                              h1 function    3∧activation
                                                                 (−x2 ∨ −x function,
                                                                             1 )) ∨ (−xof 2the form:
                                                                                             ∧ −x 1 ).
                                                                                                                        Broeck and
                                                    {︂ we obtain defining formulas for gates h2 and
                                              Similarly,                                                                alternatives
                       ¯             ¯                 1         ¯ 𝑔 ∙ ¯𝑖 + 𝑏𝑔 ≥ 0,
                                                              if 𝑤                                                      are also no
                               ¯ 𝑔 ∙ 𝑖 + h𝑏𝑔3), and
                   𝜑𝑔 (𝑖) = sp(𝑤                 := o: (for all of them,  d = 2)
                                                       −1 otherwise,                                                    structs of th
                                                  h2 ←→ (−x3 ∧ (−x2 ∨ −x1 )) ∨ (−x2 ∧ −x1 ),                            the sub, res
which is parameterized by a vector of binary hweights     𝑤¯ 𝑔 and a real-valued constant bias 𝑏𝑔 . For                 ing ⊤ and ⊥
                                                   3 ←→ (x3 ∧ (x2 ∨ x1 )) ∨ (x2 ∧ x1 ),
technical, non-essential reasons, we used inputs and outputs, −1, 1, except for the output gate,                        The sub can
                                                   o ←→ (−h3 ∧ (h2 ∨ h1 )) ∨ (h2 ∧ h1 ).               (8)
𝑜, that returns 0 or 1.                                                                                                 node. [ℓ1 |ℓ
                                         Replacing the definitions of h1 , h2 , h3 into (8), we finally                 fied simulta
                                         obtain:                                                                        without • is
                                                                                                                         An SDD r
                                                 o ←→ (−[(x3 ∧ (x2 ∨ x1 )) ∨ (x2 ∧ x1 )] ∧                              FS : ⟨x1 ,
                                                      ([(−x3 ∧ (−x2 ∨ −x1 )) ∨ (−x2 ∧ −x1 )] ∨                          FS (0, 1, 1)
                                                      [(x3 ∧ (−x2 ∨ −x1 )) ∨ (−x2 ∧ −x1 )])) ∨                          Since x1 =
                                                                                                                        underneath
                                                      ([(−x3 ∧ (−x2 ∨ −x1 )) ∨ (−x2 ∧ −x1 )] ∧                          node labele
                                                       [(x3 ∧ (−x2 ∨ −x1 )) ∨ (−x2 ∧ −x1 )]).  (9)
                                                                                                                           4
                                                                                                                            For our
                                               The final In CNF:
                                                          part of step (a) in path (3), requires transform-             does this job
                                             ing this formula into CNF. In this example, it can be taken                too much (c.
                                                𝑜 ←→ (−𝑥1 ∨−𝑥2 )∧(−𝑥1 ∨−𝑥3 )∧(−𝑥2 ∨−𝑥3 ).
                                                                                                                    4
   The BNN on the LHS above is transformed into the propositional formula on the RHS. After
that, it is transformed into a simplified CNF formula, which is shown right below. In its turn,
the CNF is transformed into the SDD on the RHS below. In general, this is the most expensive
step during the overall compilation time, but it has an upper bound that is exponential in the
tree-width of the formula [9].
next to the right. In this
  formula for the output
CNF. The participating
 treated as true or false,
or −1, resp.
 sider gate h1 , with pa-
= 0.16, and input ī =
 e conveniently instanti-
nd then, contributing to
 f the comparison in (4).
n order to represent as a
 able, also denoted with
 f conveniently instanti-
ficient to make the LHS                          Figure 3: An SDD (a) and a vtree (b).
n or equal to 0. This is       Finally, the SDD is easily transformed into a dDBC, the one shown on the RHS here. It can
 e; otherwise, it is false.      straightforwardly           CNF.4 with
 neral by: (Narodytska be used         as a binary intoclassifier,   The resulting   CNF formula
                                                                           binary input     features 𝑥1 , 𝑥2 , 𝑥3 . The binary label is read off
                                 is, in its turn, simplified into a shorter and simpler new CNF
                            fromformula
                                   the topbynode.     This   circuit  is deterministic   in
                                               means of the Confer SAT solver (Manthey 2017).that, for every ∨-gate, at most one of its inputs
                            is 1 when
                                 For thisthe    output
                                            example,  the is 1. It is CNF
                                                           simplified      formula is asinfollows:
                                                                      decomposable          that, for every ∧-gate, the inputs do not share
 ative weights in w̄. (5) features.
                               o ←→ This
                                      (−xdDBC   is also smooth, in that sub-circuits that feed a same ∨-gate share the same
                                          1 ∨ −x2 ) ∧ (−x1 ∨ −x3 ) ∧ (−x2 ∨ −x3 ). (10)
                             features. It has a fan-in at most two, in that every ∧-gate and ∨-gate have at most two inputs.
                                 Having a CNF formula will be convenient for the next
  ive weights:      d =          conversion steps along path (3).                                         2
= 2. With this, we can
  bles with the right sign 4. Experiments
                                    Following with step (b) along path (3), the resulting CNF
onvenient pairs. For h1          formula is transformed into a Sentential Decision Diagram
                           For our
                                 (SDD) experiments,
                                          (Darwiche 2011b;  we consider
                                                                   Van den Broeckreal estate      as an application domain, where house prices
                                                                                          and Darwiche
 x3 ) ∨ (−x2 ∧ x3 ). (6) depend  2015),
                                      onwhich,
                                          certain  asfeatures,
                                                       a particularwhich
                                                                     kind ofwe decision   diagram (Bol-
                                                                                  appropriately        binarize.1 The problem consists in classifying
                                 lig and Buttkus 2019), is a directed acyclic graph. So as the
 ed from considering all propertypopular blocks,
                                           OBDDsrepresented
                                                       (Bryant 1986),   asthat
                                                                            entity
                                                                                SDDsrecords       of thirteen
                                                                                        generalize,    they     feature values, as high-value or low-
 lready better that trying value.canThis    is atobinary
                                      be used      representclassification
                                                               general Booleanproblem
                                                                                    formulas,for      which a BNN is first learnt, and then used.
                                                                                                 in partic-
e). However, there is a          ular, the
                              After     propositional    formulas of
                                             transformation          (butthewithout
                                                                               BNNnecessarily
                                                                                       into circuit   being
                                                                                                         𝒞, we had the possibility to compute Shap,
 t still uses the number         per se propositional formulas).
nvey the bigger picture, for aExample
                                   given input entity, in three different ways: (a) Directly on the BNN as a black-box model,
                                              3. (example 2 cont.) Figure 3(a) shows an
   of this method (that is i.e. using
                                 SDD, only
                                         S, to itsbe input/output          relation
                                                      used for illustration.          for multiple
                                                                                   (C.f.    (Bova 2016;  calls; (b) Similarly, using the circuit 𝒞 as a
Appendix A. Using this black-box Nakamura,model; and (c) With the efficient algorithm in [2] for smooth dDBCs with fan-in 2,
                                                Denzumi,      and  Nishino     2020)   for  precise    defi-
ormula defining h1 :             nitions.) An SDD has different kinds of nodes. Those repre-
                           treating     circuit   𝒞 as an open-box.
                                 sented with encircled numbers are decision nodes (Van den
  ∨ (−x2 ∧ −x1 ). (7)         We    performed         these2015),
                                                               three e.g.
                                                                        computations          forconsider
                                                                                                    sets of 20, 40, 60, 80, and 100 input entities, to
                                 Broeck and Darwiche                         ⃝1 and ⃝, 3 that
mulas for gates h2 and compare average times with increasing numbersThere
                                 alternatives   for  the  inputs  (in essence,    disjunctions).       of entities. In all cases, Shap was computed
                           withare thealso nodes called elements. They are labeled with con-
                                        uniform       probability     distribution       over    the
                                 structs of the form [ℓ1 |ℓ2 ], where ℓ1 , ℓ2 , called the prime and
                                                                                                       joint feature domain of size 213 . The results
   ∨ (−x2 ∧ −x1 ),         shown      right
                                 the sub,     below
                                           resp.,         reportliterals,
                                                    are Boolean     on thee.g.  seconds
                                                                                  x1 and ¬x  taken     to compute Shap on 20, 40, 60, 80 and 100
                                                                                                2 , includ-
2 ∧ x1 ),                        ing ⊤using
                           entities;     and ⊥,the for 1BNN
                                                          or 0, as
                                                                resp.   E.g. [¬x2 |⊤](blue
                                                                    a black-box          is onebar),
                                                                                                  of them.
                                                                                                        the dDBC as a black-box (red bar), and the
                       (8) dDBC  The sub can also be a pointer, •, with an edge to a decision
 h2 ∧ h1 ).                          as  an  open-box        (orange     bar).   Note     that  the
                                 node. [ℓ1 |ℓ2 ] represents two conditions that have to be satis-      vertical axis employs a logarithmic scale.
  h3 into (8), we finally     Infied
                                   oursimultaneously
                                         experiment,(inthe        initiala conjunction).
                                                               essence,     transformation           of the BNN into CNF took 1.3 hrs, which was
                                                                                             An element
                                 without
                           the most        • is a terminal.
                                         expensive       paper of the compilation. However, it is a one time computation, and our
                                  An SDD represents (or defines) a total Boolean function
x2 ∧ x1 )] ∧               ratherFS naive
                                      : ⟨x1 ,transformation
                                              x2 , x3 ⟩ ∈ {0, 1}algorithm
                                                                      3
                                                                          7→ {0, 1}.  leaves
                                                                                           For considerable
                                                                                                example,         room for improvement.
 ∨ (−x2 ∧ −x1 )] ∨               FS (0, 1, 1) is evaluated by following the graph downwards.
 −x2 ∧ −x1 )])) ∨                Since x1 = 0, we descent to the right; next via node ⃝                   3
                                 underneath, with x2 = 1, we reach the instantiated leaf
 ∨ (−x2 ∧ −x1 )] ∧               node labeled with [1|0], a “conjunction”, with the second
 (−x2 ∧ −x1 )]).       (9)
                                    4
                         1              For our experiments, we programmed a simple algorithm that
 3), requires transform- We use
                            doesthe
                                 thisCalifornia
                                      job, while Housing  Prices
                                                 making sure      dataset available
                                                             the generated CNF doesatnot
                                                                                      https://www.kaggle.com/datasets/camnugent/california-
                                                                                         grow
xample, it can be taken housing-prices.
                            too much (c.f. Appendix A).


                             4
5. Acknowledgments
We are grateful, for their scientific and technical help, to: Arthur Choi, Andy Shih, Norbert
Manthey, Maximilian Schleich, and Adnan Darwiche. Part of this work was funded by ANID -
Millennium Science Initiative Program - Code ICN17002; and “Centro Nacional de Inteligencia
Artificial” CENIA, FB210017 (Financiamiento Basal para Centros Científicos y Tecnológicos
de Excelencia de ANID), Chile. Both CENIA and SKEMA Canada funded Jorge León’s visit to
Montreal. L. Bertossi is a Professor Emeritus at Carleton University, Ottawa, Canada; and a
Senior Fellow of the Universidad Adolfo Ibáñez (UAI), Chile.


References
 [1] Arenas, M., Barceló, P., Bertossi, L. and Monet, M. On the Complexity of SHAP-Score-Based
     Explanations: Tractability via Knowledge Compilation and Non-Approximability Results.
     Journal of Machine Learning Research, 2023, 24(63):1-58. Extended version of [2].
 [2] Arenas, M., Barceló, P., Bertossi, L. and Monet, M. The Tractability of SHAP-Score-Based
     Explanations for Classification over Deterministic and Decomposable Boolean Circuits. In
     Proceedings of the 35th AAAI Conference on Artificial Intelligence, 2021, 6670–6678.
 [3] Bertossi, L., Li, J., Schleich, M., Suciu, D. and Vagena, Z. Causality-Based Explanation
     of Classification Outcomes. In Proceedings of the 4th International Workshop on "Data
     Management for End-to-End Machine Learning" (DEEM) at ACM SIGMOD/PODS, 2020, 1–10.
     Posted as ArXiv paper 2003.06868.
 [4] Bertossi, L. and Leon, J. E. Opening Up the Neural Network Classifier for Shap Score
     Computation. 2023, ArXiv paper 2303.06516.
 [5] Bryant, R. E. Graph-Based Algorithms for Boolean Function Manipulation. IEEE Transac-
     tions on Computers, 1986, C-35(8):677–691.
 [6] Darwiche, A. and Hirth, A. On The Reasons Behind Decisions. In Proceedings of the 24th
     European Conference on Artificial Intelligence, 2020, 712–720.
 [7] Darwiche, A. and Marquis, P. A Knowledge Compilation Map. Journal of Artificial
     Intelligence Research, 2002, 17(1):229–264.
 [8] Darwiche, A. On the Tractable Counting of Theory Models and its Application to Truth
     Maintenance and Belief Revision. Journal of Applied Non-Classical Logics, 2011, 11(1-2):11–
     34.
 [9] Darwiche, A. SDD: A New Canonical Representation of Propositional Knowledge Bases.
     In Proceedings of the 22th International Joint Conference on Artificial Intelligence (IJCAI-11),
     2011, 819–826.
[10] Lundberg, S. M. and Lee, S.-I. A Unified Approach to Interpreting Model Predictions. In
     Proceedings of the 31st International Conference on Neural Information Processing Systems,
     2017, 4768–4777. ArXiv paper 1705.07874.
[11] Lundberg, S., Erion, G., Chen, H., DeGrave, A., Prutkin, J., Nair, B., Katz, R., Himmelfarb, J.,
     Bansal, N. and Lee, S.-I. From Local Explanations to Global Understanding with Explainable
     AI for Trees. Nature Machine Intelligence,2020, 2(1):56–67. ArXiv paper 1905.04610.
[12] Narodytska, N., Kasiviswanathan, S., Ryzhyk, L., Sagiv, M. and Walsh, T. Verifying
     Properties of Binarized Deep Neural Networks. In Proceedings of the 32nd AAAI Conference
     on Artificial Intelligence, 2018, 6615–6624.
[13] Roth, A. E. The Shapley Value: Essays in Honor of Lloyd S. Shapley. Cambridge University
     Press, 1988.
[14] Rudin, C. Stop Explaining Black Box Machine Learning Models for High Stakes Decisions
     and Use Interpretable Models Instead. Nature Machine Intelligence, 2019, 1:206–215. ArXiv
     paper 1811.10154.
[15] Shapley, L. S. A Value for n-Person Games. In Contributions to the Theory of Games (AM-28),
     1953, vol. 2, 307–318.
[16] Shi, W., Shih, A., Darwiche, A. and Choi, A. On Tractable Representations of Binary Neural
     Networks. In Proceedings of the 17th International Conference on Principles of Knowledge
     Representation and Reasoning, 2020, 882–892. ArXiv paper 2004.02082.
[17] Van den Broeck, G. and Darwiche, A. On the Role of Canonicity in Knowledge Compilation.
     In Proceedings of the 29th AAAI Conference on Artificial Intelligence, 2015, 1641–1648.
[18] Van den Broeck, G., Lykov, A., Schleich, M. and Suciu, D. On the Tractability of SHAP
     Explanations. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, 2021,
     6505–6513.

</pre>