Causal Explanations for Performance in Radio
Networks
Domokos M. Kelen1 , Péter Kersch2 and András A. Benczúr1
1
    Insitute for Computer Science and Control (SZTAKI), 1111 Budapest, Kende u. 13-17, Hungary
2
    Ericsson, 1117 Budapest, Magyar Tudósok Körútja 11, Hungary


                                         Abstract
                                         Towards machine learning (ML) for automated radio network control, we demonstrate how Explainable
                                         AI (XAI) can be used to explain and distinguish the effect of different factors to network performance
                                         KPIs. Additive local feature attribution methods like SHAP promise a model agnostic way to gain
                                         insight into a ML model. However, there are many variations of these methods, including marginal
                                         and conditional SHAP, as well as ways to calculate attributions that respect the causal structure of the
                                         data, such as Asymmetric SHAP. We describe two new approaches to calculating attributions that follow
                                         causal relationships more closely. One approach is based on marginal explanations of a model trained in
                                         a specific way, while the other is based on calculating the equivalent of conditional Asymmetric SHAP
                                         by training multiple models. We demonstrate the approaches on both synthetic and real-world data.

                                         Keywords
                                         Explainable AI, Radio networks, SHAP, causality


1. Introduction
Using AI for automated radio network control is a fundamental challenge for future generation
radio networks. Towards automation, a promising direction is Explainable AI (XAI), whose
goal is to investigate tools and techniques aimed at opening the so-called opaque (or black-box)
models (e.g., deep neural networks, DNN) or at devising intrinsically interpretable and accurate
models (e.g., rule-based systems). More expressive models such as DNN or gradient boosted
tree models can better fit the observation, however, they are more complex themselves. While
more expressive models promise better insights, their complexity makes them more opaque,
making it harder to successfully infer useful information about the system.
   In this paper, we propose an XAI method to explain and distinguish the effect of different
factors on network performance metrics. The main technical difficulty is the complex causal
relationship between these factors. For example, antenna tilt configuration affects network
performance indirectly via coverage, interference and cell load. These indirect metrics have a
more direct effect on network performance, hence XAI will primarily find the importance of
coverage, interference, and load and explain the predicted performance metric based on them,
mostly ignoring the explanation of how network configuration affects this performance metric.

AI6G’22: First International Workshop on Artificial Intelligence in beyond 5G and 6G Wireless Networks,
July 21, 2022, Padua, Italy
$ kdomokos@sztaki.hu (D. M. Kelen); peter.kersch@ericsson.com (P. Kersch); benczur@sztaki.hu (A. A. Benczúr)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
   Local feature attribution methods based on Shapley values, such as SHAP [1], provide us
with a tool to gain insight into the behavior of any model in a black-box manner. Their recent
popularity illustrates their usefulness in many domains, explaining predictions in terms of
contributions provided by the knowledge of individual feature values. However, there are many
variations of such methods, and it is important to understand the advantages and disadvantages
of conditional, marginal, and other SHAP attributions [2]. Further, it can be challenging to
make sense of SHAP values in a causal setting [3].
   In this paper, we describe our use of local feature attributions for explaining radio network
performance. Radio networks used in telecommunications are complex systems with many
different, inter-related variables, and various causal structures between these variables. Our
primary goal is root cause analysis for specific behavior, and our approach is to model the overall
behavior of the system and investigate the model using explanation methods. It is, however,
very important that the explanations take the causal relationships between the model features
into consideration, as failing to do this could easily cause misattributions in our analysis.
   Reliably inferring causal relationships from observational data is generally considered to
be impossible [4]. Rather than inferring the relationship, causal attributions [3, 5] assume
to know the nature of the causal relationship based on domain knowledge and attempt to
calculate attributions that respect these relationships. One way to do this is through the use of
Asymmetric SHAP [3], a variation of SHAP that is better equipped to deal with causality.
   In our use case, we measure the contributions of feature groups, while assuming to know the
causal structure between the groups. An important distinction between different Shapley-value-
based methods is whether they use a conditional or marginal value function for evaluating the
model. A conditional value function is a way to explain the data, while a marginal value function
results in explanations that reflect the behavior of the model more closely [2]. Our aim is to
gain insight into the behavior of the underlying system through examining the data; modeling
and model explanation is only used as a tool towards this end. It thus seems appropriate to use
conditional SHAP, however as it turns out we can also use marginal SHAP as a tool to force the
explanations to better reflect the causal structure of the data.
   The rest of this paper is structured as follows. In Section 2, we describe the mathematical
background of SHAP and related terms. In Section 3, we describe two ways of calculating causal
attributions for our use case; one based on marginal explanations of a causally-trained model,
and one that is based on calculating the equivalent of conditional SHAP by training multiple
models. In Section 4, we describe the results of our experiments on simulated and real-world
data to illustrate our method. Finally in Section 5, we share our conclusions.


2. Background
2.1. SHAP
The notion of Shapley values [6] originates from cooperative game theory and is a way to fairly
calculate the contribution of individual players towards a shared objective. In this context, fairly
is used in the sense that the provided values satisfy a number of conditions described in the
original paper. The method is based on measuring the performance towards the objective with
different coalitions of players participating and combining these results in a certain way to
calculate individual contributions.
   Using Shapley values to calculate additive local feature attributions for machine learning
models gained popularity in recent years. SHAP [1] and related methods provide a compelling
way to gain insight into the predictions made by any model, by quantifying the value that the
knowledge of each feature value contributed towards a specific prediction value.
   Given model 𝑓 , feature set S, and a specific point 𝑥 from the dataset, the Shapley value 𝜑 for
feature 𝑗 can be defined as
                                 ∑︁      |𝑆|! (𝑀 − |𝑆| − 1)!
                 𝜑𝑗 (𝑓, 𝑥) =                                 𝜙𝑗 (𝑓, 𝑥, 𝑆) ,   where            (1)
                                                 𝑀!
                               𝑆⊆S∖{𝑗}

                             𝜙𝑗 (𝑓, 𝑥, 𝑆) = 𝑣𝑓 (𝑥) (𝑆 ∪ {𝑗}) − 𝑣𝑓 (𝑥) (𝑆).                     (2)

Here 𝑣𝑓 (𝑥) (𝑆) is called the value function, and represents the function of evaluating the model
𝑓 at point 𝑥 while only using the coalition 𝑆 of features. It is, however, not evident how we can
evaluate a model with coalitions of varying sizes, when the model was trained using a fixed
number of features. Note that an alternative but equivalent formulation of Equation 1 is
                         ∑︁ 1 [︁                                                        ]︁
            𝜑𝑗 (𝑓, 𝑥) =          𝑣𝑓 (𝑥) ({𝑖 : 𝜋(𝑖) ≤ 𝜋(𝑗)}) − 𝑣𝑓 (𝑥) ({𝑖 : 𝜋(𝑖) < 𝜋(𝑗)}) ,     (3)
                             𝑛!
                       𝜋∈Π

where Π represents the set of all permutations of the ordering of model features.
  We describe two ways of defining the value function 𝑣𝑓 (𝑥) (𝑆) corresponding to the marginal
and conditional variants of SHAP values. In the marginal or interventional SHAP variant

                              𝑣𝑓 (𝑥) (𝑆) = 𝐸𝑝(𝑥′ ) 𝑓 (𝑥𝑆 ⊔ 𝑥′𝑆 ) ,                         (4)
                                                  [︀            ]︀

where the expression 𝑓 (𝑥𝑆 ⊔ 𝑥′𝑆 ) represents the value of evaluating function 𝑓 with the in-
coalition feature values 𝑠 ∈ 𝑆 taken from 𝑥, and the out-of-coalition feature values taken from
an identically distributed random 𝑥′ . In contrast, in the conditional or observational variant

                             𝑣𝑓 (𝑥) (𝑆) = 𝐸𝑝(𝑥′ |𝑥′𝑆 =𝑥𝑆 ) 𝑓 (𝑥𝑆 ⊔ 𝑥′𝑆 ) ,                      (5)
                                                          [︀            ]︀

where the distribution of 𝑥′ is conditioned on the known feature values 𝑆 from 𝑥.
   The difference between these two variants is whether the expectation that is taken over the
out-of-coalition features is conditioned on the values of the in-coalition features. Marginal
SHAP values can be estimated by sampling out-of-coalition values from the training data for
evaluating the value function, while conditional sampling is much harder to do and can itself
become a separate modeling task [2]. However, an even more important distinction is whether
the resulting explanation more strictly adheres to the model or the data.
   Marginal SHAP explanations reflect the behavior of the model, meaning that the explanations
respect whatever rules the model infers from the data. If there are multiple equivalent ways
in which the prediction can be calculated, then the explanations reflect whichever way the
model itself uses. Another important aspect of marginal SHAP is a common criticism that
the evaluation can also include invalid data points, where combining the in-coalition and the
sampled out-of-coalition features results in a data point that is unrealistic in the real world.
   On the other hand, conditional SHAP explanations reflect the dataset in that they are not
dependent on which equivalent formulation of prediction the model uses. As a simple example,
if the dataset contains two features that are identical, then the model is free to choose either
of these (or any convex combination of them) for the prediction formula. Marginal SHAP
attributions will mirror the model’s choice, while conditional SHAP attributions will share the
contribution equally between these features regardless of the model itself. This is often cited as
a drawback of marginal SHAP, as it means that the model explanation can reveal associations
that are not present in the model itself.

2.2. Asymmetric SHAP
In [3], Asymmetric SHAP is described as an approach to handling causal relationships between
features. The method is based on breaking one of the fundamental requirements of the original
Shapley values, symmetry. This requirement guarantees that if two features behave exactly the
same way w.r.t. the value function, then their assigned contributions must also be equal. The
way this relates to causality is that if these two variables have a direct causal relationship, then
it is reasonable to want the causing variable to get a higher share of the contribution.
    Asymmetric SHAP modifies Equation 3 to only include permutations where for each causal
relationship, the causing variable is ordered before the affected variable, or more generally,
allows any weighting scheme over the permutations instead of uniform weights:
                       ∑︁      [︁                                                       ]︁
            𝜑𝜔𝑗 (𝑓 ) =    𝜔(𝜋) 𝑣𝑓 (𝑥) ({𝑖 : 𝜋(𝑖) ≤ 𝜋(𝑗)}) − 𝑣𝑓 (𝑥) ({𝑖 : 𝜋(𝑖) < 𝜋(𝑗)}) .          (6)
                     𝜋∈Π

   In a causal setting, the above described weighting scheme is suggested, meaning that we
modify the weights such that 𝜔(𝜋) ∝ 1 if 𝜋(𝑖) < 𝜋(𝑗) for each 𝑖, 𝑗 where 𝑖 is a causal ancestor
of 𝑗, and 𝜔(𝜋) = 0 otherwise. The idea behind ordering causing variables first is that these
get the attribution for the change they cause in the expected value of the prediction, while the
affected variables get the change that is caused by including them after the causing variables.


3. Proposed approaches
3.1. Causal graph of feature groups
Our approaches rely on grouping the variables and assuming to know the causal relationships
between groups. From an evaluation perspective, we treat a feature group as a single, multi-
dimensional feature, meaning that while calculating SHAP, either all features of a feature group
are included in a coalition or none of them are. This way, we get overall contributions for each
group.
   The causal relations between the groups can be represented in a directed acyclic graph, where
the nodes are the feature groups, and the arcs point from causing group to affected group. This
way we can say that feature 𝑖 is a causal ancestor of feature 𝑗 if and only if there is a path from
𝑖 to 𝑗 in the graph. We can see an example of such a graph in Figure 1. A topological ordering
of this graph has the property that each variable precedes its causal descendants. In fact, the
set of permutations with nonzero weight in our Asymmetric SHAP calculation, as described in
Section 2.2, is exactly the set of all possible topological orderings of this graph.

3.2. Explaining a model trained in causal order
One way to have the attributions respect causal relationships is by using marginal SHAP to
explain a model that itself calculates the prediction in a way that respects causal relationships.
Having assumed a causal order of features, as described in Section 3.1, we can attempt to force
the model to prefer using features that appear earlier in the causal ordering, in hopes that the
resulting explanation also attributes the prediction accordingly. If there are multiple causal
orderings possible than we need to calculate the average of all of these with equal weight.
   In our modeling, we primarily use gradient boosted decision trees [7]. These are models that
use a linear combination of decision trees for predicting the output. During the training process,
the linear combination is built up gradually, adding more decision trees to the result in each
step. We can attempt to force this process to yield a model that respects a causal ordering. We
do this by restricting the decision trees to only include features from a set that we gradually
increase according to the causal ordering throughout the training process. In our case, this
is done by first only using features from the first group and training until convergence, then
continuing using features from the first two groups, etc. This way, the final model has access
to all of the features, while also placing more emphasis on features that appear earlier in the
causal ordering.
   One possible failure mode of this method is that there is no guarantee that the final model
will be different from one that was trained with all features available at the same time. To see
this, assume that we have the model 𝑓1 (𝑋1 ) after training only using the first feature 𝑋1 . In the
next step, the modeling process can continue by adding 𝑓2 (𝑋1 , 𝑋2 ) = −𝑓1 (𝑋1 ) + 𝑓2′ (𝑋1 , 𝑋2 )
such that 𝑓1 + 𝑓2 = 𝑓2′ . In other words, it is possible to cancel out the previous modeling steps in
later steps. We can attempt to use regularization (eg. trying to limit capacity) to discourage the
model from doing this, however, if the advantage of using 𝑋2 is great enough, it can overpower
regularization.

3.3. Conditional Asymmetric SHAP using multiple models
In this section, we describe a way of using the already in place modeling process for conditioning
on the in-coalition feature values. Similar ideas are also described in [8] and [9], where the
training of separate models is proposed to calculate feature importances in regression problems
to deal with multicollinearity. Conditional SHAP values are hard to calculate, and trying
to sample feature values according to their conditional distribution can result in a separate
modeling task itself [2], as filtering the data to the subset that fits the condition can result in a
very low number of records, or in many cases could essentially equal the data point 𝑥 itself.
   Let us assume that we can train a separate model for each coalition, i.e. train 𝑓𝑆 (𝑥) for each
coalition 𝑆, where the model is trained using only the features in 𝑆. Using these, we can define
                            ∑︁       |𝑆|! (𝑀 − |𝑆| − 1)! (︀
                                                            𝑓𝑆∪{𝑗} (𝑥) − 𝑓𝑆 (𝑥) ,                 (7)
                                                                               )︀
                    𝜑𝑗 =
                                             𝑀!
                           𝑆⊆S∖{𝑗}
meaning that the value function is simply equal to 𝑓𝑆 (𝑥) instead of Equation 5. We give a
heuristic argument that these two calculations approximate the same value. When modeling a
regression task, we are in fact approximating an ideal function

                                    𝑓 (𝑥) ≈ 𝑓 * (𝑥) = 𝐸𝑝(𝑦|𝑥) [𝑦] ,                               (8)

with 𝑦 being the target variable. The SHAP value 𝜑𝑗 (𝑓 ) then itself can then be viewed as an
approximation of 𝜑𝑗 (𝑓 * ), the feature contributions calculated for the ideal function 𝑓 * . In this
sense, Equation 7 is just another way of approximating the same value that conditional SHAP
estimates. This can be seen by observing the following about 𝑓𝑆 (𝑥) and 𝑣𝑓 (𝑥) (𝑆):

                    𝑓𝑆 (𝑥) ≈ 𝑓𝑆* (𝑥) = 𝐸𝑝(𝑦|𝑥𝑆 ) [𝑦] and                                          (9)
                 𝑣𝑓 (𝑥) (𝑆) = 𝐸𝑝(𝑥′ |𝑥𝑆 ) 𝑓 (𝑥𝑆 ⊔ 𝑥𝑆′ ) ≈ 𝐸𝑝(𝑥′ |𝑥𝑆 ) 𝑓 * (𝑥𝑆 ⊔ 𝑥′𝑆 )
                                         [︀              ]︀           [︀             ]︀

                            = 𝐸𝑝(𝑥′ |𝑥𝑆 ) 𝐸𝑝(𝑦|𝑥′,𝑥𝑆 ) [𝑦] = 𝐸𝑝(𝑦|𝑥𝑆 ) [𝑦] .                    (10)
                                         [︀                 ]︀

In Equations 9-10 we use Equation 8, while in Equation 10 we also use law of total expectation.
   While this way we have to train a separate model for each coalition, the problem of condi-
tioning part of the feature set on a given coalition is solved by re-using the same modeling
process that we already have in place. Since we already trust this process in delivering insights
about the dataset, this saves us from dealing with the conditioning as a separate sampling or
modeling step. Fortunately, Asymmetric SHAP greatly reduces the number of coalitions that are
of interest, as we only need to perform the evaluation for starting subsets of causal orderings of
features, which makes this method computationally feasible in our case.
   Given a single causal ordering, the contribution for a specific feature is given by the change
in prediction when first including the feature in the set of features available for the model, as
formalized by Equation 6. Similar to regular SHAP, we start by calculating 𝜑0 = 𝐸(𝑦), which
can be interpreted as the best possible model while using an empty set of features as coalition.


4. Experiments
We run two kinds of experiments. The first one is a synthetic task where the variables and
functions involved are simple enough that we know the exact result that we can expect from
the explanation methods. The second is an example with real-world data.
   We report global feature importance. Since simply averaging contributions over all records
results in values with an expected value of zero, importance is calculated by taking average of
the absolute values of the contributions. One unfortunate consequence of using this metric is
that the importances don’t sum up to some fixed value. On our tables, Causal-order training
refers to the approach described in Section 3.2, while Separate models refers to the approach
described in Section 3.3. We denote the importance of variable 𝑋 by Φ𝑋 .

4.1. Synthetic data experiment
Here we demonstrate our methods on synthetic data. We start by defining 5 variables. Let

             𝐴, 𝐵, 𝐶 ∼ 𝒰[0,1] , 𝐷 = 𝐴 + 𝐵 + 𝐶, and 𝑌 = 2(𝐴 + 𝐵 + 𝐶) + 𝑁 ,                       (11)
Table 1
Average absolute feature attributions made by different methods on synthetic data.
                                     Expected value       Φ𝐴       Φ𝐵       Φ𝐶       Φ𝐷

          . Regular model                3.0073          0.0029   0.0023   0.0027   0.7928
            Causal-order training        3.0070          0.5062   0.4973   0.5052   0.0003
            Separate models              3.0103          0.4829   0.4824   0.5116   0.0000


where 𝒰[0,1] denotes the uniform distribution over the [0, 1] interval, and 𝑁 ∼ 𝑁 (0, 0.1) is
a normally distributed noise variable. We treat each variable as its own separate group, and
assume a causal order [𝐴, 𝐵, 𝐶, 𝐷]. The variable 𝑌 is the target variable.
  It is clear that the target variable can be approximated from either variables 𝐴, 𝐵, 𝐶 or
variable 𝐷, with the latter having a mode direct relationship with the target. Let us assume that
the modeling process can learn the relationship between the features and the target exactly.
We can express the expected feature importance analytically for each of the cases where either
𝐴, 𝐵, 𝐶 or 𝐷 get full contribution for the calculating the prediction:
                                                  ∫︁ 1
                                                                                              (12)
                                                 ⃒        ⃒
                      Φ𝐴 = Φ𝐵 = Φ𝐶 = 2           ⃒𝑎 − 𝐸(𝐴)⃒ d𝑎 = 0.5
                                               0
                   ∫︁ 1 ∫︁ 1 ∫︁ 1
                                                                                              (13)
                                  ⃒                         ⃒
            Φ𝐷 = 2                ⃒𝑎 + 𝑏 + 𝑐 − 𝐸(𝐴 + 𝐵 + 𝐶)⃒ d𝑎 d𝑏 d𝑐 ≈ 0.8194.
                       0   0   0

This essentially means that in the case of marginal SHAP, on one end we get Φ𝐴 = Φ𝐵 =
Φ𝐶 = 0.5 and Φ𝐷 = 0, or on the other end we get Φ𝐴 = Φ𝐵 = Φ𝐶 = 0 and Φ𝐷 ≈ 0.82. Any
convex combination of these two cases is possible as an actual modeling result. Conditional
SHAP is expected to distribute the contributions uniformly, meaning a feature importance of
Φ𝐴 = Φ𝐵 = Φ𝐶 = 0.25 and Φ𝐷 ≈ 0.41.
   We model on a training set of 104 samples, and smaller evaluation and testing sets of 103
samples. We use gradient boosted decision tree models as implemented by the Python package
LightGBM [7]. Measured feature contributions are reported in Table 1, for a regularly trained
model as explained by TreeSHAP [10], as well as for our proposed approaches. As we can
observe, both of our proposed approaches are able to place the feature importances on the
features that appear earlier in the causal ordering.

4.2. Real-world data experiment
We demonstrate our method on a proprietary real-world dataset of mobile telecommunications,
consisting of performance management (PM) data from radio access network cells with 15
minutes granularity. Model output is average downlink cell throughput for automatically
determining the root cause of throughput degradations using explainers of the model. Input
features of the model are described in Table 2, while Figure 1 illustrates the causal relationships
between these groups. It also includes a 5th group: Cell configuration. Features of this group
are not present in the dataset, but directly or indirectly affect all other features. This has the
side-effect that while the groups TA distribution and Cell load can be assumed independent given
Table 2
Feature groups used in our real-world dataset.
                                                          .
 Name                  Description
 TA distribution       Timing Advance distribution of mobile terminals. It is derived from radio propa-
                       gation delay measurements between the mobile terminal and base stations and
                       can be used to estimate the distance from the base station. We normalize this
                       distance with cell range, hence these features capture cell edge versus cell center
                       distribution of mobile terminals.
 Cell load             Various PM metrics describing user plane and signaling load of a given cell.
 Interference          Measured uplink interference distribution in a given cell (downlink interference
                       is unfortunately not available in our dataset).
 Channel quality       Various downlink channel quality metrics including CQI and rank distributions.


                                                 Target: throughput


                                      Channel quality


                                                                Interference


                   TA distribution                                                    Cell load


                                                 Cell configuration


Figure 1: Causal structure of the different feature groups in our use case


Table 3
Average absolute feature attributions made by different methods on real-world data
                                        TA distribution         Cell load      Interference       Channel quality
   Importances when Cell load ordered before TA distribution
   Causal-order training (Sec. 3.2)           0.2093              3.6073         0.7939               2.9326
   Separate models (Sec. 3.3)                 0.4714              4.4750         0.9445               2.9343
   Importances when TA distribution ordered before Cell load
   Causal-order training (Sec. 3.2)           0.4317              3.4371         0.7808               2.9383
   Separate models (Sec. 3.3)                 1.3429              4.1927         0.9344               2.9463
   Importances when contributions averaged over causal orderings
   Regular model                              0.1658              3.6426         0.8157               3.1973
   Causal-order training (Sec. 3.2)           0.3205              3.5222         0.7874               2.9354
   Separate models (Sec. 3.3)                 0.9072              4.3339         0.9395               2.9403


Cell configuration, the latter not being present in the dataset makes them no longer independent.
The resulting interactions should be distributed equally between these two groups.
   There are only two possible topological orderings of these feature groups. In Table 3 we report
feature importance according to both separately, and also final averaged feature importance.
The methods clearly have a similar effect to what we observed in our synthetic experiments,
with features that appear earlier in the causal order getting increased importance. The ordering
of the first two groups has a negligible effect on later groups, which suggests the possibility of
not even having to recompute these values when dealing with multiple orderings.
   We can observe a substantial difference between the results of the approaches described in
Sections 3.2 and 3.3. While both place increased importance on the group TA distribution, the
latter is still much higher. We attribute this to the possibility observed in Section 3.2 that while
the model is encouraged to use features that appear earlier in the causal ordering, it is not
forced to do so. There is no such problem present in the case of the other approach.


5. Conclusions and Acknowledgments
In this paper, we described two approaches to calculating causality-respecting additive local
feature attributions for explaining the cause of radio network performance issues. We demon-
strated their effectiveness on both synthetic and real-world datasets and concluded that these
methods proved to be effective in placing more feature importance on features that appear
earlier in a causal ordering.
   This work has been partially funded by the Hungarian Ministry of Innovation and Technology
NRDI Office within the framework of the Artificial Intelligence National Laboratory Program.


References
 [1] S. M. Lundberg, S.-I. Lee, A unified approach to interpreting model predictions, Advances in neural
     information processing systems 30 (2017).
 [2] H. Chen, J. D. Janizek, S. M. Lundberg, S. Lee, True to the model or true to the data?, CoRR
     abs/2006.16234 (2020). arXiv:2006.16234.
 [3] C. Frye, C. Rowat, I. Feige, Asymmetric shapley values: incorporating causal knowledge into
     model-agnostic explainability, NIPS 33 (2020) 1229–1239.
 [4] C. Winship, S. L. Morgan, The estimation of causal effects from observational data, Annual review
     of sociology 25 (1999) 659–706.
 [5] T. Heskes, E. Sijben, I. G. Bucur, T. Claassen, Causal shapley values: Exploiting causal knowledge
     to explain individual predictions of complex models, in: NeurIPS, volume 33, 2020, pp. 4778–4789.
 [6] L. S. Shapley, A Value for N-Person Games, RAND Corporation, Santa Monica, CA, 1952. doi:10.
     7249/P0295.
 [7] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, T.-Y. Liu, Lightgbm: A highly efficient
     gradient boosting decision tree, in: NeurIPS, volume 30, 2017.
 [8] S. Lipovetsky, M. Conklin, Analysis of regression in game theory approach, Applied Stochastic
     Models in Business and Industry 17 (2001) 319–330.
 [9] E. Štrumbelj, I. Kononenko, M. R. Šikonja, Explaining instance classifications with interactions of
     subsets of feature values, Data & Knowledge Engineering 68 (2009) 886–904.
[10] S. M. Lundberg, G. Erion, H. Chen, A. DeGrave, J. M. Prutkin, B. Nair, R. Katz, J. Himmelfarb,
     N. Bansal, S.-I. Lee, From local explanations to global understanding with explainable AI for trees,
     Nature Machine Intelligence 2 (2020) 2522–5839.