=Paper= {{Paper |id=Vol-2659/mollas |storemode=property |title=LionForests: local interpretation of random forests |pdfUrl=https://ceur-ws.org/Vol-2659/mollas.pdf |volume=Vol-2659 |authors=Ioannis Mollas,Nick Bassiliades,Ioannis Vlahavas,Grigorios Tsoumakas |dblpUrl=https://dblp.org/rec/conf/ecai/MollasBVT20 }} ==LionForests: local interpretation of random forests== https://ceur-ws.org/Vol-2659/mollas.pdf
     LionForests: Local Interpretation of Random Forests
                         Ioannis Mollas, Nick Bassiliades, Ioannis Vlahavas, Grigorios Tsoumakas1


Abstract. Towards a future where ML systems will integrate into                in uncovering errors and biases, local interpretation methods are in
every aspect of people’s lives, researching methods to interpret such          certain domains a prerequisite due to legal frameworks, such as the
systems is necessary, instead of focusing exclusively on enhancing             General Data Protection Regulation (GDPR) [36] of the EU and the
their performance. Enriching the trust between these systems and               Equal Credit Opportunity Act of the US2 .
people will accelerate this integration process. Many medical and re-             Another important dimension, which IML methods can be cate-
tail banking/finance applications use state-of-the-art ML techniques           gorised, concerns the type of ML model that they are interpreting [1].
to predict certain aspects of new instances. Thus, explainability is a         Model-agnostic methods [31, 37, 38] can be applied to any type of
key requirement for human-centred AI approaches. Tree ensembles,               model, while model-specific methods [4, 30, 33, 34] are engineered
like random forests, are widely acceptable solutions on these tasks,           for a specific type of model. Methods of the former category have
while at the same time they are avoided due to their black-box un-             wider applicability, but they just approximately explain the models
interpretable nature, creating an unreasonable paradox. In this paper,         they are applied to [1]. This work focuses on the latter category of
we provide a methodology for shedding light on the predictions of              methods, proposing a technique specified for tree ensembles [6, 8],
the misjudged family of tree ensemble algorithms. Using classic un-            which are very effective in several applications involving tabular and
supervised learning techniques and an enhanced similarity metric,              time-series data [45].
to wander among transparent trees inside a forest following bread-                Past work on model-specific interpretation techniques, about tree
crumbs, the interpretable essence of tree ensembles arises. An inter-          ensembles, is limited [34, 43]. iForest [43], a global and local inter-
pretation provided by these systems using our approach, which we               pretation system of random forests (RF), provides insights for a deci-
call “LionForests”, can be a simple, comprehensive rule.                       sion through a visualisation tool. Notwithstanding, such visual expla-
                                                                               nation tool is very complex for non-expert users, while at the same
1   INTRODUCTION                                                               time requires user interaction in order to construct the local interpre-
                                                                               tations. Another instance-level interpretation technique for RF [34],
                                                                               produces an interpretation in the form of a list of features with their
   Machine learning (ML) models are becoming pervasive in our so-              ranges, accompanied by an influence metric. If the list of features is
ciety and everyday life. Such models may contain errors, or may be             extensive, and the ranges are very narrow, the interpretation can be
subject to manipulation from an adversary. In addition, they may be            considered as unreliable and untrustworthy, because small changes in
mirroring the biases that exist in the data from which they were in-           the features will render the interpretation useless. Finally, both meth-
duced. For example, Apple’s new credit card is being recently inves-           ods do not handle categorical data appropriately.
tigated over claims it gives women lower credit [16], IBM Watson                  To address the above problems, we introduce a local-based model-
Health was accused of suggesting unsafe treatments for patients [7]            specific approach for interpreting an individual prediction of an RF
and state-of-the-art object detection model YOLOv2 is easily tricked           through a single rule, in natural language form. The ultimate goal is
by specially designed patches [10, 41]. Being able to understand how           mainly to reduce the number of features and secondly to broaden the
an ML model operates and why it predicts a particular outcome is               feature-ranges producing more robust, indisputable and intuitive in-
important for engineering safe and unbiased intelligent systems.               terpretations. Additionally, the categorical features are handled prop-
   Unfortunately, many families of highly accurate (and thus popu-             erly, providing intelligible information about them throughout the
lar) models, such as deep neural networks and tree ensembles, are              rules. The constructed rule will be presented as the interpretation. We
opaque: humans cannot understand the inner workings of such mod-               call this technique “LionForests” (Local Interpretation Of raNdom
els and/or the reasons underpinning their predictions. This has re-            FORESTS) and we use its path and feature selection ability, based
cently motivated the development of a large body of research on in-            on unsupervised techniques like association rules [2] and k-medoids
terpretable ML (IML), concerned with the interpretation of black box           clustering [27] to process the interpretations in order to make them
models [1, 13, 14, 21, 23, 29, 39].                                            more comprehensible.
   Methods for interpreting ML models are categorised, among other                The rest of this work is structured the following way. First, Sec-
dimensions [21], into global ones that uncover the whole logic and             tion 2 introduces the related work, presenting approaches of interpre-
structure of a model and local ones that aim to interpret a single             tation techniques applicable to RF models. In Section 3, we present
prediction, such as “Why has this patient to be immediately hospital-          the methodology of our approach to the interpretation of RF models.
ized?”. This work focuses on the latter category. Besides their utility        In section 4, we perform the quantitative and qualitative analysis.
1   Department of Informatics, Aristotle University of Thes-                   Finally, we conclude and present possible future directions of our
 saloniki,      54124      Thessaloniki,    Greece,      email:     {iamol-    strategy in Section 5.
 las,nbassili,vlahavas,greg}@csd.auth.gr
Copyright c 2020 for this paper by its authors. Use permitted under Creative   2 ECOA 15 U.S. Code §1691 et seq.: https://www.law.cornell.
 Commons License Attribution 4.0 International (CC BY 4.0).                      edu/uscode/text/15/1691
2   RELATED WORK
A number of works concerning the tree ensembles interpretation
problem are either model-agnostic or model-specific solutions, with
a global or local scope, as presented with a similar taxonomy in a
recent survey [22].
   A family of model-agnostic interpretation techniques about black-
box models, including tree ensembles, concerns the efficient calcula-
                                                                                               Figure 2. A sample explanation from [34]
tion of feature importance. These are variations of feature permuta-
tion methods [17], partial dependence plots [26] and individual con-
                                                                                tures with their ranges, ranked based on their contribution (see Fig-
ditional expectation [18], which are global-based. SHAP [31] is an
                                                                                ure 2). Thus, the interpretation process consists of two parts. Firstly,
alternative method to compute feature importance for both global and
                                                                                they calculate the influence of a feature by monitoring the changes
local aspects of any black-box model.
                                                                                of the activated nodes for a specific instance’s prediction. This influ-
   Specifically, on tree ensembles, the most common techniques in-
                                                                                ence later will be used for the ranking process. The second step is to
clude the processes of extracting, measuring, pruning and selecting
                                                                                find the narrowest range across all trees for every feature. However,
rules from the trees to compute the feature importance [11, 40]. A
                                                                                they do not attempt to expand these ranges, while they also claim that
highly studied by many researchers [11, 12, 25, 44] technique at-
                                                                                their influence metric assigns zero influence to some features, and by
tempts to globally interpret tree ensembles using single-tree approxi-
                                                                                extension removing them, they could offer more compact explana-
mations. But this method, as its name implies, approximates the per-
                                                                                tions. In spite of this, they do not know, by keeping only features
formance of the model it seeks to explain. Thus, this approach is ex-
                                                                                with a non-zero influence, that these features will at least be present
tremely problematic and criticised, because it is not feasible to sum-
                                                                                in half plus one paths to preserve the same prediction of an instance.
marise a complex model like tree ensembles to a single tree [20]. An
                                                                                Finally, they do not manage categorical features properly.
additional approach on interpreting tree ensembles focuses on clus-
tering the trees of an ensemble using a tree dissimilarity metric [9].
                                                                                3   OUR APPROACH
                                                                                Our objective is to provide local interpretation of RF binary classi-
                                                                                fiers. In RF, a set of techniques like data and feature sampling, is used
                                                                                in order to train a collection of T weak trees. Then, these trained trees
                                                                                vote for an instance’s prediction:

                                                                                                                    1 X
                                                                                                        h(xi ) =            ht (xi )                  (1)
                                                                                                                   |T | t∈T

Figure 1. A sample example from the visualisation tool of iForest adapted       where ht (xi ) is the vote cast from the tree t ∈ T for the instance
from [43]                                                                       xi ∈ X, representing the probability P (C = cj |X = xi ) of xi to
                                                                                be assigned to class cj ∈ C = {0, 1}, thus
   iForest [43], a model-specific global and local approach, utilises                                (
a path distance metric to project an instance’s paths to a two-                                        1     if P (C = 1|X = xi ) ≥ 0.5
                                                                                          ht (xi ) =
dimensional space via t-Distributed Stochastic Neighbour Embed-                                        0     if P (C = 0|X = xi ) ≥ 0.5.
ding (t-SNE) [32]. The path distance metric they propose considers
distant two paths in two cases: a) if a feature exists in only one out of
the two paths the distance between those two paths is increasing, b) if
a feature exists in both paths, the distance is increasing according to
the non-common ranges of the feature on the paths divided by half.
The total distance, which is the aggregation of those cases for all the
features, is finally divided by the total number of features appearing
at least in one out of the two paths. Except the projection of the paths
(Figure 1a), they provide feature summary (Figure 1b) and decision
path flow (Figure 1c), which is a paths overview. In feature summary,
a stacked area plot visualises every path’s range for a specific feature,
while decision path flow plot visualises the paths themselves. How-
ever, they do not provide this information automatically. The user has
to draw a lasso (Figure 1f) around some points-paths in the paths pro-
jection plot in order to get the feature summary and paths overview.                    Figure 3. A decision tree binary classifier with 4 features
But requiring the user to select the appropriate paths is critical, sim-
ply because the user can easily choose wrong paths, a small set of                 Each decision tree t ∈ T is a directed graph, and by deconstructing
paths, or even paths entirely different to the paths being responsible          its structure, we are able to derive a set Pt of paths from the root
for his prediction. That may lead to incorrect feature summary and              to the leaves. Therefore, every instance can be classified with one
paths overview, thus to a faulty interpretation.                                of these paths. A path p ∈ Pt is a conjunction of conditions, and
   Lastly, one technique [34] interprets tree ensembles in instance-            the conditions are features and values with relations ≤ and >. For
level (local-based technique) providing as interpretation a set of fea-         example, a path from the tree on Figure 3 could be the following:

                                                                            2
 Figure 4. Bars plot of Value ranges of the ‘variance’ feature from the Ban-       Figure 6. Stacked area plot of value ranges of the ‘variance’ feature from
 knote Dataset for the M = 26 trees of an RF of K = 42 trees predicting the        the Banknote Dataset for the M = 26 trees of an RF of K = 42 trees
 majority class                                                                    predicting the majority class

 ‘if f1 ≤ 0.45 and f3 > 0.9 then Decision A’. Thus, each path p is                 0.47 ≤ f1 ≤ 0.6. This intersection range will always contain the
 expressed as a set:                                                               instance’s value for the specific feature. Moreover, no matter how
                                                                                   much the feature value is going to change, as long as it stays within
             p = {fi  vj |fi ∈ F, vj ∈ <,  ∈ {≤, >}}                  (2)        this intersection range, the decision paths are not going to change.
    We are presenting LionForests, a framework for interpreting RF                 For example, if the value of the instance for the feature f1 was 0.5, if
 models in the instance level. LionForests is a pipeline of actions:               the value will change to 0.52, each tree will take its decision through
 a) feature-ranges extraction, reduction through b1 ) association rules,           the same path. Summarising the aforementioned, an interpretation
 b2 ) clustering and b3 ) random selection, c) categorical feature han-            can have this shape:
 dling, and, finally, d) interpretation composition.                                                ‘if 0.47 ≤ f1 ≤ 0.6 and . . . then Class_A’

                                                                                      But with as many paths as possible to vote to class A, we face the
 3.1    Feature-Ranges Extraction                                                  following two issues:
 Our approach delivers local explanations for RF. Assume we have                   1. A lot of paths can lead to an explanation with many features, by
 an RF of N trees and an instance x, which is classified by the forest                extension to an unintelligible understanding and a frustrated user.
 as cj ∈ C = {0, 1}. We focus on the K ≥ N2 trees of the forest                    2. A lot of paths will lead to a small, strict and very specific feature
 that classify x as cj . For each feature, we compute from each of                    range. For example, f1 instance’s value was 0.5 and the intersec-
 the K trees that it appears in, its values range as imposed from the                 tion range of all paths for this feature occurs to be 0.47≤ f1 ≤ 0.6,
 conditions involving this feature in the path from the root to the leaf.             while the feature range is [−1, 1]. A narrow range, like the afore-
 Figure 4 shows an example of these ranges for feature ‘variance’ with                mentioned, would result in a negative impression of the model,
 range −1...0.1 from the Banknote dataset [15] for a particular RF of                 which will be considered unstable and unreliable. Then, a broader
 50 trees and a particular instance whose value for the skew feature                  range will be less refutable.
 is −0.179. Figure 6 shows the corresponding stacked area plot. The
 highlighted (cyan/light grey) area represents the intersection of these              Consequently, we formulate the optimisation problem (Eq. 3) to
 ranges, which they will always contain the instance’s value for the               minimise the number of features that satisfy the paths of a subset
 specific feature. Moreover, no matter how much the feature value is               of the original trees, thereby retaining the same classification result
 going to change, as long as it stays within this intersection range, the          with the original set of trees and making the size of the total number
 instance will not follow a different decision path in these trees.                of trees equal to or greater than the quorum, in order to ensure the
    In order to give a brief example, we have the following three paths:           consistency of the results of the original RF model.

p1 if f1 ≤ 0.6 and ... then Class_A                                                 minimise      |F 0 |
p2 if f1 ≤ 0.6 and f1 > 0.469 and ... then Class_A                                     0
                                                                                       F ⊆F
p3 if f1 > 0 and f1 ≤ 1 and ... then Class_A                                        subject to    p = {fi  vj |fi ∈ F 0 }, p ∈ Pt ∀t ∈ T 0 ,
                                                                                                     1 X              1         1 X               1 (3)
                                                                                                  b         ht (xi ) + c = b            ht (xi ) + c,
                                                                                                    |T |  0
                                                                                                                      2        |T | t∈T           2
                                                                                                           t∈T

                                                                                                  |T 0 | ≥ quorum
                                                                                      To give an example of the equation b |T1 | t∈T ht (xi )+ 12 c. When
                                                                                                                                P

                                                                                   70 out of |T | = 100 trees are voting for class 1, then we have
                                                                                      1
                                                                                   b 100 70 + 0.5c = b1.2c → 1. On the other side, if 25 out of
                                                                                   |T | = 100 trees are voting class 1 (the minority), then we have
                 Figure 5. Bar plot of the simple example                             1
                                                                                   b 100 25 + 0.5c = b0.75c → 0. Therefore, we are aiming to find
    The intersection of these three paths is the cyan/light grey area              the smallest T 0 ⊆ T , which will produce the same classification as
 in Figure 5, which we can infer that the f1 feature’s range can be:               the original T trees.

                                                                               3
 3.2    Reduction through Association Rules                                       has 0.6 support value, and the rule has 0.33 confidence. This means
                                                                                  that in 66.6% of paths containing f1 , the f3 is absent. We add f1 to
 The first step of the reduction process begins by using association              the feature list and now the F 0 = [f1 , f4 ]. With this feature list only
 rules. Association rules [2] mining is an unsupervised technique,                the p5 is activated again. Hence, we need another feature. The next
 which is used as a tool to extract knowledge from large datasets                 rule we have is f3 ⇒ (f1 , f4 ) with 0.4 support of f3 and confidence
 and explore relations between attributes. In association rules, the              0.2. Adding f3 now the paths p2 , p4 and p5 are valid.
 attributes are called items I = {i1 , i2 , . . . , in }. Each dataset con-          In the aforementioned example, we achieved to reduce the features
 tains sets of items, called itemsets T = {t1 , t2 , . . . , tm }, where          from four to three and the paths from five to three, as well. However,
 ti ⊆ I. Using all possible items of a dataset, we can find all the               applying this method to datasets with plenty of features and models
 rules X ⇒ Y , where X, Y ⊆ I. X is called antecedent, while Y                    with more estimators, the reduction effect can be observed. Section 4
 is called consequent. In association rules the goal is to calculate the          is seeking to explore that effect, through a set of experiments.
 support and confidence of each rule in order to find useful relations.
 A simple observation is that X is independent of Y when the con-
 fidence is critically low. Furthermore, we can say that X with high              3.3    Reduction through Clustering and Random
 support, means it is probably very important.                                           Selection
    But how can we use association rules in random forests? We are                However, association rules may not be able to reduce the number of
 going to do this at the path-level. The items I will contain the features        features and, consequently, the number of paths. In that case, a sec-
 F of our original dataset. The dataset T , which we are going to use to          ond reduction technique based on clustering is applied. Clustering
 mine the association rules, will contain sets of features that represent         is yet another group of unsupervised ML techniques, aside from the
 each path ti = {ij |ij = fj |fj  vk ∈ pi , pi ∈ P }. It is significant          association rules. k-medoids [5, 27] is a well-known clustering algo-
 to mention that we keep only the presence of a feature in the path,              rithm, which considers as cluster’s centre an existing element from
 and we discard its value vj . Then, it is feasible to apply association          the dataset. This element is called medoid. k-medoids, like other
 rules techniques, like apriori algorithm [3].                                    clustering techniques, needs a distance or a dissimilarity metric to
    The next step is to sort the association rules extracted by the apri-         find the optimum clusters. Thus, performing clustering to paths will
 ori algorithm based on the ascending confidence score of each rule.              require a path specific distance or dissimilarity metric.
 For the rule X ⇒ Y , with the lowest confidence, we will take the                   We designed a path similarity metric in Algorithm 1. This similar-
 X and will add its items to the list of features. Afterwards, we are             ity metric is close to the distance metric introduced in iForest [43],
 calculating the number of paths containing conjunctions, which are               but eliminates some minor problems of the aforementioned. Parsing
 satisfied with the new feature set. If the amount of paths is at least           this algorithm, if a feature is absent from both paths, the similarity of
 half plus one paths of the total number of trees, we have found the              these paths increases by 1. When a feature is present in both paths,
 reduced set of paths. We have a quorum! Otherwise, we iterate and                the similarity increases by a value between 0 and 1, which is the inter-
 add more features from the next antecedent of the following rule. By             section of the two ranges normalised by the union of the two ranges.
 using this technique, we reduce the number of features, and we have              The similarity metric is biased towards the absence of a feature from
 the new feature set F 0 ⊆ F . Reducing the features, can lead to a               both paths because it always assigns one similarity point. However,
 reduced set of paths too, because paths containing conjunctions with             this is a desirable feature for our goal of minimising the feature set.
 the redundant features will no longer be valid. Thus, for every path p
 we have the following representation:                                             Algorithm 1: Path similarity metric
                                                                                    input : pi , pj , f eature_names,
            p = {fi  vj |fi ∈ F 0 , vj ∈ <,  ∈ {≤, >}}.              (4)
                                                                                              min_max_f eature_values
    Illustrating this, for a toy dataset of four features F =                       return: similarityij
 [f1 , f2 , f3 , f4 ] and an RF model with five estimators T =                      sij ← 0
 [t1 , t2 , t3 , t4 , t5 ], for every instance x, from each ti ∈ T we can           for f in f eature_names do
 extract a path pi . Supposing that for the instance x, we have five                    if f in pi and f in pj then
                                                                                             find li , ui , lj , uj lower and upper bounds
 paths:
                                                                                             inter ← min(ui , uj ) − max(li , lj )
p1 if f1 and f2 and f4 then Class_A                                                          union ← max(ui , uj ) − min(li , lj )
p2 if f1 and f3 and f4 then Class_A                                                          if inter > 0 and union 6= 0 then
                                                                                                  sij ← sij + inter/union
p3 if f1 and f2 and f4 then Class_A
                                                                                             end
p4 if f3 and f4 then Class_A
p5 if f4 then Class_A                                                                   else if f not in pi and f not in pj then
                                                                                             sij ← sij + 1
                                                                                        end
    Then, we can compute the association rules using apriori. Our ob-
                                                                                    end
 jective is to create a set of features F 0 ⊂ F . We take the first rule
                                                                                    return sij /len(f eature_names)
 f4 ⇒ (f3 , f1 ), the rule with the lowest confidence. This rule informs
 us that f4 , which has the highest support value, exists in 80% of
 the paths without (f1 , f3 ). Thus, the first thing we add to our fea-              In Algorithm 2, we calculate the k-medoids and their clusters us-
 ture list is the antecedent of this rule, f4 . By adding the feature, we         ing the similarity metric of Algorithm 1. Afterwards, we perform an
 are counting how many paths can be satisfied with the features of                ordering of the medoids based on the number of paths they cover
 F 0 = [f4 ]. Only one path is valid (p5 ), and is not enough because we          in their clusters. Then, we collect paths from the larger clusters into
 need a quorum. Skipping all the association rules having the chosen              a list, until we acquire at least a quorum. By summing larger clus-
 features at their antecedents, the next rule we have is f1 ⇒ f3 . f1             ters first, the possibility of feature reduction is increasing, because

                                                                              4
                                                                                reduction method reduces one of the encoded OneHot features, it
 Algorithm 2: Paths reduction through k-medoids clustering
                                                                                will not appear in the feature’s list of alternative values, but in the
  input : similarity_matrix, no_of _estimators,                                 list of values that do not influence the prediction. For this reason,
            paths, no_of _medoids                                               the categorical features will appear in the interpretations with a
  return: paths                                                                 notation ‘c’ like ‘categorical_f eaturec = value’. Depending on
  quorum ← no_of _estimators/2 + 1                                              the application and the interpretation, the user will be able to request
  m ← kmedoids(similarity_matrix, no_of _medoids)                               a list of alternative values or will be able to simply hover over the
  sorted_m ← sort_by_key(m, descending = T rue)                                 feature to reveal the list. Section 4.3.3 is showcasing transformations
  count ← 0, size ← 0, reduced_paths ← []                                       of OneHot encoded features, as well as one example of a OneHot
  while size < quorum and count < len(sorted_m) do                              encoded feature’s alternative values list.
       for j in m[sorted_m[count]] do
           reduced_paths.append(paths[j])
       end                                                                      3.5     Interpretation Composition
       count ← count + 1 ,size ← len(reduced_paths)                             These processes are part of the LionForests technique, which, in the
  end                                                                           end, produces an interpretation in the form of a feature-range rule.
  if size ≥ quorum then                                                         Lastly, LionForests combines the ranges of the features in the re-
       paths ← reduced_paths
                                                                                duced feature set to a single natural language rule. The order of ap-
  end
                                                                                pearance of the feature ranges in the rule is determined by using a
  return paths
                                                                                global interpretation method, such as the SHAP TreeExplainer [30]
                                                                                or the Scikit [35] RF model’s built-in feature importance attribute,
the paths inside a cluster tend to be more similar among them. Ad-              for smaller and larger datasets, respectively. One notable example of
ditionally, the biased similarity metric would cluster paths with less          an interpretation is the following:
irrelevant features, leading to a subset of paths that are satisfied with
a smaller set of features.                                                            ‘if 0 ≤ f1 ≤ 0.5 and − 0.5 ≤ f3 ≤ 0.15 then class_A’.            (5)
   Performing clustering does not guarantee feature reduction, but
there is a probability of an unanticipated reduction of the feature set.        We interpret this feature-range rule like that: “As long as the value of
This procedure attempts to minimise the number of paths at least at             the f1 is between the ranges 0 and 0.5, and the value of f3 is between
the quorum. Unlike the association rules method, which may not ac-              the ranges -0.5 and 0.15, the system will classify this instance to class
complish to reduce features, clustering will significantly reduce the           A. If the value of f1 , f3 or both, surpass the limits of their ranges then
number of paths. By the end of the reduction process through cluster-           the prediction may change. Note that the features are ranked through
ing, random selection is applied to the paths to obtain the acceptable          their influence”. This type of interpretation is comprehensible and
minimum number of paths, in case of reduction via clustering has not            human-readable. Thus, if we manage to keep them shorter, then they
reached the quorum.                                                             could be an ideal way to explain an RF model. A way to encounter an
                                                                                over-length rule could be to hide the last n feature-ranges, which they
                                                                                will be the least important due to the ordering process. At the same
3.4    Handling Categorical Features                                            time, users will have the ability to expand their rules to explore the
                                                                                feature-ranges. However, we do not completely exclude those fea-
It is possible, even expected, to deal with a dataset containing cat-           tures, but we are only hiding them because otherwise, we would af-
egorical features. Of course, a transformation through OneHot or                fect the correctness of both the explanation and the prediction. An
Ordinal [42] encoding is used to make good use of these data.                   example is showcased in Section 4.3.3.
This transformed type of information will then be acceptable from
ML systems. But is there any harm to interpretability caused
by the use of encoding methods? Sadly, yes! Using ordinal en-                   4     EXPERIMENTAL RESULTS
coding will transform a feature like country=[GR, U K, . . . ] to               We first discuss the setup of our experiments. Then, we present quan-
country=[0, 1, 2, . . . ]. As a result we lose the intelligibility of the       titative results, followed by qualitative results for the explanation of
feature. On the other hand, using OneHot encoding will increase dra-            particular instances.
matically the amount of features leading to over-length and incom-
prehensible interpretations by transforming the feature country to
country_GR=[0, 1], country_U K=[0, 1], and so forth. Since the                  4.1     Experimental Setup
encoding transformations are part of the feature engineering and are            The implementation of LionForests as well as of all the experiments
not invariable, we cannot construct a fully automated process to in-            in this paper is available in the LionLearn repository at GitHub3 .
verse transform the features into human interpretable forms within                 Our experiments were conducted on the following three tabular
the interpretations.                                                            binary classification datasets: Banknote authentication [15], Heart
   Nevertheless, LionForests provides two automated encod-                      (Statlog) [15] and Adult Census [28]. The top part of Table 1 shows
ing processes using either OneHot or Ordinal encoding and                       the number of instances and features of each dataset. Particularly,
their inverse transformation for the interpretation extrac-                     Adult Census contains 6 numerical and 8 categorical features. The
tion. Feature-ranges of Ordinal encoded data transform like                     eight categorical features are transformed through OneHot encoding
(1≤country≤2)→(country=[U K, F R]), while feature-ranges of                     into 74 numerical, resulting in a total of 80 features.
OneHot encoded data (0.5≤country_U K≤1)→(country=U K)                              We used the RandomForestClassifier from Scikit-learn [35] and
and feature-ranges like (0≤country_GR≤0.49) are removed.                        the MinMaxScaler with feature range [−1, 1]. In order to work with
The excluded OneHot encoded features for the categorical feature
                                                                                3 https://github.com/intelligence-csd-auth-gr/LionLearn
should appear to the user as possible alternative values. If a feature

                                                                            5
optimised models, a 10-fold cross-validation grid search was carried                    = ‘sqrt’). Thus, the overlapping features among the different paths are
out on each dataset, using the following set of parameters and val-                     fewer, and by extension, techniques relying on feature selection, like
ues: max_depth {1, 5, 7, 10}, max_features {‘sqrt’, ‘log2’, 75%,                        association rules, cannot perform the feature reduction, as well as
None4 }, min_samples_leaf {1, 2, 5, 10, 10%}, bootstrap {True,                          path reduction. However, LionForests is an ensemble of techniques,
False}, n_estimators {10, 100, 500, 1000}. The parameters’ values                       and we have revealed with these quantitative experiments the im-
that achieved the best F1 score, along with the score itself are shown                  portance of each part. We may infer that the LionForests strategy is
in the middle and bottom part of Table 1, respectively.                                 considerably effective in reducing both features and paths.
                          Banknote    Heart (Statlog)      Adult Census
              instances     1372           270                 48842                    4.3     Qualitative Results
               features      4             13                 14 (80)
                                                                                        LionForests technique provides consistent and robust rules which are
             max depth       10              5                   10                     more indisputable from other interpretations because they are more
          max features      0.75           ‘sqrt’              ‘sqrt’
        min sample leaf       1              5                    1                     compact, have broader ranges for the features, while at the same time
              bootstrap     True           False               False                    present categorical data in a human-comprehensible form. In this sec-
             estimators     500             500                 100                     tion, we provide an example from each dataset to demonstrate how
                  F1 %      99.43          81.89               88.71                    RF models can be interpreted efficiently.
Table 1. For each of the three datasets: main statistics (top), optimal param-
eter values of the Random Forest (middle), corresponding F1 score (bottom)              4.3.1    Banknote Authentication
                                                                                        We observe that the complete methodology, incorporating associa-
                                                                                        tion rules, clustering and random selection reduction, achieves the
4.2    Quantitative Results                                                             highest performance on both feature and path reduction for the first
For each dataset, we train one RF model, using all data, with the                       dataset, Banknote Authentication. We also provide an example of a
parameters shown in Table 1. Then we apply LionForests to all in-                       pair of explanations (1) without and (2) with LionForests:
stances of each dataset and report the mean feature and path reduc-
tion in Table 2.                                                                        1. ‘if 2.4 ≤ variance ≤ 6.83 and −3.13 ≤ skew ≤ −2.30 and
   In terms of path reduction, we notice that among the three reduc-                       1.82 ≤ curtosis ≤ 2.13 and −0.64 ≤ entropy ≤ 0.73 then
tion techniques, random selection consistently leads to the best re-                       fake banknote’
sults across all datasets, achieving an average reduction of 45.69%.                    2. ‘if 2.4 ≤ variance ≤ 6.83 and −1.60 ≤ curtosis ≤ 17.93
Clustering is the second-best method with only slightly worse results                      then fake banknote’
in the first two datasets, but much worse results in the Adult Census
                                                                                           The reduced rule (2) is smaller than the original by two features.
dataset, achieving an average of 41.94% reduction. Association rules
                                                                                        In addition, the “curtosis” feature has a wider range. The instance has
is the worst technique, achieving zero reduction in the Adult Cen-
                                                                                        a value of 1.92 for the “curtosis” feature, and in the original rule this
sus dataset and an average of 19.90% reduction. Combining random
                                                                                        value is marginal for the very narrow range 1.82 ≤ curtosis ≤ 2.13
selection with other techniques does not lead to improved results.
                                                                                        indicating that a small change may lead to a different outcome, but
   With respect to feature reduction, we find that the association rules
                                                                                        this is not the case for the reduced rule as well. In addition, changing
strategy leads to the best results in the first two datasets, reaching
                                                                                        the skew value from −2.64 to −4, which is outside the range of the
an average reduction of 26.04%. The other two techniques achieve
                                                                                        feature in the original rule, will not change the prediction and will
negligible reduction in these two datasets. In banknote authentica-
                                                                                        produce the same reduced range rule. We observe the same result
tion combining the three methods improves the reduction slightly
                                                                                        when we change the value of the “entropy” feature, as well as when
from 30.70% to 30.85%. In Adult Census on the other hand, asso-
                                                                                        we tweak both “skew” and “entropy”.
ciation rules achieve zero reduction of features similarly to paths.
The best result in this dataset is achieved by random selection, fol-
lowed closely by clustering. Combining these two techniques im-                         4.3.2    Heart Disease
proves slightly the reduction to 10%.
                                                                                        Again, with LionForests we achieve both higher feature and path re-
   The weak performance of association rules in Adult Census is re-
                                                                                        duction ratios. There are thirteen features included in this particular
lated to the small number of estimators (100) and the huge number of
                                                                                        dataset and, as a result, the interpretations may have a total of thirteen
features (80). In particular, each of the estimators can have a maxi-
                                                                                        features. In fact, in this case the reduction of features is essential to
mum of eight features, as resulted from the grid search (max features
                                                                                        provide comprehensible interpretations. We choose an example, and
4 None = all features
                                                                                        we present the original rule (1) and the reduced rule (2):

              Reduction Technique                                                           Reduction %                                              Mean Reduction %
 Association Rules Clustering Random Based             Feature          Path           Feature            Path          Feature           Path       Feature   Path
        X               X          X                30.85±5.1e−2 49.47±5.5e−15       21.37±0.0        43.41±0.0     10.03±1.5e−2 44.18±5.5e−15        20.75    45.69
         -              X          X                2.15±1.8e−1    49.47±5.5e−15   8.6e−3 ±1.3e−2     43.41±0.0     10.03±1.5e−2 44.18±5.5e−15        4.06     45.69
        X                -         X                 30.70±0.0     49.47±5.5e−15     21.37±0.0        43.41±0.0      9.11±1.4e−2     44.18±5.5e−15    20.39    45.69
        X               X          -                30.84±4.4e−2   48.56±2.4e−2      21.37±0.0       42.88±2.2e−2    7.82±1.4e−2     36.26±2.6e−2     20.01    42.57
        X                -         -                 30.70±0.0       27.02±0.0       21.37±0.0        32.67±0.0        0.0±0.0          0.0±0.0       17.36    19.90
         -              X          -                2.12±1.3e−1    47.57±3.6e−2    5.7e−3 ±1.1e−2 41.98±7.4e−2       7.82±1.4e−2     36.26±2.6e−2     3.32     41.94
         -               -         X                   0.0±0.0     49.47±5.5e−15   2.9e−3 ±8.6e−3     43.41±0.0      9.11±1.4e−2     44.18±5.5e−15    3.04     45.69
                                                       Banknote Authentication              Heart Disease                     Adult Census

                          Table 2.   Feature and path reduction ratios on the three datasets. X(#) the order of the applied techniques


                                                                                    6
1. ‘if 6.5 ≤ reversable def ect ≤ 7.0 and 3.5 ≤ chest pain ≤ 4.0                          Possible values of “native_country”, which they
   and 0.0 ≤ number of major vessels ≤ 0.5 and 1.55 ≤                                   may affect the prediction          preserve the prediction
   oldpeak ≤ 1.7 and 0.5 ≤ exercise induced angina ≤ 1.0                          ‘Mexico’, ‘United-States’, ‘Canada’,
                                                                                                                               ‘India’, ‘France’,
   and 128.005 ≤ maximum heart rate achieved ≤ 130.998                             ‘Philippines’, ‘England’, ‘Thailand’,
                                                                                                                            ‘El-Salvador’, ‘Iran’,
                                                                                ‘Japan’, ‘China’, ‘Dominican-Republic’,
   and 1.5 ≤ the slope of the peak exercise ≤ 2.5 and                                                                           ‘Cuba’, ‘Haiti’,
                                                                                ‘Germany’, ‘South’, ‘Columbia’, ‘Italy’,
   sexc = M ale and 184.999 ≤ serum cholestoral ≤                                ‘Puerto-Rico’, ‘Vietnam’, ‘Cambodia’,
                                                                                                                             ‘Guatemala’, ‘Peru’,
   199.496 and 29.002 ≤ age ≤ 41.497 and 0.0 ≤                                                                               ‘Trinidad&Tobago’,
                                                                                 ‘Ireland’, ‘Taiwan’, ‘Portugal’, ‘Laos’,
   resting electrocardiographic results           ≤     0.5 and                                                                   ‘Honduras’
                                                                                  ‘Yugoslavia’, ‘Nicaragua’, ‘Scotland’
   119.0 ≤ resting blood pressure ≤ 121.491 and
                                                                                Table 3. List of values affecting or not the classification of an instance
   0.0 ≤ f asting blood sugar ≤ 0.5 then presence’
2. ‘if 6.5 ≤ reversable def ect ≤ 7.0 and 3.5 ≤ chest pain ≤ 4.0
   and 0.0 ≤ number of major vessels ≤ 0.5 and 1.55 ≤                          optimal ones but not the optimal explanations, the attempt to indif-
   oldpeak ≤ 1.7 and 0.5 ≤ exercise induced angina ≤ 1.0                       ferently interpret each black-box model will not lead to the desired
   and 128.005 ≤ maximum heart rate achieved ≤ 133.494                         outcome. In this work, we introduced a model-specific local-based
   and 1.5 ≤ the slope of the peak exercise ≤ 2.5 and                          approach for obtaining real interpretations of random forests predic-
   184.999 ≤ serum cholestoral ≤ 199.496 and 119.0 ≤                           tions. Other works [34, 43] attempt to provide explanations of this
   resting blood pressure ≤ 121.491 then presence’                             form, but they do not try to make them more comprehensible, either
                                                                               indisputable. A user may not be familiar with iForest’s [43] visualisa-
   The reduced rule is shorter than the original rule by four features.        tion tool. Besides, an interpretation containing a lot of features with
In addition, the “maximum heart rate achieved” feature has a wider             narrow ranges [34] may lead to incomprehensible and untrustworthy
range in the reduced rule. Changing the “sex” value from ‘Male’ (1)            rules. The proposed technique, LionForests, will provide users with
to ‘Female’ (0) did not change the reduced rule at all. We tweak               small rules as interpretations in natural language, which by widen-
“age” value from 35 to 15 and again the reduced rule remains the               ing the feature ranges will be more reliable and trustworthy as well.
same. Thus, features like “age”, “sex”, “resting electrocardiographic          We use classic unsupervised methods such as association rules and
results” and “fast blood sugar”, cannot influence the prediction.              k-medoids clustering to achieve feature and path reduction.
                                                                                  Nevertheless, LionForests is not a complete solution either, since
                                                                               it is not appropriate for tasks of model inspection. For example, if
4.3.3   Adult Census                                                           a researcher is working to build a reliable and stable model aiming
Finally, we present a pair of explanations provided by LionForests             for the highest performance, a visualisation tool like iForest may be
for an instance of Adult Census, without (1) and with (2) reduction:           preferred. This approach is the best and easiest way of providing an
                                                                               interpretation to a non-expert user. Lastly, by reducing the number of
1. ‘if marital statusc = M arried and sexc = F emale and                       paths to the quorum to minimise the features and at the same time
   educationc = HS grad and workclassc = P rivate and                          to increase the features’ ranges, the outcome would be a discounted
   94721 ≤ f nlwgt ≤ 161182 and 47 ≤ age ≤ 53 and 15 ≤                         probability of the classification of the instance, which poses ques-
   hours per week ≤ 25 and native country c = Jamaica and                      tions about the prediction’s reliability. This can be counter-attacked
   [other 2 feature-ranges] then income >50K’                                  by introducing a threshold parameter to the reduction effect, requir-
2. ‘if marital statusc = M arried and sexc = F emale and                       ing the algorithm to retain at least a specific percentage of the paths.
   educationc = HS_grad and workclassc = P rivate and                             Future research will explore the impact of tuning parameters, such
   87337 ≤ f nlwgt ≤ 382719 and 47 ≤ age ≤ 63 and 15 ≤                         as the number of estimators or the maximum number of features, on
   hours per week ≤ 99 and native country c = Jamaica and                      the reduction of features and paths. We also aim to apply LionForests
   [other 2 feature-ranges] then income >50K’                                  to different tree ensembles, rather than random forests, as well as to
                                                                               various datasets and data types. FPGrowth [24], and its variant FP-
   The reduced rule is thirteen features smaller than the original rule.       Max [19], will be tested against the Apriori algorithm. In addition, we
This is not directly obvious because some OneHot categorical fea-              will consider the possibility of adapting LionForests to other tasks,
tures are not presented. For example, only valid features such as              such as multi-class or multi-label classification, as well as regression.
“marital_status_Married” and “education_HS_Grad” are presented                 We will also explore the possibility of using LionForests interpreta-
as described in Section 3.4. Furthermore, we observe that the ranges           tions to provide descriptive narratives through counter-examples. Ul-
of “age”, “fnlwft” and “hours per week” are broader. Specifically,             timately, by means of a qualitative, human-oriented analysis, we will
“age” range from [47, 53] increased to [47, 63], while “hours per              try to explore this promising method in order to prove its intelligi-
week” range from [15, 25] expanded to [15, 99]. Moreover, we can               bility and its necessity as a foundation for human-centred artificial
explore the categorical feature’s “native country” alternative values.         intelligence systems based on interpretable ML methods.
In Table 3, the first list refers to the values that may change the pre-
diction of an instance, while the second shows the values that cannot
affect the prediction. In the original rule this type of information was       ACKNOWLEDGEMENTS
not available because the non-affecting values were present.

                                                                               This paper is supported by the European Union’s Horizon 2020
5   CONCLUSION                                                                 research and innovation programme under grant agreement No
                                                                               825619, AI4EU Project5 .
Providing helpful explanations is a challenging task. Providing com-
prehensible and undeniable explanations is even tougher. Since
                                                                               5 https://www.ai4eu.eu
model-agnostic approaches produce approximations, i.e. the nearest

                                                                           7
REFERENCES                                                                                   ings of Machine Learning Research, pp. 77–85, Playa Blanca, Lan-
                                                                                             zarote, Canary Islands, (09–11 Apr 2018). PMLR.
                                                                                      [26]   Trevor Hastie, Robert Tibshirani, and Jerome Friedman, The elements
 [1] Amina Adadi and Mohammed Berrada, ‘Peeking inside the black-box:                        of statistical learning: data mining, inference, and prediction, Springer
     A survey on explainable artificial intelligence (xai)’, IEEE Access, 6,                 Science & Business Media, 2009.
     52138–52160, (2018).                                                             [27]   Leonard Kaufman and Peter J Rousseeuw, ‘Clustering by means of
 [2] Rakesh Agrawal, Tomasz Imieliński, and Arun Swami, ‘Mining asso-                       medoids. statistical data analysis based on the l1 norm’, Y. Dodge, Ed,
     ciation rules between sets of items in large databases’, in Acm sigmod                  405–416, (1987).
     record, volume 22, pp. 207–216. ACM, (1993).                                     [28]   Ron Kohavi, ‘Scaling up the accuracy of naive-bayes classifiers: a
 [3] Rakesh Agrawal, Ramakrishnan Srikant, et al., ‘Fast algorithms for                      decision-tree hybrid’, in Proceedings of the Second International Con-
     mining association rules’, in Proc. 20th int. conf. very large data bases,              ference on Knowledge Discovery and Data Mining, p. to appear, (1996).
     VLDB, volume 1215, pp. 487–499, (1994).                                          [29]   Zachary C. Lipton, ‘The mythos of model interpretability’, Commun.
 [4] Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick                          ACM, 61(10), 36–43, (September 2018).
     Klauschen, Klaus-Robert Müller, and Wojciech Samek, ‘On pixel-wise               [30]   Scott M Lundberg, Gabriel Erion, Hugh Chen, Alex DeGrave, Jordan M
     explanations for non-linear classifier decisions by layer-wise relevance                Prutkin, Bala Nair, Ronit Katz, Jonathan Himmelfarb, Nisha Bansal,
     propagation’, PloS one, 10(7), e0130140, (2015).                                        and Su-In Lee, ‘Explainable ai for trees: From local explanations to
 [5] Christian Bauckhage, ‘Numpy/scipy recipes for data science: k-                          global understanding’, arXiv preprint arXiv:1905.04610, (2019).
     medoids clustering’, Researchgate. Net, February, (2015).                        [31]   Scott M Lundberg and Su-In Lee, ‘A unified approach to interpreting
 [6] Leo Breiman, ‘Random forests’, Machine learning, 45(1), 5–32,                           model predictions’, in Advances in Neural Information Processing Sys-
     (2001).                                                                                 tems 30, eds., I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fer-
 [7] Angela Chen. Ibm’s watson gave unsafe recommendations for treating                      gus, S. Vishwanathan, and R. Garnett, 4765–4774, Curran Associates,
     cancer. https://cutt.ly/keHQDma, 2018. Accessed: 2019-11-                               Inc., (2017).
     18.                                                                              [32]   Laurens van der Maaten and Geoffrey Hinton, ‘Visualizing data us-
 [8] Tianqi Chen and Carlos Guestrin, ‘Xgboost: A scalable tree boosting                     ing t-sne’, Journal of machine learning research, 9(Nov), 2579–2605,
     system’, in Proceedings of the 22nd acm sigkdd international confer-                    (2008).
     ence on knowledge discovery and data mining, pp. 785–794. ACM,                   [33]   Ioannis Mollas, Nikolaos Bassiliades, and Grigorios Tsoumakas, ‘Li-
     (2016).                                                                                 onets: Local interpretation of neural networks through penultimate
 [9] HA Chipman, EI George, and RE McCulloh, ‘Making sense of a forest                       layer decoding’, in ECML PKDD 2019 AIMLAI XKDD Workshop.
     of trees’, Computing Science and Statistics, 84–92, (1998).                             Würzburg, Germany, (2019).
[10] Samantha Cole. This trippy t-shirt makes you invisible to ai. https:             [34]   Alexander Moore, Vanessa Murdock, Yaxiong Cai, and Kristine Jones,
     //cutt.ly/FeHQHAa, 2019. Accessed: 2019-11-18.                                          ‘Transparent tree ensembles’, in The 41st International ACM SIGIR
[11] Houtao Deng, ‘Interpreting tree ensembles with intrees’, International                  Conference on Research & Development in Information Retrieval, pp.
     Journal of Data Science and Analytics, 7(4), 277–287, (2019).                           1241–1244. ACM, (2018).
[12] Pedro Domingos, ‘Knowledge discovery via multiple models’, Intelli-              [35]   F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
     gent Data Analysis, 2(1-4), 187–202, (1998).                                            O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-
[13] Filip Karlo Došilović, Mario Brčić, and Nikica Hlupić, ‘Explainable                 plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
     artificial intelligence: A survey’, in 2018 41st International convention               esnay, ‘Scikit-learn: Machine learning in Python’, Journal of Machine
     on information and communication technology, electronics and micro-                     Learning Research, 12, 2825–2830, (2011).
     electronics (MIPRO), pp. 0210–0215. IEEE, (2018).                                [36]   General Data Protection Regulation, ‘Regulation (eu) 2016/679 of the
[14] Mengnan Du, Ninghao Liu, and Xia Hu, ‘Techniques for interpretable                      european parliament and of the council of 27 april 2016 on the protec-
     machine learning’, arXiv preprint arXiv:1808.00033, (2018).                             tion of natural persons with regard to the processing of personal data
[15] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.                      and on the free movement of such data, and repealing directive 95/46’,
[16] Niall Firth. Apple card is being investigated over claims it gives women                Official Journal of the European Union (OJ), 59(1-88), 294, (2016).
     lower credit limits. https://cutt.ly/oeGYCx5, 2019. Ac-                          [37]   Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin, ‘Why should
     cessed: 2019-11-18.                                                                     i trust you?: Explaining the predictions of any classifier’, in Proceed-
[17] Aaron Fisher, Cynthia Rudin, and Francesca Dominici, ‘All models are                    ings of the 22nd ACM SIGKDD international conference on knowledge
     wrong but many are useful: Variable importance for black-box, propri-                   discovery and data mining, pp. 1135–1144. ACM, (2016).
     etary, or misspecified prediction models, using model class reliance’,           [38]   Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin, ‘Anchors:
     arXiv preprint arXiv:1801.01489, (2018).                                                High-Precision Model-Agnostic Explanations’, in Thirty-Second AAAI
[18] Alex Goldstein, Adam Kapelner, Justin Bleich, and Emil Pitkin, ‘Peek-                   Conference on Artificial Intelligence, (2018).
     ing inside the black box: Visualizing statistical learning with plots            [39]   Wojciech Samek, Thomas Wiegand, and Klaus-Robert Müller, ‘Ex-
     of individual conditional expectation’, Journal of Computational and                    plainable artificial intelligence: Understanding, visualizing and in-
     Graphical Statistics, 24(1), 44–65, (2015).                                             terpreting deep learning models’, arXiv preprint arXiv:1708.08296,
[19] Gösta Grahne and Jianfei Zhu, ‘Efficiently using prefix-trees in mining                 (2017).
     frequent itemsets.’, in FIMI, volume 90, (2003).                                 [40]   Gabriele Tolomei, Fabrizio Silvestri, Andrew Haines, and Mounia Lal-
[20] Stefan Th Gries, ‘On classification trees and random forests in corpus                  mas, ‘Interpretable predictions of tree-based ensembles via actionable
     linguistics: Some words of caution and suggestions for improvement’,                    feature tweaking’, in Proceedings of the 23rd ACM SIGKDD interna-
     Corpus Linguistics and Linguistic Theory, (2019).                                       tional conference on knowledge discovery and data mining, pp. 465–
[21] Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini,                    474. ACM, (2017).
     Fosca Giannotti, and Dino Pedreschi, ‘A survey of methods for explain-           [41]   James Vincent. This colorful printed patch makes you pretty much
     ing black box models’, ACM computing surveys (CSUR), 51(5), 93,                         invisible to ai. https://cutt.ly/TeHQJHU, 2019. Accessed:
     (2019).                                                                                 2019-11-18.
[22] MAISSAE HADDOUCHI and ABDELAZIZ BERRADO, ‘A survey                               [42]   Alexander Von Eye and Clifford C Clogg, Categorical variables in de-
     of methods and tools used for interpreting random forest’, in 2019 1st                  velopmental research: Methods of analysis, Elsevier, 1996.
     International Conference on Smart Systems and Data Science (ICSSD),              [43]   Xun Zhao, Yanhong Wu, Dik Lun Lee, and Weiwei Cui, ‘iforest: In-
     pp. 1–6. IEEE, (2019).                                                                  terpreting random forests via visual analytics’, IEEE transactions on
[23] Tameru Hailesilassie, ‘Rule extraction algorithm for deep neural net-                   visualization and computer graphics, 25(1), 407–416, (2018).
     works: A review’, arXiv preprint arXiv:1610.05267, (2016).                       [44]   Yichen Zhou and Giles Hooker, ‘Interpreting models via single tree
[24] Jiawei Han, Jian Pei, Yiwen Yin, and Runying Mao, ‘Mining fre-                          approximation’, arXiv preprint arXiv:1610.09036, (2016).
     quent patterns without candidate generation: A frequent-pattern tree             [45]   Andreas Ziegler and Inke R König, ‘Mining data with random forests:
     approach’, Data mining and knowledge discovery, 8(1), 53–87, (2004).                    current options for real-world applications’, Wiley Interdisciplinary Re-
[25] Satoshi Hara and Kohei Hayashi, ‘Making tree ensembles interpretable:                   views: Data Mining and Knowledge Discovery, 4(1), 55–63, (2014).
     A bayesian model selection approach’, in Proceedings of the Twenty-
     First International Conference on Artificial Intelligence and Statistics,
     eds., Amos Storkey and Fernando Perez-Cruz, volume 84 of Proceed-


                                                                                  8