=Paper= {{Paper |id=Vol-2659/mollas |storemode=property |title=LionForests: local interpretation of random forests |pdfUrl=https://ceur-ws.org/Vol-2659/mollas.pdf |volume=Vol-2659 |authors=Ioannis Mollas,Nick Bassiliades,Ioannis Vlahavas,Grigorios Tsoumakas |dblpUrl=https://dblp.org/rec/conf/ecai/MollasBVT20 }} ==LionForests: local interpretation of random forests== https://ceur-ws.org/Vol-2659/mollas.pdf

LionForests: Local Interpretation of Random Forests
Ioannis Mollas, Nick Bassiliades, Ioannis Vlahavas, Grigorios Tsoumakas1

Abstract. Towards a future where ML systems will integrate into in uncovering errors and biases, local interpretation methods are in
every aspect of people’s lives, researching methods to interpret such certain domains a prerequisite due to legal frameworks, such as the
systems is necessary, instead of focusing exclusively on enhancing General Data Protection Regulation (GDPR) [36] of the EU and the
their performance. Enriching the trust between these systems and Equal Credit Opportunity Act of the US2 .
people will accelerate this integration process. Many medical and re- Another important dimension, which IML methods can be cate-
tail banking/finance applications use state-of-the-art ML techniques gorised, concerns the type of ML model that they are interpreting [1].
to predict certain aspects of new instances. Thus, explainability is a Model-agnostic methods [31, 37, 38] can be applied to any type of
key requirement for human-centred AI approaches. Tree ensembles, model, while model-specific methods [4, 30, 33, 34] are engineered
like random forests, are widely acceptable solutions on these tasks, for a specific type of model. Methods of the former category have
while at the same time they are avoided due to their black-box un- wider applicability, but they just approximately explain the models
interpretable nature, creating an unreasonable paradox. In this paper, they are applied to [1]. This work focuses on the latter category of
we provide a methodology for shedding light on the predictions of methods, proposing a technique specified for tree ensembles [6, 8],
the misjudged family of tree ensemble algorithms. Using classic un- which are very effective in several applications involving tabular and
supervised learning techniques and an enhanced similarity metric, time-series data [45].
to wander among transparent trees inside a forest following bread- Past work on model-specific interpretation techniques, about tree
crumbs, the interpretable essence of tree ensembles arises. An inter- ensembles, is limited [34, 43]. iForest [43], a global and local inter-
pretation provided by these systems using our approach, which we pretation system of random forests (RF), provides insights for a deci-
call “LionForests”, can be a simple, comprehensive rule. sion through a visualisation tool. Notwithstanding, such visual expla-
nation tool is very complex for non-expert users, while at the same
1 INTRODUCTION time requires user interaction in order to construct the local interpre-
tations. Another instance-level interpretation technique for RF [34],
produces an interpretation in the form of a list of features with their
Machine learning (ML) models are becoming pervasive in our so- ranges, accompanied by an influence metric. If the list of features is
ciety and everyday life. Such models may contain errors, or may be extensive, and the ranges are very narrow, the interpretation can be
subject to manipulation from an adversary. In addition, they may be considered as unreliable and untrustworthy, because small changes in
mirroring the biases that exist in the data from which they were in- the features will render the interpretation useless. Finally, both meth-
duced. For example, Apple’s new credit card is being recently inves- ods do not handle categorical data appropriately.
tigated over claims it gives women lower credit [16], IBM Watson To address the above problems, we introduce a local-based model-
Health was accused of suggesting unsafe treatments for patients [7] specific approach for interpreting an individual prediction of an RF
and state-of-the-art object detection model YOLOv2 is easily tricked through a single rule, in natural language form. The ultimate goal is
by specially designed patches [10, 41]. Being able to understand how mainly to reduce the number of features and secondly to broaden the
an ML model operates and why it predicts a particular outcome is feature-ranges producing more robust, indisputable and intuitive in-
important for engineering safe and unbiased intelligent systems. terpretations. Additionally, the categorical features are handled prop-
Unfortunately, many families of highly accurate (and thus popu- erly, providing intelligible information about them throughout the
lar) models, such as deep neural networks and tree ensembles, are rules. The constructed rule will be presented as the interpretation. We
opaque: humans cannot understand the inner workings of such mod- call this technique “LionForests” (Local Interpretation Of raNdom
els and/or the reasons underpinning their predictions. This has re- FORESTS) and we use its path and feature selection ability, based
cently motivated the development of a large body of research on in- on unsupervised techniques like association rules [2] and k-medoids
terpretable ML (IML), concerned with the interpretation of black box clustering [27] to process the interpretations in order to make them
models [1, 13, 14, 21, 23, 29, 39]. more comprehensible.
Methods for interpreting ML models are categorised, among other The rest of this work is structured the following way. First, Sec-
dimensions [21], into global ones that uncover the whole logic and tion 2 introduces the related work, presenting approaches of interpre-
structure of a model and local ones that aim to interpret a single tation techniques applicable to RF models. In Section 3, we present
prediction, such as “Why has this patient to be immediately hospital- the methodology of our approach to the interpretation of RF models.
ized?”. This work focuses on the latter category. Besides their utility In section 4, we perform the quantitative and qualitative analysis.
1 Department of Informatics, Aristotle University of Thes- Finally, we conclude and present possible future directions of our
saloniki, 54124 Thessaloniki, Greece, email: {iamol- strategy in Section 5.
las,nbassili,vlahavas,greg}@csd.auth.gr
Copyright c 2020 for this paper by its authors. Use permitted under Creative 2 ECOA 15 U.S. Code §1691 et seq.: https://www.law.cornell.
Commons License Attribution 4.0 International (CC BY 4.0). edu/uscode/text/15/1691
2 RELATED WORK
A number of works concerning the tree ensembles interpretation
problem are either model-agnostic or model-specific solutions, with
a global or local scope, as presented with a similar taxonomy in a
recent survey [22].
A family of model-agnostic interpretation techniques about black-
box models, including tree ensembles, concerns the efficient calcula-
Figure 2. A sample explanation from [34]
tion of feature importance. These are variations of feature permuta-
tion methods [17], partial dependence plots [26] and individual con-
tures with their ranges, ranked based on their contribution (see Fig-
ditional expectation [18], which are global-based. SHAP [31] is an
ure 2). Thus, the interpretation process consists of two parts. Firstly,
alternative method to compute feature importance for both global and
they calculate the influence of a feature by monitoring the changes
local aspects of any black-box model.
of the activated nodes for a specific instance’s prediction. This influ-
Specifically, on tree ensembles, the most common techniques in-
ence later will be used for the ranking process. The second step is to
clude the processes of extracting, measuring, pruning and selecting
find the narrowest range across all trees for every feature. However,
rules from the trees to compute the feature importance [11, 40]. A
they do not attempt to expand these ranges, while they also claim that
highly studied by many researchers [11, 12, 25, 44] technique at-
their influence metric assigns zero influence to some features, and by
tempts to globally interpret tree ensembles using single-tree approxi-
extension removing them, they could offer more compact explana-
mations. But this method, as its name implies, approximates the per-
tions. In spite of this, they do not know, by keeping only features
formance of the model it seeks to explain. Thus, this approach is ex-
with a non-zero influence, that these features will at least be present
tremely problematic and criticised, because it is not feasible to sum-
in half plus one paths to preserve the same prediction of an instance.
marise a complex model like tree ensembles to a single tree [20]. An
Finally, they do not manage categorical features properly.
additional approach on interpreting tree ensembles focuses on clus-
tering the trees of an ensemble using a tree dissimilarity metric [9].
3 OUR APPROACH
Our objective is to provide local interpretation of RF binary classi-
fiers. In RF, a set of techniques like data and feature sampling, is used
in order to train a collection of T weak trees. Then, these trained trees
vote for an instance’s prediction:

1 X
h(xi ) = ht (xi ) (1)
|T | t∈T

Figure 1. A sample example from the visualisation tool of iForest adapted where ht (xi ) is the vote cast from the tree t ∈ T for the instance
from [43] xi ∈ X, representing the probability P (C = cj |X = xi ) of xi to
be assigned to class cj ∈ C = {0, 1}, thus
iForest [43], a model-specific global and local approach, utilises (
a path distance metric to project an instance’s paths to a two- 1 if P (C = 1|X = xi ) ≥ 0.5
ht (xi ) =
dimensional space via t-Distributed Stochastic Neighbour Embed- 0 if P (C = 0|X = xi ) ≥ 0.5.
ding (t-SNE) [32]. The path distance metric they propose considers
distant two paths in two cases: a) if a feature exists in only one out of
the two paths the distance between those two paths is increasing, b) if
a feature exists in both paths, the distance is increasing according to
the non-common ranges of the feature on the paths divided by half.
The total distance, which is the aggregation of those cases for all the
features, is finally divided by the total number of features appearing
at least in one out of the two paths. Except the projection of the paths
(Figure 1a), they provide feature summary (Figure 1b) and decision
path flow (Figure 1c), which is a paths overview. In feature summary,
a stacked area plot visualises every path’s range for a specific feature,
while decision path flow plot visualises the paths themselves. How-
ever, they do not provide this information automatically. The user has
to draw a lasso (Figure 1f) around some points-paths in the paths pro-
jection plot in order to get the feature summary and paths overview. Figure 3. A decision tree binary classifier with 4 features
But requiring the user to select the appropriate paths is critical, sim-
ply because the user can easily choose wrong paths, a small set of Each decision tree t ∈ T is a directed graph, and by deconstructing
paths, or even paths entirely different to the paths being responsible its structure, we are able to derive a set Pt of paths from the root
for his prediction. That may lead to incorrect feature summary and to the leaves. Therefore, every instance can be classified with one
paths overview, thus to a faulty interpretation. of these paths. A path p ∈ Pt is a conjunction of conditions, and
Lastly, one technique [34] interprets tree ensembles in instance- the conditions are features and values with relations ≤ and >. For
level (local-based technique) providing as interpretation a set of fea- example, a path from the tree on Figure 3 could be the following:

2
Figure 4. Bars plot of Value ranges of the ‘variance’ feature from the Ban- Figure 6. Stacked area plot of value ranges of the ‘variance’ feature from
knote Dataset for the M = 26 trees of an RF of K = 42 trees predicting the the Banknote Dataset for the M = 26 trees of an RF of K = 42 trees
majority class predicting the majority class

‘if f1 ≤ 0.45 and f3 > 0.9 then Decision A’. Thus, each path p is 0.47 ≤ f1 ≤ 0.6. This intersection range will always contain the
expressed as a set: instance’s value for the specific feature. Moreover, no matter how
much the feature value is going to change, as long as it stays within
p = {fi vj |fi ∈ F, vj ∈ <, ∈ {≤, >}} (2) this intersection range, the decision paths are not going to change.
We are presenting LionForests, a framework for interpreting RF For example, if the value of the instance for the feature f1 was 0.5, if
models in the instance level. LionForests is a pipeline of actions: the value will change to 0.52, each tree will take its decision through
a) feature-ranges extraction, reduction through b1 ) association rules, the same path. Summarising the aforementioned, an interpretation
b2 ) clustering and b3 ) random selection, c) categorical feature han- can have this shape:
dling, and, finally, d) interpretation composition. ‘if 0.47 ≤ f1 ≤ 0.6 and . . . then Class_A’

But with as many paths as possible to vote to class A, we face the
3.1 Feature-Ranges Extraction following two issues:
Our approach delivers local explanations for RF. Assume we have 1. A lot of paths can lead to an explanation with many features, by
an RF of N trees and an instance x, which is classified by the forest extension to an unintelligible understanding and a frustrated user.
as cj ∈ C = {0, 1}. We focus on the K ≥ N2 trees of the forest 2. A lot of paths will lead to a small, strict and very specific feature
that classify x as cj . For each feature, we compute from each of range. For example, f1 instance’s value was 0.5 and the intersec-
the K trees that it appears in, its values range as imposed from the tion range of all paths for this feature occurs to be 0.47≤ f1 ≤ 0.6,
conditions involving this feature in the path from the root to the leaf. while the feature range is [−1, 1]. A narrow range, like the afore-
Figure 4 shows an example of these ranges for feature ‘variance’ with mentioned, would result in a negative impression of the model,
range −1...0.1 from the Banknote dataset [15] for a particular RF of which will be considered unstable and unreliable. Then, a broader
50 trees and a particular instance whose value for the skew feature range will be less refutable.
is −0.179. Figure 6 shows the corresponding stacked area plot. The
highlighted (cyan/light grey) area represents the intersection of these Consequently, we formulate the optimisation problem (Eq. 3) to
ranges, which they will always contain the instance’s value for the minimise the number of features that satisfy the paths of a subset
specific feature. Moreover, no matter how much the feature value is of the original trees, thereby retaining the same classification result
going to change, as long as it stays within this intersection range, the with the original set of trees and making the size of the total number
instance will not follow a different decision path in these trees. of trees equal to or greater than the quorum, in order to ensure the
In order to give a brief example, we have the following three paths: consistency of the results of the original RF model.

p1 if f1 ≤ 0.6 and ... then Class_A minimise |F 0 |
p2 if f1 ≤ 0.6 and f1 > 0.469 and ... then Class_A 0
F ⊆F
p3 if f1 > 0 and f1 ≤ 1 and ... then Class_A subject to p = {fi vj |fi ∈ F 0 }, p ∈ Pt ∀t ∈ T 0 ,
1 X 1 1 X 1 (3)
b ht (xi ) + c = b ht (xi ) + c,
|T | 0
2 |T | t∈T 2
t∈T

|T 0 | ≥ quorum
To give an example of the equation b |T1 | t∈T ht (xi )+ 12 c. When
P

70 out of |T | = 100 trees are voting for class 1, then we have
1
b 100 70 + 0.5c = b1.2c → 1. On the other side, if 25 out of
|T | = 100 trees are voting class 1 (the minority), then we have
Figure 5. Bar plot of the simple example 1
b 100 25 + 0.5c = b0.75c → 0. Therefore, we are aiming to find
The intersection of these three paths is the cyan/light grey area the smallest T 0 ⊆ T , which will produce the same classification as
in Figure 5, which we can infer that the f1 feature’s range can be: the original T trees.

3
3.2 Reduction through Association Rules has 0.6 support value, and the rule has 0.33 confidence. This means
that in 66.6% of paths containing f1 , the f3 is absent. We add f1 to
The first step of the reduction process begins by using association the feature list and now the F 0 = [f1 , f4 ]. With this feature list only
rules. Association rules [2] mining is an unsupervised technique, the p5 is activated again. Hence, we need another feature. The next
which is used as a tool to extract knowledge from large datasets rule we have is f3 ⇒ (f1 , f4 ) with 0.4 support of f3 and confidence
and explore relations between attributes. In association rules, the 0.2. Adding f3 now the paths p2 , p4 and p5 are valid.
attributes are called items I = {i1 , i2 , . . . , in }. Each dataset con- In the aforementioned example, we achieved to reduce the features
tains sets of items, called itemsets T = {t1 , t2 , . . . , tm }, where from four to three and the paths from five to three, as well. However,
ti ⊆ I. Using all possible items of a dataset, we can find all the applying this method to datasets with plenty of features and models
rules X ⇒ Y , where X, Y ⊆ I. X is called antecedent, while Y with more estimators, the reduction effect can be observed. Section 4
is called consequent. In association rules the goal is to calculate the is seeking to explore that effect, through a set of experiments.
support and confidence of each rule in order to find useful relations.
A simple observation is that X is independent of Y when the con-
fidence is critically low. Furthermore, we can say that X with high 3.3 Reduction through Clustering and Random
support, means it is probably very important. Selection
But how can we use association rules in random forests? We are However, association rules may not be able to reduce the number of
going to do this at the path-level. The items I will contain the features features and, consequently, the number of paths. In that case, a sec-
F of our original dataset. The dataset T , which we are going to use to ond reduction technique based on clustering is applied. Clustering
mine the association rules, will contain sets of features that represent is yet another group of unsupervised ML techniques, aside from the
each path ti = {ij |ij = fj |fj vk ∈ pi , pi ∈ P }. It is significant association rules. k-medoids [5, 27] is a well-known clustering algo-
to mention that we keep only the presence of a feature in the path, rithm, which considers as cluster’s centre an existing element from
and we discard its value vj . Then, it is feasible to apply association the dataset. This element is called medoid. k-medoids, like other
rules techniques, like apriori algorithm [3]. clustering techniques, needs a distance or a dissimilarity metric to
The next step is to sort the association rules extracted by the apri- find the optimum clusters. Thus, performing clustering to paths will
ori algorithm based on the ascending confidence score of each rule. require a path specific distance or dissimilarity metric.
For the rule X ⇒ Y , with the lowest confidence, we will take the We designed a path similarity metric in Algorithm 1. This similar-
X and will add its items to the list of features. Afterwards, we are ity metric is close to the distance metric introduced in iForest [43],
calculating the number of paths containing conjunctions, which are but eliminates some minor problems of the aforementioned. Parsing
satisfied with the new feature set. If the amount of paths is at least this algorithm, if a feature is absent from both paths, the similarity of
half plus one paths of the total number of trees, we have found the these paths increases by 1. When a feature is present in both paths,
reduced set of paths. We have a quorum! Otherwise, we iterate and the similarity increases by a value between 0 and 1, which is the inter-
add more features from the next antecedent of the following rule. By section of the two ranges normalised by the union of the two ranges.
using this technique, we reduce the number of features, and we have The similarity metric is biased towards the absence of a feature from
the new feature set F 0 ⊆ F . Reducing the features, can lead to a both paths because it always assigns one similarity point. However,
reduced set of paths too, because paths containing conjunctions with this is a desirable feature for our goal of minimising the feature set.
the redundant features will no longer be valid. Thus, for every path p
we have the following representation: Algorithm 1: Path similarity metric
input : pi , pj , f eature_names,
p = {fi vj |fi ∈ F 0 , vj ∈ <, ∈ {≤, >}}. (4)
min_max_f eature_values
Illustrating this, for a toy dataset of four features F = return: similarityij
[f1 , f2 , f3 , f4 ] and an RF model with five estimators T = sij ← 0
[t1 , t2 , t3 , t4 , t5 ], for every instance x, from each ti ∈ T we can for f in f eature_names do
extract a path pi . Supposing that for the instance x, we have five if f in pi and f in pj then
find li , ui , lj , uj lower and upper bounds
paths:
inter ← min(ui , uj ) − max(li , lj )
p1 if f1 and f2 and f4 then Class_A union ← max(ui , uj ) − min(li , lj )
p2 if f1 and f3 and f4 then Class_A if inter > 0 and union 6= 0 then
sij ← sij + inter/union
p3 if f1 and f2 and f4 then Class_A
end
p4 if f3 and f4 then Class_A
p5 if f4 then Class_A else if f not in pi and f not in pj then
sij ← sij + 1
end
Then, we can compute the association rules using apriori. Our ob-
end
jective is to create a set of features F 0 ⊂ F . We take the first rule
return sij /len(f eature_names)
f4 ⇒ (f3 , f1 ), the rule with the lowest confidence. This rule informs
us that f4 , which has the highest support value, exists in 80% of
the paths without (f1 , f3 ). Thus, the first thing we add to our fea- In Algorithm 2, we calculate the k-medoids and their clusters us-
ture list is the antecedent of this rule, f4 . By adding the feature, we ing the similarity metric of Algorithm 1. Afterwards, we perform an
are counting how many paths can be satisfied with the features of ordering of the medoids based on the number of paths they cover
F 0 = [f4 ]. Only one path is valid (p5 ), and is not enough because we in their clusters. Then, we collect paths from the larger clusters into
need a quorum. Skipping all the association rules having the chosen a list, until we acquire at least a quorum. By summing larger clus-
features at their antecedents, the next rule we have is f1 ⇒ f3 . f1 ters first, the possibility of feature reduction is increasing, because

4
reduction method reduces one of the encoded OneHot features, it
Algorithm 2: Paths reduction through k-medoids clustering
will not appear in the feature’s list of alternative values, but in the
input : similarity_matrix, no_of _estimators, list of values that do not influence the prediction. For this reason,
paths, no_of _medoids the categorical features will appear in the interpretations with a
return: paths notation ‘c’ like ‘categorical_f eaturec = value’. Depending on
quorum ← no_of _estimators/2 + 1 the application and the interpretation, the user will be able to request
m ← kmedoids(similarity_matrix, no_of _medoids) a list of alternative values or will be able to simply hover over the
sorted_m ← sort_by_key(m, descending = T rue) feature to reveal the list. Section 4.3.3 is showcasing transformations
count ← 0, size ← 0, reduced_paths ← [] of OneHot encoded features, as well as one example of a OneHot
while size < quorum and count < len(sorted_m) do encoded feature’s alternative values list.
for j in m[sorted_m[count]] do
reduced_paths.append(paths[j])
end 3.5 Interpretation Composition
count ← count + 1 ,size ← len(reduced_paths) These processes are part of the LionForests technique, which, in the
end end, produces an interpretation in the form of a feature-range rule.
if size ≥ quorum then Lastly, LionForests combines the ranges of the features in the re-
paths ← reduced_paths
duced feature set to a single natural language rule. The order of ap-
end
pearance of the feature ranges in the rule is determined by using a
return paths
global interpretation method, such as the SHAP TreeExplainer [30]
or the Scikit [35] RF model’s built-in feature importance attribute,
the paths inside a cluster tend to be more similar among them. Ad- for smaller and larger datasets, respectively. One notable example of
ditionally, the biased similarity metric would cluster paths with less an interpretation is the following:
irrelevant features, leading to a subset of paths that are satisfied with
a smaller set of features. ‘if 0 ≤ f1 ≤ 0.5 and − 0.5 ≤ f3 ≤ 0.15 then class_A’. (5)
Performing clustering does not guarantee feature reduction, but
there is a probability of an unanticipated reduction of the feature set. We interpret this feature-range rule like that: “As long as the value of
This procedure attempts to minimise the number of paths at least at the f1 is between the ranges 0 and 0.5, and the value of f3 is between
the quorum. Unlike the association rules method, which may not ac- the ranges -0.5 and 0.15, the system will classify this instance to class
complish to reduce features, clustering will significantly reduce the A. If the value of f1 , f3 or both, surpass the limits of their ranges then
number of paths. By the end of the reduction process through cluster- the prediction may change. Note that the features are ranked through
ing, random selection is applied to the paths to obtain the acceptable their influence”. This type of interpretation is comprehensible and
minimum number of paths, in case of reduction via clustering has not human-readable. Thus, if we manage to keep them shorter, then they
reached the quorum. could be an ideal way to explain an RF model. A way to encounter an
over-length rule could be to hide the last n feature-ranges, which they
will be the least important due to the ordering process. At the same
3.4 Handling Categorical Features time, users will have the ability to expand their rules to explore the
feature-ranges. However, we do not completely exclude those fea-
It is possible, even expected, to deal with a dataset containing cat- tures, but we are only hiding them because otherwise, we would af-
egorical features. Of course, a transformation through OneHot or fect the correctness of both the explanation and the prediction. An
Ordinal [42] encoding is used to make good use of these data. example is showcased in Section 4.3.3.
This transformed type of information will then be acceptable from
ML systems. But is there any harm to interpretability caused
by the use of encoding methods? Sadly, yes! Using ordinal en- 4 EXPERIMENTAL RESULTS
coding will transform a feature like country=[GR, U K, . . . ] to We first discuss the setup of our experiments. Then, we present quan-
country=[0, 1, 2, . . . ]. As a result we lose the intelligibility of the titative results, followed by qualitative results for the explanation of
feature. On the other hand, using OneHot encoding will increase dra- particular instances.
matically the amount of features leading to over-length and incom-
prehensible interpretations by transforming the feature country to
country_GR=[0, 1], country_U K=[0, 1], and so forth. Since the 4.1 Experimental Setup
encoding transformations are part of the feature engineering and are The implementation of LionForests as well as of all the experiments
not invariable, we cannot construct a fully automated process to in- in this paper is available in the LionLearn repository at GitHub3 .
verse transform the features into human interpretable forms within Our experiments were conducted on the following three tabular
the interpretations. binary classification datasets: Banknote authentication [15], Heart
Nevertheless, LionForests provides two automated encod- (Statlog) [15] and Adult Census [28]. The top part of Table 1 shows
ing processes using either OneHot or Ordinal encoding and the number of instances and features of each dataset. Particularly,
their inverse transformation for the interpretation extrac- Adult Census contains 6 numerical and 8 categorical features. The
tion. Feature-ranges of Ordinal encoded data transform like eight categorical features are transformed through OneHot encoding
(1≤country≤2)→(country=[U K, F R]), while feature-ranges of into 74 numerical, resulting in a total of 80 features.
OneHot encoded data (0.5≤country_U K≤1)→(country=U K) We used the RandomForestClassifier from Scikit-learn [35] and
and feature-ranges like (0≤country_GR≤0.49) are removed. the MinMaxScaler with feature range [−1, 1]. In order to work with
The excluded OneHot encoded features for the categorical feature
3 https://github.com/intelligence-csd-auth-gr/LionLearn
should appear to the user as possible alternative values. If a feature

5
optimised models, a 10-fold cross-validation grid search was carried = ‘sqrt’). Thus, the overlapping features among the different paths are
out on each dataset, using the following set of parameters and val- fewer, and by extension, techniques relying on feature selection, like
ues: max_depth {1, 5, 7, 10}, max_features {‘sqrt’, ‘log2’, 75%, association rules, cannot perform the feature reduction, as well as
None4 }, min_samples_leaf {1, 2, 5, 10, 10%}, bootstrap {True, path reduction. However, LionForests is an ensemble of techniques,
False}, n_estimators {10, 100, 500, 1000}. The parameters’ values and we have revealed with these quantitative experiments the im-
that achieved the best F1 score, along with the score itself are shown portance of each part. We may infer that the LionForests strategy is
in the middle and bottom part of Table 1, respectively. considerably effective in reducing both features and paths.
Banknote Heart (Statlog) Adult Census
instances 1372 270 48842 4.3 Qualitative Results
features 4 13 14 (80)
LionForests technique provides consistent and robust rules which are
max depth 10 5 10 more indisputable from other interpretations because they are more
max features 0.75 ‘sqrt’ ‘sqrt’
min sample leaf 1 5 1 compact, have broader ranges for the features, while at the same time
bootstrap True False False present categorical data in a human-comprehensible form. In this sec-
estimators 500 500 100 tion, we provide an example from each dataset to demonstrate how
F1 % 99.43 81.89 88.71 RF models can be interpreted efficiently.
Table 1. For each of the three datasets: main statistics (top), optimal param-
eter values of the Random Forest (middle), corresponding F1 score (bottom) 4.3.1 Banknote Authentication
We observe that the complete methodology, incorporating associa-
tion rules, clustering and random selection reduction, achieves the
4.2 Quantitative Results highest performance on both feature and path reduction for the first
For each dataset, we train one RF model, using all data, with the dataset, Banknote Authentication. We also provide an example of a
parameters shown in Table 1. Then we apply LionForests to all in- pair of explanations (1) without and (2) with LionForests:
stances of each dataset and report the mean feature and path reduc-
tion in Table 2. 1. ‘if 2.4 ≤ variance ≤ 6.83 and −3.13 ≤ skew ≤ −2.30 and
In terms of path reduction, we notice that among the three reduc- 1.82 ≤ curtosis ≤ 2.13 and −0.64 ≤ entropy ≤ 0.73 then
tion techniques, random selection consistently leads to the best re- fake banknote’
sults across all datasets, achieving an average reduction of 45.69%. 2. ‘if 2.4 ≤ variance ≤ 6.83 and −1.60 ≤ curtosis ≤ 17.93
Clustering is the second-best method with only slightly worse results then fake banknote’
in the first two datasets, but much worse results in the Adult Census
The reduced rule (2) is smaller than the original by two features.
dataset, achieving an average of 41.94% reduction. Association rules
In addition, the “curtosis” feature has a wider range. The instance has
is the worst technique, achieving zero reduction in the Adult Cen-
a value of 1.92 for the “curtosis” feature, and in the original rule this
sus dataset and an average of 19.90% reduction. Combining random
value is marginal for the very narrow range 1.82 ≤ curtosis ≤ 2.13
selection with other techniques does not lead to improved results.
indicating that a small change may lead to a different outcome, but
With respect to feature reduction, we find that the association rules
this is not the case for the reduced rule as well. In addition, changing
strategy leads to the best results in the first two datasets, reaching
the skew value from −2.64 to −4, which is outside the range of the
an average reduction of 26.04%. The other two techniques achieve
feature in the original rule, will not change the prediction and will
negligible reduction in these two datasets. In banknote authentica-
produce the same reduced range rule. We observe the same result
tion combining the three methods improves the reduction slightly
when we change the value of the “entropy” feature, as well as when
from 30.70% to 30.85%. In Adult Census on the other hand, asso-
we tweak both “skew” and “entropy”.
ciation rules achieve zero reduction of features similarly to paths.
The best result in this dataset is achieved by random selection, fol-
lowed closely by clustering. Combining these two techniques im- 4.3.2 Heart Disease
proves slightly the reduction to 10%.
Again, with LionForests we achieve both higher feature and path re-
The weak performance of association rules in Adult Census is re-
duction ratios. There are thirteen features included in this particular
lated to the small number of estimators (100) and the huge number of
dataset and, as a result, the interpretations may have a total of thirteen
features (80). In particular, each of the estimators can have a maxi-
features. In fact, in this case the reduction of features is essential to
mum of eight features, as resulted from the grid search (max features
provide comprehensible interpretations. We choose an example, and
4 None = all features
we present the original rule (1) and the reduced rule (2):

Reduction Technique Reduction % Mean Reduction %
Association Rules Clustering Random Based Feature Path Feature Path Feature Path Feature Path
X X X 30.85±5.1e−2 49.47±5.5e−15 21.37±0.0 43.41±0.0 10.03±1.5e−2 44.18±5.5e−15 20.75 45.69
- X X 2.15±1.8e−1 49.47±5.5e−15 8.6e−3 ±1.3e−2 43.41±0.0 10.03±1.5e−2 44.18±5.5e−15 4.06 45.69
X - X 30.70±0.0 49.47±5.5e−15 21.37±0.0 43.41±0.0 9.11±1.4e−2 44.18±5.5e−15 20.39 45.69
X X - 30.84±4.4e−2 48.56±2.4e−2 21.37±0.0 42.88±2.2e−2 7.82±1.4e−2 36.26±2.6e−2 20.01 42.57
X - - 30.70±0.0 27.02±0.0 21.37±0.0 32.67±0.0 0.0±0.0 0.0±0.0 17.36 19.90
- X - 2.12±1.3e−1 47.57±3.6e−2 5.7e−3 ±1.1e−2 41.98±7.4e−2 7.82±1.4e−2 36.26±2.6e−2 3.32 41.94
- - X 0.0±0.0 49.47±5.5e−15 2.9e−3 ±8.6e−3 43.41±0.0 9.11±1.4e−2 44.18±5.5e−15 3.04 45.69
Banknote Authentication Heart Disease Adult Census

Table 2. Feature and path reduction ratios on the three datasets. X(#) the order of the applied techniques

6
1. ‘if 6.5 ≤ reversable def ect ≤ 7.0 and 3.5 ≤ chest pain ≤ 4.0 Possible values of “native_country”, which they
and 0.0 ≤ number of major vessels ≤ 0.5 and 1.55 ≤ may affect the prediction preserve the prediction
oldpeak ≤ 1.7 and 0.5 ≤ exercise induced angina ≤ 1.0 ‘Mexico’, ‘United-States’, ‘Canada’,
‘India’, ‘France’,
and 128.005 ≤ maximum heart rate achieved ≤ 130.998 ‘Philippines’, ‘England’, ‘Thailand’,
‘El-Salvador’, ‘Iran’,
‘Japan’, ‘China’, ‘Dominican-Republic’,
and 1.5 ≤ the slope of the peak exercise ≤ 2.5 and ‘Cuba’, ‘Haiti’,
‘Germany’, ‘South’, ‘Columbia’, ‘Italy’,
sexc = M ale and 184.999 ≤ serum cholestoral ≤ ‘Puerto-Rico’, ‘Vietnam’, ‘Cambodia’,
‘Guatemala’, ‘Peru’,
199.496 and 29.002 ≤ age ≤ 41.497 and 0.0 ≤ ‘Trinidad&Tobago’,
‘Ireland’, ‘Taiwan’, ‘Portugal’, ‘Laos’,
resting electrocardiographic results ≤ 0.5 and ‘Honduras’
‘Yugoslavia’, ‘Nicaragua’, ‘Scotland’
119.0 ≤ resting blood pressure ≤ 121.491 and
Table 3. List of values affecting or not the classification of an instance
0.0 ≤ f asting blood sugar ≤ 0.5 then presence’
2. ‘if 6.5 ≤ reversable def ect ≤ 7.0 and 3.5 ≤ chest pain ≤ 4.0
and 0.0 ≤ number of major vessels ≤ 0.5 and 1.55 ≤ optimal ones but not the optimal explanations, the attempt to indif-
oldpeak ≤ 1.7 and 0.5 ≤ exercise induced angina ≤ 1.0 ferently interpret each black-box model will not lead to the desired
and 128.005 ≤ maximum heart rate achieved ≤ 133.494 outcome. In this work, we introduced a model-specific local-based
and 1.5 ≤ the slope of the peak exercise ≤ 2.5 and approach for obtaining real interpretations of random forests predic-
184.999 ≤ serum cholestoral ≤ 199.496 and 119.0 ≤ tions. Other works [34, 43] attempt to provide explanations of this
resting blood pressure ≤ 121.491 then presence’ form, but they do not try to make them more comprehensible, either
indisputable. A user may not be familiar with iForest’s [43] visualisa-
The reduced rule is shorter than the original rule by four features. tion tool. Besides, an interpretation containing a lot of features with
In addition, the “maximum heart rate achieved” feature has a wider narrow ranges [34] may lead to incomprehensible and untrustworthy
range in the reduced rule. Changing the “sex” value from ‘Male’ (1) rules. The proposed technique, LionForests, will provide users with
to ‘Female’ (0) did not change the reduced rule at all. We tweak small rules as interpretations in natural language, which by widen-
“age” value from 35 to 15 and again the reduced rule remains the ing the feature ranges will be more reliable and trustworthy as well.
same. Thus, features like “age”, “sex”, “resting electrocardiographic We use classic unsupervised methods such as association rules and
results” and “fast blood sugar”, cannot influence the prediction. k-medoids clustering to achieve feature and path reduction.
Nevertheless, LionForests is not a complete solution either, since
it is not appropriate for tasks of model inspection. For example, if
4.3.3 Adult Census a researcher is working to build a reliable and stable model aiming
Finally, we present a pair of explanations provided by LionForests for the highest performance, a visualisation tool like iForest may be
for an instance of Adult Census, without (1) and with (2) reduction: preferred. This approach is the best and easiest way of providing an
interpretation to a non-expert user. Lastly, by reducing the number of
1. ‘if marital statusc = M arried and sexc = F emale and paths to the quorum to minimise the features and at the same time
educationc = HS grad and workclassc = P rivate and to increase the features’ ranges, the outcome would be a discounted
94721 ≤ f nlwgt ≤ 161182 and 47 ≤ age ≤ 53 and 15 ≤ probability of the classification of the instance, which poses ques-
hours per week ≤ 25 and native country c = Jamaica and tions about the prediction’s reliability. This can be counter-attacked
[other 2 feature-ranges] then income >50K’ by introducing a threshold parameter to the reduction effect, requir-
2. ‘if marital statusc = M arried and sexc = F emale and ing the algorithm to retain at least a specific percentage of the paths.
educationc = HS_grad and workclassc = P rivate and Future research will explore the impact of tuning parameters, such
87337 ≤ f nlwgt ≤ 382719 and 47 ≤ age ≤ 63 and 15 ≤ as the number of estimators or the maximum number of features, on
hours per week ≤ 99 and native country c = Jamaica and the reduction of features and paths. We also aim to apply LionForests
[other 2 feature-ranges] then income >50K’ to different tree ensembles, rather than random forests, as well as to
various datasets and data types. FPGrowth [24], and its variant FP-
The reduced rule is thirteen features smaller than the original rule. Max [19], will be tested against the Apriori algorithm. In addition, we
This is not directly obvious because some OneHot categorical fea- will consider the possibility of adapting LionForests to other tasks,
tures are not presented. For example, only valid features such as such as multi-class or multi-label classification, as well as regression.
“marital_status_Married” and “education_HS_Grad” are presented We will also explore the possibility of using LionForests interpreta-
as described in Section 3.4. Furthermore, we observe that the ranges tions to provide descriptive narratives through counter-examples. Ul-
of “age”, “fnlwft” and “hours per week” are broader. Specifically, timately, by means of a qualitative, human-oriented analysis, we will
“age” range from [47, 53] increased to [47, 63], while “hours per try to explore this promising method in order to prove its intelligi-
week” range from [15, 25] expanded to [15, 99]. Moreover, we can bility and its necessity as a foundation for human-centred artificial
explore the categorical feature’s “native country” alternative values. intelligence systems based on interpretable ML methods.
In Table 3, the first list refers to the values that may change the pre-
diction of an instance, while the second shows the values that cannot
affect the prediction. In the original rule this type of information was ACKNOWLEDGEMENTS
not available because the non-affecting values were present.

This paper is supported by the European Union’s Horizon 2020
5 CONCLUSION research and innovation programme under grant agreement No
825619, AI4EU Project5 .
Providing helpful explanations is a challenging task. Providing com-
prehensible and undeniable explanations is even tougher. Since
5 https://www.ai4eu.eu
model-agnostic approaches produce approximations, i.e. the nearest

7
REFERENCES ings of Machine Learning Research, pp. 77–85, Playa Blanca, Lan-
zarote, Canary Islands, (09–11 Apr 2018). PMLR.
[26] Trevor Hastie, Robert Tibshirani, and Jerome Friedman, The elements
[1] Amina Adadi and Mohammed Berrada, ‘Peeking inside the black-box: of statistical learning: data mining, inference, and prediction, Springer
A survey on explainable artificial intelligence (xai)’, IEEE Access, 6, Science & Business Media, 2009.
52138–52160, (2018). [27] Leonard Kaufman and Peter J Rousseeuw, ‘Clustering by means of
[2] Rakesh Agrawal, Tomasz Imieliński, and Arun Swami, ‘Mining asso- medoids. statistical data analysis based on the l1 norm’, Y. Dodge, Ed,
ciation rules between sets of items in large databases’, in Acm sigmod 405–416, (1987).
record, volume 22, pp. 207–216. ACM, (1993). [28] Ron Kohavi, ‘Scaling up the accuracy of naive-bayes classifiers: a
[3] Rakesh Agrawal, Ramakrishnan Srikant, et al., ‘Fast algorithms for decision-tree hybrid’, in Proceedings of the Second International Con-
mining association rules’, in Proc. 20th int. conf. very large data bases, ference on Knowledge Discovery and Data Mining, p. to appear, (1996).
VLDB, volume 1215, pp. 487–499, (1994). [29] Zachary C. Lipton, ‘The mythos of model interpretability’, Commun.
[4] Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick ACM, 61(10), 36–43, (September 2018).
Klauschen, Klaus-Robert Müller, and Wojciech Samek, ‘On pixel-wise [30] Scott M Lundberg, Gabriel Erion, Hugh Chen, Alex DeGrave, Jordan M
explanations for non-linear classifier decisions by layer-wise relevance Prutkin, Bala Nair, Ronit Katz, Jonathan Himmelfarb, Nisha Bansal,
propagation’, PloS one, 10(7), e0130140, (2015). and Su-In Lee, ‘Explainable ai for trees: From local explanations to
[5] Christian Bauckhage, ‘Numpy/scipy recipes for data science: k- global understanding’, arXiv preprint arXiv:1905.04610, (2019).
medoids clustering’, Researchgate. Net, February, (2015). [31] Scott M Lundberg and Su-In Lee, ‘A unified approach to interpreting
[6] Leo Breiman, ‘Random forests’, Machine learning, 45(1), 5–32, model predictions’, in Advances in Neural Information Processing Sys-
(2001). tems 30, eds., I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fer-
[7] Angela Chen. Ibm’s watson gave unsafe recommendations for treating gus, S. Vishwanathan, and R. Garnett, 4765–4774, Curran Associates,
cancer. https://cutt.ly/keHQDma, 2018. Accessed: 2019-11- Inc., (2017).
18. [32] Laurens van der Maaten and Geoffrey Hinton, ‘Visualizing data us-
[8] Tianqi Chen and Carlos Guestrin, ‘Xgboost: A scalable tree boosting ing t-sne’, Journal of machine learning research, 9(Nov), 2579–2605,
system’, in Proceedings of the 22nd acm sigkdd international confer- (2008).
ence on knowledge discovery and data mining, pp. 785–794. ACM, [33] Ioannis Mollas, Nikolaos Bassiliades, and Grigorios Tsoumakas, ‘Li-
(2016). onets: Local interpretation of neural networks through penultimate
[9] HA Chipman, EI George, and RE McCulloh, ‘Making sense of a forest layer decoding’, in ECML PKDD 2019 AIMLAI XKDD Workshop.
of trees’, Computing Science and Statistics, 84–92, (1998). Würzburg, Germany, (2019).
[10] Samantha Cole. This trippy t-shirt makes you invisible to ai. https: [34] Alexander Moore, Vanessa Murdock, Yaxiong Cai, and Kristine Jones,
//cutt.ly/FeHQHAa, 2019. Accessed: 2019-11-18. ‘Transparent tree ensembles’, in The 41st International ACM SIGIR
[11] Houtao Deng, ‘Interpreting tree ensembles with intrees’, International Conference on Research & Development in Information Retrieval, pp.
Journal of Data Science and Analytics, 7(4), 277–287, (2019). 1241–1244. ACM, (2018).
[12] Pedro Domingos, ‘Knowledge discovery via multiple models’, Intelli- [35] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
gent Data Analysis, 2(1-4), 187–202, (1998). O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-
[13] Filip Karlo Došilović, Mario Brčić, and Nikica Hlupić, ‘Explainable plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
artificial intelligence: A survey’, in 2018 41st International convention esnay, ‘Scikit-learn: Machine learning in Python’, Journal of Machine
on information and communication technology, electronics and micro- Learning Research, 12, 2825–2830, (2011).
electronics (MIPRO), pp. 0210–0215. IEEE, (2018). [36] General Data Protection Regulation, ‘Regulation (eu) 2016/679 of the
[14] Mengnan Du, Ninghao Liu, and Xia Hu, ‘Techniques for interpretable european parliament and of the council of 27 april 2016 on the protec-
machine learning’, arXiv preprint arXiv:1808.00033, (2018). tion of natural persons with regard to the processing of personal data
[15] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. and on the free movement of such data, and repealing directive 95/46’,
[16] Niall Firth. Apple card is being investigated over claims it gives women Official Journal of the European Union (OJ), 59(1-88), 294, (2016).
lower credit limits. https://cutt.ly/oeGYCx5, 2019. Ac- [37] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin, ‘Why should
cessed: 2019-11-18. i trust you?: Explaining the predictions of any classifier’, in Proceed-
[17] Aaron Fisher, Cynthia Rudin, and Francesca Dominici, ‘All models are ings of the 22nd ACM SIGKDD international conference on knowledge
wrong but many are useful: Variable importance for black-box, propri- discovery and data mining, pp. 1135–1144. ACM, (2016).
etary, or misspecified prediction models, using model class reliance’, [38] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin, ‘Anchors:
arXiv preprint arXiv:1801.01489, (2018). High-Precision Model-Agnostic Explanations’, in Thirty-Second AAAI
[18] Alex Goldstein, Adam Kapelner, Justin Bleich, and Emil Pitkin, ‘Peek- Conference on Artificial Intelligence, (2018).
ing inside the black box: Visualizing statistical learning with plots [39] Wojciech Samek, Thomas Wiegand, and Klaus-Robert Müller, ‘Ex-
of individual conditional expectation’, Journal of Computational and plainable artificial intelligence: Understanding, visualizing and in-
Graphical Statistics, 24(1), 44–65, (2015). terpreting deep learning models’, arXiv preprint arXiv:1708.08296,
[19] Gösta Grahne and Jianfei Zhu, ‘Efficiently using prefix-trees in mining (2017).
frequent itemsets.’, in FIMI, volume 90, (2003). [40] Gabriele Tolomei, Fabrizio Silvestri, Andrew Haines, and Mounia Lal-
[20] Stefan Th Gries, ‘On classification trees and random forests in corpus mas, ‘Interpretable predictions of tree-based ensembles via actionable
linguistics: Some words of caution and suggestions for improvement’, feature tweaking’, in Proceedings of the 23rd ACM SIGKDD interna-
Corpus Linguistics and Linguistic Theory, (2019). tional conference on knowledge discovery and data mining, pp. 465–
[21] Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, 474. ACM, (2017).
Fosca Giannotti, and Dino Pedreschi, ‘A survey of methods for explain- [41] James Vincent. This colorful printed patch makes you pretty much
ing black box models’, ACM computing surveys (CSUR), 51(5), 93, invisible to ai. https://cutt.ly/TeHQJHU, 2019. Accessed:
(2019). 2019-11-18.
[22] MAISSAE HADDOUCHI and ABDELAZIZ BERRADO, ‘A survey [42] Alexander Von Eye and Clifford C Clogg, Categorical variables in de-
of methods and tools used for interpreting random forest’, in 2019 1st velopmental research: Methods of analysis, Elsevier, 1996.
International Conference on Smart Systems and Data Science (ICSSD), [43] Xun Zhao, Yanhong Wu, Dik Lun Lee, and Weiwei Cui, ‘iforest: In-
pp. 1–6. IEEE, (2019). terpreting random forests via visual analytics’, IEEE transactions on
[23] Tameru Hailesilassie, ‘Rule extraction algorithm for deep neural net- visualization and computer graphics, 25(1), 407–416, (2018).
works: A review’, arXiv preprint arXiv:1610.05267, (2016). [44] Yichen Zhou and Giles Hooker, ‘Interpreting models via single tree
[24] Jiawei Han, Jian Pei, Yiwen Yin, and Runying Mao, ‘Mining fre- approximation’, arXiv preprint arXiv:1610.09036, (2016).
quent patterns without candidate generation: A frequent-pattern tree [45] Andreas Ziegler and Inke R König, ‘Mining data with random forests:
approach’, Data mining and knowledge discovery, 8(1), 53–87, (2004). current options for real-world applications’, Wiley Interdisciplinary Re-
[25] Satoshi Hara and Kohei Hayashi, ‘Making tree ensembles interpretable: views: Data Mining and Knowledge Discovery, 4(1), 55–63, (2014).
A bayesian model selection approach’, in Proceedings of the Twenty-
First International Conference on Artificial Intelligence and Statistics,
eds., Amos Storkey and Fernando Perez-Cruz, volume 84 of Proceed-