BoostEMM — Transparent Boosting using
            Exceptional Model Mining

    Simon van der Zon1 , Oren Zeev Ben Mordehai1 , Tom Vrijdag1 , Werner van
     Ipenburg2 , Jan Veldsink2 , Wouter Duivesteijn1 , and Mykola Pechenizkiy1
                1
              Eindhoven University of Technology, the Netherlands,
 {s.b.v.d.zon, o.zeev.ben.mordehay, w.duivesteijn, m.pechenizkiy}@tue.nl
                       t.s.vrijdag@student.tue.nl
               2
                 Cooperatieve Rabobank U.A., the Netherlands,
             {werner.van.ipenburg, jan.veldsink}@rabobank.nl


        Abstract. Boosting is an iterative ensemble-learning paradigm. Every
        iteration, a weak predictor learns a classification task, taking into account
        performance achieved in previous iterations. This is done by assigning
        weights to individual records of the dataset, which are increased if the
        record is misclassified by the previous weak predictor. Hence, subsequent
        predictors learn to focus on problematic records in the dataset. Boosting
        ensembles such as AdaBoost have shown to be effective models at fight-
        ing both high variance and high bias, even in challenging situations such
        as class imbalance. However, some aspects of AdaBoost might imply lim-
        itations for its deployment in the real world. On the one hand, focusing
        on problematic records can lead to overfitting in the presence of ran-
        dom noise. On the other hand, learning a boosting ensemble that assigns
        higher weights to hard-to-classify people might throw up serious ques-
        tions in the age of responsible and transparent data analytics; if a bank
        must tell a customer that they are denied a loan, because the underlying
        algorithm made a decision specifically focusing the customer since they
        are hard to classify, this could be legally dubious. To kill these two birds
        with one stone, we introduce BoostEMM: a variant of AdaBoost where
        in every iteration of the procedure, rather than boosting problematic
        records, we boost problematic subgroups as found through Exceptional
        Model Mining. Boosted records being part of a coherent group should
        prevent overfitting, and explicit definitions of the subgroups of people
        being boosted enhances the transparency of the algorithm.

        Keywords: Boosting, class imbalance, Exceptional Model Mining, model
        transparency, responsible analytics


1     Introduction
Arguably, AdaBoost [10] can be described as one of the best-performing out-of-
the-box classifiers. By evaluating the performance of simple base classifiers, the
algorithm learns which records of a dataset are easy/hard to classify. Subsequent
base classifiers focus on more problematic records. By properly weighing the
decisions of the earlier, blunter base classifiers with the later, more specific ones,
an ensemble classifier is built that performs well overall. Hence, AdaBoost as
a mechanism shines in devoting appropriate attention of a classifier to those
records of the dataset that prove to be problematic.
    For all its strenghts, AdaBoost also comes with two weaknesses. The records
proving problematic for the classifier might encompass outliers, which could
lead to overfitting. Moreover, in the age of responsible data analytics, we want
to know not only what our algorithm does, but also why it is reasonable. A
recent European Parliament resolution on the future of robotics and artificial
intelligence in Europe contains the following [9, Section “Ethical principles”,
point 12]:

      [. . . ] it should always be possible to supply the rationale behind any
      decision taken with the aid of AI that can have a substantive impact on
      one or more persons’ lives; [. . . ] it must always be possible to reduce the
      AI system’s computations to a form comprehensible by humans;

If the decision in a dataset on loan applications is outsourced to AdaBoost,
the customer might demand insight in the reasoning behind rejection. When
this resolution is put into law, it is likely that a customer demanding insight in
the reasoning AdaBoost deployed behind its decision (which has a substantive
impact on the customer’s life), must be presented with not only how AdaBoost
decides where to focus its extra attention, but also why. Hence, in the near
future, financial institutions will shy away from the liability associated with
using AdaBoost if there is no transparent form of boosting.
    In this paper, we fill that void by proposing BoostEMM. This method com-
bines AdaBoost-style iterative learning of base classifiers, where the focus shifts
towards problematic parts of input space, with Exceptional Model Mining (EMM)
[18,5]. This is a local pattern mining method, designed to find subgroups (subsets
of the dataset at hand) that satisfy two conditions. On the one hand, subgroups
must be interpretable. This is typically enforced by only allowing subgroups
that can be defined as a conjunction of few conditions on input attributes of
the dataset; hence subgroups come in terms that any domain expert can un-
derstand. On the other hand, subgroups must be exceptional. This typically is
formalized in terms of an unusual interaction between several target attributes;
hence subgroups represent unusual behavior in the dataset. We choose targets
that represent the actual class label and the label predicted by base classifiers,
and define several quality measures that gauge unusual interaction between those
targets. Hence, for various types of bad base classifier performance, EMM finds
coherent parts of the classifier input space where this behavior occurs.


1.1     Main Contributions

We provide BoostEMM, a method encompassing core ideas from both AdaBoost
and Exceptional Model Mining. By dovetailing these techniques, BoostEMM
achieves two benefits over AdaBoost:
 1. by defining specific EMM variants (cf. Section 3.3), we steer boosting to pun-
    ish specific kinds of bad behavior (error rate, class imbalance, FPR/TPR),
    which is relevant for cost-sensitive applications;
 2. by dovetailing an EMM run with every iteration of the boosting algorithm,
    on every step we can report subgroups where extra boosting is needed. This
    adds transparency to the boosting process (cf. Section 3.4), which is relevant
    in the light of looming EU law.


2   Related Work

The groundbreaking paper on AdaBoost in its traditional form is [10]; this form
is also explained in Section 3. A version incorporating class probabilities was
introduced in [11]. Convex potential boosting algorithms (which includes Ad-
aBoost) cannot handle random classification noise well [20]; boosting tends to
overfit towards the noisily labeled records of the dataset, reducing generality of
the learned classification model.
    The resurgence of neural networks through the deep learning hype has led
to a reappreciation of complex classifiers, that perform extremely well on the
task for which they have been designed, but whose internal reasoning is far too
complex for a human to fully understand. As a reaction, papers emerge that
take a peek into the black box. Some of the first such papers include [12,13] for
hard classifiers, and [8] for soft classifiers. These papers share the objective of
transparency with BoostEMM, but they do not (as BoostEMM does) loop back
the interpretable results into the classification process to improve performance.
    The study of how local patterns can aid global models was the topic of the
LeGo workshop [16]. This workshop encompasses both papers enhancing existing
classifiers with local patterns, or combining local patterns into a classifier. A few
years later, LeGo-EMM [7] enhanced multi-label classifiers with local patterns
found through Exceptional Model Mining with the Bayesian networks model
class [6], which improved multi-label SVN classification. These methods have in
common that the learning process is a single line: first local patterns are mined,
then a subset of those patterns is selected, and finally that subset is used to en-
hance/replace the feature set of a classifier, with which subsequently predictions
are made. Hence, LeGo papers share the incorporation of local pattern knowl-
edge in classification with BoostEMM, but they do not (as BoostEMM does)
loop back the output into an iterative learning process.
    Exceptional Model Mining seeks to find subgroups of a dataset where several
targets interact in an unusual manner. A simpler cousin of EMM is Subgroup
Discovery (SD) [15,21,14]: the task of finding subgroups of a dataset where a
single target displays an unusual distribution. SD is closely related to Contrast
Set Mining (CSM) [1] and Emerging Pattern Mining (EPM) [3]; the results of
the latter technique have also been exploited to enhance classification [4], in the
style of the LeGo workshop papers discussed in the previous paragraph. The
relation between SD, CSM, and EPM is explored in detail in [17].
3     The BoostEMM Method

Given a dataset Ω, which is a bag of N records r ∈ Ω of the form r =
(a1 , . . . , ak , `), where {a1 , . . . , ak } are the input attributes of the dataset, taken
from some collective domain A, and ` is the class label, typically taken to be
binary. If we need to refer to a specific record (or corresponding data compo-
nents), we do so by superscripts: the ith record is denoted ri , `i is its class label,
and the value of its j th input attribute is aij . We also use the shorthand ai to
denote the collective input attribute values of ri : ai = (ai1 , . . . , aik ).
    The goal of classification is to find a mapping P from the input attribute
space to the class label; P : A → {0, 1}, such that we can predict the latter
for unseen instances of the former. In boosting, these predictions are improved
through the following methodology. We iteratively build one strong learner or
expert P = (E, W). This is done by constructing an ensemble E of h weak
learners or predictors (i.e. classifiers that perform (slightly) better than random)
E = (P1 , . . . , Ph ), and an associated tuple W = (w1 , . . . , wh ) of weights related
to the performance of each predictor. AdaBoost obtains these weights for each
weak classifier Pi by a transformation (cf. Section 3.2) of its error rate erri . In
each iteration of the boosting process, a new weak learner Pj is constructed,
taking into account the whole training set but also the up-to-date priorities,
or weights, of the records. These weights are maintained as another tuple of N
weights W , associated with the records of the dataset: W = (w1 , . . . , wN ). In the
first iteration, all training data (unless given initial weights by the end user) are
initialized with equal weights wi = 1/N , and the first weak learner is trained. In
subsequent iterations, the weights are increased for all samples that are classified
incorrectly by P, after which the weights are normalized. In later sections, we
replace the selection mechanism for the erroneous samples by an equivalent EMM
function defining the subgroups to be boosted. AdaBoost updates these weights
W in a manner similar to the weights W, by a transformation (cf. Section 3.2)
based on errj .


3.1    Exceptional Model Mining

Given a dataset Ω, which is a bag of N records r ∈ Ω of the form (a1 , . . . , ak ,
t1 , . . . , tm ), where {a1 , . . . , ak } are the descriptors of the dataset, and {t1 , . . . , tm }
the targets. The goal of EMM is to find subgroups of the dataset at hand, defined
in terms of a conjunction of a few conditions on single descriptors of the dataset
(e.g.: a7 ≤ 3 ∧ a3 = true), for which the targets interact in an unusual manner.

Definition 1 (Subgroup). A subgroup corresponding to a description D is
the bag of records GD ⊆ Ω that D covers, i.e.

                                GD = { ri ∈ Ω | D(ai ) = 1 }

From now on we omit the D if no confusion can arise, and refer to the coverage
of a subgroup by n = |G|.
   In order to objectively evaluate a candidate description in a given dataset,
we need to define a quality measure. For each description D in a user-defined
description language D, this function quantifies the extent to which the subgroup
GD corresponding to the description deviates from the norm.

Definition 2 (Quality Measure). A quality measure is a function ϕ : D → R
that assigns a numeric value to a description D.


3.2   Mining Descriptions for Boosting and Updating the Weights

The input attributes in classification/boosting correspond to the descriptors in
EMM. Having trained a weak learner Pj , we generate targets that reflect how
well the classification performs on each record. We explore several choices for
unusual interaction between these targets in Section 3.3. Having thusly defined
a model class for EMM, we run the beam search algorithm [5, Algorithm 1]
to generate a set Dtop-q of subgroups GD with their associated descriptions D.
AdaBoost constructs the subset to be boosted by simply picking all erroneously
classified samples. Instead, BoostEMM picks every record that is covered by at
least one of the top subgroups we found with EMM. Hence, BoostEMM adheres
to the following scheme:
                        N
                   1 X i i
       errj =    N
                          w I(` 6= Pj (ai ))
                 P i
                    w i=1
                 i=1
                 1 − errj
        αj = log
                   errj         i                                      
        wi ← wi · exp αj · I     a ∈ Ω ∃Dj ∈Dtop-q : Dj (ai ) = 1
                      XM
      P(a) = arg max      αj · I(Pj (a) = `)
                 `∈{0,1} m=1


In AdaBoost, the weight update function is instead given by:

                       wi ← wi · exp(αj · I(`i 6= Pj (ai )))


3.3   The Transparent Boosting Model Class for EMM

The missing ingredient in the description of BoostEMM in the previous section,
is: how do we find the subgroup set Dtop-q ? We do so by Exceptional Model
Mining. Within BoostEMM, every EMM run is encompassed by the boosting
process. Hence, we have just trained a weak learner Pj to predict a specific class
label ` for every possible input attribute vector a ∈ A. Within this setting, in
order to employ EMM, we need to cast the available building blocks into a form
that fits the EMM problem specification, as outlined in Section 3.1. We need to
describe the dataset in terms of descriptors and targets, formulate a model class
over the targets, and define a quality measure over this model class.
     For the descriptors in EMM we take the input attributes ai1 , . . . , aik of the
classification task given at the start of Section 3. In the Transparent Boosting
model class for EMM there are two targets: the original class label, ti1 = `i , and
the class label predicted by the available weak learner Pj , ti2 = Pj (ai ). The kind
of interaction in which the Transparent Boosting model class is interested, is an
exceptional discord between the original class label and the predicted class label:
where does our weak learner perform not so well?
     The last question can be answered in many reasonable manners. Which an-
swer we choose depends on what kind of boosting we want to achieve. In EMM,
the quality measure governs what exactly we find interesting within the kind of
interaction defined by the model class. As is common in EMM, we build up the
quality measure from two components: ϕTB (D) = ϕsize (D) · ϕdev (D). The latter
component measures the exceptionality degree of target interaction. Since a large
value for this can easily be obtained in tiny subgroups, we need to prevent over-
fitting by multiplying with a component explicitly representing subgroup size.
For this, we take ϕsize (D) = log(|GD |). We employ the logarithm here, since
we do not want to put a penalty on medium-sized subgroups compared to large
subgroups; this component is only meant to discourage tiny subgroups. For the
deviation component ϕdev (D), we develop four alternatives.


Error-based boosting with ϕerr If one would merely be interested in the
error rate, we define the target interaction exceptionality of the subgroup as
follows.
                                         wi I(`i 6= Pj (ai ))
                                   P
                                           i:D(ai )=1
                              ϕerr (D) =                P
                                                                wi
                                                   i:D(ai )=1

This quality measure computes the error rate, but only on the records covered
by the subgroup. Hence, unlike AdaBoost, BoostEMM will also boost records of
the dataset that were classified correctly by the weak learner. This is deliberate,
since this ought to reduce the overfitting effect from which AdaBoost suffers.


Kappa-based boosting with ϕκ In the presence of class imbalance, optimiz-
ing for the Kappa statistic is more appropriate than the error rate.

            accobs (D) − accexp (D)                      X
 ϕκ (D) =                           , where  Σw (D) =            wi
                1 − accexp (D)
                                                      i:D(ai )=1
                                           X
                  accobs (D) = /Σw (D) ·
                               1              w I(` = Pj (ai ))
                                                 i  i

                                                   i:D(ai )=1

              Xaccexp (D) = (pos(D)·posp (D)+neg(D)·negp (D))/X
                                                              (Σw (D))2
                    i     i
pos(D) =          w · I(` = 1)                posp (D) =             wi · I(P (ai ) = 1)
            i:D(ai )=1                                                i:D(ai )=1
              X                                                         X
                          i       i
neg(D) =                 w · I(` = 0)                    negp (D) =                wi · I(P (ai ) = 0)
            i:D(ai )=1                                                i:D(ai )=1
Cost-sensitive boosting with ϕFNR and ϕFPR Based on the dataset domain
at hand, one might be interested in cost-sensitive classification. When the cost
of false negatives and false positives is substantially skewed, one would desire
to find subgroups that boost for either of these components. Hence, we employ
each type of classification mistake as a deviation component of its own.
                                  X
               ϕFNR (D) = 1/n           wi · I(`i 6= Pj (ai ) ∧ `i = 1)
                                i:D(ai )=1

                                  X
               ϕFPR (D) = 1/n                wi · I(`i 6= Pj (ai ) ∧ `i = 0)
                                i:D(ai )=1


3.4   Model Transparency
After one has trained an ensemble using boosting, it is insightful to know how
the ensemble was constructed (i.e. which data was emphasized most during the
boosting process). Especially when boosting for various quality measures, it can
be interesting to see how the various boosting strategies behave. We present a
visualization that shows the user exactly which regions have (successfully) been
boosted the most. Our method can show a high number of descriptions by visu-
alizing them in a tree. The tree is constructed by looping over the descriptions.
For each description we create a branch:
1. A branch consists of nodes represented by the literals of a description, and
   the root of the branch is the first literal (which makes sense, since each
   following literal is a refinement on the description).
2. The last node (literal) of the branch stores the weight of the description. The
   weight corresponds to the error of the weak learner that was constructed from
   this description (w).
3. If a literal already exists during creation of the branch, we increase the weight
   of the existing leaf by the weight of the literal, because the description was
   used more heavily (i.e. by multiple classifiers). We proceed the creation of
   the branch using the existing path.
After tree construction, we merge sibling leaf nodes defined on the same attribute
that originate from the same boosting iteration. For instance, two sibling leaf
nodes “age < 20” and “age < 30” can be merged into a single leaf node “age <
30”, since all descriptions from the same round are boosted together.
   Figure 1 shows the descriptions encountered in a run of the BoostEMM
process. The size of the nodes represents the weight, indicating the degree to
which the constraint contributes to the selection of samples for boosting.


4     Experiments
Three datasets with a binary classification task were used for the experiments
(cf. Table 1). The well-known Adults dataset stems from the UCI repository [19];
Fig. 1. Inspection of descriptions used for boosting (size indicates ensemble weight wi ).
Descriptions correspond to a conjunction of literals, from root to leaf.


the positive class is high earners. Credit-card fraud [2] can be found at Kaggle;
the task is to detect fraudulent transactions. Metadata for both these datasets
is readily available online. The third dataset, however, is proprietary: Rabobank
provided an anonymized real-life dataset related to on-line fraud.
    Rabobank is a financial institution, one of the three biggest banks in the
Netherlands. Rabobank supplies a full range of financial products to its cus-
tomers, including internet and mobile banking. The dataset we work with en-
compasses over 30 million samples of internet transactions and mobile payments,
performed from January 2016 up to February 2017. Most of these samples rep-
resent genuine, non-fraudulent transactions, but a tiny fraction (∼2 000) were
manually marked as fraudulent by domain experts. Known types of fraud in-
clude trojan attacks, phishing and ID takeover. For trojan attacks and phishing,
the client takes part in the fraudulent transaction by providing the two-factor
authentication to the bank, after being misled by the fraudster to do so on a
payment prepared by the fraudster. In ID takeover, the fraudster has stolen the
Table 1. Datasets, with both original and ‘preprocessed’ feature sets used in practice.

       Dataset     #orig. attr. #prep. attr. #train Neg./Pos. #test Neg./Pos.
       Adults          14          104         29 741/9 332       7 414/2 355
       Credit-card     30           29         227 451/394          56 864/98
       Rabobank       2487         200          2 015/385              513/87


credentials of the client and is able to provide the authentication herself. Typi-
cally, different attackers and attack types occur in the same period. As attacks
are being blocked the modus operandi is changed or renewed within days or
weeks. Old attacks are retried over time, new attack vectors show up.
    Each record consists of a timestamp, an identification hash, a binary label
to indicate fraud, and 1 013 anonymized features. Attributes are masked by re-
naming. Algorithmically each attribute is inspected; if a sample contains not
more then 200 unique values it is considered to be a code which is recoded to
a numeric value. Numeric and text values are frequency-based transformed into
up to 801 bins. Base attributes are constructed from the current transaction, as
well as the history from accountholder and beneficiary. Aggregations found to
be useful for the current business rule system were added.
    The task in the Rabobank dataset is to predict transactions to be fraudulent
(label=1) or not (label=0), having learned from historical data only. In order to
be useful alongside the current fraud detection, the bank requires the FPR to
be stable and far less then 1:10 000.

4.1   Experimental Setup
All datasets are imbalanced: there are substantially more negative records than
positive ones. Since we build on scikit-learn’s Python Decision Tree implemen-
tation as weak learner, we use dummy variables to handle categorical variables
in the Adult dataset. For the Credit-card and Rabobank dataset, all the values
were given as numeric in the first place. We discard the ‘Time’ column in the
Credit-card dataset. The beam search algorithm for EMM [5, Algorithm 1] is
parametrized with search width w = 3, search depth d = 3, and incorporates
the top-q subgroups into BoostEMM with q = 6. We use an AdaBoost imple-
mentation with decision stumps (i.e. decision trees with depth 1).

4.2   Experimental Results
We run comparative experiments with seven competitors; results can be found
in Table 2. The models are Straw Man (majority class), AdaBoost SAMME [10],
AdaBoost SAMME.R [11], and BoostEMM with each of the target interaction
exceptionality components (cf. Section 3.3). For each competitor we report ac-
curacy evaluated on a withheld test set. Since all datasets are imbalanced, we
also report Kappa and AUC. For all measures, higher is better.
    Since Adults is the only non-anonymized dataset, we present descriptions dis-
covered during the BoostEMM training for a qualitative inspection. Subgroups
                   Table 2. Results of comparative experiments.

     ML model              Adults          Credit-card       Rabobank
                      Acc. κ      AUC Acc. κ        AUC Acc. κ       AUC
     Straw Man        0.759 0.0 0.5 0.998 0.0 0.5 0.855 0.0 0.5
     AdaBoost SAMME 0.855 0.566 0.757 0.999 0.758 0.852 0.917 0.647 0.808
     AdaBoost SAMME.R 0.869 0.616 0.787 0.999 0.815 0.883 0.91 0.623 0.799
     BoostEMM ϕerr    0.808 0.289 0.606 0.999 0.310 0.592 0.9 0.442 0.66
     BoostEMM ϕκ      0.759 0.0 0.5 0.999 0.601 0.735 0.855 0.0 0.5
     BoostEMM ϕFNR    0.241 0.0 0.241 0.999 0.598 0.907 0.145 0.0 0.5
     BoostEMM ϕFPR    0.759 0.0 0.5 0.999 0.234 0.566 0.855 0.0 0.5

 Table 3. Top subgroups found by BoostEMM ϕerr in first five iterations (Adults).

Iter.                                 D                               |GD | TN FP FN TP
1      capital-gain ≥ 6896.48 ∧ age ≥ 19.52 ∧ education 7th-8th 6= 1 1632 17 0 1615 0
      capital-loss ≥ 1802.48 ∧ marital-status Married-civ-spouse = 1∧
2                                                                      493 16 0 477   0
                           capital-loss ≤ 1952.69
      marital-status Married-civ-spouse = 1 ∧ education-num ≤ 7.72∧
3                                                                     1847 0 1662 0 185
                           capital-gain ≤ 3448.24
4      capital-loss ≥ 1802.48 ∧ education-num ≥ 7.72 ∧ age ≥ 27.07 1041 240 0 801     0
5      capital-gain ≥ 6896.48 ∧ age ≥ 19.52 ∧ education 7th-8th 6= 1 1632 17 0 1615 0


that are deemed most problematic by the four compound quality measures in
the first five iterations of the BoostEMM process can be found in Tables 3–6.


5   Discussion

Table 2 shows that BoostEMM can sometimes match the performance of Ad-
aBoost, but sometimes it does not do so well. As expected, ϕerr mimics AdaBoost
best in terms of pure performance; it barely loses accuracy on the Rabobank and
Credit-card datasets in comparison with AdaBoost, while it has to cede some
ground on the Adults dataset. Interestingly, while BoostEMM with ϕFNR leads
to substantial accuracy loss in two of the datasets, it performs unexpectedly well
on the third. All methods have a high accuracy on the Credit-card dataset, but
in terms of AUC, ϕFNR outperforms all other methods including AdaBoost.
    When we inspect subgroups in more detail (cf. Tables 3–6), we obtain more
transparency and hence accountability into the boosting process. This trans-
parency is augmented by the visualization as introduced in Figure 1. Addition-
ally, from Table 6 we learn that BoostEMM suffers from a familiar problem
in data mining. This table follows the process of boosting subgroups featuring
an unusually high False Postive Rate. As the table shows, the top subgroups
found in the first five iterations have an undefined FPR: there are no positives
in these subgroups at all. This is caused by the first weak learner assigning all
records to the majority class, which is negative: the process only features true
and false negatives! In this setting, FPR boosting makes no sense. Therefore, in
    Table 4. Top subgroups found by BoostEMM ϕκ in first few iterations (Adults).

Iter.                                     D                                    |GD | TN FP FN TP
1      marital-status Separated 6= 1 ∧ native-country Holand-Netherlands 6= 1 37824 28562 0 9262 0
2      occupation Prof-specialty 6= 1 ∧ native-country Holand-Netherlands 6= 1 34088 26982 0 7106 0
3      occupation Other-service 6= 1 ∧ native-country Holand-Netherlands 6= 1 35157 25992 0 9165 0
4      occupation Adm-clerical 6= 1 ∧ native-country Holand-Netherlands 6= 1 34592 25858 0 8734 0
5     occupation Farming-fishing 6= 1 ∧ native-country Holand-Netherlands 6= 1 37892 28695 0 9197 0

Table 5. Top subgroups found by BoostEMM ϕFNR in first few iterations (Adults).

Iter.                                           D                                     |GD | TN FP FN TP
1                capital-gain ≥ 6896.48 ∧ age ≥ 19.52 ∧ education 7th-8th 6= 1        1632 17 0 1615 0
                capital-loss ≥ 1802.48 ∧ marital-status Married-civ-spouse = 1∧
2                                                                                      493 16 0 477 0
                                     capital-loss ≤ 1952.69
3       capital-gain ≥ 24137.69 ∧ marital-status Married-civ-spouse 6= 1 ∧ age ≥ 19.52 98 1 0 97      0
4        capital-loss ≥ 1802.48 ∧ marital-status Married-civ-spouse = 1 ∧ age > 27.07 887 133 0 754 0
5       capital-gain ≥ 24137.69 ∧ marital-status Married-civ-spouse 6= 1 ∧ age ≥ 19.52 98 1 0 97      0


future work, we plan to tackle this problem by dovetailing the various kinds of
boosting BoostEMM has to offer.
    A similarly detailed investigation as the on in Tables 3–6 has been made for
the Rabobank dataset. Here, the attribute names are all obfuscated; we find them
in the form C 0010. However, we presented the resulting subgroups in such tables
to domain experts at Rabobank who possess the key to translate obfuscated
features back to real-life information. They reported back that the subgroups
focus on the historical behavior of the customer of counterparty. Subgroups
reported in the first iteration make the initial, rough cut. Subgroups reported
in the second iteration give it more detail towards a specific modus operandi.
Client confidentiality disallows us to discuss more details about these subgroups,
but the domain experts confirm that the problematic areas have clear meaning
to them, which provides us with confidence that BoostEMM indeed adds the
desired transparency to the boosting process.

References
 1. S.D. Bay, M.J. Pazzani. Detecting Change in Categorical Data: Mining Contrast
    Sets. Proc. KDD, pp. 302–306, 1999.
 2. A. Dal Pozzolo, O. Caelen, R.A. Johnson, G. Bontempi. Calibrating Probability
    with Undersampling for Unbalanced Classification. Proc. SSCI, pp. 159–166, 2015.
 3. G. Dong, J. Li. Efficient Mining of Emerging Patterns: Discovering Trends and
    Differences. Proc. KDD, pp. 43–52, 1999.
 4. G. Dong, K. Ramamohanarao. Enhancing Traditional Classifiers Using Emerging
    Patterns. In: G. Dong, J. Bailey (eds.): Contrast Data Mining: Concepts, Algo-
    rithms, and Applications, pp. 187–196, CRC Press, 2013.
 5. W. Duivesteijn, A.J. Feelders, A. Knobbe. Exceptional model mining. Data Mining
    and Knowledge Discovery 30(1):47–98, 2016.
 6. W. Duivesteijn, A. Knobbe, A. Feelders, M. van Leeuwen. Subgroup Discovery
    meets Bayesian networks — an Exceptional Model Mining approach. Proc. ICDM,
    pp. 158–167, 2010.
Table 6. Top subgroups found by BoostEMM ϕFPR in first few iterations (Adults).

        Iter.                   D                    |GD | TN FP FN TP
        1     native-country Holand-Netherlands 6= 1 39073 29741 0 9332 0
        2     native-country Holand-Netherlands 6= 1 39073 29741 0 9332 0
        3     native-country Holand-Netherlands 6= 1 39073 29741 0 9332 0
        4     native-country Holand-Netherlands 6= 1 39073 29741 0 9332 0
        5     native-country Holand-Netherlands 6= 1 39073 29741 0 9332 0


 7. W. Duivesteijn, E. Loza Mencı́a, J. Fürnkranz, A.J. Knobbe. Multi-label LeGo —
    Enhancing Multi-label Classifiers with Local Patterns. Proc. IDA, pp. 114–125,
    2012.
 8. W. Duivesteijn, J. Thaele. Understanding Where Your Classifier Does (Not) Work
    — The SCaPE Model Class for EMM. Proc. ICDM, pp. 809–814, 2014.
 9. European Parliament. Resolution of 16 February 2017 with recom-
    mendations to the Commission on Civil Law Rules on Robotics
    (2015/2103(INL)).      http://www.europarl.europa.eu/sides/getDoc.do?pubRef=-
    //EP//TEXT+TA+P8-TA-2017-0051+0+DOC+XML+V0//EN [accessed July
    3, 2017], 2017.
10. Y. Freund, R.E. Schapire. A decision-theoretic generalization of on-line learn-
    ing and an application to boosting. Journal of Computer and System Sciences
    55(1):119–139, 1997.
11. J. Friedman, T. Hastie, R. Tibshirani. Additive logistic regression: a statistical
    view of boosting. Annals of Statistics 28(2):337–407, 2000.
12. A. Henelius, K. Puolamäki, H. Boström, L. Asker, P. Papapetrou. A peek into
    the black box: exploring classifiers by randomization. Data Mining and Knowledge
    Discovery 28(5–6):1503–1529, 2014.
13. A. Henelius, K. Puolamäki, I. Karlsson, J. Zhao, L. Asker, H. Boström, P. Papa-
    petrou. GoldenEye++: A Closer Look into the Black Box. Proc. SLDS, pp. 96–105,
    2015.
14. F. Herrera, C.J. Carmona, P. González, M.J. del Jesús. An overview on sub-
    group discovery: foundations and applications. Knowledge and Information Sys-
    tems 29(3):495–525, 2011.
15. W. Klösgen. Explora: A Multipattern and Multistrategy Discovery Assistant. Ad-
    vances in Knowledge Discovery and Data Mining, pp. 249–271, 1996.
16. A. Knobbe, B. Crémilleux, J. Fürnkranz, M. Scholz. From Local Patterns to Global
    Models: The LeGo Approach to Data Mining. Proc. LeGo: From Local Patterns
    to Global Models workshop @ ECML/PKDD, pp. 1–16, 2008.
17. P. Kralj Novak, N. Lavrač, G.I. Webb. Supervised Descriptive Rule Discovery: A
    Unifying Survey of Contrast Set, Emerging Pattern and Subgroup Mining. Journal
    of Machine Learning Research 10:377–403, 2009.
18. D. Leman, A. Feelders, A.J. Knobbe. Exceptional Model Mining. Proc.
    ECML/PKDD (2), pp. 1–16, 2008.
19. M. Lichman, UCI Machine Learning Repository, http://archive.ics.uci.edu/ml,
    University of California, Irvine, School of Information and Computer Sciences,
    2013.
20. P.M. Long, R.A. Servedio. Random classification noise defeats all convex potential
    boosters. Machine Learning 78(3):287-304, 2010.
21. S. Wrobel. An Algorithm for Multi-relational Discovery of Subgroups. Proc.
    PKDD, pp. 78–87, 1997.