<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Rule Learning from Time-Dependent Data Applied to Fraud Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marine Collery</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>In nancial environment, fraud detection is a challenging problem with tremendous nancial impacts where data is highly unbalanced, sequential and timestamped. An additional constraint comes from the fact that common machine learning methods cannot be used alone for fraud detection, as every decision made in order to label a transaction as fraudulent needs to be explainable and the complete model understandable.The use of a symbolic language, such as understandable classi cation rules, is therefore preferred or even required.</p>
      </abstract>
      <kwd-group>
        <kwd>Rule Learning Fraud Detection Time-Dependant Data Business Rules and Interpretability</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1 IBM France Lab
2 Inria Saclay Ile-de-France</p>
    </sec>
    <sec id="sec-2">
      <title>Introduction</title>
      <p>For few decades now, rule systems have been widely adopted in di erent
industrial elds. Business Rule Management Systems (BRMS) o er an intuitive,
human readable and comprehensible way to de ne business rules and hides the
computational aspect for the business user.</p>
      <p>With the growth of machine learning in the past years due to the newly
available computational power combined with a growing number of accessible
datasets, improving quality of a learned predictive model was an important
research interest. Today, impressive models are learned but can lack transparency,
interpretability and understandability characteristics that are required and
essential for numerous application elds. Those models, and especially the ones
based on neural networks, are commonly referred to as \black boxes". Focus is
progressively shifting towards providing an explanation for decisions a learned
model took as well as building interpretable, understandable and transparent
models from scratch.</p>
      <p>Combining comprehensibility of business rules and machine learning power to
tackle the problem, is the approach we are focusing on for this research project.</p>
      <p>This strategy is considered in the context of fraud detection that comes with
a complex learning problem as well as a full transparency requirement.</p>
      <p>Copyright c 2021 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).</p>
    </sec>
    <sec id="sec-3">
      <title>Related work</title>
      <p>Interpretability, interpretation, and explainability With the growth of
high performance non interpretable black-box models, an important question is
raised: to what extent a model can be considered trustworthy, especially for
highstakes decision making ? Di erent terms are commonly used when referring to
this problem, we clarify their meaning here for further use. Model interpretability
is the ability (of the model) to explain or to present in understandable terms to
a human [15, 8]. Model rationale is how the model takes decisions. Interpretation
and explanation (methods) will be considered equivalent in this paper (subtle
di erences are not considered). They both refer to methods that explain or
translate the model rationale.</p>
      <p>The context of fraud detection There are multiple types of nancial frauds
from credit card fraud to insurance fraud which come with di erent detection
solutions as described by J. West et al. [23]. Credit card fraud detections were
for example studied with sequential and non-sequential learning methods by J.
Jurgovsky et al. [13] and lead to di erent types of frauds detected with both
approaches. A spatio-temporal attention-based neural network for fraud
detection on credit card was recently introduced by D. Cheng et al. [5] and brought
promising results for detecting `suspicious transactions and mining fraud
patterns'. However as pointed by J. Guo et al. [11], allowing for more long-range
dependencies than common machine learning models can help identify repeated
or cyclical appearances of fraudulent events which seems to be the harder to
catch. Very recently, tensor networks were used for anomaly detection [22] where
the model outperformed deep and classical algorithms on tabular datasets and
achieved competitive results on image datasets.</p>
      <p>
        Rule learning Another approach to detect anomalies in runtime process logs
took by K. Bohmer et al., is rule mining [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. It comes with some speci c bene ts,
especially explainability. In opposition to machine learning models, rules are
symbolic and key to bring understandable arti cial intelligence.
      </p>
      <p>Interpretable models should even be preferred to explaining uninterpretable
model a posteriori for any high stakes decisions according to C. Rudin [20].
However, in some context rule based models are not considered as fully-interpretable.
Indeed, as presented by Z. C. Lipton in [16], given the limited capacity of
human cognition, when we reach a su cient high dimension, we could consider the
model to be less interpretable than a simple compact neural network.</p>
      <p>Combining logic rules and deep neural networks is proposed by Z . Hu et al.
[12] to enhance the neural network capabilities. This approach could actually
also be used for rule learning. We can also mention recent work from I. Kraiem
who applied rule learning for multiple anomaly detection [14] and G. Bert who
presented an association rule learning approach for temporal noisy data [10].</p>
      <p>
        More global approaches are proposed in [21] to induce if-then-else rules to
explain predictions of supervised learning models, or in [18] to learn
compositional rules with very little data. As explained in [9], there are two main base
families of methods to induce ruleset from training data: extracting rules from
a decision tree (examples: CART [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and C4.5 [19]) or sequential covering that
is learning rules directly from data (examples: CN2 [6] and RIPPER [7]).
      </p>
      <p>We can also refer to Inductive Logic Programming (ILP) introduced by S.
Muggleton in 1991 [17] where an ILP system is a program that combines positive
and negative examples with background knowledge and outputs a correct logical
hypothesis. ILP systems result of two main steps: searching for hypothesis and
then selecting the best one.
3</p>
    </sec>
    <sec id="sec-4">
      <title>Problem, goals and method</title>
      <p>Problem statement Modeling data in an interpretable and understandable
way is very challenging when working with large-scale and real-world datasets.
Interpretable models are commonly simple and have di culties learning
complex patterns. Rule-based approaches typically tend to over t complex patterns
because of the inappropriate simplicity of the rule language available (operators,
aggregates...). Dimensionality of over tted models make human understanding
of the model much harder. In the context of fraud detection, with imbalanced
datasets, evolving patterns and time dependency, those limitations are
highlighted.</p>
      <p>Problem How can we learn accurate, understandable and time-dependent rules
for decision making and in particular for fraud detection problems?
Hypothesis The hypothesis on which the project helds are:
{ rule based-models are fully-interpretable or at least more interpretable than
other models;
{ machine learning models bring relevant statistical information to learn rules
from;
{ sequential models (Hidden Markov Models, Matrix Product State based
model, ...) can bring interesting statistical information to learn rules from;
{ fraud detection is a relevant application domain to illustrate the problem.
{ an ideal trade o between bias and variance can be found to generate rules
out of di erent fraud patterns (the more complex patterns are, the harder it
is to learn rules and generalize).</p>
      <p>Purpose The purpose of this project is to induce sets of accurate and
understandable rules with or from machine learning models in time dependent data.
It will help achieving fraud detection and prediction in the challenging context
of nance and banking environments where full interpretability is required. A
longer term objective is to be able to integrate the induction solutions found in
IBM products (Operational Decision Manager (ODM) and Automation Decision
Service (ADS)).</p>
      <p>Goals The goal of the project is to build, tune, test and validate one or multiple
solid models and rule learning solutions to detect fraudulent patterns and events
resulting in a fraudulent event. This main project goal can be divided in multiple
goals:
{ Acquiring expertise in fraud detection, rule induction and machine learning
models.
{ Building one or more models and rule learning solutions as well as an
evaluation process to answer the stated problem.
{ Experimenting and validating proposed solutions with synthetic and real
data.</p>
      <p>{ Sharing results.</p>
      <p>Tasks The following tasks will take part of this project:
{ Write a state of the art analysis of the fraud detection models and solutions
as well as an inventory of known fraud patterns.
{ Write a state of the art analysis of rule learning algorithms as well as existing
solutions to optimize parameters values.
{ Propose a mathematical model of the problem by specifying inputs and
outputs.
{ Analyze available open source datasets applicable to the stated problem.
{ Experiment with di erent supervised and unsupervised models found in
state-of-the-art papers (reproduce when possible).
{ De ne an evaluation and test protocol.
{ Work deeply on di erent approaches of the problem to improve results.
{ Experiment on external synthetic data before experimenting in vivo on real
data.
{ Present and make available proof-of-concepts.
{ Write papers for conferences, workshops, journals (attend when possible).
{ Write nal thesis.</p>
      <p>
        The project will use empirical methods [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The work will be based on
experimenting with speci c datasets, performance metrics will be de ned in order to
evaluate and draw conclusions.
4
4.1
      </p>
    </sec>
    <sec id="sec-5">
      <title>Preliminary ndings</title>
      <sec id="sec-5-1">
        <title>Fraud detection data</title>
        <p>This research project bene ts from the fact that an IBM partner in the nancial
area comes with a perfect use case for the project: detection of fraudulent events
in bank transfers and credit card transactions. Experiments with real data will be
feasible but with no access to the dataset, only resulting metrics will be shared.
It provides a good nal testing experimental environment but is not satisfactory
at the research level.</p>
        <p>Due to di culties to generate or collect data for fraud detection for obvious
con dentiality reasons, we have not found an existing reference dataset that
combines all the following conditions:
{ Data should be composed of events which are nancial transactions (ideally
not just credit card payment transactions).
{ Pro le of users should be extractable : we need to have the historical data
of a client in order to predict fraudulent behavior.
{ As a consequence, data should include a notion of time.</p>
        <p>
          However, we could still use existing datasets that are not verifying the
following conditions. For example, we can mention Kaggle Dataset : Synthetic
Financial Datasets For Fraud Detection [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. We learned the importance of features
preprocessing with the use of this dataset as shown later in section 4.3.
        </p>
        <p>We are currently searching for appropriate datasets to work on. An
alternative we selected if we are not able to found fraud detection viable data, is to
start with anomaly detection data which comes with comparable characteristics:
temporal, unbalanced and evolving patterns (not known when appearing).
4.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Rule language</title>
        <p>What rule learning state-of-the-art analysis highlighted, is that there is an
important rule conditions limitation when it comes to existing learning algorithms.
Algorithms like RIPPER [7] or CN2 [6] for example, are not scalable for others
than basic conditions operators. This comes from the fact that they have a weak
internal data representation that is only based on original attributes. With this
conclusions in mind, we list below increasingly complex rule structures. They
re ect rules we want to be able to learn, in order to describe complex model like
fraud detection. In the following rules, x are data attributes, fa; b; c; d; e; f g are
xed values (numerical or categorical valid according to the operator in use in
condition) and ypred is the target class.
1. Base rule structure. CN2 and RIPPER -like rules.</p>
        <p>if x1 &lt; a and x2 &gt; b and x3 = c
then ypred = d
2. Simple features comparisons.</p>
        <p>if x1 &lt; a and x2 &gt; x1 and x3 = c
then ypred = d
3. Linear combinations.</p>
        <p>if x1 &lt; a and b1
then ypred = d</p>
        <p>x2 &gt; b and x3=c1 = c2
4. Adding aggregates. For example sum, count, min, max, average ... that are
applied to a set of data. This is particularly useful when working with time
dependent data. We de ne , a set of aggregation functions that can have
parameters.
if 1 &lt; a and b1
then ypred = d
x2 &gt; x1 + b2 and
2(c1) = c2
5. Complex structures with aggregates.</p>
        <p>if sum(a e:x2 e:x1) &gt; d
for e 2 events where e:x1 &gt; b over timewindow(c)
then ypred = d
6. Complex temporal expression between events e1 and e2.</p>
        <p>if 9e1 : e1:x1 &gt; 10 and 9e2 : e2:x1 = e1:x2
where e1:time 2 [e2:time; now]
then ypred = d
7. Program induction extension. That is increasing complexity of the right part
of the rule, by adding chaining or symbolic regression for example. A new
variable var is de ned.</p>
        <p>{ Chaining
if x1 &lt; a and x2 &gt; b and x3 = c
then var = x2 + d
if var = e
then ypred = f
{ Symbolic regression
if x1 &lt; a and x2 &gt; b and x3 = c
then ypred += x2 + d
4.3</p>
      </sec>
      <sec id="sec-5-3">
        <title>First approach</title>
        <p>
          The rst approach took to learn rules with linear combinations (step 3), is to
use a data-driven preprocessing approach. As pointed out by Li et al. [15], data
preprocessing such as augmentation or regularization can impact
interpretability considerably. Very few preprocessing techniques can be used without loss
of interpretability, therefore a simple linear approach is chosen. It consists in
adding new features to the data provided for the learning step. Those new
features are actually linear combination of original features. This approach was
chosen following rst experiments done with Synthetic Financial Datasets For
Fraud Detection dataset [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], that showed the di culty of RIPPER and CN2
algorithms to model data that are not ruled by original features individually.
With the manual introduction of a new feature, results improved considerably
as shown in Table 1. An automated feature generation process is created with
sum and di erence operations. Interpretability is maintained thanks to a
dimensional consistency lter. However this approach is not scalable for more complex
operations and can have impacts on some learning algorithms (for example on
RIPPER stopping criteria that depends on data dimensions).
4.4
        </p>
      </sec>
      <sec id="sec-5-4">
        <title>Future work and ideas</title>
        <p>An approach that we would like to develop is the use of intermediary models.
Rather than working on the dataset directly, we want to try modeling the data
rst into an intermediary model (tensor networks, bayesian model etc.), before
learning rules for that new representation of the data. Additionally further work
on how to approach the temporal aspect of the data needs to be completed. With
a fraud detection dataset, it would be interesting to apply anomaly detection
strategy (supervised and unsupervised) as both domains share data
characteristics (unbalanced, temporal).
5</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>In this paper, we presented the doctoral research project. There is a growing need
for understandable AI models. A rule based approach is one potential solution,
but they no longer have the same research interest as black boxes models do. We
believe that this approach is a solution for many di erent kind of applications
especially nancial applications. Modeling with rules, a time-dependent dataset
requires a rule language complexity that is not currently possible to learn with
available methods. This research project aims at going in that direction.
Acknowledgements This thesis project is supported by PSPC AIDA
2019PSPC-09. It is supervised by Philippe Bonnard at IBM France Lab and Francois
Fages at Inria Saclay.
5. Cheng, D., Xiang, S., Shang, C., Zhang, Y., Yang, F., Zhang, L.: Spatio-Temporal
Attention-Based Neural Network for Credit Card Fraud Detection. Proceedings
of the AAAI Conference on Arti cial Intelligence 34(01), 362{369 (Apr 2020).
https://doi.org/10.1609/aaai.v34i01.5371
6. Clark, P., Niblett, T.: The CN2 Induction Algorithm. Machine Learning 3(4), 261{
283 (Mar 1989). https://doi.org/10.1023/A:1022641700528
7. Cohen, W.W.: Fast E ective Rule Induction. In: In Proceedings of the Twelfth
International Conference on Machine Learning. pp. 115{123. Morgan Kaufmann
(1995)
8. Doshi-Velez, F., Kim, B.: Towards A Rigorous Science of Interpretable Machine</p>
      <p>Learning. arXiv:1702.08608 [cs, stat] (Mar 2017)
9. Furnkranz, J., Gamberger, D., Lavrac, N.: Foundations of Rule Learning. Springer</p>
      <p>Science &amp; Business Media (Nov 2012)
10. Guillame-Bert, M.: Apprentissage de regles associatives temporelles pour les
sequences temporelles de symboles p. 158
11. Guo, J., Liu, G., Zuo, Y., Wu, J.: Learning Sequential Behavior Representations for
Fraud Detection. In: 2018 IEEE International Conference on Data Mining (ICDM).
pp. 127{136 (Nov 2018). https://doi.org/10.1109/ICDM.2018.00028
12. Hu, Z., Ma, X., Liu, Z., Hovy, E., Xing, E.: Harnessing Deep Neural
Networks with Logic Rules. In: Proceedings of the 54th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers). pp. 2410{
2420. Association for Computational Linguistics, Berlin, Germany (Aug 2016).
https://doi.org/10.18653/v1/P16-1228
13. Jurgovsky, J., Granitzer, M., Ziegler, K., Calabretto, S., Portier, P.E.,
He-Guelton, L., Caelen, O.: Sequence classi cation for credit-card fraud
detection. Expert Systems with Applications 100, 234{245 (Jun 2018).
https://doi.org/10.1016/j.eswa.2018.01.037
14. Kraiem, I.B.: Detection d'Anomalies Multiples par Apprentissage Automatique
de Regles dans les Series Temporelles. Ph.D. thesis, Universite de Toulouse-Jean
Jaures (Jan 2021)
15. Li, X., Xiong, H., Li, X., Wu, X., Zhang, X., Liu, J., Bian, J., Dou, D.:
Interpretable Deep Learning: Interpretation, Interpretability, Trustworthiness, and
Beyond. arXiv:2103.10689 [cs] (May 2021)
16. Lipton, Z.C.: The mythos of model interpretability: In machine learning, the
concept of interpretability is both important and slippery. Queue 16(3), 31{57 (Jun
2018). https://doi.org/10.1145/3236386.3241340
17. Muggleton, S.: Inductive logic programming. New Generation Computing 8(4),
295{318 (Feb 1991). https://doi.org/10.1007/BF03037089
18. Nye, M.I., Solar-Lezama, A., Tenenbaum, J.B., Lake, B.M.: Learning
Compositional Rules via Neural Program Synthesis. arXiv:2003.05562 [cs] (Mar 2020)
19. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann
Publishers Inc., San Francisco, CA, USA (1993)
20. Rudin, C.: Stop Explaining Black Box Machine Learning Models for High Stakes
Decisions and Use Interpretable Models Instead. arXiv:1811.10154 [cs, stat] (Sep
2019)
21. Sushil, M., Suster, S., Daelemans, W.: Rule induction for global explanation of
trained models. arXiv:1808.09744 [cs, stat] (Aug 2018)
22. Wang, J., Roberts, C., Vidal, G., Leichenauer, S.: Anomaly Detection with Tensor</p>
      <p>Networks. arXiv:2006.02516 [quant-ph, stat] (Jun 2020)
23. West, J., Bhattacharya, M., Islam, R.: Intelligent Financial Fraud Detection
Practices: An Investigation. arXiv:1510.07165 [cs] (Oct 2015)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Synthetic</given-names>
            <surname>Financial Datasets For Fraud</surname>
          </string-name>
          <article-title>Detection</article-title>
          . https://kaggle.com/ntnutestimon/paysim1
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bock</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          : Getting It Right: R&amp;
          <article-title>D Methods for Science and Engineering</article-title>
          . Elsevier Science (Apr
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Bohmer,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Rinderle-Ma</surname>
          </string-name>
          , S.:
          <article-title>Mining association rules for anomaly detection in dynamic process runtime behavior and explaining the root cause to users</article-title>
          .
          <source>Information Systems</source>
          <volume>90</volume>
          , 101438 (May
          <year>2020</year>
          ). https://doi.org/10.1016/j.is.
          <year>2019</year>
          .101438
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Breiman</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Friedman</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stone</surname>
            ,
            <given-names>C.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Olshen</surname>
            ,
            <given-names>R.A.</given-names>
          </string-name>
          :
          <article-title>Classi cation and Regression Trees</article-title>
          .
          <source>Taylor &amp; Francis (Jan</source>
          <year>1984</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>