-

Rule Learning from Time-Dependent Data Applied to Fraud Detection

Marine Collery

In nancial environment, fraud detection is a challenging problem with tremendous nancial impacts where data is highly unbalanced, sequential and timestamped. An additional constraint comes from the fact that common machine learning methods cannot be used alone for fraud detection, as every decision made in order to label a transaction as fraudulent needs to be explainable and the complete model understandable.The use of a symbolic language, such as understandable classi cation rules, is therefore preferred or even required.

Rule Learning Fraud Detection Time-Dependant Data Business Rules and Interpretability

1 IBM France Lab 2 Inria Saclay Ile-de-France

Introduction

For few decades now, rule systems have been widely adopted in di erent industrial elds. Business Rule Management Systems (BRMS) o er an intuitive, human readable and comprehensible way to de ne business rules and hides the computational aspect for the business user.

With the growth of machine learning in the past years due to the newly available computational power combined with a growing number of accessible datasets, improving quality of a learned predictive model was an important research interest. Today, impressive models are learned but can lack transparency, interpretability and understandability characteristics that are required and essential for numerous application elds. Those models, and especially the ones based on neural networks, are commonly referred to as \black boxes". Focus is progressively shifting towards providing an explanation for decisions a learned model took as well as building interpretable, understandable and transparent models from scratch.

Combining comprehensibility of business rules and machine learning power to tackle the problem, is the approach we are focusing on for this research project.

This strategy is considered in the context of fraud detection that comes with a complex learning problem as well as a full transparency requirement.

Copyright c 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

Related work

Interpretability, interpretation, and explainability With the growth of high performance non interpretable black-box models, an important question is raised: to what extent a model can be considered trustworthy, especially for highstakes decision making ? Di erent terms are commonly used when referring to this problem, we clarify their meaning here for further use. Model interpretability is the ability (of the model) to explain or to present in understandable terms to a human [15, 8]. Model rationale is how the model takes decisions. Interpretation and explanation (methods) will be considered equivalent in this paper (subtle di erences are not considered). They both refer to methods that explain or translate the model rationale.

The context of fraud detection There are multiple types of nancial frauds from credit card fraud to insurance fraud which come with di erent detection solutions as described by J. West et al. [23]. Credit card fraud detections were for example studied with sequential and non-sequential learning methods by J. Jurgovsky et al. [13] and lead to di erent types of frauds detected with both approaches. A spatio-temporal attention-based neural network for fraud detection on credit card was recently introduced by D. Cheng et al. [5] and brought promising results for detecting `suspicious transactions and mining fraud patterns'. However as pointed by J. Guo et al. [11], allowing for more long-range dependencies than common machine learning models can help identify repeated or cyclical appearances of fraudulent events which seems to be the harder to catch. Very recently, tensor networks were used for anomaly detection [22] where the model outperformed deep and classical algorithms on tabular datasets and achieved competitive results on image datasets.

Rule learning Another approach to detect anomalies in runtime process logs took by K. Bohmer et al., is rule mining [ 3 ]. It comes with some speci c bene ts, especially explainability. In opposition to machine learning models, rules are symbolic and key to bring understandable arti cial intelligence.

Interpretable models should even be preferred to explaining uninterpretable model a posteriori for any high stakes decisions according to C. Rudin [20]. However, in some context rule based models are not considered as fully-interpretable. Indeed, as presented by Z. C. Lipton in [16], given the limited capacity of human cognition, when we reach a su cient high dimension, we could consider the model to be less interpretable than a simple compact neural network.

Combining logic rules and deep neural networks is proposed by Z . Hu et al. [12] to enhance the neural network capabilities. This approach could actually also be used for rule learning. We can also mention recent work from I. Kraiem who applied rule learning for multiple anomaly detection [14] and G. Bert who presented an association rule learning approach for temporal noisy data [10].

More global approaches are proposed in [21] to induce if-then-else rules to explain predictions of supervised learning models, or in [18] to learn compositional rules with very little data. As explained in [9], there are two main base families of methods to induce ruleset from training data: extracting rules from a decision tree (examples: CART [ 4 ] and C4.5 [19]) or sequential covering that is learning rules directly from data (examples: CN2 [6] and RIPPER [7]).

We can also refer to Inductive Logic Programming (ILP) introduced by S. Muggleton in 1991 [17] where an ILP system is a program that combines positive and negative examples with background knowledge and outputs a correct logical hypothesis. ILP systems result of two main steps: searching for hypothesis and then selecting the best one. 3

Problem, goals and method

Problem statement Modeling data in an interpretable and understandable way is very challenging when working with large-scale and real-world datasets. Interpretable models are commonly simple and have di culties learning complex patterns. Rule-based approaches typically tend to over t complex patterns because of the inappropriate simplicity of the rule language available (operators, aggregates...). Dimensionality of over tted models make human understanding of the model much harder. In the context of fraud detection, with imbalanced datasets, evolving patterns and time dependency, those limitations are highlighted.

Problem How can we learn accurate, understandable and time-dependent rules for decision making and in particular for fraud detection problems? Hypothesis The hypothesis on which the project helds are: { rule based-models are fully-interpretable or at least more interpretable than other models; { machine learning models bring relevant statistical information to learn rules from; { sequential models (Hidden Markov Models, Matrix Product State based model, ...) can bring interesting statistical information to learn rules from; { fraud detection is a relevant application domain to illustrate the problem. { an ideal trade o between bias and variance can be found to generate rules out of di erent fraud patterns (the more complex patterns are, the harder it is to learn rules and generalize).

Purpose The purpose of this project is to induce sets of accurate and understandable rules with or from machine learning models in time dependent data. It will help achieving fraud detection and prediction in the challenging context of nance and banking environments where full interpretability is required. A longer term objective is to be able to integrate the induction solutions found in IBM products (Operational Decision Manager (ODM) and Automation Decision Service (ADS)).

Goals The goal of the project is to build, tune, test and validate one or multiple solid models and rule learning solutions to detect fraudulent patterns and events resulting in a fraudulent event. This main project goal can be divided in multiple goals: { Acquiring expertise in fraud detection, rule induction and machine learning models. { Building one or more models and rule learning solutions as well as an evaluation process to answer the stated problem. { Experimenting and validating proposed solutions with synthetic and real data.

{ Sharing results.

Tasks The following tasks will take part of this project: { Write a state of the art analysis of the fraud detection models and solutions as well as an inventory of known fraud patterns. { Write a state of the art analysis of rule learning algorithms as well as existing solutions to optimize parameters values. { Propose a mathematical model of the problem by specifying inputs and outputs. { Analyze available open source datasets applicable to the stated problem. { Experiment with di erent supervised and unsupervised models found in state-of-the-art papers (reproduce when possible). { De ne an evaluation and test protocol. { Work deeply on di erent approaches of the problem to improve results. { Experiment on external synthetic data before experimenting in vivo on real data. { Present and make available proof-of-concepts. { Write papers for conferences, workshops, journals (attend when possible). { Write nal thesis.

The project will use empirical methods [ 2 ]. The work will be based on experimenting with speci c datasets, performance metrics will be de ned in order to evaluate and draw conclusions. 4 4.1

Preliminary ndings Fraud detection data

This research project bene ts from the fact that an IBM partner in the nancial area comes with a perfect use case for the project: detection of fraudulent events in bank transfers and credit card transactions. Experiments with real data will be feasible but with no access to the dataset, only resulting metrics will be shared. It provides a good nal testing experimental environment but is not satisfactory at the research level.

Due to di culties to generate or collect data for fraud detection for obvious con dentiality reasons, we have not found an existing reference dataset that combines all the following conditions: { Data should be composed of events which are nancial transactions (ideally not just credit card payment transactions). { Pro le of users should be extractable : we need to have the historical data of a client in order to predict fraudulent behavior. { As a consequence, data should include a notion of time.

However, we could still use existing datasets that are not verifying the following conditions. For example, we can mention Kaggle Dataset : Synthetic Financial Datasets For Fraud Detection [ 1 ]. We learned the importance of features preprocessing with the use of this dataset as shown later in section 4.3.

We are currently searching for appropriate datasets to work on. An alternative we selected if we are not able to found fraud detection viable data, is to start with anomaly detection data which comes with comparable characteristics: temporal, unbalanced and evolving patterns (not known when appearing). 4.2

Rule language

What rule learning state-of-the-art analysis highlighted, is that there is an important rule conditions limitation when it comes to existing learning algorithms. Algorithms like RIPPER [7] or CN2 [6] for example, are not scalable for others than basic conditions operators. This comes from the fact that they have a weak internal data representation that is only based on original attributes. With this conclusions in mind, we list below increasingly complex rule structures. They re ect rules we want to be able to learn, in order to describe complex model like fraud detection. In the following rules, x are data attributes, fa; b; c; d; e; f g are xed values (numerical or categorical valid according to the operator in use in condition) and ypred is the target class. 1. Base rule structure. CN2 and RIPPER -like rules.

if x1 < a and x2 > b and x3 = c then ypred = d 2. Simple features comparisons.

if x1 < a and x2 > x1 and x3 = c then ypred = d 3. Linear combinations.

if x1 < a and b1 then ypred = d

x2 > b and x3=c1 = c2 4. Adding aggregates. For example sum, count, min, max, average ... that are applied to a set of data. This is particularly useful when working with time dependent data. We de ne , a set of aggregation functions that can have parameters. if 1 < a and b1 then ypred = d x2 > x1 + b2 and 2(c1) = c2 5. Complex structures with aggregates.

if sum(a e:x2 e:x1) > d for e 2 events where e:x1 > b over timewindow(c) then ypred = d 6. Complex temporal expression between events e1 and e2.

if 9e1 : e1:x1 > 10 and 9e2 : e2:x1 = e1:x2 where e1:time 2 [e2:time; now] then ypred = d 7. Program induction extension. That is increasing complexity of the right part of the rule, by adding chaining or symbolic regression for example. A new variable var is de ned.

{ Chaining if x1 < a and x2 > b and x3 = c then var = x2 + d if var = e then ypred = f { Symbolic regression if x1 < a and x2 > b and x3 = c then ypred += x2 + d 4.3

First approach

The rst approach took to learn rules with linear combinations (step 3), is to use a data-driven preprocessing approach. As pointed out by Li et al. [15], data preprocessing such as augmentation or regularization can impact interpretability considerably. Very few preprocessing techniques can be used without loss of interpretability, therefore a simple linear approach is chosen. It consists in adding new features to the data provided for the learning step. Those new features are actually linear combination of original features. This approach was chosen following rst experiments done with Synthetic Financial Datasets For Fraud Detection dataset [ 1 ], that showed the di culty of RIPPER and CN2 algorithms to model data that are not ruled by original features individually. With the manual introduction of a new feature, results improved considerably as shown in Table 1. An automated feature generation process is created with sum and di erence operations. Interpretability is maintained thanks to a dimensional consistency lter. However this approach is not scalable for more complex operations and can have impacts on some learning algorithms (for example on RIPPER stopping criteria that depends on data dimensions). 4.4

Future work and ideas

An approach that we would like to develop is the use of intermediary models. Rather than working on the dataset directly, we want to try modeling the data rst into an intermediary model (tensor networks, bayesian model etc.), before learning rules for that new representation of the data. Additionally further work on how to approach the temporal aspect of the data needs to be completed. With a fraud detection dataset, it would be interesting to apply anomaly detection strategy (supervised and unsupervised) as both domains share data characteristics (unbalanced, temporal). 5

Conclusion

In this paper, we presented the doctoral research project. There is a growing need for understandable AI models. A rule based approach is one potential solution, but they no longer have the same research interest as black boxes models do. We believe that this approach is a solution for many di erent kind of applications especially nancial applications. Modeling with rules, a time-dependent dataset requires a rule language complexity that is not currently possible to learn with available methods. This research project aims at going in that direction. Acknowledgements This thesis project is supported by PSPC AIDA 2019PSPC-09. It is supervised by Philippe Bonnard at IBM France Lab and Francois Fages at Inria Saclay. 5. Cheng, D., Xiang, S., Shang, C., Zhang, Y., Yang, F., Zhang, L.: Spatio-Temporal Attention-Based Neural Network for Credit Card Fraud Detection. Proceedings of the AAAI Conference on Arti cial Intelligence 34(01), 362{369 (Apr 2020). https://doi.org/10.1609/aaai.v34i01.5371 6. Clark, P., Niblett, T.: The CN2 Induction Algorithm. Machine Learning 3(4), 261{ 283 (Mar 1989). https://doi.org/10.1023/A:1022641700528 7. Cohen, W.W.: Fast E ective Rule Induction. In: In Proceedings of the Twelfth International Conference on Machine Learning. pp. 115{123. Morgan Kaufmann (1995) 8. Doshi-Velez, F., Kim, B.: Towards A Rigorous Science of Interpretable Machine

Learning. arXiv:1702.08608 [cs, stat] (Mar 2017) 9. Furnkranz, J., Gamberger, D., Lavrac, N.: Foundations of Rule Learning. Springer

Science & Business Media (Nov 2012) 10. Guillame-Bert, M.: Apprentissage de regles associatives temporelles pour les sequences temporelles de symboles p. 158 11. Guo, J., Liu, G., Zuo, Y., Wu, J.: Learning Sequential Behavior Representations for Fraud Detection. In: 2018 IEEE International Conference on Data Mining (ICDM). pp. 127{136 (Nov 2018). https://doi.org/10.1109/ICDM.2018.00028 12. Hu, Z., Ma, X., Liu, Z., Hovy, E., Xing, E.: Harnessing Deep Neural Networks with Logic Rules. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 2410{ 2420. Association for Computational Linguistics, Berlin, Germany (Aug 2016). https://doi.org/10.18653/v1/P16-1228 13. Jurgovsky, J., Granitzer, M., Ziegler, K., Calabretto, S., Portier, P.E., He-Guelton, L., Caelen, O.: Sequence classi cation for credit-card fraud detection. Expert Systems with Applications 100, 234{245 (Jun 2018). https://doi.org/10.1016/j.eswa.2018.01.037 14. Kraiem, I.B.: Detection d'Anomalies Multiples par Apprentissage Automatique de Regles dans les Series Temporelles. Ph.D. thesis, Universite de Toulouse-Jean Jaures (Jan 2021) 15. Li, X., Xiong, H., Li, X., Wu, X., Zhang, X., Liu, J., Bian, J., Dou, D.: Interpretable Deep Learning: Interpretation, Interpretability, Trustworthiness, and Beyond. arXiv:2103.10689 [cs] (May 2021) 16. Lipton, Z.C.: The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue 16(3), 31{57 (Jun 2018). https://doi.org/10.1145/3236386.3241340 17. Muggleton, S.: Inductive logic programming. New Generation Computing 8(4), 295{318 (Feb 1991). https://doi.org/10.1007/BF03037089 18. Nye, M.I., Solar-Lezama, A., Tenenbaum, J.B., Lake, B.M.: Learning Compositional Rules via Neural Program Synthesis. arXiv:2003.05562 [cs] (Mar 2020) 19. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1993) 20. Rudin, C.: Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead. arXiv:1811.10154 [cs, stat] (Sep 2019) 21. Sushil, M., Suster, S., Daelemans, W.: Rule induction for global explanation of trained models. arXiv:1808.09744 [cs, stat] (Aug 2018) 22. Wang, J., Roberts, C., Vidal, G., Leichenauer, S.: Anomaly Detection with Tensor

Networks. arXiv:2006.02516 [quant-ph, stat] (Jun 2020) 23. West, J., Bhattacharya, M., Islam, R.: Intelligent Financial Fraud Detection Practices: An Investigation. arXiv:1510.07165 [cs] (Oct 2015)

Synthetic

Financial Datasets For Fraud Detection . https://kaggle.com/ntnutestimon/paysim1

2. Bock , P. : Getting It Right: R& D Methods for Science and Engineering . Elsevier Science (Apr 2020 )

3. Bohmer, K. , Rinderle-Ma , S.: Mining association rules for anomaly detection in dynamic process runtime behavior and explaining the root cause to users . Information Systems 90 , 101438 (May 2020 ). https://doi.org/10.1016/j.is. 2019 .101438

4. Breiman , L. , Friedman , J. , Stone , C.J. , Olshen , R.A. : Classi cation and Regression Trees . Taylor & Francis (Jan 1984 )