Finding NeMo: Fishing in banking networks using network motifs Xavier Fontes David Aparício Maria Inês Silva Feedzai Feedzai Feedzai Porto, Portugal Porto, Portugal Lisbon, Portugal xavier.fontes@feedzai.com david.aparicio@feedzai.com ines.silva@feedzai.com Beatriz Malveiro João Tiago Ascensão Pedro Bizarro Feedzai Feedzai Feedzai Lisbon, Portugal Lisbon, Portugal Lisbon, Portugal beatriz.malveiro@feedzai.com joao.ascensao@feedzai.com pedro.bizarro@feedzai.com ABSTRACT extract insights about how legitimate and fraudulent transactions Banking fraud causes billion-dollar losses for banks worldwide. "behave" and aid both fraud detection systems and fraud analysts. In fraud detection, graphs help understand complex transaction To the best of our knowledge, this work is the first to find and patterns and discovering new fraud schemes. This work explores explore heterogeneous network motifs in a real-world banking graph patterns in a real-world transaction dataset by extracting setting. Notably, we have two main contributions from this work: and analyzing its network motifs. Since banking graphs are hetero- • We propose a randomization process (i.e., a null model) ade- geneous, we focus on heterogeneous network motifs. Additionally, quate to banking datasets, a vital component for the defini- we propose a novel network randomization process that generates tion of network motifs used to provide the baseline frequen- valid banking graphs. From our exploratory analysis, we conclude cies of each subgraph (detailed in Section 2.2.1). that network motifs extract insightful and interpretable patterns. • We extract network motifs from graphs built from card trans- action data and review them thoroughly. We include an anal- Reference Format: ysis of how the motif significance evolves as more random Xavier Fontes, David Aparício, Maria Inês Silva, Beatriz Malveiro, João networks are used (detailed in Section 3). Tiago Ascensão, and Pedro Bizarro. Finding NeMo: Fishing in banking networks using network motifs. In the 2nd Workshop on Search, The remainder of this paper is organized as follows. Section 2 Exploration, and Analysis in Heterogeneous Datastores (SEA Data 2021). presents our method and discusses the key components necessary for our analysis. We describe the data and present results in Sec- tion 3. Usage scenarios are proposed in Section 4. We put forward 1 INTRODUCTION our main takeaways and offer directions for future work in Sec- Payment information theft renders online transactions susceptible tion 5. to fraud. Once detected, fraud entails a reimbursement from the cardholder’s bank, leading to monetary costs to financial institu- 2 METHOD tions and customer friction. In this section we present our methodological choices. First, we Fraud detection thus requires a deep understanding of the un- discuss how we build networks from banking datasets (Section 2.1). derlying fraud patterns, and graphs offer an intuitive way to vi- Then, we introduce heterogeneous network motifs and describe sualize these patterns. One can further leverage graph mining in how to compute and identify these motifs in banking datasets, transaction data to understand fraud schemes in a wide range of with a special focus on the null model and the measure of motif applications. Hajdu and Krész [1] developed a methodology to iden- significance. (Section 2.2) tify cycles in transaction networks as a means to detect fraudulent expenses. Micale et al. [2] retrieved the most frequent patterns in a 2.1 Graph representation relationship network of people involved in the Panama papers to identify the most relevant money laundering structures. Banking datasets usually consist of transactions between entities, In this work, we build networks from a real-world banking such as people, merchants, and businesses. They can include differ- dataset of card purchases and apply a widely used graph mining ent types of transactions such as card payments or bank transfers. tool – heterogeneous network motifs [7, 10]. Analyzing recurring From a banking dataset, we can build two graph representations, patterns in real banking networks sets a foundation for understand- namely (a) entity graphs and (b) transaction graphs, as illustrated ing how fraud materializes in transaction data. From there, one can in Figure 1. Copyright © 2021 for the individual papers by the papers’ authors. Copyright © 2021 2.1.1 Entity graph. In an entity graph, 𝐺 = (𝑉 , 𝐸), nodes 𝑉 rep- for the volume as a collection by its editors. This volume and its papers are published resent entities, such as merchants or clients, and edges 𝐸 connect under the Creative Commons License Attribution 4.0 International (CC BY 4.0). entities with at least one shared transaction. This way of repre- Published in the Proceedings of the 2nd Workshop on Search, Exploration, and Anal- ysis in Heterogeneous Datastores, co-located with VLDB 2021 (August 16-20, 2021, senting banking datasets is helpful to highlight suspicious entity Copenhagen, Denmark) on CEUR-WS.org. behavior. Transaction Client Merchant Fraud? Entity T1 C1 M1 No Transaction Graph T2 C1 M1 No Graph (a) T3 C2 M2 Yes (b) T4 C1 M1 No C1 C2 T1 T2 T3 T4 M1 M2 Client Merchant Legitimate Fraud Legitimate Fraud Client Merchant Figure 1: Example of (a) an entity graph and (b) a transaction graph. Consider, for example, the entity graph in Figure 1 (a). There, entities, i..e, 𝜇 (𝑢, 𝑣) ∈ 𝑀, |𝑀 | = 2𝑘 − 1. In Figure 1, 𝑘 = 2, hence we represent C1 and M1 as two connected nodes of different types, there are 3 possible labels for edges, 𝑀 = 3. i.e., {𝐶1, 𝑀1} ∈ 𝑉 , (𝐶1, 𝑀1) ∈ 𝐸, and 𝜙 (𝐶1) ≠ 𝜙 (𝑀1), where Transaction can be used to investigate patterns between trans- 𝜙 (𝑢) is the type of node 𝑢. Since we connect all 𝑘 entities in a actions. Additionally, machine learning classification models can transaction, they form a 𝑘-node clique in the graph. When the same benefit from receiving topological information (including centrality two entities are parties to multiple transactions (e.g., a person makes measures or node embeddings) extracted from transaction graphs several purchases at a retail store), we aggregate the transaction [5]. information into a single edge (𝑢, 𝑣) ∈ 𝐸. Therefore, an entity graph is undirected and heterogeneous. The node label 𝜙 (𝑢) corresponds to the entity’s type (i.e., client, card, 2.2 Heterogeneous network motifs merchant, or terminal). The edge label 𝜇 (𝑢, 𝑣) is binary, indicating A network motif is a subgraph that appears more frequently than whether there is at least one fraudulent transaction involving the expected. The concept of appearing more frequently than expected two entities. commonly relies on building a large set of randomized networks R𝐺 that are similar to the original network [4]. In the literature, 2.1.2 Transaction graph. In a transaction graph, 𝐺 = (𝑉 , 𝐸), nodes authors typically use either 100 or 1000 randomized networks [9]. 𝑉 represent individual transactions, and edges 𝐸 connect transac- Suppose the frequency of a given subgraph 𝑚𝑖 in the original tions that share entities. network is, according to some significance measure (Section 2.2.2), In the transaction graph from Figure 1 (b), transactions T1 and much higher than the (average) frequency on similar randomized T2 are represented as two nodes connected by an edge indicating networks, i.e., 𝑓 𝑟𝑒𝑞(𝑚𝑖 , 𝐺) >> 𝑎𝑣𝑒𝑟𝑎𝑔𝑒_𝑓 𝑟𝑒𝑞(𝑚𝑖 , R𝐺 ). In that case, that the same client made both, i.e., {𝑇 1,𝑇 2} ∈ 𝑉 and (𝑇 1,𝑇 2) ∈ 𝐸. subgraph 𝑚𝑖 is a network motif. Similarly, if the subgraph’s fre- Since connecting all transactions with common entities would result quency is much lower than the (average) frequency on similar ran- in very dense graphs, we only connect transactions that occurred domized networks, that subgraph is an anti-motif. In this work, we within a time window, e.g., transactions made by a client in less than are interested in both motifs and anti-motifs. 24 hours. Moreover, we use edge direction to encode the temporal Motif discovery involves computing the frequency of a given set sequence of transactions, with edges connecting older transactions of subgraphs and entails subgraph counting [6]. Subgraph counting to more recent ones, i.e., if (𝑢, 𝑣) ∈ 𝐸, then 𝑣 is more recent than 𝑢. receives as input a graph G and a list of non-isomorphic subgraph If two transactions occur in the same timestamp, we add a bidirec- types (e.g., all possible unique subgraphs with four nodes). Then, tional edge between the two nodes, i.e., {(𝑢, 𝑣), (𝑣, 𝑢)} ∈ 𝐸. it outputs the frequencies of each subgraph type in G, e.g., the Thus, a transaction graph is directed and heterogeneous. The frequencies of 4-node cliques, 4-node chains, or 4-node stars. node label is binary (the transaction is fraudulent or legitimate), In this work, we use g-tries for subgraph counting since they are i.e., 𝜙 (𝑥) ∈ {𝑓 𝑟𝑎𝑢𝑑, 𝑙𝑒𝑔𝑖𝑡 }. The edge label is one of 2𝑘 − 1 possible a general framework able to count subgraphs of arbitrary size in labels, all the combinations of sharing one or more of 𝑘 different heterogenous graphs [7, 8]. Other approaches are faster for counting 2 (d) Heterogeneous (a) Homogeneous (b) Heterogeneous Nodes (c) Heterogeneous Edges Nodes and Edges Figure 2: Different 3-node cliques. In our case, we use both heterogeneous nodes and edges. specific subgraphs, but, as far as we know, g-tries are the only build the randomized networks from the randomized tabular data. method that support directed and heterogenous graphs [6]. Thus, the randomization procedure works as follows: Since our graphs are heterogeneous, we need network motifs (1) Shuffle all the fraud labels. This step maintains the fraud that consider node and edge heterogeneity. Concretely, we need to rate of the original dataset but randomizes its attribution to extract heterogeneous motifs and anti-motifs. different transactions. As an illustrative example, consider a 3-node clique where nodes (2) Iterate over each entity type and, for each entity type, we and edges are homogeneous (Figure 2 (a)) and the following hetero- randomly chose 𝜌 ∗ 𝑚 pairs of values to be swapped. Here, geneous graph settings: (i) if nodes can be of two different types, 𝜌 ∈ [0, 1] controls how much we want to randomize the there are four possible 3-node cliques (Figure 2 (b)), (ii) if edges original network, and 𝑚 is the number of transactions. can be of two different types, there are four possible 3-node cliques (3) Build the resulting graph from the randomized data, as de- (Figure 2 (c)). Thus, in our work, disregarding node or edge labels scribed in Sections 2.1.1 and 2.1.2. undermines the necessary differentiation of topological structures. Heterogeneous motif discovery is more informative than traditional This network randomization strategy guarantees semantically homogeneous motif discovery and more complex to extract and valid banking networks with a random topology. analyze. 2.2.2 Motif significance measures. After doing subgraph counting In banking fraud analysis, heterogeneous network motifs are on both the original network and the randomized networks, we use more helpful than homogeneous motifs. For instance, knowing that a motif significance measure to evaluate which subgraphs are over- "clients connected to many different cards are more likely to be and under-represented. Several measures have been proposed [11], fraudulent" is arguably more informative than just knowing that but here we focus on two of the most well-established: the z-score "dense subgraphs can be indicative of fraud". and the ratio. 2.2.1 Network randomization. Computing the expected frequency The z-score of a subgraph, 𝑧𝑖 , is computed as follows: of a given subgraph requires randomizing the original network so that randomized networks are similar to the original. However, 𝑓 𝑜 − 𝜇𝑅 defining network similarity is non-trivial and task-dependent. In 𝑧𝑖 = 𝑖 𝑅 𝑖 (1) practice, the following two approaches are common: 𝜎𝑖 • Initialize the randomized graph as a copy of the original Where, graph and iteratively swap random pairs of edges [9]. This • 𝑓𝑖𝑜 is the frequency of subgraph 𝑖 in the original network. method preserves vertices’ in- and out-degrees. • Initialize the randomized graph with the same nodes as the • 𝜇𝑖𝑅 and 𝜎𝑖𝑅 are the mean and standard deviation of the fre- original graph and incrementally add edges with probabili- quencies of subgraph 𝑖 in the randomized networks, respec- ties based on each nodes’ degree in the original graph [4]. tively. This method provides an approximation of the vertices’ in- We compute the ratio of a subgraph, 𝑟𝑖 , as follows: and out-degrees. However, these strategies are unsuitable to the banking fraud 𝑓𝑜 domain as they fundamentally change the semantic structure of 𝑟𝑖 = 𝑖𝑅 (2) 𝜇𝑖 banking graphs. For instance, in an entity graph, entities of the same type are never directly connected by an edge. However, adding Our goal is to find the subgraphs with the highest z-scores/ratio or removing edges indiscriminately (while preserving each node’s (i.e., the motifs) and subgraphs with the lowest z-scores/ratio (i.e., degree) will eventually lead to connecting nodes of the same type the anti-motifs). and thus compromise the validity of the randomized network. Since In our experiments (Section 3), we show the ratio of the sub- these strategies do not take node and edge labels into account, we graphs since it is more interpretable than the z-score: if 𝑟𝑖 = 100, need to follow a different approach for network randomization. the subgraph appears 100 times more often in the original network Instead of randomizing the original graph directly, we apply than in the randomized networks. We complement our analysis by a randomization procedure directly on the tabular data and then reporting 𝑓𝑖𝑜 , 𝜇𝑖𝑅 , and 𝜎𝑖𝑅 . 3 3 RESULTS Ratio 10,000 3.1 Data overview Table 1 outlines the parameters used to generate the entity graph Motifs and the transaction graph and provides summary statistics on the 1,000 two graphs. In the entity graph, we consider four node types (i.e., client, 100 merchant, terminal, and card) and two edge types (i.e., fraudulent or legitimate). In the transaction graph, we consider two node types (i.e., fraud- 10 ulent and legitimate) and three edge types (i.e., only client, only merchant, or both client and merchant). The rationale for choosing these two entities lies in the close relationship between clients and 1 cards and, similarly, merchants and terminals. In other words, since most clients use few cards and most merchants have few terminals, 0.1 we simplify the graph by dropping the card and terminal entities. Anti-Motifs Additionally, we found it necessary to constrain the considered time window due to computational constraints on subgraph count- 0.01 ing. 0 20 40 60 80 100 Both graphs have hundreds of thousands of edges and many Number of Random Networks different connected components. The fraud rate in the entity graph is the number of fraudulent edges, while, in the transaction graph, Figure 3: Entity graph’s ratio evolution, highlighting the top- the fraud rate is the number of fraudulent nodes. As a result, the 6 motifs in dark blue and the top-4 anti-motifs in light blue. fraud rates differ depending on the graph type. 3.2 Motif analysis have significantly lower ratio than the others (𝑟𝑖 ≈ 0.01). We con- In this subsection, we start with the entity graph results and then sider the first to be the motifs, and the last to be the anti-motifs, show the results of the transaction graph. shown in Figure 4 (a) and Figure 4 (b), respectively. We analyze which subgraphs are motifs and anti-motifs based From Figure 4 we notice that none of the subgraphs is a 3-node on the ratio (Equation 2): we consider subgraphs that have a con- clique. This result might seem counter-intuitive as all transactions siderably higher ratio than the others to be motifs, and subgraphs form a 4-node clique. A possible explanation is that some merchants with the lowest ratios to be anti-motifs. are hub nodes and thus induce chain-like subgraphs. Another gen- We focus on subgraphs of size three due to the higher computa- eral observation is that all edges in the motifs are fraudulent, which tional cost of larger subgraphs. is not true for the anti-motifs. Domain knowledge indicates that fraudulent transactions tend to be closely connected. 3.2.1 Entity graph. From Figure 3 we can see that, for most cases, The first three motifs from Figure 4 (a) have a client at its center, the ratio of all subgraphs stabilizes relatively quickly at ≈ 40 ran- connected to either (i) two terminals, (ii) one terminal and a mer- dom networks. We observe that a few subgraphs have significantly chant, or (iii) two merchants. Subgraphs (i) and (iii) tell us that the higher ratio than the others (𝑟𝑖 > 100). Moreover, a few subgraphs client made two fraudulent transactions at different merchants and terminals, respectively. These two subgraphs are motifs because fraud is sometimes recurring for fraudulent clients, and this infor- mation is lost in the randomized networks since they reshuffle fraud Table 1: Parameters used for graph generation and general labels. Subgraph (ii) tells us a similar story but notice that, since the statistics (NA means Not Applicable). merchant and the terminal are not connected, the merchant and the terminal are from different transactions. The last three motifs are equivalent to the first three but with the card at the center of the entity graph transaction graph subgraph. Since clients can use multiple cards, the original network period 2 days, 5 hours 1 day, 21 hours counts are lower for subgraphs with the card at the center of the lookback period NA 6 hours chain than the counts of the subgraphs with the client at the center random networks 100 of the chain. We also note that, in practice, it might be interesting 𝜌 0.8 to investigate all cards used by the client. nodes 51,428 20,209 The first three anti-motifs from Figure 4 (b) have a merchant edges 104,833 214,054 at its center. All three anti-motifs convey the same information: node types 4 2 merchants typically either have only legitimate transactions or only edge types 2 3 fraud transactions, i.e., fraud tends to cluster around the same risky conn. components 5,796 9,877 merchants. In the randomized networks, since we randomly swap fraud rate 1:1000 1:500 edges, these relations are lost. Finally, the last anti-motif tells us 4 Ratio 100,000 Motifs 10,000 700 600 550 1,000 14 / 0.02 ± 0.14 24 / 0.04 ± 0.28 11 / 0.02 ± 0.14 100 10 700 600 550 1 12 / 0.04 ± 0.20 20 / 0.08 ± 0.39 9 / 0.04 ± 0.20 Anti-Motifs (a) Motifs 0.1 0 20 40 60 80 100 Number of Random Networks Figure 5: Transaction graph’s ratio evolution, highlighting the top-4 motifs in dark blue and the top-1 anti-motif in light blue. 0.02 0.02 0.02 39 / 2335 ± 1011 39 / 2335 ± 1011 39 / 2324 ± 1020 Three of the four top motifs are a 3-node clique, which happens when three transactions that share a merchant or client occur within Client Merchant the same six-hour window. Once again, the randomization of the entities breaks such patterns. Card Terminal It is important to note that some motifs contain transactions Legitimate processed in the same timestamp (represented with bi-directional Fraud 0.03 edges). This pattern can happen in the same merchant when the 704 / 22554 ± 17 merchant is processing transactions in a batch or has multiple terminals. (b) Anti-Motifs The only anti-motif is a sequence of three transactions in the same merchant, where the middle transaction is fraudulent, and the Figure 4: Entity graph motifs and anti-motifs. For each sub- other two are legitimate. This pattern is less frequent in the original graph 𝑖, we show its ratio 𝑟𝑖 (first line) and 𝑓𝑖𝑜 / 𝜇𝑖𝑅 ±𝜎𝑖𝑅 (second network than in the randomized networks since fraud transactions line). typically occur together in reality, and the randomized networks break this pattern. 4 USAGE SCENARIOS that it is not common for clients to use multiple cards in legitimate transactions. Indeed, clients that use multiple cards are typically The motif analysis can be a tool to characterize banking datasets. Be- associated with fraudulent activity. yond summary statistics, the list of motifs and anti-motifs surfaces underlying fraud patterns. 3.2.2 Transaction graph. Figure 5 shows the evolution of the ratio Fraud experts, who may not be knowledgeable in data science or for each 3-node subgraph in the transaction graph. After 80 random statistics, often use graph visualization for data exploration. Motif networks, the ratios seem to stabilize, and we can clearly distinguish analysis serves as a visual summary characterization of a dataset five subgraphs that stand out, namely, the top-4 subgraphs and the for fraud detection. bottom-1 subgraph. These subgraphs are presented in Figure 6. As an illustrative example, let us consider a fraud expert at a The common feature in the four motifs is at least two transactions bank. Their role entails designing new rules to prevent fraud and that share the same client and merchant. We may lose this pattern reviewing unlabelled transactions. The expert can review the char- during the randomization since we swap merchants and clients acteristic patterns surfaced by motif analysis to tailor the fraud de- independently. However, it is still interesting to observe a significant tection system in place. Over time, fraudsters design new schemes number of patterns where the same client makes two or three to evade detection. Periodic analysis of motifs surfaces upcoming transactions in the same merchant in less than six hours. fraud schemes and overall behavioral trends. 5 As future work, one can investigate whether different banking datasets have similar motifs (and anti-motifs) and if those patterns are different in merchant datasets. This research would follow the findings by Milo et al. [3] where they report that networks with similar contexts have similar subgraph patterns. One can also ex- tend the analysis to larger motif sizes, different temporal windows, 12,564 8,357 and the inclusion of transaction amounts or fraud labels as graph 1759 / 0.14 ± 0.99 2507 / 0.30 ± 0.54 properties. REFERENCES [1] László Hajdu and Miklós Krész. 2020. Temporal network analytics for fraud detection in the banking sector. In ADBIS, TPDL and EDA 2020 Common Workshops. Springer, 145–157. [2] Giovanni Micale, Alfredo Pulvirenti, Alfredo Ferro, Rosalba Giugno, and Dennis Shasha. 2019. Fast methods for finding significant motifs on labelled multi- relational networks. Journal of Complex Networks 7, 6 (2019), 817–837. 2,223 1,567 [3] Ron Milo, Shalev Itzkovitz, Nadav Kashtan, Reuven Levitt, Shai Shen-Orr, Inbal 289 / 0.13 ± 0.34 1254 / 0.80 ± 6.94 Ayzenshtat, Michal Sheffer, and Uri Alon. 2004. Superfamilies of evolved and designed networks. Science 303, 5663 (2004), 1538–1542. [4] Ron Milo, Shai Shen-Orr, Shalev Itzkovitz, N Kashtan, Dmitri Chklovskii, and (a) Motifs Uri Alon. 2002. Network Motifs: Simple Building Blocks of Complex Networks. Science (New York, N.Y.) 298 (11 2002), 824–7. [5] Catarina Oliveira, João Torres, Maria Inês Silva, David Aparício, João Tiago Legitimate Ascensão, and Pedro Bizarro. 2021. GuiltyWalker: Distance to illicit nodes in the Bitcoin network. arXiv:2102.05373 [cs.LG] Fraud [6] Pedro Ribeiro, Pedro Paredes, Miguel EP Silva, David Aparicio, and Fernando Silva. 2019. A survey on subgraph counting: concepts, algorithms and applications Client to network motifs and graphlets. arXiv preprint arXiv:1910.13011 (2019). [7] Pedro Ribeiro and Fernando Silva. 2014. Discovering colored network motifs. In 0.29 Merchant Complex Networks V. Springer, 107–118. 3315 / 11493 ± 7582 [8] Pedro Ribeiro and Fernando Silva. 2014. G-Tries: a data structure for storing and finding subgraphs. Data Mining and Knowledge Discovery 28, 2 (2014), 337–377. [9] Pedro Ribeiro, Fernando Silva, and Marcus Kaiser. 2009. Strategies for network (b) Anti-Motifs motifs discovery. In 2009 Fifth IEEE International Conference on e-Science. IEEE, 80–87. [10] Ryan A Rossi, Nesreen K Ahmed, Aldo Carranza, David Arbour, Anup Rao, Figure 6: Transaction graph motifs and anti-motifs. For each Sungchul Kim, and Eunyee Koh. 2019. Heterogeneous network motifs. arXiv preprint arXiv:1901.10026 (2019). subgraph 𝑖, we show its ratio 𝑟𝑖 (fist line) and 𝑓𝑖𝑜 / 𝜇𝑖𝑅 ± 𝜎𝑖𝑅 [11] F. Xia, H. Wei, S. Yu, D. Zhang, and B. Xu. 2019. A Survey of Measures for (second line). Network Motifs. IEEE Access 7 (2019), 106576–106587. When reviewing an unlabelled transaction, the expert can com- pare the respective subgraph with known motifs. This context can give insight into which pattern, fraudulent or legitimate, might occur. On the other hand, by extracting the most relevant transaction patterns and uncovering new fraud schemes, motif analysis can complement common fraud detection systems. These insights can be used to improve both machine learning models and rule-based systems. 5 CONCLUSIONS We explore motifs and anti-motifs in the context of banking fraud using two graph representations, namely entity graphs and trans- action graphs. We propose a novel randomization method that operates directly on tabular data. This way, we overcome the limi- tations of current network randomization methods in the context of banking fraud. Moreover, we extract heterogeneous network motifs that convey more information than traditional network motifs and find they offer interpretable results. Insights extracted from motif analysis can be used to aid fraud analyst investigate specific cases and improve fraud detection systems by uncovering new fraud schemes. 6