Detecting Bursts of Activity from Streams Bruno Veloso Carlos Martins INESCTEC, Porto, Portugal WEDO, Braga, Portugal bruno.m.veloso@inesctec.pt Carlos.MMartins@wedotechnologies.com Raphael Espanha Raul Azevedo WEDO, Braga, Portugal WEDO, Braga, Portugal Raphael.Espanha@wedotechnologies.com Raul.Azevedo@wedotechnologies.com João Gama INESCTEC, Porto, Portugal jgama@fep.up.pt Abstract The high asymmetry of international termination rates, where calls are charged with higher values, are fertile ground for the appearance of frauds in Telecom Companies. In this paper, we present a solution for a real problem called Interconnect Bypass Fraud. This problem is one of the most expressive in the telecommunication domain and can be de- tected by the occurrence of burst of calls from specific numbers. Based on this assumption, we propose the adoption of a new fast forgetting technique that works together with the Lossy Counting algorithm. Our goal is to detect as soon as possible items with abnormal behaviours, e.g. bursts of calls, repetitions and mirror behaviours. The results shows that our technique not only complements the techniques used by the telecom company but also improves the performance of the Lossy Counting algorithm in terms of runtime, memory used and sensibility to detect the abnormal behaviours. 1 Introduction Telecom fraud continues to account for more fraud complaints each year. New technology has led to an onslaught of new telecommunication fraud tactics. Around 29 Billion USD are lost annually in the Telecom sector as a result of Telecom Fraud, according with Communications Fraud Control Association’s (CFCA) 2017 Global Fraud Loss Survey [A+ 17], a main reference in the sector for such a sensible matter. This means 1.25 % of direct impact in revenues and does not takes into account reputation costs as fraud is evolving in such a way that is impacting more on subscribers’ services. CFCA’s survey confirmed also the perception that the investment made in better practices and systems is having a return, with a reduction of 23.2 % when compared with 2015 survey edition, despite of a significantly increase in fraud attempts. In terms of fraud types, International Revenue Share Fraud continues to be the main revenue source for Fraudsters, causing a loss of 6.10 Billion USD according with CFCA, closely followed by Interconnection Bypass Fraud, causing an estimated loss of 4.27 Billion USD according to the same source. Regarding Interconnection Copyright © by the paper’s authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Bypass Fraud, recent changes in regulation, namely in the European Union space, are creating a favourable context for this type of fraud to grow in more developed countries, as interconnect rates grown to compensate losses in revenues resultant from the Roaming Like at Home directive. And if Digital Transformation in Communication Service Providers raise new concerns, it also brings oppor- tunities to a more effective fraud fight. As systems evolve towards more real-time and intelligent capabilities, it enables also real-time fraud controls to be implemented in a faster and cheaper way, as well as creating a space for new generation of adaptive Machine Learning based controls. In this paper we will focus on a solution to efficiently detect Interconnect Bypass Fraud, in order to be used in real-time. This fraud explores the forward of international calls using low cost IP connections. Despite of the fact that Fraudsters are shaping usage patterns to avoid traditional fraud detection controls, Interconnect Bypass Fraud is often characterised by the occurrence of burst of calls. Based on this assumption in this paper, we propose the adoption of fast forgetting technique together with the Lossy Counting algorithm. Our goal is to detect as soon as possible items with abnormal behaviours. This methodology, when compared with the pre-existing approaches referred in Section 3, enables to detect recent abnormal patterns like (bursts of calls, repetition and mirror behaviours). The main contributions of this paper are summarised as follows: • We perform experiments on a real-world data set, demonstrating that the Fast Forgetting technique signif- icantly increases the detection of abnormal behaviours. • We propose an extension of the Lossy Counting algorithm which includes the adoption of a new forgetting mechanism to fast detect abnormal behaviours. • We apply the Lossy Counting algorithm with Fast Forgetting for the first time on the fraud detection domain. In terms of organisation, this document contains six sections. Section 2 is dedicated to the problem definition. Section 3 presents the state of the art on frequent itemsets using Lossy Counting. Section 4 describes our approach, including the Lossy Counting algorithm with the fast forgetting. Section 5 describes the experiments and discusses the results obtained. Finally, Section 6 draws the conclusions and suggests future developments. 2 Problem Definition The high asymmetry of international termination rates vis-à-vis domestic ones, where international calls are charged by the operator where the call terminates at a significantly higher value, are fertile ground for the appearance of fraud in Telecommunications. There are several types of fraud that exploit this type of differential being the Interconnect Bypass Fraud one of the most expressive. In this type of fraud, one of several intermediaries responsible for delivering the calls forwards the traffic over a low cost IP connection, reintroducing the call in the destination network already as a local call, using VOIP Gateways. In this way, it charges the entity that sent the traffic the amount corresponding to the delivery of international traffic, but once it has illegally delivered as national traffic, it will not have to pay the international termination fee, appropriating this amount. In addition to the financial aspect, this type of fraud causes image costs because both the originating subscriber of the call and the one who receives it, end up having a poor call quality. This type of fraud is traditionally detected by analysing the call patterns of these Gateways that, once identified, have their SIM cards blocked. These gateways have evolved over time, resembling some of them, true SIM Farms, capable of manipulating identifiers, simulating standard call patterns similar to the ones of normal users and even being mounted on vehicles for making it difficult to detect them using location information. Another important aspect for the deterrence of this type of fraud has to do with the speed of detection and blocking. If the time required for detection and blocking is not long enough for the payback to compensate for the cost and effort of system setup, the fraudster has no great incentive to bypass this operator. A new approach is therefore necessary which, on the one hand, allows for sufficient adaptability to identify increasingly hidden fraud patterns and on the other hand allowing large-volume monitoring in real time so that, as soon as a suspicious behavior is detected an alert action is generated, or even immediate blocking, according to the degree of confidence of the particular detection. It should be noted that this type of fraud, which was usually more prominent in developing countries where the cost of termination has always been high, has recently grown significantly in Europe as a result of the increase in termination charges for calls originating outside Europe, in order to compensate for the decrease of revenues resulting from the application of the roaming like at home directive in the European area. 3 Related Work The literature on data streams is abundant on approximate counting algorithms [Mut05]. Popular one-pass algo- rithms to approximate the most frequent items on an event stream are the SpaceSaving algorithm [MAEA05], the Frequent algorithm [CH10], the Sticky Sampling Algorithm [MM02], and the Lossy Counting algorithm [MM02]. The first two, provide a rank of the most frequent items, while the last two algorithms provide approximate counts of the frequent items. In all cases, the algorithms are designed for processing high-speed data streams, using fixed and restricted memory. Sticky Sampling and Lossy Counting algorithms were presented in the [MM02] paper. They are based on the same idea of removing low-frequent items. While Lossy Counting is a deterministic algorithm, Sticky Sampling is a probabilistic one. Lossy counting is more accurate but Sticky Sampling requires constant space. Lossy Counting space requirements increase logarithmically with the length of the stream. Sticky sampling remembers every unique element that gets sampled, whereas Lossy Counting chops of low frequency elements quickly leaving only the high frequency ones. Sticky sampling can support infinite streams while keeping the same error guarantees as Lossy Counting. Cormode et al. (2008) discuss hierarchical heavy hitters (φ-HHH) in streaming data. Given a hierarchy and a support φ, find all nodes in the hierarchy that have a total number of descendants in the data stream no smaller than φ × N after discounting the descendant nodes that are also φ-HHH [CKMS08]. This is of particular interest for monitoring structured and networked data, like XML data, and explores the internal structure of data. Other Recent works for fast computing frequent items include [LJ11, NIY19, SN18]. 4 Frequent Itemsets using LossyCounting Manku and Motwani in [MM02] present the LossyCounting algorithm, a one-pass algorithm for computing frequency counts exceeding a user-specified threshold over data streams. Although the output is approximate, the error is guaranteed not to exceed a user-specified parameter. LossyCounting requires two user-specified parameters: a support threshold s ∈ [0, 1], and an error parameter  ∈ [0, 1] such that   s. At any point of time, the LossyCounting algorithm can produce a list of item(set)s along with their estimated frequencies. 4.1 The LossyCounting Algorithm Let N denote the current length of the stream. The answers produced will have the following guarantees: • All item(set)s whose true frequency exceeds s × N are output. There are no false negatives; • No item(set) whose true frequency is less than (s − ) × N is outputted; • Estimated frequencies are less than the true frequencies by at most  × N . The incoming stream is conceptually divided into buckets of width w = N transactions each. Buckets are   labelled with bucket ids, starting from 1. Denote the current bucket id by bcurrent . For an element e, denote its true frequency in the stream seen so far by fe . The frequent elements are stored in a data structure T . T contains a set of entries of the form (e, f, ∆), where e is the element, f its estimated frequency, and ∆ is the maximum possible error in f . The pseudo-code of the Lossy Count algorithm, is presented in Algorithm 1 [Gam10]. It works as follows. Initially, T , is empty. Whenever a new element e arrives, if an entry  for e already exists, the algorithm increments its counter f . Otherwise, a new entry is created of the form (e, 1, N ). At bucket boundaries, the set T is pruned. The rule for deletion is: an entry (e, f, ∆) is deleted if f + ∆ ≤ N . When a user requests a list of item with   threshold s, LossyCounting outputs all the entries in T where f ≥ (s − ) × N . 4.2 Frequent Itemsets using Lossy Counting Depending on the application, the Lossy Counting algorithm might treat a tuple as a single item or as a set of items. In the latter case, the input stream is not processed transaction by transaction. Instead, the available main memory is filled in with as many transactions as possible. After that, they process such a batch of transactions together. Let β denote the number of buckets in memory. input: S: A Sequence of Examples; : Error margin; begin n ← 0; ∆ ← 0; T ← 0; foreach example e ∈ S do n←n+1 if e is monitored then Increment Counte else T ← T ∪ {e, 1 + ∆} end if n 6= ∆ then   ∆ ← n foreach all j ∈ T do if Countj < ∆ then T ← T \{j} end end end end end Algorithm 1: The Lossy Counting Algorithm. As for items, Lossy Counting maintains a data structure T , as a set of entries of the form (set, f, ∆), where set is a subset of items, f is an integer representing its approximate frequency, and ∆ is the maximum possible error in f . D is updated as follows: • Update itemset: For each entry (set, f, ∆), update by counting the occurrences of set in the current batch. Delete any entry such that f + ∆ ≤ bcurrent ; • New itemset: If a set set has frequency f ≥ β in the current batch and does not occur in T , create a new entry (set, f, bcurrent − β). Every set whose true frequency is f ≥  × N has an entry in T . Moreover, for any entry (set, f, ∆) ∈ D, the true frequency fset satisfies the inequality f ≤ fset ≤ f + ∆. When a user requests a list of items with threshold s, output those entries in T , where f ≥ (s − ) × N . 4.3 Fast Forgetting The Lossy Counting algorithm computes the frequency of items or sets of items in a stream of buckets. However, this algorithm do not capture small variations on each bucket because the continuous increment of the most frequent items of the stream. We propose the adoption of forgetting factors to reduce the relevance of old elements, and capture small frequency variations on the stream. T contains a set of entries of the form (e, f, ∆), where e is the element, f its estimated frequency, and ∆ is the maximum possible error in f taking into account the accumulated forgetting values. δ is the accumulated forgetting factors applied on the buckets and can be described by w + δ × α, where w is the width of the bucket and α is the forgetting factor. The pseudo-code of Lossy Counting with fast forgetting is presented in Algorithm 2. It works as follows. Initially, T , is empty. Whenever a new element e arrives, if an entry for  e already exists, the algorithm increments its counter f . Otherwise, a new entry is created of the form (e, 1, δ ). At bucket boundaries, the algorithm applies the forgetting factor α to each element on the data structure T and the set T is pruned taking into account the accumulated forgetting factors. The rule for deletion is: an entry (e, f, ∆) is deleted if f + ∆ ≤ δ .  When a user requests a list of item with threshold s, Lossy Counting with fast forgetting outputs all the entries in T where f ≥ (s − ) × δ. input: S: A Sequence of Examples; : Error margin; α: fast forgetting parameter begin n ← 0; ∆ ← 0; T ← 0; foreach example e ∈ S do n←n+1 if e is monitored then Increment Counte else T ← T ∪ {e, 1 + ∆} end if n 6= ∆ then   ∆ ← n foreach all j ∈ T do Countj ← α ∗ Countj if Countj < ∆ then T ← T \{j} end end end end end Algorithm 2: The Lossy Counting with Fast Forgetting Algorithm. 4.4 Discussion The results of Lossy Counting are order-dependent, giving heavier weight to the counts processed last. In our case, this makes perfect sense because we want to detect as soon as possible, items with a burst in activity. These burst are potential abnormal activities that must be analysed for fraud detection. The focus in recent items is very much improved with the new forgetting mechanism we introduce. 5 Experimental Evaluation The following subsections present the data description, the experiments, the results obtained and a final dis- cussion. The experiments were performed with an Intel Core CPU i5-5200u 2.20 GHz Central Processing Unit (CPU), 4 GB DDR3 Random Access Memory (RAM) and 1 TB of hard drive. 5.1 Data Description Our proposal was evaluated with an anonymised phone calls data set, provided by a telecommunication company. The data set contains information about: (i) origin numbers (A-Numbers); (ii) destination number (B-Numbers); (iii) timestamp; and (iv) result, which represents if the call is blacklisted (code – 001) or not (code – 000). The data was collected during three months between 24/07/2018 to 21/10/2018 which includes 83 366 367 examples. 5.2 The impact of Forgetting We made a set of experiments to verify the impact of applying forgetting on the Lossy Counting algorithm. The Lossy Counting algorithm proposed by Manku and Motwani (2002) it was our baseline algorithm. Figure 1 presents for each day the Top 5 A-Numbers that have a frequency higher than the pruning threshold. The algorithm capture a set of 12 A-Numbers or heavy hitters that can potential indicate some kind of fraud behaviour. However, using the Lossy Counting algorithm its difficult to discriminate repetition patterns or burst generated by the A-Numbers. When we applied our fast forgetting mechanism the algorithm showed more sensibility to capture other A- Numbers. Figure 2 shows a set of 35 heavy hitters and it is possible to verify variability on the Top 5 A-Numbers. With our technique we can rapidly observe on Figure 3: (i) burst of calls generated by the A-Numbers (ID 14); (ii) repetition pattern observed on A-Number (ID 1) which cannot be captured using the Lossy Count as we can observe on the Figure 1; and (iii) mirror patterns observed on A-Numbers (ID 8 and ID 22) and (ID 19 and ID 24). Figure 1: The top-5 most active A-numbers over time. Table 1 presents the forgetting parameter sensitivity (α). We can observe that adjusting the forgetting parameter we can detect more Top 5 A-Numbers. When α = 1.0 the algorithm works like the Lossy Counting proposed by Manku and Motwani (2002), i.e., we don’t apply forgetting on the data. When α = 0.0 we forget everything on the previous bucket of data. Table 1: Forgetting Parameter Sensitivity; α is the forgetting parameter and UAN is Unique A-Numbers α 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 UAN 211 210 203 192 180 175 158 123 93 66 12 Table 2 presents the performance of our proposal when compared with Lossy Counting algorithm. The Lossy Counting with Fast Forgetting, maintains the computer complexity of the Lossy Counting algorithm proposed by Manku and Motwani (2002) which is according to Martino et al. (2013) O( 1 log(N )) [MNS13], reduces the runtime and the memory required to store the data structure T , and increases the speedup. Table 2: Performance comparison of the Lossy Counting (LC) vs Lossy Counting with Fast Forgetting (LCFF) Algorithm Runtime (s) Memory (MB) SpeedUp (Examples/s) LC 88 75.8 947 345 LCF F 72 28.8 1 157 866 6 Conclusions The Interconnect Bypass Fraud is the most expressive fraud tactic in the telecommunication domain. The fraudsters explores the forwarding of international calls using low cost IP connections to increase their profits. This type of fraud is characterised by the occurrence of burst of calls. We explore the use of a new fast forgetting mechanism for the Lossy Counting algorithm. This technique was designed to capture as soon as possible abnormal behaviours on phone calls. Figure 2: Lossy Counting with Fast Forgetting α = 0.99. The figure reports the top-5 most active A-Numbers over time. The contributions of this paper are at: (i) application level, to reduce the impact of Interconnect Bypass Fraud on the telecommunication companies; (ii) and at the methodology level, with the extension of the Lossy Counting algorithm with a fast forgetting mechanism to rapidly detect abnormal behaviours. The experiments shows an inability of the Lossy Counting algorithm to detect recent items with abnormal behaviours (burst of calls, repetition and mirror behaviours). The results show that our proposal improved the detection of these recent items. Furthermore, our forgetting mechanism reduces the execution and memory used to compute the data stream, increasing the speedup of the algorithm. For future research we explore the use of ranges to detect A-Numbers with similar behaviours. 6.0.1 Acknowledgements Authors acknowledge the project ML-ABA - Machine Learn based Adaptive Business Assurance, Individual Demonstration Projects, NUP: FCOMP-01-0202-FEDER-038204, a project co-funded by the Incentive Sys- tem for Research and Technological Development, from the Thematic Operational Program Competitiveness of the national framework program - Portugal2020. We also acknowledge the support of the project FailStopper (DSAIPA / DS /0086/2018). References [A+ 17] Communications Fraud Control Association et al. 2017 global fraud loss survey. Press Release, June, 2017. [CH10] Graham Cormode and Marios Hadjieleftheriou. Methods for finding frequent items in data streams. The VLDB Journal, 19(1):3–20, February 2010. [CKMS08] Graham Cormode, Flip Korn, S. Muthukrishnan, and Divesh Srivastava. Finding hierarchical heavy hitters in streaming data. ACM Trans. Knowl. Discov. Data, 1(4):2:1–2:48, February 2008. [Gam10] João Gama. Knowledge Discovery from Data Streams. Chapman and Hall / CRC Data Mining and Knowledge Discovery Series. CRC Press, 2010. Figure 3: Lossy Counting with Fast Forgetting α = 0.99. A-Numbers with abnormal behaviours (bursts, repetition and mirror) [LJ11] Chao-Wei Li and Kuen-Fang Jea. An adaptive approximation method to discover frequent itemsets over sliding-window-based data streams. Expert Syst. Appl., 38:13386–13404, 2011. [MAEA05] Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. Efficient computation of frequent and top-k elements in data streams. In Proceedings of the 10th International Conference on Database Theory, ICDT’05, pages 398–412, Berlin, Heidelberg, 2005. Springer-Verlag. [MM02] Gurmeet Singh Manku and Rajeev Motwani. Approximate frequency counts over data streams. In Proceedings of the 28th International Conference on Very Large Data Bases, VLDB ’02, pages 346–357. VLDB Endowment, 2002. [MNS13] Giovanni Da San Martino, Nicolò Navarin, and Alessandro Sperduti. A lossy counting based approach for learning on streams of graphs on a budget. In Proceedings of the Twenty-Third international joint conference on Artificial Intelligence, pages 1294–1301. AAAI Press, 2013. [Mut05] S. Muthukrishnan. Data streams: Algorithms and applications. Found. Trends Theor. Comput. Sci., 1(2):117–236, August 2005. [NIY19] Takumi Nishina, Koji Iwanuma, and Yoshitaka Yamamoto. A skipping fp-tree for incrementally intersecting closed itemsets in on-line stream mining. 2019 IEEE International Conference on Big Data and Smart Computing (BigComp), pages 1–4, 2019. [SN18] B. Subbulakshmi and A. Periya Nayaki. Compact pattern stream tree algorithm. 2018.