=Paper=
{{Paper
|id=Vol-2646/38-paper
|storemode=property
|title=jKarma: A Highly-Modular Framework for Pattern-Based Change Detection on Evolving Data
|pdfUrl=https://ceur-ws.org/Vol-2646/38-paper.pdf
|volume=Vol-2646
|authors=Angelo Impedovo,Corrado Loglisci,Michelangelo Ceci,Donato Malerba
|dblpUrl=https://dblp.org/rec/conf/sebd/ImpedovoLCM20
}}
==jKarma: A Highly-Modular Framework for Pattern-Based Change Detection on Evolving Data==
jKarma: a Highly-Modular Framework for Pattern-Based Change Detection on Evolving Data (Discussion Paper) Angelo Impedovo, Corrado Loglisci, Michelangelo Ceci, Donato Malerba Dept. of Computer Science, University of Bari "Aldo Moro", Bari, Italy {name.surname}@uniba.it Abstract. Pattern-based change detection (PBCD) describes a class of change detection algorithms for evolving data. Contrary to conventional solutions, PBCD seeks changes exhibited by the patterns over time and therefore works on an abstract form of the data, which prevents the search for changes on the raw data. Moreover, PBCD provides argu- ments on the validity of the results because patterns mirror changes occurred with any form of evidence. However, the existing solutions dif- fer on data representation, mining algorithm and change identification strategy, which we can deem as main modules of a general architecture, so that any PBCD task could be designed by accommodating custom im- plementations for those modules. This is what we propose in this paper through jKarma, a highly-modular framework for designing and perform- ing PBCD. Keywords: Change Detection, Evolving Data, Software Framework 1 Introduction Pattern-based change detection (PBCD) refers to the class of change detection solutions able to find out data-points in which the data distribution changes by acting on the patterns rather than on raw data. Despite the attention it could raise, we ascertain lacking in comprehensive environments able to investigate the problem with alternative solutions or even with integrable implementations. Its main peculiarity is working in an unsupervised fashion, without relying on labeling, which often makes it preferable to the supervised approaches. The blueprint relies on three main methodological decisions, that is, data description, pattern mining algorithm, and change identification strategy. Pat- tern mining algorithms are in charge of building an abstract representation of the evolving data (patterns). The change identification strategy is in charge of Copyright © 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). This volume is published and copyrighted by its editors. SEBD 2020, June 21-24, 2020, Villasimius, Italy. Impedovo et al. searching for changes expressed by the patterns by the effect of possible dis- tribution drifts in the underlying data. In PBCDs, the changes correspond to variations that occurred on the patterns discovered over time. While the deci- sion on which technique to use for the pattern mining and change identification components determines the algorithmic aspects of a PBCD solution, the data representation strictly concerns the formalism of the evolving data, character- istics of the original data to consider and pattern language. For instance, the PBCDs implemented in [1,2] identify the changes through a generic notion of Jaccard dissimilarity defined for three different types of patterns, that is, frequent subnetworks, and δ-closed itemsets. Our purpose is to provide the users with a software framework that supports the study of a predictive problem (change detection) through an unsupervised data mining task (pattern mining) while disseminating existing PBCDs and promoting the development of new ones. As our best knowledge, this is the first solution that combines change detection and pattern mining, while they have been explored as separate tasks in existing frameworks. For instance, MOA [3] has been designed to work on evolving data and offers algorithms that deal with concept drift in predictive tasks. SPMF [4] presents several classes of pat- terns (such as, sequential patterns and periodic patterns), but no one defined for change detection. In this discussion paper, we accomplish this with jKarma, a framework written in Java and proposed in [5] which offers loosely coupled modules, does not require programming efforts and enables the use of reusable, off-the-shelf and ad-hoc implementations for algorithmic components. 2 Background and PBCD architecture In this section we provide preliminary notions and explain the conceptual archi- tecture under which PBCD solutions can be collocated. Given the set of items I, a transactional database is the time-ordered sequence D = hT1 , T2 , . . . , Tn i. Each Ti ⊆ I is a transaction observed in ti and uniquely identified by id i. Thus, a pattern P ⊆ I is a set of |P | items, and, for PBCD purposes, they are discov- ered from transactions in time windows. A window W = [ti , tj ], with ti < tj , is the sequence of |W | = j − i + 1 transactions {Ti , . . . , Tj } ⊆ D. PW denotes the set of patterns discovered on the window W . In the blueprint of PBCD, the Mining step and the Identification step search for change-points on evolving data by using Time-windows models. In particular, two time-windows W and W 0 , W = [tb , te ] and W 0 = [t0b , t0e ] (tb ≤ t0b ≤ te+1 , te < t0e ) are built (Figure 1, Step 2) and input to a pattern mining algorithm, which discovers two pattern sets PW and PW 0 (Figure 1, Step 3). In these terms, the changes are attributed to the patterns which make PW different from PW 0 . In particular, we can determine the i) amount of the change through a quantifica- tion of the difference between the two pattern sets, ii) temporal collocation of the changes (change-points) as the time in which the difference-patterns occur (Fig- ure 1, Step 4). For this core procedure, jKarma offers a general architecture that supports software modularity (Figure 1). It makes the decisions on the specific jKarma: a highly-modular framework for PBCD on evolving data change not detected 7 keep old data 4 1 2 3 change new block accumulate pattern start identification of data new data mining step step 6 5 drop change old data explanation step change detected Fig. 1: Overview of the PBCD architecture implementation for Time-windows models, Mining step and Identification step independent from each other. Indeed, the time-window models allow us to build sub-sequences of data regardless of their original structure (such as, itemsets, subgraphs, subtrees) and the choice of the specific model to use (such as, slid- ing, landmark, tilted) is not constrained neither by pattern mining nor change identification, since the time-windows are only in charge of to scan evolving data and account for new (recent) transactions and old (past) transactions (Figure 1, Steps 2, 6 and 7). The sole assumption of jKarma is the the availability of evolving data in the form of transactional databases. The Identification step (Figure 1, Step 4) is in charge of spotting variations which the new pattern set PW 0 presents in comparison with the old pattern set PW . To do that, jKarma makes available different implementations of dissimi- larity measures (such as Jaccard dissimilarity, etc.) defined on several notions of evidence of the patterns (such as relative frequency, frequency ratio, periodicity, etc.). Not all the dissimilarity values are worthwhile of interest, but only those that exceed a desired degree of change, as well as, not all the patterns exhibit a variation in the evidence, but only that exceed a desired degree of evidence. This enables jKarma to provide "explanations" of the changes in the form of patterns that better express the underlying changes (Figure 1, Step 5). 3 Software Framework & Functionalities jKarma is an open-source highly-modular framework written in Java 8 for defin- ing and executing custom PBCDs. The framework, publicly available under the Apache License 2.0, exposes an API easing the rapid prototyping of custom PBCD strategies, tailored for data coming from transactional data sources, by implementing the general architecture seen in Section 2. Custom PBCDs are in- https://jkarma.bitbucket.io/ Impedovo et al. stantiated by composition, meaning that existing modules for the pattern min- ing, change identification and change explanation steps can be combined to de- sign PBCDs ready to be used. Every PBCD defined with jKarma is completely independent from other data mining and machine learning libraries, and third- parties data sources, thus offering two advantages, i) integrability with existing projects using their data sources (such as, relational databases, graph databases, xml documents), and ii) interoperability with existing analytics frameworks. The framework has been developed as a multi-module Maven project in which two modules expose the APIs for defining custom pattern mining strategies (jkarma-mining) and custom PBCDs pipelines on top of previously defined pat- tern mining strategies (jkarma-pbcd). In particular, the main functionalities are served by two main factory classes: i) org.jkarma.mining.structures.Strategies constructs MiningStrategy objects implementing the pattern mining algorithm to be used in the Mining step of the PBCD architecture, ii) org.jkarma.pbcd.detectors.Detectors constructs PBCD objects implement- ing the details of every step involved in the PBCD architecture. The expressiveness of the programming interface enables the modular design of custom PBCD strategies. This is done by reusing existing software for the (mining step and identification step) in the PBCD architecture. In the current version, it is possible to devise PBCDs based on 5 pattern mining algorithms (Eclat, diffEclat, LCM and LCM-Max, and PFPM), 3 pattern languages (item- sets, subgraphs, subtrees), 4 time-window models (blockwise sliding/landmark, cumulative sliding/landmark) and 2 search algorithms (depth-first search and beam search). Furthermore, the API allows the user to implement his modules when necessary. 4 Illustrative Examples In this section, we report some illustrative examples of how jKarma can be used for building different PBCDs by following a component-based architectural model. Specifically, in the following we will show how the user can specify the details about the Mining step, the Detection step, and the Explanation step in a two-step approach: the first step uses the Strategies class to define a MiningStrategy object, while the second step injects it into a PBCD object via the Detectors class. It is evident that the choices are domain-specific and affect the behavior of the PBCDs, thus resulting in different change detection results. 4.1 Definition of mining strategies The mining strategy is defined by instantiating a generic MiningStrategy object, hence by specifying the set of items (of type A), the pattern language, the pattern evidence criterion (implemented in a class of type B), the mining al- gorithm and the search strategy of patterns. Listing 1.1 shows the definition of a mining strategy, based on the Eclat algorithm, which searches for connected sub- graphs. The pattern evidence criterion filters out frequent connected subgraphs jKarma: a highly-modular framework for PBCD on evolving data (FCSs) whose frequency is lower than the minimum threshold (minSupp). Eclat computes the frequency of a pattern by inspecting its tidset, a data structure collecting the identifiers of the transactions in which the pattern occurs. public MiningStrategy < LabeledEdge , TidSet > defineStrategy ( double minSupp ) { TidsetProvider < LabeledEdge > accessor = new TidsetProvider < >( Windows . blockwiseSliding () ); return Strategies . uponSubgraphs () . eclat ( minSupp ) . limitDepth (3) . dfs ( accessor ); } Listing 1.1: FCS mining strategy based on Eclat. The strategy, which is an object of type MiningStrategy, is instantiated by the uponSubgraphs method that specifies the FCSs pattern lan- guage. The eclat method injects the mining algorithm into the mining strategy, while the limitDepth method limits the maximum number of edges in every FCS. Then, an instance of type TidsetProvider (accessor) scans the transactions and build the tidsets. Finally, the dfs method finalizes the strategy and forces the Eclat algorithm to run in a DFS fashion. An interesting aspect is that Eclat is used to mine FCSs, while natively it is a frequent itemset mining algorithm. In fact, in jKarma the pattern lan- guage is decoupled from the mining algorithm, so that equivalent strategies on different languages (e.g.: itemsets and subtrees) can be defined. For instance, the Eclat algorithm can be forced to discover frequent subtrees by replacing the uponSubgraphs method with the uponSubtrees one. However, since both the strategies are based on the Eclat algorithm, they will compute the fre- quencies of patterns by intersecting TidSet objects. Although this is a good choice on sparse datasets, it could be time-consuming for dense datasets [6], for which the diffEclat algorithm is more appropriate, since the frequency is computed using DiffSet data structures. In jKarma, mining strategies based on diffEclat are easily instantiated by i) invoking the diffEclat method instead of the eclat method, and ii) replacing the TidsetProvider data accessor with a DiffsetProvider. However, the main pitfall of the mining strategies discussed so far is their exhaustiveness, which leads to the discovery of complete sets of patterns. The exhaustive search is caused by the dfs method, which forces the mining algorithm to work in exhaustive mode. jKarma can be used to define non-exhaustive strategies based on beam-search and heuristics as done in [7]. 4.2 Definition of PBCDs As introduced in Section 2, PBCD relies on the sets of patterns PW and PW 0 to i) compute the dissimilarity score d(PW , PW 0 ) and quantify the degree of change and ii) arrange a change explanation. The dissimilarity score is computed on two equally-sized vector encodings FW and FW 0 , in which the i-th element cor- responds to the weight associated with the i-th pattern in the enumeration of PW ∪ PW 0 . This way, the change is quantified through vector measures, instead of set-based ones. Different weighting schemes and vector encodings could de- termine different change detection results. Impedovo et al. In jKarma, a PBCD pipeline is defined by injecting a MiningStrategy instance into a PBCD object via the Detectors class. This ensures the type-checking consistency between the patterns discovered in the mining step and those used in the identification step. The generic type C specifies the type of transactions that will be consumed by the PBCD, while the generic type D denotes the pattern weighting scheme adopted. Finally, a PBCD is finalized by providing details on the identification step and explanation step. In the following example, a PBCD is built by passing a MiningStrategy to the upon method. Then, a binary weighting scheme and the Jaccard dissimi- larity measure are specified via the unweighted method. The PBCD will use the isFrequent predicate when constructing the binary vector encodings, while the UnweightedJaccard computes the dissimilarity score. This PBCD explains changes by discovering emerging patterns via the Descriptors.eps method. Fi- nally, the PBCD is finalized with the build method which i) sets the minimum change threshold to 0.5, and ii) arranges data in blocks of 15 transactions. public PBCD < LabeledGraph , LabeledEdge , TidSet , Boolean > buildPBCD ( MiningStrategy < LabeledEdge , TidSet > strategy ) { UnweightedJaccard m = new UnweightedJaccard () ; return Detectors . upon ( strategy ) . unweighted ((p ,t) -> Patterns . isFrequent (p , minFreq , t) , m) . describe ( Descriptors . eps ( minGr )). build (0.5 , 15) ; } Listing 1.2: PBCD based on the unweighted jaccard dissimilarity between binary- valued vector encodings of patterns. 4.3 A complete example: the KARMA algorithm Detecting changes is particularly relevant for dynamic networked data, that is, networks which evolve over time, for which no common notion of change exists. In fact, different methods ascribe changes to variations in the observed nodes, while others focus on edges or subgraphs observed over time, which leads to clearly different results. Moreover, many proposed methods are not part of ex- isting software frameworks, which limits their versatility. Listing 1.3 reports a complete example in which jKarma is used so as implementing the KARMA PBCD algorithm presented in [1], which detects changes in dynamic networks by observing variations in the FCSs discovered over time. The example also shows how users can react to changes, by following the event-listener paradigm: the changeDetected method will be be executed in case of detected changes, otherwise, the changeNotDetected method will be executed. public PBCD < LabeledGraph , LabeledEdge , TidSet , Boolean > getKARMA ( double minSupp , double minChange , double minGr ) { // auxiliary components TidSetProvider < LabeledEdge > dataAccessor = new TidSetProvider < >( Windows . cumulativeLandmark () ); UnweightedJaccard m = new UnweightedJaccard () ; Descriptor descriptor = Descriptors . partialEps ( minSupp , minGr ); // mining strategy definition MiningStrategy < LabeledEdge , TidSet > strategy = Strategies . uponSubgraphs () . eclat ( minSupp ) https://bitbucket.org/jkarma/demo-karma-pbcd/ jKarma: a highly-modular framework for PBCD on evolving data . limitDepth (3) . dfs ( dataAccessor ); // PBCD definition return Detectors . upon ( strategy ) . unweighted ((p ,t) -> Patterns . isFrequent (p , minSupp , t) , m) . describe ( descriptor ). build ( minChange , 15) ; } public void runKARMA ( Stream < LabeledGraph > dataSource ) { PBCD < LabeledGraph , LabeledEdge , TidSet , Boolean > detector = this . getKarma (0.15 , 0.2 , 1.2) ; // change detection event listening detector . registerListener ( new PBCDEventListener < LabeledEdge , TidSet >() { public void changeDetected ( ChangeDetectedEvent < LabeledEdge , TidSet > e){ // reaction to change detected } public void changeNotDetected ( ChangeNotDetectedEvent < LabeledEdge , TidSet > e){ // reaction to change not detected } }) ; // consume the data source dataSource . forEach ( detector ); } Listing 1.3: Example of jKarma implementing the KARMA PBCD [1]. Indeed, jKarma enables the users to detect changes in dynamic networks with alternative approaches. The example shows how to instantiate the KARMA algo- rithm, which is a good choice when the change has to be detected on subgraphs. However, the solution could not be the best one when changes affects only some attributes of nodes. To this end, jKarma can be used to rapid prototyping of new algorithms in Java. 4.4 Comparative evaluation To show the effectiveness of jKarma in deploying actionable PBCDs, we compare the detection accuracy and running times of four PBCD algorithms (KARMA, PBCD-1, PBCD-2, and StreamKrimp) on three synthetic datasets with same minimum frequency and change thresholds (equal to 0.5). Specifically, KARMA, PBCD-1, and PBCD-2 have been implemented in jKarma. PBCD-1 and PBCD- 2 are non-exhaustive variants of the exhaustive KARMA algorithm that make use of the landmark window model and sliding window model, respectively. While StreamKrimp [8] is a non-exhaustive PBCD based on frequent itemsets discovered according to the MDL principle. The results (Table 1) show that non-exhaustive PBCDs (PBCD-1, PBCD-2, and StreamKrimp) are more accu- rate than those exhaustive (KARMA). Although exhaustive, KARMA is more efficient than StreamKrimp, which is not implemented with jKarma. Finally, PBCD-1 offers the higher accuracy, while PBCD-2 has the lower running times. 5 Conclusions We have introduced jKarma, an highly-modular framework for defining and exe- cuting customized pattern-based change detection approaches for evolving data, https://bitbucket.org/jkarma/datasets Impedovo et al. Table 1: Running times and accuracies on synthetic data. Running times (seconds) Accuracy dataset dataset PBCD-1 PBCD-2 KARMA S.Krimp PBCD-1 PBCD-2 KARMA S.Krimp synth-drifts-1 12.913 6.194 60.763 86.130 synth-drifts-1 0.987 0.918 0.804 0.931 synth-drifts-2 12.284 6.522 55.982 77.138 synth-drifts-2 0.991 0.916 0.799 0.911 synth-drifts-3 12.603 6.463 58.137 76.750 synth-drifts-3 0.988 0.918 0.796 0.916 in Java. jKarma enables the modular definition of custom PBCDs, with reduced or none implementation efforts, by following a component-based architectural model. The framework comes as a Java software library which is completely in- dependent of other data mining frameworks and existing data sources. As future work, we plan to investigate the periodicity of the changes [9]. Acknowledgments We acknowledge the support of the MIUR - Ministero dell’Istruzione dell’Università e della Ricerca through the project "TALIsMan - Tecnologie di Assistenza personALizzata per il Miglioramento della quAlità della vitA" (Grant ID: ARS01_01116), funding scheme PON RI 2014-2020 References 1. C. Loglisci, M. Ceci, A. Impedovo, D. Malerba, Mining microscopic and macro- scopic changes in network data streams, Knowl. Based Syst. 161 (2018) 294–312. doi:10.1016/j.knosys.2018.07.011. 2. D. Trabold, T. Horváth, Mining strongly closed itemsets from data streams, in: 20th International Conference, DS 2017, Kyoto, 2017, pp. 251–266. 3. A. Bifet, G. Holmes, R. Kirkby, B. Pfahringer, MOA: massive online analysis, J. Mach. Learn. Res. 11 (2010) 1601–1604. 4. P. Fournier-Viger, A. Gomariz, T. Gueniche, A. Soltani, C. Wu, V. S. Tseng, SPMF: a java open-source pattern mining library, J. Mach. Learn. Res. 15 (1) (2014) 3389– 3393. 5. A. Impedovo, C. Loglisci, M. Ceci, D. Malerba, jkarma: A highly-modular framework for pattern-based change detection on evolving data, Knowl. Based Syst. 192 (2020) 105303. doi:10.1016/j.knosys.2019.105303. 6. M. J. Zaki, K. Gouda, Fast vertical mining using diffsets, in: Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, USA, 2003, pp. 326–335. 7. A. Impedovo, M. Ceci, T. Calders, Efficient and accurate non-exhaustive pattern- based change detection in dynamic networks, in: Discovery Science - 22nd Interna- tional Conference, DS 2019, Split, Croatia, October 28-30, Proceedings, 2019. 8. M. van Leeuwen, A. Siebes, Streamkrimp: Detecting change in data streams, in: ECML/PKDD 2008, Belgium 2008, Proceedings, Part I, pp. 672–687. 9. C. Loglisci, M. Ceci, A. Impedovo, D. Malerba, Mining spatio-temporal patterns of periodic changes in climate data, in: 5th International Workshop, NFMCP 2016, Held in Conjunction with ECML-PKDD 2016, Italy, 2016, Revised Selected Papers, pp. 198–212. doi:10.1007/978-3-319-61461-8_13.