Graph Pattern Mining for Business Decision Support André Petermann supervised by Erhard Rahm Database Research Group University of Leipzig petermann@informatik.uni-leipzig.de ABSTRACT model like common data warehouse models, they need to ask To which extent can graph pattern mining enrich business several questions like ”Are As and Bs frequently connected intelligence? This question was the seed whose sprout be- via Ds?” and get simple ”Yes” or ”No” answers. came my PhD research. To find an answer, I investigated Summarized, wrapping the schemas of data sources into a graph-based data integration, the calculation of business graph super-model enables more generic queries and mining measures from graphs and suitable data mining techniques of self-descriptive patterns. In my PhD research, I developed based thereon. The latter should identify correlations be- the BIIIG approach (Business Intelligence with Integrated tween occurrences of specific graph patterns and values of Instance Graphs) to enable such flexible graph-based anal- business measures. Finally, interesting patterns should be yses of business data. Figure 1 provides an overview of the presented to decision makers. With real world applications approach. In the remainder of this paper, I will give a brief in mind, I additionally considered the requirements of big overview of my past and future work. data scenarios at all stages. In this paper, I summarize my recent contributions and give an outlook on the work re- 2. CONTRIBUTIONS quired to finally answer the motivating question. In the following, I will provide an overview of the contri- butions made during my past PhD research. 1. INTRODUCTION To make good decisions, enterprises have a permanent de- 2.1 Graph Representation of Business Objects sire to understand the reasons for certain values of business Business data of a company implicitly describes a graph measures. In a classical business intelligence development but is typically stored in one or more business information lifecycle a domain expert is choosing potential impact fac- systems based on relational databases. Thus, I first had tors and the analytical model is tailored to evaluate mea- to consider the process of turning data organized in tables sures by these factors. However, this approach often leads into graphs. In [16], I proposed a semi-automated solution to oversimplified models and, thus, unexpected patterns may to this problem and implemented a prototype based on a remain hidden. Hence, the use of graph models for business productive graph database [19]. The approach was evalu- intelligence is a promising approach for two reasons: First, ated using real and synthetic data. In the following, I will some patterns are too complex to be represented using tu- briefly discuss my approach to graph-based data transfor- ples. In particular, this applies to patterns where most of mation and integration but, due to limited space, only for the information is about relationships. relational databases. Second, graphs can loosen the coupling of experts’ bias In the initial step (step 1 of Figure 1) metadata of one and analytical results because data represented by rich graph or more data sources is acquired, stored in a graph model models like the property graph model [20] allows not only (unified metadata graph) and enriched by a domain expert. to evaluate instance data but also metadata occurrence, i.e., In this graph, every vertex represents a class of domain ob- schema-related information is part of the result and must no jects (class-like tables) and every edge an association be- be specified in a query. For example, to reveal patterns be- tween classes (foreign keys or m:n tables). Both, vertices tween objects of classes A and B, ideally analysts just want and edges, further contain information about their source to ask ”Which patterns typically connect As and Bs?” and system, semantic type, keys and attributes. expect an answer like ”Mostly via a sequence of Cs and Ds, In the second step (step 2 of Figure 1), vertices and edges but sometimes only via Es”. In contrast, using a structured of the metadata graph are interpreted to generate SQL state- ments. These are used to query instances (data objects and relationships) from the source databases. Afterwards, all data objects are transformed into vertices and all relation- ships into edges of a so-called integrated instance graph. I decided to use the property graph model [20], i.e., a di- rected labeled multigraph with named attributes (proper- ties). For both, vertices and edges, labels represent their Proceedings of the VLDB 2017 PhD Workshop, August 28, 2017. Munich, semantic type and all attributes are stored using properties. Germany. Copyright (c) 2017 for this paper by its authors. Copying permitted for Another popular model to represent such graphs is the re- private and academic purposes. source description framework (RDF) [10]. However, RDF is Figure 1: Overview of the BIIIG approach [16] more general and provides no dedicated structures for logical nent’s vertices. In consequence, every transactional vertex relationships, labels and properties. In consequence every belongs to exactly one graph while master data instances attributed relationship must be represented by a subgraph may be part of many graphs. The algorithm’s only require- [7] and the total number of edges would be much higher. ment is the categorization of vertices to represent either mas- Besides model transformation, the second step may also ter or transactional data. This categorization is done by a include data integration. Every vertex has a globally unique domain expert at the class level and taken over by their source identifier composed from identifiers for source sys- instances. tem, class and record. Thus, the approach supports rela- Due to the bad availability of datasets from real business tionships across data sources. Such relationships may exist information systems, I designed and implemented FoodBro- for two reasons: First, data objects of different systems may ker [17], a data generator based on business process simula- reference each other, for example, a ticket of a customer is- tion. The generated data’s schema is inspired by real busi- sue tracking system may reference an invoice stored in an ness information systems. Further on, every master data accounting system. Second, certain master data is held re- object has a quality criterion and will, if participating, in- dundantly and copies refer to a global business key (e.g., fluence the process execution positively or negatively. For customer number). For the latter case, I proposed vertex example, the more poor master data objects interact in a fusion, a strategy to automatically merge the resulting ver- process the higher is the chance for a bad process outcome tices and to redirect their relationships. like financial loss. Thus, data generated by FoodBroker is suitable to evaluate the BIIIG approach. 2.2 Business Transaction Graphs Data warehouse models use a schema (e.g., star schema) 2.3 Business Measure Aggregation that needs to be defined in advance to link facts and dimen- To analyze graph collections, first, measures need to be sions. Data mining techniques based thereon can evaluate calculated on the graph-level. For this reason, I proposed the the co-occurrence of certain dimensional values (e.g., fea- graph aggregation operation [14, 16]. Aggregation derives a ture vectors). The major aim of the BIIIG approach was scalar value from an input graph’s vertices and edges includ- to enable an additional evaluation of the relationship struc- ing labels and properties, e.g., to count contained vertices ture among interrelated facts as well as between facts and of a certain type or to sum all values of a specified property. dimensions. Analyzing such structural patters is promis- The actual calculation is specified by a user-defined func- ing, for example, to reveal interaction patterns between cus- tion γ that is executed for every graph of a collection. For tomers and certain employees that lead to high sales profit. example, the attributes isClosed and soCount attached to Here, the first challenge was to find a suitable abstraction the graphs of Figure 2 represent the results of two different to enable such analyses. For this reason, I introduced the aggregation functions γisClosed and γsoCount . While γsoCount concept of business transaction graphs [16] as the base for counts vertices of type SalesOrder, γisClosed will check, if measure aggregation (Section 2.3) and graph pattern mining the graph contains a closed sales quotation, i.e., if the sales (Section 2.5). A business transaction graph represents, for process is finished. The result of an aggregation function example, a single execution of a business process like trading can be used to filter a graph collection. In our example, only or manufacturing goods. graphs with γisClosed = true were selected to apply γsoCount . I proposed a domain-specific algorithm to automatically Since vertices of type SalesOrder only exist in the case of a extract a collection of such graphs from the integrated in- confirmed (won) Quotation, this aggregation result can be stance graph (step 3 in Figure 1). Figure 2 shows four exam- used to categorize graphs into won (γsoCount > 0) and lost ple business transaction graphs of a sales process. For sake of (γsoCount = 0) ones. ease, edge types are omitted. The algorithm is based on the observation that transactional data (e.g., Email, Quotation, 2.4 Scalable Frequent Subgraph Mining SalesOrder) only link each other in the case of a causal To find correlations between certain business measures connection. Here, causally connected means object B (e.g., values and graph patterns, pattern frequencies need to be an invoice) would not exist without the prior existence of computed. This primitive operation is the well known prob- object A (e.g., a quotation). Thus, the algorithm first iden- lem of frequent subgraph mining [5]. Since the problem is tifies connected components of transactional data and, af- NP-complete and graph collections in business applications terwards, adds all master data (e.g., Customer, Employee, can be very large I required a massive parallel solution to Product) that is directly connected to one of the compo- minimize total response times. There are three distributed approaches to (exact and complete) frequent subgraph min- Figure 2: Example Business Transaction Graphs [14] ing based on MapReduce [4, 11, 12]. However, none of these approaches is capable to mine directed multigraphs. Thus, I discussed an extension of the popular gSpan algo- rithm [22] to support directed multigraphs in [15] and pro- posed DIMSpan [18], the first approach to frequent subgraph mining based on distributed in-memory dataflow systems like Apache Spark [23] or Apache Flink [2]. In comparison to the existing MapReduce based approaches, DIMSpan not only requires fewer disk access but also shuffles less data over the network and can reduce the total number of ex- pensive isomorphism resolutions to a minimum. In exper- imental evaluations I have shown that a lightweight data structure as well as effective and fast compression techniques based thereon are key techniques for good scalability in big data scenarios. Figure 3 shows example evaluation results of DIMSpan. The chart on the left hand side shows a perfect scalability for increasing input data volume, since compu- tation time for a portion of 100K graphs is decreasing for a growing number of graphs at different minimum support thresholds smin . The chart on the right hand side shows good speedup for an increasing cluster size. 2.5 Category Characteristic Patterns Since we are able to categorize graphs based on aggre- gated measures and can compute pattern frequencies, we can also mine correlations between categories and certain patterns. In [14] I proposed an analytical workflow to iden- tify such category characteristic patterns. Figure 2 shows four example graphs where the top 3 represent finished exe- cutions of a sales process (γisClosed = true) categorized into won (γsoCount > 0) and lost ones (γsoCount = 0). Blue and red color are used to highlight example patterns. The pattern in blue color represents ’a phone call made by Alice’ and requirements, especially they miss support for graph collec- the one in red color ’an email sent by Bob’. To enable the tions and graph properties. Thus, I joined the development extraction of patterns combining labels and values of cer- of Gradoop [9], a scalable framework supporting complex tain properties I additionally use a specific transformation workflows [15] of multiple operations on both graphs and between categorization and mining. graph collections. In contrast to basic frequent subgraph mining, I require The aggregation and vertex fusion operators proposed in patterns not just to be frequent but to be characteristic for [16] became part of Gradoop’s extended property graph a measure category. For example, the blue pattern is in- model. Additionally, DIMSpan [18] as well as the algo- teresting, as it occurs in all of the won cases but not in the rithms to extract business transaction graphs [16] and the lost one. By contrast, the red pattern occurs in all graphs of one to identify category characteristic patterns [14] were im- both categories and, thus, is considered to be trivial. There- plemented to fit a dedicated interface for plug-in algorithms fore, I use an interestingness measure comparing a pattern’s and are part of Gradoop’s open source repository1 . Be- frequency in different categories. The measure is a func- sides operators related to the BIIIG approach, the frame- tion that evaluates the relative support of a pattern within work provides further valuable analytical operators such as a category in relation to its average relative support in all graph grouping [8] and graph pattern matching [6]. categories. Based on this measure, the analyst sets an in- terestingness threshold to prune patterns by minimum in- terestingness. Additionally, there is a candidate threshold 3. PROBLEMS AND FUTURE WORK to specify the minimum support of a pattern inside a cate- In first evaluations of mining characteristic patterns from gory to be considered as a candidate. This parameter is used FoodBroker data I found out that the expected result was to save computations in exchange for result completeness. returned but the number of patterns quickly became very large and may overwhelm analysts. However, I was already 2.6 Framework Integration able to identify two particular ”data science” problems and their potential solutions: First, the method described in Sec- The implementation of the initial prototype [19] only cov- tion 2.5 eliminates trivial patterns for each category but not ered data integration and simple analytical queries. To find combinations of trivial and characteristic patterns. Thus, a suitable platform for complex workflows including mea- I’ll investigate ranking results using a fast analytical method sure calculation and pattern mining, I performed an in- of graph p-value calculation [13]. Based thereon most sig- depth comparison of graph databases [7] and examined the nificant patterns should be presented first. suitability of different graph processing technologies [14]. I 1 found out that none of the existing systems could satisfy my www.gradoop.com Second, patterns should contain different levels of dimen- Figure 3: DIMSpan: Scalability for increasing data sional attributes. To provide a simple example, on the one volume (left) and cluster size (right) [18] hand an analyst won’t be interested in the pattern bread and butter, if there are more specific patterns like wholegrain bread and butter. On the other hand, if bread and butter is not returned, the more general pattern of bakery products and butter could be. Thus, I will extend the DIMSpan algo- rithm to mine dimensional attributes across multiple levels. This approach has already been studied for itemsets [3] but not for graphs. Finally, I will evaluate BIIIG in a real world scenario in cooperation with a large-scale enterprise. The evaluation will be based on Gradoop and cover all steps of my ap- proach. The company will not only provide real business data but also valuate analytical results and scalability. [8] M. Junghanns, A. Petermann, and E. Rahm. Distributed 4. SUMMARY grouping of property graphs with gradoop. In 17th Conference on Database Systems for Business, Technology, In my past PhD research, I contributed to the fields of and Web (BTW), 2017. graph data management and graph data mining. In contrast [9] M. Junghanns, A. Petermann, N. Teichmann, K. Gómez, to other graph-based approaches to business intelligence [1, and E. Rahm. Analyzing extended property graphs with 21], BIIIG covers all steps from data integration to analyti- apache flink. In 1st SIGMOD Workshop on Network Data cal results and requires no advance definition of an analytical Analytics, page 3. ACM, 2016. schema. To the best of my knowledge, I proposed the first [10] G. Klyne and J. J. Carroll. Resource description framework approach to integrate data from multiple source into a single (RDF): Concepts and abstract syntax. 2006. instance graph and the first one using metadata-driven au- [11] W. Lin, X. Xiao, and G. Ghinita. Large-scale frequent subgraph mining in mapreduce. In Int. Conf. on Data tomation. Further on, I was the first who discussed the usage Engineering (ICDE), pages 844–855. IEEE, 2014. of graph collections to analyze the structure of interrelated [12] W. Lu, G. Chen, A. Tung, and F. Zhao. Efficiently business objects and to enable novel data mining techniques extracting frequent subgraphs using mapreduce. In Int. based thereon. Additionally, I presented the first horizon- Conf. on Big Data, pages 639–647. IEEE, 2013. tally scalable approach to transactional frequent subgraph [13] G. Micale, R. Giugno, A. Ferro, M. Mongiovı́, D. Shasha, mining using a distributed in-memory dataflow system and and A. Pulvirenti. Fast analytical methods for finding the first supporting directed multigraphs. To finish my PhD significant colored graph motifs. To be published in Data research, I will improve applicability by returning cross-level Mining and Knowledge Discovery, 2017. [14] A. Petermann and M. Junghanns. Scalable business results and ranking them by significance. intelligence with graph collections. it-Information Technology, 58(4):166–175, 2016. 5. ACKNOWLEDGMENTS [15] A. Petermann, M. Junghanns, S. Kemper, K. Gómez, N. Teichmann, and E. Rahm. Graph mining for complex This work is partially funded within the EU program Eu- data analytics. In Int. Conf. on Data Mining Workshops ropa fördert Sachsen of the European Social Fund and by the (ICDMW), pages 1316–1319. IEEE, 2016. German Federal Ministry of Education and Research under [16] A. Petermann, M. Junghanns, R. Müller, and E. Rahm. project ScaDS Dresden/Leipzig (BMBF 01IS14014B). BIIIG: Enabling Business Intelligence with Integrated Instance Graphs. In Int. Conf. on Data Engineering Workshops (ICDEW), pages 4–11. IEEE, 2014. 6. REFERENCES [17] A. Petermann, M. Junghanns, R. Müller, and E. Rahm. [1] D. Bleco and Y. Kotidis. Business intelligence on complex graph data. In Proceedings of the 2012 Joint EDBT/ICDT Foodbroker-generating synthetic datasets for graph-based Workshops, pages 13–20. ACM, 2012. business analytics. In Workshop on Big Data Benchmarks, [2] P. Carbone et al. Apache flink: Stream and batch pages 145–155. Springer, 2014. processing in a single engine. Data Eng., 38(4), 2015. [18] A. Petermann, M. Junghanns, and E. Rahm. [3] T. Eavis and X. Zheng. Multi-level frequent pattern Dimspan-transactional frequent subgraph mining with mining. In International Conference on Database Systems distributed in-memory dataflow systems. arXiv preprint for Advanced Applications, pages 369–383. Springer, 2009. arXiv:1703.01910, 2017. [4] S. Hill et al. An iterative mapreduce approach to frequent [19] A. Petermann et al. Graph-based Data Integration and subgraph mining in biological datasets. In Conference on Business Intelligence with BIIIG. PVLDB, 7(13), 2014. Bioinformatics, Computational Biology and Biomedicine, [20] M. A. Rodriguez and P. Neubauer. Constructions from dots pages 661–666. ACM, 2012. and lines. Bulletin of the American Society for Information [5] C. Jiang, F. Coenen, and M. Zito. A survey of frequent Science and Technology, 36(6):35–41, 2010. subgraph mining algorithms. The Knowledge Engineering [21] Z. Wang et al. Pagrol: Parallel graph olap over large-scale Review, 28(01):75–105, 2013. attributed graphs. In 30th Int. Conf. on Data Engineering [6] M. Junghanns, M. Kieling, A. Averbuch, A. Petermann, (ICDE), pages 496–507. IEEE, 2014. and E. Rahm. Cypher-based graph pattern matching in [22] X. Yan and J. Han. gspan: Graph-based substructure gradoop. In Proc. 5th Int. Workshop on Graph Data pattern mining. In International Conference on Data Management Experiences and Systems. ACM, 2017. Mining (ICDM), pages 721–724. IEEE, 2002. [7] M. Junghanns, A. Petermann, N. Neumann, and E. Rahm. [23] M. Zaharia et al. Resilient distributed datasets: A Management and analysis of big graph data: Current fault-tolerant abstraction for in-memory cluster computing. systems and open challenges. Big Data Handbook, In Proc. of the 9th USENIX conference on Networked Springer, 2017. Systems Design and Implementation, pages 2–2, 2012.