Graph Pattern Mining for Business Decision Support

                                                            André Petermann
                                                        supervised by Erhard Rahm
                                                        Database Research Group
                                                           University of Leipzig
                                             petermann@informatik.uni-leipzig.de


ABSTRACT                                                                  model like common data warehouse models, they need to ask
To which extent can graph pattern mining enrich business                  several questions like ”Are As and Bs frequently connected
intelligence? This question was the seed whose sprout be-                 via Ds?” and get simple ”Yes” or ”No” answers.
came my PhD research. To find an answer, I investigated                      Summarized, wrapping the schemas of data sources into a
graph-based data integration, the calculation of business                 graph super-model enables more generic queries and mining
measures from graphs and suitable data mining techniques                  of self-descriptive patterns. In my PhD research, I developed
based thereon. The latter should identify correlations be-                the BIIIG approach (Business Intelligence with Integrated
tween occurrences of specific graph patterns and values of                Instance Graphs) to enable such flexible graph-based anal-
business measures. Finally, interesting patterns should be                yses of business data. Figure 1 provides an overview of the
presented to decision makers. With real world applications                approach. In the remainder of this paper, I will give a brief
in mind, I additionally considered the requirements of big                overview of my past and future work.
data scenarios at all stages. In this paper, I summarize my
recent contributions and give an outlook on the work re-                  2.    CONTRIBUTIONS
quired to finally answer the motivating question.                           In the following, I will provide an overview of the contri-
                                                                          butions made during my past PhD research.
1.    INTRODUCTION
   To make good decisions, enterprises have a permanent de-
                                                                          2.1   Graph Representation of Business Objects
sire to understand the reasons for certain values of business                Business data of a company implicitly describes a graph
measures. In a classical business intelligence development                but is typically stored in one or more business information
lifecycle a domain expert is choosing potential impact fac-               systems based on relational databases. Thus, I first had
tors and the analytical model is tailored to evaluate mea-                to consider the process of turning data organized in tables
sures by these factors. However, this approach often leads                into graphs. In [16], I proposed a semi-automated solution
to oversimplified models and, thus, unexpected patterns may               to this problem and implemented a prototype based on a
remain hidden. Hence, the use of graph models for business                productive graph database [19]. The approach was evalu-
intelligence is a promising approach for two reasons: First,              ated using real and synthetic data. In the following, I will
some patterns are too complex to be represented using tu-                 briefly discuss my approach to graph-based data transfor-
ples. In particular, this applies to patterns where most of               mation and integration but, due to limited space, only for
the information is about relationships.                                   relational databases.
   Second, graphs can loosen the coupling of experts’ bias                   In the initial step (step 1 of Figure 1) metadata of one
and analytical results because data represented by rich graph             or more data sources is acquired, stored in a graph model
models like the property graph model [20] allows not only                 (unified metadata graph) and enriched by a domain expert.
to evaluate instance data but also metadata occurrence, i.e.,             In this graph, every vertex represents a class of domain ob-
schema-related information is part of the result and must no              jects (class-like tables) and every edge an association be-
be specified in a query. For example, to reveal patterns be-              tween classes (foreign keys or m:n tables). Both, vertices
tween objects of classes A and B, ideally analysts just want              and edges, further contain information about their source
to ask ”Which patterns typically connect As and Bs?” and                  system, semantic type, keys and attributes.
expect an answer like ”Mostly via a sequence of Cs and Ds,                   In the second step (step 2 of Figure 1), vertices and edges
but sometimes only via Es”. In contrast, using a structured               of the metadata graph are interpreted to generate SQL state-
                                                                          ments. These are used to query instances (data objects and
                                                                          relationships) from the source databases. Afterwards, all
                                                                          data objects are transformed into vertices and all relation-
                                                                          ships into edges of a so-called integrated instance graph. I
                                                                          decided to use the property graph model [20], i.e., a di-
                                                                          rected labeled multigraph with named attributes (proper-
                                                                          ties). For both, vertices and edges, labels represent their
Proceedings of the VLDB 2017 PhD Workshop, August 28, 2017. Munich,       semantic type and all attributes are stored using properties.
Germany.
Copyright (c) 2017 for this paper by its authors. Copying permitted for   Another popular model to represent such graphs is the re-
private and academic purposes.                                            source description framework (RDF) [10]. However, RDF is
                                     Figure 1: Overview of the BIIIG approach [16]


more general and provides no dedicated structures for logical     nent’s vertices. In consequence, every transactional vertex
relationships, labels and properties. In consequence every        belongs to exactly one graph while master data instances
attributed relationship must be represented by a subgraph         may be part of many graphs. The algorithm’s only require-
[7] and the total number of edges would be much higher.           ment is the categorization of vertices to represent either mas-
   Besides model transformation, the second step may also         ter or transactional data. This categorization is done by a
include data integration. Every vertex has a globally unique      domain expert at the class level and taken over by their
source identifier composed from identifiers for source sys-       instances.
tem, class and record. Thus, the approach supports rela-             Due to the bad availability of datasets from real business
tionships across data sources. Such relationships may exist       information systems, I designed and implemented FoodBro-
for two reasons: First, data objects of different systems may     ker [17], a data generator based on business process simula-
reference each other, for example, a ticket of a customer is-     tion. The generated data’s schema is inspired by real busi-
sue tracking system may reference an invoice stored in an         ness information systems. Further on, every master data
accounting system. Second, certain master data is held re-        object has a quality criterion and will, if participating, in-
dundantly and copies refer to a global business key (e.g.,        fluence the process execution positively or negatively. For
customer number). For the latter case, I proposed vertex          example, the more poor master data objects interact in a
fusion, a strategy to automatically merge the resulting ver-      process the higher is the chance for a bad process outcome
tices and to redirect their relationships.                        like financial loss. Thus, data generated by FoodBroker is
                                                                  suitable to evaluate the BIIIG approach.
2.2    Business Transaction Graphs
   Data warehouse models use a schema (e.g., star schema)         2.3    Business Measure Aggregation
that needs to be defined in advance to link facts and dimen-         To analyze graph collections, first, measures need to be
sions. Data mining techniques based thereon can evaluate          calculated on the graph-level. For this reason, I proposed the
the co-occurrence of certain dimensional values (e.g., fea-       graph aggregation operation [14, 16]. Aggregation derives a
ture vectors). The major aim of the BIIIG approach was            scalar value from an input graph’s vertices and edges includ-
to enable an additional evaluation of the relationship struc-     ing labels and properties, e.g., to count contained vertices
ture among interrelated facts as well as between facts and        of a certain type or to sum all values of a specified property.
dimensions. Analyzing such structural patters is promis-          The actual calculation is specified by a user-defined func-
ing, for example, to reveal interaction patterns between cus-     tion γ that is executed for every graph of a collection. For
tomers and certain employees that lead to high sales profit.      example, the attributes isClosed and soCount attached to
Here, the first challenge was to find a suitable abstraction      the graphs of Figure 2 represent the results of two different
to enable such analyses. For this reason, I introduced the        aggregation functions γisClosed and γsoCount . While γsoCount
concept of business transaction graphs [16] as the base for       counts vertices of type SalesOrder, γisClosed will check, if
measure aggregation (Section 2.3) and graph pattern mining        the graph contains a closed sales quotation, i.e., if the sales
(Section 2.5). A business transaction graph represents, for       process is finished. The result of an aggregation function
example, a single execution of a business process like trading    can be used to filter a graph collection. In our example, only
or manufacturing goods.                                           graphs with γisClosed = true were selected to apply γsoCount .
   I proposed a domain-specific algorithm to automatically        Since vertices of type SalesOrder only exist in the case of a
extract a collection of such graphs from the integrated in-       confirmed (won) Quotation, this aggregation result can be
stance graph (step 3 in Figure 1). Figure 2 shows four exam-      used to categorize graphs into won (γsoCount > 0) and lost
ple business transaction graphs of a sales process. For sake of   (γsoCount = 0) ones.
ease, edge types are omitted. The algorithm is based on the
observation that transactional data (e.g., Email, Quotation,      2.4    Scalable Frequent Subgraph Mining
SalesOrder) only link each other in the case of a causal            To find correlations between certain business measures
connection. Here, causally connected means object B (e.g.,        values and graph patterns, pattern frequencies need to be
an invoice) would not exist without the prior existence of        computed. This primitive operation is the well known prob-
object A (e.g., a quotation). Thus, the algorithm first iden-     lem of frequent subgraph mining [5]. Since the problem is
tifies connected components of transactional data and, af-        NP-complete and graph collections in business applications
terwards, adds all master data (e.g., Customer, Employee,         can be very large I required a massive parallel solution to
Product) that is directly connected to one of the compo-          minimize total response times. There are three distributed
approaches to (exact and complete) frequent subgraph min-        Figure 2: Example Business Transaction Graphs [14]
ing based on MapReduce [4, 11, 12]. However, none of these
approaches is capable to mine directed multigraphs.
   Thus, I discussed an extension of the popular gSpan algo-
rithm [22] to support directed multigraphs in [15] and pro-
posed DIMSpan [18], the first approach to frequent subgraph
mining based on distributed in-memory dataflow systems
like Apache Spark [23] or Apache Flink [2]. In comparison
to the existing MapReduce based approaches, DIMSpan not
only requires fewer disk access but also shuffles less data
over the network and can reduce the total number of ex-
pensive isomorphism resolutions to a minimum. In exper-
imental evaluations I have shown that a lightweight data
structure as well as effective and fast compression techniques
based thereon are key techniques for good scalability in big
data scenarios. Figure 3 shows example evaluation results of
DIMSpan. The chart on the left hand side shows a perfect
scalability for increasing input data volume, since compu-
tation time for a portion of 100K graphs is decreasing for
a growing number of graphs at different minimum support
thresholds smin . The chart on the right hand side shows
good speedup for an increasing cluster size.

2.5   Category Characteristic Patterns
   Since we are able to categorize graphs based on aggre-
gated measures and can compute pattern frequencies, we
can also mine correlations between categories and certain
patterns. In [14] I proposed an analytical workflow to iden-
tify such category characteristic patterns. Figure 2 shows
four example graphs where the top 3 represent finished exe-
cutions of a sales process (γisClosed = true) categorized into
won (γsoCount > 0) and lost ones (γsoCount = 0). Blue and red
color are used to highlight example patterns. The pattern
in blue color represents ’a phone call made by Alice’ and        requirements, especially they miss support for graph collec-
the one in red color ’an email sent by Bob’. To enable the       tions and graph properties. Thus, I joined the development
extraction of patterns combining labels and values of cer-       of Gradoop [9], a scalable framework supporting complex
tain properties I additionally use a specific transformation     workflows [15] of multiple operations on both graphs and
between categorization and mining.                               graph collections.
   In contrast to basic frequent subgraph mining, I require         The aggregation and vertex fusion operators proposed in
patterns not just to be frequent but to be characteristic for    [16] became part of Gradoop’s extended property graph
a measure category. For example, the blue pattern is in-         model. Additionally, DIMSpan [18] as well as the algo-
teresting, as it occurs in all of the won cases but not in the   rithms to extract business transaction graphs [16] and the
lost one. By contrast, the red pattern occurs in all graphs of   one to identify category characteristic patterns [14] were im-
both categories and, thus, is considered to be trivial. There-   plemented to fit a dedicated interface for plug-in algorithms
fore, I use an interestingness measure comparing a pattern’s     and are part of Gradoop’s open source repository1 . Be-
frequency in different categories. The measure is a func-        sides operators related to the BIIIG approach, the frame-
tion that evaluates the relative support of a pattern within     work provides further valuable analytical operators such as
a category in relation to its average relative support in all    graph grouping [8] and graph pattern matching [6].
categories. Based on this measure, the analyst sets an in-
terestingness threshold to prune patterns by minimum in-
terestingness. Additionally, there is a candidate threshold
                                                                 3.     PROBLEMS AND FUTURE WORK
to specify the minimum support of a pattern inside a cate-          In first evaluations of mining characteristic patterns from
gory to be considered as a candidate. This parameter is used     FoodBroker data I found out that the expected result was
to save computations in exchange for result completeness.        returned but the number of patterns quickly became very
                                                                 large and may overwhelm analysts. However, I was already
2.6   Framework Integration                                      able to identify two particular ”data science” problems and
                                                                 their potential solutions: First, the method described in Sec-
  The implementation of the initial prototype [19] only cov-
                                                                 tion 2.5 eliminates trivial patterns for each category but not
ered data integration and simple analytical queries. To find
                                                                 combinations of trivial and characteristic patterns. Thus,
a suitable platform for complex workflows including mea-
                                                                 I’ll investigate ranking results using a fast analytical method
sure calculation and pattern mining, I performed an in-
                                                                 of graph p-value calculation [13]. Based thereon most sig-
depth comparison of graph databases [7] and examined the
                                                                 nificant patterns should be presented first.
suitability of different graph processing technologies [14]. I
                                                                 1
found out that none of the existing systems could satisfy my         www.gradoop.com
   Second, patterns should contain different levels of dimen-     Figure 3: DIMSpan: Scalability for increasing data
sional attributes. To provide a simple example, on the one        volume (left) and cluster size (right) [18]
hand an analyst won’t be interested in the pattern bread
and butter, if there are more specific patterns like wholegrain
bread and butter. On the other hand, if bread and butter is
not returned, the more general pattern of bakery products
and butter could be. Thus, I will extend the DIMSpan algo-
rithm to mine dimensional attributes across multiple levels.
This approach has already been studied for itemsets [3] but
not for graphs.
   Finally, I will evaluate BIIIG in a real world scenario in
cooperation with a large-scale enterprise. The evaluation
will be based on Gradoop and cover all steps of my ap-
proach. The company will not only provide real business
data but also valuate analytical results and scalability.
                                                                   [8] M. Junghanns, A. Petermann, and E. Rahm. Distributed
4.   SUMMARY                                                           grouping of property graphs with gradoop. In 17th
                                                                       Conference on Database Systems for Business, Technology,
  In my past PhD research, I contributed to the fields of
                                                                       and Web (BTW), 2017.
graph data management and graph data mining. In contrast           [9] M. Junghanns, A. Petermann, N. Teichmann, K. Gómez,
to other graph-based approaches to business intelligence [1,           and E. Rahm. Analyzing extended property graphs with
21], BIIIG covers all steps from data integration to analyti-          apache flink. In 1st SIGMOD Workshop on Network Data
cal results and requires no advance definition of an analytical        Analytics, page 3. ACM, 2016.
schema. To the best of my knowledge, I proposed the first         [10] G. Klyne and J. J. Carroll. Resource description framework
approach to integrate data from multiple source into a single          (RDF): Concepts and abstract syntax. 2006.
instance graph and the first one using metadata-driven au-        [11] W. Lin, X. Xiao, and G. Ghinita. Large-scale frequent
                                                                       subgraph mining in mapreduce. In Int. Conf. on Data
tomation. Further on, I was the first who discussed the usage
                                                                       Engineering (ICDE), pages 844–855. IEEE, 2014.
of graph collections to analyze the structure of interrelated     [12] W. Lu, G. Chen, A. Tung, and F. Zhao. Efficiently
business objects and to enable novel data mining techniques            extracting frequent subgraphs using mapreduce. In Int.
based thereon. Additionally, I presented the first horizon-            Conf. on Big Data, pages 639–647. IEEE, 2013.
tally scalable approach to transactional frequent subgraph        [13] G. Micale, R. Giugno, A. Ferro, M. Mongiovı́, D. Shasha,
mining using a distributed in-memory dataflow system and               and A. Pulvirenti. Fast analytical methods for finding
the first supporting directed multigraphs. To finish my PhD            significant colored graph motifs. To be published in Data
research, I will improve applicability by returning cross-level        Mining and Knowledge Discovery, 2017.
                                                                  [14] A. Petermann and M. Junghanns. Scalable business
results and ranking them by significance.
                                                                       intelligence with graph collections. it-Information
                                                                       Technology, 58(4):166–175, 2016.
5.   ACKNOWLEDGMENTS                                              [15] A. Petermann, M. Junghanns, S. Kemper, K. Gómez,
                                                                       N. Teichmann, and E. Rahm. Graph mining for complex
  This work is partially funded within the EU program Eu-
                                                                       data analytics. In Int. Conf. on Data Mining Workshops
ropa fördert Sachsen of the European Social Fund and by the           (ICDMW), pages 1316–1319. IEEE, 2016.
German Federal Ministry of Education and Research under           [16] A. Petermann, M. Junghanns, R. Müller, and E. Rahm.
project ScaDS Dresden/Leipzig (BMBF 01IS14014B).                       BIIIG: Enabling Business Intelligence with Integrated
                                                                       Instance Graphs. In Int. Conf. on Data Engineering
                                                                       Workshops (ICDEW), pages 4–11. IEEE, 2014.
6.   REFERENCES                                                   [17] A. Petermann, M. Junghanns, R. Müller, and E. Rahm.
 [1] D. Bleco and Y. Kotidis. Business intelligence on complex
     graph data. In Proceedings of the 2012 Joint EDBT/ICDT            Foodbroker-generating synthetic datasets for graph-based
     Workshops, pages 13–20. ACM, 2012.                                business analytics. In Workshop on Big Data Benchmarks,
 [2] P. Carbone et al. Apache flink: Stream and batch                  pages 145–155. Springer, 2014.
     processing in a single engine. Data Eng., 38(4), 2015.       [18] A. Petermann, M. Junghanns, and E. Rahm.
 [3] T. Eavis and X. Zheng. Multi-level frequent pattern               Dimspan-transactional frequent subgraph mining with
     mining. In International Conference on Database Systems           distributed in-memory dataflow systems. arXiv preprint
     for Advanced Applications, pages 369–383. Springer, 2009.         arXiv:1703.01910, 2017.
 [4] S. Hill et al. An iterative mapreduce approach to frequent   [19] A. Petermann et al. Graph-based Data Integration and
     subgraph mining in biological datasets. In Conference on          Business Intelligence with BIIIG. PVLDB, 7(13), 2014.
     Bioinformatics, Computational Biology and Biomedicine,       [20] M. A. Rodriguez and P. Neubauer. Constructions from dots
     pages 661–666. ACM, 2012.                                         and lines. Bulletin of the American Society for Information
 [5] C. Jiang, F. Coenen, and M. Zito. A survey of frequent            Science and Technology, 36(6):35–41, 2010.
     subgraph mining algorithms. The Knowledge Engineering        [21] Z. Wang et al. Pagrol: Parallel graph olap over large-scale
     Review, 28(01):75–105, 2013.                                      attributed graphs. In 30th Int. Conf. on Data Engineering
 [6] M. Junghanns, M. Kieling, A. Averbuch, A. Petermann,              (ICDE), pages 496–507. IEEE, 2014.
     and E. Rahm. Cypher-based graph pattern matching in          [22] X. Yan and J. Han. gspan: Graph-based substructure
     gradoop. In Proc. 5th Int. Workshop on Graph Data                 pattern mining. In International Conference on Data
     Management Experiences and Systems. ACM, 2017.                    Mining (ICDM), pages 721–724. IEEE, 2002.
 [7] M. Junghanns, A. Petermann, N. Neumann, and E. Rahm.         [23] M. Zaharia et al. Resilient distributed datasets: A
     Management and analysis of big graph data: Current                fault-tolerant abstraction for in-memory cluster computing.
     systems and open challenges. Big Data Handbook,                   In Proc. of the 9th USENIX conference on Networked
     Springer, 2017.                                                   Systems Design and Implementation, pages 2–2, 2012.