1. Introduction

Larissa C. Shimomura

l.capobianco.shimomura@tue.nl 0

Graph Dependencies, Data Profiling, Tuple-Generating Dependencies, Graph Databases

0 Eindhoven University of Technology , Eindhoven - Netherlands

Data dependencies have a direct application to data profiling as they can provide information about the metadata. The rise of practical real-world use of graph data resulted in increased interest in studying dependencies and constraints in graphs and their applications. In this project, we propose a new class of dependencies for graph data named Graph Generating Dependencies, or GGDs. Informally, a GGD expresses constraints between two (possibly diferent) graph (sub-)structures enforcing dependencies according to user-defined topological (graph) patterns and similarities in the corresponding data (property) values. The expressiveness of GGDs allows to describe information about the graph data about both topology and property values, making it an interesting class of dependency for graph data profiling. In this paper, we present the GGDs and further topics on graph data profiling that we plan to investigate during the project CEUR

1. Introduction

Data profiling refers to the task of collecting information about the content of the data. In the area of data profiling, data dependencies are used as a tool not only to assist the user in understanding the data and possible correlations between diferent attributes but also as a tool to express and ensure data quality rules. The property graph model is the emerging standard with current eforts from both industry and academia on standardizing a graph query language (GQL)1. Therefore, it is important to define and study new classes of dependencies for this model and its practical use.

Consider a social network graph in which it has been identified that whenever two people vertices have the same last name and address property values, and have an edge labelled “friend” connecting both, then there should also exist an edge of the type “is family” connecting these two vertices. It is important to be able to capture and present such constraints to the user as can arise naturally in graph data and such information is valuable for further profiling and use of the data.

However, classes of graph dependencies [1, 2, 3] for the property graph model previously proposed in the literature cannot fully capture such information as they are defined over one graph pattern and focus on generalizing functional dependencies (i.e., variations of egds, equality-generating dependencies). nEvelop-O ney, Australia. Copyright (C) 2022 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR htp:/ceur-ws.org ISN1613-073 https://www.gqlstandards.org/home

To represent such information, in this project, we propose a new class of graph dependencies for the property graph model named Graph Generating Dependencies (GGDs) [4]. A GGD can express topological constraints according to two (possibly) diferent graph patterns and data constraints that express the similarity of the property values of nodes and edges on the defined graph patterns. Given the expressiveness of GGDs, this class of dependencies can be used in practical scenarios such as describing the content of the data (discovery and visualization of GGDs) and ensuring data quality (detecting data inconsistencies, entity resolution and repair of graph data).

In this paper, we introduce in section 3 the syntax and

semantics of GGDs, its main reasoning problems and practical use cases of GGDs. section 4 we present the topics regarding the use of GGDs for graph data profiling and exploration that we are currently investigating.

2. Related Work We place GGDs in the context of relational and graph

dependencies proposed in the literature.

The classical Functional Dependencies (FDs) have been

studied and extended for contemporary applications in data management. Conditional Functional Dependencies (CFDs) [5] were later proposed for data cleaning tasks.

CFDs enforce an FD only for a set of tuples specified

by a condition, unlike the original FDs, in which the dependency holds for the whole relation. Due to its large application to data cleaning, discovery algorithms and extensions were proposed for CFDs [6].

The idea of FDs and CFDS

were extended to (GFDs) [1] and Graph Entity Dependencies (GEDs) [2]. © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License graph dependencies on Graph Functional Dependencies Attribution 4.0 International (CC BY 4.0).

Tuple-generating dependencies (tgds) are a well- [, ] ) is a constraint of one of the following forms [3, 7]: The GFDs are formally defined as a pair ([ ], → ) which [ ] is a graph pattern that defines a topological constraint, and ,

are two sets of literals over the vertex attributes, that define the property-value functional dependencies of the GFD. Besides the property-value dependencies present in the GFDs, GEDs also carry special id literals to enable identification of vertices in the graph pattern.

Diferential Dependencies (DDs) [ 7] were proposed to

support applications such as entity resolution, in which the similarity of the attributes must be considered. The

DDs extend the FDs by specifying constraints accord

ing to user-defined distance functions between attribute values [7]. This idea was also introduced in a class of graph dependencies, the Graph Diferential Dependencies - GDDs [3].

Recently, PG-Keys were proposed as a formalism to

define keys over property graphs [ 8]. While GFDs and

GEDs define constraints only over vertex attributes, PG Keys can identify and constraint unique vertices, edges

and properties in the property graph. known type of dependency used in the areas of data integration and data exchange [9]. Special cases of tgds and its extensions also have a wide range of applications.

One example of a special case of tgds are inclusion de

pendencies (INDs) that are often used to identify foreign keys in relational data [10]. Close to the idea of having constraints for property values, there is an extension to dencies (ctgds) [11]. The ctgds extends the tgds by adding a condition (a constraint) on variables.

Other types of constraints for graphs include: Graph

mantics for graphs, and Graph-Pattern Association Rules (GPARs [13]), a specific case of tgds and has been applied to social media marketing applications.

The main diferences of GGDs compared to previous works are: (i) the use of diferential constraints, (ii) edges are first-class citizens in the graph patterns (in alignment with the property graph model) and (iii) the ability to entail the generation of new vertices and edges. In general, GGD is the first constraint formalism for property graphs supporting both egds and tgds, and DDs for property values.

3. GGDs In this section, we present the Graph Generating Dependencies (GGDs), which includes: GGDs definition, reasoning problems and examples of practical use cases of GGDs.

egds and tgds called constrained tuple-generating depen- a constant of the domain of property and is a preRepairing Rules (GRRs [12]), an automatic repairing se- to the same entity (vertex/edge) and can also use the

We start by presenting the syntax and semantics of the

GGDs, previously published in [4]. A Graph Generating Dependency (GGD) is a dependency of the form [], → [, ], where: • [] and [, ]

are graph patterns, called source graph pattern and target graph pattern, respectively; and • is a set of diferential constraints defined over the variables (variables of the graph pattern ); • is a set of diferential constraints defined over the variables ∪ , in which are the variables of the source graph pattern and are any additional variables of the target graph pattern . A diferential constraint in on [] (resp., in on 1. (., ) ≤ 2. 1 2 3. = (. ′ or ≠ ′

1, ′. 2) ≤ 1 2 where , ′ ∈ (resp. ∈ ∪ ) for [] (resp. for [, ] ), is a user defined similarity function for the property and .

is the property value of variable on , is defined threshold. The diferential constraints defined by (1) and (2) can use the operators (=, <, >, ≤, ≥, ≠).

The constraint (3) = ′ states that and ′ refer inequality operator stating that ≠ ′ . An important feature of GGDs is that both vertices and edges are considered variables (in source and target graph patterns), which allows the comparison of vertex-vertex variables, edge-edge, and vertex-edge variables. of the graph pattern [ ] in .

Consider a graph pattern [ ] , a set of diferential

constraints and a match of this pattern represented by ℎ[] in a graph . The match ℎ[] satisfies , denoted as ℎ[] ⊧ if the match ℎ[] satisfies every diferential constraint in . If = ∅ then ℎ[] ⊧ for any match A GGD = [], → [, ], holds in a graph G, denoted as ⊧

, if and only if for every match ℎ [] of the source graph pattern [] in satisfying the set of constraints , there exists a match ℎ [, ] of the graph pattern [, ] in satisfying such that for each in it holds that ℎ () = ℎ () . In case a GGD does not hold in (it is violated), it can be repaired by generating new vertices/edges in .

Example - The GGD in Figure 1 describes that when

ever there is a match of the source graph pattern, in which

The validation problem can be solved by an algorithm with the following steps:

1. Check if ℎ () satisfies the source constraints (ie., ℎ () ⊧ ). If yes then continue. 2. Retrieve all matches ℎ (, ) of the target graph pattern [, ] where ℎ () = ℎ () for all ∈ . If there are no such matches of the target graph pattern, return false. 3. Verify if ℎ (, ) ⊧ . If there exists at least one match of the target graph pattern such that ℎ (, ) ⊧ , then return true, else return false. 3.3. Practical Use of GGDs m Magazine appeared_on a

Magazine m g Genre is_about i s = { (p.type, “TV-Personality”) = 0} t = { (g.name, “TV-Shows” = 0)} a Person node of the type “TV-Personality” is connected to Magazine by the edge “appeared on”, there should exist an edge of the type “is about” from this Magazine to a node labeled Genre in which its attribute name is “TV-Shows”.

In this section, we present how GGDs can be used in

practice in two diferent scenarios: (1) Identifying data inconsistencies and (2) Entity Resolution. The algorithms 3.2. Reasoning of GGDs used in these scenarios were implemented in the sHINER2 system using the G-Core language interpreter3[14] and To understand the application of GGDs in real-world data the Spark framework4. and its properties, we study how we can reason about Identifying Data Inconsistencies - Given a set of GGDs. We discuss three reasoning problems for GGDs: GGDs Σ, we define as inconsistent data a set of graph Satisfiability, Implication, and Validation. Due to space pattern matches of the source side of each GGD in Σ in limitation, in this section, we give an overview of how which there does not exist a match of the target side that we can solve each one of these problems. satisfies the target constraints .

Satisfiability - A set of GGDs Σ is satisfiable if there From the definition of inconsistent data, we can obexists a model that is a graph , such that (i) ⊧ Σ and serve that this problem is related to checking if a set of (ii) for each GGD ∈ Σ there exists a match of [] in GGDs, Σ, is violated or not. For this reason, to identify in . The satisfiability problem is to answer if given a set Σ, consistent data, we modify the previously introduced Valis Σ satisfiable? idation algorithm to return which matches of the source setInofforGmGaDllys, ΣtheisScaotinssfiaibsitleitnyt.prAoblseemtoisf tGoGvDersifyΣ iifsthneot (tℎru(e o[r)fa⊧l se if) twheerseetniostvvaalildidaateteddoirnnstoeta.d of returning satisfiable when contradictory constraints are enforced We implemented two versions of this algorithm using to the same node or edge in a graph G. Similar to the “left anti joins” and “left outer joins” to identify data inSatisfiability checking for GEDs [ 13], the satisfiability consistencies. This choice of operators were based on problem for GGDs can be solved by using a Chase pro- previous studies on validation over tgds of the literature cedure for GGDs. Given a graph Σ which contains a in the scenario of validating schema mappings [15, 16]. match of each source side of each GGD in Σ, if the Chase Although there is room for optimization and improveprocedure terminates and there are no infeasible/contra- ment on the implementation, the goal of these results dictory diferential constraints enforced in any properties is to show how GGDs are feasible even when using an of any nodes/edges of Σ, we can conclude that the set Σ available query engine such as SparkSQL. is satisfiable. We compared these two versions of the algorithms us

Implication - Given a set of GGDs Σ and a GGD , ing the LDBC benchmark dataset5. For both of these does Σ imply , (denoted by Σ ⊧ ) for every non-empty datasets, we manually defined a set of GGDs. Even graph G that satisfies Σ? though both versions of the algorithms finished in a

The implication of data dependencies can be proven feasible time, the the algorithm that uses left anti join by using Chase. Given a initial graph which con- performed better in terms of scalability (see Figure 2). tains a match of the source of , the Chase procedure Entity Resolution - Entity Resolution is the task of will interactively apply GGDs of Σ and enforce its target identifying instances of data that refer to the same realconstraints. After Chase terminates, if for every match of world entity. Entity Resolution can be used not only to the source of there exists a match of its target in then the implication is true, otherwise, it is false.

Validation - Given a set of GGDs Σ and a non-empty 32hhttttppss::////lgdibthcuobu.nccoiml.o/srmg/agrctodraet-aslpaakrek/g/core-spark-ggd graph , does the set of GGDs Σ hold in , denoted as 4https://spark.apache.org/ ⊧ Σ ? 5https://ldbcouncil.org/benchmarks/snb/ Scalability of Validation of GGDs − LDBC Dataset

4. Research Plan

Given the work on GGDs, in this section we introduce

Type the topics that we will investigate on GGDs for graph

LLeeffttAOnuttier data profiling.

Approximate Discovery of GGDs - Discovering data dependencies from the data is not a trivial task, 0 and new challenges arise when dealing with graph data. 0.1 0.3 1.0 Scale Factor 3.0 Dependency classes proposed for graph data in the property graph model are usually defined over a graph patFigure 2: Scalability of Validation of GGDs tern, which means that not only the constraints over attributes should be discovered, but also the constraints over the topology must be identified by the discovery puQbslisesdh=veitd({_aPiV.nealedaxepniesturt(eNe.fialromsAnt_uegatA,fNhupoatrh.mfioJrreos,utjrAjn.anuPlaatppphmueobrerli)shped2_4}i,n Qtpublishved_PVianexepneure firsAtt_ustaaA=fhmuostrehAo;srJoujrnapPluabppleisrhped_in ceaaodvollfeggeprtoorhre,rerenitilttdahahhteetmimtnordcsinsibis.tseuocsbDotfine[evi1sdtes/,wcrtp1hyore7eoveo]pemnfureGoysgretsrGiatsaeDlrpsfgerhsooleefriqpvsiuitaatehestnvntmtneetgornsgnrdarpmseaprsaopohanrhpepndopadcstathehteteadtderelglnrsfeeonisnms.rgmsHigihlniaornogarwiuipnatl-hydgs be identified. To the best of our knowledge, there is no Figure 3: Example of a GGD used for Entity Resolution method that can discover such correlations. Our goal is to develop an algorithm in which given a graph data and a degree of inconsistency outputs a set of GGDs deduplicate data but also to integrate datasets according Σ which holds on with at most degree of violation. to real-world entities that they have in common. To solve such problem we are currently working on an

GGDs can be used in practice to describe rules for approach that uses the ideas proposed in the association Entity Resolution. By using GGDs, we are able to declare rule mining area [18] to discover correlations between deduplication rules over graph data according to graph graph patterns, and on metrics or strategies that can patterns and the similarity of the attributes of nodes quantitatively measure . and edges defined in these graph patterns. Observe in Visualization of GGDs - One of the main advanFigure 3 an example of a GGD for Entity Resolution. tages of dependencies is that their syntax is human inter

To check if two entities match according to a GGD, pretable. However, depending on the content and volume we can use the same modified version of the Validation of the data, it can be dificult for the general user to unalgorithm used for data inconsistencies. This algorithm derstand the cases in which a dependency holds or not. will identify which matches of the source graph pattern While there are many tools that have explored the idea should have an edge indicating that they are the same of visualization of data dependencies using relational but do not actually have one in the dataset. Given this data [19], to the best of our knowledge, this topic has not set of matches, the next step is to generate the new edges. been explored in the context of graph data. Thus, many We implemented a version of the GGDs repair algorithm of these systems focus on ways to visualize the semantics by assuming that these missing edges will always be of the dependency and not how this dependency occurs generated. Since the generation of an edge can trigger in the data. Given these challenges, the goal is to develop the validation of another GGD, this algorithm stops when a system in which the user can understand the GGDs there are no changes in the graph. through visualization of examples of data in which the

Using our implementation in sHINER and the expertise GGDs hold. of our industrial partners in the SmartDataLake6 project, User-guided repair using GGDs - Repairing graphs we set the GGDs manually and tested their performance with constraints is a key task to ensure data quality. The on datasets from our industrial partners. Our partners repairing problem for GGDs can be defined as given a set were able to achieve the same Recall and Precision using of GGDs Σ and a data graph , make ⊧ Σ , meaning Σ GGDs compared to their internal tools, however, with holds on . Given the “generating” property of GGDs, a less human efort to run entity resolution. naive way to repair the graph data is to always create new nodes or edges. However, this solution can create even more noise in the data and might not generate useful information. To avoid this situation, the knowledge of the dataset specialist is crucial to correctly clean the data [20].

Involving the user can be very expensive because of the large number of possibilities to be verified. For this rea- [7] S. Song, L. Chen, Diferential dependencies: Reason, this topic has two main challenges: (1) develop a soning and discovery, ACM Trans. Database Syst. mechanism to suggest the best option on how to repair 36 (2011) 16:1–16:41. the data to the user and; (2) develop a policy on how and [8] R. Angles, A. Bonifati, S. Dumbrava, G. Fletcher, which order the suggestions should be presented to the K. W. Hare, J. Hidders, V. E. Lee, B. Li, L. Libkin, user. To the best of our knowledge there is no method in W. Martens, F. Murlak, J. Perryman, O. Savković, user-guided repair using dependencies in the property M. Schmidt, J. Sequeda, S. Staworko, D. Tomaszuk, graph model, however, studies in the context of relational Pg-keys: Keys for property graphs, in: Proceedings data and on knowledge bases [20, 21] can be reviewed of the 2021 International Conference on Manageand provide inspiration to solve the problem. ment of Data, SIGMOD/PODS ’21, Association for Computing Machinery, New York, NY, USA, 2021, p. 2423–2436. 5. Conclusion [9] R. Fagin, P. G. Kolaitis, R. J. Miller, L. Popa, Data exchange: Semantics and query answering, in: D. CalIn this project, we propose GGDs, a new class of depen- vanese, M. Lenzerini, R. Motwani (Eds.), Database dencies for the property graph model. GGDs can express Theory — ICDT 2003, Springer Berlin Heidelberg, an association between (possibly diferent) graph pat- Berlin, Heidelberg, 2003, pp. 207–224. terns and their attributes. GGDs can be used to describe [10] J. Bauckmann, U. Leser, F. Naumann, V. Tietz, Efimeaningful information about the graph data and as- ciently detecting inclusion dependencies, in: 2007 sist the user in further data analysis. In this paper, we IEEE 23rd International Conference on Data EnpGrGesDesnatendd tthhee dtoepfiniictisoannodfcGhGalDlesn,gpersatchtiactawlueswe icllasinevseosfti- gineering, 2007, pp. 1448–1450. doi:1 0 . 1 1 0 9 / I C D E . gate on graph data profiling using GGDs. [11] 2M0.0 7J.. 3M6 9a0h3e2r., D. Srivastava, Chasing constrained tuple-generating dependencies, in: Proceedings Acknowledgments of the fiteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, 1996, This project has received funding from the European pp. 128–138.

Union’s Horizon 2020 research and innovation pro- [12] Y. Cheng, L. Chen, Y. Yuan, G. Wang, Rule-based gramme under grant agreement No 825041. graph repairing: semantic and eficient repairing methods, in: ICDE, 2018, pp. 773–784. [13] W. Fan, Dependencies for graphs: Challenges and References opportunities, J. Data and Information Quality 11 (2019) 5:1–5:12. [1] W. Fan, Y. Wu, J. Xu, Functional dependencies for [14] R. Angles, M. Arenas, P. Barcelo, P. Boncz, graphs, in: SIGMOD, 2016, pp. 1843–1857. G. Fletcher, C. Gutierrez, T. Lindaaker, M. Paradies, [2] W. Fan, P. Lu, Dependencies for graphs, ACM S. Plantikow, J. Sequeda, O. van Rest, H. Voigt, G

Trans. Database Syst. 44 (2019) 5:1–5:40. core: A core for future graph query languages, in: [3] S. Kwashie, L. Liu, J. Liu, M. Stumptner, J. Li, L. Yang, Proceedings of the 2018 International Conference Certus: An efective entity resolution approach on Management of Data, SIGMOD ’18, Association with graph diferential dependencies (gdds), Proc. for Computing Machinery, New York, NY, USA, VLDB Endow. 12 (2019) 653–666. 2018, p. 1421–1432. [4] L. C. Shimomura, G. Fletcher, N. Yakovets, Ggds: [15] B. Alexe, B. ten Cate, P. G. Kolaitis, W.-C. Tan, DeGraph generating dependencies, in: Proceedings of signing and refining schema mappings via data exthe 29th ACM International Conference on Infor- amples, in: Proceedings of the 2011 ACM SIGMOD mation amp; Knowledge Management, CIKM ’20, International Conference on Management of Data, Association for Computing Machinery, New York, SIGMOD ’11, Association for Computing MachinNY, USA, 2020, p. 2217–2220. ery, New York, NY, USA, 2011, p. 133–144. [5] P. Bohannon, W. Fan, F. Geerts, X. Jia, A. Kementsi- [16] A. Bonifati, G. Mecca, A. Pappalardo, S. Raunich, etsidis, Conditional functional dependencies for G. Summa, Schema mapping verification: The spicy data cleaning, in: ICDE, 2007, pp. 746–755. way, in: Proceedings of the 11th International Con[6] W. Fan, F. Geerts, J. Li, M. Xiong, Discovering ference on Extending Database Technology: Adconditional functional dependencies, IEEE Transac- vances in Database Technology, EDBT ’08, Assotions on Knowledge and Data Engineering 23 (2011) ciation for Computing Machinery, New York, NY, 683–698. USA, 2008, p. 85–96.

[17] M. Alipourlangouri, F. Chiang, Keyminer: Discovering keys for graphs, in: VLDB workshop TD-LSG, 2018. [18] W. Fan, X. Wang, Y. Wu, J. Xu, Association rules with graph patterns, Proc. VLDB Endow. 8 (2015) 1502–1513. [19] B. Breve, L. Caruccio, S. Cirillo, V. Deufemia,

G. Polese, Visualizing dependencies during incremental discovery processes., in: EDBT/ICDT Workshops, 2020. [20] M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani,

I. F. Ilyas, Guided data repair, Proceedings of the

VLDB Endowment 4 (2011). [21] A. Arioua, A. Bonifati, User-guided repairing of inconsistent knowledge bases, in: EDBT: Extending Database Technology, OpenProceedings. org, 2018, pp. 133–144.