1 Introduction

Smart Data Exchange (DISCUSSION PAPER)

Sergio Greco

Michele Ianni

Elio Masciari

Domenico Sacca`

Irina Trubitsyna

elio:masciari@unina:it

greco

ianni

sacca

trubitsynag@dimes:unical:it

0 DIETI-Universita` degli Studi di Napoli Federico II , 80125 Napoli (NA) , Italy 1 DIMES-Universita` della Calabria , 87036 Rende (CS) , Italy

The problem of exchanging data, even considering incomplete and heterogeneous data, has been deeply investigated in the last years. The approaches proposed so far are quite rigid as they refer to fixed schema and/or are based on a deductive approach consisting in the use of a fixed set of (mapping) rules. In this paper we describe a smart data exchange framework integrating deductive and inductive techniques to obtain new knowledge. The use of graph-based representation of source and target data, together with the midway relational database and the extraction of new knowledge allow us to manage dynamic databases where also features of data may change over the time.

1 Introduction

Many proposal have been made for data exchange among heterogeneous sources. However, for the best of our knowledge, no generalization of the consolidated data exchange framework, that supports both the extraction of new knowledge and flexible representation of heterogeneous data has been defined so far. In this paper we describe an extension of the data exchange framework, called Smart Data Exchange (SDE), recently proposed in [ 13 ], that addresses these issues. The global scenario is showed in Fig. 1, where (i) Source, Target and Midway are three databases, (ii) SM and MT are extended TGDs (Tuple Generating Dependencies) mapping data from Source to Midway and standard TGDs mapping data from Midway to Target, respectively, (iii) M and T are extended TGDs defined over the Midway database and standard TGDs and EGDs (Equality Generating Dependencies) defined over the Target database, respectively.

The idea is that, by allowing an intermediate database and a richer language to derive new information, as well as information obtained by analyzing data, we may define more powerful and flexible tools for data exchange. To have a flexible and general representation of source and target databases, following the RDF approach, we decided to model them using a graph-based formalism where data are stored into ternary relations. Differently from other formalisms where data are stored into a unique relation [ 3, 18 ], we consider multiple ternary relations, but analogously to RDF our graph data consist of tuples of the form (i; a; v) with i being a (resource) identifier, a an attribute name and v a value (either a constant or an identifier). The midway database is a relational database containing tuples of any arity that are either imported from the source database using the SM source-midway mapping rules, or are generated by the Smart Analyzer Tool (SAT). Data dependencies denoted by M are used to generate new information useful to enrich the target database. The target database is graph-based and is built by importing data from the midway database using the mapping rules MT. Finally, as in the standard data exchange scenario, T consists of a set of TGDs or EGDs.

While we assume that SAT is a black-box aiming to produce new data stored in the midway database (in our prototype it implements several data mining algorithms specific for the domain application), the language used to define M is a rich logical language that makes use of comparison predicates, aggregate predicates and nondeterministic choice constructs (well studied in the past by the community of logic databases). This language allows to obtain more realistic information that could be used in a flexible way by both data experts (for analysis purposes) and inexperienced users (for a better navigation through data). Another peculiarity of our framework is that the data exchange takes also advantages of information (under the form of facts) generated by a data analyzer module and stored into the midway database, together with the data extracted from the source database.

The idea to model data analysis during the data mapping is present in Data Posting framework [ 8 ] and its simplified version [ 19 ]. Differently from SDE, the framework proposed in [ 8, 19 ] does not use graph-based formalisms for source and target data and does not have an intermediate level for data analysis. Moreover, it does not admit the use of existentially quantified variables in the mapping rules and uncertain values in the body of mapping rules can be represented only by means of non-deterministic variables, whose values can be selected from specific relations, possibly restricted by target count constraints. 2

Background

A Tuple Generating Dependency (TGD) is formula of the form: 8x (x) ! 9y (x; y), where (x) and (x; y) are conjunctions of atoms, and x; y are lists of variables. Full TGDs are TGDs without existentially quantified variables. An Equality Generating Dependency (EGD) is a formula of the form: 8x (x) ! x1 = x2, where (x) is conjunction of atoms, while x1 and x2 are variables in x. In our formulae we omit the universal quantifiers, when their presence is clear from the context.

Let S = S1; :::; Sn and T = T1; :::; Tm be two disjoint schemas. We refer to S (resp. T ) as the source (resp. target) schema and to the Si’s (resp. Tj ’s) as the source (resp. target) relation symbols. Instances over S (resp. T ) will be called source (resp. target) instances. If I is a source instance and J is a target instance, then we write hI; J i for the instance K over the schema S [ T such that K(Si) = I(Si) and K(Tj ) = J (Tj ), for i n and j m.

The data exchange setting [ 11, 2 ] is a tuple (S; T; st; t), where S is the source schema, T is the target schema, t are dependencies over T and st are source-totarget TGDs. The dependencies in st map data from the source to the target schema and are TGDs of the form 8x( s(x) ! 9y t(x; y) ), where s(x) and t(x; y) are conjunctions of atomic formulas on S and T , respectively. Dependencies in st are also called mapping dependencies. Dependencies in t specify constraints on the target schema and can be either TGDs or EGDs.

The data exchange problem associated with this setting is the following: given a finite source instance I, find a finite target instance J such that hI; J i satisfies st and J satisfies t. Such a J is called a solution for I. The computation of an universal solution (the compact representation of all possible solutions) can be done by means of the fixpoint chase algorithm, when it terminates [ 9 ]. Decidable (sufficient) conditions ensuring chase termination can be found in [ 16, 6, 15 ].

Queries over relational database can be expressed using Relational Algebra, or alternative equivalent languages such safe Relational Calculus and safe, nonrecursive Datalog (with negation). To make query languages more expressive, several additional feature have been added to these languages including the possibility to manage bags (SQL), aggregates (SQL), recursion (Datalog), existential variables (Datalog ) and others.

The choice constructs [ 20, 17 ] have been introduced in Datalog to get an increase in expressive power and to obtain simple declarative formulations of classical combinatorial problems, such as those which can be solved by means of greedy algorithms [ 20 ].

A choice atom is of the form choice((x); (y)), where x and y are lists of variables such that x \ y = ;, and can occur in the body of a rule. Its intuitive meaning is to force each initialization of x to be associated with a unique initialization of y, thus making the result of executing of a corresponding rule nondeterministic. A choice rule is of the form: A(w) B(z); choice((x); (y)), where A(w) is an atom, B(z) is a conjunction of atoms, x, y, z and w are lists of variables such that x; y; w z. The atom choice((x); (y)) is used to enforce the functional dependency x ! y on the set of atoms derived by means of the rule.

The choice-least and choice-most constructs [ 17 ] specialize the choice construct so as to force greedy selections among alternative choices—these turn out to be particularly useful to express classical greedy algorithms. A choice-least (resp. choicemost) atom is of the form choice least((x); (c)) (resp. choice most((x); (c))), where x is a list of variables and c is a single variable ranging over an ordered domain. A choice least((x); (c)) (resp. choice most((x); (c))) atom in a rule indicates that the functional dependency defined by the atom choice((x); (c)) is to be satisfied, and the c value assigned to a certain value of x has to be the minimum (resp. maximum) one among the candidate values. The body of a rule may contain even more than one choice constructs, but only one choice-least or choice-most atom. For instance, the rule p(x; y; c) q(x; y; c); choice((x); (y)); choice least((x); (c)) imposes the functional dependency x ! y; c on the possible instances of p. In addition, for each value of x, the minimum among the candidate values of c must be chosen. For instance, assuming that q is defined by the facts q(a; b; 1) and q(a; d; 2), from the rule above we might derive either p(a; b; 1) or p(a; d; 2). However, the choice-least atom introduces the additional requirement that the minimum value on the third attribute has to be chosen, so that only p(a; b; 1) is derived. The formal semantics is defined by rewriting rules with choice atoms into rules with negated literals and selecting (nondeterministically) one of the stable models of the rewritten program [ 20, 17 ]. 3

Smart Data Exchange

In this section we present the data exchange framework informally discussed in the Introduction. We assume the existence of the following countably infinite sets: relation names R, identifiers I, attribute names A, constants C, nulls N and variables V. The set of relation symbols (also called predicate symbols) is partitioned into three countable sets denoted by RS (source relations), RT (target relations) and RM (midway relations), whereas DS = I [ A [ C, DT = I [ A [ C [ N and DM = I [ A [ C denote the domains of relations RS, RT and RM, respectively. Relations in RS and RT have arity 3 and take values from I A I [ C and I [ N A [ N I [ C [ N respectively, n whereas relations in RM may have any arity n and take values from DM . The main difference between the source and the target databases is that the target database may also have nulls (corresponding to blank nodes in RDF) which are introduced to satisfy constraints. The set of source (resp. target, midway) relations define the source (resp. target, midway) database whose schema is denoted by S (resp. T , M ).

The model defined above states that the source and target databases are graph-based databases stored in a relational database (using triples), whereas the midway database is a standard relational database. This choice is due to the fact that we would exchange data among heterogeneous databases and we want model data whose schema may change over the time.

Extended TGDs. An atom is of the form p(t1; :::; tn), where t1; :::; tn are terms (standard atom), or of the form t1 t2, where is a comparison predicates and t1; t2 are terms (built-in atom). A literal is an atom A (positive literal) or its negation :A (negative literal). A conjunction of literals is of the form B1 ^ ^Bn where B1; :::; Bn are literals. A conjunction of literals is said to be safe if all variables occurring in built-in atoms and negated literals also appear in positive literals. From now on we assume that whenever we consider conjunctions of literals they are safe.

Definition 1. An extended TGD (ETGD) is a universally quantified implication formula of one of the following forms: – '(x) ^ choice(x0) ! (w), where '(x) is a safe conjunction of standard and built-in literals, (w) is a conjunction of atoms, w x and choice(x0), with x0 x, is a possibly empty conjunction of choice-atoms; – '(x) ! q(w0; c1hw1i; :::; cnhwni) (called aggregate dependency) where '(x) is a safe conjunction of standard and built-in literals, c1; :::; cn denote aggregate functions (e.g., min; max; sum; count), w0 (called “group-by” variables) and w1; :::; wn are lists of variables such that w0; :::; wn x and w0\wi = ; 8i 2 [1; n]. 2 For any set of data dependencies, the dependency graph G = (V; E) is built as follows: V consists of the predicate symbols occurring in , whereas there is an edge from p to q if there is a dependency having p in the body and q in the head. Moreover, the edge is labeled with : (resp. ag) if p occurs negated (resp. occurs in the head atom which contains aggregate functions). A set of dependency is said to be stratified if the depencecy graph does not contain cycles with labeled edges. Observe that since standard dependencies are positive (that is, all body literals are positive), to check stratification it is sufficient to check stratification of ETGDs. From now on we assume that our dependencies are stratified.

The semantics of ETGDs with choice atoms can be defined as in the case of Datalog rules. To this end, in order to eliminate head conjunctions, any ETGDs r : '(x) ! p1(y1) ^ ^ pk(yk) having k > 1 atoms in the head is rewritten into: (i) k ETGDs ri : '(x) ! pi(yi) (i 2 [1; k]), if r does not contain choice atoms, and (ii) k+1 ETGDs r0; :::; rk if r contains choice atoms, where r0 = '(x) ! hr(x) and ri = hr(x) ! pi(yi) (with i 2 1; k]), where hr is a fresh new predicate. Regarding the semantics of ETGDs with aggregate functions, as the set of ETGDs is stratified, we can partition it into strata so that, if a stratum contains an ETGD with aggregate functions, then it does not contain any other ETGD, and compute one stratum at time following topological order defined over the strata by the dependency graph. The computation of an ETGD with aggregate functions can be carried out in the same way of computing SQL queries with aggregates, as body predicates have been already computed and, therefore, they correspond to database relations in SQL (see also [ 12 ]).

Smart data exchange. Now we formally define the Smart Data Exchange Framework. – S is the source schema containing predicates taken from RS, – M is the midway schema containing predicates taken from RM, – T is the target schema containing predicates taken from RT, – SM is a source-to-midway set of safe ETGDs, – M is a midway-to-midway set of safe, stratified ETGDs, – MT is a midway-to-target set of standard TGDs, and – T is a target-to-target set of standard TGDs and standard EGDs.

As already pointed out, we assume that the source and target relations are graphbased data and every class of objects is modeled as a named set of triples. RDF graph, microdata and JSON representations are very closed to this description [ 1 ]. The midway database is a relational database containing relations of any arity including data in the format of the source and target database.

For any SDE E = (S; M; T; SM; M; MT; Ti, we shall use the following notation: 1 = SM [ M, 2 = MT [ T, and = 1 [ 2.

Example 1. Consider a graph relation in the source database containing facts of the form graph(id1; edge; id2) where id1 and id2 are node identifiers and edge is an attribute value denoting that there is an edge from id1 to id2. The data exchange problem consists in extracting a spanning tree to be stored in the target database. From the source database we can import edges and nodes in the midway database, as defined in the set

SM consisting of the ETGD: graph(x; edge; y) ! edge(x; y) ^ node(x) ^ node(y). Assume now that the midway database also contains a fact root(0), for instance generated by the SAT module or imported from the source database using another sourceto-midway dependency. The next set of ETGDs M shows how it is possible to generate a spanning tree rooted in the node x denoted by fact root(x).

root(x) ! st(nil; x) st(z; x) ^ edge(x; y); choice((y); (x)); ! st(x; y) ^ connected(y) node(x) ^ :connected(x); choice((); (x)) ! nextRoot(x) Here the first ETGD is used to start the computation by deriving an edge ending in the root node (the starting node is the dummy node nil), whereas the second ETGD, imposing the functional dependency y ! x, guarantees that the set of selected nodes is a spanning tree. The last ETGD in M gives a node not belonging to the spanning tree if the graph is not connected; this node can be used in the future as a root to compute another spanning tree.

Spanning tree edges and nextRoot facts are then imported in the target database (as triples) using the below set of TGDs MT: st(x; y) ! graph(x; st; y) nextRoot(x) ! 9y graph(nil; root; x) Information stored in the target database are next analyzed to generate new information which will be stored in the midway database. For instance, from a fact graph(nil; root; x) the SAT module could generate a fact root(x) which, after been stored in the midway database, could generate the computation of another spanning tree rooted in x. 2

Note that the process described in the previous example and showed in Fig. 1 is supervised, that is the activation of the module SAT is performed by the user.

The smart data exchange problem associated with this setting is the following: given a finite source instance I over a schema S and a smart data exchange E = (S; M; T; SM; M; MT; T), find finite instances J and K over the schemas M and T , respectively, such that hI; J i satisfies SM, J satisfies M, hJ; Ki satisfies MT and K satisfies M. The pair hJ; Ki is called a solution (or model) for hI; Ei, or equivalently for hI; i, where let 1 = SM [ M and 2 = MT [ T, = 1 [ 2. The set of solutions for hI; Ei (or equivalently for hI; i) is denoted by Sol(I; E), (resp. Sol(I; )). Moreover, J is also called solution (or model) for hI; 1i and, in cascade, K is a solution (or model) for hJ; 2i. Thus, the problem of finding solutions can be split into two problems: (i) finding a solutions J for I, and (ii) finding solutions K for J . The set of solutions for hI; 1i is denoted by Sol(I; 1), whereas the set of solutions for hJ; 2i is Sol(J; 2). Therefore, given a smart data exchange framework, starting from I with dependency 1 we find solutions J1; :::; Jm and, then starting from every Ji (1 i m) with dependency 2 we find solutions Ki1 ; :::; Kin . Observe that m is always finite, whereas in, in the general case, is not guaranteed to be finite [ 10 ]. Query answering. Since we have multiple models, we distinguish the certain answer derived from all models and the nondeterministic answer.

Definition 3. Given a query Q, a database I and an SDE E, the certain answer of Q over hI; Ei is Certain(Q; I; E) = ThJ;Ki2Sol(I;E) Q(K): 2

Since the certain answer could be equivalently defined as Certain(Q; I; E) = TJ2Sol(I; 1)^K2Sol(J; 2) Q(K), its computation can be optimized by considering the four sets of dependencies separately, that is by first computing solutions J 0 for hI; STi, then solutions J for hJ 0; Mi, next solutions K0 for hJ; MTi, and, finally solutions K for hK0; Ti. Let us now introduce the concepts of homomorphism and universal model.

A homomorphism from a set of atoms A1 to a set of atoms A2 is a mapping h from the domain of A1 (set of terms occurring in A1) to the domain of A2 such that: (i) h(c) = c, for every c 2 Const(A1); and (ii) for every atom R(t1; : : : ; tn) in A1, we have that R(h(t1); : : : ; h(tn)) is in A2. With a slight abuse of notation, we apply h also to sets of atoms and thus, for a given set of atoms A, we define h(A) = fR(h(t1); : : : ; h(tn)) j R(t1; : : : ; tn) 2 Ag.

Definition 4. A universal model of (I; ) is a model hJ; Ki of (I; ) such that for every model hJ 0; K0i of (I; ) there exists a homomorphism from K to K0. The set of all universal solutions of (I; ) will be denoted by USol(I; ). 2 Proposition 1. For any positive query Q, database I and SDE E, the certain answer of Q over hI; Ei is Certain(Q; I; E) = T J 2 Sol(I; 1) Q(KJ )#, where KJ is any universal solution of hJ; 2i and Q(KJ )# is the result of computing naively (i.e. considering nulls as constants) Q(KJ ) and deleting tuples with nulls. 2

Universal solutions for hJ; 2i can be easily computed by applying the classical fixpoint algorithm called Chase which computes a subset of universal solutions called canonical [ 4, 10 ].

As previously discussed, in several cases we are not interested in all models of hI; 1i, but only in one selected nondeterministically (e.g. the set of edges of any minimum spanning tree). Thus, we now introduce the definition of nondeterminitic answer. Definition 5. The nondeterministic answer to a query Q over a database instance I and SDE E = (S; M; T; SM; M; MT; T) is N onDet(Q; I; E) = T K 2 Sol(J; 2) Q(K), where, J is a model for hI; 1i selected nondeterministically. 2

Observe that the nondeterministic choice of the model is applied only to dependencies in 1, where users express explicitly that they want select nondeterministically a subset of tuples, whereas for hJ; 2i we consider all models. Note that, two evaluations of the nondeterministic answers could give different answers (as the choices made could be different) and that the responsibility of computing nondeterministic answers, instead of certain answers, is left to the user. Indeed, in several cases the user is not interested in specific models, but only in one model satisfying some properties (e.g. any spanning tree). Therefore, we introduce the concept of universal nondeterministc model. Definition 6. A universal nondeterministic model for hI; i is any universal model in USol(D; ) selected nondeterministically. 2 Proposition 2. For any positive query Q, database I and SDE E = (S; M; T; SM; M;

MT; T), the nondeterministic answer of Q over hI; Ei is N onDet(Q; I; E) = Q(KJ )#, where, let J be any model for hI; 1i selected nondeterministically, KJ is a universal solution of hJ; 2i 2

Conclusion

In this paper we have discussed a framework supporting both analysis and flexible representation of heterogeneous data. The use of graph-based representation of source and target data, together with the extraction of new knowledge made by the SAT tool allow us to manage dynamic databases where also features of data may change over the time. The theoretical complexity of the approach is discussed in [ 13 ]. Future work should include the use of more powerful declarative languages for the midway, where limited use of function symbols is allowed [ 5, 7, 14 ].

Angles and

Gutierrez . An introduction to graph data management . In Graph Data Management, Fundamental Issues and Recent Developments ., pages 1 - 32 . 2018 .

Arenas , P. Barcelo´,

Fagin , and

Libkin . Locally consistent transformations and query answering in data exchange . In PODS , 2004 .

Arenas , G.

Gottlob, and

Pieris . Expressive languages for querying the semantic web . ACM Trans. Database Syst ., 43 ( 3 ): 13 : 1 - 13 : 45 , 2018 .

Beeri and

M. Y.

Vardi . A proof procedure for data dependencies . J. ACM , 31 ( 4 ): 718 - 741 , 1984 .

Calautti ,

Greco ,

Molinaro , and I. Trubitsyna. Logic program termination analysis using atom sizes . In IJCAI , pages 2833 - 2839 . AAAI Press, 2015 .

Calautti ,

Greco ,

Molinaro , and I. Trubitsyna. Exploiting equality generating dependencies in checking chase termination . PVLDB , 9 ( 5 ): 396 - 407 , 2016 .

Calautti ,

Greco ,

Spezzano , and I. Trubitsyna. Checking termination of bottom-up evaluation of logic programs with function symbols . TPLP , 15 ( 6 ): 854 - 889 , 2015 .

Cassavia ,

Masciari ,

Pulice , and D. Sacca`. Discovering user behavioral features to enhance information search on big data . TiiS , 7 ( 2 ):7: 1 - 7 : 33 , 2017 .

Deutsch ,

Nash , and

J. B.

Remmel . The chase revisited . In PODS , 2008 .

10.

Fagin ,

P. G.

Kolaitis ,

R. J.

Miller , and

Popa . Data exchange: semantics and query answering . Theoretical Computer Science , 336 ( 1 ): 89 - 124 , 2005 .

11.

Fagin ,

P. G.

Kolaitis , and

Popa . Data Exchange: getting to the core . ACM Trans. Database Syst ., 30 ( 1 ): 174 - 210 , 2005 .

12.

Greco . Dynamic programming in datalog with aggregates . IEEE Trans. Knowl . Data Eng., 11 ( 2 ): 265 - 283 , 1999 .

13.

Greco ,

Masciari , D.

Sacca`, and

I. Trubitsyna.

HIKE: A step beyond data exchange . In ER Conference , 2019 .

14.

Greco ,

Molinaro , and I. Trubitsyna. Logic programming with function symbols: Checking termination of bottom-up evaluation through program adornments . Theory Pract . Log. Program., 13 ( 4-5 ): 737 - 752 , 2013 .

15.

Greco ,

Spezzano , and I. Trubitsyna. Stratification criteria and rewriting techniques for checking chase termination . PVLDB , 4 ( 11 ): 1158 - 1168 , 2011 .

16.

Greco ,

Spezzano , and I. Trubitsyna. Checking chase termination: Cyclicity analysis and rewriting techniques . IEEE Trans. Knowl . Data Eng., 27 ( 3 ): 621 - 635 , 2015 .

17.

Greco and

Zaniolo . Greedy algorithms in datalog . TPLP , 1 ( 4 ): 381 - 407 , 2001 .

18. L. Libkin , J. L.

Reutter , A.

Soto , and D.

Vrgoc . Trial: A navigational algebra for RDF triplestores . ACM Trans. Database Syst ., 43 ( 1 ):5: 1 - 5 : 46 , 2018 .

19. E. Masciari, I. Trubitsyna , and D. Sacca`. Simplified data posting in practice . In IDEAS , pages 29 : 1 - 29 : 7 , 2019 .

20. D. Sacca` and

Zaniolo . Stable models and non-determinism in logic programs with negation . In Proc. PODS Conf. , pages 205 - 217 , 1990 .