-

A Diagnosis and Repair Framework for DL-Lite KBs A

Michalis Chortis

Giorgos Flouris

0 0 ICS-FORTH , Greece

Several logical formalisms have been proposed in the literature for expressing structural and semantic integrity constraints of Linked Open Data (LOD). Still, the integrity of the datasets published in the LOD cloud needs to be improved, as published data often violate such constraints, jeopardising the value of applications consuming linked data in an automatic way. In this work, we propose a novel, fully automatic framework for detecting and repairing violations of integrity constraints, by considering both explicit and implicit ontological knowledge. Our framework relies on the ontology language DL-LiteA for expressing several useful types of constraints, while maintaining good computational properties. The experimental evaluation shows that our framework is scalable for large datasets and numbers of invalidities exhibited in reality by reference linked datasets (e.g., DBpedia).

Repairing diagnosis DL-LiteA integrity constraints

Linked Open Data (LOD) published on the Web of Data are often associated with various structural (e.g., primary key) and semantic (e.g., disjointness) integrity constraints. These constraints are usually expressed in ontological [ 19, 22 ] or database [ 10 ] logic frameworks. However, LOD sources do not impose such constraints a priori, when data are created, so violations of integrity constraints must be detected and repaired a posteriori. As have been reported in [ 15 ], reference LOD sources, such as DBpedia1 or LinkedGeoData2, exhibit millions of violations (this is also veri ed by our own experiments { see Table 3).

In most of the cases, LOD are manually repaired by their curators or by their consuming applications, using, at best, diagnosis approaches or tools (e.g., [ 16, 19, 22 ], Stardog3, QuOnto [ 1 ] etc.) for detecting violations of various types of integrity constraints. Obviously, the manual repair of millions of violations is a time-consuming and error-prone task, a fact that seriously limits the data quality of the available LOD sources. Thus, a major challenge is to automatically

1 http://dbpedia.org 2 http://linkedgeodata.org 3 http://stardog.com/

detect and repair violations of both structural and semantic integrity constraints, especially when ontology reasoning is involved (i.e., detect and repair violations of constraints like disjointness, functional constraints etc., taking into account logical inference and its interaction with those constraints).

In this work, we propose a novel automatic framework for assisting curators in the arduous task of enforcing integrity constraints in large datasets. We provide an e cient methodology for detecting invalidities (diagnosis ), as well as for automatically resolving them (repairing ), in a manner that has minimal impact in terms of lost knowledge on the Knowledge Base (KB), according to the principles set out in earlier works [ 2, 8 ].

We consider detecting and repairing of invalidities attributed to constraints of a purely logical nature (e.g., class disjointness). Constraints are expressed in the language DL-LiteA [ 4 ], which belongs to the DL-Lite family of ontology languages that forms the foundation of the popular OWL 2 QL4 language. The choice of DL-LiteA was motivated by the fact that it is arguably rich enough to capture several useful types of integrity constraints that are used in practice in LOD datasets, and their interaction with implicit knowledge, while at the same time supporting e cient query answering [ 4 ].

The main contributions of our work are the following: { We propose a framework for detecting and automatically repairing invalidities, for constraints that are expressed in DL-LiteA, namely: concept/property disjointness constraints, property domain/range disjointness constraints and functional constraints. Diagnosis of invalidities related to both explicit and inferred constraints can be performed in linear time with respect to the dataset size, whereas repairing can be performed in polynomial time with respect to the number of invalidities. { We have implemented an operational repairing system for real-world applications. Our implementation is modular, allowing each component to be implemented in a manner independent to the other components. This way, we managed to reuse o -the-shelf, state-of-the-art tools for many of the components, such as reasoning, storage, query answering, etc. { We have experimentally evaluated the scalability and performance of our algorithms, using real and synthetic datasets. The main conclusion drawn is that our framework can scale for very large datasets, such as DBpedia, as well as for large numbers (millions) of invalidities.

The rest of the paper is structured as follows: in Section 2, we motivate the use of the DL-LiteA language for this problem and explain its features; in Section 3, we describe our framework and explain how we address the problems of detecting and resolving invalidities; Section 4 describes our algorithms for diagnosis and repairing; in Section 5, we describe our experimental evaluation and report on the main conclusions drawn; nally, Section 6 compares our contributions to the related work and Section 7 concludes.

4 http://www.w3.org/TR/owl2-profiles/#OWL_2_QL Preliminaries

In DL-LiteA [ 4 ], concept expressions, hereafter expressed by the letter C, and role expressions, denoting binary relations between concepts and hereafter expressed by the letter R, are formed according to the following syntax, where A denotes an atomic concept and P denotes an atomic role:

C ! A j 9R R ! P j P A DL-LiteA TBox consists of axioms of the following form:

C1 v C2 C1 v :C2 R1 v R2 R1 v :R2 (funct R) A DL-LiteA ABox is a nite set of assertions of the following form:

A(x) P (x; y)

In order to guarantee good complexity results for reasoning tasks like consistency checking, DL-LiteA imposes a limitation in the TBox, namely that a functional role cannot be specialized by using it in the right-hand side of a role inclusion assertion. This means that if a DL-LiteA TBox contains an axiom of the form R0 v R, then it cannot contain (funct R) or (funct R ) [ 5 ]. Note that DL-LiteA assertions can be also expressed in OWL syntax.

DL-LiteA follows the standard reasoning semantics of DLs [4{6]. A DL-LiteA KB K = hT ; Ai is called inconsistent i T [ A is inconsistent (in the standard logical sense). It is called consistent otherwise.

With respect to performance, DL-LiteA has the important property of FOLReducibility [ 5 ], which essentially means that one can reduce the process of inconsistency checking and query answering to the evaluation of First-Order Logic (FOL) queries over the ABox, considered as a database; this makes both tasks tractable (in LogSpace with respect to the data) [ 5 ]. 3 3.1

Diagnosis and Repair

Constraints in DL-LiteA For the purposes of diagnosis and repair, we can distinguish three di erent types of DL-LiteA TBox axioms, namely positive inclusions (of the form C1 v C2, R1 v R2), negative inclusions (of the form C1 v :C2, R1 v :R2) and functionality assertions (of the form funct R). This distinction is important for diagnosis and repair due to the fact that the ABox is viewed under the Open World Assumption (OWA), which is considered for Description Logics and the ontology languages of the Semantic Web in general (such as OWL { but see [ 22 ] for an e ort to understand OWL under the Closed World Assumption, and the NRL language5 for a similar analysis). Due to the OWA, a TBox consisting of positive inclusions only can never lead to an inconsistent KB; therefore, the only interesting (from the diagnosis perspective) constraints are the negative inclusions and the functionality assertions. In the following, the term constraint will be used to refer to negative TBox inclusions and functionality assertions.

5 http://www.semanticdesktop.org/ontologies/2007/08/15/nrl

Despite that, positive inclusions are still relevant for the diagnosis process, because they may generate inferred information that should be taken into account. As an example, assume that the TBox contains the constraint A1 v :A3 and the axiom A2 v A3 (where A1; A2; A3 are atomic concepts), and suppose that the ABox contains both A1(x) and A2(x) for some x. Even though no constraint is explicitly violated, the combination of the ABox contents with the aforementioned TBox would lead to inferring both A3(x) and :A3(x), i.e., an invalidity. Note that the positive inclusion A2 v A3, albeit not violated itself, plays a critical role in creating this invalidity.

Rather than capturing such invalidities via the obvious method of computing the closure of the ABox, it is more e cient to identify the constraints implied by the explicitly declared constraints and the positive inclusions in the TBox. In our example, we could identify that the constraint A1 v :A2 is a consequence of the two explicit axioms in the TBox, so the presence of A1(x) and A2(x) violates this implicit constraint.

This process amounts to computing all explicit and implicit constraints of the TBox (denoted by cln(T )) [ 5 ], i.e., the set of all the functionality assertions and the explicit and implicit negative inclusions present in the TBox. In fact, it has been proven that, in order to check the consistency of a DL-LiteA KB, one has to take into account only the constraints in cln(T ) [ 6 ]. More formally, a DL-LiteA KB K = hT ; Ai is inconsistent i there is a constraint c 2 cln(T ) and a pair of assertions a1; a2 2 A such that the DL-LiteA KB K0 = hfcg; fa1; a2gi is inconsistent [ 6 ]. In the following, the triple (a1; a2; c) will be called an invalidity of K. It is obvious by the above result that in order to render a KB consistent, for each invalidity (a1; a2; c), one of a1; a2 has to be removed from the ABox. Example 1. Consider the following DL-LiteA KB K = hT ; Ai:

T = f(funct P1), A1 v :A2, 9P2 v A1g

A = fA1(x1); A2(x1); P2(x1; y1); P1(x3; y2); P1(x3; y3); P1(x3; y4)g The closure of negative inclusions and functionality assertions of T (cln(T )), computed in the way that was presented in [ 6 ], is the following: cln(T ) = f(funct P1), A1 v :A2, 9P2 v :A2g From the computed closure, we can easily deduce that (A1(x1); A2(x1); A1 v :A2) is one of the invalidities in the KB. tu 3.2

Approach for Diagnosis and Repair

Diagnosis amounts to identifying the invalidities, i.e., the data assertions and the (possibly implicit) constraint that are involved in an invalidity. Using the property of FOL-Reducibility, the identi cation of invalidities in a DL-LiteA KB can be reduced to the execution of adequately de ned FOL queries over a database [ 5 ] { see also Table 1. Exploiting this property, diagnosis is performed by simply executing the queries corresponding to the constraints in cln(T ), to get all the invalidities of the KB under question.

Repairing is based on the aforementioned property that restoring consistency requires eliminating all invalidities from a KB via removing either one of the two data assertions that take part in each invalidity; formally: De nition 1. Given a DL-LiteA KB K = hT ; Ai a repairing delta of K is a selection of data assertions RD, such that K0 = hT ; A n RDi is consistent. A repairing delta RD is called minimal i there is no repairing delta RD0, such that RD0 RD. tu

The notion of minimality is important, as many authors have proposed the identi cation of minimal repairing deltas (under di erent forms of minimality) as one of the main concerns during repairing [ 2, 8 ]; as is obvious by De nition 1, minimal repairing deltas correspond to subset repairs in the terminology of [ 2 ].

Identifying the minimal repairing delta(s) is not trivial. The computation of such delta(s) is based on the fact that constraints expressed in DL-LiteA allow the presence of interrelated invalidities, i.e., data assertions being involved in more than one invalidities. This implies that potential resolutions of such invalidities coincide, and that there exist resolutions which resolve more than one invalidity at the same time.

To help in the process of identifying the minimal repairing delta, the diagnosed invalidities are organized into an interdependency graph, which is used to identify assertions involved in multiple invalidities. Formally: De nition 2. The interdependency graph of a DL-LiteA KB K = hT ; Ai is an undirected labelled graph IG(K) = (V; E) such that V = fa j (a1; a2; c) is an invalidity of K and a = a1 or a = a2g and E = f(a1; a2; c) j (a1; a2; c) is an invalidity of Kg. tu

The use of the interdependency graph as a structure to represent the invalidities that are diagnosed in the KB gives the ability to get a better grasp of the form and complexity of the invalidities and their interrelationships, as well as to use methods and tools that come from graph theory in order to facilitate the repairing process. Note that an interdependency graph is di erent from a con ict-graph [ 9 ], as the interdependency graph does not contain every assertion in the ABox, having an obvious impact in the algorithm time-cost.

In terms of the interdependency graph, resolving an invalidity amounts to removing one of the two vertices that are connected by the edge representing this invalidity. Therefore, a minimal repairing delta is essentially the minimal vertex cover of the corresponding interdependency graph, which reduces the problem of repairing to the well-known problem of vertex cover [ 11 ]. This fact forms the basis of our algorithms presented in the next section. 4 4.1

Algorithms for Diagnosis and Repairing Diagnosis Algorithm

The diagnosis algorithm is used to detect all the invalidities in a KB, and provide them as output in the form of an interdependency graph. The steps needed to perform diagnosis are illustrated in Algorithm 1. Algorithm 1 Diagnosis(K)

The diagnosis algorithm starts by computing the closure cln(T ) of negative inclusions and functionality assertions of the TBox (line 2 of Algorithm 1), in order to get the full set of constraints that need to be checked over the ABox. Each of the constraints in cln(T ) is then transformed to a FOL query (line 4) using prede ned patterns, as de ned in Table 1 (see also [ 5 ]), whose answers determine the invalidities. These queries are executed over the ABox in line 5 (Ansqc contains pairs ha1; a2i such that (a1; a2; c) is an invalidity). Note that these FOL queries can be easily expressed as SPARQL queries over an ABox stored in a triple store, so that o -the-shelf, optimized tools can be used for query answering. The last step of the algorithm encodes the invalidities in the form of an interdependency graph (lines 6-9) as speci ed in De nition 2.

Constraint (c) c = A1 v :A2 (c) = q(x) c = A1 v :9P1 (or c = 9P1 v :A1) (c) = q(x) c = A1 v :9P1 (or c = 9P1 v :A1) (c) = q(x) c = 9P1 v :9P2 (c) = q(x) c = 9P1 v :9P2 (c) = q(x) c = 9P1 v :9P2 (c) = q(x) A1(x); A2(x) A1(x); P1(x; y) A1(x); P1(y; x) P1(x; y1); P2(x; y2) P1(y1; x); P2(y2; x)

P1(x; y1); P2(y2; x)

Transformation ( (c)) c = P1 v :P2 (or c = P1 v :P2 ) (c) = q(x; y) P1(x; y); P2(x; y) c = P1 v :P2 (c) = q(x; y) P1(x; y); P2(y; x) c =(funct P ) (c) = q(x) P (x; y1); P (x; y2) c =(funct P ) (c) = q(x) P (y1; x); P (y2; x) Table 1: Transformation of DL-LiteA constraints to FOL queries. The following example illustrates the diagnosis algorithm in action: Example 2. Consider the KB K and the cln(T ) of Example 1. The corresponding FOL queries to check for invalidities, according to Table 1 are:

As already mentioned, computing cln(T ) (line 2) is in LogSpace with respect to the data [ 5 ], whereas the remaining steps of the algorithm are linear with respect to the invalidities and the constraints in cln(T ). The repairing algorithm (Algorithm 2) takes as input the interdependency graph and is responsible for automatically repairing the KB. As explained in Section 3.2, the main idea behind the repairing algorithm is the computation of the vertex cover of the interdependency graph.

To do so, the repairing algorithm rst breaks the interdependency graph IG(K) into the set of its connected components (line 2). Note that the computation of the vertex cover for each of the connected components is independent to the others, and can be parallelized for better performance.

This computation (vertex cover) is performed in lines 3-5. Recall that vertex cover is a well-known NP-complete problem [ 20 ], but many approximation algorithms have been proposed, such as the 2-approximation algorithm [ 20 ], or the approximation algorithm presented in [ 14 ]. Algorithm 2 Repair(IG(K); A) Input: An interdependency graph IG(K) and a DL-LiteA ABox A Output: K in a consistent state 1: repairing delta ; 2: CC ConnectedComponents(IG(K)) 3: for all cc 2 CC do 4: repairing delta 5: end for 6: A A n repairing delta

repairing delta [ GreedyV ertexCover(cc)

For our implementation and experiments below, we chose (for e ciency) to compute the vertex cover in a greedy manner, as presented in [ 20 ] (but any other algorithm for vertex cover could be used instead). Greedy means that, in each step of the computation, the vertex that is chosen to be included in the cover is the vertex with the highest degree (in other words, the invalid data assertion that is part of the most invalidities). If there exist more than one vertices with the same degree, one of those vertices is arbitrarily chosen; this arbitrary choice avoids the need for complex and time-consuming selection conditions and guarantees that a single vertex cover is returned by the algorithm. This computation is performed in the GreedyVertexCover subroutine, which is omitted for brevity.

The output of lines 3-5 (repairing delta) contains the data assertions to be removed from the dataset in order to render it valid. The actual repairing is performed in line 6, through a single SPARQL-Update statement6 requesting the deletion of all the assertions in the repairing delta.

The correctness of our algorithms is guaranteed by our analysis in Section 3 and the results in [ 6 ]. The computational complexity of Algorithm 2 is dominated by the computation of the vertex cover, which is proven to achieve O(log n) approximation of the optimal solution (where n is the number of vertices of the graph), with a time complexity of O(n log n) [ 20 ].

The following concludes the running example for our framework: Example 3. Consider the interdependency graph of Figure 1. The repairing algorithm will compute the following repairing delta:

repairing delta = fA2(x1); P1(x3; y2); P1(x3; y3)g After the application of the repairing delta, the ABox A is in the following state:

A = fA1(x1); P2(x1; y1); P1(x3; y4)g which can be easily veri ed to be a consistent KB with respect to T . tu 6 http://www.w3.org/TR/2013/REC-sparql11-update-20130321/ 5.1

Experimental Evaluation Overview of Experimental Evaluation

We have implemented our framework as a Java web application. More speci cally, we have created a system that uses a triple store, which lies on a Virtuoso Open-Source Edition Server7 version 07.10, as a storage for the ABox instances and as an endpoint for query answering. For storing the set of constraints, we used a main memory model, which, along with the communication with the Virtuoso Server, are handled by the Apache Jena8 framework. Apache Jena is also used for the various reasoning tasks (e.g., computation of cln(T )). The system used for the experiments was an AMD Opteron 3280, 8-core CPU with 24GB RAM (we allocated 8GB for the JVM), running Ubuntu Server 12.04.

We have performed several experiments in order to measure the performance and scalability of our framework, as well as to determine the decisive factors for the performance of the di erent phases of the process. More speci cally, we performed three sets of experiments: (i) the rst set veri ed that our framework can handle millions of violations in real-world ABoxes that scale up to more than 2 billion triples, considering hundreds of thousands of constraints; (ii) the second set of experiments quanti ed the impact of ABox size on performance, by using real Tboxes with constraints and synthetic ABoxes of varying sizes; and, (iii) the third set quanti ed the impact of the number of invalid data assertions on performance, by using real TBoxes with constraints and synthetic ABoxes with varying number of invalid data assertions.

In all of the above sets of experiments, we measured the time needed to run the diagnosis algorithm and produce the interdependency graph (diagnosis time), the time needed by the repairing algorithm to compute the repairing delta (repair computation time) and the time needed to apply this repairing delta on the dataset, using a SPARQL-Update query (repair application time). All of our experiments were run in sets of 5 hot runs and the average times were taken. 5.2

Real and Synthetic Datasets Used

For the TBox, we used two versions (3.6, 3.9) of the DBpedia ontology, which is a reference dataset for LOD, already containing di erent amounts and types of constraints; this is illustrated in Table 2, which shows information on how many functional and (concept/domain/range) disjointness constraints exist in the original TBox, as well as how many of these exist in the closure of negative inclusions (cln(T ))9, and how many queries need to be executed for diagnosis. Property disjointness is the only type of constraint supported by DL-LiteA that

7 http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/ 8 http://jena.apache.org/

9 The big di erence in the amount of disjointness constraints between the original DBpedia 3.9 and its closure is caused by the many positive inclusions and their interaction with the negative inclusions during the computation of the closure. TBox version FunctCioonnasltDr aisijnotinstness FCuonncsttiornaainltDsisinjoicnltnn(eTss) Queries DBpedia 3.6 18 0 18 0 18 DBpedia 3.9 26 17 26 323.389 323.415 was not considered in our experiments, because we were unable to nd any real TBoxes with this constraint (so it seems irrelevant for practical applications).

For running our experiments, we used the above TBoxes, together with both real and synthetic ABoxes. Real ABoxes were used to evaluate our system in realistic conditions, whereas synthetic ABoxes allow controlling the important factors for the performance of our algorithm, such as size and number of invalidities, and the appropriate evaluation of their e ect on performance.

Real ABoxes were taken by the two DBpedia versions corresponding to the two aforementioned TBoxes, stored in a local Virtuoso instance. The DBpedia 3.6 ABox contains around 541 million triples, whereas the DBpedia 3.9 ABox contains more than 2 billion triples.

To generate synthetic ABoxes, we started from each TBox and an empty ABox, and added data and property instances of the classes/properties of the corresponding TBox, making sure to include some invalid pairs of assertions as well (taking into account the constraints). For the rst set of generated ABoxes we created a xed number of invalid data assertions (10K) and a varying ABox size (500K-5M triples, with a step of 500K triples). The second set of ABoxes had a xed size (10M triples) and a varying number of invalid data assertions (50K-500K, with a step of 50K). The above two sets of ABoxes were used in the second and third set of experiments respectively. The rst set of experiments aimed at verifying the scalability of our framework in real-world settings, with ABoxes of billions of triples and with large numbers of constraints (up to hundreds of thousands). For this purpose, we used DBpedia versions 3.6 and 3.9 (TBox and real ABox). For each version, we measured the diagnosis time (identifying invalidities and creating the interdependency graph), the repair computation time (computing the repairing delta), the repair application time (applying the repairing delta) and the total time (sum of the above). We also measured the number of invalid data assertions that appear in the datasets, to see how well our framework scales with respect to that, as well as the size of the repairing delta, to verify that a manual repair by the curator would be infeasible in this context.

The results of this set of experiments are illustrated in Table 3. In the table, IDA denotes the number of invalid data assertions, Delta is the size of the repairing delta (in triples), td denotes the diagnosis time, tr:c: the repair computation time, tr:a: the repair application time and tt the total time needed for diagnosis and repairing. All times are in milliseconds.

The results show that our framework is scalable, for both large datasets and big numbers of invalid data assertions, and that it can be applied in real-world settings. It also proves that already deployed and massively used reference KBs, such as DBpedia, don't have su cient mechanisms for preventing the introduction of invalid data or for detecting and repairing such invalid data. Moreover, our experiments illustrate that the number of invalid data assertions and the size of the repairing delta would be prohibitive for manual repairing.

Our second set of experiments evaluated the e ect of ABox size on performance using synthetic ABoxes of varying sizes and a xed number of invalid data assertions. The results of this set of experiments appear in Figure 2. Note that some of the curves in the graphs are di cult to distinguish, either because they are too close to the start of the x-axis (e.g., the repair computation time and the repair application time in the left gure), or because they are too close with another curve (e.g., the diagnosis time and the total time in the right gure). Fig. 2: Performance for DBpedia 3.6 (left) and 3.9 (right) with 10K invalidities.

From the results of this set of experiments, we conclude that diagnosis time grows linearly with respect to the ABox size and that it is the dominating factor of the total time, when the number of invalid data assertions is xed. This is an important conclusion because it shows that, overall, our framework scales linearly with respect to the dataset size.

The third set of experiments evaluated the e ect of the number of invalidities using ABoxes of xed size, but with a varying number of invalid data assertions. The results of this set of experiments are illustrated in Figure 3.

From these results, we can conclude that the number of invalid data assertions has no immediate impact on the diagnosis time. On the contrary, it is the main impact factor of the repair computation time. That was an expected behaviour, as the repair computation is done by computing the vertex cover of the interdependency graph. A bigger number of invalid data assertions leads to a bigger graph and this leads to a more costly computation of the vertex cover.

Another signi cant impact factor of the repair computation time is the amount of interdependencies in the interdependency graph. We can see that the repair computation time increases with a higher rate in the left graph of Figure 3 than in the right one, which can be explained by the fact that the DBpedia 3.6 TBox contains only functional constraints, which form cliques in the interdependency graph (thus, more interdependencies), whereas the DBpedia 3.9 TBox contains mainly disjointness constraints, which cause less interdependencies, therefore less \touching" edges in the interdependency graph.

Moreover, the repair application time seems to be negligible in all of the experiments. This is due to the fact that the repair application is performed by executing a single SPARQL-Update query requesting the deletion of all the triples in the repairing delta, which is very e cient due to the optimizations for batch operations of Virtuoso.

Fig. 3: Performance for DBpedia 3.6 (left) and 3.9 (right) with 10M triples.

The last signi cant conclusion comes from the comparison of the times measured for the two di erent DBpedia TBox versions. We see that the diagnosis times for version 3.9 are two orders of magnitude higher compared to the respective times of version 3.6. This is due to the fact that the closure of version 3.9 contains 323.415 constraints, whereas the closure of version 3.6 only 18; more constraints require the generation of more queries to be executed by the diagnosis algorithm, eventually causing this big di erence in the measurements.

The following main conclusions can be distilled from our evaluation: { Diagnosis can be performed in linear time with respect to the ABox size. { Repair computation can be performed in polynomial time with respect to the number of invalid data assertions that appear in the dataset. { Our implementation enjoys a decent performance in real-world settings with large datasets, numbers of constraints and invalidities, being able to repair the huge DBpedia 3.9 (>2B triples) in about 10 hours, which is a reasonable amount of time, given that repairing is expected to be an o ine process.

It should be noted that the experimental evaluation of the only other work in the literature that performs automated repairing of inconsistent DL-LiteA KBs ([ 18 ] { see Section 6), only considers datasets of size up to 30.000 triples, whereas we consider datasets of up to 5 orders of magnitude larger; thus, the results are not comparable. 6

Related Work

The problem of inconsistencies appearing in KBs can be tackled either by providing the ability to query inconsistent data and get consistent answers (Consistent Query Answering - CQA) [ 3 ], or by actually repairing the KB, which leads to a consistent version of it [ 12 ]. Both these approaches have attracted researchers' attention, mostly in the context of relational databases and, lately, in the context of linked data and ontology languages as well.

In the context of relational databases, CQA has been studied in various works dealing with di erent classes of conjunctive queries and denial constraints, mainly key constraints (e.g., [ 13, 23 ]). These works underline the main advantages of using First-Order query rewriting for the validation of integrity constraints. Note that CQA techniques systematically drop all information involved in a constraint violation, whereas repairing techniques, like ours, make explicit decisions on what to keep and what to drop, in accordance with the principles set out in [ 2 ], that require preserving as much information as possible.

Di erent semantics have been studied for the repairing of inconsistent relational databases, considering di erent kinds of constraints. For example, [ 9 ] studied the problem of repairing by allowing only tuple deletions and, in this way, resolving violations of denial constraints and inclusion dependencies, which is a more expressive set of constraints than the one we consider in this work. However, as proven in [ 9 ], the unrestricted combination of those constraints leads to intractability issues.

In the context of linked data and the corresponding languages and technologies, there has been research on the topic of using ontological languages to encode integrity constraints (ICs) that must be checked over a dataset. In [ 22 ], the authors present a way to integrate ICs in OWL and they show that IC validation can be reduced to query answering, for integrity constraints that fall into the SROI DL fragment. A similar approach has been followed in [ 19 ]. In [ 16 ], the presented approach integrates constraints that come from the relational world (primary-key, foreign-key) into RDF and provides a way to validate these constraints. IC validation is also an important part of some of the current OWL reasoners, such as Stardog10. The above approaches address, essentially, only 10 http://stardog.com/ the KB satis ability problem and do not consider detection and repairing of invalidities.

In the eld of diagnosis for DL-Lite KBs, there has been some work regarding inconsistency checking. The DL-LiteA reasoner QuOnto [ 1 ] has the ability to check the satis ability of a DL-LiteA KB. However, it does not detect the invalid data assertions in the ABox, neither repairs it. A problem very similar to repairing (but in a di erent setting) is addressed in the context of DL-Lite KB evolution (e.g., [ 7 ], [ 21 ]), where the objective is to identify the minimal set of assertions to remove in order to render a DL-Lite KB consistent during evolution.

Recently, there has also been some research on CQA for inconsistent knowledge bases expressed in Description Logic languages, using query rewriting techniques. For example, [ 17 ] deals with di erent variants of inconsistency-tolerant semantics to reach a good compromise between expressive power of the semantics and computational complexity of inconsistency-tolerant query answering.

Finally, [ 18 ] is (to our knowledge) the only work addressing the automatic repairing of an inconsistent DL-LiteA KB, and thus the closest to our work. It is based on the inconsistency-tolerant semantics studied in [ 17 ] and resolves each invalidity by removing both data assertions that take part in it. On the contrary, our repairing algorithm considers the removal of only one of two involved data assertions. Thus, [ 18 ] removes more information than necessary from the original KB. In addition, the work of [ 18 ] has only been evaluated with datasets that are unrealistically small (up to 30.000 triples). 7

Conclusion and Future Work

We presented a novel, fully automatic and modular diagnosis and repairing framework, which can be used on top of already deployed datasets to assist the curators in the task of enforcing the validity of logical integrity constraints, taking into account logical inference, in order to maintaining their consistency. Our experimental evaluation showed that our framework is scalable for large dataset sizes, often found in real reference linked datasets such as DBpedia.

As future work, we will try to improve the scalability properties of our algorithms, possibly using a parallel implementation relying on the MapReduce model. In addition, we will consider di erent models of interaction with the curator, to allow him to in uence the repairing process (e.g., via user guidelines or preferences) without being overwhelmed with the complete set of invalidities; the ultimate goal is to develop an interactive repairing process that will combine the quality of manual curation with the e ciency of automatic repairing. Another possible extension is to experiment with more LOD datasets, and provide a comprehensive study of the number and types of violations that exist in di erent popular datasets.

Acknowledgement. This work was partially supported by the EU project DIACHRON (ICT-2011.4.3, #601043), and by the State Scholarships Foundation.

1. Acciarri , A. , Calvanese , D. , De Giacomo , G. , Lembo , D. , Lenzerini , M. , Palmieri , M. , Rosati , R.: QUONTO: querying ontologies . In: AAAI. Volume 5 . ( 2005 )

2. Afrati , F. , Kolaitis , P. : Repair checking in inconsistent databases: algorithms and complexity . In: ICDT. ( 2009 )

3. Arenas , M. , Bertossi , L. , Chomicki , J.: Consistent query answers in inconsistent databases . In: PODS . ( 1999 )

4. Calvanese , D. , De Giacomo , G. , Lembo , D. , Lenzerini , M. , Poggi , A. , Rosati , R.: Linking data to ontologies: The Description Logic DL-LiteA . In: OWLED. ( 2006 )

5. Calvanese , D. , De Giacomo , G. , Lembo , D. , Lenzerini , M. , Rosati , R. : Tractable reasoning and e cient query answering in Description Logics: The DL-Lite family . Journal of Automated Reasoning . 39 ( 3 ). ( 2007 )

6. Calvanese , D. , De Giacomo , G. , Lembo , D. , Lenzerini , M. , Poggi , A. , RodriguezMuro , M. , Rosati , R.: Ontologies and databases: The DL-Lite approach . In: Reasoning Web . ( 2009 )

7. Calvanese , D. , Kharlamov , E. , Nutt , W. , Zheleznyakov , D. : Evolution of DL-Lite Knowledge Bases . In: ISWC. ( 2010 )

8. Chomicki , J. , Marcinkowski , J.: On the computational complexity of minimalchange integrity maintenance in relational databases . Inconsist. Toler . ( 2005 )

9. Chomicki , J. , Marcinkowski , J.: Minimal-change integrity maintenance using tuple deletions . Information and Computation . 197 ( 1 ). ( 2005 )

10. Deutsch , A. : FOL modeling of integrity constraints (dependencies) . In: Encyclopedia of Database Systems . Springer. ( 2009 )

11. Garey , M. , Johnson , D.: Computers and intractability . Vol. 174 . Freeman . ( 1979 )

12. Greco , G. , Greco , S. , Zumpano , E.: A logical framework for querying and repairing inconsistent databases . IEEE TKDE . 15 ( 6 ). ( 2003 )

13. Grieco , L. , Lembo , D. , Rosati , R. , Ruzzi , M. : Consistent query answering under key and exclusion dependencies: Algorithms and experiments . In: CIKM. ( 2005 )

14. Karakostas , G.: A better approximation ratio for the vertex cover problem . In: Automata, Languages and Programming . ( 2005 )

15. Kontokostas , D. , Westphal , P. , Auer , S. , Hellmann , S. , Lehmann , J. , Cornelissen , R.: Databugger: a test-driven framework for debugging the web of data . In: WWW (Companion

Volume).

( 2014 )

16. Lausen , G. , Meier , M. , Schmidt , M.: SPARQLing constraints for RDF . In: EDBT. ( 2008 )

17. Lembo , D. , Lenzerini , M. , Rosati , R. , Ruzzi , M. , Savo , D. : Query rewriting for inconsistent DL-Lite ontologies . In: Web Reasoning and Rule Systems . ( 2011 )

18. Masotti , G. , Rosati , R. , Ruzzi , M. : Practical ABox cleaning in DL-Lite (progress report) . In: Description Logics . ( 2011 )

19. Motik , B. , Horrocks , I. , Sattler , U. : Bridging the gap between OWL and relational databases . In: WWW . ( 2007 )

20. Papadimitriou , C. , Steiglitz , K. : Combinatorial optimization: algorithms and complexity . Courier Dover Publications . ( 1998 )

21. Qi , G. , Wang , Z. , Wang , K. , Fu , X. , Zhuang , Z. : Approximating model-based ABox revision in DL-Lite: Theory and practice . In: AAAI. ( 2015 )

22. Tao , J. , Sirin , E. , Bao , J. , McGuinness , D. : Integrity constraints in OWL . In: AAAI. ( 2010 )

23. Wijsen , J.: Consistent query answering under primary keys: a characterization of tractable queries . In: ICDT . ( 2009 )