OWL2 based Data Cleansing Using Conditional Exclusion Dependencies Olivier Curé1 , Chan Le Duc2 , Myriam Lamolle2 1 Université Paris-Est, LIGM, Marne-la-Vallée, France ocure@univ-mlv.fr 2 LIASD Université Paris 8 - IUT de Montreuil chan.leduc,myriam.lamolle@iut.univ-paris8.fr 1 Introduction Ontology-based Data Access (OBDA) [6] aims to provide access to heterogeneous data sources through a mediating ontology. Most research conducted in this area tackles ontology expressivity, computational efficiency of reasoning services and inferences associated to query answering. In this paper, we argue that data qual- ity and data cleansing are domains where OBDA could contribute in an efficient manner. That is we aim to prevent the execution of update operations which are corrupting database instances and to propose their (semi-)automatic cleansing. This can be performed by handling integrity constraints (ICs), especially those that can not be easily represented and processed in a strict relational context. In [3], we have already proposed a first solution based on extensions of standard dependencies, i.e. Conditional Functional and INclusion Dependencies (henceforth denoted CFDs and CINDs) in the context of OWL2. Based on ex- periments conducted on real world databases [2], we found out that a form of Conditional Exclusion Dependency (CED) may be relevant in capturing more real-life data inconsistencies. To the best of our knowledge, this work is a first approach to address data quality and cleansing problems using CEDs in either a strict relational or OBDA context. Intuitively, CEDs correspond to standard exclusion dependencies extended with a pattern tableau (inspired from tableau queries presented in [1]) containing variables and constant values which are part of the active domain of the database. The discovery of CEDs is a hard prob- lem since negative information are generally not stored in relational databases. In this work, we highlight that novel OWL2 constructs, e.g. negative property assertions, can assist our system in this discovery process. Moreover, the use of OWL2 reasoning services over ontologies containing concept and property ax- ioms enables to represent and process CEDs in a compact and efficient way using the formalism of SPARQL queries. In the rest of this paper, we consider that elements of a domain ontology are mapped to relations of a relational database. Due to space limitation, we do not present these mapping assertions on our running example. 2 Conditional Exclusion Dependencies An exclusion dependency (ED) [1] corresponds to forbidding the appearance of a given tuple in a relation S when a tuple appears in a relation R and is represented as the following axiom: ∀x, y, z, x0 , z 0 R(x, y, z) → ¬S(x0 , x, z 0 ). A conditional extension of an ED forbids the appearance of tuples in S when tuples satisfying a set of patterns appear in R. Formally, a CED φ, defined over a pair of relations R and S, is a pair (R(X; Xp ) ⊆ ¬S(Y ; Yp ), Tp ) where X, Xp and Y, Yp are attribute sets of respectively R and S. R(X) ⊆ ¬S(Y ) is a standard exclusion dependency and Tp is a tableau pattern of φ with attribute sets Xp and Yp such that for each pattern tp and each attribute B in Xp and Yp , tp [B] is either a constant in the domain of B or a wild card, denoted 0 0 . An instance (I1 , I2 ) of (R, S) satisfies a CED φ, denoted (I1 , I2 ) |= φ, iff for each tuple tp in Tp and for each t1 in I1 , if t1 [Xp ] = tp [Xp ] then there does not exist a tuple t2 in S such that t1 [X] = t2 [Y ] and t2 [Yp ] = tp [Yp ]. Example 1: We consider an extract of a medical database with the following relations: drug(idDrug, nameDrug, form), contraDrug(idDrug, idContra) and atcDrug(idDrug, atcCode) which respectively contain information con- cerning drug products (with an identifier, name and the form of the product, e.g. allopathy, homeopathy), contraindications of drug products (with a drug identifier and a contraindication identifier) and molecules of drugs identified by ATC3 codes. In this context, the following CEDs hold: φ1 : (drug(idDrug; f orm) ⊆ ¬contraDrug(idDrug; idContra), Tp1 ) with Tp1 as {(0 0 ;0 homeopathy 0 k0 0 ;0 0 ), (0 0 ;0 phytotherapy 0 k0 0 ;0 0 )} φ2 ; (atcDrug(idDrug; atcCode) ⊆ ¬contraDrug(idDrug; idContra), Tp2 ) where Tp2 is: {(0 0 ;0 R5DA90 k0 0 ;0 Anti-coughing0 )} These CEDs respectively state that homeopathy and phytotherapy drugs do not have contraindications and that the molecule identified with ’R5DA9’ must not be contraindicated to ’anti-coughing’. Note that these forms of CEDs, i.e. constants in the left hand side only for φ1 and constants on both sides in φ2 , correspond to the most widely encountered CEDs in studied use cases. 3 Discovery approaches OWL2 ontologies correspond to the SROIQ description logic [4] which allows for new role constructors such as composition, disjointness and negation. This enables to represent RBox axioms of the form R v ¬S where R and S are both DL roles. Note that this axiom corresponds to an exclusion dependency where the relations are necessarily binary and supports the discovery of EDs, i.e. CEDs with an empty pattern tableau. Such axioms are frequently encountered in role hierarchies. For instance, consider a property hasContraIndication with two subroles, hasDiseaseContraIndication and hasDrugContraIndication. Then it would be useful to state that these two subroles are disjoint. Note that class 3 Anatomical Therapeutic Chemical: http://www.whocc.no/atcddd/ disjointness, already available in the first version of OWL, can also be used to identify CEDs. In both cases, OBDA ’s mapping assertions need to be considered in order to maintain the data quality of underlying relational databases. Another form of CED related axioms found in OWL2 ontologies is supported by General Concept Inclusion (GCI) of the form: ∃R.C v ¬∃S.D where concept C corresponds to a nominal and D is either a nominal or the top concept (>). In the context of our running example, consider that correspondences between the atcDrug and contraDrug are defined with resp. hasATCCode and hasContraDrug and the hasForm property is mapped to the form attribute of the drug relation, then the following axioms correspond to resp. φ1 and φ2 : ∃f orm.{homeo} v ¬∃contraDrug.> ∃hasAT CCode.{R5DA9} v ¬∃contraDrug.{Anti-coughing}. Moreover, OWL2 ABoxes enable the definition of negative property assertions which together with property subsumption axioms enable to define CEDs. In an OBDA context, all the extensional data are stored in a (relational) database and serve to generate an ABox satisfying an ontology, e.g. by using a systems like QUONTO [6] or SOR [5]. Hence, end-users generally do not store asser- tions directly in the ABox. Nevertheless, such an approach could be useful to discover the pattern tableaux of our CEDs. These assertions could be defined as a complementary ABox and would mainly serve to store CEDs and enable some inferences. That is they would not be stored in the relational database generating the ABox. 4 Representation and Processing of CEDs We propose to represent the CEDs discovered using the formalism of SPARQL queries. This fits into the approach of [7] where the author argued that ICs are epistemic in nature and concern “what the knowledge base knows”. These queries aim to detect violations of CEDs and are generated by considering a CED has a graph over elements of the domain ontology. In this graph, the negated property is asserted to be true. Thus a translation of this graph into an SPARQL query enables to detect objects being violated. These objects are identified by the query’s distinguished variables which are selected using axioms of the domain ontology, e.g. domain and range of properties. They are working over the knowledge base underlying the application domain and which is mapped to the relational domain. Hence, using an approach similar to the QUONTO system, it is possible to translate these queries into SQL queries executed over relational databases. The SPARQL queries enabling to detect violations of φ1 and φ2 are respectively: SELECT ?x WHERE { ?x rdf:type :ATC. ?y rdf:type :Drug. ?y :hasATCCode ?x. ?y :form ’homeopathy’.} SELECT ?x WHERE { ?x rdf:type :Drug. ?y rdf:type :ATC. ?x :hasATCCode ?y. ?y :nameAtc ’R5DA9’. ?z rdf:type :Contra. ?x :hasContraDrug ?z. ?z :nameContra ’Anti-coughing’.} Moreover, in order to represent them in a compact way, we exploit and an- alyze the hierarchies of concepts present as constants in object properties used in CEDs. We provide an example of the use of such inferences with the ATC classification which divides drug molecules into different groups according to the organ or system on which they act and/or their therapeutic and chemical characteristics. The classification is organized in 5 levels with each level encoded using a letter or digits. For instance, the R5CA code subsumes 11 molecules, R5CA1 to R5CA11, which act as expectorants. Example 2: Consider CED φ2 with a pattern tableau Tp2 containing all 11 descendants of the R5CA code as Xp and with the ’expectorant’ constant in Yp . Then it will be much more compact to store one tuple with the ’R5CA’ code than 11 tuples containing its subsumed codes. The pattern would look like: (0 0 ;0 R5CA0 k0 0 ;0 Expectorant0 ). Note that in the medical domain, such gener- alizations frequently occur since molecules of a given family generally possess common properties. The main idea of this approach consists of generating a SPARQL query for each sub concept of the concept stored in a CED. The next step corresponds to the detection of a CED violation. Such detection is activated whenever a tuple of the data sources is updated, i.e. after the execution of a CRUD operation. In the context of a relational database, this can be handled by the definitions of SQL triggers. In fact, we automatically generate an AFTER/ROW LEVEL SQL trigger for each relation mapped to a property involved in a CED. These triggers call some generic programmed methods (in Java) defined in the framework of our data quality system. The purpose of these methods is to execute SPARQL queries and hence to discover and identify the data source tuples causing some inconsistencies. Note that we must associate a trigger to both the left and right hand side relations of a CED to ensure the consistency of data sources. Finally, we consider that the full potential of a data quality and cleansing implementation based on conditional dependencies lies in the study of possible interactions between discovered sets of CFDs, CINDs and CEDs. References 1. S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995. 2. O. Curé. Improving the data quality of drug databases using conditional dependen- cies and ontologies. ACM Journal of Data and Information Quality (JDIQ). 3. O. Curé. Improving the data quality of relational databases using obda and owl2ql. In OWLED, 2009. 4. I. Horrocks, O. Kutz, and U. Sattler. The even more irresistible SROIQ. In KR, pages 57–67, 2006. 5. J. Lu, L. Ma, L. Z. 0007, J.-S. Brunner, C. Wang, Y. Pan, and Y. Yu. Sor: A practical system for ontology storage, reasoning and search. In VLDB, pages 1402– 1405, 2007. 6. A. Poggi, D. Lembo, D. Calvanese, G. D. Giacomo, M. Lenzerini, and R. Rosati. Linking data to ontologies. J. Data Semantics, 10:133–173, 2008. 7. R. Reiter. On integrity constraints. In Proc. TARK, 1988.