CANARD complex matching system: results of the 2018 OAEI evaluation campaign Elodie Thiéblin, Ollivier Haemmerlé, Cassia Trojahn IRIT & Université de Toulouse 2 Jean Jaurès, Toulouse, France {firstname.lastname}@irit.fr Abstract. This paper presents the results obtained by the CANARD system in the OAEI 2018 campaign. CANARD can produce complex alignments. This is the first participation of CANARD in the campaign. Even though the system has been able to generate alignments for one only complex dataset (Taxon), the results are promising. 1 Presentation of the system 1.1 State, purpose, general statement The CANARD (Complex Alignment Need and A-box based Relation Discovery) system discovers complex correspondences between populated ontologies based on Competency Questions for Alignment (CQAs). Competency Questions for Alignment (CQAs) represent the knowledge needs of a user and define the scope of the alignment [3]. They are competency questions that need to be satisfied over two or more ontologies. Our approach takes as input a set of CQAs translated into SPARQL queries over the source ontology. The answer to each query is a set of instances retrieved from a knowledge base described by the source ontology. These instances are matched with those of a knowledge base described by the target ontology. The generation of the correspondence is performed by matching the graph-pattern from the source query to the lexically similar surroundings of the target instances. 1.2 Specific techniques used The CQAs that are taken as input by CANARD are limited to class expressions (interpreted as a set of instances). The approach is developed in 11 steps, as depicted in Figure 1: 1 Extract source DL formula es from SPARQL CQA. 2 Extract lexical information from the CQA, Ls set labels of atoms from the DL formula. 3 Extract source instances insts . 4 Find equivalent or similar (same label) target instances instt to the source instances insts . 5 Retrieve description of target instances: set of triples and object/subject type. 6 For each triple, retrieve Lt labels of entities. 7 Compare Ls and Lt using a string comparison metric (e.g., Levenshtein distance with a threshold). 8 Keep the triples with the summed similarity of their labels above a threshold τ . Keep the object(/subject) type if its similarity is better than the one of the object(/subject). 9 Express the triple into a DL formula et . 10 Aggregate the formulae et into an explicit or implicit form: if two DL formu- lae have a common atom in their right member (target member), the atoms which differed are put together. 11 Put es and et together in a correspondence (es ≡ et ) and express this corre- spondence in EDOAL. The average string similarity between the aggregated formula and the CQA labels gives the confidence value of the correspondence. Target EDOAL cor- respondence Source 10 aggregate 1 DL formula 9 DL formula 11 et Best Triples CQA es 8 >τ For each Triple 6 7 2 URI labels 7 similarity 6 labels Ls Lt Triple 3 answers composed of 4 sameAs insts 5 surroundings instt Triples + ob- ject/subject type Fig. 1: Schema of the general approach. The instance matching phase (step 4 ) is based on existing owl:sameAs, skos:closeMatch, skos:exactMatch and exact label matching. The similarity be- tween the sets of labels Ls and Lt of step 7 is the cartesian product of the string similarities between the labels of Ls and Lt (equation 1). X X sim(Ls , Lt ) = strSim(ls , lt ) (1) ls ∈Ls lt ∈Lt strSim is the string similarity between two labels ls and lt (equation 2). τ is the threshold for the similarity measure. In our experiments, we have empirically set up τ = 0.5. σ if σ > τ , where σ = 1 − levenshteinDist(ls , lt )  strSim(ls , lt ) = max(|ls |, |lt |) (2)  0 otherwise The confidence value given to the final correspondence (step 11 ) is the similar- ity of the triple it comes from or average similarity if it comes from more than one triple. The confidence value is reduced to 1 if it is initially calculated over 1. 1.3 Adaptations made for the evaluation Automatic generation of CQAs The CQAs can not be given as input in the evaluation as none are available in the OAEI datasets. We developed a CQA generator that was integrated to the version of the system used in the evaluation. This generator produces two types of SPARQL queries: Classes and Property- Value pairs. Classes For each owl:Class populated with at least one instance, a SPARQL query is created to retrieve all the instances of this class. If is a populated class of the source ontology, the following query is created: SELECT DISTINCT ?x WHERE {?x a .} Property-Value pairs Inspired by the approaches of [1,2,4], we create SPARQL queries of the form – SELECT DISTINCT ?x WHERE {?x .} – SELECT DISTINCT ?x WHERE { ?x.} – SELECT DISTINCT ?x WHERE {?x "Value".} These property-value pairs are computed as follow: for each property (object or data property), the number of distinct object and subject values are retrieved. If the ratio of these two numbers is over a threshold (arbitrarily set to 30) and the smallest number is smaller than a threshold (arbitrarily set to 20), a query is created for each of the less than 20 values. For example, if the property has 300 different subject values and 3 different object values ("Value1", "Value2", "Value3"), the ratio |subject|/|object| = 300/3 > 30 and |object| = 3 < 20. The 3 following queries are created as CQAs: – SELECT DISTINCT ?x WHERE {?x "Value1".} – SELECT DISTINCT ?x WHERE {?x "Value2".} – SELECT DISTINCT ?x WHERE {?x "Value3".} The threshold on the smallest number ensures that the property-value pairs represent a category. The threshold on the ratio ensures that properties represent categories and not properties with few instanciations. Implementation adaptations In the initial version of the system, Fuseki server endpoints are given as input. For the SEALS evaluation, we embedded a Fuseki server inside the matcher. The ontologies are downloaded from the SEALS repository, then uploaded in the embedded Fuseki server before the matching process can start. This downloading-uploading phase may take time, in particular when dealing with large files. The CANARD system in the SEALS package is available at http://doi. org/10.6084/m9.figshare.7159760.v1. The generated alignments in EDOAL format are available at http://oaei.ontologymatching.org/2018/results/ complex/taxon/CANARD.html (link to each pair of task). Note that, as described below, CANARD was able to generate results for the Taxon track. 2 Results The CANARD system could only output correspondences for the Taxon dataset of the Complex track. Indeed, the other datasets of this track do not contain instances and least of all common instances. Table 1 shows the run-time of CANARD on all pairs of ontologies in the Taxon track, as well as the characteristics of the output alignments. As the align- ment process is directional, we do not obtain symmetrical results for a pair of ontologies. CANARD is able to generate different kinds of correspondences: (1:1), (1:n) and (m:n). The best precision was obtained for the pair agronomicTaxon- agrovoc with a precision of 0.57. CANARD did not output any correspondence for 4 oriented pairs (in grey in Table 1). These empty results can be due to the fail of the instance matching phase of our approach. We could observe that with TaxRef as the source knowledge base, no correspondence could be gen- erated. The exception is the pair taxref-agrovoc where 8 correspondences were found but only involving skos:exactMatch or skos:closeMatch properties in the constructions. The incorrect correspondences of this pair have a low confidence (between 0.05 and 0.30). Looking for the query rewriting task in Taxon, CANARD’s alignment was used to rewrite the most queries (best qwr ). As CANARD does not deal with binary CQAs, none of the 3 binary queries × 12 pairs of ontologies = 36 binary query cases could be dealt with. Out of the 2 unary queries × 12 pairs = 24 unary query cases, CANARD could deal with 6 unary cases needing a complex correspondence and 2 needing simple correspondences for a total of (8/24) 33% of unary query cases. Overall, for the query cases needing complex correspondences, (0+6/28+16) 14% were covered by CANARD. For all the query cases, the CANARD system could provide an answer to (8/36+24) 13% of all cases. 3 General comments The CANARD approach relies on common instances between the ontologies to be aligned. Hence, when such instances are not available, as for the Conference, output correct Test Case ID Run Time (s) prec. (1:1) (1:n) (m:n) corres. corres. agronomicTaxon-agrovoc 37 7 4 0.57 0 7 0 agronomicTaxon-dbpedia 75 17 3 0.18 3 14 0 agronomicTaxon-taxref 87 9 3 0.33 1 8 0 agrovoc-agronomicTaxon 20 0 NaN 0 0 0 agrovoc-dbpedia 128 13 3 0.23 0 0 13 agrovoc-taxref 87 8 0 0 0 0 8 dbpedia-agronomicTaxon 556 0 NaN 0 0 0 dbpedia-agrovoc 236 37 0 0 0 20 17 dbpedia-taxref 333 43 14 0.33 0 17 26 taxref-agronomicTaxon 269 NaN 0 0 0 taxref-agrovoc 283 8 0 0 0 0 8 taxref-dbpedia 351 0 NaN 0 0 0 Global 2468 142 27 0.20 4 66 72 Table 1: Results of CANARD on the Taxon track GeoLink and Hydrography datasets, the approach is not able to generated com- plex correspondences. Furthermore, CANARD is need-oriented and requires a set competency questions to guide the matching process. Here, these “questions” have been automatically generated based on a set of patterns. The current version of the system is limited to finding complex correspon- dences involving classes and properties are not yet taken into account. We plan to extend the systems to take binary relations in the next version. Another point that we would like to improve is the semantics of the confidence of the correspondences. With respect to the technical environment, as mentioned before, the initial version of the system receives as input the endpoints of the populated ontolo- gies. Using SEALS, the large ontologies are stored into repositories. Our systems hence downloads them and stores them into an embedded Fuseki server. This configuration is not ideal as we have to deal with large knowledge bases. Further- more, we struggled with the SEALS dependencies in order to correctly package our system into the SEALS format. As we focus on user needs in order to avoid dealing with the whole alignment space, it could be interesting to having more need-oriented tasks with respect to the alignments coverage. 4 Conclusions This paper presented the adapted version of the CANARD system and its prelim- inary results in the OAEI 2018 campaign. This year, we have been participated only in the Taxon track, in which ontologies are populated with common in- stances. CANARD was the only system to output complex correspondences on the Taxon track. Acknowledgements Cassia Trojahn has been partially supported by the French CIMI Labex projet IBLiD (Integration of Big and Linked Data for On-Line Analytics). References 1. Parundekar, R., Knoblock, C.A., Ambite, J.L.: Linking and building ontologies of linked data. In: ISWC. pp. 598–614. Springer (2010) 2. Parundekar, R., Knoblock, C.A., Ambite, J.L.: Discovering concept coverings in ontologies of linked data sources. In: ISWC. pp. 427–443. Springer (2012) 3. Thiéblin, E., Haemmerlé, O., Trojahn, C.: Complex matching based on competency questions for alignment: a first sketch. In: Ontology Matching Workshop. p. 5 (2018) 4. Walshe, B., Brennan, R., O’Sullivan, D.: Bayes-recce: A bayesian model for detecting restriction class correspondences in linked open data knowledge bases. International Journal on Semantic Web and Information Systems 12(2), 25–52 (2016)