=Paper=
{{Paper
|id=Vol-2456/paper1
|storemode=property
|title=Conference v3.0 : A Populated Version of the Conference Dataset
|pdfUrl=https://ceur-ws.org/Vol-2456/paper1.pdf
|volume=Vol-2456
|authors=Elodie Thiéblin,Cassia Trojahn
|dblpUrl=https://dblp.org/rec/conf/semweb/ThieblinT19
}}
==Conference v3.0 : A Populated Version of the Conference Dataset==
Conference v3.0: A Populated Version of the Conference Dataset Elodie Thiéblin, Cassia Trojahn Institut de Recherche Informatique de Toulouse, France {firstname.lastname}@irit.fr Abstract. The Conference dataset consists of independently designed ontolo- gies in the domain of conference organisation, together with a subset of reference alignments between these ontologies. It has been widely used in ontology match- ing evaluation, in particular in the context of the Ontology Alignment Evaluation Initiative (OAEI). This dataset however is not equipped with instances, limiting its exploitation by matchers. This paper describes the methodology followed to populate a subset of the Conference dataset, both synthetically and with real data. 1 Introduction Ontology matching is the task of generating alignments between the entities of differ- ent ontologies. Several matching approaches have been proposed in the literature [2] and systematic evaluation of them has been carried out over the last fifteen years in the context of the Ontology Alignment Evaluation Initiative (OAEI)1 . One OAEI track of- fering expressive and real-world ontologies is the Conference track, whose dataset has been proposed in [8]. This dataset consists of 16 independently designed ontologies of the conference organisation domain, together with a subset of 21 reference alignments between 7 of these ontologies. This dataset has became one of the most used ones in matching evaluation [7] and has been extended in different proposals [1, 3]. Recently, it has been extended with complex alignments [5]. The Conference dataset however is not equipped with instances, limiting the evalu- ation of matching approaches relying on them. While in [4], a partially populated ver- sion of the dataset has been used to evaluate alignments on the query rewriting task, the resulting dataset is limited to the scope of the queries used in the evaluation (only on- tology concepts corresponding to 18 queries). In this paper, a fully populated version of the dataset is proposed. We present the methodology that has been followed to populate a subset of 5 Conference ontologies, both synthetically and with real data. It has been based on the notion of competence questions for alignment (CQA) which define the knowledge needs to be covered (at best) by the ontologies and the alignment between them [6]. The use of CQAs ensures that the populating is homogeneous across ontolo- gies. Thanks to this dataset, it will be possible to automatise the evaluation process of complex matchers using an evaluation strategy based on the comparison of instances in a query rewriting setting rather than comparing syntactically complex correspondences to references ones. 1 http://oaei.ontologymatching.org/ Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons Li- cense Attribution 4.0 International (CC BY 4.0). 2 Overall methodology The methodology followed for populating the dataset has the following main steps: 1. Create a set of CQAs based on an application scenario in order to guide the on- tology interpretation by the experts. Examples of CQAs include: “What are the accepted papers?” (unary CQA) or “Which are the authors of accepted papers” (binary CQA). 2. Create a pivot format (e.g., JSON schema) for covering the CQAs from step 1 (e.g. covering attributes describing specific types of objects, such as papers or people): { "id": "10", "title": "User-Centric Ontology Population", "authors": ["K. Clarkson", ...], "type": "Research track", "decision":"accept" } 3. For each ontology of the dataset, create SPARQL INSERT queries from the pivot format (here, an ontology may not cover the whole pivot format). INSERT DATA { {{pap}} a :Camera_ready_contribution. {{pap}} rdfs:label {{paptitle}}. {{pap}} :is_submitted_at {{conf}}. {{pap}} :has_authors {{auth}}. ... } 4. Instantiate the pivot format with real-life or synthetic data. 5. Populate the ontologies with the instantiated pivot format using the SPARQL IN- SERT queries. 6. Run a reasoner to verify the consistency of the populated ontologies. If an exception occurs, try to change the interpretation of the ontology and iterate over steps 3 to 5. 3 Populated dataset The methodology above has been followed to populate 5 ontologies from the Confer- ence data: cmt, conference (Sofsem), confOf (confTool), edas and ekaw (Table 1). This choice is motivated by the fact that these ontologies have been also the ones used in the complex version of this dataset. A total of 152 CQAs have been created by an ex- pert using as basis the ESWC 2018 conference scenario (whose data were fully open) and expanded by ontology exploration. The pivot format was first instantiated with data from the ESWC 2018 website and an automatic instantiation script of the pivot format was developed taking into account some statistics (e.g, proportion of members of the program committee author of articles, etc.). The dataset and instantiations of the pivot format have been made available2 . In addition to the ESWC 2018 dataset, 6 other datasets (with 25 artificial confer- enes) have been generated in order to cover the cases where ontologies share common 2 https://framagit.org/IRIT_UT2J/conference-dataset-population instances. In these artificial datasets, each ontology has been populated with 5 pivot instantiation data. In the “dataset 0%” all ontologies were populated with 5 different pivot format instantiations; in the “dataset 20%”, the ontologies were populated with 1 identical and 4 different instantiations; the other datasets (40%, 60%, 80%, and 100%) followed the same strategy. Since the size of each instantiation may differ, the percent- age of common instances between two ontologies varies. For example, in the dataset 20%, the instances Papers common to the ontologies represent between 7% instances of Papers of ekaw and 11% of instances of Papers of cmt. Table 1. Populated entities/total entities by ontology. Number of CQAs covered by each ontology. cmt conference confOf edas ekaw Classes 26 / 30 51 / 60 29 / 39 42 / 104 57 / 74 Obj. prop. 43 / 49 37 / 46 10 / 13 17 / 30 26 / 33 Data prop. 7 / 10 13 / 18 10 / 23 11 / 20 0 / 0 CQAs 46 90 67 60 84 4 Discussion Running the Hermit reasoner (step 6 of the methodology), several incoherences were encountered. For most of them, the problem was with the interpretation of the ontol- ogy. For example, in cmt, cmt:hasAuthor is functional; unlike primarily interpreted, this means that cmt:hasAuthor represents a “is first author of” relationship between a cmt:Paper and a cmt:Author. Hence, the SPARQL INSERT queries have been modified accordingly. We have also detected exceptions that could not be resolved by changing the interpretation. In that case, the original ontologies have been slightly modified. For instance, in cmt, the relation cmt:acceptPaper between an Administrator and a Paper was defined as functional and inverse functional. This leads to an inconsistency when a conference administrator accepts more than one paper. cmt:acceptPaper has been changed to be only inverse functional. With respect to the CQAs, if a given CQA is not fully covered by an ontology, it would not be populated with the corresponding instances. This results in an uneven population of equivalent concepts. For example, considering ekaw and cmt, which both contain a Document class. However, “What are the documents?” are rather covered by paper, review, web site and proceedings instantiations, as ekaw:Document has four sub- classes (ekaw:Paper, ekaw:Review, ekaw:Web Site and ekaw:Conference Proceedings) and cmt:Document has only two subclasses (cmt:Paper and cmt:Review). We could also have considered each class with exactly the same instances, e.g., populating cmt:Document with all the Paper, Review, Web site and Conference proceedings in- stances. Therefore, cmt:Document and ekaw:Document would share exactly the same instances. However, we chose to remain closer to the original ontologies as possible (the lack of a class in an ontology is due to the requirements of its creators). The instances also reflects the conceptual mismatches between the ontologies. In order to evaluate the dataset itself, we verified that two equivalent classes would not obtain a disjoint relation on the populated dataset. For that, we used the reference alignment ra1 from the original Conference dataset and modified it in order to take into account our interpretations. Then, the instances of the source and target member of each correspondence of the modified ra1 were compared, resulting in non disjoint cor- respondences. We have also calculated the intrinsic precision of reference alignments such as the simple alignment ra1 and the two complex ones (rewriting and merging) of [5] (Table 2). In a dataset where two common classes are either populated with the same instances, not populated or share at least a subclass with the same instances, this metric gives a lower and upper bound for the precision of the alignment. The lower bound is given by the classical score in which only equivalent members are considered as cor- rect. The upper bound is given by the not disjoint score in which all correspondences with overlapping or empty members are considered correct. Table 2. Results on comparing the instances related to the entities in the correspondences. classical recall oriented precision oriented overlap not disjoint ra1 0.563 0.763 0.763 0.923 0.990 Ontology merging 0.445 0.724 0.724 0.880 0.955 Query rewriting 0.429 0.719 0.719 0.911 0.976 5 Conclusion This paper has presented a populated version of a subset of the Conference ontologies. This dataset will contribute to automatising the evaluation of complex matchers and expanding the scope of its use in ontology matching in general. We plan to evaluate the behaviour of the approaches generating complex alignments under the different per- centages of instances overlap, and to integrate this dataset in the evaluation of complex matchers in OAEI 2019. References 1. M. Cheatham and P. Hitzler. Conference v2. 0: An uncertain version of the OAEI Conference benchmark. In ISWC, pages 33–48, 2014. 2. J. Euzenat and P. Shvaiko. Ontology Matching. Springer Berlin Heidelberg, 2013. 3. C. Meilicke, R. Garcia-Castro, F. Freitas, W. R. van Hage, E. Montiel-Ponsoda, R. R. de Azevedo, H. Stuckenschmidt, O. Sváb-Zamazal, V. Svátek, A. Tamilin, C. Trojahn, and S. Wang. Multifarm: A benchmark for multilingual matching. JWS, 15:62–68, 2012. 4. A. Solimando, E. Jimnez-Ruiz, and C. Pinkel. Evaluating ontology alignment systems in query answering tasks. In ISWC Poster Track, pages 301–304, 2014. 5. É. Thiéblin, O. Haemmerlé, N. Hernandez, and C. Trojahn. Task-oriented complex ontology alignment: Two alignment evaluation sets. In ESWC, pages 655–670, 2018. 6. E. Thiéblin, O. Haemmerlé, and C. Trojahn. Complex matching based on competency ques- tions for alignment: a first sketch. In OM@ISWC, 2018. 7. O. Zamazal and V. Svatek. The Ten-Year OntoFarm and its Fertilization within the Onto- Sphere. Web Semantics, 43:46–53, Mar. 2017. 8. O. Zamazal, V. Svatek, P. Berka, D. Rak, and P. Tomasek. Ontofarm: Towards an experimental collection of parallel ontologies. ISWC Poster Track, 2005, 2005.