Automating OAEI Campaigns (First Report) Cássia Trojahn1 , Christian Meilicke2 , Jérôme Euzenat1 , Heiner Stuckenschmidt2 1 INRIA & LIG, Grenoble, France 2 University of Mannheim, Mannheim, Germany Abstract. This paper reports the first effort into integrating OAEI and SEALS evaluation campaigns. OAEI is an annual evaluation cam- paign for ontology matching systems. The 2010 campaign includes a new modality in coordination with the SEALS project. This project aims at providing standardized resources (software components and data sets) for automatically executing evaluations of typical semantic web tools, including ontology matching tools. A first version of the software infras- tructure is based on a web service interface wrapping the functionality of the matching tool to be evaluated. In this setting, the evaluation results can be visualized and manipulated immediately in a direct feedback cy- cle. We describe how parts of the OAEI 2010 evaluation campaign have been integrated into the SEALS software infrastructure. In particular, we discuss technical and organizational aspects related to the use of the new technology for both participants and organizers of the OAEI. 1 Introduction The Ontology Alignment Evaluation Initiative3 (OAEI) is a coordinated inter- national initiative that organizes the evaluation of ontology matching systems [6]. The main goal of OAEI is to compare systems and algorithms on the same basis and to allow anyone for drawing conclusions about the best matching strategies. The ambition is that from such evaluations, tool developers can learn and improve their systems. The OAEI annual campaign provides the evaluation of matching systems on consensus test cases, which are organized by different groups of researchers. OAEI evaluations have been carried out since 2004. Although OAEI has been the basis for ontology matching evaluation over the last years, additional efforts have to be made in order to catch up with the growth of ontology matching technology, specially in two main directions: large scale evaluation and automation of the evaluation process. The SEALS project4 aims at providing standardized data sets, evaluation campaigns for typical se- mantic web tools and, in particular, a software infrastructure for automatically executing evaluations. One of the five semantic areas covered by SEALS is ontol- ogy matching. The SEALS infrastructure will allow developers to run their tools 3 http://oaei.ontologymatching.org 4 Semantic Evaluation at Large Scale http://about.seals-project.eu/ Proceedings of the International Workshop on Evaluation of Semantic Technologies (IWEST 2010). Shanghai, China. November 8, 2010. on an execution environment in both the context of an evaluation campaign and on their own for a formative evaluation of their tool versions. OAEI and SEALS are closely coordinated and the plan is to integrate progres- sively the SEALS infrastructure within the OAEI campaigns. The 2010 OAEI campaign is the first effort in this direction. A subset of the OAEI tracks have been included in this new modality. Participants are invited to extend a web service interface5 and deploy their matchers as web services, which are accessed in an evaluation experiment. This setting enables participants to debug their systems, run their own evaluations and manipulate the results immediately in a direct feedback cycle. On the other hand, runtime and memory consumption cannot be correctly measured because a controlled execution environment is missing. Further versions of the SEALS infrastructure will include the deploy- ment of tools in such a controlled environment. In this paper, we report the first efforts on integrating OAEI campaigns into the SEALS infrastructure. We describe the preparation of the evaluation campaign as well as we comment on how the evaluation itself is conducted, taking into account its partial automation. Furthermore, we present the SEALS evaluation service that we have developed for this purpose and show how it has been used at hand of a concrete example. The rest of the paper is structured as follows. We firstly present the evalu- ation design of the 2010 evaluation campaign (§2). We comment on evaluation workflows, data sets and criteria and metrics specified for this evaluation. Sec- ondly, we detail how the designed evaluation is being conducted (§3). Third, we present the overview of the main software components of the SEALS infras- tructure (§4) together with an example of running evaluations (§5). Preliminary results of the campaign are then presented (§6). Finally, we comment on the lessons learned (§7) and conclude the paper (§8). 2 Evaluation Design The design of an evaluation campaign is conducted previously to the execution of the campaign itself. It involves to specify the data sets, criteria and metrics to be considered in the campaign, as well as how the several components (matchers, test providers, evaluators, etc.) interact in an evaluation experiment, i.e., the evaluation workflow. 2.1 Evaluation workflow An alignment can be characterized as a set of pair of entities (e and e0 ), coming from the ontologies to be aligned (o and o0 ), related by a particular relation (r) together with some confidence measure (n) expressing a degree of trust in the fact that the relation holds [2, 1, 3]. From this characterization it is possible to ask any alignment method to output an alignment, given: (i) two ontologies to be 5 http://alignapi.gforge.inria.fr/tutorial/tutorial5/ aligned; (ii) partial input alignment (possibly empty); and (iii) a characterization of the wanted alignment (e.g. one-to-one vs. many-to-many alignments). The quality of the generated alignment can be assessed regarding different criteria. Figure 1 shows the evaluation workflow representing an OAEI evaluation experiment, where several matchers are evaluated. The first step of the workflow is to retrieve from a database the test cases to be considered in such an evalua- tion, where each test case consists of the two ontologies to be matched and the corresponding reference alignment. Next, each matching system performs the matching, taking as input parameters the two ontologies o and o0 , and generates the alignment A using a certain set of resources and parameters. An evaluation component receives this alignment and computes a (set of) quality measure(s) m – typically precision and recall – by comparing it to the reference alignment R. Finally, each result interpretation is stored into the result database. matcher R o evaluator m Result Test matching A o0 params resources Fig. 1. OAEI typical evaluation workflow. This workflow represents a typical OAEI evaluation workflow. However, for some data sets, which have no complete reference alignments, extensions for this typical workflow have been designed. Usually, in such cases, the user has the role of evaluator and alternative approaches, such as manual labeling, data mining and logical reasoning, are applied for supporting the evaluation task. For instance, according to Figure 1, for each test case, the available matchers are executed and their generated alignments (A) are stored into the database. This content is then used later by data mining techniques, whose results will be finally analysed by the user. In previous OAEI campaigns, this workflow has been realized as follows: the required test cases have been made available to the participants for download and posteriorly used by the participants to generate the matching results with their tool. The results have then been submitted to the OAEI organizers who used evaluation scripts to apply measures and store the results. 2.2 OAEI data sets OAEI data sets have been extended and improved over the years. In the OAEI 2010 campaign, the following tracks and data sets have been selected: The benchmark test aims at identifying the areas in which each matching algorithm is strong and weak. The test is based on one particular ontol- ogy dedicated to the very narrow domain of bibliography and a number of alternative ontologies of the same domain for which alignments are provided. The anatomy test is about matching the Adult Mouse Anatomy (2744 classes) and the NCI Thesaurus (3304 classes) describing the human anatomy. Its reference alignment has been generated by domain experts. The conference test consists of a collection of ontologies describing the do- main of organising conferences. Reference alignments are available for a sub- set of test cases. The directories and thesauri test cases propose web directories (matching website directories like open directory or Yahoo’s), thesauri (three large SKOS subject heading lists for libraries) and generally less expressive re- sources. The instance matching test cases aim at evaluating tools able to identify similar instances among different data sets. It features web data sets, as well as a generated benchmark. Anatomy, Benchmark and Conference have been included in the SEALS evaluation modality. The reason for this is twofold: on the one hand these data sets are well known to the organizers and have been used in many evaluations contrary to the test cases of the instance data sets, for instance. On the other hand these data sets come with a high quality reference alignment which allows for computing the compliance based measures, such as precision and recall. 2.3 Evaluation criteria and metrics The diverse nature of OAEI data sets, specially in terms of the complexity of test cases and presence/absence of (complete) reference alignments, requires to use different evaluation measures. For the three data sets in the SEALS modal- ity, compliance of matcher alignments with respect to the reference alignments is evaluated. In the case of Conference, where the reference alignment is avail- able only for a subset of test cases, compliance is measured over this subset. The most relevant measures are precision (true positive/retrieved), recall (true positive/expected) and f–measure (aggregation of precision and recall). These metrics are also partially considered or approximated for the other data sets which are not included in the SEALS modality (standard modality). For Conference, alternative evaluation approaches have been applied. These approaches include manual labeling, alignment coherence [7] and correspondence patterns mining. These approaches require a more deep analysis from experts than traditional compliance measures. For the first version of the evaluation ser- vice, we concentrate on the most important compliance based measures because they do not require a complementary step of analyse/interpretation from ex- perts, which is mostly performed manually and outside an automatic evaluation cycle. However, such approaches will be progressively integrated into the SEALS infrastructure. Nevertheless, for 2010, the generated alignments are stored in the results database (as detailed in §4) and can be retrieved by the organizers easily. It is thus still possible to exploit alternative evaluation techniques subsequently, as it has been done in the previous OAEI campaigns. All the criteria above are about alignment quality. A useful comparison be- tween systems also includes their efficiency, in terms of runtime and memory consumption. The best way to measure efficiency is to run all systems under the same controlled evaluation environment. In previous OAEI campaigns, partici- pants have been asked to run their systems on their own and to inform about the elapsed time for performing the matching task. Using the web based evaluation service, runtime cannot be correctly measured due the fact that the systems run in different execution environments and, as they are exposed as web services, there are potential network delays. 3 Evaluation Process Once the evaluation design has been specified, the evaluation campaign takes place in four main phases: Preparatory phase ontologies and alignments are provided to participants, which have the opportunity to send observations, bug corrections, remarks and other test cases; Preliminary testing phase participants ensure that their systems can load the ontologies to be aligned and generate the alignment in the correct format (the Alignment API format [3]); Execution phase participants use their algorithms to automatically match the ontologies; Evaluation phase the alignments provided by the participants are evaluated and compared. The four phases are the same for both standard and SEALS modality. How- ever, different tasks are required to be performed by the participants of each modality. In the preparatory phase, the data sets have been published on web sites and could be downloaded as zip-files. In the future, it will be possible to use the SEALS portal to upload and describe new data sets. In addition, the test data repository supports versioning, which is an important issue regarding bug fixes and improvements that have taken place over the years. In the phase of preliminary testing, the SEALS evaluation service pays off in terms of reduced effort. In the past years, participants submitted their pre- liminary results to the organizers, who analyzed them semi-automatically, often detecting problems related to the format or to the naming of the required results files. Via a time-consuming communication process these problems have been dis- cussed with the participants. It is now possible to check these and related issues automatically (as detailed in §5). In the execution phase, standard OAEI participants run their tools on their own machines and submit the results via mail to the organizers, while SEALS participants run their tools via web service interfaces. They get a direct feedback on the results and can also discuss and analyse this feedback in their results paper6 . Prior to the hard deadlines, for many of the data sets, results could not be delivered in the past to participants by the organizers in time. Finally, in the evaluation phase, organizers are in charge of evaluating the received alignments. For the SEALS modality, this effort has been minimized due the fact that the results are automatically computed by the services in the infrastructure, as detailed in the next section. 4 Evaluation Service Architecture The evaluation service is composed of three main components: a web user inter- face, a BPEL workflow and a set of web services. The web user interface is the entry point to the application. This interface is deployed as a web application in a Tomcat application-server behind an Apache web server. It invokes the BPEL workflow, which is executed on the ODE7 engine. This engine runs as a web application inside the application server. The BPEL process accesses several services that provide different function- alities: – validation service ensures that (a) the matcher web service specified via its endpoint (URL) is available; (b) this service implements correctly the interface we have specified; and (c) the matcher generates an alignment in the correct format (the validation service uses two simple ontologies in order to test if the matcher generates alignments in the correct format). If it is not the case, an output error message is generated to the user. This validation is done prior to any evaluation. – redirect service is used to redirect the request for running a matching task to the matcher service endpoint. – test iterator service is responsible for iterating over test cases and providing a reference to the required files. These files are the source ontology, the target ontology and the reference alignment. All the operations of this service make use of the SEALS test data repository. – evaluation service computes measures such as precision and recall for evalu- ating the alignments generated by the matching system. – result service is used for storing evaluation results in a relational database. 6 Notice that each participant in the OAEI, independently of the modality, has to write a paper that contains a system description and an analysis of results from the point of view of the system developer. 7 ODE BPEL Engine http://ode.apache.org/ The user can start an evaluation by specifying the web service endpoint via the web interface. This data is then forwarded to BPEL as an input parameter. The complete evaluation workflow is executed as a series of calls to the services listed above. The specification of the web service endpoint becomes relevant for the invocation of the validation and redirect services. They implement internally web service clients that connect to the URL specified in the web user interface. Test and result services require to access additional data resources. For test data, the test web service accesses the SEALS repository, extracts the relevant information and forwards the URLs of the required documents (source and target ontologies and reference alignment) via the redirect service to the matcher, which is currently evaluated. The result web service uses a connection to the database to store the results for each execution of an evaluation workflow. For visualizing and manipulating the stored results, an OLAP (Online Analytical Processing) application is available. Results can be re-accessed at any time e.g., for comparing different tool versions against each other. 5 Running an Evaluation For illustrating a complete evaluation cycle, we have extended the Anchor-Flood system [8] with the web service interface8 . This system has participated in the two previous OAEI campaigns and is thus a typical evaluation target. The cur- rent version of the web application described in the following is available at http://seals.inrialpes.fr/platform/. In order to start an evaluation, one must specify the URL of the matcher service, the class implementing the re- quired interface and the name of the matching system to be evaluated (Figure 2). Three of the OAEI data sets have been selected, namely Anatomy, Bench- mark and Conference. In this example, we have used the conference test case. Submitting the form data, the BPEL workflow is invoked. It first validates the specified web service as well as its output format. In case of a problem, the concrete validation error is displayed to the user as direct feedback. In case of a successfully completed validation, the system returns a confirmation message and continues with the evaluation process. Every time an evaluation is conducted, results are stored under the enpoint address of the deployed matcher (Figure 3). The results are displayed as a table (Figure 4), when clicking on one of the three evaluation IDs in Figure 3. The results table is (partially) available while the evaluation itself is still running. By reloading the page from time to time, users can see the progress of an evaluation that is still running. In the results table, for each test case, precision and recall are listed. Moreover, a detailed view on the alignment results is available (Figure 5), when clicking on the alignment icon in Figure 4. This detailed view lists those correspondences, that (a) have been generated and are in the reference alignment (true positives), that (b) have been generated but are not in the reference alignment (false positives), and that (c) have not been generated but are in the reference alignment (false negatives). 8 Available at http://mindblast.informatik.uni-mannheim.de:8080/sealstools/ aflood/matcherWS?wsdl Fig. 2. Specifying a matcher endpoint as evaluation target. Fig. 3. Listing of available evaluation results. The user can visualize the results in an OLAP application by clicking on the plot figure in Figure 3, in a similar way from what is shown in Figure 6, but in a setting where only the results of his/her system are shown. Furthermore, organizers have a similar tool for accessing the results registered for the campaign as well as all evaluations being carried out in the evaluation service (even the evaluation executed for testing purposes). 6 Preliminary Results The OAEI 2010 campaign has counting with 15 participants [5] (16 participants in 2009 [4]). Regarding the SEALS tracks, 11 participants have registered their results for Benchmark, 9 for Anatomy and 8 for Conference. Some participants in Benchmark have not participated in Anatomy or Conference and vice-versa. Fig. 4. Display results of an evaluation. Fig. 5. Detailed view on an alignment. Figure 6 shows some preliminary results. The values of precision, recall and f-measure are the average of the results for all test cases in each track. For the benchmark track, two systems are ahead: ASMOV and RiMOM, with AgrMaker as close follower, while SOBOM, GeRMeSMB and Ef2Match, respectively, achieve intermediary values of precision and recall. For anatomy, AgrMaker has generated the best alignments with respect to f-measure. This system is followed by three participants (Ef2Match, NBJLM, SOBOM) that share a very similar characteristic regarding precision and recall. Finally, for conference, the matcher with the highest average f-measure was CODI, with Falcon, Ef2Match and ASMOV as followers. There is no unique set of systems ahead for all three tracks, what clearly demonstrates that systems exploiting different features of ontologies perform accordingly to the features of each test cases. These preliminary results will be discussed in the fifth Ontology Matching Workshop collocated with ISWC, in Shanghai, China9 , and they can be found at http://oaei.ontologymatching. org/2010/results/. The complete analysis of the results will be available soon after the workshop in its web site. 7 Lessons Learned The new technology introduced in the OAEI affected both tool developers and organizers to a large degree. In the following we highlight some of the outcomes and describe the lessons we learned from the experiences made so far. As already argued, implementing the web service interfaced requires some effort on side of the tool developer. We stayed in contact with some of the tool developers during this process and observed that the time required for imple- menting the interface varied between several hours and several days depending 9 http://om2010.ontologymatching.org Fig. 6. Using OLAP for results visualization. on the technical skills of each developer. We also observed that the first version of the provided tutorial contained some unclear information resulting in problems for some participants. From the feedback of the developers, we have improved the tutorial. Another typical problem is related to the fact that some tool devel- oper had only restricted access to a machine that is available from the Internet. These problems could finally be solved, however, system administrators of the particular company or research institute should be contacted early. Once technical problems had been solved, the evaluation service has been used by some of the participants in the phase of preliminary testing extensively. Obviously, the direct feedback of the evaluation service has supported the pro- cess of a formative evaluation well. Other participants used the service only for submitting their final results. Regarding the evaluation service performance, during the first weeks the runtime performance was suboptimal. We solved the underlying problems finally. These problems might have been the reason for some participants to abandon from the use during the first weeks. Once the problems have been solved, we contacted each participant in order to explain the problems and they started to use the system again. On side of the organizers, the evaluation service reduced the effort of check- ing the formal correctness of the results to a large degree. In the past, it was required to communicate many of the problems in a time-consuming multilevel process. Typical examples are invalid xml, missing or incorrect namespace in- formation, unsupported types of relations in generated alignments, incorrect directory structure and an incorrect naming style used for the submissions. All of these problems are now directly forwarded to the tool developer in an er- ror message or in a preliminary result interpretation that does not fit with the expectations. Moreover, the organizers could analyse the results so far submitted at any time and had an overview on the participants using the systems. However, while some analysis methods are already available, a number of specific services and operations is still missing. The graphical support of the OLAP visualisation does, for example, not support the generation of precision and recall graphs frequently used by OAEI organizers. In particular, evaluation and visualisation methods specific for ontology matching are not supported. However, most of these operations are already implemented in the Alignment API and will be made available in the future. 8 Final Remarks This paper has reported the first efforts in integrating the SEALS evaluation service in OAEI evaluation campaigns. A preliminary version of this service has been exposed via a web service interface. For that reason, participants are asked to make available their tools as web services, which will be accessed in the evalua- tion experiment. The resulting approach offers the minimal requirements needed to execute a complete cycle of evaluation. The major benefit of this approach is to allow developers to debug their systems, run their own evaluations, and manipulate the results immediately in a direct feedback cycle. As a limitation, runtime and memory consumption cannot be correctly measured because there is no a controlled execution environment. Another important drawback is related to the missing reproducibility of the generated results. In a second development iteration matching tools will be deployed and ex- ecuted on the runtime environment of the SEALS infrastructure. This allows organizers to compare systems on the same basis, in particular in terms of run- time. It also solves the problem of reproducibility. This is also a test of the deployability of tools. The successful deployment relies on the Alignment API and requires additional information about how the tool can be executed in the platform and its dependencies in terms of resources, e.g., installed databases or resources like WordNet, etc. For that reason the challenging goal of the SEALS project can only be reached with support of the matching community and de- pends highly on the acceptance of tool developers. We believe that an online available evaluation service is a key component to raise the acceptance in the community. Acknowledgements The authors are partially supported by the SEALS project (IST-2009-238975). References 1. P. Bouquet, M. Ehrig, J. Euzenat, E. Franconi, P. Hitzler, M. Krötzsch, L. Serafini, G. Stamou, Y. Sure, and S. Tessaris. Specification of a common framework for characterizing alignment. Deliverable D2.2.1, Knowledge web NoE, 2004. 2. J. Euzenat. Towards composing and benchmarking ontology alignments. In Proc. ISWC Workshop on Semantic Integration, pages 165–166, Sanibel Island (FL US), 2003. 3. J. Euzenat. An API for ontology alignment. In Proc. 3rd International Semantic Web Conference (ISWC), volume 3298 of Lecture notes in computer science, pages 698–712, Hiroshima (JP), 2004. 4. J. Euzenat, A. Ferrara, L. Hollink, A. Isaac, C. Joslyn, V. Malaisé, C. Meilicke, A. Nikolov, J. Pane, M. Sabou, F. Scharffe, P. Shvaiko, V. Spiliopoulos, H. Stuck- enschmidt, O. Sváb-Zamazal, V. Svátek, C. Trojahn dos Santos, G. Vouros, and S. Wang. Results of the ontology alignment evaluation initiative 2009. In P. Shvaiko, J. Euzenat, F. Giunchiglia, H. Stuckenschmidt, N. Noy, and A. Rosenthal, editors, Proc. 4th ISWC workshop on ontology matching (OM), Chantilly (VA US), pages 73–126, 2009. 5. J. Euzenat, A. Ferrara, C. Meilicke, J. Pane, F. Scharffe, P. Shvaiko, H. Stuck- enschmidt, O. Sváb-Zamazal, V. Svátek, and C. Trojahn dos Santos. Results of the ontology alignment evaluation initiative 2010. In P. Shvaiko, J. Euzenat, F. Giunchiglia, H. Stuckenschmidt, N. Noy, and A. Rosenthal, editors, Proc. 5th ISWC workshop on ontology matching (OM), Shanghai (Chine), pages 1–35, 2010. 6. J. Euzenat and P. Shvaiko. Ontology matching. Springer, Heidelberg (DE), 2007. 7. C. Meilicke and H. Stuckenschmidt. Incoherence as a basis for measuring the quality of ontology mappings. In Proc. of the ISWC 2008 Workshop on Ontology Matching, Karlsruhe, Germany, 2008. 8. H. Seddiqui and M. Aono. Anchor-flood: results for OAEI 2009. In Proceedings of the ISWC 2009 workshop on ontology matching, Washington DC, USA, 2009.