ALIN Results for OAEI 2017 Jomar da Silva1 , Fernanda Araujo Baião1 , and Kate Revoredo1 Graduated Program in Informatics, Department of Applied Informatics Federal University of the State of Rio de Janeiro (UNIRIO), Brazil {jomar.silva, fernanda.baiao,katerevoredo}@uniriotec.br Abstract. ALIN is an ontology alignment system specialized in the in- teractive alignment of ontologies. Its main characteristic is the selection of correspondences to be shown to the expert, depending on the previ- ous feedbacks given by the expert. This selection is based on semantic and structural characteristics. ALIN has obtained the alignment with the highest quality in the interactive tracking for Conference data set. This paper describes its configuration for the OAEI 2017 competition and discusses its results. Keywords: ontology matching, Wordnet, interactive ontology match- ing, ontology alignment, interactive ontology alignment 1 Presentation of the system A large amount of data repositories became available due to the advances in information and communication technologies. Those repositories, however, are highly semantically heterogeneous, which hinders their integration. Ontology alignment has been successfully applied to solve this problem, by discovering cor- respondences between two distinct ontologies which, in turn, conceptually define the data stored in each repository. Among the various ontology alignment ap- proaches that exist in the literature, interactive ontology alignment includes the participation of experts of the domain to improve the quality of the final align- ment. This approach has proven more effective than non-interactive ontology alignment [1]. ALIN is an ontology alignment system specialized in interactive alignment. 1.1 State, purpose, general statement ALIN is an ontology alignment system, specialized in the ontology interactive alignment, based primarily on linguistic matching techniques, using the Wordnet as external resource. After generating an initial set of correspondences ( called set of candidate correspondences, which are the correspondences selected to receive the feedback from the expert ), interactions are made with the expert, and to each interaction, the set of candidate correspondences is modified. The modification of the set of candidate correspondences is through the use of the structural analysis of ontologies and use of correspondence anti-patterns. The interactions continue until there are no more candidate correspondences left. ALIN was built with a special focus on the interactive matching track of OAEI 2017. 1.2 Specific techniques used The ALIN algorithm is shown in algorithm 1. Algorithm 1 ALIN algorithm Input: Two ontologies to be aligned Output: Alignment between the two ontologies 1: Loading of ontologies 2: Generation of the initial set of candidate correspondences 3: Automatic classification of correspondences 4: Removal of correspondences by the low value of semantic similarity 5: while Set of candidate correspondences is not empty do 6: Choose correspondences to show to the expert 7: Receive expert feedback to chosen correspondences and remove them of the set of candidate correspondences 8: Remove correspondences in an correspondence anti-pattern from set of candidate correspondences 9: Insert some data property and object property correspondences into set of candidate correspondences 10: Insert some correspondences from the backup set into set of candidate correspondences 11: end while The steps of ALIN algorithm are the following: 1. Load of the ontologies with load of classes, object properties and data properties through the Align API1 . For each entity some data are stored such as name and label. In the case of classes, their superclasses and disjunctions are saved. In the case of object properties the properties that are their hypernyms and their associated classes are saved. The classes of data properties are saved, too. ALIN does not use instances. The ALIN can only work with ontologies whose entity names are in English. 2. As an initial set of candidate correspondences a stable marriage algorithm with incomplete preference lists with maximum size of the list equals to 1, using linguistic metrics to sort the priority list was used [2]. The list is sorted in decreasing order. For this algorithm only the correspondence whose first entity is in the list of second entity and vice-versa is selected. The linguist metrics used are Jaccard, Jaro-Winkler and n-Gram [3] provided by Simmetrics API2 and 1 Alignment API . Available at http://alignapi.gforge.inria.fr/ Last accessed on Oct, 10, 2017. 2 String Similarity Metrics for Information Integration . Available on http://www.coli.uni-saarland.de/courses/LT1/2011/slides/stringmetrics.pdf. Last accessed on Oct, 10, 2017. Resnick, Jiang-Conrath and Lin [3] provide by HESML API3 that use Wordnet. To use Wordnet the canonical form of the entity names is needed, therefore Stanford CoreNLP API4 was used. The most frequent synsets of words are used to calculate semantic similarities. To find this synset is used the WS4J API5 . The algorithm is run six times, once by each metric, and the result set is the union of results of each metric. 3. The value of the similarity metrics ( Resnick, Jiang-Conrath, Lin, Jaccard, Jaro-Winkler and n-Gram ) vary from 0 to 1 ( 1 is the maximum value ). When a correspondence in the set of candidate correspondences has all the six metrics with the maximum value, it is added to the final alignment and removed from the set of candidate correspondences. There are exceptions to this rule, some correspondences that fall into some structural patterns are not put on the final alignment and are not removed from the set of candidate correspondences. 4. The correspondences whose entities has one of its linguistic metrics less than a given threshold are removed from the set of candidate correspondences. These correspondences are put into a backup set, and can return to the set of candidate correspondences using structural analysis. The use of this technique can best be seen in [4], with the difference that, in [4], instead of applying a threshold, it was removed the classes of correspondences that were not in the same Wordnet synset. 5-11. At this point the interactions with the expert begin. The correspon- dences in the set of candidate correspondences are sorted by the sum of similarity metric values, with the greatest sum first. The correspondences are showed to the expert. The set of candidate correspondences has, at first, only correspondences of classes. When the expert answer one question, the set of candidate correspon- dences is modified. Correspondences ( besides the correspondence answered by expert ) can be removed and correspondences can be included into the set of candidate correspondences, depending on the answer of the expert. If the expert does not accept the correspondence it is removed from the set of candidate cor- respondences. But if the expert accepts the correspondence it is removed from the set of candidate correspondences and put in the final alignment. At each interaction with the expert: - We remove from the set of candidate correspondences and disregard all the correspondences that are in correspondence anti-pattern [5] with the correspon- dences accepted by the expert; - We insert into the set of candidate correspondences, data property and object property correspondences related to the class correspondences accepted by the expert. 3 HESML. Available at https://www.researchgate.net/publication/313881253 HESML A scalable ontologybased semantic similarity measures library with a set of reproducible experiments and a replication dataset Last accessed on Oct, 10, 2017. 4 Stanford CoreNLP . Available at http://stanfordnlp.github.io/CoreNLP/ Last ac- cessed on Oct, 10, 2017. 5 WS4J . Available at https://github.com/Sciss/ws4j Last accessed on Nov, 08, 2017. - We insert into the set of candidate correspondences, correspondences of the backup set ( step 4 ) whose both entities are subclasses of the classes of a correspondence accepted by expert. This step continues until the set of candidate correspondences is empty. Detailed information about the ALIN system can be seen in the master thesis of Jomar da Silva6 . 1.3 Link to the system and parameters file ALIN is available through Google drive ( https://drive.google.com/open?id=1myVtcRoKKdUDHQTKNKsomna8AFbukanf) as a package for running through the SEALS client. 2 Results The system ALIN has been developed with its focus on interactive ontology alignment. The approach performs better when the number of data and object properties is proportionately large. ALIN considers properties associated to cor- respondent classes when selecting entities for user feedback, thus allowing for increased recall. When the number of properties in the ontologies is small, the system still generates a very precise alignment, but its recall tends to decrease. Another characteristic of ALIN is its reliance on an interactive phase. The non-interactive phase of the system is quite simple, mainly based on maximum string similarity, specializing in maintaining a high precision without worrying about recall, generating initially a low f-measure. The recall increases in the in- teractive phase. Finally, ALIN is also not robust to users errors. The system uses a number of techniques that take advantage of the expert feedback to reach other conclusions. When the expert gives a wrong answer it is propagated generating other errors, thereby decreasing the f-measure. 2.1 Comments on the participation of the ALIN in non-interactive tracks As expected the participation of ALIN in non-interactive alignment processes showed the following results: high precision and not so high recall, as can be seen in Anatomy track7 shown in Table 1, where recall+ field refers to non- trivial correspondences found and Coherent field filled by + indicates that the generated alignment is consistent. 6 INTERACTIVE ONTOLOGY ALIGNMENT: AN APPROACH BASED ON THE INTERACTIVE MODIFICATION OF THE SET OF CANDIDATE CORRESPONDENCES . Available at http://www2.uniriotec.br/ppgi/banco- de-dissertacoes-ppgi-unirio/ano-2017/interactive-ontology-alignment-an-approach- based-on-the-interactive-modification-of-the-set-of-candidate-correspondences/view Last accessed on Nov, 12, 2017. 7 Results for OAEI 2017 - Anatomy track . Available at http://oaei.ontologymatching.org/2017/results/anatomy/index.html Last accessed on Nov, 012, 2017. Regarding the Conference track8 , as ALIN evaluates only the properties as- sociated with classes already evaluated as belonging to the alignment, the align- ment of the M2 type (which take into account only the properties of ontologies) were with the f-measure = 0, as can be seen in Table 2. As properties are eval- uated only in the interactive phase in the ALIN, alignments of type M1 (only classes) remained with a higher recall than M3 (classes and properties), as can be seen in Table 2, because the reference alignments of type M3 contain properties besides classes. Table 1. Participation of ALIN in Anatomy non-interactive track Runtime Size Precision F-Measure Recall Recall+ Coherent 836 516 0.996 0.506 0.339 0.0 + Table 2. Participation of ALIN in Conference non-interactive track Threshold Precision Recall F1-Measure F2-Measure F.5-Measure ra1+m1 0.0 0.89 0.32 0.47 0.37 0.66 ra1+m2 0.0 0.0 0.0 0.0 0.0 0.0 ra1+m3 0.0 0.89 0.27 0.41 0.31 0.61 2.2 Comments on the participation of the ALIN in interactive tracks Table 3. Participation of ALIN in Anatomy interactive track - Error rate 0.0 Tool Run Time (sec) Precision Recall F-measure Total Requests Distinct Mappings ALIN 1074 0.993 0.794 0.882 939 1472 AML 45 0.968 0.948 0.958 241 240 LogMap 23 0.982 0.846 0.909 388 1164 XMap 43 0.927 0.865 0.895 35 35 8 ”Results of Evaluation for the Conference track within OAEI 2017 . Available at http://oaei.ontologymatching.org/2017/conference/eval.html Last accessed on Nov, 12, 2017. Table 4. Participation of ALIN in Anatomy interactive track - Error rate 0.1 Tool Run Time (sec) Precision Recall F-measure Total Requests Distinct Mappings ALIN 1000 0.94 0.745 0.831 905 1352 AML 45 0.956 0.946 0.95 266 264 LogMap 23 0.962 0.83 0.891 388 1164 XMap 44 0.927 0.865 0.895 35 35 Table 5. Participation of ALIN in Conference interactive track - Error rate 0.0 Tool Run Time (sec) Precision Recall F-measure Total Requests Distinct Mappings ALIN 35 0.957 0.731 0.829 329 571 AML 30 0.912 0.711 0.799 271 270 LogMap 35 0.886 0.61 0.723 82 246 XMap 21 0.837 0.57 0.678 4 4 Table 6. Participation of ALIN in Conference interactive track - Error rate 0.1 Tool Run Time (sec) Precision Recall F-measure Total Requests Distinct Mappings ALIN 35 0.804 0.669 0.73 321 549 AML 30 0.841 0.701 0.765 282 275 LogMap 35 0.851 0.598 0.702 82 246 XMap 21 0.837 0.57 0.678 4 4 Anatomy track In this track the program ALIN showed the highest precision among the four evaluated tools when the error rate is zero, as can be seen in Table 3. When the error rate increases both the precision as the recall falls, reducing the f-measure, as can be seen in Table 4. This is expected and explained earlier. As ontologies of the Anatomy Track contains almost no properties, some interactive techniques used in ALIN can not be utilized, like the selection of properties associated with classes with positive feedback. This has limited the increase in recall, which influenced the f-measure. Conference Track In this track ALIN stood out, showing the greatest f- measure among the four tools when the error rate is zero, as can be seen in 5, as with a loss of f-measure when the error rate increases, as can be seen in Table 6. Other results, including results with other error rates can be seen on the OAEI 20179 page. 9 Results for OAEI 2017 - Interactive Track . Available at http://oaei.ontologymatching.org/2017/results/interactive/index.html Last ac- cessed on Nov, 11, 2017. 2.3 Comparison of the participation to ALIN in OAEI 2017 with his participation in OAEI 2016 The difference between the participation of ALIN in OAEI 2016 and his partici- pation in OAEI 2017 was the use of the HESML API in 2017 instead of the WS4J API in calculating semantic similarities, which greatly increased the efficiency in these calculations. In ALIN’s participation in OAEI 2016[6], three seman- tic similarity metrics were used: Wu-Palmer, Jiang-Conrath and Lin. In ALIN’s participation in OAEI 2017 the metrics Resnick, Jiang-Conrath and Lin were used. Resnick’s exchange of Wu-Palmer is due to the fact that the Wu-Palmer metric in the HESML API took longer to execute than the same metric in the WS4J API. The Resnick metric proved to be much faster than the Wu-Palmer metric in the HESML API and according to [7] as good as, so the Resnick metric was chosen to take Wu-Palmer’s place in the implementation of ALIN at OAEI 2017. More information about the HESML API can be found in [8]. In table 7. it can be seen that the ALIN runtime has decreased considerably with the use of the HESML API instead of the WS4J API. In the Anatomy interactive track of OAEI 2016, ALIN did not use the semantic metrics, only the string metrics, since the semantic metrics were taking a long time, making it impossible to ex- ecute it. In OAEI 2017, using the HESML API, it was possible to use semantic metrics, which led to an increase in the quality of the alignment generated, but with an increase in the expert’s participation. The execution time also increased with the inclusion of semantic metrics, as we can see in table 8. Table 7. Participation of ALIN in Conference interactive track - OAEI 2016/2017- Error rate 0.0 Year Run Time (sec) Precision Recall F-measure Total Requests Distinct Mappings 2016 101 0.957 0.735 0.831 326 574 2017 35 0.957 0.731 0.829 329 571 Table 8. Participation of ALIN in Anatomy interactive track - OAEI 2016/2017- Error rate 0.0 Year Run Time (sec) Precision Recall F-measure Total Requests Distinct Mappings 2016 505 0.993 0.749 0.854 803 1221 2017 1074 0.993 0.794 0.882 939 1472 3 General Comments Evaluating the results it can be seen that the system can be improved towards: (a) handling user error rate; (b) generating a higher quality (especially w.r.t. recall) initial alignment in its non-interactive phase; (c) reducing the number of interactions with the expert; and (d) optimize the process to reduce its execution time, especially in alignments with large numbers of correspondences, such as Anatomy. 3.1 Conclusions Within certain characteristics, the ALIN system stands out in ontology align- ment process in interactive application scenarios, especially when the amount of data and object properties are relatively large and when the expert does not make mistakes. With these features there is an alignment generated with rela- tively high precision and recall. The third author was partially funding by project PQ-UNIRIO N01/2017 (” Aprendendo, adaptando e alinhando ontologias:metodologias e algoritmos.”) and CAPES/PROAP. References 1. H. Paulheim, S. Hertling, and D. Ritze, Towards Evaluating Interactive Ontology Matching Tools, Lect. Notes Comput. Sci., vol. 7882, pp. 31-45, 2013. 2. R. W. Irving, D. F. Manlove, and G. OMalley, Stable marriage with ties and bounded length preference lists J. Discret. Algorithms, vol. 7, no. 2, pp. 213-219, 2009. 3. J. Euzenat and P. Shvaiko, Ontology Matching - Second Edition, 2. Springer-Verlag, 2013. 4. Silva, J., Baião, F. A., Revoredo, K., & Euzenat, J. (n.d.). Semantic Interactive Ontology Matching : Synergistic Combination of Techniques to Improve the Set of Candidate Correspondences. 5. A. Guedes, F. Baião, e K. Revoredo, Digging Ontology Correspondence Antipat- terns, Proceeding WOP14 Proc. 5th Int. Conf. Ontol. Semant. Web Patterns, vol. 1302, p. 3848, 2014. 6. J. Silva, F. A. Baião, and K. Revoredo, ALIN Results for OAEI 2016, CEUR Work- shop Proc., vol. 1766, 2016. 7. E. G. M. Petrakis, G. Varelas, A. Hliaoutakis, and P. Raftopoulou, Design and Evaluation of Semantic Similarity Measures for Concepts Stemming from the Same or Different Ontologies object instrumentality, Proc. 4th Work. Multimed. Semant., vol. 4, pp. 233-237, 2006. 8. Lastra-Dı́az, J. J., Garcı́a-Serrano, A., Batet, M., Fernández, M., & Chirigati, F. (2017). HESML: A scalable ontology-based semantic similarity measures library with a set of reproducible experiments and a replication dataset. Information Sys- tems, 66, 97118. http://doi.org/10.1016/j.is.2017.02.002