StringsAuto and MapSSS Results for OAEI 2013 Michelle Cheatham and Pascal Hitzler Kno.e.sis Center, Wright State University, Dayton, OH, USA {cheatham.7, pascal.hitzler}@wright.edu Abstract. StringsAuto and MapSSS are two closely related ontology alignment systems. The StringsAuto matcher seeks to explore the limits of a syntactic-only approach to alignment. The MapSSS system then expands on this work by em- bedding the syntactic matching of StringsAuto within a more complete alignment system that also makes use of semantic and structural information. In this paper we describe the basic operation of the two systems and discuss their performance in the OAEI 2013 evaluation. 1 Presentation of the system 1.1 State, purpose, general statement The vast majority of ontology alignment systems use some form of string similarity metric. Our overall goal with StringsAuto and MapSSS is to explore the importance of the choice of a particular string metric. StringsAuto consists only of string metrics, while MapSSS uses strategically chosen string metrics within the context of a more fully-featured alignment system. In [1] we analyzed the performance of eleven string similarity metrics (TF-IDF, Soft TF-IDF, Jaccard, Soft Jaccard, Exact Match, Longest Common Substring, Jaro Winkler, Levenstein, Monge Elkan, N-gram, and Stoilos) on different types of ontolo- gies (standard, biomedical, and multi-lingual). In addition, we experimented with the use of common string pre-processing methods (tokenization, normalization, stemming, stop word removal, synonyms, and translations). StringsAuto grew out of this work. Its purpose is to investigate string similarity metrics as applied to ontology alignment. In particular, it is of interest to compare the performance of this system to that of the very basic string-based matchers used as baselines for some of the OAEI tracks. The MapSSS system was involved in previous OAEI evaluations. The three S’s in MapSSS stand for syntactic, semantic, and structural, which are the three types of metrics used by the system. This year MapSSS has been augmented with a different semantic metric, based on Google queries, and modified to use the same syntactic metric selection strategy as StringsAuto. We are interested in comparing the performance of this version to that of previous years. 1.2 Specific techniques used Based on the results of the string metric analysis in [1], we produced a set of guidelines for choosing string metrics and preprocessing strategies based on the characteristics of the ontologies to be aligned and whether precision or recall is of primary concern. More information can be found in the referenced paper. – Precision • Less than two words per label: Jaro-Winkler 1, 1 • Two or more words per label ∗ Synonyms: Soft Jaccard .2, .5 with Levenstein .9 base metric ∗ No synonyms: Soft Jaccard 1, 1 with Levenstein .8 base metric – Recall • Less than two words per label: TF-IDF .8, .8 • Two or more words per label ∗ Synonyms: Soft TF-IDF .5, .8 with Jaro-Winkler .8 base metric ∗ Different Languages: Soft TF-IDF 0, .7 with Jaro-Winkler .9 base met- ric ∗ Other: Soft TF-IDF .8, .8 with Jaro-Winkler .8 base metric StringsAuto simply chooses two metrics based on these heuristics: one that priori- tizes precision and another that focuses on recall. Each of these metrics is run (in series) and the resulting alignment is used as-is. When a metric is run, every label in the first ontology is compared to every label in the second ontology, and the results of the sim- ilarity metric are stored in a matrix. The stable marriage algorithm is then run over the matrix, and any matches greater than a threshold value are included in the alignment. If either entity involved in a match has already been used in the alignment, that match is ignored. This means that all alignments generated are 1:1 and the recall-centric metric cannot override the precision-centric method. MapSSS uses the same syntactic metric selection strategy as StringsAuto. In addi- tion, it uses a semantic metric based on Google queries. When considering two labels, A from the first ontology and B from the second, this metric queries Google for the phrase A definition. It then searches the snippets on the first page of results for B. If B is found, the metric returns true, otherwise it returns false. If this metric returns true in both directions (i.e. googleMetric(A, B) and googleMetric(B, A) are both true) then the mapping is added to the alignment. Finally, MapSSS also contains a structural met- ric. If all of the entities in the direct neighborhood of two classes are mapped to one another, then those classes are mapped. This approach is sometimes called “flooding.” The structural metric is run repeatedly until no new mappings are created. 1.3 Adaptations made for the evaluation No significant adaptations were made for the OAEI evaluation. In particular, the heuris- tics used to select the string similarity metrics do not break cleanly along the different OAEI tracks. Some possibly relevant details of these alignment systems include: – Neither alignment system attempts to align properties or instances; only classes are considered. Our previous work has shown that string similarity metrics perform particularly poorly on property labels. – The systems determine the language of an ontology by randomly selecting a sam- ple of ten entity labels and sending them to Google Translate. This assumes each ontology involves predominately one language. – To determine if an ontology has embedded synonyms, the alignment systems look for tags involving the word “synonym.” This is to some extent tailored to the anatomy track of the OAEI. – The semantic metric within MapSSS uses the Google API. There is a limit on the number of queries that can be submitted using this API each day, as well as a monthly cap. This causes problems for some of the larger ontology alignment problems within the OAEI evaluation. We attempted to cache the query results to alleviate this problem, but the SEALS server configuration made this unworkable (we would need to be able to write to a file during execution and have this file available during subsequent runs of the program). 1.4 Link to the system and parameters file StringsAuto is available at http://pascal-hitzler.de/resources/Strings.zip and MapSSS is available at http://pascal-hitzler.de/resources/MapSSS.zip. 2 Results Development and testing of StringsAuto and MapSSS focused primarily on the confer- ence, anatomy, and multiform test sets, but we present results for all tracks in which alignments were produced. 2.1 anatomy StringsAuto achieved an f-measure of 0.835 on this test set (see Table 1). This placed it 7th out of 21 participating systems. In particular, the results produced by StringsAuto were significantly better than those of StringsEquiv, a basic string equality matcher. Interestingly, the performance of MapSSS did not differ greatly from StringsAuto. When compared to the performance of the 2012 version of MapSSS, we see that the pre- cision has dropped while the recall has increased slightly. Notably, the recall+ measure is significantly higher with the current version, which makes use of a string similar- ity metric specifically chosen to enhance recall and semantic information gleaned from Google queries. 2.2 conference The (ra2) results of StringsAuto and MapSSS on the conference track are shown in Ta- ble 2. StringsAuto outperformed both StringsEquiv and edna (an edit distance metric with a threshold of .82). Overall, StringsAuto was 6th out of 27 alignment systems in terms of f-measure, while edna was 11th and StringsEquiv was 22nd. The 2013 ver- sion of MapSSS significantly outperformed its predecessor but fell slightly short of StringsAuto. Table 1. Anatomy Track Results Alignment System F-measure Precision Recall Recall+ StringsEquiv .766 .997 .622 0 StringsAuto .835 .899 .779 .433 MapSSS 2013 .828 .898 .768 .443 MapSSS 2012 .831 .935 .747 .337 Table 2. Conference Track Results Alignment System F-measure Precision Recall StringsEquiv .52 .76 .39 edna .55 .73 .44 StringsAuto .60 .74 .50 MapSSS 2013 .58 .77 .46 MapSSS 2012 .46 .47 .46 2.3 multifarm There was a problem running both StringsAuto and MapSSS on the multifarm test set. While both systems were able to produce alignments, they had to fall back to their non- translating versions due to a problem reaching the Google Translate service from the OAEI test server. We attempted to fix this during the evaluation by caching the results of the translation queries, but this did not work, possibly due to write restrictions on the server itself (we need to be able to write to a file that will persist between differ- ent executions of the program). Here we report both the results achieved during the evaluation and the results we get when we run StringsAuto on a local computer. In ad- dition, the code used to generate the StringsAuto results is available from http://pascal- hitzler.de/resources/Strings.zip and the actual alignments produced on the multifarm test cases are available http://pascal-hitzler.de/resources/Multifarm2013alignments.zip. Table 3. Multifarm Track Results Different Same Alignment System F-measure Precision Recall F-measure Precision Recall StringsAuto .14 .30 .09 .07 .51 .04 MapSSS .10 .27 .07 .06 .50 .03 StringsAuto (corrected) .30 .42 .23 .36 .92 .23 2.4 library MapSSS did not produce alignments for this track, likely due to the size of the thesauri causing the system to exceed the Google API query limit. StringsAuto finished below the reference (string equality) matchers on this test. StringsAuto could very probably be improved by recognizing all of the labels as syn- onyms (as it does for the anatomy benchmark). Another potential issue is that StringsAuto might have decided that either or both of the ontologies was entirely in German due to its sampling technique, and then attempted to translate all of the labels in that ontology (even the ones already in English). Table 4. Library Track Results Alignment System F-measure Precision Recall StringsAuto .302 .774 .188 2.5 large biomedical ontologies In this track StringsAuto was only able to complete one out of the six tasks (FMA-NCI), and MapSSS was not able to complete any of them (again due to the Google API query limit). The results of StringsAuto on the FMA-NCI task were not very good. The system achieved an f-measure of 0.667 (based on a precision of 0.838 and a recall of 0.554). This placed the system 20th out of 23. It is odd that the performance here is so differ- ent than on the anatomy track. Similar to the problem on the library track, it is likely this is partially due to StringAuto’s inability to recognize multiple labels for a single entity as synonyms. The synonym extraction method should be adapted to include this information. It might be surprising that StringsAuto was unable to complete more of the tasks in this track. While in theory a matcher that only does string comparisons of labels should scale very well, StringsAuto uses a global (m x n) matrix to store all of the pair-wise similarity values and runs the stable marriage algorithm over this data. This obviously runs into memory limitations for large ontologies. In the future it might make sense to choose mappings based on a simple local maximum (with a threshold). 3 General comments 3.1 Comments on the results Despite some technical problems, the performance of StringsAuto compared to that of the base string matchers shows that a careful selection of string similarity metrics leads to a significant performance increase in ontology alignment systems. In fact, StringsAuto finished in the top third of all alignment systems in both the anatomy and conference tracks. This shows that a significant amount of semantic information within some ontologies is contained in the labels themselves, and string similarity metrics are therefore an important component of ontology alignment systems. The lackluster performance of MapSSS when compared with StringsAuto was some- what surprising. Further research will be needed to improve the utility of the Google- based semantic similarity metric. We have begun looking into leveraging other general sources of information, including wikilinks1 and Wikipedia. It would be interesting to perform an comprehensive analysis of these type of metrics similar to the one done for string similarity metrics in [1]. 3.2 Discussions on the way to improve the proposed system While StringsAuto is basically a proof-of-concept alignment system, it could be ex- tended in several ways that would improve its performance on the OAEI evaluation. In particular, it could be adapted to treat multiple labels for a single entity as synonyms and to avoid the use of a global data structure so that larger ontology pairs could be aligned. The main problem with MapSSS is due to the Google API query limit. This is also a problem with Bing, according to their terms of service. To mitigate this issue, we need to identify another general information source that does not have such a limit or only invoke this metric in a more limited way. 3.3 Comments on the OAEI 2013 procedure It would be convenient to provide a way to run all of the language pairs in the multifarm test set with a single command and produce the same results published by the organizers of that track (i.e. the precision, recall and f-measure separated into the “same” and “different” ontology categories). 3.4 Proposed new measures It might be interesting to see some details about the alignments produced by the various tools. For instance, were there some mappings identified by all of the alignment sys- tems? Were there some that were missed by all systems? This might provide insights that improve the performance of ontology alignment systems in general. It might also highlight any controversial mappings remaining in the reference alignments. 4 Conclusion We have described two related ontology alignment systems, StringsAuto and MapSSS, which explore the role that string similarity metrics play in ontology alignments. The results of these matchers on the OAEI evaluation are significantly better than the base- line string similarity matchers, and in some cases perform quite well when compared to all other alignment systems. The disappointing performance of the Google-based se- mantic similarity metric used in MapSSS indicates the need for further research in this area. 1 http://www.iesl.cs.umass.edu/data/wiki-links Acknowledgements This work was supported by the National Science Foundation under award 1354778 “EAGER: Collaborative Research: EarthCube Building Blocks, Leveraging Semantics and Linked Data for Geoscience Data Sharing and Discovery.” Any opinions, findings, and conclusions or recommendations expressed in this material are those of the au- thor(s) and do not necessarily reflect the views of the National Science Foundation. References 1. M. Cheatham and P. Hitzler. String similarity metrics for ontology alignment. In Proceedings of the 12th International Semantic Web Conference (ISWC 2013), Sydney, NSW, Australia, October 21-25, 2013, Heidelberg, 2013. Springer.