Compound Matching of Biomedical Ontologies Daniela Oliveira∗ and Catia Pesquita LaSIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 1749-016, Portugal ABSTRACT Z to generate a class expression that is mapped to X via a mapping Biomedical ontologies are particularly successful in the uniformiza- relation M. Here, we consider the ontology to which X belongs to tion of the life sciences domain and ontology matching systems are be the source ontology, and the ontologies that define Y and Z to be useful to discover relationships between concepts of two different the target ontology 1 and 2, respectively. In this particular case the ontologies. However, that is also a limitation as there is a growing relation R is always an intersection (regardless of any qualifier) and interest in discovering more complex kinds of mappings and existing the mapping M an equivalence. techniques are limited to matching two ontologies. Therefore, producing ’compound’ alignments, which match more than two ontologies, could be potentially useful to support a next generation of semantic technologies. In this paper, we present a novel algorithm that produces compound matches between three different ontologies and its performance is evaluated against seven automatically inferred reference alignments from the biomedical domain. We analyze all Fig. 1. Example of a possible ternary compound match. alignments manually to verify the results and propose a new way to complete the logical definitions of OBO cross-products. 2.1 Implementation 1 INTRODUCTION We developed a novel algorithm to establish compound mappings integrated into the AgreementMakerLight (AML) (Faria et al., Biomedical ontologies typically contain a high number of classes 2014) ontology matching system1 . Our algorithm exploits AML’s and many times cover the same field or related fields, which hinders Word Lexicon, the set of all words in an ontology’s vocabulary to their interoperability. One approach to address this problem is which are assigned an evidence content (EC), reflecting the usage the use of matching systems which are capable of establishing of the word within the ontology. meaningful connections between ontologies. In a first step, we perform a pairwise mapping of the labels of Os Still, most ontology matching systems produce equivalence with the labels of Ot1 , by the ratio of the sum of the EC of the words mappings between classes or properties in two ontologies. However, shared by the source label (ls ) and the target 1 label (lt1 ), and the in a complex domain such as biomedicine, where several ontologies sum of the EC of the words in lt1 . describe different but related aspects of biomedical phenomena, it P may be advantageous to create mappings by combining entities EC(word ∈ (ls ∩ lt1 )) sim(ls , lt1 ) = (1) from more than two ontologies. We argue that it would be useful P EC(word ∈ lt1 ) for the developers of ontology alignment systems to develop new We filter out all mappings with similarity below a given threshold. techniques and tools for identifying ’compound matches’, i.e. In a second step, for each mapping found in step 1, we remove from matches between class or property expressions involving more than the source labels all the words that have already been matched (ls∗ ). two ontologies. To the best of our knowledge, there are currently no Taking as an example the mapping in Figure 1, after matching HP ontology matching systems capable of generating such mappings. and FMA, which would capture the mapping for ‘aorta’, the HP’s The purpose of this work is to develop novel algorithms which can class label would be reduced to ‘stenosis’. be used for the efficient and effective creation of alignments between In a third step, for each mapping, we perform a pairwise comparison a class A of one ontology with an expression relating classes B and of the reduced source labels with target 2 labels. However, here the C of two other ontologies, constituting a ternary relationship. ratio divisor corresponds to the sum of EC of the words in the label with more words, to ensure the longest possible match. 2 METHODS We consider that a ternary compound alignment is a set of P correspondences (mappings) between classes from a source EC(word ∈ (ls∗ ∩ lt2 )) sim(ls , lt2 ) = P (2) ontology Os and class expressions obtained by combining two other EC(word ∈ longest(ls , lt2 )) classes each belonging to a different target ontology Ot1 and Ot2 (see In a fourth step, the final similarity between the matched labels is Figure 1). This means that we define a ternary compound mapping computed as the average between the similarities computed in steps as a tuple , where X, Y and Z are classes from 1 and 3. Label mappings below the second threshold are filtered out. three distinct ontologies, R is a relation established between Y and Finally, the algorithm has a greedy selection step, which selects the ∗ To whom correspondence should be addressed: doliveira@lasige.di.fc.ul.pt 1 Available at: https://github.com/AgreementMakerLight/AML-Compound Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes 1 Oliveira et al mapping with the highest similarity, amongst the source classes with Although our algorithm’s performance against the reference more than one mapping. alignments is low (Table 1), the manual evaluations of the mappings reveals a very low proportion of incorrect mappings, so we 2.2 Evaluation investigated how these new mappings could impact the logical To evaluate our strategy we used a set of seven reference alignments definitions of the source ontology. The results presented in Table (Pesquita et al., 2014) automatically created by inferring compound 3 indicate that the logical definitions of the three source ontologies mappings from cross-products (Mungall et al., 2011) of the logical could be expanded with more than 800 new logical definitions. definitions in OBO ontologies (Smith et al., 2007). For this, we computed precision, recall and f-measure. We also performed a Ontology New Mappings OBO classes % of Growth manual evaluation of the results, where we classified mappings MP 422 7694 5.48 into three possible categories: ’Correct’, where the mapping is WBP 182 957 19.02 deemed correct and the source class has no mapping in the reference HP 259 14059 1.84 alignment; ‘Conflict’, where the mapping is deemed correct but the Table 3. Influence of the new mappings on the source ontology. source class has a different mapping in the reference alignment; and ‘Incorrect’, where the mapping is deemed incorrect. We applied this We can conclude that our approach is capable of producing good to all mappings created by using 0.5 as a threshold for step 1 and 0.9 precision (Table 2 shows an average of 81% of the matches are for step 2. correct), and is able to find many correct mappings that are not in the reference alignment. However, it struggles with capturing 3 RESULTS many of the mappings in the references, which is mainly due to Table 1 presents some statistics about the alignments obtained. our algorithm’s inability to distinguish between similar PATO class Preliminary results using this evaluation approach present low F- (e.g., PATO:0000470: ‘present in greater numbers in organism’ vs. Measure, with a higher precision, which fluctuates between 67.9 PATO:0002002: ’has extra parts of type’), or the use of synonyms and 11.6 and recalls that always fall below the 50% mark. not defined in any of the ontologies. Precision Recall F-Measure MP-CL-PATO 52.6 % 20.8 % 29.8 % 5 CONCLUSION MP-GO-PATO 67.9 % 47.2 % 55.7 % We have presented, to the best of our knowledge, the first algorithm MP-NBO-PATO 47.3 % 30.1 % 36.8 % for compound matching of ontologies. It is particularly suited for MP-UBERON-PATO 64.7 % 19.4 % 29.9 % biomedical ontologies, given its ability to handle large ontologies WBP-GO-PATO 11.6 % 7.7 % 9.2 % HP-FMA-PATO 21.2 % 12.4 % 15.6 % and the need in this domain to reveal more complex relations Table 1. Evaluation results from the comparison with the between them. Our preliminary experiments have shown that, automated reference alignments despite the challenges in handling an increased matching space and the inherently more difficult-to-compute ternary mapping, our algorithm is able to produce good precision mappings. Moreover, Correct Conflict Incorrect we posit that it could also be used as a first step in adding new logical MP-CL-PATO 63.71 % 34.60 % 1.69 % definitions to ontologies, since we were able to find several correct MP-GO-PATO 92.16 % 6.97 % 0.87 % mappings that were not in the reference alignments.. MP-NBO-PATO 72.46 % 26.09 % 1.45 % MP-UBERON-PATO 91.33 % 7.96 % 0.70 % WBP-GO-PATO 88.55 % 7.49 % 3.96 % ACKNOWLEDGEMENTS HP-FMA-PATO 77.82 % 15.56 % 6.61 % Table 2. Manual evaluation of results. The authors are grateful to Daniel Faria for his technical support. This work was supported by FCT through funding of LaSIGE The manual inspection of the mappings (Table 2) revealed that Research Unit, ref.UID/CEC/00408/2013 the algorithm is finding mostly correct mappings, with the lowest percentage belonging to the MP-CL-PATO compound alignment, which had the highest number of conflicting mappings. REFERENCES Faria, D., Pesquita, C., Santos, E., Cruz, I. F., and Couto, F. M. (2014). AgreementMakerLight: a scalable automated ontology matching system. 10th 4 DISCUSSION International Conference on Data Integration in the Life Sciences 2014 (DILS), One challenge in computing compound alignments is the memory page 29. requirements involved in the process. If matching two large Mungall, C. J., Bada, M., Berardini, T. Z., Deegan, J., Ireland, A., Harris, M. A., Hill, D. P., and Lomax, J. (2011). Cross-product extensions of the Gene Ontology. biomedical ontologies is already a challenge for many ontology Journal of Biomedical Informatics, 44(1), 80 – 86. Ontologies for Clinical and matching systems, handling three ontologies in a compound Translational Research. alignment scenario is even more demanding. Our algorithm reduces Pesquita, C., Cheatham, M., Faria, D., Barros, J., Santos, E., and Couto, F. M. (2014). the search-space by using the two-step matching approach, which Building reference alignments for compound matching of multiple ontologies using OBO cross-products. In Ontology Matching Workshop at ISWC 2014. both reduces the time and memory requirements 2 . Smith, B., Ashburner, M., Rosse, C., Bard, J., Bug, W., Ceusters, W., Goldberg, L. J., Eilbeck, K., Ireland, A., Mungall, C. J., et al. (2007). The OBO Foundry: 2 The largest alignment takes less than 15 minutes with an Intel R coordinated evolution of ontologies to support biomedical data integration. Nature CoreTM i7-2600 CPU 3.40GHz x 8 processor and 16GB memory. biotechnology, 25(11), 1251–1255. 2 Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes