=Paper=
{{Paper
|id=Vol-1515/regular15
|storemode=property
|title=Using Aber-OWL for fast and scalable reasoning over BioPortal ontologies
|pdfUrl=https://ceur-ws.org/Vol-1515/regular15.pdf
|volume=Vol-1515
|dblpUrl=https://dblp.org/rec/conf/icbo/SlaterGSH15
}}
==Using Aber-OWL for fast and scalable reasoning over BioPortal ontologies==
Using Aber-OWL for fast and scalable reasoning over BioPortal ontologies Luke Slater 1∗, Georgios V Gkoutos2 , Paul N Schofield3 , Robert Hoehndorf1 1 Computational Bioscience Research Center, King Abdullah University of Science and Technology, 4700 KAUST, 23955-6900, Thuwal, Saudi Arabia 2 Department of Computer Science, Aberystwyth University, Aberystwyth, SY23 3DB, Wales, United Kingdom 3 Department of Physiology, Development and Neuroscience, University of Cambridge, Downing Street, CB2 3EG, England, United Kingdom ABSTRACT However, enabling automated reasoning over multiple ontologies Reasoning over biomedical ontologies using their OWL semantics is a challenging task since as automated reasoning can be highly has traditionally been a challenging task due to the high theoretical complex and costly in terms of time and memory consumption complexity of OWL-based automated reasoning. As a consequence, (Tobies, 2000). In particular, ontologies formulated in the Web ontology repositories, as well as most other tools utilizing ontologies, Ontology Language (OWL) (Grau et al., 2008) can utilize either provide access to ontologies without use of automated statements based on highly expressive description logics (Horrocks reasoning, or limit the number of ontologies for which automated et al., 2000), and therefore queries that utilize automated reasoning reasoning-based access is provided. We apply the Aber-OWL cannot, in general, be guaranteed to finish in a reasonable amount infrastructure to provide automated reasoning-based access to all of time. accessible and consistent ontologies in BioPortal (368 ontologies). Prior work on large-scale automated reasoning over biomedical We perform an extensive performance evaluation to determine query ontologies has often focused on the set of ontologies in Bioportal, times, both for queries of different complexity as well as for queries as it is one of the largest collections of ontologies freely available. that are performed in parallel over the ontologies. We demonstrate To enable inferences over this set of ontologies, modularization that, with the exception of a few ontologies, even complex and parallel techniques have been applied (Del Vescovo et al., 2011) using queries can now be answered in milliseconds, therefore allowing the notion of locality-based modules, and demonstrated that, for automated reasoning to be used on a large scale, to run in parallel, most ontologies and applications, relatively small modules can be and with rapid response times. extracted over which queries can be answered more efficiently. Other work has focused on predicting the performance of reasoners when applied to the set of BioPortal ontologies (Sazonau et al., 1 INTRODUCTION 2013), and could demonstrate that performance of particular Major ontology repositories such as the BioPortal (Noy et al., reasoners can reliably be predicted; at the same time, the authors 2009), OntoBee (Xiang et al., 2011), or the Ontology Lookup have conducted an extensive evaluation of average classification Service (Cote et al., 2006), have existed for a number of years, times of each ontology. and currently contain several hundred ontologies, enabling ontology Other approaches apply RDFS reasoning (Patel-Schneider et al., creators and maintainers to publish their ontology releases and make 2004) for providing limited, yet fast, inference capabilities in them available to the wider community. answering queries over Bioportal’s set of ontologies through a Besides the hosting functionality that such repositories offer, SPARQL interface (Salvadores et al., 2012, 2013). Alternatively, they usually also provide certain web-based features for browsing, systems such as OntoQuery (Tudose et al., 2013) provide access comparing, visualising and processing ontologies. One particularly to ontologies through automated reasoning but limit the number of useful feature, currently missing from the major ontology ontologies. repositories, is the ability to provide online access to reasoning The Aber-OWL (Hoehndorf et al., 2015) system is a novel services simultaneously over many ontologies. Such a feature ontology repository that aims to allow access to multiple ontologies would enable the use of semantics and deductive inference when through automated reasoning utilizing the OWL semantics of the processing data characterized with the ontologies these repositories ontologies. Aber-OWL mitigates the complexity challenge by using contain (Hoehndorf et al., 2015). Moreover, the ability to query a reasoner which supports only a subset of OWL (i.e., the OWL multiple ontologies simultaneously further enables data integration EL profile (Motik et al., 2009)), ignoring ontology axioms and across domains and data sources. For example, there is an increasing queries that do not fall within this subset. This enables the provision amount of RDF (Manola and Miller, 2004) data becoming available of polynomial-time reasoning, which is sufficiently fast for many through public SPARQL (Seaborne and Prud’hommeaux, 2008) practical uses even when applied to large ontologies. However, thus endpoints (Jupp et al., 2014; The Uniprot Consortium, 2007; far, the Aber-OWL software is only applied to a few, manually Belleau et al., 2008; Williams et al., 2012), which utilise multiple selected, ontologies, and therefore does not have a similar coverage ontologies to annotate entities. as other ontology repositories, nor does it cater for reasoning over large sets of ontologies such as the ones provided by the BioPortal ontology dataset (Bioportal contains, as of 9 March 2015, 428 ∗ To whom correspondence should be addressed: luke.slater@kaust.edu.sa ontologies consisting of 6,668,991 classes). Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes 1 Slater et al Here, we apply the Aber-OWL framework to reason over the comparison, BioPortal currently (9 March 2015) includes a total of majority of the available ontologies in Bioportal. We evaluate 6,668,991 classes. the performance of querying ontologies with Aber-OWL, utilizing 337 ontologies from BioPortal, we evaluate Aber-OWL’s ability 2.2 Use of the Aber-OWL reasoning infrastructure to perform different types of queries as well as its scalability in Aber-OWL (Hoehndorf et al., 2015) is an ontology repository and performing queries that are executed in parallel. We demonstrate query service built on the OWLAPI (Horridge et al., 2007) library, that the Aber-OWL framework makes it possible to provide, at least, which allows access to a number of ontologies through automated light-weight description logic reasoning over most of the freely reasoning. In particular, Aber-OWL allows users or software accessible ontologies contained in BioPortal, with a relatively low applications to query the loaded ontologies using Manchester OWL memory footprint and high scalability in respect to the number Syntax (Horridge et al., 2006), using the class and property of queries executed in parallel, using only a single medium-sized labels as short-form identifiers for classes. Aber-OWL exposes this server as hardware to provide these services. Furthermore, we functionality on the Internet through a JSON API as well as a identify several ontologies for which querying using automated web interface available on http://aber-owl.net. To answer reasoning performs significantly worse than the majority of the other queries, Aber-OWL utilizes the ELK reasoner (Kazakov et al., ontologies tested, and discuss potential explanations and solutions. 2014, 2011), a highly optimized reasoner that supports the OWL- EL profile. Ontologies which are not OWL-EL are automatically transmuted by the reasoner by means of ignoring all non-EL axioms, 2 METHODS though as of 2013 50.7% of ontologies in Bioportal were natively 2.1 Selection of ontologies using it (Matentzoglu et al., 2013). We selected all ontologies contained in BioPortal as candidate We extended the Aber-OWL framework to obtain a list of ontologies, and attempted to download the current versions of all the ontologies from the Bioportal repository, periodically checking for ontologies for which a download link was provided by BioPortal. A new ontologies as well as for new versions of existing ontologies. As summary of the results is presented in Table 1. a result, our testing version of Aber-OWL maintains a mirror of the accessible ontologies available in BioPortal. Furthermore, similarly to the functionality provided by BioPortal, a record of older versions Total 427 of ontologies is kept within Aber-OWL, so that, in the future, the Loadable 368 semantic difference between ontology versions could be computed. Used 337 In addition, we expanded the Aber-OWL software to count and Unobtainable 39 provide statistics about: Non-parseable 17 • The ontologies which failed to load, with associated error Inconsistent 3 messages; No Labels 31 • Axioms, axiom types, and number of classes per ontology; and Table 1. Summary of Ontologies used in our test. The loadable ontologies • Axioms, axiom types, and number of classes over all are the ones obtained from BioPortal which could be parsed using the OWL ontologies contained within Aber-OWL. API and which were found to be consistent when classified with the ELK For each query to Aber-OWL, we also provide the query reasoner. We exclude 31 ontologies that do not contain any labels from our analysis. execution time within Aber-OWL and pass this information back to the client along with the result-set of the query. All information is available through Aber-OWL’s JSON API, and the source code freely available at https://github.com/ Out of 427 total ontologies listed by Bioportal, only 368 could bio-ontology-research-group/AberOWL. be directly downloaded and processed by Aber-OWL. Reasons for 2.3 Experimental setup failure to load ontologies include the absence of a download link for listed ontologies, proprietary access to ontologies or ontologies In order to evaluate the performance of querying single and multiple that are only available in proprietary data formats (e.g., some of the ontologies in Aber-OWL, randomly queries of different complexity ontologies and vocabularies provided as part of the Unified Medical were generated and executed. Since the ELK reasoner utilises a Language Systems (Bodenreider, 2004)). 39 ontologies were not cache for answering queries that have already been computed, each obtainable. Furthermore, 17 ontologies that could be downloaded of the generated query consisted of a new class expression. The were not parseable with the OWL API, indicating a problem in the following types of class expressions were used in the generated file format used to distribute the ontology. Three ontologies were queries (for randomly generated A, B, and R): inconsistent at the reasoning stage. Several ontologies also referred • Primitive class: A to unobtainable ontologies as imports; however, we included these ontologies in our analysis, utilizing only the classes and axioms that • Conjunctive query: A and B were accessible. As Aber-OWL currently relies on the use of labels • Existential query: R some A to construct queries, we further removed 31 ontologies that did not • Conjunctive existential query: A and R some B include any labels from our test set. Overall, we use set of 337 ontologies in our experiments 300 random queries for each of these type were generated for each consisting of 3,466,912 classes and 6,997,872 logical axioms (of ontology that was tested (1,200 queries in total per ontology). Each which 12,721 are axioms involving relations, i.e., RBox axioms). In set of the 300 random queries that was generated, was subsequently 2 Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes Scalable Reasoning split into three sets each of which contained 100 class expressions. The random class expressions contained in the resulting sets were then utilised to perform superclass (100 queries), equivalent (100 queries) and subclass (100 queries) queries and the response time of the Aber-OWL framework was recorded for each of the query. We further test the scalability of answering the queries by performing these queries in parallel. For this purpose, we remotely query Aber-OWL with one query at once, 100 queries in parallel, and 1,000 queries in parllel. In our test, we record the response time of each query, based on the statistics provided by the Aber-OWL server; in particular, response time does not include network latency. All tests are performed on a server with 128GB memory and two Intel Xeon (a) primitive classes E5-2680v2 10-core 2.8GHz CPUs with hyper-threading activated (resulting in 40 virtual cores). The ELK reasoner underlying Aber- OWL is permitted to use all available (i.e., all 40) cores to perform classification and respond to queries. 3 RESULTS AND DISCUSSION On average, when performing a single query over Aber-OWL, query results are returned in 10.8 milliseconds (standard deviation: 48.0 milliseconds). The time required to answer a query using Aber- OWL correlates linearly with the number of logical axioms in the ontologies (Pearson correlation, ρ = 0.33), and also strongly correlates with the number of queries performed in parallel (Pearson correlation, ρ = 0.82). Figure 1 shows the query times for the (b) conjunctive queries ontologies based on the type of query, and Figure 2 shows the query times based on different number of queries run in parallel. The maximum observed memory consumption for the Aber-OWL server while performing these tests was 66.1 GB. We observe several ontologies for which query times are significantly higher than for the other ontologies. The most prevalent outlier is the NCI Thesaurus (Sioutos et al., 2007) for which average query time is 600 ms when performing a single query over Aber-OWL. Previous analysis of NCI Thesaurus has identified axioms which heavily impact the performance of classification for the ontology using multiple description logic reasoners (Gonçalves et al., 2011). The same analysis has also shown that it can significantly improve reasoning time to add inferred axioms to the ontology. To test whether this would also allow us to improve (c) existential queries reasoning time over the NCI Thesaurus in Aber-OWL and using the ELK reasoner, we apply the Elvira modularization software (Hoehndorf et al., 2011), using the HermiT reasoner to classify the NCI Thesaurus and adding all inferred axioms that fall into the OWL-EL profile to the ontology, as opposed to ELK’s approach of ignoring non-EL axioms during classification. We then repeat our experiments. Figure 3 shows the different reasoning times for NCI Thesaurus before and after processing with Elvira. Query time reduces from 703 ms (standard deviation: 689 ms) before processing with Elvira to 51 ms (standard deviation: 42 ms) after processing with Elvira, demonstrating that adding inferred axioms and removing axioms that do not fall in the OWL-EL profile can be used to improve query time. Another outlier with regard to average query time is the (d) conjunctive existential queries Natural Products Ontology (NATPRO, http://bioportal. bioontology.org/ontologies/NATPRO). However, as NATPRO is expressed in OWL-Full, it cannot reliably be classified Fig. 1: Query times as function of the number of logical axioms in with a Description Logic reasoner, and therefore we cannot apply the ontologies, separated by the type of query. Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes 3 Slater et al (a) Sequential querying (b) 100 parallel queries (c) 1,000 parallel queries Fig. 2: Query times as function of the number of logical axioms in the ontologies, separated by the number of queries executed in parallel. Fig. 3: Query times over the NCI Thesaurus. the same approach to improve the performance of responding to and, to a lesser degree, the Drug Ontology (DRON) (Hanna queries. et al., 2013), similar ‘culprit-finding’ analysis methods may be applied as have previously been applied for the NCI Thesaurus 3.1 Future Work (Gonçalves et al., 2011). These methods may also allow the The performance of using automated reasoning for querying ontology maintainers to identifying possible modifications to their ontologies relies heavily on the type of reasoner used. We have ontologies that would result in better reasoner performance. used the ELK (Kazakov et al., 2014, 2011) reasoner in our evaluation; however, it is possible to substitute ELK with any other OWLAPI-compatible reasoners. In particular, novel reasoners such 4 CONCLUSION as Konklude (Steigmiller et al., 2014), which outperform ELK in We have demonstrated that it is feasible to reason over most of the many tasks (Bail et al., 2014), may provide further improvements in ontologies available in BioPortal in real time, and that queries over performance and scalability. these ontologies can be answered quickly, in real-time, and using We identified several ontologies as leading to performance only standard server hardware. We further tested the performance problems, i.e., they are outliers during query time testing. For these of answering queries in parallel, and show that, for the majority of ontologies, including the Natural Products Ontology (NATPRO), cases, even highly parallel access allows quick response times. 4 Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes Scalable Reasoning We have also identified a number of ontologies for which Workshop on OWL Experiences and Directions. performance of automated reasoning, at least when using Aber- Horrocks, I., Sattler, U., and Tobies, S. (2000). Practical reasoning for very expressive description logics. Logic Journal of the IGPL, 8(3), 239–264. OWL and the ELK reasoner, is significantly worse, which renders Jupp, S., Malone, J., Bolleman, J., Brandizi, M., Davies, M., Garcia, L., Gaulton, A., them particularly problematic for application that carry heavy Gehant, S., Laibe, C., Redaschi, N., Wimalaratne, S. M., Martin, M., Le Novre, N., parallel loads. At least for some of these ontologies, pre-processing Parkinson, H., Birney, E., and Jenkinson, A. M. (2014). The EBI RDF platform: ontologies using tools such as Elvira (Hoehndorf et al., 2011) can linked open data for the life sciences. Bioinformatics, 30(9), 1338–1339. mitigate these problems. Kazakov, Y., Krötzsch, M., and Simančı́k, F. (2011). Unchain my EL reasoner. In Proceedings of the 23rd International Workshop on Description Logics (DL’10), The ability to reason over a very large number of ontologies, CEUR Workshop Proceedings. CEUR-WS.org. such as all the ontologies in BioPortal, opens up the possibility to Kazakov, Y., Krötzsch, M., and Simancik, F. (2014). The incredible elk. Journal of frequently use reasoning not only locally when making changes to a Automated Reasoning, 53(1), 1–61. single ontology, but also monitor – in real time – the consequences Manola, F. and Miller, E., editors (2004). RDF Primer. W3C Recommendation. World Wide Web Consortium. that a change may have on other ontologies, in particular on Matentzoglu, N., Bail, S., and Parsia, B. (2013). A corpus of owl dl ontologies. ontologies that may import the ontologies that is being changed. Description Logics, 1014, 829–841. Using automated reasoning over all ontologies within a domain Motik, B., Grau, B. C., Horrocks, I., Wu, Z., Fokoue, A., and Lutz, C. (2009). Owl 2 therefore has the potential to increase interoperability between web ontology language: Profiles. Recommendation, World Wide Web Consortium ontologies and associated data by verifying mutual consistency and (W3C). Noy, N. F., Shah, N. H., Whetzel, P. L., Dai, B., Dorf, M., Griffith, N., Jonquet, C., enabling queries across multiple ontologies, and our results show Rubin, D. L., Storey, M.-A. A., Chute, C. G., and Musen, M. A. (2009). Bioportal: that such a system can now be implemented with the available ontologies and integrated data resources at the click of a mouse. Nucleic acids software tools and commonly used server hardware. research, 37(Web Server issue), W170–173. Patel-Schneider, P. F., Hayes, P., and Horrocks, I. (2004). Owl web ontology language semantics and abstract syntax section 5. rdf-compatible model-theoretic semantics. ACKNOWLEDGEMENTS Technical report, W3C. REFERENCES Salvadores, M., Horridge, M., Alexander, P. R., Fergerson, R. W., Musen, M. A., and Noy, N. F. (2012). Using sparql to query bioportal ontologies and metadata. In The Bail, S., Glimm, B., Jiménez-Ruiz, E., Matentzoglu, N., Parsia, B., and Steigmiller, A., Semantic Web–ISWC 2012, pages 180–195. Springer. editors (2014). ORE 2014: OWL Reasoner Evaluation Workshop. Number 1207 in Salvadores, M., Alexander, P. R., Musen, M. A., and Noy, N. F. (2013). Bioportal as CEUR Workshop Proceedings. CEUR-WS.org, Aachen, Germany. a dataset of linked biomedical ontologies and terminologies in rdf. Semantic web, Belleau, F., Nolin, M., Tourigny, N., Rigault, P., and Morissette, J. (2008). 4(3), 277–284. Bio2RDF: Towards a mashup to build bioinformatics knowledge systems. Journal Sazonau, V., Sattler, U., and Brown, G. (2013). Predicting performance of owl of Biomedical Informatics, 41(5), 706–716. reasoners: Locally or globally? Technical report, Technical report, School of Bodenreider, O. (2004). The Unified Medical Language System (UMLS): integrating Computer Science, University of Manchester. biomedical terminology. Nucleic Acids Res, 32(Database issue), D267–D270. Seaborne, A. and Prud’hommeaux, E. (2008). SPARQL query language for RDF. Cote, R., Jones, P., Apweiler, R., and Hermjakob, H. (2006). The ontology lookup W3C recommendation, W3C. http://www.w3.org/TR/2008/REC-rdf-sparql-query- service, a lightweight cross-platform tool for controlled vocabulary queries. BMC 20080115/. Bioinformatics, 7(1), 97+. Sioutos, N., de Coronado, S., Haber, M. W., Hartel, F. W., Shaiu, W.-L., and Wright, Del Vescovo, C., Gessler, D. D., Klinov, P., Parsia, B., Sattler, U., Schneider, T., and L. W. (2007). Nci thesaurus: a semantic model integrating cancer-related clinical Winget, A. (2011). Decomposition and modular structure of bioportal ontologies. and molecular information. Journal of biomedical informatics, 40(1), 30–43. In The Semantic Web–ISWC 2011, pages 130–145. Springer. Steigmiller, A., Liebig, T., and Glimm, B. (2014). Konclude: System description. Web Gonçalves, R. S., Parsia, B., and Sattler, U. (2011). Analysing multiple versions of Semantics: Science, Services and Agents on the World Wide Web, 27(1). an ontology: A study of the nci thesaurus. In 24th International Workshop on The Uniprot Consortium (2007). The universal protein resource (uniprot). Nucleic Description Logics, page 147. Citeseer. Acids Res, 35(Database issue). Grau, B., Horrocks, I., Motik, B., Parsia, B., Patelschneider, P., and Sattler, U. (2008). Tobies, S. (2000). The complexity of reasoning with cardinality restrictions and OWL 2: The next step for OWL. Web Semantics: Science, Services and Agents on nominals in expressive description logics. J. Artif. Int. Res., 12(1), 199–217. the World Wide Web, 6(4), 309–322. Tudose, I., Hastings, J., Muthukrishnan, V., Owen, G., Turner, S., Dekker, A., Kale, Hanna, J., Joseph, E., Brochhausen, M., and Hogan, W. (2013). Building a drug N., Ennis, M., and Steinbeck, C. (2013). Ontoquery: easy-to-use web-based owl ontology based on rxnorm and other sources. Journal of Biomedical Semantics, querying. Bioinformatics, 29(22), 2955–2957. 4(1), 44. Williams, A. J., Harland, L., Groth, P., Pettifer, S., Chichester, C., Willighagen, E. L., Hoehndorf, R., Dumontier, M., Oellrich, A., Wimalaratne, S., Rebholz-Schuhmann, Evelo, C. T., Blomberg, N., Ecker, G., Goble, C., and Mons, B. (2012). Open phacts: D., Schofield, P., and Gkoutos, G. V. (2011). A common layer of interoperability semantic interoperability for drug discovery. Drug Discovery Today, 17(2122), 1188 for biomedical ontologies based on OWL EL. Bioinformatics, 27(7), 1001–1008. – 1198. Hoehndorf, R., Slater, L., Schofield, P. N., and Gkoutos, G. V. (2015). Aber-owl: a Xiang, Z., Mungall, C. J., Ruttenberg, A., and He, Y. (2011). Ontobee: A linked data framework for ontology-based data access in biology. BMC Bioinformatics. server and browser for ontology terms. In Proceedings of International Conference Horridge, M., Drummond, N., Goodwin, J., Rector, A., Stevens, R., and Wang, H. on Biomedical Ontology, pages 279–281. (2006). The Manchester OWL Syntax. Proc. of the 2006 OWL Experiences and Directions Workshop (OWL-ED2006). Horridge, M., Bechhofer, S., and Noppens, O. (2007). Igniting the OWL 1.1 touch paper: The OWL API. In Proceedings of OWLED 2007: Third International Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes 5