=Paper=
{{Paper
|id=None
|storemode=property
|title=Towards Automation of Ontology Analysis Reporting
|pdfUrl=https://ceur-ws.org/Vol-1214/93.pdf
|volume=Vol-1214
|dblpUrl=https://dblp.org/rec/conf/itat/ZamazalS14
}}
==Towards Automation of Ontology Analysis Reporting==
V. Kůrková et al. (Eds.): ITAT 2014 with selected papers from Znalosti 2014, CEUR Workshop Proceedings Vol. 1214, pp. 93–96 http://ceur-ws.org/Vol-1214, Series ISSN 1613-0073, c 2014 O. Zamazal, V. Svátek Towards Automation of Ontology Analysis Reporting Ondřej Zamazal, Vojtěch Svátek Department of Knowledge and Information Engineering Faculty of Informatics and Statistics University of Economics, Prague, Czech Republic ondrej.zamazal@vse.cz, svatek@vse.cz Abstract: Different kinds of ontologies are currently ac- analysis was done once at a certain point of time. Thus, cessible either from different ontology catalogues or var- ontology analysis research lacks its automation and con- ious ontology search engines. Heterogeneity of ontolo- tinuous access. While an interpretation of statistics gath- gies and ontology resources hinders ontology users in their ered during ontology analysis and its subsequent lessons work such as selection of an adequate ontology resource learned can hardly be an automatic process, automatically in which they could search for proper ontology to be used, providing summary overviews e.g. in tables with regard to reused or adapted with regard to their use case. Although diverse ontology aspects is obviously doable. there are many works which provided ontology analyses Our long-term goal is to provide an automatic ontology from diverse aspects, none of them enables straightfor- analysis reporting service in order to facilitate regular and ward access at any time. In this paper we present prelim- up-to-date snapshots of ontology repositories. We include inary results of our ontology analysis and our plan to pro- ontologies expressed in OWL3 addressing four aspects: vide an automatic and generally available ontology anal- ysis reporting service providing time snapshots of avail- • logical including types of entities, axioms, constructs able ontologies. Further, we will present some available and expressiveness, ontology catalogues, repositories, search engines and dis- • structural corresponding to the ontology graph based cuss how they could be included. In our work we focus on asserted (rather than inferred) axioms and espe- on ontologies expressed in OWL and we address logical, cially the taxonomy structure, structural, naming and annotation aspects of ontologies. • naming dealing with entity naming. Naming in an on- tology is mostly related to local parts of entity URIs 1 Introduction and labels (values of rdfs:label property). Although ontologies are logical theories, entity naming is con- Ontologies, as formal conceptual models typically de- sidered as an important part of an ontology [8]. scribing a certain domain of discourse, are inherent part of the Semantic Web vision already since its inception • annotation which can contain important additional in 2001 [3]. As the Semantic Web was evolving, typi- information written in textual or structural form. cal semantic web ontologies were changing: from large ontologies carefully designed by AI experts and highly The rest of the paper is structured as follows. Section 2 reused within the community (nineties), e.g. GALEN1 via gives an overviews about ontology analysis related work. smaller web domain ontologies designed by many individ- Section 3 presents some ontology repositories and ontol- uals rarely used or reused (since 2000) to small simple do- ogy search engines considered so far. Next, we present main ontologies driven mostly by data modelling request, envisioned ontology analysis process in Section 4. Sec- e.g. FOAF vocabulary.2 Many of these different kinds of tion 5 provides preliminary results of some characteristics ontologies are currently accessible on the Semantic Web of ontology repositories and concisely overviews further either via ontology catalogues or via ontology search en- interesting ontology analysis characteristics. Finally, Sec- gines. We jointly call them ontology resources. tion 6 wraps up the paper and foresees some future work. Since there are many different ontology resources vary- ing in typical ontologies they offer, ontology users face 2 Related Work difficult situation of selecting an adequate ontology re- source in which they will search for proper ontology to Ontology analysis has already been performed many be used, reused or adapted with regard to their use case. times, from different perspectives and in different ranges. To support ontology users, many works provided ontology Ding and Finin [5] evaluated 1.7M RDF documents4 in or- analyses from various aspects, see Section 2. In a nutshell, der to better understand the status quo of the semantic web these works differ in various aspects from which they anal- to date. They employed a number of metrics and usage yse ontologies, but they all share their static nature, i.e. an 3 http://www.w3.org/TR/owl2-primer/ 1 http://www.opengalen.org/ 4 http://www.w3.org/TR/2014/ 2 http://xmlns.com/foaf/spec/ REC-rdf11-concepts-20140225/ 94 O. Zamazal, V. Svátek patterns, such as aggregation over URL domains and indi- been analyzed by Theoharis et al. [13] in order to create a vidual websites, or number of triples used to define a term. benchmark for semantic web tools, e.g. query language in- Wang et al. [15] analysed a collection of ontologies (688 terpreters, Rosoiu et al. [9] concentrated on an analysis of OWL ontologies and 587 RDF schemas) from the logi- OWL ontologies in order to generate a suitable benchmark cal and structural viewpoint: the shape of the class hierar- for ontology alignment. Next, Tempich and Volz [12] ana- chy (lists, trees or multitrees), proportion of certain OWL lyzed ontologies from the DAML ontology library in order language constructs or logical expressiveness. The statis- to tune parameters for generation of synthetic ontologies tics counted for diverse metrics allowed them to character- suitable for performance evaluation of semantic web rea- ize the semantic web from the semantic documents/terms soners. They used a clustering approach for discovering perspective. Matentzoglu et al. [7] gathered crawl-based structurally similar ontologies. Each ontology type is then OWL corpus (about 4500 ontologies) and compared it represented as a synthetic ontology. with 4 ontology repositories or samples from ontology Last but not least, our work is strongly related to Ontol- search engines (the BioPortal,5 the Oxford,6 the Swoogle7 ogy Evaluation, which focuses on assessing the quality of and the TONES8 ) regarding basic characteristics such as a single ontology. According to Vrandecic [14], an ontol- number of different kinds of entities, number of various ogy can be evaluated from several aspects. The vocabulary axiom types, distribution of OWL profiles etc. They con- aspect is dealing with evaluating names used in the ontol- cluded that crawl-based OWL corpus is close to curated ogy. The syntax aspect includes quality issues related to repositories in terms of ontology size and expressivity. ontology serialization in its surface syntax, where many Their process of gathering ontologies includes a careful trivial best practices should be fulfilled (e.g., terminolog- filtering procedure to ensure collecting real single OWL ical axioms should precede facts), along with syntax val- ontologies rather than arbitrary OWL files. idation. The structure aspect deals with the surface struc- Vocabularies as lightweight ontologies have been in- ture of axioms and their constituent constructs. For the spected in many surveys. Suominen and Hyvönen [11] last aspect the number of proposed and implemented met- validated SKOS vocabularies9 against (SKOS-specific) rics is highest, since it can be effectively measured through quality measures and a tool (Skosify) was provided to cor- common graph metrics and returns easily understandable rect some reported errors. The landscape of SKOS vo- numbers. The semantics aspect evaluates an ontology con- cabularies was also inspected by Manaf et al. [6], where sidering the inferential semantics of OWL. Thus, semantic the focus was on high-level structural properties such as metrics measure the models (and their entailments) that the number of hierarchy levels or in- and outgoing links to are described by the structure. other entities. Large number of vocabularies (almost three Our ongoing work on ontology analysis reporting is dis- thousands) have been analyzed by Cheng [4], specifically tinguished from ontology analyses works, among other, focusing on the mutual relatedness of web vocabularies by: 1) inclusion of the naming and annotation aspect of from the semantic, lexical, expressiveness and distribution ontologies; 2) continuous provision of fresh results (given perspective. The high number of vocabularies involved the dynamic state of the subject analysed) in large scale was however achieved by gathering them in a bottom-up and 3) user access to all data and time snapshots of OWL manner, via extracting new vocabularies from RDF docu- ontologies via web interface. ments. Entities from diverse RDF documents (almost 16 million, from the Falcons search engine10 ) were grouped based on their common namespace. In order to measure 3 Ontology Resources the vocabularies’ relatedness, their instances were also taken into account. Six prominent ontology resources, on which we base our Besides ontologies and vocabularies, linked datasets are research, are as follows. also being inspected. A tool for extensive analysis of The BioPortal [16] is a web portal providing access to a linked data sets, LODStats, by Auer et al. gathers compre- library of well-curated biomedical ontologies via RESTful hensive statistics about RDF datasets. Statistics are avail- services. Currently, there are 417 ontologies in different able either from web LODStats web-page11 or they can be formats. The BioPortal contains ontologies from another accessed using SPARQL endpoint.12 ontology repository, the OBO foundry.13 The primary for- Other projects directly connected their ontology analy- mat of the OBO foundry is OBO format but the ontologies sis with practical applications. While RDFS schemas have are also available in OWL. Ontologies in BioPortal vary a lot in terms of number of entities (from couple of enti- ties to tens of thousands entities) or complexity. We can 5 http://bioportal.bioontology.org/ 6 http://www.cs.ox.ac.uk/isg/ontologies/ find there ontologies of complexity lower than OWL-Lite 7 http://swoogle.umbc.edu/ as well as ontologies with complexity of OWL 2. 8 http://owl.cs.manchester.ac.uk/repository/ LOV 14 is a well-curated collection of linked open vo- 9 http://www.w3.org/2009/08/skos-reference/skos.html cabularies used in the Linked Data Cloud. To date there 10 http://iws.seu.edu.cn/services/falcons/ 11 http://stats.lod2.eu/stats 13 http://obofoundry.org/ 12 http://stats.lod2.eu/sparql 14 http://lov.okfn.org/dataset/lov/ Towards Automation of Ontology Analysis Reporting 95 are 409 ontologies covering diverse domains, e.g., pub- lications, science, business or city. Most ontologies are structurally simple, i.e. they often have complexity lower than OWL-Lite, and there are usually small; yet, they are used within diverse linked open data applications. Aside a list of available ontologies, there is also a SPARQL end- point for accessing the ontologies’ metadata. The Protégé15 ontology library mostly contains ontolo- gies developed within the Protégé editor. As there is no programmatic access to the library nor a concise list of Figure 1: Ontology Analysis Reporting Architecture available ontologies, links to OWL files must be extracted using some tailored wrapper. Currently, it has 98 ontolo- terialize all ontologies into a database (“(1) materializa- gies which also includes well-known test ontologies, e.g. tion”) as a central point of the software architecture de- Pizza ontology. This repository has rather small ontologies sign. The database is populated with ontologies from on- (up to hundreds of entities) and their complexity mostly tology resources using their programmable access or tai- correspond to OWL DL. lored wrapper. Imported ontologies are also stored into The TONES repository contains ontologies of various the database [7]. The materialization of ontologies in the domains, many of them however designed for testing pur- database processes and decomposes ontologies into their poses. Similarly as Protégé library, it has no direct pro- parts: entities, names (local fragment of entity URIs), re- grammatic access nor a list of available ontologies except lations (axioms), imported ontologies etc. Next, various the HTML page.16 Currently, it has 174 ontologies includ- ontology metrics are computed and their results stored into ing OBO ontologies (already present in BioPortal). Some the database as well (“(2) metrics computation”). Since of the ontologies are large (over 1000 entities) and most ontology resources can include the same ontologies, dedu- have the complexity of OWL-Lite or OWL-DL. plication process follows (“(3) deduplication”). Dedupli- Besides ontology repositories there are also search en- cation process could be based on entity to entity compari- gines. The Swoogle semantic web search engine ex- son. However, this would be computational very demand- tracts metadata for documents of specific filetypes (rdf or ing. Therefore, we first search for duplicates candidates owl) and computes the relations among them. Nowadays, based on computed metrics such as number of classes, ob- Swoogle indexes almost 4 million semantic documents ject properties etc.18 and then we can apply detail (e.g. en- and allows to search for ontologies and their instances over tity to entity) comparison on duplicates candidates. This this index. This engine does not provide a public API. approach tends to be highly precise, since we do not ex- Watson [2] is a semantic web search engine17 for on- clude false duplicates. We think that ontology versions, tologies and semantic documents. There are about 20,000 being reflected as slight variants according to computed cached ontologies. Watson provides keyword search and a metrics, should be kept and analyzed as different ontolo- number of methods for manipulating with ontologies, e.g., gies. Summarizing results (“(4) summarization”) of ontol- basic metrics as number of classes etc. ogy metrics uses R language for statistical computing.19 In all, BioPortal, LOV and Watson provide program- We implement our ontology analysis workflow (so far matic access to ontologies; processing Swoogle output is partly back-end components from Figure 1) via Java pro- restricted by the service [7]; ontologies in Protégé and grams.20 We manipulate the ontologies via OWL-API21 TONES can be accessed using a tailored wrapper. and decompose them into a MySQL database. 4 Ontology Analysis Reporting 5 Ontology Analysis Characteristics We plan to make ontology analysis reporting service (see In our work we consider ontology metrics related to four Figure 1) available via web interface where on the one side aspects of ontologies inspired by related work in Sec- web users could ask for the latest summaries (automatic tion 2. Logical metrics represent basic characteristics in reports) of particular ontology repositories (“(a) retrieve terms of a number of classes, complex classes (defined summary”) and on the other side they could ask for partic- by anonymous expressions), properties, instances, axioms ular ontologies or ontologies meeting certain criteria (“(b) and annotations. We provide these characteristics22 in Ta- retrieve ontologies”). ble 1 where average, median and maximum values are In order to provide such services independently on availability of ontology resources or ontologies, we ma- 18 Apropriate set of metrics to be used for deduplication will be tuned based on further testing. 15 http://protegewiki.stanford.edu/wiki/Protege_ 19 http://www.r-project.org/ Ontology_Library 20 We plan to make all programs freely available. 16 The link to download all ontologies does not work. [June 2014] 21 http://owlapi.sourceforge.net/ 17 http://watson.kmi.open.ac.uk/WatsonWUI/ 22 As March 2014 snapshot. 96 O. Zamazal, V. Svátek Metrics BioPortal LOV Protégé TONES Watson Avg 270 30 126 153 79 service as a web interface enabling access to all ontologies Classes Med 223 14 34 72 25 and their respective characteristics as well as characteris- Max 994 509 717 948 986 Avg 191 22 163 126 76 tics of each resource and across all ontologies in future. Complex classes Med 50 4 27 28 7 We also plan to proceed from elementary features to semi- Max 5194 659 603 2752 1042 automatic discovery of (intentional and implicit) patterns. Avg 34 23 46 21 14 Object properties Med 8 12 14 5 8 Max 1337 288 313 414 291 Acknowledgement Ondřej Zamazal has been supported by Avg 10 11 12 17 9 the CSF grant no. 14-14076P, “COSOL – Categorization Data properties Med 0 3 7 0 2 Max 488 217 74 708 159 of Ontologies in Support of Ontology Life Cycle”. Avg 88 17 355 100 54 Instances Med 0 2 18 0 2 Max 5819 702 2872 3542 2961 References Avg 2634 543 3103 1588 941 Axioms Med 1635 216 442 318 231 [1] Auer S., Demter J., Martin M., Lehmann J.: LODStats - An Max 38056 23839 21087 44764 19361 Avg 1213 228 330 204 446 Extensible Framework for High-performance Dataset Ana- Annotations Med 617 92 8 3 42 lytics. In: Proc. of the EKAW 2012. 2012. Max 16058 2900 2937 4260 13764 [2] d’Aquin M., Baldassarre C., Gridinoc L., Angeletou S., Sabou M., Motta E.: Characterizing Knowledge on the Se- Table 1: Characteristics of ontology resources mantic Web with Watson. In: EON Workshop at ISWC’07, Busan, Korea. 2007 [3] Berners-Lee, T., Hendler J., Lassila O.: The semantic web. given. Due to various issues (e.g. non-parseable ontolo- Scientific american 284.5 (2001): 28-37. gies by OWL-API or unaccessible imports), we could not [4] Cheng G.: Relatedness between Vocabularies on the Web of automatically retrieve all ontologies from the ontology re- Data: A Taxonomy and an Empirical Study. Journal of Web sources. Thus, we analyzed 171 ontologies from BioPor- Semantics. 2013. tal, 302 from LOV, 20 from Protégé and 122 ontologies [5] Ding L., Finin, T.: Characterizing the Semantic Web on the from TONES. For this first run of ontology analysis we Web. In: ISWC 2006. restricted the process to ontologies with more than 0 and [6] Manaf N. A. A., Bechhofer, S., Stevens, R.: The Current less than 1001 classes. From Table 1 we can see that on av- State of SKOS Vocabularies on the Web. In: ESWC 2012. erage BioPortal has large ontologies in terms of number of 2012. classes, axioms and annotations. On the other side, LOV [7] Matentzoglu, N., Bail, S., Parsia, B.: A Snapshot of the has, on average, very small ontologies in terms of all listed OWL Web. In: Proc. ISWC 2013, Springer, LNCS, 2013. metrics in the table except annotations and axioms. While [8] Nirenburg S., Wilks Y.: Whats in a symbol: Ontology and Protégé, TONES and Watson are on average comparable the surface of language. Journal of Experimental and Theo- in terms of number of classes and properties, Protégé has retical AI, 2001. typically ontologies with many more instances than are in [9] Rosoiu M. E., Trojahn C., Euzenat J.: Ontology Match- any other ontology resource. ing Benchmarks: Generation and Evaluation. In: Ontology Matching Workshop 2011. In our ongoing work, we will also consider ontolo- gies from the structural viewpoint, e.g., the number of top [10] Šváb-Zamazal O., Svátek V.: Analysing Ontological Struc- tures through Name Pattern Tracking. In: EKAW-2008, Ac- classes and leaf classes, or the maximum number of su- itrezza, Italy, 2008. perclasses/subclasses. Next, we plan to provide ontology [11] Suominen O., Hyvönen E.: Improving the Quality of SKOS analysis for the naming aspect, i.e. the use of concatena- Vocabularies with Skosify. In:EKAW 2012. tion symbols, capitalization and complex analysis aimed [12] Tempich C., Volz R.: Towards a Benchmark for Semantic at naming patterns [10]. Finally, we plan to inspect on- Web Reasoners - An Analysis of the DAML Ontology Li- tologies with regard to annotations, i.e. which types of brary. In: Evaluation of Ontology-based Tools (EON). 2003. annotations dominate in each ontology resource. [13] Theoharis Y., Tzitzikas Y., Kotzinos D., Christophides V.: On Graph Features of Semantic Web Schemas. Knowledge and Data Engineering, IEEE Transactions on, 20(5), 692- 6 Conclusions and Future Work 702. 2008. [14] Vrandecic, D.: Ontology Evaluation. Ph.D. Thesis. Karl- Our ongoing work aims at an ontology analysis reporting sruhe. 2010. service. We described the ontology repositories to be in- [15] Wang T. D., Parsia B., Hendler J.: A Survey of the Web volved, provided a sketch of ontology analysis reporting Ontology Landscape. In: ISWC-2006. architecture, and presented the preliminary results of log- [16] Whetzel, P. L., Noy, N. F., Shah, N. H., Alexander, P. R., ical characteristics for five ontology resources, along with Nyulas, C., Tudorache, T., Musen, M. A.: BioPortal: en- mentioning ontology metrics to be further considered. hanced functionality via new Web services from the National In future we will implement all the ontology metrics Center for Biomedical Ontology to access and use ontologies mentioned and apply them on, at least, the six mentioned in software applications. Nucleic acids research, 39(suppl 2), ontology resources. We also plan to provide a reporting W541-W545.