=Paper= {{Paper |id=None |storemode=property |title=Towards Automation of Ontology Analysis Reporting |pdfUrl=https://ceur-ws.org/Vol-1214/93.pdf |volume=Vol-1214 |dblpUrl=https://dblp.org/rec/conf/itat/ZamazalS14 }} ==Towards Automation of Ontology Analysis Reporting== https://ceur-ws.org/Vol-1214/93.pdf
V. Kůrková et al. (Eds.): ITAT 2014 with selected papers from Znalosti 2014, CEUR Workshop Proceedings Vol. 1214, pp. 93–96
http://ceur-ws.org/Vol-1214, Series ISSN 1613-0073, c 2014 O. Zamazal, V. Svátek



                       Towards Automation of Ontology Analysis Reporting

                                                   Ondřej Zamazal, Vojtěch Svátek

                                        Department of Knowledge and Information Engineering
                                                Faculty of Informatics and Statistics
                                          University of Economics, Prague, Czech Republic
                                           ondrej.zamazal@vse.cz, svatek@vse.cz

Abstract: Different kinds of ontologies are currently ac-               analysis was done once at a certain point of time. Thus,
cessible either from different ontology catalogues or var-              ontology analysis research lacks its automation and con-
ious ontology search engines. Heterogeneity of ontolo-                  tinuous access. While an interpretation of statistics gath-
gies and ontology resources hinders ontology users in their             ered during ontology analysis and its subsequent lessons
work such as selection of an adequate ontology resource                 learned can hardly be an automatic process, automatically
in which they could search for proper ontology to be used,              providing summary overviews e.g. in tables with regard to
reused or adapted with regard to their use case. Although               diverse ontology aspects is obviously doable.
there are many works which provided ontology analyses                      Our long-term goal is to provide an automatic ontology
from diverse aspects, none of them enables straightfor-                 analysis reporting service in order to facilitate regular and
ward access at any time. In this paper we present prelim-               up-to-date snapshots of ontology repositories. We include
inary results of our ontology analysis and our plan to pro-             ontologies expressed in OWL3 addressing four aspects:
vide an automatic and generally available ontology anal-
ysis reporting service providing time snapshots of avail-                  • logical including types of entities, axioms, constructs
able ontologies. Further, we will present some available                     and expressiveness,
ontology catalogues, repositories, search engines and dis-                 • structural corresponding to the ontology graph based
cuss how they could be included. In our work we focus                        on asserted (rather than inferred) axioms and espe-
on ontologies expressed in OWL and we address logical,                       cially the taxonomy structure,
structural, naming and annotation aspects of ontologies.
                                                                           • naming dealing with entity naming. Naming in an on-
                                                                             tology is mostly related to local parts of entity URIs
1 Introduction                                                               and labels (values of rdfs:label property). Although
                                                                             ontologies are logical theories, entity naming is con-
Ontologies, as formal conceptual models typically de-                        sidered as an important part of an ontology [8].
scribing a certain domain of discourse, are inherent part
of the Semantic Web vision already since its inception                     • annotation which can contain important additional
in 2001 [3]. As the Semantic Web was evolving, typi-                         information written in textual or structural form.
cal semantic web ontologies were changing: from large
ontologies carefully designed by AI experts and highly                     The rest of the paper is structured as follows. Section 2
reused within the community (nineties), e.g. GALEN1 via                 gives an overviews about ontology analysis related work.
smaller web domain ontologies designed by many individ-                 Section 3 presents some ontology repositories and ontol-
uals rarely used or reused (since 2000) to small simple do-             ogy search engines considered so far. Next, we present
main ontologies driven mostly by data modelling request,                envisioned ontology analysis process in Section 4. Sec-
e.g. FOAF vocabulary.2 Many of these different kinds of                 tion 5 provides preliminary results of some characteristics
ontologies are currently accessible on the Semantic Web                 of ontology repositories and concisely overviews further
either via ontology catalogues or via ontology search en-               interesting ontology analysis characteristics. Finally, Sec-
gines. We jointly call them ontology resources.                         tion 6 wraps up the paper and foresees some future work.
   Since there are many different ontology resources vary-
ing in typical ontologies they offer, ontology users face               2 Related Work
difficult situation of selecting an adequate ontology re-
source in which they will search for proper ontology to                 Ontology analysis has already been performed many
be used, reused or adapted with regard to their use case.               times, from different perspectives and in different ranges.
To support ontology users, many works provided ontology                 Ding and Finin [5] evaluated 1.7M RDF documents4 in or-
analyses from various aspects, see Section 2. In a nutshell,            der to better understand the status quo of the semantic web
these works differ in various aspects from which they anal-             to date. They employed a number of metrics and usage
yse ontologies, but they all share their static nature, i.e. an
                                                                             3 http://www.w3.org/TR/owl2-primer/
    1 http://www.opengalen.org/                                            4 http://www.w3.org/TR/2014/
    2 http://xmlns.com/foaf/spec/                                       REC-rdf11-concepts-20140225/
94                                                                                                       O. Zamazal, V. Svátek


patterns, such as aggregation over URL domains and indi-       been analyzed by Theoharis et al. [13] in order to create a
vidual websites, or number of triples used to define a term.   benchmark for semantic web tools, e.g. query language in-
Wang et al. [15] analysed a collection of ontologies (688      terpreters, Rosoiu et al. [9] concentrated on an analysis of
OWL ontologies and 587 RDF schemas) from the logi-             OWL ontologies in order to generate a suitable benchmark
cal and structural viewpoint: the shape of the class hierar-   for ontology alignment. Next, Tempich and Volz [12] ana-
chy (lists, trees or multitrees), proportion of certain OWL    lyzed ontologies from the DAML ontology library in order
language constructs or logical expressiveness. The statis-     to tune parameters for generation of synthetic ontologies
tics counted for diverse metrics allowed them to character-    suitable for performance evaluation of semantic web rea-
ize the semantic web from the semantic documents/terms         soners. They used a clustering approach for discovering
perspective. Matentzoglu et al. [7] gathered crawl-based       structurally similar ontologies. Each ontology type is then
OWL corpus (about 4500 ontologies) and compared it             represented as a synthetic ontology.
with 4 ontology repositories or samples from ontology             Last but not least, our work is strongly related to Ontol-
search engines (the BioPortal,5 the Oxford,6 the Swoogle7      ogy Evaluation, which focuses on assessing the quality of
and the TONES8 ) regarding basic characteristics such as       a single ontology. According to Vrandecic [14], an ontol-
number of different kinds of entities, number of various       ogy can be evaluated from several aspects. The vocabulary
axiom types, distribution of OWL profiles etc. They con-       aspect is dealing with evaluating names used in the ontol-
cluded that crawl-based OWL corpus is close to curated         ogy. The syntax aspect includes quality issues related to
repositories in terms of ontology size and expressivity.       ontology serialization in its surface syntax, where many
Their process of gathering ontologies includes a careful       trivial best practices should be fulfilled (e.g., terminolog-
filtering procedure to ensure collecting real single OWL       ical axioms should precede facts), along with syntax val-
ontologies rather than arbitrary OWL files.                    idation. The structure aspect deals with the surface struc-
    Vocabularies as lightweight ontologies have been in-       ture of axioms and their constituent constructs. For the
spected in many surveys. Suominen and Hyvönen [11]             last aspect the number of proposed and implemented met-
validated SKOS vocabularies9 against (SKOS-specific)           rics is highest, since it can be effectively measured through
quality measures and a tool (Skosify) was provided to cor-     common graph metrics and returns easily understandable
rect some reported errors. The landscape of SKOS vo-           numbers. The semantics aspect evaluates an ontology con-
cabularies was also inspected by Manaf et al. [6], where       sidering the inferential semantics of OWL. Thus, semantic
the focus was on high-level structural properties such as      metrics measure the models (and their entailments) that
the number of hierarchy levels or in- and outgoing links to    are described by the structure.
other entities. Large number of vocabularies (almost three        Our ongoing work on ontology analysis reporting is dis-
thousands) have been analyzed by Cheng [4], specifically       tinguished from ontology analyses works, among other,
focusing on the mutual relatedness of web vocabularies         by: 1) inclusion of the naming and annotation aspect of
from the semantic, lexical, expressiveness and distribution    ontologies; 2) continuous provision of fresh results (given
perspective. The high number of vocabularies involved          the dynamic state of the subject analysed) in large scale
was however achieved by gathering them in a bottom-up          and 3) user access to all data and time snapshots of OWL
manner, via extracting new vocabularies from RDF docu-         ontologies via web interface.
ments. Entities from diverse RDF documents (almost 16
million, from the Falcons search engine10 ) were grouped
based on their common namespace. In order to measure           3    Ontology Resources
the vocabularies’ relatedness, their instances were also
taken into account.                                            Six prominent ontology resources, on which we base our
    Besides ontologies and vocabularies, linked datasets are   research, are as follows.
also being inspected. A tool for extensive analysis of            The BioPortal [16] is a web portal providing access to a
linked data sets, LODStats, by Auer et al. gathers compre-     library of well-curated biomedical ontologies via RESTful
hensive statistics about RDF datasets. Statistics are avail-   services. Currently, there are 417 ontologies in different
able either from web LODStats web-page11 or they can be        formats. The BioPortal contains ontologies from another
accessed using SPARQL endpoint.12                              ontology repository, the OBO foundry.13 The primary for-
    Other projects directly connected their ontology analy-    mat of the OBO foundry is OBO format but the ontologies
sis with practical applications. While RDFS schemas have       are also available in OWL. Ontologies in BioPortal vary
                                                               a lot in terms of number of entities (from couple of enti-
                                                               ties to tens of thousands entities) or complexity. We can
     5 http://bioportal.bioontology.org/
     6 http://www.cs.ox.ac.uk/isg/ontologies/
                                                               find there ontologies of complexity lower than OWL-Lite
     7 http://swoogle.umbc.edu/                                as well as ontologies with complexity of OWL 2.
     8 http://owl.cs.manchester.ac.uk/repository/                 LOV 14 is a well-curated collection of linked open vo-
     9 http://www.w3.org/2009/08/skos-reference/skos.html
                                                               cabularies used in the Linked Data Cloud. To date there
     10 http://iws.seu.edu.cn/services/falcons/
     11 http://stats.lod2.eu/stats                                 13 http://obofoundry.org/
     12 http://stats.lod2.eu/sparql                                14 http://lov.okfn.org/dataset/lov/
Towards Automation of Ontology Analysis Reporting                                                                                              95


are 409 ontologies covering diverse domains, e.g., pub-
lications, science, business or city. Most ontologies are
structurally simple, i.e. they often have complexity lower
than OWL-Lite, and there are usually small; yet, they are
used within diverse linked open data applications. Aside
a list of available ontologies, there is also a SPARQL end-
point for accessing the ontologies’ metadata.
   The Protégé15 ontology library mostly contains ontolo-
gies developed within the Protégé editor. As there is no
programmatic access to the library nor a concise list of                    Figure 1: Ontology Analysis Reporting Architecture
available ontologies, links to OWL files must be extracted
using some tailored wrapper. Currently, it has 98 ontolo-
                                                                        terialize all ontologies into a database (“(1) materializa-
gies which also includes well-known test ontologies, e.g.
                                                                        tion”) as a central point of the software architecture de-
Pizza ontology. This repository has rather small ontologies
                                                                        sign. The database is populated with ontologies from on-
(up to hundreds of entities) and their complexity mostly
                                                                        tology resources using their programmable access or tai-
correspond to OWL DL.
                                                                        lored wrapper. Imported ontologies are also stored into
   The TONES repository contains ontologies of various
                                                                        the database [7]. The materialization of ontologies in the
domains, many of them however designed for testing pur-
                                                                        database processes and decomposes ontologies into their
poses. Similarly as Protégé library, it has no direct pro-
                                                                        parts: entities, names (local fragment of entity URIs), re-
grammatic access nor a list of available ontologies except
                                                                        lations (axioms), imported ontologies etc. Next, various
the HTML page.16 Currently, it has 174 ontologies includ-
                                                                        ontology metrics are computed and their results stored into
ing OBO ontologies (already present in BioPortal). Some
                                                                        the database as well (“(2) metrics computation”). Since
of the ontologies are large (over 1000 entities) and most
                                                                        ontology resources can include the same ontologies, dedu-
have the complexity of OWL-Lite or OWL-DL.
                                                                        plication process follows (“(3) deduplication”). Dedupli-
   Besides ontology repositories there are also search en-
                                                                        cation process could be based on entity to entity compari-
gines. The Swoogle semantic web search engine ex-
                                                                        son. However, this would be computational very demand-
tracts metadata for documents of specific filetypes (rdf or
                                                                        ing. Therefore, we first search for duplicates candidates
owl) and computes the relations among them. Nowadays,
                                                                        based on computed metrics such as number of classes, ob-
Swoogle indexes almost 4 million semantic documents
                                                                        ject properties etc.18 and then we can apply detail (e.g. en-
and allows to search for ontologies and their instances over
                                                                        tity to entity) comparison on duplicates candidates. This
this index. This engine does not provide a public API.
                                                                        approach tends to be highly precise, since we do not ex-
   Watson [2] is a semantic web search engine17 for on-
                                                                        clude false duplicates. We think that ontology versions,
tologies and semantic documents. There are about 20,000
                                                                        being reflected as slight variants according to computed
cached ontologies. Watson provides keyword search and a
                                                                        metrics, should be kept and analyzed as different ontolo-
number of methods for manipulating with ontologies, e.g.,
                                                                        gies. Summarizing results (“(4) summarization”) of ontol-
basic metrics as number of classes etc.
                                                                        ogy metrics uses R language for statistical computing.19
   In all, BioPortal, LOV and Watson provide program-
                                                                           We implement our ontology analysis workflow (so far
matic access to ontologies; processing Swoogle output is
                                                                        partly back-end components from Figure 1) via Java pro-
restricted by the service [7]; ontologies in Protégé and
                                                                        grams.20 We manipulate the ontologies via OWL-API21
TONES can be accessed using a tailored wrapper.
                                                                        and decompose them into a MySQL database.

4     Ontology Analysis Reporting                                       5     Ontology Analysis Characteristics
We plan to make ontology analysis reporting service (see
                                                                        In our work we consider ontology metrics related to four
Figure 1) available via web interface where on the one side
                                                                        aspects of ontologies inspired by related work in Sec-
web users could ask for the latest summaries (automatic
                                                                        tion 2. Logical metrics represent basic characteristics in
reports) of particular ontology repositories (“(a) retrieve
                                                                        terms of a number of classes, complex classes (defined
summary”) and on the other side they could ask for partic-
                                                                        by anonymous expressions), properties, instances, axioms
ular ontologies or ontologies meeting certain criteria (“(b)
                                                                        and annotations. We provide these characteristics22 in Ta-
retrieve ontologies”).
                                                                        ble 1 where average, median and maximum values are
   In order to provide such services independently on
availability of ontology resources or ontologies, we ma-                   18 Apropriate set of metrics to be used for deduplication will be tuned

                                                                        based on further testing.
   15 http://protegewiki.stanford.edu/wiki/Protege_                         19 http://www.r-project.org/

Ontology_Library                                                            20 We plan to make all programs freely available.
    16 The link to download all ontologies does not work. [June 2014]       21 http://owlapi.sourceforge.net/
    17 http://watson.kmi.open.ac.uk/WatsonWUI/                              22 As March 2014 snapshot.
96                                                                                                                          O. Zamazal, V. Svátek

     Metrics                   BioPortal    LOV    Protégé   TONES    Watson
                         Avg        270       30       126      153       79
                                                                               service as a web interface enabling access to all ontologies
     Classes             Med        223       14        34       72       25   and their respective characteristics as well as characteris-
                         Max        994      509       717      948      986
                         Avg        191       22       163      126       76
                                                                               tics of each resource and across all ontologies in future.
     Complex classes     Med         50        4        27       28        7   We also plan to proceed from elementary features to semi-
                         Max       5194      659       603     2752     1042   automatic discovery of (intentional and implicit) patterns.
                         Avg         34       23        46       21       14
     Object properties   Med          8       12        14        5        8
                         Max       1337      288       313      414      291   Acknowledgement Ondřej Zamazal has been supported by
                         Avg         10       11        12       17        9   the CSF grant no. 14-14076P, “COSOL – Categorization
     Data properties     Med          0        3         7        0        2
                         Max        488      217        74      708      159
                                                                               of Ontologies in Support of Ontology Life Cycle”.
                         Avg         88       17       355      100       54
     Instances           Med          0        2        18        0        2
                         Max       5819      702     2872      3542     2961   References
                         Avg       2634      543     3103      1588      941
     Axioms              Med       1635      216       442      318      231   [1] Auer S., Demter J., Martin M., Lehmann J.: LODStats - An
                         Max      38056    23839    21087     44764    19361
                         Avg       1213      228       330      204      446       Extensible Framework for High-performance Dataset Ana-
     Annotations         Med        617       92         8        3       42       lytics. In: Proc. of the EKAW 2012. 2012.
                         Max      16058     2900     2937      4260    13764
                                                                               [2] d’Aquin M., Baldassarre C., Gridinoc L., Angeletou S.,
                                                                                   Sabou M., Motta E.: Characterizing Knowledge on the Se-
           Table 1: Characteristics of ontology resources                          mantic Web with Watson. In: EON Workshop at ISWC’07,
                                                                                   Busan, Korea. 2007
                                                                               [3] Berners-Lee, T., Hendler J., Lassila O.: The semantic web.
given. Due to various issues (e.g. non-parseable ontolo-                           Scientific american 284.5 (2001): 28-37.
gies by OWL-API or unaccessible imports), we could not                         [4] Cheng G.: Relatedness between Vocabularies on the Web of
automatically retrieve all ontologies from the ontology re-                        Data: A Taxonomy and an Empirical Study. Journal of Web
sources. Thus, we analyzed 171 ontologies from BioPor-                             Semantics. 2013.
tal, 302 from LOV, 20 from Protégé and 122 ontologies                          [5] Ding L., Finin, T.: Characterizing the Semantic Web on the
from TONES. For this first run of ontology analysis we                             Web. In: ISWC 2006.
restricted the process to ontologies with more than 0 and                      [6] Manaf N. A. A., Bechhofer, S., Stevens, R.: The Current
less than 1001 classes. From Table 1 we can see that on av-                        State of SKOS Vocabularies on the Web. In: ESWC 2012.
erage BioPortal has large ontologies in terms of number of                         2012.
classes, axioms and annotations. On the other side, LOV                        [7] Matentzoglu, N., Bail, S., Parsia, B.: A Snapshot of the
has, on average, very small ontologies in terms of all listed                      OWL Web. In: Proc. ISWC 2013, Springer, LNCS, 2013.
metrics in the table except annotations and axioms. While                      [8] Nirenburg S., Wilks Y.: Whats in a symbol: Ontology and
Protégé, TONES and Watson are on average comparable                                the surface of language. Journal of Experimental and Theo-
in terms of number of classes and properties, Protégé has                          retical AI, 2001.
typically ontologies with many more instances than are in                      [9] Rosoiu M. E., Trojahn C., Euzenat J.: Ontology Match-
any other ontology resource.                                                       ing Benchmarks: Generation and Evaluation. In: Ontology
                                                                                   Matching Workshop 2011.
   In our ongoing work, we will also consider ontolo-
gies from the structural viewpoint, e.g., the number of top                    [10] Šváb-Zamazal O., Svátek V.: Analysing Ontological Struc-
                                                                                   tures through Name Pattern Tracking. In: EKAW-2008, Ac-
classes and leaf classes, or the maximum number of su-
                                                                                   itrezza, Italy, 2008.
perclasses/subclasses. Next, we plan to provide ontology
                                                                               [11] Suominen O., Hyvönen E.: Improving the Quality of SKOS
analysis for the naming aspect, i.e. the use of concatena-
                                                                                   Vocabularies with Skosify. In:EKAW 2012.
tion symbols, capitalization and complex analysis aimed
                                                                               [12] Tempich C., Volz R.: Towards a Benchmark for Semantic
at naming patterns [10]. Finally, we plan to inspect on-
                                                                                   Web Reasoners - An Analysis of the DAML Ontology Li-
tologies with regard to annotations, i.e. which types of                           brary. In: Evaluation of Ontology-based Tools (EON). 2003.
annotations dominate in each ontology resource.
                                                                               [13] Theoharis Y., Tzitzikas Y., Kotzinos D., Christophides V.:
                                                                                   On Graph Features of Semantic Web Schemas. Knowledge
                                                                                   and Data Engineering, IEEE Transactions on, 20(5), 692-
6       Conclusions and Future Work                                                702. 2008.
                                                                               [14] Vrandecic, D.: Ontology Evaluation. Ph.D. Thesis. Karl-
Our ongoing work aims at an ontology analysis reporting                            sruhe. 2010.
service. We described the ontology repositories to be in-
                                                                               [15] Wang T. D., Parsia B., Hendler J.: A Survey of the Web
volved, provided a sketch of ontology analysis reporting                           Ontology Landscape. In: ISWC-2006.
architecture, and presented the preliminary results of log-
                                                                               [16] Whetzel, P. L., Noy, N. F., Shah, N. H., Alexander, P. R.,
ical characteristics for five ontology resources, along with                       Nyulas, C., Tudorache, T., Musen, M. A.: BioPortal: en-
mentioning ontology metrics to be further considered.                              hanced functionality via new Web services from the National
   In future we will implement all the ontology metrics                            Center for Biomedical Ontology to access and use ontologies
mentioned and apply them on, at least, the six mentioned                           in software applications. Nucleic acids research, 39(suppl 2),
ontology resources. We also plan to provide a reporting                            W541-W545.