=Paper= {{Paper |id=Vol-2849/paper-10 |storemode=property |title=None |pdfUrl=https://ceur-ws.org/Vol-2849/paper-10.pdf |volume=Vol-2849 |dblpUrl=https://dblp.org/rec/conf/swat4ls/AlgergawyK19 }} ==None== https://ceur-ws.org/Vol-2849/paper-10.pdf
                      Partitioning of BioPortal Ontologies: An
                                  Empirical Study

                                      Alsayed Algergawy and Birgitta König-Ries

                              Heinz-Nixdorf Chair for Distributed Information Systems,
                                            Institute for Computer Science,
                                 Friedrich Schiller University of Jena, Jena, Germany
                            {alsayed.algergawy, birgitta.koenig-riesg}@uni-jena.de



                      Abstract BioPortal is a leading repository of biomedical ontologies
                      developed in different formats such as OWL and OBO. There is an
                      increasing number of ontologies as well as an increasing number of
                      concepts available via this platform. This has sparked a number of
                      studies analyzing different aspects of BioPortal ontologies, such as their
                      quality and reuse. With this paper, we add a new aspect to this body
                      of work: The current version of BioPortal supports the access to whole
                      ontologies; however, often, users are interested to obtain access to subsets
                      of ontologies. This is particularly true for big ontologies. This requires
                      partitioning of ontologies. In this paper, we therefore investigate how
                      suitable BioPortal ontologies are for being partitioned with state of the
                      art tools.



             1     Introduction

             Ontologies provide domain knowledge in machine readable formats. They
             are widely used in various applications, e.g., to drive data annotation, data
             integration, information retrieval, and in particular widely used in biological
             and biomedical research [12]. Therefore, a large number of ontologies have been
             developed and there is a growing necessity to keep them in a common repository
             to make them accessible and manageable. Examples of such repositories include
             BioPortal [17,25,32] 1 , OntoBee [20] 2 and AgroPortal [14] 3 .
                 Since its deployment, the National Center for Biomedical Ontologies (NCBO)
             BioPortal has evolved to become the prevalent repository of biomedical
             ontologies and terminologies [17,32]. In 2008, BioPortal had 72 ontologies with
             around 300,000 concepts, while in the current version4 it contains 821 ontologies
             with 8,859,512 concepts. This shows that there is a tremendous increase in both
             the number of ontologies and the number of concepts. BioPortal provides a
             number of services, amongst others to obtain ontology information, such as
              1
                http://bioportal.bioontology.org/
              2
                http://www.ontobee.org/
              3
                http://agroportal.lirmm.fr/
              4
                visited on 18.11.2019




Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
ontology metadata and individual ontology terms, to search within ontologies,
and to visualize them.
    Due to its popularity, different aspects of BioPortal ontologies have been
investigated. For example, a number of approaches have been introduced to
investigate the quality of BioPortal ontologies [5,18]. Another aspect that
is intensively studied w.r.t. BioPortal ontologies is ontology reuse [8,15,19].
However to the best of our knowledge, there is only one study focusing on
the decomposition and partitioning of BioPortal ontologies [31]. This study was
conducted in 2011 when the number of ontologies was 250 of which 218 were
OWL or OBO ontologies. The study was conducted on that set of ontologies.
This study has been replicated in 2014 by another study utilizing the same
methodology and software to extend the experiments on BioPortal ontologies
at that time [13]. There is no recent or comprehensive study on this important
aspect, though. With this paper, we aim to fill this gap with an empirical analysis
for partitioning BioPortal ontologies.
    Ontology modularization covers the problem of identifying a fragment or a set
of fragments of an ontology. The process of identifying a fragment of an ontology
given a user input (request) is called ontology module extraction [9,23,24], while
the process that partitions the ontology into a set of fragments is called ontology
partitioning [2,4,22]. Ontology modularization can be used to support a number
of complex tasks, such as maintenance, reuse and knowledge selection [6],
reasoning [22] and integration of existing ontologies. An ontology module is
defined as a reusable component of a larger or more complex ontology [7,21],
which is self-contained but bears a definite association to other ontology modules,
including the original ontology. In this paper, we aim to study and investigate
the partitionability of BioPortal ontologies. To this end, we adopt three different
partitioning approaches belonging to two different categories: PATO, and OAPT
as structural-based and AD as a logic-based partitioning approaches. PATO is
a tool used to partition large ontologies into smaller modules based on the
structure of the class hierarchy [26,27]. AD (Atomic decomposition) depends
on the definition of logical dependence that allows the definition of clumps
of highly interlaced axioms (called atoms) that are never split across two or
more modules [31,13]. OAPT (Ontology Analysis and Partition Tool) aims to
split an ontology into a set of modules exploiting a seeding-based clustering
approach [1,2]. We applied these approaches on the BioPortal ontology repository
and analyzed the partitioning results.
    The remainder of the paper is organized as follows: background is presented
in Section 2. Section 3 provides an overview of the proposed methodology. The
experimental evaluation of the used partitioning approaches w.r.t. Bioportal
ontologies is introduced in Section 4. Finally, Section 5 concludes the paper.


2   Background

An ontology, O, is a set of axioms, each representing a statement about
the domain. The building blocks of axioms are entities, such as concepts,


                                        2
properties, individuals, and data types [11,29]. We describe an ontology as
a 6-tuple, denoted as O = {C, P, H C , H P , A, I}. C and P are two disjoint
sets of classes (concepts) and properties, respectively. H C = {(C1 , C2 ) ∈ C ×
C| C1 subsumes C2 } represents the hierarchy of class subsumption. Similarly,
H P is the hierarchy between properties. A is a set of axioms and I is a set of
instances associated to the set concepts C and properties P . A signature S of an
ontology O based on a description logic L is the union of concepts, properties,
and instances, i.e. S = C ∪ P ∪ I.
    A module M of an ontology O is a reusable part of O, which is self-contained
but bears a definite association to other ontology modules, including the original
ontology [7]. Formally, we define an ontology module Mi (O) following [9,10,16]:

Definition 1. A module Mi (O) is a module of the ontology O w.r.t. a
description logic L, if for every axiom α over L with S(α) ⊆ S, we have
Mi (O) |= α if O |= α.

An ontology module can be represented as a 6-tuple Mi (O) =
                C      P
{CMi , PMi , HM   i
                    , HM i
                           , AMi , IMi }, where CMi ⊆ C, PMi ⊆ P , etc. This
definition implies that any information that exists or can be entailed from the
module Mi (O) should also exist or could be entailed from the original ontology
O. This enables the reuse of ontology modules either as they are or by enlarging
them by adding more axioms. Therefore, each module can be considered as
an ontology by itself. To achieve this, each module should be self-contained,
consistent, and topic-centric [4,7].
     We define the ontology modularization process (partitioning) as follows:
given an ontology O, partition the ontology entities into a set of modules
M1 , M2 , ..., Mk such that the cohesion of entities in each module is high (i.e.
intra-module similarity is high), while the coupling between any pair of modules
is low. To conduct the analysis of the partitioning process, a number of evaluation
criteria been designed as a trade-off between the modularization quality and
the modularization efficiency [3,26]. In this study, we consider the number of
modules, the number of entities in each module, and the time needed to achieve
partitioning.


3      Methodology

Partitioning BioPortal ontologies into suitable partitions (modules) is certainly
valuable when it comes to processing, editing, and analyzing them or reusing
their parts. To investigate the partitioning aspect of BioPortal ontologies, we
propose a workflow that contains the following main steps: i) get all accessible
ontologies using the BioPortal API5 , ii) transform these ontologies into OWL or
OBO formats using OWL API6 , iii) partition these ontologies using one of the
following partitioning algorithms, and iv) analyse the partitioning results. To
5
    http://data.bioontology.org/documentation
6
    http://owlcs.github.io/owlapi/


                                         3
keep this paper self-contained, we give a brief summary of each the partitioning
algorithms used.

3.1   PATO
PATO is an ontology partitioning tool making use of the following steps to
partition an ontology [27,28]: i) dependency graph creation: a graph structure
is created to represent dependencies between ontology entities, where nodes of
the graph are the values of "rdf:label" or "rdf:ID". ii) graph partitioning: a
set of nodes of given minimal and maximal size for which the strength of the
connection between the nodes inside the set is higher than the strength of any
connection to nodes outside the set is used to determine sets of ontology elements
that should be in one module, and iii) a distributed ontology is created based
on the graph partitioning.

3.2   OAPT
The ontology analysis and partitioning tool (OAPT) [2] aims to partition
ontologies into a set of modules based on exploiting the seeding-based clustering
algorithm [1]. The algorithm has the following steps: i) ranking the ontology
concepts: A first step is to quantify the importance of each concept within
the concept graph (ontology) to select which concepts could be used later as
important concepts. Some of these important concepts are then elected to be
cluster heads, the seed of the partition. ii) determine cluster heads: the next
step is to select which concepts represent cluster heads. In this context, we
have to deal with two arising questions: how many cluster heads should we
select? and which cluster heads? iii) partitioning: the seed-based algorithm
initiates one partition for each cluster head. Then, it places direct children in the
corresponding partition and finally, for the remaining (non-partitioned) concepts,
a membership function to assign remaining nodes to their fitting partition is
developed. and v) generate module: the following step is to generate a module for
each partition preserving the required intra-relationships between concepts in the
same partition as well as inter-links between concepts from different partitions.

3.3   AD
The atomic decomposition (AD) is a compact representation of modular
structure of the ontology [13,31]. AD of an ontology O is a pair consisting of a set
of atoms and a directed dependency relation over these atoms [30], where an atom
is a maximal set of axioms which are tightly bound to each other. For computing
Atomic Decompositions we used the off-the-shelf implementation provided by
Del Vescovo and Palmisano [30]. The implementation is available via Maven
Central (maven.org) with an artifactId of owlapitools-atomicdecomposition.
The current implementation of the AD approach supports extracting three types
of syntactic-locality-based modules: the bottom module, the top module, and the
star module.


                                         4
4     Evaluation
In this section, we first describe the setup of our evaluation and then discuss
evaluation results.

4.1   Setup
We carried out a set of experiments using a 3.4GHz Intel (R) Core i7
processor with 16GB RAM running Windows 7. We make use of the available
implementation of the partitioning tools: AD can be accessed through this link
7
  , PATO, and OAPT from this link8 . We run this set of experiments using the
BioPortal ontology repository version that contains 792 ontologies9 , of which 710
are accessible and can be downloaded using the BioPortal API. 657 ontologies
are represented or can be converted to OWL or OBO formats.

4.2   Results
We ran the PATO, OAPT, and AD partitioning tools to partition these 657
ontologies according to the respective partitioning algorithm implemented in
the tool. In the following, we present the results of PATO and OAPT together
as they are both classified as structural-based partitioning approaches, while the
results of AD are presented for using different strategies, bottom (Bot), top, and
star, as it is classified as logic-based partitioning approach.

Number of modules. We started our analysis by considering how many
ontologies can be partitioned and how many partitions are generated for each
ontology. Results are summarized in Fig. 1, where results of OAPT and PATO
are shown in Fig. 1a and Fig. 1b, respectively. These figures show that OAPT and
PATO can partition 97% (635 out of 657) and 84% of the ontologies from the
repository, respectively. However, the two partitioning tools generate different
numbers of partitions according to their respective procedure.
    For OAPT, Fig. 1a shows that 142 ontologies can be represented as
one-module ontologies. The two main reasons behind that are i) 101 ontologies
have less than 50 concepts. Here, partitioning seems unnecessary. We report
also that 123 ontologies have less than 100 concepts. That means that at least
19 ontologies with more than 100 concepts, are represented as one-module
ontologies. ii) Investigating this set of 19 ontologies, we found that 15 out of
them have less than 200 concepts. For these, it could also be acceptable to be
represented as one-module ontology. We reviewed the remaining four ontologies
(BFLC, LUNGMAP-HUMAN, MSV and TM-SIGNS-AND-SYMPTS ) and we
found some issues of them. For example, the MSV (Metagenome Sample
7
  https://web.stanford.edu/~horridge/publications/2014/iswc/
  atomic-decomposition/data/
8
  https://github.com/fusion-jena/OAPT
9
  at the time of evaluation execution


                                        5
Vocabulary) ontology is in its beta version since 2017. It has 648 concepts with
only five is-a relations. Fig. 1a also shows that half of the accessible ontologies
(347 ontologies) in the BioPortal repository can be partitioned in up to only
five partitions, while 590 ontologies can be partitioned in up to 30 modules.
The remaining set of ontologies representing most larger size ontologies are
partitioned into more than 30 partitions requiring more computational resources.
For example, the GO-PLUS ontology containing 80,999 is partitioned into
50 modules. The figure also illustrates that three different ontologies (SEQ,
SMASH, and ENM ) generate zero modules, where the sequence ontology (SEQ)
has only one concept producing a problem during the extraction of the concept,
while the current JENA API fails to parse and read the other two ontologies.




               (a) OAPT                                  (b) PATO

                   Figure 1: No. of ontologies vs no. of modules



    For PATO, as shown in Fig. 1b, the tool generates a large number of partitions
with a small number of entities based on a defined parameter. The tool can
generate 1-module category for 10 different ontologies, eight of them also appear
in the same category when generated by OAPT. The remaining two ontologies
are SEQ (which appears in the 0-module category by OAPT ) and HORD. In
total, PATO generated 0 modules for 17 different ontologies. This is because the
tool fails to build the dependency graph for this set of ontologies. Fig. 1b also
shows that among 554 ontologies 140 ontologies are partitioned into more than
50 partitions.


Partitioning time One important aspect that should be considered during the
analysis of partitioning results is to study the partitioning performance. In this
analysis we measured the time needed to execute the reading and partitioning
of each ontology. We sum the execution time for the set of ontologies within
the same category. Results are reported in Fig. 2. The figure summarizes the
average execution time (avg. time) to partition an ontology within the category.


                                        6
               (a) OAPT                                   (b) PATO

              Figure 2: No. of ontologies vs average partitioning time



For example, as shown in Fig. 2a, the 10-module category needs an average time
of 8.6 seconds to partition an ontology within the category for OAPT, while
PATO needs 86.5 seconds to partition an ontology within the same category,
as shown in Fig. 2b. Fig. 2a demonstrates also that the 100-module category
needs an average time of 110 minutes (approximately two hours) to achieve the
partitioning of an ontology. One more interesting findings that can be extracted
from the figure is that the execution time depends not only on the number of
modules (partitions) but also on other internal characteristics of the ontology. For
example, the FAST-EVENT ontology has only four concepts and it needs 1286
and 103 seconds to do the partitioning using OAPT and PATO, respectively. We
investigate this ontology and we found that it has 15,700 individuals. Similarly,
the CU-VO has 11 concepts and 7320 individuals. However, it needs 920 seconds
for partitioning using OAPT, while PATO fails to partition it.


AD results. Since results of the atomic decomposition of the BioPortal
ontologies has been introduced in an earlier study [31,13], in this section
we introduce the new results w.r.t. the current repository. These results
are summarized in Table 1. The table shows that the AD approach using
different strategies (bottom (Bot), Top, and Star ) was applied to the ontology
repository with 657 ontologies. The different strategies can generate correct
atoms (partitions) for 435, 410, and 442 ontologies using the bottom (Bot), Top,
and Star strategy, respectively. This represents 67% of the whole repository. Even
though this category of partitioning approaches generate atoms very fast, it fails
to cope with a large number of BioPortal ontologies. The table also shows that
the AD approach decomposes a number of ontologies with 0 atoms, where the Bot
and Star strategies produce 0 atoms for seven (the same set) ontologies, while
the Top strategy produces 0 atoms for 31 ontologies. One important finding here
is that the ontology SMASH is partitioned by all tools into 0 partitions (atoms).


                                         7
                               criterion       Bot Top Star
                           No. of ontologies 511 510 509
                              Exception       27 27 18
                                0-atom         7 31 7
                               no-result      42 42 42
                             with-result     435 410 442
                                Table 1: AD results




5    Conclusion
BioPortal is an important resource for biomedical ontologies. It is thus
worthwhile to investigatethe portal and the ontologies contained therein. While
existing work addresses a number of aspects, an analysis of partitionability of
BioPortal ontologies was still missing. In this paper, we describe first steps to fill
this gap. We introduced an empirical study and analysis of the applicability of
partitioning to BioPortal ontologies. We investigated success of partitioning as
well as partioning performance for three existing partitioning tools. The study
showed that overall, for many - but not for all - ontologies in BioPortal partioning
works. Failure to partition seems to be at least partially due to characteristics
of the ontologies rather than of the tools. The study also shows, however, that
different algorithms result in very different partitions. In future work, we plan
to study the semantic content and usage of individual modules.
6    Acknowledgments
This work has been mostly funded by the Deutsche Forschungsgemeinschaft
(DFG) as part of the CRC 1076 AquaDiva.
References
 1. A. Algergawy, S. Babalou, M. J. Kargar, and S. H. Davarpanah. Seecont: A new
    seeding-based clustering approach for ontology matching. In 19th Internation
    Conference on Advances in Databases and Information Systems, ADBIS, pages
    245–258, 2015.
 2. A. Algergawy, S. Babalou, F. Klan, and B. König-Ries. OAPT: A tool for ontology
    analysis and partitioning. In Proceedings of the 19th International Conference on
    Extending Database Technology, EDBT, pages 644–647, 2016.
 3. A. Algergawy, S. Babalou, and B. König-Ries. A new metric to evaluate
    ontology modularization. In 2nd International Workshop on Summarizing and
    Presenting Entities and Ontologies Co-located with the 13th Extended Semantic
    Web Conferenc, 2016.
 4. F. Amato, A. D. Santo, V. Moscato, F. Persia, A. Picariello, and S.R.Poccia.
    Partitioning of ontologies driven by a structure-based approach. In 2015 IEEE
    International Conference on Semantic Computing, pages 320–323, 2015.
 5. M. Amith, Z. He, J. Bian, J. A. Lossio-Ventura, and C. Tao. Assessing the practice
    of biomedical ontology evaluation: Gaps and opportunities. Journal of Biomedical
    Informatics, 80:1–13, 2018.


                                           8
 6. M. d’Aquin, A. Schlicht, H. Stuckenschmidt, and M. Sabou.                  Ontology
    modularization for knowledge selection: Experiments and evaluations. In 18th
    International Conference on Database and Expert Systems Applications, DEXA,
    pages 874–883, 2007.
 7. P. Doran, V. A. M. Tamma, and L. Iannone. Ontology module extraction
    for ontology reuse: an ontology engineering perspective. In Proceedings of the
    Sixteenth ACM Conference on Information and Knowledge Management, CIKM,
    pages 61–70, 2007.
 8. M. Fernández-López, M. Poveda-Villalón, M. C. Suárez-Figueroa, and
    A. Gómez-Pérez. Why are ontologies not reused across the same domain? J.
    Web Semant., 57, 2019.
 9. B. C. Grau, I. Horrocks, Y. Kazakov, and U. Sattler. Just the right amount:
    extracting modules from ontologies. In Proceedings of the 16th International
    Conference on World Wide Web, WWW, pages 717–726, 2007.
10. B. C. Grau, I. Horrocks, Y. Kazakov, and U. Sattler. Modular reuse of ontologies:
    Theory and practice. J. Artif. Intell. Res. (JAIR), 31:273–318, 2008.
11. N. Guarino, D. Oberle, and S. Staab. What is an ontology? In Handbook on
    Ontologies, pages 1–17. 2009.
12. R. Hoehndorf, P. N. Schofield, and G. V. Gkoutos. The role of ontologies
    in biological and biomedical research: a functional perspective. Briefings in
    Bioinformatics, 16(6):1069–1080, 2015.
13. M. Horridge, J. Mortensen, B. Parsia, U. Sattler, and M. A. Musen. A study on
    the atomic decomposition of ontologies. In The Semantic Web - ISWC 2014 - 13th
    International Semantic Web Conference, Riva del Garda, Italy, October 19-23,
    2014. Proceedings, Part II, pages 65–80, 2014.
14. C. Jonquet, A. Toulet, E. Arnaud, S. Aubin, E. D. Y. Kaboré, V. Emonet,
    J. Graybeal, M. Laporte, M. A. Musen, V. Pesce, and P. Larmande. Agroportal:
    A vocabulary and ontology repository for agronomy. Computers and Electronics
    in Agriculture, 144:126–143, 2018.
15. M. R. Kamdar, T. Tudorache, and M. A. Musen. A systematic analysis of term
    reuse and term overlap across biomedical ontologies. Semantic Web, 8(6):853–871,
    2017.
16. B. Konev, C. Lutz, D. Walther, and F. Wolter. Model-theoretic inseparability and
    modularity of description logic ontologies. Artif. Intell., 203:66–103, 2013.
17. M. A. Musen, N. F. Noy, N. H. Shah, P. L. Whetzel, C. G. Chute, M.-A. Story,
    B. Smith, and the NCBO team. The national center for biomedical ontology.
    Journal of the American Medical Informatics Association, 19(12):190–195, 2012.
18. C. Ochs, Z. He, L. Zheng, J. Geller, Y. Perl, G. Hripcsak, and M. A. Musen.
    Utilizing a structural meta-ontology for family-based quality assurance of the
    bioportal ontologies. Journal of Biomedical Informatics, 61:63–76, 2016.
19. C. Ochs, Y. Perl, J. Geller, S. Arabandi, T. Tudorache, and M. A. Musen.
    An empirical analysis of ontology reuse in bioportal. Journal of Biomedical
    Informatics, 71:165–177, 2017.
20. E. Ong, Z. Xiang, B. Zhao, Y. Liu, Y. Lin, J. Zheng, C. Mungall, M. Courtot,
    A. Ruttenberg, and Y. He. Ontobee: A linked ontology data server to support
    ontology term dereferencing, linkage, query and integration. Nucleic Acids
    Research, 45(Database-Issue):D347–D352, 2017.
21. J. Pathak, T. M. Johnson, and C. G. Chute. Survey of modular ontology techniques
    and their applications in the biomedical domain. Integrated Computer-Aided
    Engineering, 16(3):225–242, 2009.


                                          9
22. S. Priya, Y. Guo, M. Spear, and J. Heflin. Partitioning OWL knowledge bases
    for parallel reasoning. In 2014 IEEE International Conference on Semantic
    Computing, pages 108–115, 2014.
23. A. A. Romero, M. Kaminski, B. C. Grau, and I. Horrocks. Ontology module
    extraction via datalog reasoning.       In 29th AAAI Conference on Artificial
    Intelligence, pages 1410–1416, 2015.
24. A. A. Romero, M. Kaminski, B. C. Grau, and I. Horrocks. Module extraction in
    expressive ontology languages via datalog reasoning. J. Artif. Intell. Res. (JAIR),
    55:499–564, 2016.
25. M. Salvadores, P. R. Alexander, M. A. Musen, and N. F. Noy. Bioportal as a
    dataset of linked biomedical ontologies and terminologies in RDF. Semantic Web,
    4(3):277–284, 2013.
26. A. Schlicht and H. Stuckenschmidt. Towards structural criteria for ontology
    modularization. In 1st International Workshop on Modular Ontologies, WoMO’06,
    co-located with the International Semantic Web Conference, ISWC’06, 2006.
27. A. Schlicht and H. Stuckenschmidt. A flexible partitioning tool for large ontologies.
    In International Conference on Web Intelligence, WI, pages 482–488, 2008.
28. H. Stuckenschmidt and M. C. A. Klein. Structure-based partitioning of large
    concept hierarchies. In Third International Semantic Web Conference, ISWC 2004,
    pages 289–303, 2004.
29. R. Studer, V. R. Benjamins, and D. Fensel. Knowledge engineering: Principles and
    methods. Data Knowl. Eng., 25(1-2):161–197, 1998.
30. C. D. Vescovo. The Modular Structure of an Ontology: Atomic Decomposition and
    its applications. PhD thesis, The University of Manchester, 2013.
31. C. D. Vescovo, D. Gessler, P. Klinov, B. Parsia, U. Sattler, T. Schneider, and
    A. Winget. Decomposition and modular structure of bioportal ontologies. In The
    Semantic Web - ISWC 2011 - 10th International Semantic Web Conference, Bonn,
    Germany, October 23-27, 2011, Proceedings, Part I, pages 130–145, 2011.
32. P. L. Whetzel, N. F. Noy, N. H. Shah, P. R. Alexander, C. Nyulas, T. Tudorache,
    and M. A. Musen. Bioportal: enhanced functionality via new web services from
    the national center for biomedical ontology to access and use ontologies in software
    applications. Nucleic Acids Research, 39(Web-Server-Issue):541–545, 2011.




                                           10