Documentation Gap in Ontology Creation:
                      Insights into the Reality of Knowledge
                     Formalization in a Life Science Company

                                            Marius Michaelis and Olga Streibel

                             Bayer Business Services GmbH, 51368 Leverkusen, Germany
                                  {marius.michaelis,olga.streibel}@bayer.com


                      Abstract. To achieve the goal of FAIR — findable, accessible, interop-
                      erable, and reusable — data, life science companies employ Semantic Web
                      standards and Linked Data principles. In doing so, they create ontologies
                      that formally represent knowledge. This paper presents the results of a
                      survey among knowledge engineers and domain experts involved in ontol-
                      ogy creation for a global life science company. The survey results indicate
                      that the conceptualization phase of the ontology creation process, includ-
                      ing knowledge acquisition, remains largely undocumented. The majority
                      of knowledge engineers surveyed begin to document during or after the
                      creation of the formal knowledge model. The authors discuss the risks
                      that may arise from this documentation gap and recommend addressing
                      them by means of joint, timely, and structured documentation.

                      Keywords: Ontology · Documentation · Knowledge Management


             1     Introduction
             For over 10 years there have been initiatives to apply semantic technologies in the
             field of life sciences. In 2008, for instance, the W3C interest group Semantic Web
             Health Care and Life Sciences was founded, which has continued its work as a
             community group since 2018 [23]. Another example is the international Semantic
             Web Applications and Tools for Healthcare and Life Sciences (SWAT4HCLS)
             conference which has been taking place annually since 2008 [18]. As the shift
             towards the use of semantic technologies is becoming more common, the inter-
             national standardization organization Health Level Seven International (HL7)
             has published a Linked Data Module for its standard framework Fast Healthcare
             Interoperability Resource (FHIR) [10]. In addition, pharmaceutical companies
             and authorities such as the U.S. Food and Drug Administration (FDA) are in-
             volved in non-profit organizations like Pharmaceutical Users Software Exchange
             (PhUSE). There, the working group Linked Data & Graph Databases worked
             on the use of semantic technologies [11]. At EU level, the intergovernmental
             organization ELIXIR, which is engaged in the European Open Science Cloud
             (EOSC), encourages semantic integration with its Interoperability Platform to
             achieve the goal of FAIR life science data [4]. FAIR refers to a set of four princi-
             ples: data must be findable, accessible, interoperable, and reusable [24]. The FAIR


Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
2       M. Michaelis and O. Streibel

strategy is mainly driven by the GO FAIR initiative. Both EOSC and GO FAIR
follow the recommendations of the European Commission expert group on FAIR
data [2,7], which recommends, among others, the use of semantic technologies [3].
In non-for-profit collaborations such as the Pistoia Alliance, companies, vendors,
publishers, and academic groups are jointly dedicated to the implementation of
FAIR data principles in biopharmaceutical R&D [25]. In order to achieve the
goal of FAIR data, life science companies employ Semantic Web standards and
Linked Data principles. In doing so, they create ontologies that formally repre-
sent knowledge. This paper provides insights into the reality of ontology creation
in a life science company, focusing on the documentation that takes place during
the process.
     First, the process of ontology creation and the roles involved are briefly out-
lined in section 2. Following the description of the applied methodology in sec-
tion 3, the findings are presented in section 4. They provide information about
the company’s ontology creators and their approach to documentation. Based
on these findings, section 5 evaluates whether the prevailing documentation ap-
proach is sufficient. Finally, in section 6, we draw a conclusion on the challenges
life science companies face when creating ontologies.


2   Ontology Creation Process
Based on the definitions by Gruber [8] and Borst [1], Studer et al. define
ontologies as “a formal, explicit specification of a shared conceptualisation” [20].
There is no single, uniform approach to the structured development of ontolo-
gies. Instead, over the last two decades, a variety of so-called ontology engi-
neering methodologies have been proposed in literature that describe more or
less specific processes for ontology creation. The roles involved may differ from
methodology to methodology in terms of their quantity, designations and respon-
sibilities [6,19]. Following the understanding of roles in the life science company,
this paper distinguishes between only two roles similar to the ones of the Unified
Process for Ontology Building: knowledge engineer and domain expert [17]. Do-
main experts (DEs) have expertise in a certain subject area, i.e. DEs are familiar
with the main concepts of a domain, their characteristics and relationships. In
terms of ontology development, this means DEs are knowledgeable in the domain
which is to be represented by the ontology. Knowledge engineers (KEs) capture,
structure and formalize knowledge so that it can be processed by machines in
order to solve certain problems. In terms of ontology development, the KEs are
those who build the ontology. In the following, a basic ontology creation process
is outlined, as it underlies many methodologies (see figure 1). For the sake of
clarity, neither feedback loops nor special cases are discussed.
 1. Ontology specification: Collection of requirements and definition of frame-
    work conditions [5,17,21]. Usually includes collecting so-called competency
    questions (CQs), i.e. questions to be answered by exploring and querying
    the ontology. CQs are initially expressed informally at the conceptual level,
    not as formal queries [9,17].
                                  Documentation Gap in Ontology Creation         3

 2. Conceptualization: Acquisition of knowledge, during which KEs gather the
    required domain knowledge from non-human as well as human knowledge
    sources. To do so, they research explicit knowledge stored in media outside
    the human brain (e.g. in the form of databases, documents, vocabularies)
    on the one hand, and elicit tacit knowledge, which is bound to individuals
    (e.g. practical knowledge in the memory of a long-time employee), on the
    other hand, by interviewing, observing, and probing DEs [16]. The collected
    knowledge is conceptually analyzed by the KEs in order to create an informal
    knowledge model. [5,17,22]
 3. Implementation: KEs encode the informal knowledge model as an ontology
    using a formal ontology language. [5,17,22]
 4. Test: KEs and DEs evaluate the ontology’s quality in different dimensions
    [14]. Basically, the ontology must meet technical standards and the defined
    requirements so that it can be used to answer the collected CQs. [5,17,22]


       Fig. 1. Outline of a basic process for ontology creation without loops


3   Methodology
A survey based on two different questionnaires was conducted, one addressed
to KEs, the other to DEs. The questionnaires consisted of five questions each.
For the purpose of this paper, only 5 of the 10 questions are presented. KEs
were asked when and how they document their work on ontologies and what
information they consider relevant while creating knowledge models. Besides,
they were asked whether fast or resilient results constitute their main goal. DEs
were asked what they expect from ontologies. In both cases, only the current
situation was enquired, not the desired ideal state. Therefore, the survey results
do not necessarily reflect an optimal situation. In other words, just because the
KEs work quickly and document barely, this does not mean they consider this to
be the best solution. It may be an effect of economic constraints, not a reasonable
decision from a professional perspective.
4       M. Michaelis and O. Streibel

    In total, three groups were surveyed: (1) KEs and (2) DEs of a global life
science company based in Germany as well as (3) DEs of an international working
group, which are referred to as external DEs. The two questionnaires have been
designed to be completed quickly and are therefore relatively simple. They have
been sent electronically to people known as KE or DE. The response rates were
92.9 % for KEs (13 out of 14), 78.6 % for internal DEs (11 out of 14) and
11.6 % for external DEs (5 out of 43). The participation in the survey was
voluntary. As the sample sizes were small for both roles, the survey results do
not claim to represent the entirety of the KEs and DEs in the company or
the external working group. Nevertheless, they provide valuable insights into
corporate reality.


4     Findings

4.1   Relevant Information per Concept

Figure 2 shows what information the KEs surveyed consider relevant for each
concept. Each of the answer options offered was represented in the results, com-
plemented by two free text entries added by the respondents. The most frequent
choice was definition or explanation, which is obvious, since the meaning of con-
cepts must be grasped in order to create ontologies. This is further supported
by the results for the answer option context. After all, information on context
is needed to situate a concept in a semantic network. However, the results also
show that not only information directly related to the concept’s meaning are
considered relevant. KEs take into account related knowledge sources such as
related vocabularies and standards, related data sources, related people, and re-
lated literature as well.


4.2   Expectations towards Ontologies

Figure 3 shows what DEs expect from ontologies for their daily work. Almost all
of the DEs surveyed expected ontologies to work in the background to improve
the interaction between IT systems. More than half of the DEs surveyed, 9 out of
16, expected to be able to work directly with ontologies to gain knowledge about a
domain, complemented, among other free text entries, by the response “I expect
to receive information or explanation of data, which is currently not available”.


4.3   Main Goal

Concerning the main goal pursued by KEs, 69.2 % of the respondents aimed
for fast results (see figure 4). In return, they accepted less perfect knowledge
models.
                                Documentation Gap in Ontology Creation        5


          Fig. 2. Survey Results KEs: Relevant Information per Concept


4.4   Timepoint of Documentation
Figure 5 shows when KEs start documenting their work on a knowledge model.
Only two of the respondents started the documentation before creating the for-
mal model. The majority documented while or after creation of the formal model.
In other words, the documentation usually took place after the conceptualiza-
tion phase and thus after the exchange of knowledge between KEs and DEs
(cf. figure 1). One of the KEs surveyed did not create any documentation at all.


4.5   Nature of Documentation
Figure 6 shows how KEs document the exchange with DEs, which takes place
primarily in the course of knowledge acquisition. Although most KEs started
their documentation in connection with the formal model, only 2 out of 13 KEs
documented in a formal way as is possible by using annotation properties. The
other 11 KEs documented the insights they acquire by exchanging with DEs
informally, i.e. by using natural language. In doing so, the narrow majority of
6 KEs documented unstructured, while the remaining 5 KEs documented in a
structured way, for instance by using templates. According to the results for
this question, all KEs documented the exchange with DEs. This is not fully
coherent with the results regarding the timepoint of documentation where the
option “I don’t create a documentation” was selected once.
6      M. Michaelis and O. Streibel


           Fig. 3. Survey Results DEs: Expectations towards Ontologies


5   Discussion

According to the survey results, KEs consider information on the meaning of
concepts and the associated knowledge resources to be relevant in the course of
ontology creation. However, most KEs only begin to document during or after
implementation. This means that the conceptualization phase of the ontology
creation process remains largely undocumented. This poses a serious problem
because in this very phase the knowledge considered relevant is acquired. If
the laboriously researched and elicited knowledge is not explicitly recorded, it
remains as tacit knowledge in the mind of the respective KE and is therefore
difficult to access. As a consequence of this documentation gap, collaboration is
impeded and it is more complicated to distribute workload. In addition, there
is a risk of knowledge loss through individual and collective oblivion. Hence,
timely documentation is essential.
    If the documentation gap causes knowledge to be lost, this not only com-
plicates the work of the KEs, but also jeopardizes that the DEs’ expectations
towards ontologies are met. After all, they expect to be able to work directly with
ontologies to gain knowledge about a domain. Apart from preserving knowledge,
joint documentation may also allow to identify synergies and potential mis-
understandings at an early stage. Moreover, a shared documentation is a way to
put definitions of terms up for discussion early enough. Thus, consensual knowl-
edge as required for the creation of ontologies can already be gathered during
knowledge acquisition. Without shared documentation, definitions are initially
                                Documentation Gap in Ontology Creation        7


                     Fig. 4. Survey Results KEs: Main Goal


            Fig. 5. Survey Results KEs: Timepoint of Documentation


hidden in the personal notes or mind of a KE, which means that consensus
building may only begin after the publication of the formal knowledge model.
    A possible explanation for the identified documentation gap may be the fact
that the majority of KEs in the life science company under investigation strive
for fast results, probably at the expense of timely documentation.
    With regard to the nature of the documentation, the structured docu-
mentation approach is recommended, as already adopted by some of the KEs
surveyed. Structured documentation or semi-formal documentation is written in
natural language and follows guidelines provided, for instance, by templates.
Hence, the documentation is clear and understandable for both KEs and DEs.
Unstructured documentation, also called informal documentation, by contrast, is
individual and does not follow guidelines, making it ambiguous and heteroge-
neous. Creating formal documentation, which is machine-readable, requires more
effort and specific skills that not all DEs have. Consequently, a joint documen-
tation should neither be formal nor unstructured, but structured and thus easy
to handle for all people involved. [12,13]
    To illustrate the described consequences of the documentation gap, two fic-
tive scenarios are given below. They are based on personal experiences gained
by the authors while working as KEs for the life science company under inves-
tigation. In Scenario 1, the KEs document too late and insufficiently which, in
8        M. Michaelis and O. Streibel


                 Fig. 6. Survey results KEs: Nature of Documentation


our experience, constitutes the prevailing situation in the company. Scenario 2
represents the desired situation in which the challenges that life science compa-
nies face when creating ontologies are addressed by means of joint, timely, and
structured documentation.

Scenario 1 Ina1 does not document acquired knowledge in a timely manner, so
that she has forgotten some information by the time of implementation (knowl-
edge loss through individual oblivion). Unfortunately, the knowledgeable col-
league is no longer available due to retirement (knowledge loss through organi-
zational oblivion). Until she can ask her KE colleague Cora1 , who hasn’t created
any documentation either, she has to wait for her to return from vacation (im-
peded collaboration). If the DE Conan1 wants to make a definition proposal re-
garding a concept, he must first write an e-mail to Ina, as there is no structured
documentation available in which he can enter information directly (impeded
distribution of workload). Ina does not forward Conan’s proposal to the other
DEs, which is why their disagreement with his definition becomes apparent only
after publication of the formal model (delayed consensus building).

Scenario 2 Ina1 , who works as a KE, externalizes knowledge acquired during
the conceptualization phase promptly in form of a structured documentation,
which can be edited remotely by her colleagues. This allows her KE colleague
Cora1 to see which concepts are already described (collaboration). In addition,
the DE Conan1 is able to add new definitions directly to the documentation
without having to contact Ina (distribution of workload). Following this, other
DEs can review Conan’s definition proposal and initiate a discussion if neces-
sary (consensus building). If Ina forgets something or leaves the company, the
documentation can be consulted (knowledge preservation).
1
    The names of the personas are fictitious.
                                   Documentation Gap in Ontology Creation           9

6   Conclusion
In accordance with our personal experience as KEs, the presented survey results
suggest that there is a documentation gap between knowledge acquisition and
knowledge formalization in the process of ontology creation. Among the surveyed
KEs, ontologies are created in various projects for various domains and divisions
by international teams consisting of internal and external employees. At the
same time, collaborations with external working groups take place. As a result,
the challenge is to share knowledge acquired for ontology creation as early as
possible in the process. We recommend addressing this challenge by means of
structured documentation, which is created jointly and in a timely manner by
the KEs and DEs involved. This reduces the risk of knowledge loss while enabling
collaboration and distribution of workload. A solution developed for this purpose
is the documentation concept proposed by Michaelis [15], which enables the
company to overcome the documentation gap by providing guidelines in the form
of graphical templates on what should be documented by whom, how and when.
Further research is needed to determine whether the presented documentation
gap constitutes a phenomenon that is specific to the surveyed KEs or represents
a general pattern in the life science industry.


7   Acknowledgements
This paper is based on a bachelor thesis cooperation between Bayer and the
University of Applied Sciences Potsdam under supervision of Gu
                                                             ̈nther Neher.


References
 1. Borst, W.N.: Construction of engineering ontologies for knowledge sharing and
    reuse. Phd thesis, University of Twente, Enschede, NL (1997)
 2. Directorate-General for Research and Innovation: Prompting an eosc in practice:
    Final report and recommendations of the commission 2nd high level expert group
    on the european open science cloud (eosc) (2018). https://doi.org/10.2777/112658
 3. Directorate-General for Research and Innovation: Turning fair data into reality:
    Final report and action plan from the european commission expert group on fair
    data (2018). https://doi.org/10.2777/1524
 4. ELIXIR: Interoperability platform (2019), https://elixir-europe.org/platforms/
    interoperability
 5. Fernández, M., Gómez-Pérez, A., Juristo, N.: Methontology: From ontological art
    towards ontological engineering. In: Farquhar, A. (ed.) Ontological engineering,
    pp. 33–40. Technical report / American Association for Artificial Intelligence SS,
    AAAI Press, Menlo Park, Calif. (1997)
 6. Fernández-López, M., Go ́mez-Pérez, A.: Overview and analysis of methodolo-
    gies for building ontologies. The Knowledge Engineering Review 17(02) (2002).
    https://doi.org/10.1017/S0269888902000462
 7. GO FAIR: Strategy (2018), https://www.go-fair.org/go-fair-initiative/strategy/
 8. Gruber, T.R.: A translation approach to portable ontology specifications. Knowl-
    edge Acquisition 5(2), 199–220 (1993). https://doi.org/10.1006/knac.1993.1008
10      M. Michaelis and O. Streibel

 9. Grüninger, M., Fox, M.S.: Methodology for the design and evaluation of ontologies.
    Workshop on Basic Ontological Issues in Knowledge Sharing: International Joint
    Conference on Artificial Intelligence (1995)
10. HL7.org: Fhir release 3 (stu): Fhir linked data module (2019), http://hl7.org/fhir/
    STU3/linked-data-module.html
11. Kent Innovation Centre: Phuse working groups: Linked data & graph databases
    (2017), https://www.phuse.eu/linked-data-graph-databases
12. Landes, D., Schneider, K., Houdek, F.: Organizational learning and experience
    documentation in industrial software projects. International Journal of Human-
    Computer Studies 51(3), 643–661 (1999). https://doi.org/10.1006/ijhc.1999.0280
13. Lehmann, A.: A documentation approach for higher education. In: Proceedings
    of 2018 IEEE Global Engineering Education Conference (EDUCON). pp. 43–50.
    IEEE, Piscataway, NJ (2018). https://doi.org/10.1109/EDUCON.2018.8363207
14. Lourdusamy, R., John, A.: A review on metrics for ontology evalua-
    tion. In: Proceedings of the Second International Conference on Inven-
    tive Systems and Control (ICISC 2018). pp. 1415–1421. IEEE (2018).
    https://doi.org/10.1109/ICISC.2018.8399041
15. Michaelis, M.: Documentation concept for the exchange of knowledge in the process
    of creating ontological knowledge models. Bachelor thesis, University of Applied
    Sciences Potsdam (2019), https://nbn-resolving.org/urn:nbn:de:kobv:525-23611
16. Milton, N.R.: Knowledge acquisition in practice: A step-by-step guide. De-
    cision Engineering, Springer, London and Berlin and Heidelberg (2007).
    https://doi.org/10.1007/978-1-84628-861-6
17. de Nicola, A., Missikoff, M., Navigli, R.: A software engineering ap-
    proach to ontology building. Information Systems 34(2), 258–275 (2009).
    https://doi.org/10.1016/j.is.2008.07.002
18. Semantic Web Applications and Tools for Healthcare and Life Sciences: About
    (2019), http://www.swat4ls.org/about/
19. Simperl, E., Luczak-Ro     ̈sch, M.: Collaborative ontology engineering: a
    survey. The Knowledge Engineering Review 29(01), 101–131 (2014).
    https://doi.org/10.1017/S0269888913000192
20. Studer, R., Benjamins, V.R., Fensel, D.: Knowledge engineering: Princi-
    ples and methods. Data & Knowledge Engineering 25(1-2), 161–197 (1998).
    https://doi.org/10.1016/S0169-023X(97)00056-6
21. Suárez-Figueroa, M.C., Go ́mez-Pérez, A.: Ontology requirements specification. In:
    Suárez-Figueroa, M.C., Go ́mez-Pérez, A., Motta, E., Gangemi, A. (eds.) Ontology
    Engineering in a Networked World, pp. 93–106. Springer Berlin Heidelberg, Berlin,
    Heidelberg (2012). https://doi.org/10.1007/978-3-642-24794-1 5
22. Suárez-Figueroa, M.C., Go ́mez-Pérez, A., Ferna
                                                    ́ndez-López, M.: The neon method-
    ology framework: A scenario-based methodology for ontology development. Applied
    Ontology 10(2), 107–145 (2015). https://doi.org/10.3233/AO-150145
23. W3C: Semantic web in health care and life sciences community group (2019),
    https://www.w3.org/community/hclscg/
24. Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J.J., et al.: The fair guiding prin-
    ciples for scientific data management and stewardship. Scientific data 3 (2016).
    https://doi.org/10.1038/sdata.2016.18
25. Wise, J., de Barron, A.G., Splendiani, A., et al.: Implementation and relevance
    of fair data principles in biopharmaceutical r&d. Drug Discovery Today (2019).
    https://doi.org/10.1016/j.drudis.2019.01.008