<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Cleaning up a legacy thesaurus to make it t for transformation into a Semantic Web KOS</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Technische Informationsbibliothek</institution>
          ,
          <addr-line>Welfengarten 1B, 30167 Hannover</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Legacy knowledge organization systems (KOS) such as thesauri for speci c domains that have been created and maintained as print versions before being digitized typically su er from a rather heterogeneous quality in terms of their structural consistence and interconnectivity. We take the domain thesaurus \Technik und Management" (TEMA) which has its origins in the assembly of six print thesauri, highlight some exemplary structural challenges resulting from this assembly and from various additions over the years since its creation, and make suggestions of how to align it with the principles of the Semantic Web via the use of corresponding standards such as SKOS, with a special focus on the existing and the potential relations between its concepts and terms. More speci cally, we made an attempt to transform a subset of the thesaurus into an ontology, and then realized that we would have to improve the overall structure of the thesaurus rst before we could proceed. During the process, we also took into account the concerns of the subject experts at WTI who are in charge of maintaining the thesaurus on a daily basis.</p>
      </abstract>
      <kwd-group>
        <kwd>knowledge organization systems</kwd>
        <kwd>legacy thesauri</kwd>
        <kwd>struc- tural enhancement</kwd>
        <kwd>semantic relations</kwd>
        <kwd>Semantic Web standards</kwd>
        <kwd>SKOS</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Point of departure 1</title>
      <p>1.1</p>
      <sec id="sec-1-1">
        <title>The domain thesaurus \Technik und Management" (TEMA)</title>
        <p>The domain thesaurus \Technik und Management" (technical and management
topics; TEMA) started out as a product of the cooperative WTI-Frankfurt eG
(formerly \Fachinformationszentrum (FIZ) Technik", founded based on an
initiative of the German government), and since 2013 is developed further in
collaboration with the Leibniz Information Centre for Science and Technology (TIB).</p>
        <p>The thesaurus was created by manually joining ve print thesauri (on topics
such as mechanical, electrical, medical, materials and textile engineering,
electronics, information technology), the contents of a sixth (naval architecture) was
added later on. The rst printed version with about 34 000 concepts and 80 000
terms was issued in 1998, the fourth and last with 34 900 concepts and 97 800
terms in 2003/4 when it was digitized. In 2018, the thesaurus comprises about
57 000 concepts and 197 000 terms, both German and English, and is one of
the largest for the topics it covers. The terminological material is curated and
increased by the subject experts working at WTI on a day-to-day basis.</p>
        <p>The TEMA thesaurus is maintained on the Averbis Terminology Platform
(ATP) in a proprietary format and its contents can be exported as a .txt le
(minus a few elds that are deemed relevant to the subject experts of WTI only).
The ATP features a layered system of access rights (read; propose; edit proposals;
edit, accept proposals, import/export, create subterminologies; admin) in order
to be able to control the editing. By the clients of WTI, the thesaurus is mainly
used in combination with the domain-speci c document databases provided by
WTI, as an indexing base for the associated search engine (TecFinder).
1.2</p>
      </sec>
      <sec id="sec-1-2">
        <title>Project \Fachontologie Technik" (2013{2017)</title>
        <p>
          The project \Fachontologie Technik" (\domain ontology for technical subjects")
was started in 2013 as a joint project of TIB and WTI with the goal to enrich
the TEMA thesaurus in various ways in order to take it to the next level of the
digital development { for example, by transferring it to the ATP, by introducing
English terms, thus making the thesaurus bilingual throughout, by evaluating
term extraction tools that would help add more content, and by aligning it with
concepts from the German authority le (\Gemeinsame Normdatei"; GND) [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
An additional goal was to increase the interoperability of the TEMA thesaurus
with other knowledge organization systems (KOS) by transforming it into a KOS
that complies with the established Semantic Web standards, models and
description languages such as RDF, RDF Schema, SKOS and/or OWL (depending on
the use case, and consequently, the required level of formalization).
2
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Steps towards a sandbox ontology for electric mobility</title>
      <p>In line with the project goal formulated in the previous section, the author
implemented a script (both in Perl and Python) that transforms the .txt output of
the ATP into a hybrid OWL/SKOS ontology (with SKOS focussing on the
classical thesaurus relations between concepts and with their natural language labels,
such as hyperonymy and synonymy, whereas OWL allows for a more formal,
set-theoretic approach and additional logical rules and constraints), serialized in
Turtle syntax. Since on one hand a domain ontology should be rather concise if
it is to have potential for reasoning applications (also see Section 4) and on the
other both TIB and WTI had ongoing projects on mobility and transportation,
we chose electric mobility as an exemplary domain, and a corresponding subset
of TEMA concepts was extracted by subject experts of WTI and provided in a
separate .txt le. We also integrated the \TEMA Fachordnung" into the
ontology, which is a system of classi cation codes for the TEMA subjects that can
be assigned to concepts on the ATP via a special relation.</p>
      <p>
        The rationale for creating a hybrid ontology was the following: On the one
hand, we wanted to keep and enhance the characteristics of a high-quality
thesaurus, such as the distinction between concepts and terms, which had been only
recently performed for the TEMA thesaurus by assigning separate IDs to
concepts and to each of their labelling terms (preferred, alternative, hidden). Such a
distinction can be implemented by using SKOS-XL [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], a standardized extension
of SKOS which conceptualizes terms as separate entities instead of mere strings,
so that they can be described by metadata and stand in relationships with other
entities themselves. Another SKOS extension we considered using is iso-thes [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
which aims at ensuring compliance with ISO 25964 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] on the ideal design of
and ways to ensure interoperability between thesauri, for example by allowing
more complex relations between concepts such as compound equivalence.1 We
wanted to comply as far as possible with those two standards in order to achieve
a maximal potential towards a high-quality knowledge organization system.
      </p>
      <p>On the other hand, since according to the thesaurus manager at WTI the
hyperonym relation in the TEMA thesaurus (unlike in the majority of legacy
thesauri) is in large parts a proper subclass relation, in this rst experiment
we wanted to make the hierarchy formed by the TEMA concepts visible by
interpreting it as an OWL class hierarchy and visualizing it in the Protege editor.
Work on this rst sandbox ontology provided various insights on the structure
of the thesaurus { for example the fact that although the ATP visualizes the
terminology net in tree form, some concepts do have more than one superconcept,
which then leads to the fact that identical subtrees are displayed multiple times
1 As an example, the English concept labeled by the term \pollution" is equivalent to
a concatenation of German concepts, labeled by \Umwelt" and \Verschmutzung".
(Fig. 2). Polyhierarchies in thesauri are not prohibited as such but it is important
to be aware of them when trying to visualize or improve their structure.</p>
      <p>However, the analysis also showed minor and major aws in the structure of
the thesaurus. Some minor aws were probably introduced during the manual
merge of the six thesauri that served as sources for the TEMA thesaurus, for
example the one in Fig. 2 on the right where two concepts both have a
subconcept that logically should be their superconcept because it subsumes them both.
Therefore, it was decided to dedicate a subproject to cleaning the thesaurus up,
with a focus on the (poly-)hierarchy but also with some preliminary
considerations towards more expressive relations between concepts.</p>
    </sec>
    <sec id="sec-3">
      <title>A top-level structure for the TEMA thesaurus</title>
      <p>The structural analysis of the TEMA thesaurus had also shown that there were
approximately 2 200 concepts without a superterm. Since the ATP visualizes the
terminology net in tree form, all those terms are shown as the direct children of an
arti cial mother node. Consequently, we decided to provide the thesaurus with
a top-level (\roof") structure with no more than 40{50 branches per level and
between 3 and 10 levels that would allow a subject-driven access to its contents.
The long-term objective would be to enable a visual explorative navigation in
search portals for resources that have been indexed with the thesaurus.</p>
      <p>Initially, we identi ed several candidates that could serve as a primary source
for such a top-level structure, among them the International Patent Classi
cation (IPC), the Dewey Decimal Classi cation (DDC), (a subset of) the GND
classi cation, and the \TEMA Fachordnung" (see Section 2). These were
visualized by the author as collapsible trees using JSON and the JavaScript library
D3, and provided to the subject experts of WTI for inspiration. In a next step,
each subject expert of WTI came up with their own proposal, and the
resulting structures were then merged manually by the author and again visualized.2
2 The merge can be explored at https://ontologie.tib.eu/toplevelkand/sturmerge.html.
Currently, the subject experts of WTI are working on re ning and expanding
the latest candidate structure collaboratively.</p>
      <p>Throughout the project TIB and WTI are having regular workshops in order
to discuss issues that come up while working on the top-level structure, i.e.,
questions such as how to integrate this new structure into the actual thesaurus
(physically or virtually), which system best to use to maintain the thesaurus
and its top-level structure in the future, but also very fundamental questions on
thesaurus and ontology design, the proper division into categories, the degree of
formality to choose, and the set of relations to use for our speci c purposes (for
example, to make the thesaurus a suitable source of concepts for smaller, more
formalized domain ontologies). We will discuss the latter in the next section.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Relations for the TEMA thesaurus</title>
      <p>A major issue for the subject experts of WTI when creating the top-level
structure was the question of which rationale to use to build the underlying hierarchy.
Since for the top-level structure we had decided on a strict monohierarchy, each
concept would have one superconcept only. As a consequence, should subareas
of chemistry such as petrochemistry be subsumed under chemistry or under the
areas where they are applied, e.g., in the oil industry { given the applied
character of TEMA? And since the top-level structure was meant to facilitate access by
subjects, how would users with di erent approaches nd their respective areas of
interest? This leads to the question of how to implement what in the knowledge
organization area is known as `facets'. As of now, such facets are implemented
in the TEMA thesaurus by additional cross-sectional concepts such as \part of
a machine", \application (general)", etc. which is not ideal since it suggests an
underlying hierarchical principle rather than the intention of a facet. A future
solution should allow the extraction of subsets of concepts from the thesaurus
that share a common aspect but cannot be obtained by selecting a single subtree.</p>
      <p>
        One approach that we propose is to transfer the thesaurus as a whole into the
SKOS format and to exploit the resulting possibilities. Although the ATP does
have a SKOS export functionality, the output is rather unsatisfactory (missing
elds, no SKOS-XL support). The author has programmed an alternative to x
this but it only transfers concepts one by one from .txt into SKOS-XL. However,
SKOS provides constructs that can introduce more structure into a Linked Data
thesaurus: SKOS concept schemes and SKOS collections [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        A SKOS concept scheme is an aggregation of concepts sharing a common
topic { also called a \microthesaurus" [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] { and via the relation
skos:hasTopConcept one or more upper concepts can be speci ed within the scheme. Thus,
this would allow us to present the TEMA thesaurus as a cohesive SKOS
vocabulary with one or more hierarchical structures in it. SKOS collections, on the
other hand, are intended to represent groupings of concepts that share a certain
(ideally non-trivial) aspect and are thus additional organizational features
orthogonal to the concept hierarchy. An example for such a collection given in the
SKOS Primer [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is \milk by source animal" { an aspect which is not likely to be
chosen as a top concept for a hierarchy but might nevertheless be relevant in some
search scenario. SKOS collections are exible enough to represent both
subdivisions of concepts with the same superconcept, e.g., \milk" (also called \thesaurus
array" [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]), and facets, i.e., groupings of concepts sharing some aspect which may
not necessarily be their superconcept (e.g., \everything related to photography"
{ \photographer", \tripod", \camera"). Accordingly, we could de ne SKOS
collections for \theoretical foundations that are applied in the oil industry",
\theoretical foundations that are applied in the mining industry", etc., and make them
accessible as facets to search engines in portals for resources that are indexed
with the TEMA thesaurus. Since these measures towards a better structuring
of the thesaurus have not been undertaken yet it is an obvious next step.
      </p>
      <p>
        Another goal of the project \Fachontologie Technik" which we also carry on
is to evaluate if the TEMA thesaurus can serve as a base for the development
of more formalized domain ontologies that allow for scienti c applications such
as reasoning and question answering (also see Section 2). It was soon clear that
we cannot transform the thesaurus as a whole into an ontology since reasoning
in big ontologies generally involves too many steps and relies on the exactness
of too many relations so that the result is likely to be unreliable. Moreover,
polyhierarchical ontologies are more di cult to maintain and more care has to
be devoted to make sure that all inheritances remain correct when changing it.
However, it is possible to use thesauri as \quarries" from which to extract subsets
of concepts, enrich them with more expressive relations, axioms, and logical rules,
and thus transform them into concise ontologies for a certain domain (see for
example [
        <xref ref-type="bibr" rid="ref6 ref7">6,7</xref>
        ]). In order to transform the TEMA thesaurus into such a source,
we would have to identify relations that can be established in a meaningful
way between the concepts (such as isApplicationOf, isMachinePartOf ), and the
subject experts of WTI are collecting such relations during their work on the
structure of the thesaurus, for future reference and integration.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Summary and Outlook</title>
      <p>
        It is desirable to transfer as much domain knowledge as possible from legacy
vocabularies into well-structured KOS in compliance with current Semantic Web
standards so that instead of getting lost it can be used both in indexing / search
applications (in the form of a thesaurus) and in inferencing / question answering
scenarios (ontologies). However, such a transformation is also a challenge which
requires a thorough analysis of the terminological material so that historically
grown structural peculiarities can be eliminated in order not to become an
obstacle for the target system and its applications. A next step would be to continue
purifying and enriching the TEMA thesaurus using state of the art thesaurus
engineering tools (e.g., VocBench3 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]) in order to exploit its maximal potential
with respect to Semantic Web applications and alignment with other KOS.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bernauer</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mehlberg</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Runnwerth</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidt</surname>
          </string-name>
          , G.:
          <article-title>Towards a comprehensive knowledge organisation system for the engineering domain</article-title>
          .
          <source>Slide presentation at the Workshop on Classi cation and Subject Indexing in Library and Information Science</source>
          (LIS'
          <year>2015</year>
          ),
          <source>in conjunction with the European Conference on Data Analysis (ECDA)</source>
          (
          <year>2015</year>
          ). Available from https://publikationen.bibliothek.kit.
          <source>edu/1000049929</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>SKOS</given-names>
            <surname>Simple</surname>
          </string-name>
          <article-title>Knowledge Organization System Primer</article-title>
          , https://www.w3.org/TR/ skos-primer/.
          <source>Last accessed 30 May 2018</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>SKOS</given-names>
            <surname>Simple</surname>
          </string-name>
          <article-title>Knowledge Organization System eXtension for Labels (SKOS-XL) Namespace Document</article-title>
          , https://www.w3.org/TR/skos-reference/skos-xl.
          <source>html. Last accessed 30 May 2018</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>4. ISO 25964 SKOS extension (iso-thes)</article-title>
          , https://lov.okfn.org/dataset/lov/vocabs/ iso-thes.
          <source>Last accessed 30 May 2018</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>5. ISO 25964 the international standard for thesauri and interoperability with other vocabularies</article-title>
          , https://www.niso.org/schemas/iso25964. Last accessed 30 May 2018
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Kless</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jansen</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lindenthal</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiebensohn</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>A method for re-engineering a thesaurus into an ontology</article-title>
          . In: Donnelly,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Guizzardi</surname>
          </string-name>
          ,
          <string-name>
            <surname>G</surname>
          </string-name>
          . (eds.)
          <source>Proceedings of the Seventh International Conference (FOIS</source>
          <year>2012</year>
          ), pp.
          <volume>133</volume>
          {
          <issue>146</issue>
          , IOS Press (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Nowroozi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mirzabeigi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sotudeh</surname>
          </string-name>
          , H.:
          <article-title>The comparison of thesaurus and ontology: Case of ASIS&amp;T web-based thesaurus and designed ontology</article-title>
          ,
          <source>Library Hi Tech</source>
          (
          <year>2018</year>
          ). https://doi.org/10.1108/LHT-03-2017-0060
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. VocBench homepage, http://vocbench.uniroma2.it/.
          <source>Last accessed 30 May 2018</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>