<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DSAT: Ontology-based Information Extraction on Technical Data Sheets</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>DLR Institute of Data Science Mälzerstraße 3</institution>
          ,
          <addr-line>07745 Jena</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Kobkaew Opasjumruskit</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Current engineering design processes oftentimes involve transferring information from manufacturer-provided data sheets into domainspecific design tools. While most data sheets are provided only as PDF files, this remains a tedious and manual task. This paper presents the Data Sheets Annotation Tool (DSAT), which assists engineers in gathering the information required in the design process. Using an OntologyBased Information Extraction (OBIE) method, the properties of components are extracted from data sheets and subsequently presented in an integrated, web-based interface. Engineers can now review and correct these automatic annotations, before exporting them for further use. In the demonstration, we employ a real-world use case rooted in modelbased space-system engineering. We show how the automated process can extract relevant component-attributes from technical data sheets and how users can redact the results. We further highlight the impact of content and quality of the underlying ontologies.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        An important, recurring task in many engineering projects is to gather
information about components to be used. These are described by their physical
properties (e.g., spatial dimensions or mass) and the interfaces they provide or
require (e.g., propelling force or power consumption). The acquired descriptions
are subsequently fed into domain-specific design tools like Virtual Satellite [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
and allow engineers to compare, combine, and adjust the components based on
the project’s requirements.
      </p>
      <p>The information required is barely available in machine-processable form,
but is usually provided by PDF files (see Fig. 1 for examples). Engineers are
required to obtain data sheets of interest, scan these files, and manually copy
Copyright c 2020 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
the important pieces of information to their respective design tools. Not only is
this process tedious and repetitive, it is also error-prone and time consuming.</p>
      <p>
        Existing information extraction approaches can alleviate this task, in
particular, Ontology-Based Information Extraction (OBIE) has proven itself a valuable
tool to convert text into machine-readable structures as witnessed by [
        <xref ref-type="bibr" rid="ref10 ref5 ref7">5,7,10</xref>
        ].
However, some drawbacks remain for engineering projects. First, most tools are
tailored to extract entities and their relationships, but the main information to
be extracted in data sheets are key-value(-unit) tuples. Second, the vocabulary
used in data sheets is highly domain specific and generally not consistently used.
Thus, OBIE approaches that rely on general purpose ontologies are bound to
fail here. Finally, incorrectly extracted information is not tolerable. Depending
on the scope of the project, an erroneous value for some property can have fatal
and very costly consequences later in the development process.
      </p>
      <p>
        In the following we present the Data Sheets Annotation Tool (DSAT) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. It
provides a human-in-the-loop interface for the extraction of technical properties
from PDF files. Each data sheet is initially processed by an OBIE-pipeline to
automatically detect relevant properties of the components described. For this
purpose, we employ domain-specific ontologies [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] targeted at the specific types
of possible components. The results are presented to engineers who can review,
amend to, or remove the automatically created annotations directly on the PDF
file, before exporting them to their respective design tool.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Data Sheets Annotation Tool (DSAT)</title>
      <p>DSAT is a web-based application which displays the data sheets and allows
engineers to annotate attributes that will be used by different tools1. As shown in
1 A link to demo video:</p>
      <p>https://zenodo.org/record/4034478/files/DSAT-screen-record.mp4</p>
      <p>Fig. 2, it allows users to upload and select their data sheets (area I). The selected
data sheet is displayed and attributes can be highlighted on the left panel (area
II). The highlighted text can be categorized as a key, value, or remark. Engineers
can add/edit/delete key-value pairs in this control form (area III). Users can also
add custom text in a comment box. One set of a key, a value, a remark, and a
comment is denoted as an annotation.</p>
      <p>Manual highlighting is needed when there is no initial knowledge of data
sheet’s domain. If a domain-specific ontology is available, DSAT calls upon a
server-side information extraction pipeline to automatically detect key-value
pairs in data sheets, and the results will be highlighted on the area II. However,
these detected values should be reviewed and edited by domain experts before
being used further. All annotations created, both manually and automatically,
are summarized on the bottom right panel (area IV).</p>
      <p>DSAT was evaluated by targeted users, who are involved in satellite
development process. The evaluation result on DSAT was collected qualitatively. The
overall feedback is that DSAT is intuitive to use. However, the evaluation also
revealed some areas for further improvement like supporting a wider range of
data sheet formats or the ability to assign a single area of text to multiple
annotations.</p>
      <p>The manual annotations collected during this evaluation were also used to
evaluate the automatic extraction by OBIE. The envisioned workflow will be
presented to a wider audience to collect additional feedback and adapt it to its
users’ needs.</p>
      <p>
        To harmonize across the heterogeneity of terms we want to search in the data
sheets, we use an Ontology-Based Information Extraction (OBIE) approach.
The accuracy of the auto-extracted result depends highly on the domain specific
ontology. For example, when processing a star sensor data sheet, an ontology
describing the specific attributes of a star sensor has to be provided. Our initial
ontologies can be found in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Nevertheless, the terminology used by different
manufactures varies widely. Creating a comprehensive list of terms used from
scratch is an almost insurmountable task given the constant evolving of the
field.
      </p>
      <p>
        To cope with this semantic challenge, we include external knowledge bases
such as Wikidata [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and WordNet [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] to expand our initial ontologies and
disambiguate occurring terms2.
      </p>
      <p>We use Wikidata to find entities corresponding to occurring terms. The
entities returned from Wikidata contain semantic information, e.g. alternative label,
description, superclass, similar entities, etc. This information will be inserted into
the initial ontology, so that it can be used to extend the scope when searching
for attributes in the data sheets.</p>
      <p>In case of multiple entities, referring to different concepts, are returned from
Wikidata, we use the context of data sheets and definitions from the lexical
database WordNet to disambiguate between them. WordNet encapsulates the
different meanings of a word in one synset each. These synsets are composed of
definition, examples of usage, and relations to other synsets like synonyms or
hyponyms. First, we collect all domain-representing keywords from data sheets
and find the synsets that correspond to the domain. After collecting a set of
domain-representing keywords from the corpus of datasheets, the corresponding
synsets from WordNet are retrieved. If there are multiple synsets for some
keywords, the most coherent subset with respect to a semantic relatedness measure
is chosen footnote In our experiments we used Wu-Palmer Similarity, but others
will be explored in the future. . These synsets now provide enough context to
disambiguate between the candidate entities of Wikidata by comparing the
respective textual descriptions. The selected entities are then selected for enriching
the ontologies.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusions and Future Work</title>
      <p>
        In this demonstration we presented DSAT, a tool to support engineers in
extracting technical information from PDF data sheets. However, similar
challenges as in the space domain also arise for example in patent analysis [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] or
medicine [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Therefore, we plan to adapt DSAT to these domains. This requires
a change in the underlying ontologies and possibly adaptations of the system for
the structure of documents typical for those domains.
      </p>
      <p>Currently, manual corrections by users only pertain to the specific data sheet
they were made in. We plan to leverage this expert knowledge to further enhance
2 The extended ontologies used in this demonstration can be found here:
https://zenodo.org/record/4034478/files/enriched_ontology.zip
the ontology over time. While synonyms of existing concepts are easy to include,
the situation gets more complex with hypernyms, hyponyms, or even terms that
are unrelated to already known attributes. Especially for these cases we want
to provide users a direct access to the ontology. Since most users have little
experience with ontologies, the respective interface needs to be intuitive and
prevent the users from introducing inconsistencies into the underlying ontologies.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Andersson</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hidir</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piroi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Allan</surname>
          </string-name>
          , H.:
          <source>Proceedings of The 1st Workshop on Patent Text Mining and Semantic Technologies (PatentSemTech</source>
          <year>2019</year>
          ) (
          <year>2019</year>
          ). https://doi.org/10.34726/PST2019
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. ConTrOn: Contron - spacecraft
          <source>parts ontology 1</source>
          .2 (May
          <year>2020</year>
          ). https://doi.org/10.5281/zenodo.3862854
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. (DLR),
          <string-name>
            <surname>G.A.C.</surname>
          </string-name>
          :
          <article-title>Virtual satellite</article-title>
          . https://github.com/virtualsatellite, accessed:
          <fpage>2020</fpage>
          -08-14
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Fellbaum</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>WordNet : an electronic lexical database</article-title>
          . MIT Press, Cambridge, Mass (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Murdaca</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berquand</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riccardi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soares</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gerené</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brauer</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          :
          <article-title>Knowledge-based information extraction from datasheets of space parts</article-title>
          .
          <source>In: 8th International Systems &amp; Concurrent Engineering for Space Applications Conference (September</source>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Opasjumruskit</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Data Sheets Annotation Tool</article-title>
          . https://gitlab.com/kobkaew/ dsat-client,
          <source>accessed: 2020-08-17</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Rizvi</surname>
            ,
            <given-names>S.T.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mercier</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agne</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Erkel</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dengel</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ahmed</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Ontology-based information extraction from technical documents</article-title>
          .
          <source>In: Proceedings of the 10th International Conference on Agents and Artificial Intelligence. SCITEPRESS - Science and Technology Publications</source>
          (
          <year>2018</year>
          ). https://doi.org/10.5220/0006596604930500
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Starlinger</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kittner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blankenstein</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leser</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          :
          <article-title>How to improve information extraction from German medical records</article-title>
          .
          <source>it - Information Technology 59(4)</source>
          (1
          <year>2017</year>
          ). https://doi.org/10.1515/itit-2016
          <source>-0027</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Vrandečić</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krötzsch</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Wikidata: A free collaborative knowledgebase</article-title>
          .
          <source>Commun. ACM</source>
          <volume>57</volume>
          (
          <issue>10</issue>
          ),
          <fpage>78</fpage>
          -
          <lpage>85</lpage>
          (
          <year>Sep 2014</year>
          ). https://doi.org/10.1145/2629489
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Wimalasuriya</surname>
            ,
            <given-names>D.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dou</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Ontology-based information extraction: An introduction and a survey of current approaches</article-title>
          .
          <source>Journal of Information Science</source>
          <volume>36</volume>
          ,
          <fpage>306</fpage>
          -
          <lpage>323</lpage>
          (
          <year>2010</year>
          ). https://doi.org/10.1177/0165551509360123
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>