<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>J. Grau);</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>SAKE: A Semantic Authoring and Annotation Tool for Knowledge Extraction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jan Grau</string-name>
          <email>janerik.grau@student.unisg.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kimberly Garcia</string-name>
          <email>kimberly.garcia@unisg.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simon Mayer</string-name>
          <email>simon.mayer@unisg.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Semantic Authoring, Semantic Annotator, PDF annotator, Semantic Web Tool.</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Computer Science, University of St.Gallen</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Greenhouse Gas (GHG) accounting is traditionally a lengthy and manual process that requires the expertise of experienced environmental scientists; due to the recognition of the climate crisis through upcoming regulations on GHG accounting around the planet, the demand for tools that can support these environmental experts and accelerate their work is growing considerably at the moment. GHG accounting is merely one application of automated support tools that require the preservation of expert knowledge in a machine-readable and machine-understandable format; across fields, this is highly relevant for automating processes that today can only be performed by individuals with specialized training. In this paper, we present SAKE, a Semantic Authoring and Annotation tool for Knowledge Extraction that allows domain experts with no proficiency in semantic technologies annotating domainspecific PDF files, creating a Knowledge Graph with instances of standardized (or new) ontologies. The resulting Knowledge Graph can then be integrated into systems to automate specialized processes. SAKE has been developed together with domain experts in the field of environmental science and is currently used in the scope of a joint project on GHG accounting.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        From science to law and from research papers to regulatory documents, a large amount of textual
knowledge is today available in the form of PDF files. The knowledge transported through
these PDFs, while valuable for appropriately contextualized human readers, today remains
hard to integrate with automated systems. While current machine-learning methods, such as
large language models, mitigate this problem for content that aligns well with their training
data, these fall short for specialized knowledge that requires contextualized processing. Such
contextualization could be achieved if the information in a PDF was semantically integrated
with shared ontologies. This would not only enable automatic processing of the content, but
also—in-line with the core tenet of the Semantic Web—support the interlinking of pieces of
information across documents, institutions, and domains. While semantic annotation is readily
supported for HTML content, e.g., with Web-Annotation-based tools such as dokieli [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], PDF
documents today remain sidelined in Semantic Web tooling. There are good historical, technical,
and social reasons for this; however, given the wide range of domains and large amount of
nEvelop-O
(S. Mayer)
information available (often exclusively) through PDFs, we argue that it is time to pull
PDFbased communities into the world of Knowledge Graphs. Thus, we created SAKE, a Semantic
Authoring and Annotation tool for Knowledge Extraction that permits semantically lifting
PDF documents through ontology-based annotations generated by a user, thereby simplifying
the integration of information in PDF documents into the Semantic Web. The development of
SAKE was motivated by an innovation project that aims at automating GHG accounting through
Semantic Web technologies1. In this contribution, we introduce SAKE’s implementation and
features, and we discuss the GHG accounting project that is currently taking advantage of SAKE.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2. The SAKE Annotation Tool</title>
      <p>SAKE is a Web application built upon PDF.js2, a library developed by Mozilla that provides all
the functionalities of a PDF reader. SAKE defines its own skin to ofer semantic annotation
functionalities (see Figure 1(c)) next to the full capabilities of PDF.js. Specifically, SAKE enhances
the PDF highlighting functionality to allow users to transform relevant content found in a
document into structured knowledge expressed in the Resource Description Framework3 (RDF).
SAKE’s current implementation uses AtomicData4 as a semantic back end. AtomicData hosts
ontologies used for annotating documents and user data to add provenance information to
annotations. In our implementation, AtomicData could be easily replaced by any other
userbased graph database, such as Solid5 or GraphDB6.</p>
      <p>To annotate a PDF file, a user (we consider domain experts) first loads an ontology (expressed
in RDF) into SAKE’s semantic back end. The classes specified in this ontology are considered the
user’s Known Concepts (KCs). SAKE displays all KCs on the right side of the user interface (see
Figure 1). To annotate a PDF entity (text or figure), the user selects a KC and then selects the PDF
entity. Then, SAKE displays a pop-up window that prompts the user for additional information
1https://wiser-climate.com/
2https://mozilla.github.io/pdf.js/
3https://www.w3.org/RDF/
4https://atomicdata.dev/
5https://solidproject.org/
6https://graphdb.ontotext.com/
(see Figure 2) corresponding to the attributes and relationships related to the selected KC (i.e.,
object and data properties) and specified in the loaded ontology.</p>
      <p>
        To ensure compatibility with all common PDF readers (e.g., Adobe Acrobat), the annotation
is stored as an RDFa string in the PDF document’s Content dictionary (cf. the PDF
specification [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]). Hence, the PDF document can be distributed with the embedded structured data,
and collaborators not using SAKE will still see the semantic annotations when using other PDF
readers. While the semantic annotations (being RDF) may be hard to read, they can still be
modified with any common PDF reader. Since SAKE embeds semantic annotations within a PDF
ifle, it acts as a self-contained Knowledge Graph. Thus, SAKE RDFa annotations can immediately
be used with Semantic Web applications, such as dokieli [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Moreover, when an expert shares
an annotated PDF file with a colleague using SAKE, this colleague is able to read the text that
surrounds an annotation, providing them with context and improving their understanding of a
KG that has been created in a collaborative fashion.
      </p>
      <p>Furthermore, SAKE integrates a Web server and responds to HTTP requests that specify
appropriate content types (e.g., text/turtle or application/ld+json) with the graph embedded
within the currently open PDF document. Finally, SAKE provides domain experts with the means
to add new concepts and properties to existing ontologies; when a new concept is added, SAKE
asks the expert to specify the corresponding HTML elements (e.g., a text field or a drop-down
menu) to be displayed in the annotation pop-up window that is associated with the concept.
This pop-up information is stored as a list of RDF instructions, which SAKE interprets at runtime.
These instructions include validating and mapping strings to concepts in the ontology.</p>
    </sec>
    <sec id="sec-4">
      <title>3. SAKE for GHG Accounting</title>
      <p>Today, GHG accounting is a time-consuming and expensive process that requires highly
specialized environmental scientists to manually analyze companies’ processes, including their
supply chains; even large multinational companies only commission these assessments rarely
due to the amount of manual and expensive efort required. Faster and more cost-efective
GHG assessment is required not only to comply with sustainability reporting obligations (e.g.,
the Swiss Ordinance on Climate Disclosures), but also to regularly assess current practices
and reconsider company strategies to reach decarbonizations goals. In this context, WISER is
an interdisciplinary project7 coordinated by Empa (Swiss Federal Laboratories for Materials
Science and Technology) that aims at providing technological tools to increase the eficiency of
GHG assessments. The project specifically required a way to capture knowledge from PDF
documents as contextualized by the environmental scientists at Empa in a machine-understandable
way. This applies primarily to Assessment Standards documents that must be followed when
creating a GHG assessment and are published by diferent organizations (e.g., ISO, the European
Commission, the World Business Council for Sustainable Development, or the World Resources
Institute), which use idiosyncratic nomenclature and inconsistent concept definitions. Hence,
two GHG assessment reports might not be comparable if diferent standards were followed or
even if the same was followed but interpreted diferently.</p>
      <p>To increase reproducibility and consistency across GHG assessment reports, WISER aims to
create ontologies that describe diferent assessment standards and bridge ontologies that identify
commonalities that permit the automatic translation of reports across assessment standards.
Given that the environmental experts in our team are not ontologists, SAKE is proving value
in capturing their knowledge when reading an assessment standard. The KG resulting from
experts annotation will be incorporated in a Web application that accelerates the creation of
GHG assessments and can translate reports from one assessment standard to another.</p>
    </sec>
    <sec id="sec-5">
      <title>4. Related Work</title>
      <p>
        Providing non-semantic technologies experts with tools for creating semantically enriched
content has remained a challenge for several decades [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Early tools focused on bringing the
Semantic Web vision forward by, for example, annotating Web content with metadata. Such is
the case of Annotea [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which provided infrastructure to make remarks (in RDF) on content
available on the Web, at the resource level, or on selected text (e.g., add the place in which a
picture was taken). Loomp [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] was a system for serving RDF or XHTML content, it proposed the
One Click Annotator, that allowed specialist (e.g., journalists) creating semantically enriched
documents (e.g., news articles), linking them to data sources, and sharing them with other
colleagues for further annotation or for publishing. Semantator [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] is a Protégé plugin for
annotating biomedical data that provides semi-automatic annotation support using domain
ontologies. SlideWiki [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] provides manual and semi-automatic annotation tools for enriching
slide decks with linked data. It allows adding slide deck metadata or linking the content of a
slide to DBpedia entries. Dokieli [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is a platform for decentralized authoring, annotating, and
publishing HTML documents while engaging in social interactions. Dokieli uses HTML+RDFa
to edit documents and discuss them collaboratively. Sangrahaka [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] is a Web application
that allows administrators to create a schema used by annotators; curators can then verify
annotations and resolve conflicts. Similarly, SenTag [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] is a Web application that allows users
creating XML annotations on plain text.
      </p>
      <p>As described, most of the relevant related tools focus on HTML content, not on PDF documents
as SAKE does). These documents hold vast amount of knowledge if read and annotated by experts.
Moreover, SAKE is highly interested in high quality semantic annotations to integrate them in a
tool (e.g., a dashboard) that can accelerate highly specialized real-world processes.
To overcome one relevant entry barrier to using semantic technologies by domain experts,
we have created SAKE, a tool that allows domain experts to create structured knowledge from
PDF documents. This knowledge can then be exported as a KG and integrated into a tool
for supporting highly specialized tasks such as GHG accounting. SAKE is provided with this
publication as open source8, and remains in iterative development; it is currently used by
environmental scientists in the scope of an interdisciplinary GHG accounting project. However,
we expect the need for semantic annotation, sharing, and automated reasoning on top of
extracted knowledge to keep growing across a variety of domains in which knowledge is still
documented in PDFs, and their interpretations remain within the experts’ minds.</p>
      <p>Acknowledgments: We thank Dr. Didier Beloin-Saint-Pierre, Alexander Kirsten, and Dr.
Daniel Lachat, environmental scientists at Empa, for their support in testing SAKE. SAKE has
been developed as part of the WISER flagship project funded by Innosuisse.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Capadisli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Guy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Verborgh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lange</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Berners-Lee</surname>
          </string-name>
          ,
          <article-title>Decentralised authoring, annotations and notifications for a read-write web with dokieli</article-title>
          ,
          <source>in: Web Engineering</source>
          , Springer International Publishing,
          <year>2017</year>
          . doi:https://doi.org/10.1007/ 978-3-
          <fpage>319</fpage>
          -60131-1_
          <fpage>33</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Document</surname>
          </string-name>
          management - Portable
          <source>document format - Part 2: PDF 2.0</source>
          ,
          <year>2020</year>
          . URL: https: //www.iso.org/standard/75839.html.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Handschuh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Staab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ciravegna</surname>
          </string-name>
          ,
          <string-name>
            <surname>S-CREAM</surname>
          </string-name>
          -
          <article-title>Semi-automatic CREAtion of Metadata, in: Knowledge Engineering and Knowledge Management: Ontologies and the Semantic Web</article-title>
          , Springer Berlin Heidelberg,
          <year>2002</year>
          . doi:
          <volume>10</volume>
          .1007/3-540-45810-7_
          <fpage>32</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kahan</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.-R. Koivunen</surname>
          </string-name>
          ,
          <article-title>Annotea: an open RDF infrastructure for shared Web annotations</article-title>
          ,
          <source>in: Proceedings of the 10th international conference on World Wide Web, ACM, Hong Kong Hong Kong</source>
          ,
          <year>2001</year>
          . doi:
          <volume>10</volume>
          .1145/371920.372166.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Luczak-Rosch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Heese</surname>
          </string-name>
          ,
          <article-title>Linked data authoring for non-expert</article-title>
          ,
          <source>in: Linked Data on the Web Workshop</source>
          ,
          <year>2009</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>538</volume>
          /ldow2009_paper4.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. G.</given-names>
            <surname>Chute</surname>
          </string-name>
          , Semantator:
          <article-title>Semantic annotator for converting biomedical text to linked data</article-title>
          ,
          <source>Journal of Biomedical Informatics</source>
          <volume>46</volume>
          (
          <year>2013</year>
          ). doi:
          <volume>10</volume>
          .1016/j. jbi.
          <year>2013</year>
          .
          <volume>07</volume>
          .003.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Khalili</surname>
          </string-name>
          , K. A. de Graaf,
          <article-title>SlideWiki - A Platform for Authoring FAIR Educational Content</article-title>
          , in: SEMANTiCS (Posters &amp; Demos),
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Terdalkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bhattacharya</surname>
          </string-name>
          ,
          <article-title>Sangrahaka: a tool for annotating and querying knowledge graphs</article-title>
          ,
          <source>ACM</source>
          ,
          <year>2021</year>
          . doi:
          <volume>10</volume>
          .1145/3468264.3473113.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Loreggia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mosco</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Zerbinati,
          <article-title>SenTag: A Web-Based Tool for Semantic Annotation of Textual Documents</article-title>
          ,
          <source>Proceedings of the AAAI Conference on Artificial Intelligence</source>
          (
          <year>2022</year>
          ). doi:
          <volume>10</volume>
          .1609/aaai.v36i11.
          <fpage>21724</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>