<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">SAKE: A Semantic Authoring and Annotation Tool for Knowledge Extraction</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Jan</forename><surname>Grau</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Institute of Computer Science</orgName>
								<orgName type="institution">University of St.Gallen</orgName>
								<address>
									<country key="CH">Switzerland</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Kimberly</forename><surname>Garcia</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Institute of Computer Science</orgName>
								<orgName type="institution">University of St.Gallen</orgName>
								<address>
									<country key="CH">Switzerland</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Simon</forename><surname>Mayer</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Institute of Computer Science</orgName>
								<orgName type="institution">University of St.Gallen</orgName>
								<address>
									<country key="CH">Switzerland</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">SAKE: A Semantic Authoring and Annotation Tool for Knowledge Extraction</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">281F0B840D55C0DB938CB68E0B11FE51</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T18:46+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Semantic Authoring</term>
					<term>Semantic Annotator</term>
					<term>PDF annotator</term>
					<term>Semantic Web Tool</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Greenhouse Gas (GHG) accounting is traditionally a lengthy and manual process that requires the expertise of experienced environmental scientists; due to the recognition of the climate crisis through upcoming regulations on GHG accounting around the planet, the demand for tools that can support these environmental experts and accelerate their work is growing considerably at the moment. GHG accounting is merely one application of automated support tools that require the preservation of expert knowledge in a machine-readable and machine-understandable format; across fields, this is highly relevant for automating processes that today can only be performed by individuals with specialized training. In this paper, we present SAKE, a Semantic Authoring and Annotation tool for Knowledge Extraction that allows domain experts with no proficiency in semantic technologies annotating domainspecific PDF files, creating a Knowledge Graph with instances of standardized (or new) ontologies. The resulting Knowledge Graph can then be integrated into systems to automate specialized processes. SAKE has been developed together with domain experts in the field of environmental science and is currently used in the scope of a joint project on GHG accounting.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>From science to law and from research papers to regulatory documents, a large amount of textual knowledge is today available in the form of PDF files. The knowledge transported through these PDFs, while valuable for appropriately contextualized human readers, today remains hard to integrate with automated systems. While current machine-learning methods, such as large language models, mitigate this problem for content that aligns well with their training data, these fall short for specialized knowledge that requires contextualized processing. Such contextualization could be achieved if the information in a PDF was semantically integrated with shared ontologies. This would not only enable automatic processing of the content, but also-in-line with the core tenet of the Semantic Web-support the interlinking of pieces of information across documents, institutions, and domains. While semantic annotation is readily supported for HTML content, e.g., with Web-Annotation-based tools such as dokieli <ref type="bibr" target="#b0">[1]</ref>, PDF documents today remain sidelined in Semantic Web tooling. There are good historical, technical, and social reasons for this; however, given the wide range of domains and large amount of SEMANTICS'24: Posters and Demos, September 17-19, 2024, Amsterdam, Netherlands. Envelope janerik.grau@student.unisg.ch (J. Grau); kimberly.garcia@unisg.ch (K. Garcia); simon.mayer@unisg.ch (S. Mayer) Orcid 0009-0006-0565-2034 (J. Grau); 0000-0002-4971-2944 (K. Garcia); 0000-0001-6367-3454 (S. Mayer) information available (often exclusively) through PDFs, we argue that it is time to pull PDFbased communities into the world of Knowledge Graphs. Thus, we created SAKE, a Semantic Authoring and Annotation tool for Knowledge Extraction that permits semantically lifting PDF documents through ontology-based annotations generated by a user, thereby simplifying the integration of information in PDF documents into the Semantic Web. The development of SAKE was motivated by an innovation project that aims at automating GHG accounting through Semantic Web technologies <ref type="foot" target="#foot_0">1</ref> . In this contribution, we introduce SAKE's implementation and features, and we discuss the GHG accounting project that is currently taking advantage of SAKE.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">The SAKE Annotation Tool</head><p>SAKE is a Web application built upon PDF.js<ref type="foot" target="#foot_1">2</ref> , a library developed by Mozilla that provides all the functionalities of a PDF reader. SAKE defines its own skin to offer semantic annotation functionalities (see Figure <ref type="figure" target="#fig_0">1(c</ref>)) next to the full capabilities of PDF.js. Specifically, SAKE enhances the PDF highlighting functionality to allow users to transform relevant content found in a document into structured knowledge expressed in the Resource Description Framework<ref type="foot" target="#foot_2">3</ref> (RDF). SAKE's current implementation uses AtomicData<ref type="foot" target="#foot_3">4</ref> as a semantic back end. AtomicData hosts ontologies used for annotating documents and user data to add provenance information to annotations. In our implementation, AtomicData could be easily replaced by any other userbased graph database, such as Solid<ref type="foot" target="#foot_4">5</ref> or GraphDB <ref type="foot" target="#foot_5">6</ref> . To annotate a PDF file, a user (we consider domain experts) first loads an ontology (expressed in RDF) into SAKE's semantic back end. The classes specified in this ontology are considered the user's Known Concepts (KCs). SAKE displays all KCs on the right side of the user interface (see Figure <ref type="figure" target="#fig_0">1</ref>). To annotate a PDF entity (text or figure), the user selects a KC and then selects the PDF entity. Then, SAKE displays a pop-up window that prompts the user for additional information (see Figure <ref type="figure" target="#fig_1">2</ref>) corresponding to the attributes and relationships related to the selected KC (i.e., object and data properties) and specified in the loaded ontology. To ensure compatibility with all common PDF readers (e.g., Adobe Acrobat), the annotation is stored as an RDFa string in the PDF document's Content dictionary (cf. the PDF specification <ref type="bibr" target="#b1">[2]</ref>). Hence, the PDF document can be distributed with the embedded structured data, and collaborators not using SAKE will still see the semantic annotations when using other PDF readers. While the semantic annotations (being RDF) may be hard to read, they can still be modified with any common PDF reader. Since SAKE embeds semantic annotations within a PDF file, it acts as a self-contained Knowledge Graph. Thus, SAKE RDFa annotations can immediately be used with Semantic Web applications, such as dokieli <ref type="bibr" target="#b0">[1]</ref>. Moreover, when an expert shares an annotated PDF file with a colleague using SAKE, this colleague is able to read the text that surrounds an annotation, providing them with context and improving their understanding of a KG that has been created in a collaborative fashion.</p><p>Furthermore, SAKE integrates a Web server and responds to HTTP requests that specify appropriate content types (e.g., text/turtle or application/ld+json) with the graph embedded within the currently open PDF document. Finally, SAKE provides domain experts with the means to add new concepts and properties to existing ontologies; when a new concept is added, SAKE asks the expert to specify the corresponding HTML elements (e.g., a text field or a drop-down menu) to be displayed in the annotation pop-up window that is associated with the concept. This pop-up information is stored as a list of RDF instructions, which SAKE interprets at runtime. These instructions include validating and mapping strings to concepts in the ontology.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">SAKE for GHG Accounting</head><p>Today, GHG accounting is a time-consuming and expensive process that requires highly specialized environmental scientists to manually analyze companies' processes, including their supply chains; even large multinational companies only commission these assessments rarely due to the amount of manual and expensive effort required. Faster and more cost-effective GHG assessment is required not only to comply with sustainability reporting obligations (e.g., the Swiss Ordinance on Climate Disclosures), but also to regularly assess current practices and reconsider company strategies to reach decarbonizations goals. In this context, WISER is an interdisciplinary project<ref type="foot" target="#foot_6">7</ref> coordinated by Empa (Swiss Federal Laboratories for Materials Science and Technology) that aims at providing technological tools to increase the efficiency of GHG assessments. The project specifically required a way to capture knowledge from PDF documents as contextualized by the environmental scientists at Empa in a machine-understandable way. This applies primarily to Assessment Standards documents that must be followed when creating a GHG assessment and are published by different organizations (e.g., ISO, the European Commission, the World Business Council for Sustainable Development, or the World Resources Institute), which use idiosyncratic nomenclature and inconsistent concept definitions. Hence, two GHG assessment reports might not be comparable if different standards were followed or even if the same was followed but interpreted differently.</p><p>To increase reproducibility and consistency across GHG assessment reports, WISER aims to create ontologies that describe different assessment standards and bridge ontologies that identify commonalities that permit the automatic translation of reports across assessment standards. Given that the environmental experts in our team are not ontologists, SAKE is proving value in capturing their knowledge when reading an assessment standard. The KG resulting from experts annotation will be incorporated in a Web application that accelerates the creation of GHG assessments and can translate reports from one assessment standard to another.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Related Work</head><p>Providing non-semantic technologies experts with tools for creating semantically enriched content has remained a challenge for several decades <ref type="bibr" target="#b2">[3]</ref>. Early tools focused on bringing the Semantic Web vision forward by, for example, annotating Web content with metadata. Such is the case of Annotea <ref type="bibr" target="#b3">[4]</ref>, which provided infrastructure to make remarks (in RDF) on content available on the Web, at the resource level, or on selected text (e.g., add the place in which a picture was taken). Loomp <ref type="bibr" target="#b4">[5]</ref> was a system for serving RDF or XHTML content, it proposed the One Click Annotator, that allowed specialist (e.g., journalists) creating semantically enriched documents (e.g., news articles), linking them to data sources, and sharing them with other colleagues for further annotation or for publishing. Semantator <ref type="bibr" target="#b5">[6]</ref> is a Protégé plugin for annotating biomedical data that provides semi-automatic annotation support using domain ontologies. SlideWiki <ref type="bibr" target="#b6">[7]</ref> provides manual and semi-automatic annotation tools for enriching slide decks with linked data. It allows adding slide deck metadata or linking the content of a slide to DBpedia entries. Dokieli <ref type="bibr" target="#b0">[1]</ref> is a platform for decentralized authoring, annotating, and publishing HTML documents while engaging in social interactions. Dokieli uses HTML+RDFa to edit documents and discuss them collaboratively. Sangrahaka <ref type="bibr" target="#b7">[8]</ref> is a Web application that allows administrators to create a schema used by annotators; curators can then verify annotations and resolve conflicts. Similarly, SenTag <ref type="bibr" target="#b8">[9]</ref> is a Web application that allows users creating XML annotations on plain text.</p><p>As described, most of the relevant related tools focus on HTML content, not on PDF documents as SAKE does). These documents hold vast amount of knowledge if read and annotated by experts. Moreover, SAKE is highly interested in high quality semantic annotations to integrate them in a tool (e.g., a dashboard) that can accelerate highly specialized real-world processes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusions and Future Work</head><p>To overcome one relevant entry barrier to using semantic technologies by domain experts, we have created SAKE, a tool that allows domain experts to create structured knowledge from PDF documents. This knowledge can then be exported as a KG and integrated into a tool for supporting highly specialized tasks such as GHG accounting. SAKE is provided with this publication as open source <ref type="foot" target="#foot_7">8</ref> , and remains in iterative development; it is currently used by environmental scientists in the scope of an interdisciplinary GHG accounting project. However, we expect the need for semantic annotation, sharing, and automated reasoning on top of extracted knowledge to keep growing across a variety of domains in which knowledge is still documented in PDFs, and their interpretations remain within the experts' minds.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: SAKE. (a) functionalities associated with a PDF reader (e.g., next page and zoom); (b) PDF to annotate; and (c) user Known Concepts from an ontology loaded through the semantic backend.</figDesc><graphic coords="2,120.54,371.56,354.20,115.68" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: (a) Highlighting tool after selecting a Known Concept; (b) Popup window to add more annotation information.</figDesc><graphic coords="3,89.29,123.25,416.70,108.70" type="bitmap" /></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://wiser-climate.com/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">https://mozilla.github.io/pdf.js/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">https://www.w3.org/RDF/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">https://atomicdata.dev/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4">https://solidproject.org/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_5">https://graphdb.ontotext.com/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_6">https://wiser-climate.com/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_7">https://github.com/jangrau13/semantics2024_sake</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Acknowledgments: We thank Dr. Didier Beloin-Saint-Pierre, Alexander Kirsten, and Dr. Daniel Lachat, environmental scientists at Empa, for their support in testing SAKE. SAKE has been developed as part of the WISER flagship project funded by Innosuisse.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Capadisli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Guy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Verborgh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Lange</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Auer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Berners-Lee</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-319-60131-1_33</idno>
		<idno>doi:</idno>
		<ptr target="https://doi.org/10.1007/978-3-319-60131-1_33" />
		<title level="m">Decentralised authoring, annotations and notifications for a read-write web with dokieli</title>
				<imprint>
			<publisher>Springer International Publishing</publisher>
			<date type="published" when="2017">2017</date>
		</imprint>
		<respStmt>
			<orgName>Web Engineering</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<idno>PDF 2.0</idno>
		<ptr target="https://www.iso.org/standard/75839.html" />
		<title level="m">Document management -Portable document format</title>
				<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="volume">2</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">S-CREAM -Semi-automatic CREAtion of Metadata</title>
		<author>
			<persName><forename type="first">S</forename><surname>Handschuh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Staab</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Ciravegna</surname></persName>
		</author>
		<idno type="DOI">10.1007/3-540-45810-7_32</idno>
	</analytic>
	<monogr>
		<title level="m">Knowledge Engineering and Knowledge Management: Ontologies and the Semantic Web</title>
				<meeting><address><addrLine>Berlin Heidelberg</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2002">2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Annotea: an open RDF infrastructure for shared Web annotations</title>
		<author>
			<persName><forename type="first">J</forename><surname>Kahan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-R</forename><surname>Koivunen</surname></persName>
		</author>
		<idno type="DOI">10.1145/371920.372166</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 10th international conference on World Wide Web</title>
				<meeting>the 10th international conference on World Wide Web<address><addrLine>Hong Kong Hong Kong</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Linked data authoring for non-expert</title>
		<author>
			<persName><forename type="first">M</forename><surname>Luczak-Rosch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Heese</surname></persName>
		</author>
		<ptr target="https://ceur-ws.org/Vol-538/ldow2009_paper4.pdf" />
	</analytic>
	<monogr>
		<title level="m">Linked Data on the Web Workshop</title>
				<imprint>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Semantator: Semantic annotator for converting biomedical text to linked data</title>
		<author>
			<persName><forename type="first">C</forename><surname>Tao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Sharma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">G</forename><surname>Chute</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.jbi.2013.07.003</idno>
	</analytic>
	<monogr>
		<title level="j">Journal of Biomedical Informatics</title>
		<imprint>
			<biblScope unit="volume">46</biblScope>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Khalili</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">A</forename><surname>De Graaf</surname></persName>
		</author>
		<title level="m">SlideWiki -A Platform for Authoring FAIR Educational Content</title>
				<imprint>
			<publisher>SEMANTiCS (Posters &amp; Demos</publisher>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<title level="m" type="main">Sangrahaka: a tool for annotating and querying knowledge graphs</title>
		<author>
			<persName><forename type="first">H</forename><surname>Terdalkar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Bhattacharya</surname></persName>
		</author>
		<idno type="DOI">10.1145/3468264.3473113</idno>
		<imprint>
			<date type="published" when="2021">2021</date>
			<publisher>ACM</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">SenTag: A Web-Based Tool for Semantic Annotation of Textual Documents</title>
		<author>
			<persName><forename type="first">A</forename><surname>Loreggia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Mosco</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zerbinati</surname></persName>
		</author>
		<idno type="DOI">10.1609/aaai.v36i11.21724</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the AAAI Conference on Artificial Intelligence</title>
				<meeting>the AAAI Conference on Artificial Intelligence</meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
