<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Exposing Digital Content as Linked Data, and Linking them using StoryBlink</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Ben</forename><surname>De Meester</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Ghent University -iMinds -Multimedia Lab</orgName>
								<address>
									<country key="BE">Belgium</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Tom</forename><surname>De Nies</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Ghent University -iMinds -Multimedia Lab</orgName>
								<address>
									<country key="BE">Belgium</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Laurens</forename><surname>De Vocht</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Ghent University -iMinds -Multimedia Lab</orgName>
								<address>
									<country key="BE">Belgium</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Ruben</forename><surname>Verborgh</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Ghent University -iMinds -Multimedia Lab</orgName>
								<address>
									<country key="BE">Belgium</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Erik</forename><surname>Mannens</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Ghent University -iMinds -Multimedia Lab</orgName>
								<address>
									<country key="BE">Belgium</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Rik</forename><surname>Van De Walle</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Ghent University -iMinds -Multimedia Lab</orgName>
								<address>
									<country key="BE">Belgium</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Exposing Digital Content as Linked Data, and Linking them using StoryBlink</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">A0C7F55E469109184E46DD20CFA0EADA</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T14:21+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>EPUB 3</term>
					<term>DBpedia Spotlight</term>
					<term>Linked Data</term>
					<term>NIF</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Digital publications host a large amount of data that currently is not harvested, due to its unstructured nature. However, manually annotating these publications is tedious. Current tools that automatically analyze unstructured text are too fine-grained for larger amounts of text such as books. A workable machine-interpretable version of larger bodies of text is thus necessary. In this paper, we suggest a workflow to automatically create and publish a machine-interpretable version of digital publications as linked data via DBpedia Spotlight. Furthermore, we make use of the Everything is Connected Engine on top of this published linked data to link digital publications using a Web application dubbed "StoryBlink". StoryBlink shows the added value of publishing machineinterpretable content of unstructured digital publications by finding relevant books that are connected to selected classic works. Currently, the time to find a connecting path can be quite long, but this can be overcome by using caching mechanisms, and the relevancy of found paths can be improved by better denoising the DBpedia Spotlight results, or by using alternative disambiguation engines.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Introduction</head><p>Digital publications such as digital books (e-books) and online articles house a vast amount of content. Meanwhile, the amount of published content is rising more than ever, due to advancements in communication technologies. This situation leads to a (harmful) information overload for end users <ref type="bibr" target="#b7">[8]</ref>.</p><p>Current systems for finding relevant publications are largely based on two systems <ref type="bibr" target="#b1">[2]</ref>: social recommendation, and (content-based) recommendation based on metadata. On the one hand, social recommendation systems are usually biased towards popular publications, making the unpopular publications harder to find, and making these publications being read less (the so-called long tail effect <ref type="bibr" target="#b0">[1]</ref>). Metadata-based recommendation systems on the other hand rely on the availability of high-quality tags, yet these tags currently need to be entered manually, making this process very tedious and costly.</p><p>The contribution of this paper is twofold. First, we describe a system for the automatic creation of relevant tags using Natural Language Processing (NLP) techniques. Second, we apply these tags to a proof-of-concept that detects publications linked between two publications that are relevant based on their content.</p><p>In Section 1, we will review the state of the art and introduce some important relevant technologies. Afterwards, we present the two contributions in Section 2 and Section 3 respectively. These contributions are evaluated in Section 4, after which we conclude in Section 5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Background</head><p>In this section, we first have a look at systems that create machine-interpretable representations of natural language text, and link these representations to the Semantic Web (Subsection 1.1). Second, we review the used and relevant technologies for creating our proof-of-concept (Subsection 1.2).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1">Semantic Natural Language Processing</head><p>To link entities to Semantic data sources, these entities first need to be recognized in the text through Named Entity Recognition (NER). Then, a second analysis is needed to disambiguate these recognized entities into places, people, or other types (Named Entity Disambiguation or NED) <ref type="bibr" target="#b9">[10]</ref>. NERD <ref type="bibr" target="#b12">[13]</ref>, AGDISTIS <ref type="bibr" target="#b14">[15]</ref> and DBpedia Spotlight <ref type="bibr" target="#b6">[7]</ref> are examples of recognition and disambiguation engines that also connect the detected concepts to their URI on http://dbpedia.org. SHELDON <ref type="bibr" target="#b11">[12]</ref> is an information extraction framework -accessible via its web interface -that can represent outcomes from multiple NLP domains (e.g., Part-Of-Speech tagging, sentiment analysis, and relation extraction) as linked data. The core of SHELDON is FRED <ref type="bibr" target="#b10">[11]</ref>, which describes natural language sentences as linked data graphs. SHELDON can use different NLP mechanisms to extract extra information based on this graph. This linking from natural language to linked data sources is the key difference between conventional NLP tools and Semantic NLP tools, which enables publishing (parts of) natural language text as linked data. Also, the latter two systems are opensource, whereas the other engines are closed-source or commercial products.</p><p>GERBIL <ref type="bibr" target="#b15">[16]</ref> is a general entity annotator benchmarking framework to compare the performance of different annotation engines. Although DBpedia Spotlight is not the most performant engine according to GERBIL, the fact that it can be installed locally makes this a very useful engine to annotate large corpusses of text. Also, DBpedia Spotlight can perform entity recognition, whereas AGDISTIS is solely a disambiguation engine.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2">Underlying Technologies</head><p>Digital books Digital books are usually distributed in a format so that they can be read offline. EPUB, version 3 <ref type="bibr" target="#b2">[3]</ref> is a packaging format created by the International Digital Publishing Forum (IDPF). Under the hood, it is a zipped package of a folder with content files (i.e., HTML, but also images and videos for example), together with a package file that manages the links to these content files. What makes it stand out from other distribution formats, is that it is an open format, that makes use of the Open Web Platform technologies.</p><p>A notable part of the EPUB specification is the specification of the EPUB Canonical Fragment Identifier (CFI) <ref type="bibr" target="#b13">[14]</ref>. This identifier allows the specification of any content within an EPUB package -albeit a range of text or a DOM element -and it is used to create a URI fragment that uniquely defines this piece of content. It uses the XML structure of the manifest file and the HTML-files to follow a slash delimited path. It uses even numbers (n) to go into the n 2 th child element, and uneven numbers to go into the text node of that child element. To improve robustness, the ids of the DOM elements may be added between square brackets ([id]). Listing 1 shows an EPUB CFI that identifies the tenth character of the fifth paragraph of the first chapter of the file book.epub.</p><p>Project Gutenberg Project Gutenberg<ref type="foot" target="#foot_0">1</ref> is an effort into digitizing print books that are in the public domain. It currently houses over 49,000 free e-books in various formats, including plain text, HTML, and EPUB. The sample content used in this paper has been downloaded from the Project Gutenberg website.</p><p>The NIF and ITS Ontologies The NLP Interchange Format, version 2.0 (NIF) is an RDF format that provides interoperability between NLP tools, language resources and annotations <ref type="bibr" target="#b5">[6]</ref>. Although it is a very extensive vocabulary, we will only need the ontology to link publications with their parts that have a recognized entity. NIF is being actively used in several European projects<ref type="foot" target="#foot_1">2</ref> . To connect the NLP results to links on DBpedia semantically, we used the Internationalization Tag Set, version 2.0 (ITS). ITS is a vocabulary<ref type="foot" target="#foot_2">3</ref> to foster the automated processing of Web content <ref type="bibr" target="#b4">[5]</ref>.</p><p>Triple Pattern Fragments Triple Pattern Fragments are a way of hosting and querying linked data in an affordable and reliable way <ref type="bibr" target="#b16">[17]</ref>. By moving the complex query execution logic from the server to the client, and letting the server only answer in terms of simple patterns, the amount of needed server power is decreased greatly. This results in a much lower cost and an increased uptime when hosting such servers compared to standard SPARQL endpoints <ref type="foot" target="#foot_3">4</ref> . Because of the reliability of Triple Pattern Fragments servers, linked data applications can be built on top of live endpoints, whereas the uptime of other query endpoints is often not reliable enough to be used as back-end for linked data applications.</p><p>Everything is Connected The Everything is Connected engine (EiCE) is a path finding engine <ref type="bibr" target="#b3">[4]</ref>. Given two semantic concepts, it returns paths between them using a linked data endpoint. Links between two nodes are found using configurable heuristics. In this paper, we chose for linking nodes when they share non-trivial relations. For example, a non-trivial relation is that both nodes have their birthplace in Paris, a trivial relation is that both nodes are of the type Person. EiCE looks for these links directly between the two given concepts, or indirectly using intermediate concepts. The found paths are weighted based on the amount of common relations per link, and length of the path.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">From Books to Linked Data</head><p>Our overall goal is to provide a linked data endpoint that houses a relevant view of the content of a digital publication. To this end, we use DBpedia Spotlight <ref type="bibr" target="#b6">[7]</ref> to extract and disambiguate the concepts in plain text (i.e., perform NER and NED), and link these concepts to their DBpedia URIs. However, two problems arise when using DBpedia Spotlight for an EPUB file, namely that (i) an EPUB file usually consists of multiple HTML files, whilst DBpedia Spotlight works with plain text, and (ii) DBpedia Spotlight has a practical limit of the maximal amount of characters that can be analyzed within one run <ref type="foot" target="#foot_4">5</ref> .</p><p>To solve these problems, we extract the text from the HTML files in the EPUB file, in reading order. Important to note is that correct handling of whitespace is important, as stripping the HTML tags could result in wrongly concatenated words, which in turn would result in wrong results from DBpedia Spotlight. Listing 2 shows how naively stripping the tags of a (valid) HTML document can introduce errors. Namely, the list [f, i, n, l] is wrongly concatenated with the word and, which results in the word finland, changing the original text You have following letters [f, i, n, l] and all are part of the latin alphabet to You have following letters finland all are part of the latin alphabet. For the stripped sentence, DBpedia Spotlight will return "Finland" and "Latin alphabet" as disambiguated results, of which "Finland" is wrongly identified. Thus, depending on the context (i.e., whether it is phrasing content or only flow content), additional whitespace should be added to the original HTML or not <ref type="foot" target="#foot_5">6</ref> .</p><p>After all text is extracted from the EPUB file, this text is split up into text parts manageable by DBpedia Spotlight. If we wouldn't split up the text, the DBpedia Spotlight instance would need to analyze the entire text of a publication in one run, which could result in either server response time-outs or out-ofmemory errors. As DBpedia Spotlight only takes limited contextualization into account, this splitting up in sentences will have little to no effect on the results of the NER and NED done by DBpedia Spotlight. For this paper, we split the text minimally on sentence boundaries, with a maximal substring length of 1500 <ref type="foot" target="#foot_6">7</ref> .</p><p>The "annotate"-function of DBpedia Spotlight -which recognizes entities to annotate and returns a single identifier for each recognized entity -returns a list of identifiers, which are sorted in order of appearance. We use this order to reconnect these results with the original HTML files. By comparing the original text of the detected concept with the words in the text nodes in HTML, in order, we can reconstruct the range of the original detection. We then generate the EPUB CFI of this range and use this CFI in the semantic representation of the content of the publication.</p><p>By making use of the NIF and ITS ontologies, we use well-defined publicly available ontologies that are already being used in business environments. In our case, we used nif:sourceURL on the one hand to indicate the link between the detected range and the original book this range originated from, and itsrdf: taIdentRef on the other hand to indicate the DBpedia link that was detected from the text in this range (see Listing 3 for an abbreviated resulting semantic representation).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">StoryBlink</head><p>Based on the previously described methodology, we have provided an automatic way of describing the content of (digital) publications as linked data. This linked data is not meant to be granular, i.e., describe every sentence in the publication, @prefix schema: &lt;http://schema.org/&gt; . @prefix nif: &lt;http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nifcore#&gt; . @prefix itsrdf: &lt;http://www.w3.org/2005/11/its/rdf#&gt; . @prefix dbr: &lt;http://dbpedia.org/resource/&gt; . @prefix rdf: &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#&gt; . but it is meant to provide a high-level overview of the relevant concepts of a publication in an automatic and machine-interpretable way. This extracted information is published using a Triple Pattern Fragments server <ref type="foot" target="#foot_7">8</ref> , from hereon called the books endpoint. Using the HTML interface of the books endpoint, you can explore the mentioned concepts per book. We use this published linked data set in our proof-of-concept: StoryBlink. Through Story-Blink, we enable the discovery of stories by linking books based on their content.</p><p>StoryBlink is a Web application that allows users to find relevant books based on two books selected by the user. EiCE finds the paths between these two books, where the nodes are related books, and the links are built using the common concepts as detected in the books.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Web Application</head><p>The data to feed the Web application was extracted from twenty classic books as found on Project Gutenberg. The resulted linked data was published as the books endpoint, used as endpoint for the Everything is Connected engine.</p><p>When opening StoryBlink<ref type="foot" target="#foot_8">9</ref> , the user needs to choose two books as endpoints. The Everything is Connected engine will then use the books endpoint to look for a path of books between the first book and the second book (Figure <ref type="figure" target="#fig_0">1</ref>). The advantage of using the EiCE is that this engine aims at finding relevant yet surprising links between concepts. This way, StoryBlink returns contentbased recommendations based on these books, without returning too obvious results. For example, recommending a publication because it is also of the type Book is not the desired result. The (semantic) commonalities between two linked books can easily be found using a SPARQL query as shown in Listing 4, and is also visualized in StoryBlink when clicking on a link between two books. These visualized commonalities allow the user to personally assess the relevancy between two books on a content level, e.g., one user can assess a book to be relevant because it mentions the same locations, whilst another user can assess another book to be relevant because it mentions the same religion.</p><p>SELECT DISTINCT * { &lt;/book1&gt; ?predicate1 ?object . &lt;/book2&gt; ?predicate2 ?object . } Listing 4. SPARQL query to find the commonalities between two books.</p><p>@prefix schema: &lt;http://schema.org/&gt; . @prefix nif: &lt;http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nifcore#&gt; . @prefix itsrdf: &lt;http://www.w3.org/2005/11/its/rdf#&gt; . @prefix dbr: &lt;http://dbpedia.org/resource/&gt; . @prefix rdf: &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#&gt; . @prefix pg84: &lt;http://www.gutenberg.org/ebooks/84.epub#&gt; . pg84:book a schema:Book . pg84:book itsrdf:taIdentRef dbr:Chamois, dbr:Desert, ... Listing 5. The extracted linked data can greatly be reduced.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Improving the Performance</head><p>The books endpoint as described above introduces performance issues, not only in reasoning time, but also in quality of the results: the EiCE returns a lot of paths, and takes too much time to do it. This has two causes: the fact that the data model uses a two-step link between a book and its concepts, and the fact that a lot of unimportant concepts are taken into account.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Two-step link</head><p>The data model, as discussed in Section 2, invokes a two-step link between a book and its detected concepts. First, the book has a certain range, as defined by its CFI, and second, this CFI references to a certain concept in DBpedia. However, when we want to provide a high-level overview of a digital publication, we do not care where in the book the concept was detected, but just that it was detected. We can thus cut the data model short as shown Listing 5. This not only greatly reduces the amount of needed triples, it also enables a direct link between a book and its detected concepts. Keeping all detected concepts There are two issues with keeping all detected concepts within a book in the data set, namely that (i) the data set grows larger, and (ii) concepts that are detected only a few times are irrelevant for the highlevel overview of a book.</p><p>As the data set grows larger, the performance of the Everything is Connected engine becomes more troublesome. Indeed, when more data is available, the amount of potential paths found increases, and thus also the searching time of the path finding algorithm. And as this large data set contains a lot of irrelevant concepts, the path finding algorithm also returns a lot more irrelevant paths.</p><p>To improve on this, we only keep the top X% of mentioned concepts. As the initial analysis results kept all references of a detected concept, we can easily count the amount of mentions of a concept, and use those counts to only keep the most mentioned concepts. We keep the top 50% of mentioned concepts, as this is the ideal compromise between execution time and found paths (see <ref type="bibr">Section 4)</ref>.</p><p>Furthermore, we remove all Project Gutenberg specific mentions. As these books all originate from Project Gutenberg, they all have an identical disclaimer at the beginning. Thus, all books have the concept Project Gutenberg detected. However, this link is irrelevant to the content of the book, which is why we remove all Project Gutenberg mentions out of the database.</p><p>The aformentioned two optimizations have been implemented in the final proof-of-concept, running at http://uvdt.test.iminds.be/storyblink/.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Evaluation</head><p>The proposed methodology allows for automatic analysis and extraction of relevant metadata, alleviating the need for manual annotations. The Everything is Connected engine uses these semantic annotations with no preference for popular or unpopular works, and thus avoids the long tail effect. However, the performance of StoryBlink is poor when taking into account all mentioned concepts, which is why we propose to only keep the top X% of all mentioned concepts. To find a good cutoff value, we evaluate the path finding algorithm in terms of time and amount of paths found, whilst varying this cutoff value (Figure <ref type="figure">2</ref>) <ref type="foot" target="#foot_9">10</ref> . Given that the amount of triples per cutoff value rises exponentially, it is no surprise that the computation time also rises exponentially. In fact, the correlation between amount of triples taken into consideration and average path finding time is 99.45%. The number of paths found saturates around the 50% cutoff value.</p><p>When we would keep all detected concepts, we see how the path finding algorithm reaches the 60s timeout value. This results in a calculation time of 60s with 0 paths found, which is why we see a clear decrease in found paths when we keep all detected concepts in the books endpoint. In Figure <ref type="figure">2</ref>, we can see how the graph has a noticeable breakpoint at the 50% mark. We also see that, after that mark, there is very little gain in the amount of found paths. If we compare the maximum of found paths with the amount of found paths at the 50% mark, we see how 94.06% of all potential paths can be found in about one eight of the time, i.e., 5.28s.</p><p>Taking into account the large correlation between path finding time and amount of triples in the data set, we can calculate the linear regression between these variables, i.e., executionT ime(ms) = 2.2195 • triples + 2194.7. Given that in the test data set there were on average 54.55 triples per book, we can compute that we can take into account at most 64 books when trying to find a path with </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Amount of considered concepts (%)</head><p>Fig. <ref type="figure">2</ref>. When evaluating how much detected concepts to keep in the books endpoint, we see a saturation when we only take into account 50% of the mentioned concepts. The average path finding time for that cutoff value is 5.28s, and 94.06% of the maximal amount of found paths is still found. We see a clear decrease in found paths for 100% of the considered concepts, as keeping all detected concepts introduces application timeouts.</p><p>a response time lower than 10s. This can be done by, e.g., picking 62 books at random out of a catalog besides the two already selected books. However, these calculations do not need to be done for every request, as the data set is a static data source. Therefore, caching (part of) the responses could influence the response times greatly, thus allowing StoryBlink to take more books into account.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusions and Future Work</head><p>By using engines such as DBpedia Spotlight, it is possible to extract detected concepts from a digital publication, and as such, provide a high-level overview from the content of this publication. In this paper, we presented our methodology to achieve an automatic extraction and publication of detected concepts of a (digital) publication.</p><p>However, taking into account every detected concept harms the resulting data set, as unrelevant concepts are also taken into account. By extracting a simplified subset of all detection results, we were able to fuel a content-based recommendation system based on the Everything is Connected engine. This engine can connect digital publications with each other by trying to find relevant yet surprising paths between two concepts. The result is a novel way of content recommendation, no longer tied to social recommendation systems, thus avoiding the long tail effect.</p><p>The path finding algorithm takes on average 5.28s to find all relevant books for a data set of 20 books, which will make our Web application usable for book catalogs of maximally 64 books. It is clear that this is not ready for real-life catalogs. However, this problem can easily be resolved by caching the results. As the data set is fixed and not prone to a lot of change, this is a possible solution.</p><p>This work could be further improved as following.</p><p>-The choice of keeping the top 50% of mentioned concepts biases the books endpoint towards common concepts. Future work is to compare this method with similar metrics that do not bias towards common concepts, such as term frequency-inverse document frequency (tf-idf). -Instead of selecting two books to find a paths between them, StoryBlink could be adapted to start from only one book . This way, StoryBlink can become a recommendation system, suggesting linked books based on a starting book. This implies making changes to the Everything is Connected engine. -DBpedia Spotlight could be replaced by other NER and NED engines, to see how this influences the perceived quality of StoryBlink. We can evaluate how using, e.g., Babelfy <ref type="bibr" target="#b8">[9]</ref>, affects the results of StoryBlink. -As the Triple Pattern Fragments clients allows for federated querying, we can expand the usage of StoryBlink by finding links across data sets, e.g., books and movies, or books and music.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Listing 1 .</head><label>1</label><figDesc>file:///book.epub#epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/1:10) Example identifier for an EPUB file to identify -from right to left -the tenth character (:10) of the text (/1) of the fifth paragraph (/10[para05]) of the body (/4[body01]) of the first chapter (/6/4[chap01ref]) of the file book.epub.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. The StoryBlink Web application allows a user to find relevant links between two digital classic works.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head></head><label></label><figDesc>Whitespace should be taken into account when stripping the tags of an HTML document</figDesc><table><row><cell>&lt;p&gt;You have following letters</cell></row><row><cell>&lt;ul&gt;&lt;li&gt;f&lt;/li&gt;&lt;li&gt;i&lt;/li&gt;&lt;li&gt;n&lt;/li&gt;&lt;li&gt;l&lt;/li&gt;&lt;/ul&gt;and</cell></row><row><cell>all are part of the latin alphabet.</cell></row><row><cell>&lt;/p&gt;</cell></row><row><cell>after stripping:</cell></row><row><cell>You have following letters</cell></row><row><cell>finland</cell></row><row><cell>all are part of the latin alphabet.</cell></row><row><cell>Listing 2.</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head></head><label></label><figDesc>&lt;http://www.gutenberg.org/ebooks/84.epub#book&gt; a schema:Book . Sample abbreviated output of the entity extraction method that links the Gutenberg books with their detected concepts using the ITS and NIF ontologies.</figDesc><table><row><cell>&lt;http://www.gutenberg.org/ebooks/84.epub#epubcfi(/6/12!/4/2/4)&gt;</cell></row><row><cell>itsrdf:taIdentRef dbr:Chamois ;</cell></row><row><cell>nif:sourceUrl &lt;http://www.gutenberg.org/ebooks/84.epub&gt; .</cell></row><row><cell>&lt;http://www.gutenberg.org/ebooks/84.epub#epubcfi(/6/2!/4/46[chap01</cell></row><row><cell>]/16/42)&gt;</cell></row><row><cell>itsrdf:taIdentRef dbr:Chamois ;</cell></row><row><cell>nif:sourceUrl &lt;http://www.gutenberg.org/ebooks/84.epub&gt; .</cell></row><row><cell>&lt;http://www.gutenberg.org/ebooks/84.epub#epubcfi(/6/12!/4/2/6)&gt;</cell></row><row><cell>itsrdf:taIdentRef dbr:Desert ;</cell></row><row><cell>nif:sourceUrl &lt;http://www.gutenberg.org/ebooks/84.epub&gt; .</cell></row><row><cell>...</cell></row><row><cell>Listing 3.</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">http://www.gutenberg.org/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">e.g., LIDER and FREME (http://www.lider-project.eu/ and http://www. freme-project.eu/, respectively).</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">Also published as an ontology on http://www.w3.org/2005/11/its/rdf#.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">99.999% uptime between November 2014 and February 2015<ref type="bibr" target="#b17">[18]</ref> </note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4">https://github.com/dbpedia-spotlight/dbpedia-spotlight/issues/72</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_5">See http://www.w3.org/TR/html5/dom.html#kinds-of-content for the types of content in HTML5, and see https://github.com/bjdmeest/node-html-to-text for a possible HTML-to-text implementation.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_6">For each following chunk of text, we split off the next 1500 characters, and look for the last occurrence of a sentence boundary, which is the last dot (.), question mark (?) or exclamation mark (!). If no match is found, the full 1500 characters are used as next chunk of text.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_7">http://uvdt.test.iminds.be/storyblinkdata/books</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_8">http://uvdt.test.iminds.be/storyblink/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="10" xml:id="foot_9">Windows 7 SP1 64bit, Intel i5-3340M CPU@2.70GHz, 8GB RAM, 256GB SSD.</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Acknowledgements The research activities described in this paper were funded by Ghent University, iMinds, the IWT Flanders, the FWO-Flanders, and the European Union, in the context of the project "Uitgeverij van de Toekomst" (Publisher of the Future).</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">The Long Tail: Why the Future of Business is Selling Less of More</title>
		<author>
			<persName><forename type="first">C</forename><surname>Anderson</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2006-07">July 2006</date>
			<publisher>Hyperion</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Recommender systems survey</title>
		<author>
			<persName><forename type="first">J</forename><surname>Bobadilla</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Ortega</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Hernando</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Gutiérrez</surname></persName>
		</author>
		<ptr target="http://www.sciencedirect.com/science/article/pii/S0950705113001044" />
	</analytic>
	<monogr>
		<title level="j">Knowledge-Based Systems</title>
		<imprint>
			<biblScope unit="volume">46</biblScope>
			<biblScope unit="page" from="109" to="132" />
			<date type="published" when="2013-07">Jul 2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<author>
			<persName><forename type="first">G</forename><surname>Conboy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Garrish</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Gylling</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Mccoy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Makoto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Weck</surname></persName>
		</author>
		<ptr target="http://www.idpf.org/epub/301/spec/epub-overview.html,ac-cessed" />
		<title level="m">EPUB 3 Overview. Tech. rep</title>
				<imprint>
			<publisher>IDPF</publisher>
			<date type="published" when="2014-06">June 2014. January 22nd, 2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Discovering meaningful connections between resources in the Web of Data</title>
		<author>
			<persName><surname>De</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Vocht</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Coppens</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Verborgh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Vander Sande</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Mannens</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Van De Walle</surname></persName>
		</author>
		<ptr target="http://ceur-ws.org/Vol-996/papers/ldow2013-paper-04.pdf" />
	</analytic>
	<monogr>
		<title level="m">Linked Data on the Web (LDOW)</title>
				<editor>
			<persName><forename type="first">C</forename><surname>Bizer</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Heath</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Berners-Lee</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Hausenblas</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Auer</surname></persName>
		</editor>
		<meeting><address><addrLine>Rio De Janeiro, Brazil</addrLine></address></meeting>
		<imprint>
			<publisher>CEUR</publisher>
			<date type="published" when="2013-05">May 2013</date>
			<biblScope unit="page" from="1" to="8" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<author>
			<persName><forename type="first">D</forename><surname>Filip</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Mccance</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Lewis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Lieske</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Lommel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kosek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Sasaki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Savourel</surname></persName>
		</author>
		<ptr target="http://www.w3.org/TR/its20/" />
		<title level="m">Internationalization Tag Set (ITS) version 2.0. Tech. rep., W3C (October 2013</title>
				<imprint>
			<date type="published" when="2015-06-16">June 16th, 2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Hellman</surname></persName>
		</author>
		<ptr target="http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core/nif-core.html" />
		<title level="m">Nif 2.0 core ontology</title>
				<imprint>
			<date type="published" when="2015-06-16">2015. June 16th, 2015</date>
		</imprint>
		<respStmt>
			<orgName>AKSW, University Leipzig</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Tech. rep.</note>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">DBpedia Spotlight: Shedding light on the Web of documents</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">N</forename><surname>Mendes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Jakob</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>García-Silva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Bizer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 7th International Conference on Semantic Systems</title>
				<meeting>the 7th International Conference on Semantic Systems</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="1" to="8" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Psychological and health outcomes of perceived information overload</title>
		<author>
			<persName><forename type="first">S</forename><surname>Misra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Stokols</surname></persName>
		</author>
		<ptr target="http://eab.sagepub.com/content/44/6/737" />
	</analytic>
	<monogr>
		<title level="j">Environment and Behavior</title>
		<imprint>
			<biblScope unit="volume">44</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page" from="737" to="759" />
			<date type="published" when="2012-11">Nov 2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Entity Linking meets Word Sense Disambiguation: a Unified Approach</title>
		<author>
			<persName><forename type="first">A</forename><surname>Moro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Raganato</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Navigli</surname></persName>
		</author>
		<ptr target="http://wwwusers.di.uniroma1.it/~navigli/pubs/TACL_2014_Babelfy.pdf" />
	</analytic>
	<monogr>
		<title level="m">Transactions of the Association for Computational Linguistics</title>
				<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="231" to="244" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">A survey of Named Entity Recognition and Classification</title>
		<author>
			<persName><forename type="first">D</forename><surname>Nadeau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Sekine</surname></persName>
		</author>
		<idno type="DOI">10.1075/li.30.1.03nad</idno>
		<ptr target="http://dx.doi.org/10.1075/li.30.1.03nad" />
	</analytic>
	<monogr>
		<title level="j">Lingvisticae Investigationes</title>
		<imprint>
			<biblScope unit="volume">30</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="3" to="26" />
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Knowledge extraction based on discourse representation theory and linguistic frames</title>
		<author>
			<persName><forename type="first">V</forename><surname>Presutti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Draicchio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Gangemi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Knowledge Engineering and Knowledge Management</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="114" to="129" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Extracting knowledge from text using SHELDON, a Semantic Holistic framEwork for LinkeD ONtology data</title>
		<author>
			<persName><forename type="first">D</forename><surname>Reforgiato Recupero</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">G</forename><surname>Nuzzolese</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Consoli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Presutti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mongiovì</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Peroni</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 24th International Conference on World Wide Web Companion</title>
				<meeting>the 24th International Conference on World Wide Web Companion</meeting>
		<imprint>
			<publisher>International World Wide Web Conferences Steering Committee</publisher>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="235" to="238" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">NERD: Evaluating Named Entity Recognition tools in the Web of Data</title>
		<author>
			<persName><forename type="first">G</forename><surname>Rizzo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Troncy</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Workshop on Web Scale Knowledge Extraction</title>
				<meeting><address><addrLine>ISWC2011; Bonn, Germany</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2011-10">October 2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<title level="m" type="main">EPUB Canonical Fragment Identifier (epubcfi) Specification</title>
		<author>
			<persName><forename type="first">P</forename><surname>Sorotokin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Conboy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Duga</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Rivlin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Beaver</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Ballard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Fettes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Weck</surname></persName>
		</author>
		<ptr target="http://www.idpf.org/epub/linking/cfi/epub-cfi.html" />
		<imprint>
			<date type="published" when="2014-06">June 2014. June 16th, 2015</date>
			<publisher>IDPF</publisher>
		</imprint>
	</monogr>
	<note type="report_type">Tech. rep</note>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">AGDISTISgraph-based disambiguation of Named Entities using Linked Data</title>
		<author>
			<persName><forename type="first">R</forename><surname>Usbeck</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">C</forename><surname>Ngonga Ngomo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Auer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Gerber</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Both</surname></persName>
		</author>
		<ptr target="http://svn.aksw.org/papers/2014/ISWC_AGDISTIS/public.pdf" />
	</analytic>
	<monogr>
		<title level="m">International Semantic Web Conference</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">GERBIL: General entity annotator benchmarking framework</title>
		<author>
			<persName><forename type="first">R</forename><surname>Usbeck</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Röder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">C</forename><surname>Ngonga Ngomo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Baron</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Both</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Brümmer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Ceccarelli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Cornolti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Cherix</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Eickmann</surname></persName>
		</author>
		<ptr target="http://svn.aksw.org/papers/2015/WWW_GERBIL/public.pdf" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 24th International Conference on World Wide Web</title>
				<meeting>the 24th International Conference on World Wide Web</meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="1133" to="1143" />
		</imprint>
	</monogr>
	<note>International World Wide Web Conferences Steering Committee</note>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Querying datasets on the Web with high availability</title>
		<author>
			<persName><forename type="first">R</forename><surname>Verborgh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Hartig</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>De Meester</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Haesendonck</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>De Vocht</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Vander Sande</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Cyganiak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Colpaert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Mannens</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Van De Walle</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Semantic Web Conference</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2014">2014. 2014</date>
			<biblScope unit="page" from="180" to="196" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Initial usage analysis of DBpedia&apos;s Triple Pattern Fragments</title>
		<author>
			<persName><forename type="first">R</forename><surname>Verborgh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Mannens</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Van De Walle</surname></persName>
		</author>
		<ptr target="http://linkeddatafragments.org/publications/usewod2015.pdf" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 5th USEWOD Workshop on Usage Analysis and the Web of Data</title>
				<meeting>the 5th USEWOD Workshop on Usage Analysis and the Web of Data</meeting>
		<imprint>
			<date type="published" when="2015-06">Jun 2015</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
