<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Exposing Digital Content as Linked Data, and Linking them using StoryBlink</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ben De Meester</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tom De Nies</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Laurens De Vocht</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ruben Verborgh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Erik Mannens</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rik Van de Walle</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ghent University</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Multimedia Lab</institution>
          ,
          <country country="BE">Belgium</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Digital publications host a large amount of data that currently is not harvested, due to its unstructured nature. However, manually annotating these publications is tedious. Current tools that automatically analyze unstructured text are too ne-grained for larger amounts of text such as books. A workable machine-interpretable version of larger bodies of text is thus necessary. In this paper, we suggest a work ow to automatically create and publish a machine-interpretable version of digital publications as linked data via DBpedia Spotlight. Furthermore, we make use of the Everything is Connected Engine on top of this published linked data to link digital publications using a Web application dubbed \StoryBlink". StoryBlink shows the added value of publishing machineinterpretable content of unstructured digital publications by nding relevant books that are connected to selected classic works. Currently, the time to nd a connecting path can be quite long, but this can be overcome by using caching mechanisms, and the relevancy of found paths can be improved by better denoising the DBpedia Spotlight results, or by using alternative disambiguation engines.</p>
      </abstract>
      <kwd-group>
        <kwd>EPUB 3</kwd>
        <kwd>DBpedia Spotlight</kwd>
        <kwd>Linked Data</kwd>
        <kwd>NIF</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Digital publications such as digital books (e-books) and online articles house
a vast amount of content. Meanwhile, the amount of published content is
rising more than ever, due to advancements in communication technologies. This
situation leads to a (harmful) information overload for end users [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        Current systems for nding relevant publications are largely based on two
systems [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]: social recommendation, and (content-based) recommendation based
on metadata. On the one hand, social recommendation systems are usually
biased towards popular publications, making the unpopular publications harder
to nd, and making these publications being read less (the so-called long tail
e ect [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]). Metadata-based recommendation systems on the other hand rely on
the availability of high-quality tags, yet these tags currently need to be entered
manually, making this process very tedious and costly.
      </p>
      <p>The contribution of this paper is twofold. First, we describe a system for the
automatic creation of relevant tags using Natural Language Processing (NLP)
techniques. Second, we apply these tags to a proof-of-concept that detects
publications linked between two publications that are relevant based on their content.</p>
      <p>In Section 1, we will review the state of the art and introduce some important
relevant technologies. Afterwards, we present the two contributions in Section 2
and Section 3 respectively. These contributions are evaluated in Section 4, after
which we conclude in Section 5.
1</p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <p>In this section, we rst have a look at systems that create machine-interpretable
representations of natural language text, and link these representations to the
Semantic Web (Subsection 1.1). Second, we review the used and relevant
technologies for creating our proof-of-concept (Subsection 1.2).
1.1</p>
      <sec id="sec-2-1">
        <title>Semantic Natural Language Processing</title>
        <p>
          To link entities to Semantic data sources, these entities rst need to be
recognized in the text through Named Entity Recognition (NER). Then, a second
analysis is needed to disambiguate these recognized entities into places,
people, or other types (Named Entity Disambiguation or NED) [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. NERD [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ],
AGDISTIS [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] and DBpedia Spotlight [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] are examples of recognition and
disambiguation engines that also connect the detected concepts to their URI on
http://dbpedia.org. SHELDON [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] is an information extraction framework
{ accessible via its web interface { that can represent outcomes from multiple
NLP domains (e.g., Part-Of-Speech tagging, sentiment analysis, and relation
extraction) as linked data. The core of SHELDON is FRED [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], which describes
natural language sentences as linked data graphs. SHELDON can use di erent
NLP mechanisms to extract extra information based on this graph. This linking
from natural language to linked data sources is the key di erence between
conventional NLP tools and Semantic NLP tools, which enables publishing (parts
of) natural language text as linked data. Also, the latter two systems are
opensource, whereas the other engines are closed-source or commercial products.
        </p>
        <p>
          GERBIL[
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] is a general entity annotator benchmarking framework to
compare the performance of di erent annotation engines. Although DBpedia
Spotlight is not the most performant engine according to GERBIL, the fact that it
can be installed locally makes this a very useful engine to annotate large
corpusses of text. Also, DBpedia Spotlight can perform entity recognition, whereas
AGDISTIS is solely a disambiguation engine.
1.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Underlying Technologies</title>
        <p>
          Digital books Digital books are usually distributed in a format so that they
can be read o ine. EPUB, version 3 [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] is a packaging format created by the
International Digital Publishing Forum (IDPF). Under the hood, it is a zipped
package of a folder with content les (i.e., HTML, but also images and videos for
Listing 1. Example identi er for an EPUB le to identify { from right to left { the
tenth character (:10) of the text (/1) of the fth paragraph (/10[para05]) of the body
(/4[body01]) of the rst chapter (/6/4[chap01ref]) of the le book.epub.
example), together with a package le that manages the links to these content
les. What makes it stand out from other distribution formats, is that it is an
open format, that makes use of the Open Web Platform technologies.
        </p>
        <p>
          A notable part of the EPUB speci cation is the speci cation of the EPUB
Canonical Fragment Identi er (CFI) [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. This identi er allows the speci cation
of any content within an EPUB package { albeit a range of text or a DOM
element { and it is used to create a URI fragment that uniquely de nes this piece
of content. It uses the XML structure of the manifest le and the HTML- les to
follow a slash delimited path. It uses even numbers (n) to go into the n2 th child
element, and uneven numbers to go into the text node of that child element.
To improve robustness, the ids of the DOM elements may be added between
square brackets ([id]). Listing 1 shows an EPUB CFI that identi es the tenth
character of the fth paragraph of the rst chapter of the le book.epub.
Project Gutenberg Project Gutenberg1 is an e ort into digitizing print books
that are in the public domain. It currently houses over 49,000 free e-books in
various formats, including plain text, HTML, and EPUB. The sample content
used in this paper has been downloaded from the Project Gutenberg website.
The NIF and ITS Ontologies The NLP Interchange Format, version 2.0
(NIF) is an RDF format that provides interoperability between NLP tools,
language resources and annotations [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Although it is a very extensive vocabulary,
we will only need the ontology to link publications with their parts that have
a recognized entity. NIF is being actively used in several European projects2.
To connect the NLP results to links on DBpedia semantically, we used the
Internationalization Tag Set, version 2.0 (ITS). ITS is a vocabulary3 to foster the
automated processing of Web content [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>
          Triple Pattern Fragments Triple Pattern Fragments are a way of hosting
and querying linked data in an a ordable and reliable way [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. By moving the
complex query execution logic from the server to the client, and letting the server
only answer in terms of simple patterns, the amount of needed server power is
decreased greatly. This results in a much lower cost and an increased uptime
        </p>
        <sec id="sec-2-2-1">
          <title>1 http://www.gutenberg.org/</title>
        </sec>
        <sec id="sec-2-2-2">
          <title>2 e.g., LIDER and FREME (http://www.lider-project.eu/ and http://www.</title>
          <p>freme-project.eu/, respectively).</p>
        </sec>
        <sec id="sec-2-2-3">
          <title>3 Also published as an ontology on http://www.w3.org/2005/11/its/rdf#.</title>
          <p>
            when hosting such servers compared to standard SPARQL endpoints4. Because
of the reliability of Triple Pattern Fragments servers, linked data applications can
be built on top of live endpoints, whereas the uptime of other query endpoints
is often not reliable enough to be used as back-end for linked data applications.
Everything is Connected The Everything is Connected engine (EiCE) is a
path nding engine [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ]. Given two semantic concepts, it returns paths between
them using a linked data endpoint. Links between two nodes are found using
con gurable heuristics. In this paper, we chose for linking nodes when they
share non-trivial relations. For example, a non-trivial relation is that both nodes
have their birthplace in Paris, a trivial relation is that both nodes are of the type
Person. EiCE looks for these links directly between the two given concepts, or
indirectly using intermediate concepts. The found paths are weighted based on
the amount of common relations per link, and length of the path.
2
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>From Books to Linked Data</title>
      <p>
        Our overall goal is to provide a linked data endpoint that houses a relevant view
of the content of a digital publication. To this end, we use DBpedia Spotlight [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
to extract and disambiguate the concepts in plain text (i.e., perform NER and
NED), and link these concepts to their DBpedia URIs. However, two problems
arise when using DBpedia Spotlight for an EPUB le, namely that (i) an EPUB
le usually consists of multiple HTML les, whilst DBpedia Spotlight works
with plain text, and (ii) DBpedia Spotlight has a practical limit of the maximal
amount of characters that can be analyzed within one run5.
      </p>
      <p>To solve these problems, we extract the text from the HTML les in the
EPUB le, in reading order. Important to note is that correct handling of
whitespace is important, as stripping the HTML tags could result in wrongly
concatenated words, which in turn would result in wrong results from DBpedia Spotlight.
Listing 2 shows how naively stripping the tags of a (valid) HTML document can
introduce errors. Namely, the list [f; i; n; l] is wrongly concatenated with the
word and, which results in the word nland, changing the original text You have
following letters [f, i, n, l] and all are part of the latin alphabet to You have
following letters nland all are part of the latin alphabet. For the stripped sentence,
DBpedia Spotlight will return \Finland" and \Latin alphabet" as disambiguated
results, of which \Finland" is wrongly identi ed. Thus, depending on the context
(i.e., whether it is phrasing content or only ow content), additional whitespace
should be added to the original HTML or not6.</p>
      <p>
        After all text is extracted from the EPUB le, this text is split up into text
parts manageable by DBpedia Spotlight. If we wouldn't split up the text, the
4 99:999% uptime between November 2014 and February 2015 [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]
      </p>
      <sec id="sec-3-1">
        <title>5 https://github.com/dbpedia-spotlight/dbpedia-spotlight/issues/72</title>
      </sec>
      <sec id="sec-3-2">
        <title>6 See http://www.w3.org/TR/html5/dom.html#kinds-of-content for the types of</title>
        <p>content in HTML5, and see https://github.com/bjdmeest/node-html-to-text for
a possible HTML-to-text implementation.
&lt;p&gt;You have following letters
&lt;ul&gt;&lt;li&gt;f&lt;/li&gt;&lt;li&gt;i&lt;/li&gt;&lt;li&gt;n&lt;/li&gt;&lt;li&gt;l&lt;/li&gt;&lt;/ul&gt;and
all are part of the latin alphabet.
&lt;/p&gt;
after stripping:
You have following letters
finland
all are part of the latin alphabet.</p>
        <p>Listing 2. Whitespace should be taken into account when stripping the tags of an
HTML document
DBpedia Spotlight instance would need to analyze the entire text of a publication
in one run, which could result in either server response time-outs or
out-ofmemory errors. As DBpedia Spotlight only takes limited contextualization into
account, this splitting up in sentences will have little to no e ect on the results of
the NER and NED done by DBpedia Spotlight. For this paper, we split the text
minimally on sentence boundaries, with a maximal substring length of 15007.</p>
        <p>The \annotate"-function of DBpedia Spotlight { which recognizes entities to
annotate and returns a single identi er for each recognized entity { returns a
list of identi ers, which are sorted in order of appearance. We use this order to
reconnect these results with the original HTML les. By comparing the original
text of the detected concept with the words in the text nodes in HTML, in
order, we can reconstruct the range of the original detection. We then generate
the EPUB CFI of this range and use this CFI in the semantic representation of
the content of the publication.</p>
        <p>By making use of the NIF and ITS ontologies, we use well-de ned publicly
available ontologies that are already being used in business environments. In our
case, we used nif:sourceURL on the one hand to indicate the link between the
detected range and the original book this range originated from, and itsrdf:
taIdentRef on the other hand to indicate the DBpedia link that was detected
from the text in this range (see Listing 3 for an abbreviated resulting semantic
representation).
3</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>StoryBlink</title>
      <p>Based on the previously described methodology, we have provided an automatic
way of describing the content of (digital) publications as linked data. This linked
data is not meant to be granular, i.e., describe every sentence in the publication,
7 For each following chunk of text, we split o the next 1500 characters, and look for
the last occurrence of a sentence boundary, which is the last dot (.), question mark
(?) or exclamation mark (!). If no match is found, the full 1500 characters are used
as next chunk of text.
@prefix schema: &lt;http://schema.org/&gt; .
@prefix nif:
&lt;http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nifcore#&gt; .
@prefix itsrdf: &lt;http://www.w3.org/2005/11/its/rdf#&gt; .
@prefix dbr: &lt;http://dbpedia.org/resource/&gt; .
@prefix rdf: &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#&gt; .
&lt;http://www.gutenberg.org/ebooks/84.epub#book&gt; a schema:Book .
&lt;http://www.gutenberg.org/ebooks/84.epub#epubcfi(/6/12!/4/2/4)&gt;
itsrdf:taIdentRef dbr:Chamois ;
nif:sourceUrl &lt;http://www.gutenberg.org/ebooks/84.epub&gt; .
&lt;http://www.gutenberg.org/ebooks/84.epub#epubcfi(/6/2!/4/46[chap01
]/16/42)&gt;
itsrdf:taIdentRef dbr:Chamois ;
nif:sourceUrl &lt;http://www.gutenberg.org/ebooks/84.epub&gt; .
&lt;http://www.gutenberg.org/ebooks/84.epub#epubcfi(/6/12!/4/2/6)&gt;
itsrdf:taIdentRef dbr:Desert ;
nif:sourceUrl &lt;http://www.gutenberg.org/ebooks/84.epub&gt; .
...</p>
      <p>Listing 3. Sample abbreviated output of the entity extraction method that links the
Gutenberg books with their detected concepts using the ITS and NIF ontologies.
but it is meant to provide a high-level overview of the relevant concepts of a
publication in an automatic and machine-interpretable way.</p>
      <p>This extracted information is published using a Triple Pattern Fragments
server8, from hereon called the books endpoint. Using the HTML interface of the
books endpoint, you can explore the mentioned concepts per book. We use this
published linked data set in our proof-of-concept: StoryBlink. Through
StoryBlink, we enable the discovery of stories by linking books based on their content.</p>
      <p>StoryBlink is a Web application that allows users to nd relevant books
based on two books selected by the user. EiCE nds the paths between these
two books, where the nodes are related books, and the links are built using the
common concepts as detected in the books.
3.1</p>
      <sec id="sec-4-1">
        <title>Web Application</title>
        <p>The data to feed the Web application was extracted from twenty classic books
as found on Project Gutenberg. The resulted linked data was published as the
books endpoint, used as endpoint for the Everything is Connected engine.</p>
        <p>When opening StoryBlink9, the user needs to choose two books as endpoints.
The Everything is Connected engine will then use the books endpoint to look
for a path of books between the rst book and the second book (Figure 1).</p>
        <sec id="sec-4-1-1">
          <title>8 http://uvdt.test.iminds.be/storyblinkdata/books</title>
        </sec>
        <sec id="sec-4-1-2">
          <title>9 http://uvdt.test.iminds.be/storyblink/</title>
          <p>Storyblink
exploring stories through linked books</p>
          <p>How are famous classics works connected to each other? Find out!
First, select a starting book...</p>
          <p>Select startpoint:</p>
          <p>Notre­Dame De Paris
Then, select an ending book...</p>
          <p>Select endpoint:</p>
          <p>Les Misérables</p>
          <p>Notre­Dame De Paris</p>
          <p>And now...</p>
          <p> StoryBlink! 
War and Peace
The Time Machine
The Count of Monte Cristo, Illustrated
A Journey to the Centre of the Earth</p>
          <p>The War of the Worlds</p>
          <p>Tarzan of the Apes
Frankenstein; Or, The Modern Prometheus</p>
          <p>Les Misérables
Barry Lyndon
Twenty Thousand Leagues Under the Seas: An Underwater Tour of the World</p>
          <p>The advantage of using the EiCE is that this engine aims at nding relevant
yet surprising links between concepts. This way, StoryBlink returns
contentbased recommendations based on these books, without returning too obvious
results. For example, recommending a publication because it is also of the type
Book is not the desired result. The (semantic) commonalities between two linked
books can easily be found using a SPARQL query as shown in Listing 4, and
is also visualized in StoryBlink when clicking on a link between two books.
These visualized commonalities allow the user to personally assess the relevancy
between two books on a content level, e.g., one user can assess a book to be
relevant because it mentions the same locations, whilst another user can assess
another book to be relevant because it mentions the same religion.</p>
          <p>SELECT DISTINCT * {
&lt;/book1&gt; ?predicate1 ?object .</p>
          <p>&lt;/book2&gt; ?predicate2 ?object .
}</p>
          <p>Listing 4. SPARQL query to nd the commonalities between two books.
@prefix schema: &lt;http://schema.org/&gt; .
@prefix nif:
&lt;http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nifcore#&gt; .
@prefix itsrdf: &lt;http://www.w3.org/2005/11/its/rdf#&gt; .
@prefix dbr: &lt;http://dbpedia.org/resource/&gt; .
@prefix rdf: &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#&gt; .
@prefix pg84: &lt;http://www.gutenberg.org/ebooks/84.epub#&gt; .
pg84:book a schema:Book .
pg84:book itsrdf:taIdentRef dbr:Chamois,
dbr:Desert,
...</p>
          <p>Listing 5. The extracted linked data can greatly be reduced.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>3.2 Improving the Performance</title>
        <p>The books endpoint as described above introduces performance issues, not only
in reasoning time, but also in quality of the results: the EiCE returns a lot of
paths, and takes too much time to do it. This has two causes: the fact that the
data model uses a two-step link between a book and its concepts, and the fact
that a lot of unimportant concepts are taken into account.</p>
        <p>Two-step link The data model, as discussed in Section 2, invokes a two-step
link between a book and its detected concepts. First, the book has a certain
range, as de ned by its CFI, and second, this CFI references to a certain concept
in DBpedia. However, when we want to provide a high-level overview of a digital
publication, we do not care where in the book the concept was detected, but just
that it was detected. We can thus cut the data model short as shown Listing 5.
This not only greatly reduces the amount of needed triples, it also enables a
direct link between a book and its detected concepts.</p>
        <p>Keeping all detected concepts There are two issues with keeping all detected
concepts within a book in the data set, namely that (i) the data set grows larger,
and (ii) concepts that are detected only a few times are irrelevant for the
highlevel overview of a book.</p>
        <p>As the data set grows larger, the performance of the Everything is Connected
engine becomes more troublesome. Indeed, when more data is available, the
amount of potential paths found increases, and thus also the searching time of
the path nding algorithm. And as this large data set contains a lot of irrelevant
concepts, the path nding algorithm also returns a lot more irrelevant paths.</p>
        <p>To improve on this, we only keep the top X% of mentioned concepts. As
the initial analysis results kept all references of a detected concept, we can easily
count the amount of mentions of a concept, and use those counts to only keep the
most mentioned concepts. We keep the top 50% of mentioned concepts, as this
is the ideal compromise between execution time and found paths (see Section 4).</p>
        <p>Furthermore, we remove all Project Gutenberg speci c mentions. As these
books all originate from Project Gutenberg, they all have an identical disclaimer
at the beginning. Thus, all books have the concept Project Gutenberg detected.
However, this link is irrelevant to the content of the book, which is why we remove
all Project Gutenberg mentions out of the database.</p>
        <p>The aformentioned two optimizations have been implemented in the nal
proof-of-concept, running at http://uvdt.test.iminds.be/storyblink/.
4</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Evaluation</title>
      <p>The proposed methodology allows for automatic analysis and extraction of
relevant metadata, alleviating the need for manual annotations. The Everything is
Connected engine uses these semantic annotations with no preference for popular
or unpopular works, and thus avoids the long tail e ect. However, the
performance of StoryBlink is poor when taking into account all mentioned concepts,
which is why we propose to only keep the top X% of all mentioned concepts. To
nd a good cuto value, we evaluate the path nding algorithm in terms of time
and amount of paths found, whilst varying this cuto value (Figure 2)10. Given
that the amount of triples per cuto value rises exponentially, it is no surprise
that the computation time also rises exponentially. In fact, the correlation
between amount of triples taken into consideration and average path nding time
is 99:45%. The number of paths found saturates around the 50% cuto value.</p>
      <p>When we would keep all detected concepts, we see how the path nding
algorithm reaches the 60s timeout value. This results in a calculation time of
60s with 0 paths found, which is why we see a clear decrease in found paths
when we keep all detected concepts in the books endpoint. In Figure 2, we can
see how the graph has a noticeable breakpoint at the 50% mark. We also see
that, after that mark, there is very little gain in the amount of found paths. If
we compare the maximum of found paths with the amount of found paths at
the 50% mark, we see how 94:06% of all potential paths can be found in about
one eight of the time, i.e., 5:28s.</p>
      <p>Taking into account the large correlation between path nding time and
amount of triples in the data set, we can calculate the linear regression between
these variables, i.e., executionT ime(ms) = 2:2195 triples + 2194:7. Given that
in the test data set there were on average 54:55 triples per book, we can compute
that we can take into account at most 64 books when trying to nd a path with
#paths
14
12
10
8
6
4
2
0
0
20 40 60 80
Amount of considered concepts (%)
100
60
a response time lower than 10s. This can be done by, e.g., picking 62 books
at random out of a catalog besides the two already selected books. However,
these calculations do not need to be done for every request, as the data set is
a static data source. Therefore, caching (part of) the responses could in uence
the response times greatly, thus allowing StoryBlink to take more books into
account.
5</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusions and Future Work</title>
      <p>By using engines such as DBpedia Spotlight, it is possible to extract detected
concepts from a digital publication, and as such, provide a high-level overview
from the content of this publication. In this paper, we presented our methodology
to achieve an automatic extraction and publication of detected concepts of a
(digital) publication.</p>
      <p>However, taking into account every detected concept harms the resulting
data set, as unrelevant concepts are also taken into account. By extracting a
simpli ed subset of all detection results, we were able to fuel a content-based
recommendation system based on the Everything is Connected engine. This
engine can connect digital publications with each other by trying to nd relevant
yet surprising paths between two concepts. The result is a novel way of content
recommendation, no longer tied to social recommendation systems, thus avoiding
the long tail e ect.</p>
      <p>The path nding algorithm takes on average 5:28s to nd all relevant books
for a data set of 20 books, which will make our Web application usable for book
catalogs of maximally 64 books. It is clear that this is not ready for real-life
catalogs. However, this problem can easily be resolved by caching the results. As
the data set is xed and not prone to a lot of change, this is a possible solution.</p>
      <p>This work could be further improved as following.
{ The choice of keeping the top 50% of mentioned concepts biases the books
endpoint towards common concepts. Future work is to compare this method
with similar metrics that do not bias towards common concepts, such as
term frequency-inverse document frequency (tf-idf).
{ Instead of selecting two books to nd a paths between them, StoryBlink could
be adapted to start from only one book . This way, StoryBlink can become a
recommendation system, suggesting linked books based on a starting book.</p>
      <p>
        This implies making changes to the Everything is Connected engine.
{ DBpedia Spotlight could be replaced by other NER and NED engines, to
see how this in uences the perceived quality of StoryBlink. We can evaluate
how using, e.g., Babelfy [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], a ects the results of StoryBlink.
{ As the Triple Pattern Fragments clients allows for federated querying, we
can expand the usage of StoryBlink by nding links across data sets, e.g.,
books and movies, or books and music.
      </p>
      <p>Acknowledgements The research activities described in this paper were funded
by Ghent University, iMinds, the IWT Flanders, the FWO-Flanders, and the
European Union, in the context of the project \Uitgeverij van de Toekomst"
(Publisher of the Future).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Anderson</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>The Long Tail: Why the Future of Business is Selling Less of More</article-title>
          .
          <source>Hyperion (July</source>
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bobadilla</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ortega</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hernando</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gutierrez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Recommender systems survey</article-title>
          .
          <source>Knowledge-Based Systems 46, 109{132 (Jul</source>
          <year>2013</year>
          ), http://www. sciencedirect.com/science/article/pii/S0950705113001044
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Conboy</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garrish</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gylling</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McCoy</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Makoto</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weck</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          : EPUB 3 Overview. Tech. rep.,
          <source>International Digital Publishing Forum (IDPF)</source>
          (
          <year>June 2014</year>
          ), http://www.idpf.org/epub/301/spec/epub-overview.html,
          <source>accessed January 22nd</source>
          ,
          <year>2015</year>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>De Vocht</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Coppens</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verborgh</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Vander</given-names>
            <surname>Sande</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Mannens</surname>
          </string-name>
          , E., Van de Walle, R.:
          <article-title>Discovering meaningful connections between resources in the Web of Data</article-title>
          . In: Bizer,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Heath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Berners-Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Hausenblas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Auer</surname>
          </string-name>
          , S. (eds.)
          <article-title>Linked Data on the Web (LDOW)</article-title>
          . pp.
          <volume>1</volume>
          {
          <issue>8</issue>
          . CEUR, Rio De Janeiro, Brazil (May
          <year>2013</year>
          ), http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>996</volume>
          /papers/ldow2013-paper-04.pdf
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Filip</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McCance</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lewis</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lieske</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lommel</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kosek</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sasaki</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Savourel</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Internationalization Tag Set (ITS) version 2.0</article-title>
          . Tech. rep.,
          <source>W3C (October</source>
          <year>2013</year>
          ), http://www.w3.org/TR/its20/, accessed June 16th,
          <year>2015</year>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Hellman</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>Nif 2.0 core ontology</article-title>
          .
          <source>Tech. rep., AKSW</source>
          , University Leipzig (
          <year>2015</year>
          ), http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core/ nif-core.html, accessed June 16th,
          <year>2015</year>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Mendes</surname>
            ,
            <given-names>P.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jakob</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garc</surname>
            a-Silva,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          : DBpedia Spotlight:
          <article-title>Shedding light on the Web of documents</article-title>
          .
          <source>In: Proceedings of the 7th International Conference on Semantic Systems</source>
          . pp.
          <volume>1</volume>
          {
          <issue>8</issue>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Misra</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stokols</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Psychological and health outcomes of perceived information overload</article-title>
          .
          <source>Environment and Behavior</source>
          <volume>44</volume>
          (
          <issue>6</issue>
          ),
          <volume>737</volume>
          {759 (Nov
          <year>2012</year>
          ), http: //eab.sagepub.com/content/44/6/737
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Moro</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raganato</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Navigli</surname>
          </string-name>
          , R.:
          <article-title>Entity Linking meets Word Sense Disambiguation: a Uni ed Approach. Transactions of the Association for Computational Linguistics (TACL) 2</article-title>
          ,
          <issue>231</issue>
          {
          <fpage>244</fpage>
          (
          <year>2014</year>
          ), http://wwwusers.di.uniroma1.it/~navigli/ pubs/TACL_2014_Babelfy.pdf
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Nadeau</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sekine</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>A survey of Named Entity Recognition and Classi cation</article-title>
          .
          <source>Lingvisticae Investigationes</source>
          <volume>30</volume>
          (
          <issue>1</issue>
          ),
          <volume>3</volume>
          {
          <fpage>26</fpage>
          (
          <year>2007</year>
          ), http://dx.doi.
          <source>org/10.1075/li. 30.1.03nad</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Presutti</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Draicchio</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gangemi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Knowledge extraction based on discourse representation theory and linguistic frames</article-title>
          .
          <source>In: Knowledge Engineering and Knowledge Management</source>
          . pp.
          <volume>114</volume>
          {
          <fpage>129</fpage>
          . Springer (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>Reforgiato</given-names>
            <surname>Recupero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Nuzzolese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.G.</given-names>
            ,
            <surname>Consoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Presutti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            ,
            <surname>Mongiov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Peroni</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          :
          <article-title>Extracting knowledge from text using SHELDON, a Semantic Holistic framEwork for LinkeD ONtology data</article-title>
          .
          <source>In: Proceedings of the 24th International Conference on World Wide Web Companion</source>
          . pp.
          <volume>235</volume>
          {
          <fpage>238</fpage>
          .
          <string-name>
            <surname>International World Wide Web Conferences Steering Committee</surname>
          </string-name>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Rizzo</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Troncy</surname>
          </string-name>
          , R.: NERD:
          <article-title>Evaluating Named Entity Recognition tools in the Web of Data</article-title>
          . In: Workshop on Web Scale Knowledge Extraction,
          <fpage>ISWC2011</fpage>
          . Bonn, Germany (
          <year>October 2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Sorotokin</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Conboy</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duga</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rivlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beaver</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ballard</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fettes</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weck</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          : EPUB Canonical Fragment
          <article-title>Identi er (epubc ) Speci cation</article-title>
          .
          <source>Tech. rep</source>
          .,
          <source>International Digital Publishing Forum (IDPF)</source>
          (
          <year>June 2014</year>
          ), http://www.idpf. org/epub/linking/cfi/epub-cfi.html, accessed June 16th,
          <year>2015</year>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Usbeck</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Ngonga</given-names>
            <surname>Ngomo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.C.</given-names>
            ,
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Gerber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Both</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>AGDISTIS - graph-based disambiguation of Named Entities using Linked Data</article-title>
          . In: International Semantic Web Conference. Springer (
          <year>2014</year>
          ), http://svn.aksw.org/papers/ 2014/ISWC_AGDISTIS/public.pdf
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Usbeck</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , Roder,
          <string-name>
            <surname>M.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Ngonga</given-names>
            <surname>Ngomo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.C.</given-names>
            ,
            <surname>Baron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Both</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          , Brummer,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Ceccarelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Cornolti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Cherix</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Eickmann</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.</surname>
          </string-name>
          , et al.: GERBIL:
          <article-title>General entity annotator benchmarking framework</article-title>
          .
          <source>In: Proceedings of the 24th International Conference on World Wide Web</source>
          . pp.
          <volume>1133</volume>
          {
          <fpage>1143</fpage>
          .
          <string-name>
            <surname>International World Wide Web Conferences Steering Committee</surname>
          </string-name>
          (
          <year>2015</year>
          ), http://svn.aksw.org/ papers/2015/WWW_GERBIL/public.pdf
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Verborgh</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hartig</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Meester</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haesendonck</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Vocht</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vander</surname>
            <given-names>Sande</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Cyganiak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Colpaert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Mannens</surname>
          </string-name>
          , E., Van de Walle, R.:
          <article-title>Querying datasets on the Web with high availability</article-title>
          .
          <source>In: International Semantic Web Conference</source>
          <year>2014</year>
          . pp.
          <volume>180</volume>
          {
          <fpage>196</fpage>
          . Springer (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Verborgh</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mannens</surname>
          </string-name>
          , E., Van de Walle, R.:
          <article-title>Initial usage analysis of DBpedia's Triple Pattern Fragments</article-title>
          .
          <source>In: Proceedings of the 5th USEWOD Workshop on Usage Analysis and the Web of Data (Jun</source>
          <year>2015</year>
          ), http://linkeddatafragments. org/publications/usewod2015.pdf
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>