<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LOTUS: Linked Open Text UnleaShed</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Filip Ilievski</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wouter Beek</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marieke van Erp</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Laurens Rietveld</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefan Schlobach</string-name>
          <email>k.s.schlobachg@vu.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>The Network Institute VU University Amsterdam</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>It is di cult to nd resources on the Semantic Web today, in particular if one wants to search for resources based on natural language</p>
      </abstract>
      <kwd-group>
        <kwd>Findability</kwd>
        <kwd>Text Indexing</kwd>
        <kwd>Semantic Search</kwd>
        <kwd>Big Data</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>LOTUS then hold a potential to improve the entity annotation recall by nding
resources outside in an extended set of knowledge sources.</p>
      <p>
        The LOTUS index focuses on the string values present in RDF statements
to create a fast and scalable index over the data in the LOD Laundromat.3 This
enables easy querying of the LOD Laundromat through an API or web interface
and provides an option to search for strings in particular languages through an
associated language tag. LOTUS is a rst step up to an accessible disambiguation
system over the LOD cloud, to be used for example in NLP applications, that
currently su er from lack of coverage of resources such as Wikipedia/DBpedia[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
Furthermore, as LOTUS provides a link between text and documents in the LOD
cloud, Information Retrieval over the LOD cloud becomes an interesting option.
In Section 6, we will discuss some use case scenarios. Our contributions are the
following:
      </p>
      <p>- A problem description of accessing textual resources on the Semantic Web
today (Section 2)
- The LOTUS system (Sections 4 and 5)
- Three use cases that showcase the potential of LOTUS (Section 6)
2</p>
    </sec>
    <sec id="sec-2">
      <title>Problem description</title>
      <p>How do we nd relevant resources on the Semantic Web today? The Semantic
Web currently largely relies on two ndability strategies: IRI dereferencing and
SPARQL query endpoints.</p>
      <p>Dereference According to Semantic Web folklore an IRI c should dereference
to a set of expressions (c) in which that IRI appears (sometimes only in the
subject position, but this is immaterial to the current argument).</p>
      <p>Which expressions belong to (c) is decided by the authority of c, i.e., the
person or organization that pays for the domain that appears in c's authority
component. Non-authoritative expressions about c, denoted (c), are all
expressions in which c occurs (again, possibly as a subject) but are not part of (c).</p>
      <p>Findability of non-authoritative expressions occurs through an alternating
sequence of IRI dereference and graph traversal operations. Even though
expressions in (c) [ (c) by de nition belong to the same graph component, it
may well be possible that no path from terms in (c) to terms in (c) exists.
Indeed, since the Semantic Web does not implement the notion of backlinks, an
architectural decision it has in common with the WWW, dereferenceability is
inherently unable to solve the ndability problem.</p>
      <p>In addition to its theoretical limitations, the real-world implementation of
dereferenceability has proven to be both di cult and costly, since it requires a
Web Server to run on the dereferenced IRI's authority. It is not uncommon for
IRIs to be unavailable, either temporarily or permanently.
3 The LOD Laundromat contained 38,606,408,433 triples on 6 July 2015</p>
      <p>Since only IRIs can be dereferenced, natural language access to the Semantic
Web cannot be gained at all through dereference. It is therefore not possible
to nd a resource based on RDF literals to which it is related. It is certainly
not possible to search for resources based on keywords that only bear a close
similarity to (some of the) literals to which those resources are related.</p>
      <p>SPARQL endpoints The second approach towards solving the ndability
problem is through the use of SPARQL endpoints. When compared to
dereferencabilitybased graph traversal, SPARQL endpoints provide a far more powerful approach
for nding an individual resource c based on the authoritative expressions (c)
about it. As for the ndability of non-authoritative expressions about c, SPARQL
endpoints have largely the same problems as the dereferenceability approach.</p>
      <p>
        While it is possible to evaluate a SPARQL query over multiple datasets,
thereby including expressions in (c) as well, these datasets have to be included
explicitly by using the SERVICE keyword [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. This requires that all endpoints
which disseminate expressions in (c) are known, making the ndability
approach somewhat circular. While there are potentially many such endpoints
(depending on c) the list will have to be assembled anew for each new term.
      </p>
      <p>
        In addition to the unpracticality of nding all non-authoritative endpoints
that say something about c, there is no guarantee that all expressions in (c) are
disseminated by some SPARQL endpoint. Empirical studies show that SPARQL
endpoint availability is generally quite low, indicating that some
SPARQLdisseminated expressions may happen to be unavailable at query time [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>The SPARQL query language is largely oriented towards matching graph
patterns and datatyped values. As such the SPARQL speci cation de nes regular
expression-based operations on the lexical expressions of literals but does not
include string similarity matches or other more advanced NLP functionality.</p>
      <p>
        Querying multiple datasets is possible via other paradigms, such as Linked
Data Fragments [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. However, these largely share the disadvantages of SPARQL.
      </p>
      <p>Summarizing, ndability is a problem on the Semantic Web today. The
ndability problem will not be solved by implementing existing approaches or
standards in a better way, but requires a completely novel approach instead.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Related work</title>
      <p>
        LOTUS bears much resemblance to Sindice [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], a system that allowed search on
Semantic Web documents based on IRIs and keywords that appeared in those
documents. Sindice crawled the network of dereferenceable IRIs and queryable
SPARQL endpoints to gather data documents. The contents of each document
were included in two centralized indices: one for text and one for IRIs. Sindice
also semantically interpreted inverse functional relations, e.g. mapping telephone
numbers onto individuals. Currently, LOTUS does not perform any type of
semantic interpretation, although such functionality could be built on top of it.
      </p>
      <p>There are several di erences between LOTUS and Sindice. Some of these are
due to the underlying LOD Laundromat architecture and some to the LOTUS
system itself. Firstly, Sindice can relate IRIs and keywords to documents in
which the former occur. LOTUS can do much more (see Figure 1): it can relate
keywords, IRIs and documents to each other (in all directions).</p>
      <p>Secondly, Sindice requires data to adhere to the Linked Data principles.
Speci cally, it requires an IRI to either dereference or be queryable in a SPARQL
endpoint. LOTUS is build on top of the LOD Laundromat which accepts any
type of Linked Data, e.g. it allows data entered through Dropbox.</p>
      <p>Thirdly, LOTUS allows incorrect datasets to be partially included due to the
cleaning mechanism of the LOD Laundromat. This is an important feature since
empirical observations collected over the LOD Laundromat indicate that at least
70% of crawled data documents contain bugs such as syntax errors.</p>
      <p>
        Fourthly, since Sindice returns a list of online document links as a result, it
relies on the availability of the original data sources. While it has this in common
with search engines for the WWW, it is known that data sources on the Semantic
Web have much lower availability [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. LOTUS returns document IRIs that can
either be downloaded from their original sources or from a cleaned copy made
available through the LOD Laundromat Web Service.
      </p>
      <p>Fifthly, LOTUS operates on a much larger scale than Sindice did. Sindice
allowed 30M IRIs and 45M literals to be searched while LOTUS allows 3,700M
IRIs and 5,320M literals, a di erence in scale of more than factor 100.
4</p>
    </sec>
    <sec id="sec-4">
      <title>LOTUS</title>
      <p>The purpose of LOTUS is to relate unstructured (natural language text) data
to structured data using RDF as paradigm to express such structured data.
LOTUS has access to an underlying architecture that exposes a large collection of
resource-denoting terms and structured descriptions of those terms, all
formulated in RDF. It indexes natural text literals which appear in the object position
of RDF statements and allows the denoted resources to be ndable based on
approximate literal matching and, optionally, an associated language tag.
4.1</p>
      <sec id="sec-4-1">
        <title>Described resources</title>
        <p>RDF de nes a graph-based data model in which resources can be described in
terms of their relations to other resources. The textual labels denoting some of
these resources provide an opening to relate unstructured to structured data.</p>
        <p>An RDF statement expresses that a certain relation holds between a pair of
resources. We take a described resource to be any resource that is denoted
by at least one term appearing in the subject position of an RDF statement.</p>
        <p>LOTUS does not allow every resource in the Semantic Web to be found
through natural language search, as some described resources are not denoted
by a term that appears in the subject position of a triple whose object term
is a textual label. Fortunately, many Semantic Web resources have at least one
textual label linked to them and as the Semantic Web adheres to the Open World
Assumption, resources that have no textual description today may receive one
tomorrow, as everyone is free to add new content.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>RDF Literals</title>
        <p>
          In the context of RDF, textual labels appear mainly as part of RDF literals. An
RDF literal is either a pair hD; LEXi or a triple hrdf:langString; LEX; LT i [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
D is a datatype IRI denoting a datatype. LEX is a Normal Form C (NFC)
Unicode string [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. LT is a language tag identifying an IANA-registered natural
language per BCP 47 [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
        </p>
        <p>Semantically speaking, RDF literals denote resources, similar to the way
in which IRIs denote resources. A datatype D de nes the collection of allowed
lexical forms (lexical space), the collection of resources denoted by those lexical
forms (value space), a functional mapping from the former to the latter and a
non-functional mapping from the latter to the former.</p>
        <p>We are speci cally interested in literals that contain natural language text.
However, not all RDF literals express { or are intended to express { natural
language text. For instance, there are datatype IRIs that describe a value space
of date-time points or polygons. Since we are working with RDF data we cannot
rely on a whitelist of datatype IRIs in order to extract all and only natural
language texts from the LOD Cloud. Firstly, there is no xed set of datatypes and
datatype IRIs as datasets can de ne their own. Secondly, even if we would settle
for a partial whitelist we would not be able to denote a collection of datatype IRIs
that only denoted natural language texts. While natural language text is often
found together with datatype IRI xsd:string, in practice we nd that integers
and dates are also stored under that datatype, even though custom datatypes
for integers and dates exist. Due to these reasons, we lter the lexical expressions
to include through a pattern regardless of the datatype IRI associated with it.</p>
        <p>
          Finally, LOTUS indexes the language tags when explicitly speci ed by the
dataset author, however, no automatic language detection is performed.
LOTUS performs o ine approximate string matching. Approximate string
matching[
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] is an alternative to exact string matching, where a given pattern is
matched to text while still allowing a number of errors. LOTUS preprocesses
text and builds the data index o ine, allowing the approximation model to be
enriched with TF-IDF term weighting score [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], phonetic matching, synonym
matching, match granularity (phrase- or term-based match).
        </p>
        <p>The choice of optimal string matching model is nontrivial and depends on
the intended application. News articles may bene t from term-based matching,
synonym matching and TF-IDF scoring, as they often deal with incomplete
entity phrases. On the contrary, journal paper titles may better be matched as
a complete phrase, neglecting the TF-IDF score and synonyms of the individual
terms. Term-based matching (Q1 and Q2 in section 5.2) is disjunctive: minimum
one term should be matched. Phrase-based matching (Q3 and Q4) is conjunctive
and requires all terms to occur sequentially and in the given order.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Implementation</title>
      <p>The LOTUS system architecture consists of two components: Index Building
(IB) procedure and Public Interface (PI). The role of the IB component is to
index strings from LOD Laundromat; the role of PI is to expose the indexed data
to users for querying. The two system components are executed sequentially: data
is initially indexed, then it can be queried through the exposed public interface.
5.1</p>
      <sec id="sec-5-1">
        <title>System Architecture</title>
        <p>Indexing of data is expensive. Hence, initially we create an index over all data
from LOD Laundromat through a batch loading procedure, by streaming LOD
Laundromat statements through its query interface, Frank, to a client script. The
client script parses the received RDF statements and performs a bulk indexing
request in ElasticSearch,4 where the textual index is built. Once the initial
indexing is nished, we only incrementally update the index when new data appears
in LOD Laundromat, triggered by an event handler in LOD Laundromat.</p>
        <p>We use Frank's Frank documents command to enumerate all LOD
Laundromat data sets and download them sequentially in a stream (Step 1 in
Figure 2). Following the approach described in Section 4, we consider only the
statements that contain a literal as an object and use the regular expression
"/^[-\.,0-9]*$/" to lter out statements with numbers and dates as lexical
forms. The remaining statements are then parsed and the resulting data is
indexed in ElasticSearch (Step 2 in Figure 2). Each ElasticSearch entry has the
following format:
f
g
" d o c i d " : IRI ,
" l a n g t a g " : STRING,
" p r e d i c a t e " : IRI ,
" s t r i n g " : STRING,
" s u b j e c t " : IRI</p>
        <p>The elds \string" and \langtag" (language tag) are preprocessed
(\analyzed") by ElasticSearch at indexing time, which allows for exible, fuzzy lookup
of these elds. The motivation behind analyzing the \string" eld comes
naturally, as this contains unstructured text and will rarely be queried for exact
match. We also preprocess the language tag eld: following BCP 47 semantics5,
a language tag can easily contain subtags, such as country codes. In order to also
retrieve the speci c language tags (e.g. \en-US") when looking for general
language tags (e.g. \en"), we decided to preprocess the language tag eld to allow
exible matching. The remaining three elds can be looked up as exact matches,
as these are IRIs and contain structured text following the RDF standard.
5.2</p>
        <p>API
Users can search the underlying data through an API. The usual query ow is
described in steps 3-6 of Figure 2. Our NodeJs6 interface to the indexed data
currently exposes four matching functions: two term-based and two phrase-based,
which rely on underlying ElasticSearch string matching and scoring techniques7:
-Q1: terms(pattern, size): Disjunct lookup of set of terms (supplied via the
pattern parameter) occurring in the string eld of an entry. The candidate score</p>
        <sec id="sec-5-1-1">
          <title>4 https://www.elastic.co/products/elasticsearch 5 https://tools.ietf.org/html/bcp47 6 https://nodejs.org</title>
          <p>7 Detailed description of the theoretical basis for matching and scoring candidates in
ElasticSearch is available at https://www.elastic.co/guide/en/elasticsearch/
guide/current/scoring-theory.html
total # literals encountered 12,018,939,378
# integers and dates 6,699,148,542
# indexed entries (=# string literals) 5,319,790,836
# distinct sources 508,244
# distinct language tags terms 713
# hours to create the index 56
disk space used 484.77 GB
is proportional to the amount of terms spotted. The best size hits are returned.</p>
          <p>-Q2: langterms(pattern, size, langtag): Term-based query, with a langtag
value supplied as a preference: the hits which contain the preferred language
tag will receive higher score.</p>
          <p>-Q3: phrase(pattern, size): Matches a phrase pattern occurring in the string
eld of an entry as a whole. The best size hits are returned.</p>
          <p>-Q4: langphrase(pattern, size, langtag): Phrase-based query, with a langtag
value expressed as a preference: the hits which match langtag are ranked higher.</p>
          <p>LOTUS is also available as a web interface at http://lotus.lodlaundromat.
org/ for simple exploration of the data. The code of the API functions and data
from the use cases in Section 6 are available on github.8
5.3</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>Indexed Data</title>
        <p>Statistics over the indexed data are presented in Table 1. We encountered over
12 billion literals in LOD Laundromat. We ltered all numbers and date literals,
summing up to 55.7% of the overall amount. The initial LOTUS index was
created in 56 hours and takes 484.77 GB of disk space storage. The current
index consists of 5.3 billion entries, coming from 508,244 distinct sources.9</p>
        <p>There are 713 distinct terms occurring in the langtag eld. Figure 3 presents
the proportion of the 10 most frequent language tags. \en" is by far the most
frequently encountered tag: 1,049,037,147 literals have been tagged with an English
language tag, followed by 165,996,755 German tags and 149,507,401 French tags.
Figure 3 also shows the proportion of the 10 most popular languages with respect
to the overall set of 5.3 billion literals: most of the literals have no language tag
assigned to them.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Use cases</title>
      <p>To illustrate the need for access to multiple datasets, we perform a small recall
evaluation on a standard benchmark dataset, namely the CoNLL/AIDA Named</p>
      <sec id="sec-6-1">
        <title>8 https://github.com/filipdbrsk/LOTUS\_Search/</title>
        <p>
          9 The number of di erent source documents in LOTUS is lower than the overall
number of sources in LOD Laundromat, as not every source document contains string
literals.
entity linking benchmark dataset[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. We also present two domain-speci c use
cases, namely Local monuments and Scienti c journals.
        </p>
        <p>For each use case scenario, we gather a set of entities and query each entity
against LOTUS. We also counted the amount of entities without results and the
proportion of DBpedia resources in the rst 100 candidates, as a comparison to
the (currently) most popular knowledge base. We then inspected a number of
query results to assess their relevance to the search query. The obtained
quantitative results are presented in Table 2. In the remainder of this section we detail
the speci cs of each use case.
6.1</p>
        <sec id="sec-6-1-1">
          <title>CoNLL/AIDA</title>
          <p>
            The CoNLL/AIDA dataset[
            <xref ref-type="bibr" rid="ref5">5</xref>
            ] is an extension of the CoNLL 2003 Named Entity
Recognition Dataset [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ] to also include links to Wikipedia entries for each entity.
7,112 of the entity phrases in CoNLL/AIDA have no DBpedia entry. We removed
the duplicate entities in each article, providing us with 5,628 entity mentions.
We focus on these to show the impact of having access to multiple datasets.
          </p>
          <p>We suspect that the growth in DBpedia since the release of this dataset
has improved recall on the named entities, but there is still a bene t of using
multiple data sources. For smaller locations, such as the \Chapman Golf Club",
relevant results are found in for example http://linkedgeodata.org/About.
Also, the fact that the di erent language versions of DBpedia are plugged in
helps in retrieving results from localised DBpedias such as for \Ismal Boulahya",
a Tunesian politician described in http://fr.dbpedia.org/resource/Isma\
%C3\%AFl\_Boulahya. Some of the retrieval is hampered by newspaper typos,
such as \Allan Mullally" (\Alan Mullally" is the intended entity).
6.2</p>
        </sec>
        <sec id="sec-6-1-2">
          <title>Local Monuments Guided Walks</title>
          <p>The interest in applications such as Historypin (http://www.historypin.org)
or the Rijksmuseum API (https://www.rijksmuseum.nl/en/api) shows that
there are interesting use cases in cultural heritage and local data. To explore
the coverage of this domain in the LOD Laundromat, we created the local
monuments dataset by downloading a set of guided walks from the Dutch website
http://www.wandelnet.nl. We speci cally focused on the tours created in
collaboration with the Dutch National Railways as these often take walkers through
cities and along historic and monumental sites. From the walks `Amsterdam
Westerborkpad' and `Mastbos Breda', a human annotator identi ed 112 and 79
entities respectively. These are mostly monuments such as `De Grote Kerk' (The
big church) or street names such as `Beukenweg' (Beech street ).</p>
          <p>We manually inspected the top 10 results on a number of queries. Here we
nd that the majority of the highest ranking results is still coming from
DBpedia. However, when no DBpedia link is available, often a resource from the
Amsterdam Museum (http://semanticweb.cs.vu.nl/lod/am/) or Wikidata
(http://www.wikidata.org) is retrieved. The focus on entertainment in
DBpedia is also shown here for the query `Jan Dokter, the person who rst walked the
route to commemorate his family that died in WWII. `Jan' is a very common
Dutch rst name, and `Dokter' means `doctor', which results in many results
about characters in Dutch and Flemish soap operas who happen to be doctors.
This expresses a need for allowing more context to be brought into the search
query to lter results better.
6.3</p>
        </sec>
        <sec id="sec-6-1-3">
          <title>Scienti c Journals</title>
          <p>Whitelists (and blacklists) of scienti c journals are used by many institutions to
gauge the output of their researchers. They are also used by researchers interested
in the scienti c social networks. One such list is made publicly available by the
Norwegian Social Science Data Services Website (http://www.nsd.uib.no/).
Their level 2 publishing channel contains 231 titles of journals. The majority of
these titles is in English, but it also contains some German and Swedish titles
barring the use of the language tag in querying.</p>
          <p>As the queries are generally longer and contain more context-speci c terms
such as \journal", \transactions", \methods", and \association", the query
results are generally more relevant and fewer come from DBpedia. Instead, a
large part of the queries come from sources ZDB (http://dispatch.opac.dnb.
de/LNG=DU/DB=1.1/) the 2001 UK's Research Assessment Exercise as exposed
through RKB Explorer (http://rae2001.rkbexplorer.com/), Lobid (http:
//lobid.org/) and again Wikidata. The more generic titles, such as
\Transportation" yield, as expected, more generic results.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Discussion</title>
      <p>Section 4 points out there is no optimal string matching strategy expected to t
all use cases. Term-based string matching is more exible and yields more results
than phrase-based matching. The right trade-o between these two strategies is
application-dependent and should be further investigated.</p>
      <p>In our use cases, we were able to nd resources for the majority of the entities,
but many of the results still come from generic data sets such as DBpedia. Still,
the di erent use cases show that this proportion di ers per domain, opening
up new perspectives and challenges for application areas, such as Named Entity
Disambiguation and Information Retrieval.</p>
      <p>Finally, LOTUS currently lacks integration of structured and unstructured
data. We allow a transition from natural language text to literals, documents
and resources; but relations between the structured data are currently missing
preventing the user from knowing which of the retrieved resources are identical,
similar or share context in a certain sense.
8</p>
    </sec>
    <sec id="sec-8">
      <title>Conclusion and Future Work</title>
      <p>In this paper, we presented the LOTUS system, a full text look-up application
over the LOD Laundromat Linked Open Data collection. We detailed the speci c
di culties in accessing textual content in the LOD cloud, and demonstrated the
potential of LOTUS in three small use case scenarios.</p>
      <p>We expect LOTUS to grow with its applications, in particular named entity
disambiguation, wiki cation and information retrieval. On one hand, we foresee
the need for complementing the indexed data with implicit information about
the literals, such as associated language tag (when missing) and topic domain.</p>
      <p>
        We also plan to expand LOTUS' search functionality. This can be achieved by
introducing logical operators, such as negation, and fuzzy matching (for instance,
based on Levenshtein edit distance [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]). Functionality to exclude or favor speci c
knowledge bases or data sources may be added in a next version of LOTUS.
Certain use cases would also bene t from integration of LOTUS with structured
data, for instance via SPARQL endpoints.
      </p>
      <p>Finally, in-depth evaluation of performance (scalability and response time)
and usability of results is desired. Evaluation can, for example, be done by
comparing LOTUS to standard WWW web search engines and restricting the results
to letype:rdf.</p>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgments</title>
      <p>The research for this paper was supported by the European Union's 7th
Framework Programme via the NewsReader Project (ICT-316404) and the Netherlands
Organisation for Scienti c Research (NWO) via the Spinoza fund.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Beek</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rietveld</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bazoobandi</surname>
            ,
            <given-names>H.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wielemaker</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schlobach</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Lod laundromat: a uniform way of publishing other peoples dirty data</article-title>
          .
          <source>In: ISWC</source>
          <year>2014</year>
          , pp.
          <volume>213</volume>
          {
          <issue>228</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Buil-Aranda</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hogan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Umbrich</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vandenbussche</surname>
          </string-name>
          , P.Y.:
          <article-title>SPARQL webquerying infrastructure: Ready for action?</article-title>
          <source>In: ISWC</source>
          <year>2013</year>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Cyganiak</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wood</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lanthaler</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>RDF 1.1 concepts and abstract syntax (</article-title>
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Davis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Whistler</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Unicode normalization forms</article-title>
          (
          <year>August 2012</year>
          ), http://www. unicode.org/reports/tr15/tr15-
          <fpage>37</fpage>
          .html
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Ho</surname>
            <given-names>art</given-names>
          </string-name>
          , J.,
          <string-name>
            <surname>Yosef</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bordino</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          , Furstenau, H.,
          <string-name>
            <surname>Pinkal</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spaniol</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taneva</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thater</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weikum</surname>
          </string-name>
          , G.:
          <article-title>Robust disambiguation of named entities in text</article-title>
          .
          <source>In: Proceedings of the Conference on Empirical Methods in Natural Language Processing</source>
          . pp.
          <volume>782</volume>
          {
          <fpage>792</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Hogan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Umbrich</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cyganiak</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polleres</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Decker</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>An Empirical Survey of Linked Data Conformance</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on the World Wide Web</source>
          <volume>14</volume>
          ,
          <issue>14</issue>
          {
          <fpage>44</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Joachims</surname>
          </string-name>
          , T.:
          <article-title>A probabilistic analysis of the rocchio algorithm with t df for text categorization</article-title>
          .
          <source>Tech. rep., DTIC Document</source>
          (
          <year>1996</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Kittur</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chi</surname>
            ,
            <given-names>E.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suh</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>What's in wikipedia? : mapping topics and con ict using socially annotated category structure</article-title>
          .
          <source>In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI'09)</source>
          . pp.
          <source>Pages</source>
          <volume>1509</volume>
          {
          <fpage>1512</fpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Levenshtein</surname>
            ,
            <given-names>V.I.</given-names>
          </string-name>
          :
          <article-title>Binary codes capable of correcting deletions, insertions, and reversals</article-title>
          .
          <source>In: Soviet physics doklady</source>
          . vol.
          <volume>10</volume>
          , pp.
          <volume>707</volume>
          {
          <issue>710</issue>
          (
          <year>1966</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Mendes</surname>
            ,
            <given-names>P.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jakob</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garc</surname>
            a-Silva,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          : DBpedia Spotlight:
          <article-title>Shedding Light on the Web of Documents</article-title>
          . pp.
          <volume>1</volume>
          {
          <issue>8</issue>
          .
          <source>7th International Conference on Semantic Systems. ACM</source>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Moro</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raganato</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Navigli</surname>
          </string-name>
          , R.:
          <article-title>Entity linking meets word sense disambiguation: a uni ed approach</article-title>
          .
          <source>Transactions of the Association for Computational Linguistics</source>
          <volume>2</volume>
          ,
          <issue>231</issue>
          {
          <fpage>244</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Navarro</surname>
          </string-name>
          , G.:
          <article-title>A guided tour to approximate string matching</article-title>
          .
          <source>ACM computing surveys (CSUR) 33(1)</source>
          ,
          <volume>31</volume>
          {
          <fpage>88</fpage>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Phillips</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Tags for identifying languages</article-title>
          (
          <year>September 2009</year>
          ), http: //www.rfc-editor.
          <source>org/info/rfc5646</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Prud'hommeaux</surname>
          </string-name>
          , E.,
          <string-name>
            <surname>Buil-Aranda</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          : SPARQL 1.1 Federated
          <string-name>
            <surname>Query</surname>
          </string-name>
          (
          <year>2013</year>
          ), http: //www.w3.org/TR/sparql11-federated-query/
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Rizzo</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Troncy</surname>
          </string-name>
          , R.:
          <article-title>Nerd: a framework for unifying named entity recognition and disambiguation extraction tools</article-title>
          .
          <source>In: Proceedings of EACL 2012</source>
          . pp.
          <volume>73</volume>
          {
          <issue>76</issue>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Tjong Kim Sang</surname>
            ,
            <given-names>E.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meulder</surname>
            ,
            <given-names>F.D.</given-names>
          </string-name>
          :
          <article-title>Introduction to the conll-2003 shared task: Language-independent named entity recognition</article-title>
          .
          <source>In: Proceedings of CoNLL-2003</source>
          . pp.
          <volume>142</volume>
          {
          <fpage>147</fpage>
          . Edmonton, Canada (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Tummarello</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Delbru</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oren</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          :
          <article-title>Sindice.com: Weaving the open linked data</article-title>
          .
          <source>In: Proceedings ISWC'07/ASWC'07</source>
          . pp.
          <volume>552</volume>
          {
          <issue>565</issue>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Verborgh</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Vander</given-names>
            <surname>Sande</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Colpaert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Coppens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Mannens</surname>
          </string-name>
          , E., Van de Walle, R.:
          <article-title>Web-scale querying through linked data fragments</article-title>
          .
          <source>In: Proceedings of the 7th Workshop on Linked Data on the Web</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>