<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Mapping Natural Language Labels to Structured Web Resources</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Valerio Basile</string-name>
          <email>basile@di.unito.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elena Cabrio</string-name>
          <email>elena.cabrio@unice.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabien Gandon</string-name>
          <email>fabien.gandon@inria.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Debora Nozza</string-name>
          <email>debora.nozza@disco.unimib.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universite Co</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Milano-Bicocca</institution>
          ,
          <addr-line>Milan</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Turin</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>te d'Azur</institution>
          ,
          <addr-line>Inria, CNRS, I3S</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <fpage>63</fpage>
      <lpage>75</lpage>
      <abstract>
        <p>Mapping natural language terms to a Web knowledge base enriches information systems without additional context, with new relations and properties from the Linked Open Data. In this paper we formally de ne such task, which is related to word sense disambiguation, named entity recognition and ontology matching. We provide a manually annotated dataset of labels linked to DBpedia as a gold standard for evaluation, and we use it to experiment with a number of methods, including a novel algorithm that leverages the speci c characteristics of the mapping task. The empirical evidence con rms that general term mapping is a hard task, that cannot be easily solved by applying existing methods designed for related problems. However, incorporating NLP ideas such as representing the context and a proper treatment of multiword expressions can signi cantly boost the performance, in particular the coverage of the mapping. Our ndings open up the challenge to nd new ways of approaching term mapping to Web resources and bridging the gap between natural language and the Semantic Web.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Words, labels4, multiword expressions and keywords are used, among other
things, to summarize the topic of articles, to index documents, to improve search,
to organize collections and to annotate content in social media. Because of their
ubiquity, disambiguating and improving the processing of natural language terms
has an immediate and important impact on the performances of many
information systems.</p>
      <p>
        Being able to map sets of terms to linked data resources contributes to create
new interoperable resources and to transfer knowledge across di erent
applications. Take for example a large, carefully crafted ontology such as KnowRob [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ],
a framework for robotic perception and reasoning. Despite its wide usage among
robotic applications, the basic concepts of KnowRob (objects, places, actions)
4 Throughout the paper, the words label and term are used somewhat interchangeably,
although we typically refer to labels as terms attached to other entities, e.g. images.
are labeled by arbitrary strings (keywords), isolated from general Web of Data
knowledge bases. Such mapping would enrich the original resource with new
relations and properties, and the Linked Data cloud with new, often carefully crafted
information. Indeed, recent work in robotics highlights the need for linked data
resources as a source of background knowledge for domestic robots [
        <xref ref-type="bibr" rid="ref16 ref6">6, 16</xref>
        ].
      </p>
      <p>In this work, we explore the task of linking sets of labels from an arbitrary
resource to a structured knowledge base, a problem we consider at the crossroad
between word sense disambiguation and ontology matching. The contributions
of this paper are: i) a formal de nition of the term mapping task, ii) a use
case scenario, where the labels of a resource from computer vision are linked
to a general-purpose Web resource, iii) a large, manually annotated dataset of
objects and locations linked to DBpedia, iv) a novel method for solving the
mapping task and a benchmark for its evaluation.</p>
      <p>The problem is related to entity linking, that is, the task of detecting entities
in a segment of natural language text and linking them to an ontology. The
main di erences are that 1) the terms to link are already given, and 2) there is
no context for the terms to disambiguate, which are instead given simply as a
list. With respect to the second point, we can alternatively state that the set of
keywords is itself the context, in the sense that it could give, as a whole, helpful
hints for the disambiguation of the single keywords.</p>
      <p>Formally, the problem is de ned as follows:
Given an input set of terms K = k1; :::; kn and a target knowledge base S
(R n L) P R, where R is a set of resources, P R is the subset of properties,
and L R is the subset of literals. The goal of the task is that of de ning a
mapping function f : K ! R.</p>
      <p>Two constraints can be optionally posed on the target mapping function: a
total function, that is, de ned on the entire input set, would yield a mapping
where every input term is associated to a resource in the target knowledge base.
This could be useful in scenarios where the robustness of the mapping is more
important than its accuracy. One could also want the mapping function to be
injective, that is, no pair of distinct input terms is mapped to the same resource,
for example in applications where it is known a priori that the terms refer to
distinct entities.</p>
      <p>Depending on the application scenario, it also makes sense to constrain the set
of candidate resources in the target knowledge base. For example, we may want
to link a set of terms only to instances, rather than classes, or just properties,
leaving out other type of resources.</p>
      <p>The rest of the paper is structured as follows. We rst give an overview of
problems and methods related to the term mapping task (Section 2), then we
introduce a number of methods to solve it (Section 3). We introduce a relevant
use case and test the approaches on a newly created benchmark (Section 4).
Finally, we discuss the results (Section 5) and lay down plans to approach term
mapping to Web resources in future work (Section 6).</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Linking terms to Web resources may nd its application in several tasks.
Augenstein et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] explicitly approach the task of \mapping keywords to Linked
Data resources", with the main goal of producing better queries on linked data
resources. In their work, the authors propose a supervised method to map the
keywords in natural language queries to classes in linked data ontologies. This
work served as inspiration to us to formalize the term mapping problem, while
di ering from its general case, where the keywords are mapped to linked data
resources and not only classes. Freitas et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] present an overview of approaches
to querying linked data, where the subtask of entity reconciliation is basically
a term mapping step, and two main families of approaches are studied,
respectively coming from information retrieval and natural language processing. The
former solutions exploit linked data relations such as owl:sameAs to facilitate
the mapping of disambiguated keywords, while the latter approaches leverage
lexical resources and their network structure to link words to semantic entities.
      </p>
      <p>
        When dealing with at lists of terms, a relevant task is that of inferring some
kind of structure among them. This problem can be approached by linking terms
to classes in an ontology rst, in order to exploit the relations between classes.
Limpens [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], for instance, reports the need of solving term mapping-related issues
such as accounting for spelling variations and measuring the semantic similarity
of tags from a folksonomy in order to link them to an ontology on the Web.
Methods for solving such intermediate tasks, e.g, as described in Specia and
Motta [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and Damme et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], could be directly integrated into a method
for term mapping. Meng [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] tackles a variant of our problem in the process of
inferring the topics of interests of online communities starting from the
folksonomy they produce. In this version of the task, natural language descriptions are
provided for the input keywords, as well as for the target entities, therefore they
can be exploited for the alignment. In particular distributional semantic models
of words are used to facilitate the mapping of keywords to entities based on their
descriptions. Similarly, Nooralahzadeh et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] rely on the knowledge graph
of the target resource to named entity recognition and linking.
      </p>
      <p>Whether we consider keyword-based querying, semantic labeling, ontology
matching, word sense disambiguation, or related tasks, the key di erence with
general term mapping is that the former problems are more speci c with respect
to the resources involved, and more task-oriented than the latter.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Mapping Terms to DBpedia</title>
      <p>We propose a series of approaches to map a set of terms to DBpedia5, a large
knowledge base obtained by automatically parsing semi-structured sections of
pages of Wikipedia6. While our problem's formulation is agnostic with respect
to the knowledge base target of the mapping, some of the features of DBpedia</p>
      <sec id="sec-3-1">
        <title>5 https://dbpedia.org/ 6 https://wikipedia.org/</title>
        <p>enable us to experiment with the methods described in this section. Moreover,
DBpedia is a very large and open domain knowledge base, and due to its high
connectivity rate to other resources, its position in the Linked Data cloud7 is
essentially that of a central hub.
3.1</p>
        <sec id="sec-3-1-1">
          <title>DBpedia lookup.</title>
          <p>The DBpedia project provides a lookup service for keywords.8 Querying the
REST Webservice with a keyword, it returns a list of candidate entities ordered
by refCount, a measure of the commonness of the entity based on the number of
inward links to the resource page in Wikipedia. The candidates are selected by
matching the input keyword with the label of a resource, or with an anchor text
that was frequently used in Wikipedia to refer to a speci c resource. For term
mapping, we query the DBpedia lookup service with each input term separately
(normalizing the keywords by removing a xes, and replacing underscores with
whitespaces) and retrieve the URI of the rst result in the response, that is, the
resource with the highest refCount.
3.2</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>String Match.</title>
          <p>We implemented an alternative algorithm based on string matching. For each
input term, if an entity having a matching label (with corrected capitalization)
is returned from the DBpedia API, then this entity is returned, otherwise, no
label is returned. For instance, given the input keyword hay bale, we perform a
HTTP request to the URL http://dbpedia.org/data/Hay bale and check if the
resource exists in DBpedia.
3.3</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>String Match+Redirect.</title>
          <p>We also report the performance of the string matching method with the added
feature of following the redirection links provided by the resource. For instance,
\steel chair" matches dbr:Steel chair, which in turn redirects to dbr:Folding
chair. The redirection mostly (but not exclusively) helps in cases where there is
lexical morphological variation such as plural forms, e.g., dbr:Eggs redirects to
dbr:Egg.
3.4</p>
        </sec>
        <sec id="sec-3-1-4">
          <title>Babelfy.</title>
          <p>
            Finally, we tested the performance of a state-of-the-art algorithm for word sense
disambiguation and entity linking, Babelfy [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ]. Given an input text, Babelfy
extracts the spans of text which are most likely to be entities and concepts. For
each of these fragments (single words or multi-word expressions), Babelfy
generates a list of possible entries according to the semantic network of BabelNet [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ].
          </p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>7 https://lod-cloud.net/ 8 http://wiki.dbpedia.org/projects/dbpedia-lookup</title>
        <p>We query the Babelfy service using all the terms together separated by
commas. Partial matches, e.g., https://en.wikipedia.org/wiki/Clothing and https:
//en.wikipedia.org/wiki/Horse for the keyword clothes horse are discarded.
3.5</p>
        <sec id="sec-3-2-1">
          <title>Modeling the keywords' context with vectors</title>
          <p>By analyzing the output of the systems described previously on a pilot test,
we identi ed two main venues for improvement. Firstly, we noted that the vast
majority of the missed terms are composed by more than one word, therefore we
decided to implement an algorithm that explicitly tackles this issue by splitting
the multi-word expressions and searching for entities in DBpedia based on the
single words.</p>
          <p>The second improvement comes from the observation of the main aw of the
string match method, that is, the disambiguation of each keyword in isolation.
As stated in the task de nition, this kind of term mapping di ers from
standard word sense disambiguation because of the lack of context to inform the
disambiguation process. However, we can make an assumption on the set of
keywords being their own context, i.e., the disambiguation of one keyword provides
useful information to disambiguate the other keywords. We therefore test this
assumption by encoding it into our novel method for term mapping.</p>
          <p>The algorithm, henceforth called Vector-based Contextual disambiguation
(VCD) works on top of the string match method, that is, we rst run the string
match-based algorithm (including the redirection) and save its output, and then
run the new algorithm only on the terms that have not been linked in the
rst step. Formally, the rst step consists of running the string match method
described in Section 3.2 on the input set of terms K, and extracting the set of
linked entities L = SM (K). Each term for which an entity is not found in this
step is split into the words that compose it, e.g., basket of fruit ! [basket,
of, fruit]:</p>
          <p>W = w1; :::; wn = split(ki)
For each word, a new entity is retrieved with the string match method as before,
e.g., [basket, of, fruit] ! [dbr:Basket, dbr:Fruit]:</p>
          <p>E = e1; :::; em = SM (wi) for wi 2 W if SM (wi) 6= nil
Note that the number of retrieved entities (m) could be lower than the number
of words (n), for instance in this example there could be no entity for the word
of. For each of the new entities, their semantic similarity is computed with all
the entities that have been previously recognized, and the average is taken9:
agg simAV G(ej ; L) =
1</p>
          <p>
            X sim(ej ; l)
L
j j l2L
9 We also test a variant where the maximum similarity (MAX) is computed instead
of the average similarity (AVG).
This step of the VCD algorithm presupposes a way of computing the semantic
similarity between pairs of entities in the target resource. Depending on the
target resource, such measure could be already de ned, or it could be computed
based on lexical and structural properties of the knowledge base. For DBpedia,
we rely on the vector space model NASARI [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ], in order to obtain pairwise
semantic similarity from vectorial representations of the concepts in DBpedia.
NASARI is an attempt at building a vector space of concepts starting from the
word sense disambiguation of natural language text. The NASARI approach
collects cooccurrence information of concepts from Wikipedia and then applies
a cluster-based dimensionality reduction. The context of a concept is based on
the set of Wikipedia pages where a mention of it is found. The nal result is a
vector space of 4.5 million 300-dimension dense vectors associated to BabelNet
synsets and DBpedia resources. Given two DBpedia resources, we can compute
their semantic similarity as the cosine similarity between their corresponding
vectors in NASARI.
          </p>
          <p>Finally, the entity ej with the highest aggregate similarity with the set L
of previously disambiguated entities is selected as the disambiguation of the
original term. Optionally, a threshold (T) can be imposed on the aggregate
similarity score, to avoid linking to entities for which even the highest similarity
with the set is still low. This allows us to control the balance between precision
and recall, with a lower threshold producing an output for more keywords at the
cost of a lower average precision.
4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Evaluation</title>
      <p>
        As a use case to experiment with the proposed task, we identify the problem of
linking a large database of labeled images to DBpedia. Images come from work in
computer vision, and the labels describe relations between objects and locations.
Extracting information of this kind is the goal of recent work in information
extraction and knowledge base building [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], where authors create a resource of
objects and their typical locations by extracting common knowledge from text.
      </p>
      <p>In this section, we describe the large-scale resource for computer vision on
which our use case is based, the gold standard dataset we built, and the results
we obtained comparing the performances the methods described in Section 3 on
the proposed use case.
4.1</p>
      <sec id="sec-4-1">
        <title>Data</title>
        <p>
          The SUN database [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] is a large-scale resource for computer vision and objects
recognition in images. It comprises 131.067 single images, each of them
annotated with a label for the type of scene, e.g., bedroom or dining room, and labels
for each object identi ed in the scene, e.g., wall or bed. The key pieces of
information that make the SUN database valuable for applications in robotics and AI
is the set of concepts categorized as objects, the set of concepts categorized as
scenes, and the implicit relation locatedIn between object-scene pairs. However,
the concepts in the SUN database are expressed as arbitrary labels, isolated
from Linked Data. Figure 1 shows a screenshot from the SUN database Web
interface, displaying information relative to an objects and its related scenes in
the database. The images have been annotated with 908 categories based on
the type of scene (bedroom, garden, airway, ...). Moreover, 313.884 objects were
recognized and annotated with one out of 4.479 category labels.
        </p>
        <p>
          Despite the great amount of work that
went into the creation of the SUN database,
its applicability to elds related to, but
distinct from, computer vision, is hindered by
the fact that the set of labels is speci c to the
resource. To be fair, the creators used the
dictionary of lemmas from WordNet [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] to choose
the single- or multi-word labels to annotate
scenes and objects. However, the labels
themselves are not disambiguated, thus they are
not directly mapped to any existing resource
to promote interoperability.
4.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Gold Standard</title>
        <p>In order to assess the di culty of the mapping
the SUN labels to DBpedia, and test the
performance of di erent solutions, a collection of Fig. 1. Portion of the SUN
ground truth facts is needed, that is, a set of adnataobbajescet W(cehbairi)ntfeorlfloawceedshboywtinhge
terms correctly linked to the knowledge base. most frequent scenes associated to
This set will form the gold standard against it and a set of segmented instances
which baseline methods and further re ned
approaches will be tested.</p>
        <p>We employed the popular crowdsourcing
platform Figure Eight 10 to ask paid
contributors to manually annotate the object and
scene labels from the SUN database. The labels are lowercase English words
separated by underscores, e.g., stand clothes, deer, high-heeled shoes. About
49% of the labels in both sets are multi-word expressions. The scene
labels are pre xed by the starting letter (presumably for the organization of
the dataset) and may contain an optional speci cation after a slash, e.g.,
n/newsstand/outdoor or b/bakery/shop.</p>
        <p>The task we designed is that of associating a valid URL from the English
Wikipedia11 to each SUN label. This process involved looking up Wikipedia,
searching the Web for content related to the keyword, and making non-trivial
assignments, to ultimately pair the labels with DBpedia entities. For simple
instances, linking a terms to a DBpedia URI is as easy as checking the page with
10 http://www. gure-eight.com
11 http://en.wikipedia.org
matching label or a closely related one, or following a redirection, for instance
in the cases of di erences in spelling such as British vs. American English. A
number of cases, however, are not trivial, mostly due to speci c concepts being
absent from the target knowledge base. To overcome these di culties that are
inherent to the task, we provided a detailed set of guidelines and tips to the
annotators, depicted in Figure 2.</p>
        <p>Please use strictly URLs from the English Wikipedia: https://en.
wikipedia.org/wiki/...</p>
        <p>Search the page on your favorite search engine, limiting the search result
to Wikipedia EN. For instance, on Google you can use a search string like
"site:en.wikipedia.org KEYWORD".</p>
        <p>You can also search Wikipedia directly using the search box at the top of
https://en.wikipedia.org/wiki/Main Page.</p>
        <p>Some keywords will be trivial to match with a Wikipedia page, while others will
be more di cult, for instance because a page that matches the keyword exactly
does not exist in Wikipedia. The following guidelines will help the task in such
cases:
{ If a matching Wikipedia page cannot be found, try looking for a slightly
{ When facing a choice, try linking to a page that describe the same kind
of concept as the keyword, e.g. parade ground ! pavement rather than
parade ground ! parade.
{ Avoid linking to speci c individuals, such as names of people, cities, ...
{ Always look for alternatives, even if the keyword has a directly
corresponding Wikipedia page.
{ There could be misspelled keywords, orthographical variations (e.g., plurals)
or di erent spellings (e.g. British vs. American English). In such cases,
normalize mentally the keyword and consider the singular form.</p>
        <p>{ Avoid Wikipedia disambiguation pages.</p>
        <p>We collected 14,071 single judgments, with at least three independent
judgments for each term, from 850 contributors. The entire experiment cost $199
and took roughly one day.</p>
        <p>Figure Eight takes automatically care of the aggregation of the contributors'
answers, and reports a con dence score associated to each aggregated answer.
The con dence score is a measure of the agreement between the annotators on a
particular keyword, weighted by a trust score assigned to them by Figure Eight
based on how they fare in their tasks. The con dence score is of great importance
in the analysis of the produced dataset, as we expect the most di cult cases to
be associated with a lower con dence score.</p>
        <p>Inspecting the resulting data set, we found several classes of di cult cases:
from cases where a term can be linked to di erent, closely related entities (e.g.,
wood railing to either dbr:Fence or dbr:Deck railing12) to cases where the input
term is complex and not directly represented in the target knowledge base (e.g.,
basket of banana linked to either dbr:Basket or dbr:Banana). There are also
errors (mostly spelling mistakes) in the original SUN labels, such as ston!stone
or pilow!pillow, that we corrected manually.</p>
        <p>We performed a post-processing step to ensure a high-quality gold standard
dataset, which led to the removal of 456 entries (9.7% of the total number of
original terms) from the gold standard dataset, e.g., because they were links to
DBpedia disambiguation pages. The nal dataset consists of 4,239 term-DBpedia
URI pairs, 3,399 of which are objects and 840 scenes, and it is available at
https://project.inria.fr/aloof/data/.
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Results</title>
        <p>We evaluated the methods introduced in Section 3 against the gold standard
sets. This quantitative evaluation consists in counting the number of correctly
predicted entity labels and the number of items for which any label is
produced at all. With these two pieces of information we can compute the
stan#correct
dard metrics used in information retrieval, that is, precision = #retrieved and
recall = ##incsotrarnecctes . We also compute the F-score, i.e., the harmonic mean of
precision and recall, which summarizes the performance of the methods in a
single number: F -score = 2prperceicsiisoino+nrreeccaallll .</p>
        <p>The results are shown in Table 1. For VCD, we evaluated the algorithm
using as parameters both aggregation methods (AVG and MAX) and a range of
thresholds T, and we report only the result of the best combination of parameters
(AVG and T=0.3 for the objects, AVG and T=0.8 for the scenes) since the
variation of scores was minimal.
From the result of the experimentation, it is evident that simple string matching
has limited prediction power. Direct application of an entity linking algorithm
(Babelfy) also somewhat fails, arguably because of the lack of linguistic
context to help the disambiguation of the terms. Wikipedia redirection links, on
12 In the remainder of the paper, we replace http://dbpedia/org/resource/ with the
namespace pre x dbr: to improve readability.
the other hand, represent a powerful mechanism to exploit in order to bridge
lexical variations of keywords to the entity labels. It must be noted, though, that
Wikipedia redirection is very speci c to the resource we chose for our evaluation.
In the general case, such useful tool cannot be taken for granted.</p>
        <p>The string match method that makes use of redirection links has the best
performance in terms of precision both for objects and scenes. This is due to the
restricted number of terms that the method is able to retrieve, focusing only on
the input entries where there is a perfect string matching and disregarding the
ones where the string relation between the entity and the URI is less evident.</p>
        <p>VCD obtains the best performance overall, in particular because of its higher
recall. The result is particularly notable on the object label dataset.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Discussion</title>
      <p>In addition to the quantitative evaluation presented in the previuos section,
we inspected a sample of the output of each method in order to assess their
performance qualitatively.</p>
      <p>For the DBpedia Lookup baseline, about half of the wrongly predicted
labels belong to a named entity (e.g., oyster bank is mapped to dbr:Duchy of
Cornwall). This behavior is speci c of the DBpedia Lookup, indicating a bias
towards named entities that makes sense considering the encyclopedic type of
resource it targets.</p>
      <p>The string match baseline makes less mistakes than the DBpedia lookup, but
it still has low precision. Among the terms misclassi ed by such method, roughly
one third are due to the term being in the plural form, or other spelling variations
(e.g., Post-it note vs. Post it note). All these cases, among the others, are
corrected by the version of the baseline algorithm that follows the redirects,
which obtains a much higher precision score (the highest in the experiment, in
fact).</p>
      <p>Analyzing the errors committed by the strongest baseline and by Babelfy, we
noticed that only a small subset of labels is wrongly predicted by both system,
excluding the cases where nothing is returned by one of the systems. Similar
gures are found with other pairwise comparisons of the methods. This makes us
speculate that a joint system that combines the strengths of several approaches
could achieve a much higher performance than any of the single systems.</p>
      <p>Providing an additional method for dealing with the cases where the baseline
method does not return any entry led to a signi cant improvement in terms of
coverage. As a consequence of considering a wider set of entries, the number of
errors increased. The overall performance, however, is better on both datasets.</p>
      <p>The novel method we propose, VCD, is designed to solve some of the issues
of the general term mapping task, namely the disambiguation of multi-word
expressions and the necessity of inferring a notion of context from the input set of
keywords. While the experimental results show that solving these problems leads
to a better mapping, there are other issues that are not accounted for by any
of the presented method. For instance, the non-compositionality of multi-word
terms is never considered, while the best-performing method (VCD)
intrinsically assumes the strict compositionality of the term contituents. For instance,
billiard ball must be either a dbr:Ball (correct) or a dbr:Billiard (wrong),
but cannot be linked to any other concept in DBpedia by this algorithm.</p>
      <p>Finally, there is an underlying assumption that every keyword in the input set
has a corresponding \perfect match" resource in the target resource. In practice,
this is hardly the case, and the resulting mismatch calls for slight adaptations
of the task de nition.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion and Future Work</title>
      <p>In this paper, we gave a de nition of a general term mapping task, aimed at
mapping arbitrary sets of terms to a Web Knowledge base, also in relation to
well-known tasks in related areas. We built an evaluation framework that
includes a manually annotated gold standard dataset and quantitative metrics of
performance. In this environment, we tested baseline algorithms based on
DBpedia and Wikipedia, and a state-of-the-art system for entity linking, showing
their limitations when applied to general term mapping. We then proposed a
new approach that lls some of the performance gaps of out-of-the-box
solutions, and discussed the results of our experiments, concluding that the most
promising ways to approach context-less term mapping are methods that aim
for high coverage and employ the whole input set at once to provide context for
the term disambiguation.</p>
      <p>In the future, besides investigating methods to improve the coverage of the
systems, we need to look for general methods that go beyond the speci city of
DBpedia, possibly adapting solutions to related problems like the tasks listed in
Section 2. We also plan to investigate methods to leverage the asymmetry of the
term mapping task as we de ned it, that is, solutions that exploits the linguistic
features on one side of the mapping and structural features on the opposite side.
7</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>* The work of Valerio Basile is partially funded by Progetto di Ateneo/CSP 2016
(Immigrants, Hate and Prejudice in Social Media, S1618 L2 BOSC 01).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Augenstein</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gentile</surname>
            ,
            <given-names>A.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Norton</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ciravegna</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Mapping Keywords to Linked Data Resources for Automatic Query Expansion</article-title>
          , pp.
          <volume>101</volume>
          {
          <fpage>112</fpage>
          . Springer Berlin Heidelberg, Berlin, Heidelberg (
          <year>2013</year>
          ), http://dx. doi.
          <source>org/10.1007/978-3-642-41242-4 9</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Basile</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jebbara</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cabrio</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cimiano</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Populating a knowledge base with object-location relations using distributional semantics</article-title>
          .
          <source>In: Knowledge Engineering and Knowledge Management: 20th International Conference, EKAW 2016</source>
          , Bologna, Italy,
          <source>November 19-23</source>
          ,
          <year>2016</year>
          , Proceedings 20. pp.
          <volume>34</volume>
          {
          <fpage>50</fpage>
          . Springer (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Camacho-Collados</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pilehvar</surname>
            ,
            <given-names>M.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Navigli</surname>
          </string-name>
          , R.:
          <article-title>Nasari: a novel approach to a semantically-aware representation of items</article-title>
          . In: Mihalcea,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Chai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.Y.</given-names>
            ,
            <surname>Sarkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A</given-names>
            . (eds.)
            <surname>HLT-NAACL</surname>
          </string-name>
          (
          <year>2015</year>
          ). pp.
          <volume>567</volume>
          {
          <fpage>577</fpage>
          .
          <string-name>
            <surname>ACL</surname>
          </string-name>
          (
          <year>2015</year>
          ), http: //dblp.uni-trier.de/db/conf/naacl/naacl2015.html
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Damme</surname>
            ,
            <given-names>C.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hepp</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Siorpaes</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Folksontology: An integrated approach for turning folksonomies into ontologies</article-title>
          .
          <source>In: Bridging the Gep between Semantic Web and Web 2.0 (SemNet</source>
          <year>2007</year>
          ). pp.
          <volume>57</volume>
          {
          <issue>70</issue>
          (
          <year>2007</year>
          ), http://www.kde.cs.uni-kassel.de/ws/eswc2007/proc/FolksOntology.pdf
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Freitas</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Curry</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oliveira</surname>
            ,
            <given-names>J.G.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>O</given-names>
            <surname>'Riain</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          :
          <article-title>Querying heterogeneous datasets on the linked data web: Challenges, approaches, and trends</article-title>
          .
          <source>IEEE Internet Computing</source>
          <volume>16</volume>
          (
          <issue>1</issue>
          ),
          <volume>24</volume>
          {
          <fpage>33</fpage>
          (
          <year>2012</year>
          ), http://dx.doi.org/10.1109/MIC.
          <year>2011</year>
          .141
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Kaiser</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lewis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Petrick</surname>
            ,
            <given-names>R.P.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Asfour</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Steedman</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Extracting common sense knowledge from text for robot planning</article-title>
          .
          <source>In: 2014 IEEE International Conference on Robotics and Automation</source>
          ,
          <string-name>
            <surname>ICRA</surname>
          </string-name>
          <year>2014</year>
          ,
          <string-name>
            <given-names>Hong</given-names>
            <surname>Kong</surname>
          </string-name>
          , China, May 31 - June 7,
          <year>2014</year>
          . pp.
          <volume>3749</volume>
          {
          <issue>3756</issue>
          (
          <year>2014</year>
          ), http://dx.doi. org/10.1109/ICRA.
          <year>2014</year>
          .6907402
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Limpens</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <article-title>: Multi-points of view semantic enrichment of folksonomies</article-title>
          . Theses, Universite Nice Sophia Antipolis (Oct
          <year>2010</year>
          ), https://tel. archives-ouvertes.fr/tel-00530714
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Meng</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Temporal and semantic analysis of richly typed social networks from user-generated content sites on the Web</article-title>
          . Theses, Universite Nice Sophia Antipolis [UNS] (
          <year>Nov 2016</year>
          ), https://hal.inria.fr/tel-01402612
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>G.A.</given-names>
          </string-name>
          :
          <article-title>Wordnet: A lexical database for english</article-title>
          .
          <source>Commun. ACM</source>
          <volume>38</volume>
          (
          <issue>11</issue>
          ),
          <volume>39</volume>
          {41 (Nov
          <year>1995</year>
          ), http://doi.acm.
          <source>org/10</source>
          .1145/219717.219748
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Moro</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raganato</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Navigli</surname>
          </string-name>
          , R.:
          <article-title>Entity Linking meets Word Sense Disambiguation: a Uni ed Approach. Transactions of the Association for Computational Linguistics (TACL) 2</article-title>
          ,
          <issue>231</issue>
          {
          <fpage>244</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Navigli</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ponzetto</surname>
            ,
            <given-names>S.P.:</given-names>
          </string-name>
          <article-title>BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network</article-title>
          .
          <source>Arti cial Intelligence</source>
          <volume>193</volume>
          ,
          <fpage>217</fpage>
          {
          <fpage>250</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Nooralahzadeh</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lopez</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cabrio</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gandon</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Segond</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Adapting Semantic Spreading Activation to Entity Linking in text</article-title>
          .
          <source>In: Proceedings of NLDB 2016 - 21st International Conference on Applications of Natural Language to Information Systems. Manchester, United Kingdom (Jun</source>
          <year>2016</year>
          ), https://hal.inria.fr/hal-01332626
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Specia</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Motta</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          :
          <article-title>Integrating folksonomies with the semantic web</article-title>
          .
          <source>4th European Semantic Web Conference</source>
          (
          <year>2007</year>
          ), http://www.eswc2007.org/
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Tenorth</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beetz</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>KnowRob { Knowledge Processing for Autonomous Personal Robots</article-title>
          .
          <source>In: IEEE/RSJ International Conference on Intelligent Robots and Systems</source>
          . pp.
          <volume>4261</volume>
          {
          <issue>4266</issue>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Xiao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hays</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ehinger</surname>
            ,
            <given-names>K.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oliva</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Torralba</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Sun database: Large-scale scene recognition from abbey to zoo</article-title>
          . In: CVPR. pp.
          <volume>3485</volume>
          {
          <fpage>3492</fpage>
          . IEEE Computer Society (
          <year>2010</year>
          ), http://dblp.uni-trier.de/db/conf/ cvpr/cvpr2010.html#XiaoHEOT10
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Young</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Basile</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kunze</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cabrio</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hawes</surname>
          </string-name>
          , N.:
          <article-title>Towards Lifelong Object Learning by Integrating Situated Robot Perception and Semantic Web Mining</article-title>
          .
          <source>In: Proceedings of the European Conference on Arti cial Intelligence (ECAI) 2016 conference. THe Hague</source>
          ,
          <source>Netherlands (Aug</source>
          <year>2016</year>
          ), https://hal.inria.fr/hal-01370140
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>