<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Entity Linking on Philosophical Documents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Diego Ceccarelli</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alberto De Francesco</string-name>
          <email>alberto.defrancesco@imtlucca.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ra aele Perego</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Segala</string-name>
          <email>marco.segala@univaq.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicola Tonellotto</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Salvatore Trani</string-name>
          <email>trani@di.unipi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dipartimento di Informatica, Universita di Pisa</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Dipartimento di Scienze Umane</institution>
          ,
          <addr-line>Universita dell'Aquila</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Istituto IMT Alti Studi di Lucca</institution>
          ,
          <addr-line>Lucca</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Istituto di Scienza e Tecnologie dell'Informazione - CNR</institution>
          ,
          <addr-line>Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Entity Linking consists in automatically enriching a document by detecting the text fragments mentioning a given entity in an external knowledge base, e.g., Wikipedia. This problem is a hot research topic due to its impact in several text-understanding related tasks. However, its application to some speci c, restricted topic domains has not received much attention. In this work we study how we can improve entity linking performance by exploiting a domain-oriented knowledge base, obtained by ltering out from Wikipedia the entities that are not relevant for the target domain. We focus on the philosophical domain, and we experiment a combination of three di erent entity ltering approaches: one based on the \Philosophy" category of Wikipedia, and two based on similarity metrics between philosophical documents and the textual description of the entities in the knowledge base, namely cosine similarity and Kullback-Leibler divergence. We apply traditional entity linking strategies to the domainoriented knowledge base obtained with these ltering techniques. Finally, we use the resulting enriched documents to conduct a preliminary user study with an expert in the area.</p>
      </abstract>
      <kwd-group>
        <kwd>Entity Linking</kwd>
        <kwd>Entity Filtering</kwd>
        <kwd>Information Search and Retrieval</kwd>
        <kwd>Document Enriching</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        In the latest years, document enriching via Entity Linking (EL) has gained
increasing interests due to its impact in several text-understanding related tasks,
e.g., web search, document classi cation, etc [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ]. The EL problem has been
introduced in 2007 by Mihalcea and Csomai [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and consists in linking short
fragments of text within a document to an entity listed in a given Knowledge
Base (KB). The authors also propose to consider each Wikipedia article as an
entity, and the title or the anchor text of the hyperlinks pointing to the article
as potential mentions to the entity.
      </p>
    </sec>
    <sec id="sec-2">
      <title>In Dresden! [http://en.wikipedia.org/wiki/Dresden], Schopenhauer!</title>
      <p>[http://en.wikipedia.org/wiki/Arthur Schopenhauer] became acquainted with the philosopher!
[http://en.wikipedia.org/wiki/Philosopher] and freemason! [http://en.wikipedia.org/wiki/Freemasonry],</p>
    </sec>
    <sec id="sec-3">
      <title>Karl Christian Friedrich Krause! [http://en.wikipedia.org/wiki/Karl Christian Friedrich Krause]</title>
      <p>A typical EL system works in three steps: i) Spotting: the document is
processed in order to detect a set of potential mentions (also referred as surface
forms or spots), and for each mention a list of candidate entities is produced;
ii) Disambiguation: for each potential mention with more than one candidate
entity, a single entity is selected. This is done by trying to maximize the coherence
among the selected entities; iii) Filtering: only the most relevant annotations,
i.e., the mentions linked with some entity, are selected, ltering out the irrelevant
ones by using some measure of annotation con dence/importance. Due to the
ambiguity of natural languages, the EL task is not trivial. In fact the same
mention could refer to more than one entity (polysemy) and the same entity
could be referred by more than one mention (synonymy).</p>
      <p>Let us introduce an example to describe how the EL process works.
Figure 1 shows a semantically enriched document produced by an EL system.
In the reported text there are mentions (e.g., Dresden, Schopenhauer, or
freemason) linked to their semantic concept by using the URI or the identi er
in the KB, in our case Wikipedia. For example, the spot Dresden is linked
to http://en.wikipedia.org/wiki/Dresden. It is worth noting that the mention
Dresden could refer to many other meanings, as we can see by looking at the
corresponding Wikipedia disambiguation page5.</p>
      <p>Now we introduce some notations used thereinafter in the paper. A
Knowledge Base is a collection of entities, where each entity represents an artifact or
a concept in the real world. An entity e is described by the following attributes:
{ a Uniform Resource Identi er: univocally identify the entity in KB (e.g.,
the url "http://en.wikipedia.org/wiki/Dresden" identify the entity Dresden);
{ a description: text describing what the entity represents, usually the
content of its Wikipedia page (e.g., "Dresden is the capital city of..." );
{ a set of related entities that are connected to the given entity, usually
derived from Wikipedia links (e.g., Germany is linked by Dresden);
{ a set of surface forms, the fragments of text used to refer the entity (e.g.,
\A. Schopenhauer" and \Arthur Schopenhauer" are both surface forms for
http://en.wikipedia.org/wiki/Arthur Schopenhauer);
{ a set of categories, organized in a taxonomy, the entity belongs to (e.g.,
Schopenhauer belongs to the categories "Idealists" and "German atheists").</p>
      <p>The entity linking task consists in nding an annotation function fEL that,
given KB and a raw text document d, returns an enriched version of the
document de which includes also a list of annotations. Each annotation is described
by a tuple &lt; start; end; text; entity &gt;, where:</p>
      <sec id="sec-3-1">
        <title>5 http://en.wikipedia.org/wiki/Dresden (disambiguation)</title>
        <p>{ start is the starting o set of the annotation in the document;
{ end is the ending o set of the annotation in the document;
{ text is the surface form of the entity detected in the document;
{ entity is the URI of the entity detected in the document.</p>
        <p>The research question behind this paper is the following: let us assume to have
a collection of documents about a particular topic to enrich, e.g., Philosophy.
Could we exploit the knowledge about the topic to improve the e ectiveness
of the entity linking process? The solution we propose works a priori on the
Knowledge Base (KB) used for generating the EL model. The idea is to consider
only the entities relevant for a target topic t of the documents we are going to
annotate. These entities form a new domain-speci c Knowledge Base, that in
the following we refer to Topical Knowledge Base.</p>
        <p>
          To the best of our knowledge, we are the rst to investigate how to perform
topical EL by pre ltering a general knowledge base. Mirylenka and Passerini [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ],
and later Miao et al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], applied EL techniques on the domain of scienti c
publications: in [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] authors propose a method of organizing the search results
into concise and informative topic hierarchies. They obtain the hierarchies by
annotating the entities in a document with Wikipedia Miner [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] { an open source
entity linking tool. STICS [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] is a system that enriches news with entities and uses
them for improve the browsing and provide entity analytics of what is happening
in the world. Finally, Ernst et al. [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] applied entity linking on health and life
sciences, through the KnowLife portal, a large KB automatically constructed
from Web sources.
2
        </p>
        <sec id="sec-3-1-1">
          <title>The Knowledge Base Topic-Filtering Problem</title>
          <p>Let fEL(KB; d) be an annotation function that, given a Knowledge Base KB
and a document d, produces an enriched version of the document de, and let
(fEL(KB; d)) be an e ectiveness measure of the annotation function, i.e. a
common information retrieval quality metrics such as precision.</p>
          <p>Given a collection of documents Dt related to a topic t, our objective is to
nd a subset KBt of the knowledge base KB, such that:
8d 2 Dt; (fEL(KBt; d))</p>
          <p>(fEL(KB; d))
jKBtj</p>
          <p>jKBj</p>
          <p>The topical knowledge base KBt is obtained by ltering KB through a
function (KB; t). Since KB is a collection of entities fe1; e2; ; eng and lter
each entity independently from the others, we can thus de ne:</p>
          <p>KBt =</p>
          <p>(e; t)
[
e2KB</p>
          <p>Our claim is that such a function can improve the e ectiveness of the entity
linking task for the topic t. In particular, we propose three ltering methods:</p>
          <p>Cosine Similarity Filter, Kullback-Leibler Divergence Similarity Filter, and
Category Filter. The rst two approaches exploit the textual similarity6 between the
documents in Dt and the description of the entity in KB (averaging the result
with respect to the collection Dt). The latter exploits the Wikipedia Category
Graph in order to detect how far are the categories the entity belongs to and
the root category of the topic being considered.</p>
          <p>Cosine Similarity Filter</p>
          <p>
            The cosine similarity lter measures the similarity between two vectors. Let
d be a document belonging to a collection of documents D related to a topic
t, edesc be the textual description of an entity (e.g., the text in its Wikipedia
page), wk(di) the weight associated with a term-document pair (ki; d), wk(eidesc)
the weight associated with a term-entity pair (ki; edesc). Then, in the textual
similarity context, the cosine similarity is de ned as:
6 In Information Retrieval, the text similarity between document-query pair, is a score
aiming to provide a degree of similarity of a document with respect to an user
information need.
where V = fk1; : : : ; kng is the vocabulary of the terms, n is the number of
distinct terms in the document collection and ki be a generic term. The weights
wk(di) and wk(eidesc) are computed with the tf-idf [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ] formula as in the following,
using the inverse document frequency of the term in Wikipedia (idfw)
(1)
(2)
(3)
(4)
wk(di) = tf (d)(ki)
wk(eidesc) = tf (edesc)(ki)
idfw(ki)
idfw(ki)
The cosine similarity ranges in [0; 1], with the maximum similarity reached at 1.
Kullback-Leibler Divergence Filter
          </p>
          <p>
            The Kullback-Leibler Divergence (KLD ) Filter measures the relative entropy
of two di erent probability distributions associated to the same event space. Let
d and edesc be de ned as in Cosine Similarity Filter, Pk(id) be the probability of a
term ki in a document, and P (edesc) be the probability of a term ki in an entity
ki
description, the Kullback-Leibler divergence is formulated in [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ] as follows:
respectively de ned as in the following:
          </p>
          <p>P (d) =</p>
          <p>ki
P (edesc) =
ki</p>
          <p>wk(di)
P wk(di)
ki V</p>
          <p>wk(eidesc)
P wk(eidesc)
ki V
(5)
(6)
where wk(di) and wk(eidesc) are respectively computed as in Equations (2,3). For the
sake of simplicity in this paper we are using the original formulation of the KLD,
which is not symmetric (i.e., KLD(d; edesc) 6= KLD(edesc; d)). A more reliable
implementation could be the symmetrised or the Jensen-Shannon divergence
because they consider also the similarity between the textual description of the
entity and the document.</p>
          <p>Category Filter</p>
          <p>The Category Filter takes advantages of the Wikipedia category graph and of
the list of categories each entity belongs to. This information is used to
compute the shortest path (and so the minimum distance) of an entity from a set of
highly relevant category node (the root of the visit) for the topic being
considered (e.g., Philosophy 7). Each Wikipedia article can appear in more than one
category, and each category can appear in more than one parent category.
Multiple categorization schemes co-exist simultaneously. In other words, categories do
not form a strict hierarchy or tree structure, but a more general directed acyclic
graph (DAG ).</p>
          <p>In particular, given G = (C; E) be the Wikipedia category graph with C the
category nodes and E the direct connection between the categories, let us de ne
C(t) = fc(1t); c2</p>
          <p>(t); : : : ; c(mt)g be the set of highly relevant categories relative to the
topic t, with C(t) C. The minimum distance of each wikipedia entity from
the categories in C(t) can so be computed by exploiting a breadth- rst search
(BFS ) visit of the graph G, starting from the nodes in C(t). Let us de ne such
a method with the function '.</p>
          <p>'(ci(t)) = fBF S(G; C(t))g 8i 2 1; : : : ; n
(7)
where n is equal to jCj.
3</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>Experiments</title>
          <p>In the following we introduce the philosophical document adopted as a reference
document for the topic t = P hilosophy . This document is used both for ltering</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>7 http://en.wikipedia.org/wiki/Category:Philosophy</title>
      </sec>
      <sec id="sec-3-3">
        <title>Philosophy! [http://en.wikipedia.org/wiki/Category:Philosophy]</title>
        <p>+</p>
      </sec>
      <sec id="sec-3-4">
        <title>Value! [http://en.wikipedia.org/wiki/Category:Value]</title>
        <p>by textual similarity and for performing the user study described in the Section 4.
Then we describe the methodology adopted to build the ltered KB and the
di erences in the EL annotations obtained by using the traditional KB and the
ltered one.
3.1</p>
        <p>Reference document
We adopt a philosophical text written by the philosopher Ludwig Wittgenstein
during the middle of the last century as the reference document for the topic
t being considered. The title of the book is On Certainty and it is a
collection of aphorisms discussing the relation between knowledge and certainty. The
book is composed by 676 paragraphs, with an average of 243 characters and 46
terms per paragraph. Since each paragraph is long enough and contains several
philosophical notions, we consider it as an independent document d of Dt.
3.2</p>
        <p>Filtering methods
In order to evaluate the impact of the proposed ltering on the KB and to
gain some insights on the thresholds to adopt, we perform a study to measure
the frequencies of the entities that pass each ltering strategy in isolation, by
considering di erent values for the thresholds. Given the problem formalization
described in Section 2, we apply each ltering method to each entity in KB,
computing a score that expresses how far/similar is the entity from the topic
t. We compute the textual similarity by applying the Cosine Similarity and the
Kullback-Leibler Distance between the reference document described above and
the description of the entity in KB, i.e., the content of the Wikipedia article. The
Category Filter computes the minimum depth of all the categories each entity
belongs to. The category graph as well as the categories related to an entity are
taken from the KB.</p>
        <p>Figure 3 reports the application of the Cosine Similarity (Figure 3a) and
Kullback-Leibler Divergence (Figure 3b) lters to the KB. The former obtains
maximum similarity when the scores is 1, while the latter when the score is 0. The
two gures show that the cosine similarity is more spread out along the X axis
(the con dence score thresholds) than KLD, which is indeed very thin-tailed.
Finally, Figure 4 shows the category lter application, with the depth
distribution of the categories in the category graph given the root node Philosophy
(Figure 4a) and the distribution of the entities according to the minimum depth
frequency
frequency
of the categories each entity belongs to (Figure 4b). The outcome of this gure
is quite surprising: the category graph (which is a DAG according to Wikipedia)
seems to interconnect categories that are not strictly related each other. As a
matter of fact, take a look at the odd path in Figure 2, where the category
Valuation (Finance) is reached by traversing only two descendant link starting
from the Philosophy category node. This evidence clearly explains also why so
many categories ( 750k) can be reached by descendant traversing the category
graph from the Philosophy node, with at most 10-11 steps.
3.3</p>
        <p>Filtering approach adopted
According to the Wikipedia Philosophical Portal 8, there are about 15k
philosophical articles on the total of 4:35M articles. Our intuition is that by
restricting ourself to this topic speci c subset of articles may not be su cient from
an EL perspective. Indeed by considering also articles very related to the topic
could results in an improvements of the EL e ectiveness, due to a
disambiguation phase which make use also of entities only marginally related with the topic
t. Thus, our suggestion is to select 2 times the number of entities wikipedia says
to belong to the philosophical portal (i.e., our goal is to select 30k entities).
Obviously this approach is reasonable only because we are investigating a new,
unexplorated research direction. A more general way of solving the problem
would lead to not decide a-priori the number of entitites to select, but it should
depends on the domain and on how the domain is covered in Wikipedia. The
same reasoning would lead to choose the thresholds for the textual similarity
strategies depending only from domain-speci city of the entities.</p>
        <p>Hence, for ltering out entities not related to the topic t, we adopt the
following approach:
1. We make use of the category lter to select entities belonging to the topic
and to build the base KB. This choice is motivated by the high accuracy we
expect from the category graph in selecting entities relevant for the topic,</p>
      </sec>
      <sec id="sec-3-5">
        <title>8 http://en.wikipedia.org/wiki/Portal:Philosophy</title>
        <p>250K
200K
iittron150K
u
b
isd100K
50K
because the categories are manually assigned (and validated) by the user
to the entities. In order to maximize the probability of selecting only topic
related entities, we exploit a very aggressive threshold value of 3 (i.e., 3 is the
minimum distance between at least one of the category the entity belongs
to and the root category of the topic, which is in our case the Philosophical
category). By using this threshold the lter selects 28k entities.
2. We expand the KB by considering also textual similarities with the reference
document described in Section 3.1. The idea here is that some entities in
wikipedia could be misclassi ed (i.e., their categories could not re ect the
real topic of the entity or the entity could miss of some relevant categories).
In order to apply such an expansion, we adopt the two textual ltering
approaches described in Section 3.3 by combining them together, i.e., only
entities that pass both the lters are added to the base KB. Since our target
is to select 30k entities, and we have 28k from the Categories, we add the
missing 2k considering the intersection between the subset of entities with the
highest cosine and similarity and the subset of entities with the highest KLD
similarity. We select the subsets nding two thresholds so that the sets have
approximately the same size. We then lter the cosine similarity (respectively
the Kullback-Leibler divergence) with a threshold of 0:35 (0:125) which let
52k ( 79k) entities pass the lter.</p>
        <p>The resulting KB is made up of 30k entities, primarily selected by
investigating the category graph and expanded with highly similar (from a language
point of view) side entities. The size reduction compared to the full KB is 99%.
3.4</p>
        <p>
          Entity Linking di erences
Traditional entity linking strategies are applied to the reference document in
order to evaluate how the annotation process is a ected by the domain-oriented
knowledge base obtained with the proposed ltering approach. We used the EL
Dexter framework [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] to annotate the philosophical document with both the
traditional KB and by plugging into it the domain oriented KB. Each paragraph
was annotated independently from the others, so that we are able to investigate
Full KB
Filtered KB
10
        </p>
        <p>15
spots / paragraph
20
25
30
5</p>
        <p>10
ambiguity
15
20
(a) Spots distribution
(b) Ambiguity distribution
the di erence in the annotation process by looking at two important factors: how
many spots per paragraph the system is able to annotate and how much each
spot is ambiguous before performing the disambiguation phase, i.e., how many
entities on average are candidated for each spot. In Figure 7 we report these
factors by comparing the two KB solutions. As expected, Figure 5a shows that
the distributions of spots per paragraph is very di erent: on average by using the
ltered KB, the EL system annotates less spots per paragraph than using the full
KB, and this evidence is very clear if we look at the number of cumulative spots
annotated. Indeed by using the full KB, the EL system annotates approximately
3 times more spots than using the ltered one. If we look at the distributions
of the ambiguity per spot, another important aspect arises: on average the EL
system which uses the ltered KB selects far less entities as spots, with a 50%
probability to select only one candidate per spot and 25% probability to select
two candidates for a spot. The latter evidence is really important because the
disambiguation phase is simpler when the ambiguity is low (and it is worth to
notice that 50% of the spots do not need a disambiguation at all due to a single
candidate per spot selected).
4</p>
        <sec id="sec-3-5-1">
          <title>User study</title>
          <p>
            For assessing the quality of the linking performed using the ltered KB, we set
up a user study experiment. We selected the rst 110 paragraphs from the On
Certainty book, and we annotated each paragraph using a model generated from
the ltered KB. Each paragraph was annotated using a dictionary generated from
the Wikipedia anchors, and, in case of ambiguous spots, we disambiguated using
the TAGME disambiguation algorithm [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ].
          </p>
          <p>On average we annotated 4:54 entities per paragraph (500 in total). We
designed a simple web application that allows a user to evaluate the annotations.
The application allows to browse the paragraphs in two di erent ways:
Document based browsing the user can visualize a paragraph and move to
the previous, or the next. Annotated spots are highlighted and if the user
clicks on a spot, a description of the annotated entity pops out;
0.7
0.6
Entity based browsing If the user clicks on the name of the entity, then
an entity based view is presented: this page presents a description of the
entity, and then a list of paragraphs where the entity is mentioned. For each
paragraph the user can visualize the spot that was annotated with the entity,
and the text around the annotation.</p>
          <p>The user can mark an annotation as good or bad, simply by clicking on it. One
click means good and the annotation is highlighted in green, while an additional
click means bad and the annotation is highlighted in red.</p>
          <p>We asked an expert human annotator in the eld to judge the annotations.
Subsequently we performed the same evaluation, but the EL was performed
using the full KB. We obtained on average 9:72 annotations per paragraph (1070
in total). This high number is due to the fact that we did not pre- lter the
annotations in any way. To speed up the expert assessment, we decided to remove
the annotations with a con dence (i.e., a score assigned by the EL system which
express the certainty of the annotation) lower than 20%. This threshold is
absolutely reasonable since usually TAGME discards annotations with a con dence
lower than 50%. The number of annotations per document decreased from 9:72 to
2:82 (globally from 1070 to 311). Moreover, we automatically copied the
assessments relative to the same annotations (i.e., spots occurring in the same place
of the text linked to the same entity) from the previous judgment performed by
the expert annotator. This avoided the evaluation of 113 (36%) annotations.</p>
          <p>Figure 6b shows the distribution of the annotations by their con dence score.
We can observe that a traditional EL system encounters some problem on
working with a ltered KB. Indeed the annotation con dence is on average quite low,
thus suggesting that the mutual reinforcement of the entities in the
disambiguation phase has still a lot of space for improvement when working on a topic based
KB. It is worth to notice that the disambiguation strategy adopted (TAGME)
would have discarded the majority of the annotations by applying its threshold
value (50%) on the annotation con dence.</p>
          <p>Figure 6a on the other hand illustrates the distribution of the two assessment
classes by the con dence score of the relative annotations. We can identify a clear
correlation between the positive class and the con dence score (higher is better).
1.0
0.8
300 tro
p
p
200 su
100
1.0
0.8
This can be explained by the fact that high con dence scores are assigned to
entities strongly related with other entities in the annotated document, thus
resulting in a more precise annotation which is less prone to errors.</p>
          <p>Finally, Figure 7 depicts the annotation e ectiveness of the two EL systems,
the rst using the full KB and the second using the ltered one. The e ectiveness
is evaluated at di erent values of con dence in order to study the best threshold
to adopt for ltering out bad annotations and maximizing the e ectiveness of
the annotation process. In the gures we show the support, i.e., the number of
assessments with a con dence higher or equal to the threshold adopted, and the
precision, i.e., the fraction of positive assessments over the sum of positive and
negative assessments. Figure 7a clearly depicts that the precision starts from a
value of 0:7 (obtained without using any threshold on the con dence, thus
having the maximum support) to a maximum of 0:85 (using a threshold of
0:3, corresponding to a support of 430). Further values slowly decrease in
precision, but { more importantly { decrease in support. A very low support
means very few annotations, and we should avoid such compromise.</p>
          <p>Figure 7b studies the behavior of the EL system that makes use of the full
KB. Here the precision ranges from a value of 0:65, starting from a con dence
value of 0:2 (remind that annotations with a con dence below this threshold were
discarded) up to a maximum of 0:85 obtained with a con dence value of 0:57
(with a support of 12). The latter results clearly depict how using only precision
is not enough to measure the e ectiveness of a system: in fact, the support
value of 15 means that only 5% of the annotations are considered, resulting in
a very poor document enriching (i.e., 0:14 average annotations per paragraph).
By comparing the behavior of the two systems it is evident that the strategy
to build a topical KB is clearly a key idea for performing EL on a set of topic
speci c documents. This is supported by the fact that we obtained a consistent
improvement in terms of precision without penalizing too much the support.
5</p>
        </sec>
        <sec id="sec-3-5-2">
          <title>Conclusions</title>
          <p>In this paper we study how to apply entity linking on a collection of documents
about a particular topic. Our thesis is that pre- ltering a general knowledge base
keeping only the entities that are relevant for the topic, and then building the
entity linking model only from these entities could improve the annotation
performance. We propose three strategies for ltering the knowledge base, two based
on textual similarity between the topic and the entities (i.e., cosine similarity and
KL-divergence) and one based on the Wikipedia categories (i.e., considering only
the categories that belong to the selected topic). We perform some preliminary
experiments on the topic Philosophy, combining the three methods and
performing the linking on the resulting ltered knowledge base. Finally, in a user-study
performed by an expert in the area, we compare the annotation performance on
a philosophic document collection using a traditional knowledge base and the
one ltered by our approach. The results con rm that the proposed technique is
a promising idea for performing EL on a set of topic speci c documents.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>D.</given-names>
            <surname>Ceccarelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lucchese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Orlando</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Perego</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Trani</surname>
          </string-name>
          .
          <article-title>Dexter: an open source framework for entity linking</article-title>
          .
          <source>In Proceedings of the Sixth International Workshop on Exploiting Semantic Annotations in Information Retrieval (ESAIR)</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>P.</given-names>
            <surname>Ernst</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Siu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Weikum. Knowlife</surname>
          </string-name>
          :
          <article-title>A knowledge graph for health and life sciences</article-title>
          .
          <source>In Data Engineering (ICDE)</source>
          ,
          <year>2014</year>
          IEEE 30th International Conference on, pages
          <volume>1254</volume>
          {
          <fpage>1257</fpage>
          . IEEE,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>P.</given-names>
            <surname>Ferragina</surname>
          </string-name>
          and
          <string-name>
            <given-names>U.</given-names>
            <surname>Scaiella</surname>
          </string-name>
          . Tagme:
          <article-title>on-the- y annotation of short text fragments (by wikipedia entities)</article-title>
          .
          <source>In Proceedings of CIKM</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. J. Ho art, D. Milchevski, and
          <string-name>
            <given-names>G.</given-names>
            <surname>Weikum.</surname>
          </string-name>
          <article-title>Stics: searching with strings, things, and cats</article-title>
          .
          <source>In Proceedings of the 37th international ACM SIGIR conference on Research &amp; development in information retrieval</source>
          , pages
          <volume>1247</volume>
          {
          <fpage>1248</fpage>
          . ACM,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>S.</given-names>
            <surname>Kullback</surname>
          </string-name>
          and
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Leibler</surname>
          </string-name>
          .
          <article-title>On information and su ciency</article-title>
          . Ann. Math. Statist.,
          <volume>22</volume>
          (
          <issue>1</issue>
          ):
          <volume>79</volume>
          {
          <fpage>86</fpage>
          , 03
          <year>1951</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>C. D. Manning</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Raghavan</surname>
          </string-name>
          , and H. Schutze. Introduction to information retrieval, volume
          <volume>1</volume>
          . Cambridge university press Cambridge,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Q.</given-names>
            <surname>Miao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Nishino</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Igata</surname>
          </string-name>
          .
          <article-title>Link scienti c publications using linked data</article-title>
          .
          <source>In Semantic Computing (ICSC)</source>
          ,
          <year>2015</year>
          IEEE International Conference on, pages
          <volume>268</volume>
          {
          <fpage>271</fpage>
          . IEEE,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>R.</given-names>
            <surname>Mihalcea</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Csomai</surname>
          </string-name>
          . Wikify!
          <article-title>: linking documents to encyclopedic knowledge</article-title>
          .
          <source>In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management</source>
          , pages
          <volume>233</volume>
          {
          <fpage>242</fpage>
          . ACM,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>D.</given-names>
            <surname>Milne</surname>
          </string-name>
          and
          <string-name>
            <given-names>I. H.</given-names>
            <surname>Witten</surname>
          </string-name>
          .
          <article-title>An open-source toolkit for mining wikipedia</article-title>
          .
          <source>Arti cial Intelligence</source>
          ,
          <volume>194</volume>
          :
          <fpage>222</fpage>
          {
          <fpage>239</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>D.</given-names>
            <surname>Mirylenka</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Passerini</surname>
          </string-name>
          .
          <article-title>Navigating the topical structure of academic search results via the wikipedia category network</article-title>
          .
          <source>In Proceedings of the 22nd ACM international conference on Conference on information &amp; knowledge management</source>
          , pages
          <volume>891</volume>
          {
          <fpage>896</fpage>
          . ACM,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>P.</given-names>
            <surname>Pantel</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Fuxman</surname>
          </string-name>
          .
          <article-title>Jigs and lures: Associating web queries with structured entities</article-title>
          .
          <source>In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. G. Weikum and
          <string-name>
            <given-names>M.</given-names>
            <surname>Theobald</surname>
          </string-name>
          .
          <article-title>From information to knowledge: harvesting entities and relationships from web sources</article-title>
          .
          <source>In Proceedings of PODS</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>