=Paper=
{{Paper
|id=Vol-1404/paper22
|storemode=property
|title=Entity Linking on Philosophical Documents
|pdfUrl=https://ceur-ws.org/Vol-1404/paper_22.pdf
|volume=Vol-1404
|dblpUrl=https://dblp.org/rec/conf/iir/CeccarelliFPSTT15
}}
==Entity Linking on Philosophical Documents==
Entity Linking on Philosophical Documents
Diego Ceccarelli1 , Alberto De Francesco1,2 , Raffaele Perego1 , Marco Segala3 ,
Nicola Tonellotto1 , and Salvatore Trani1,4
1
Istituto di Scienza e Tecnologie dell’Informazione - CNR, Pisa, Italy
{firstname.lastname}@isti.cnr.it
2
Istituto IMT Alti Studi di Lucca, Lucca, Italy
alberto.defrancesco@imtlucca.it
3
Dipartimento di Scienze Umane, Università dell’Aquila, Italy
marco.segala@univaq.it
4
Dipartimento di Informatica, Università di Pisa, Italy
trani@di.unipi.it
Abstract. Entity Linking consists in automatically enriching a docu-
ment by detecting the text fragments mentioning a given entity in an
external knowledge base, e.g., Wikipedia. This problem is a hot research
topic due to its impact in several text-understanding related tasks. How-
ever, its application to some specific, restricted topic domains has not
received much attention.
In this work we study how we can improve entity linking performance
by exploiting a domain-oriented knowledge base, obtained by filtering
out from Wikipedia the entities that are not relevant for the target do-
main. We focus on the philosophical domain, and we experiment a com-
bination of three different entity filtering approaches: one based on the
“Philosophy” category of Wikipedia, and two based on similarity metrics
between philosophical documents and the textual description of the enti-
ties in the knowledge base, namely cosine similarity and Kullback-Leibler
divergence. We apply traditional entity linking strategies to the domain-
oriented knowledge base obtained with these filtering techniques. Finally,
we use the resulting enriched documents to conduct a preliminary user
study with an expert in the area.
Keywords: Entity Linking, Entity Filtering, Information Search and
Retrieval, Document Enriching
1 Introduction
In the latest years, document enriching via Entity Linking (EL) has gained in-
creasing interests due to its impact in several text-understanding related tasks,
e.g., web search, document classification, etc [11, 12]. The EL problem has been
introduced in 2007 by Mihalcea and Csomai [8] and consists in linking short
fragments of text within a document to an entity listed in a given Knowledge
Base (KB). The authors also propose to consider each Wikipedia article as an
entity, and the title or the anchor text of the hyperlinks pointing to the article
as potential mentions to the entity.
2 Diego Ceccarelli, et al.
In Dresden→ [http://en.wikipedia.org/wiki/Dresden], Schopenhauer→
[http://en.wikipedia.org/wiki/Arthur Schopenhauer] became acquainted with the philosopher→
[http://en.wikipedia.org/wiki/Philosopher] and freemason→ [http://en.wikipedia.org/wiki/Freemasonry],
Karl Christian Friedrich Krause→ [http://en.wikipedia.org/wiki/Karl Christian Friedrich Krause]
Fig. 1: Example of annotated document
A typical EL system works in three steps: i) Spotting: the document is
processed in order to detect a set of potential mentions (also referred as surface
forms or spots), and for each mention a list of candidate entities is produced;
ii) Disambiguation: for each potential mention with more than one candidate
entity, a single entity is selected. This is done by trying to maximize the coherence
among the selected entities; iii) Filtering: only the most relevant annotations,
i.e., the mentions linked with some entity, are selected, filtering out the irrelevant
ones by using some measure of annotation confidence/importance. Due to the
ambiguity of natural languages, the EL task is not trivial. In fact the same
mention could refer to more than one entity (polysemy) and the same entity
could be referred by more than one mention (synonymy).
Let us introduce an example to describe how the EL process works. Fig-
ure 1 shows a semantically enriched document produced by an EL system.
In the reported text there are mentions (e.g., Dresden, Schopenhauer, or
freemason) linked to their semantic concept by using the URI or the identifier
in the KB, in our case Wikipedia. For example, the spot Dresden is linked
to http://en.wikipedia.org/wiki/Dresden. It is worth noting that the mention
Dresden could refer to many other meanings, as we can see by looking at the
corresponding Wikipedia disambiguation page5 .
Now we introduce some notations used thereinafter in the paper. A Knowl-
edge Base is a collection of entities, where each entity represents an artifact or
a concept in the real world. An entity e is described by the following attributes:
– a Uniform Resource Identifier: univocally identify the entity in KB (e.g.,
the url ”http://en.wikipedia.org/wiki/Dresden” identify the entity Dresden);
– a description: text describing what the entity represents, usually the con-
tent of its Wikipedia page (e.g., ”Dresden is the capital city of...” );
– a set of related entities that are connected to the given entity, usually
derived from Wikipedia links (e.g., Germany is linked by Dresden);
– a set of surface forms, the fragments of text used to refer the entity (e.g.,
“A. Schopenhauer” and “Arthur Schopenhauer” are both surface forms for
http://en.wikipedia.org/wiki/Arthur Schopenhauer);
– a set of categories, organized in a taxonomy, the entity belongs to (e.g.,
Schopenhauer belongs to the categories ”Idealists” and ”German atheists”).
The entity linking task consists in finding an annotation function fEL that,
given KB and a raw text document d, returns an enriched version of the docu-
ment de which includes also a list of annotations. Each annotation is described
by a tuple < start, end, text, entity >, where:
5
http://en.wikipedia.org/wiki/Dresden (disambiguation)
Entity Linking on Philosophical Documents 3
– start is the starting offset of the annotation in the document;
– end is the ending offset of the annotation in the document;
– text is the surface form of the entity detected in the document;
– entity is the URI of the entity detected in the document.
The research question behind this paper is the following: let us assume to have
a collection of documents about a particular topic to enrich, e.g., Philosophy.
Could we exploit the knowledge about the topic to improve the effectiveness
of the entity linking process? The solution we propose works a priori on the
Knowledge Base (KB) used for generating the EL model. The idea is to consider
only the entities relevant for a target topic t of the documents we are going to
annotate. These entities form a new domain-specific Knowledge Base, that in
the following we refer to Topical Knowledge Base.
To the best of our knowledge, we are the first to investigate how to perform
topical EL by prefiltering a general knowledge base. Mirylenka and Passerini [10],
and later Miao et al. [7], applied EL techniques on the domain of scientific
publications: in [10] authors propose a method of organizing the search results
into concise and informative topic hierarchies. They obtain the hierarchies by
annotating the entities in a document with Wikipedia Miner [9] – an open source
entity linking tool. STICS [4] is a system that enriches news with entities and uses
them for improve the browsing and provide entity analytics of what is happening
in the world. Finally, Ernst et al. [2] applied entity linking on health and life
sciences, through the KnowLife portal, a large KB automatically constructed
from Web sources.
2 The Knowledge Base Topic-Filtering Problem
Let fEL (KB, d) be an annotation function that, given a Knowledge Base KB
and a document d, produces an enriched version of the document de , and let
σ(fEL (KB, d)) be an effectiveness measure of the annotation function, i.e. a
common information retrieval quality metrics such as precision.
Given a collection of documents Dt related to a topic t, our objective is to
find a subset KBt of the knowledge base KB, such that:
∀d ∈ Dt , σ(fEL (KBt , d)) ≥ σ(fEL (KB, d))
|KBt | |KB|
The topical knowledge base KBt is obtained by filtering KB through a func-
tion φ(KB, t). Since KB is a collection of entities {e1 , e2 , · · · , en } and φ filter
each entity independently from the others, we can thus define:
[
KBt = φ(e, t)
e∈KB
Our claim is that such a function φ can improve the effectiveness of the entity
linking task for the topic t. In particular, we propose three filtering methods:
4 Diego Ceccarelli, et al.
Cosine Similarity Filter, Kullback-Leibler Divergence Similarity Filter, and Cat-
egory Filter. The first two approaches exploit the textual similarity6 between the
documents in Dt and the description of the entity in KB (averaging the result
with respect to the collection Dt ). The latter exploits the Wikipedia Category
Graph in order to detect how far are the categories the entity belongs to and
the root category of the topic being considered.
Cosine Similarity Filter
The cosine similarity filter measures the similarity between two vectors. Let
d be a document belonging to a collection of documents D related to a topic
t, edesc be the textual description of an entity (e.g., the text in its Wikipedia
(d) (e )
page), wki the weight associated with a term-document pair (ki , d), wkidesc
the weight associated with a term-entity pair (ki , edesc ). Then, in the textual
similarity context, the cosine similarity is defined as:
P (d) P (edesc )
wki wki
k V k V
sim(d, edesc ) = r iP × r iP (1)
(d) (e )
w 2 ki w2 kidesc
ki V ki V
where V = {k1 , . . . , kn } is the vocabulary of the terms, n is the number of
distinct terms in the document collection and ki be a generic term. The weights
(d) (e )
wki and wkidesc are computed with the tf-idf [6] formula as in the following,
using the inverse document frequency of the term in Wikipedia (idfw )
(d)
wki = tf (d) (ki ) × idfw (ki ) (2)
(e )
wkidesc = tf (edesc ) (ki ) × idfw (ki ) (3)
The cosine similarity ranges in [0, 1], with the maximum similarity reached at 1.
Kullback-Leibler Divergence Filter
The Kullback-Leibler Divergence (KLD) Filter measures the relative entropy
of two different probability distributions associated to the same event space. Let
(d)
d and edesc be defined as in Cosine Similarity Filter, Pki be the probability of a
(e )
term ki in a document, and Pki desc be the probability of a term ki in an entity
description, the Kullback-Leibler divergence is formulated in [5] as follows:
(e )
X n (e )
desc
Pki desc o
Pk i × log (d)
(4)
ki V Pki
where V is the vocabulary V = {k1 , . . . , kn } representing the set of all distinct
(d) (e )
index terms in the collection of documents, and the Pki and Pki desc probabilities
6
In Information Retrieval, the text similarity between document-query pair, is a score
aiming to provide a degree of similarity of a document with respect to an user
information need.
Entity Linking on Philosophical Documents 5
respectively defined as in the following:
(d)
(d) w
Pki = P ki (d) (5)
wki
ki V
(e )
(e ) wkidesc
Pki desc = P (edesc ) (6)
wki
ki V
(d) (e )
where wki and wkidesc are respectively computed as in Equations (2,3). For the
sake of simplicity in this paper we are using the original formulation of the KLD,
which is not symmetric (i.e., KLD(d, edesc ) 6= KLD(edesc , d)). A more reliable
implementation could be the symmetrised or the Jensen-Shannon divergence
because they consider also the similarity between the textual description of the
entity and the document.
Category Filter
The Category Filter takes advantages of the Wikipedia category graph and of
the list of categories each entity belongs to. This information is used to com-
pute the shortest path (and so the minimum distance) of an entity from a set of
highly relevant category node (the root of the visit) for the topic being consid-
ered (e.g., Philosophy 7 ). Each Wikipedia article can appear in more than one
category, and each category can appear in more than one parent category. Multi-
ple categorization schemes co-exist simultaneously. In other words, categories do
not form a strict hierarchy or tree structure, but a more general directed acyclic
graph (DAG).
In particular, given G = (C, E) be the Wikipedia category graph with C the
category nodes and E the direct connection between the categories, let us define
(t) (t) (t)
C (t) = {c1 , c2 , . . . , cm } be the set of highly relevant categories relative to the
topic t, with C (t) ⊂ C. The minimum distance of each wikipedia entity from
the categories in C (t) can so be computed by exploiting a breadth-first search
(BFS ) visit of the graph G, starting from the nodes in C (t) . Let us define such
a method with the function ϕ.
(t)
ϕ(ci ) = {BF S(G, C (t) )} ∀i ∈ 1, . . . , n (7)
where n is equal to |C|.
3 Experiments
In the following we introduce the philosophical document adopted as a reference
document for the topic t = P hilosophy . This document is used both for filtering
7
http://en.wikipedia.org/wiki/Category:Philosophy
6 Diego Ceccarelli, et al.
Philosophy→ [http://en.wikipedia.org/wiki/Category:Philosophy]
⇓
Value→ [http://en.wikipedia.org/wiki/Category:Value]
⇓
Valuation (finance)→ [http://en.wikipedia.org/wiki/Category:Valuation (finance)]
Fig. 2: Odd path in the category graph of wikipedia
by textual similarity and for performing the user study described in the Section 4.
Then we describe the methodology adopted to build the filtered KB and the
differences in the EL annotations obtained by using the traditional KB and the
filtered one.
3.1 Reference document
We adopt a philosophical text written by the philosopher Ludwig Wittgenstein
during the middle of the last century as the reference document for the topic
t being considered. The title of the book is On Certainty and it is a collec-
tion of aphorisms discussing the relation between knowledge and certainty. The
book is composed by 676 paragraphs, with an average of 243 characters and 46
terms per paragraph. Since each paragraph is long enough and contains several
philosophical notions, we consider it as an independent document d of Dt .
3.2 Filtering methods
In order to evaluate the impact of the proposed filtering on the KB and to
gain some insights on the thresholds to adopt, we perform a study to measure
the frequencies of the entities that pass each filtering strategy in isolation, by
considering different values for the thresholds. Given the problem formalization
described in Section 2, we apply each filtering method to each entity in KB,
computing a score that expresses how far/similar is the entity from the topic
t. We compute the textual similarity by applying the Cosine Similarity and the
Kullback-Leibler Distance between the reference document described above and
the description of the entity in KB, i.e., the content of the Wikipedia article. The
Category Filter computes the minimum depth of all the categories each entity
belongs to. The category graph as well as the categories related to an entity are
taken from the KB.
Figure 3 reports the application of the Cosine Similarity (Figure 3a) and
Kullback-Leibler Divergence (Figure 3b) filters to the KB. The former obtains
maximum similarity when the scores is 1, while the latter when the score is 0. The
two figures show that the cosine similarity is more spread out along the X axis
(the confidence score thresholds) than KLD, which is indeed very thin-tailed.
Finally, Figure 4 shows the category filter application, with the depth distri-
bution of the categories in the category graph given the root node Philosophy
(Figure 4a) and the distribution of the entities according to the minimum depth
Entity Linking on Philosophical Documents 7
200K 300K
frequency frequency
160K 240K
120K 180K
distribution
distribution
80K 120K
40K 60K
0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
cosine similarity KLD similarity
(a) Cosine similarity distribution (b) Kullback-Leibler distribution
Fig. 3: Distribution using the textual similarity filtering
of the categories each entity belongs to (Figure 4b). The outcome of this figure
is quite surprising: the category graph (which is a DAG according to Wikipedia)
seems to interconnect categories that are not strictly related each other. As a
matter of fact, take a look at the odd path in Figure 2, where the category
Valuation (Finance) is reached by traversing only two descendant link starting
from the Philosophy category node. This evidence clearly explains also why so
many categories (≈ 750k) can be reached by descendant traversing the category
graph from the Philosophy node, with at most 10-11 steps.
3.3 Filtering approach adopted
According to the Wikipedia Philosophical Portal 8 , there are about ≈ 15k philo-
sophical articles on the total of ≈ 4.35M articles. Our intuition is that by re-
stricting ourself to this topic specific subset of articles may not be sufficient from
an EL perspective. Indeed by considering also articles very related to the topic
could results in an improvements of the EL effectiveness, due to a disambigua-
tion phase which make use also of entities only marginally related with the topic
t. Thus, our suggestion is to select 2 times the number of entities wikipedia says
to belong to the philosophical portal (i.e., our goal is to select ≈ 30k entities).
Obviously this approach is reasonable only because we are investigating a new,
unexplorated research direction. A more general way of solving the problem
would lead to not decide a-priori the number of entitites to select, but it should
depends on the domain and on how the domain is covered in Wikipedia. The
same reasoning would lead to choose the thresholds for the textual similarity
strategies depending only from domain-specificity of the entities.
Hence, for filtering out entities not related to the topic t, we adopt the fol-
lowing approach:
1. We make use of the category filter to select entities belonging to the topic
and to build the base KB. This choice is motivated by the high accuracy we
expect from the category graph in selecting entities relevant for the topic,
8
http://en.wikipedia.org/wiki/Portal:Philosophy
8 Diego Ceccarelli, et al.
250K 1.2M
frequency frequency
200K 1M
800K
150K
distribution
distribution
600K
100K
400K
50K 200K
0K 0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 inf
min depth min depth
(a) Category distribution (b) Entity Distribution
Fig. 4: Distributions using the category filtering
because the categories are manually assigned (and validated) by the user
to the entities. In order to maximize the probability of selecting only topic
related entities, we exploit a very aggressive threshold value of 3 (i.e., 3 is the
minimum distance between at least one of the category the entity belongs
to and the root category of the topic, which is in our case the Philosophical
category). By using this threshold the filter selects ≈ 28k entities.
2. We expand the KB by considering also textual similarities with the reference
document described in Section 3.1. The idea here is that some entities in
wikipedia could be misclassified (i.e., their categories could not reflect the
real topic of the entity or the entity could miss of some relevant categories).
In order to apply such an expansion, we adopt the two textual filtering
approaches described in Section 3.3 by combining them together, i.e., only
entities that pass both the filters are added to the base KB. Since our target
is to select ≈ 30k entities, and we have 28k from the Categories, we add the
missing 2k considering the intersection between the subset of entities with the
highest cosine and similarity and the subset of entities with the highest KLD
similarity. We select the subsets finding two thresholds so that the sets have
approximately the same size. We then filter the cosine similarity (respectively
the Kullback-Leibler divergence) with a threshold of 0.35 (0.125) which let
≈ 52k (≈ 79k) entities pass the filter.
The resulting KB is made up of ≈ 30k entities, primarily selected by inves-
tigating the category graph and expanded with highly similar (from a language
point of view) side entities. The size reduction compared to the full KB is ≈ 99%.
3.4 Entity Linking differences
Traditional entity linking strategies are applied to the reference document in
order to evaluate how the annotation process is affected by the domain-oriented
knowledge base obtained with the proposed filtering approach. We used the EL
Dexter framework [1] to annotate the philosophical document with both the
traditional KB and by plugging into it the domain oriented KB. Each paragraph
was annotated independently from the others, so that we are able to investigate
Entity Linking on Philosophical Documents 9
6K 1400
cumulative spots
4K
Full KB
1200 Filtered KB
2K
0 1000
0 5 10 15 20 25 30
140 800
Full KB
spots
120
100 Filtered KB 600
paragraphs
80
400
60
40 200
20
0 0
0 5 10 15 20 25 30 0 5 10 15 20
spots / paragraph ambiguity
(a) Spots distribution (b) Ambiguity distribution
Fig. 5: EL annotation differences using the full KB and the filtered one
the difference in the annotation process by looking at two important factors: how
many spots per paragraph the system is able to annotate and how much each
spot is ambiguous before performing the disambiguation phase, i.e., how many
entities on average are candidated for each spot. In Figure 7 we report these
factors by comparing the two KB solutions. As expected, Figure 5a shows that
the distributions of spots per paragraph is very different: on average by using the
filtered KB, the EL system annotates less spots per paragraph than using the full
KB, and this evidence is very clear if we look at the number of cumulative spots
annotated. Indeed by using the full KB, the EL system annotates approximately
3 times more spots than using the filtered one. If we look at the distributions
of the ambiguity per spot, another important aspect arises: on average the EL
system which uses the filtered KB selects far less entities as spots, with a 50%
probability to select only one candidate per spot and 25% probability to select
two candidates for a spot. The latter evidence is really important because the
disambiguation phase is simpler when the ambiguity is low (and it is worth to
notice that 50% of the spots do not need a disambiguation at all due to a single
candidate per spot selected).
4 User study
For assessing the quality of the linking performed using the filtered KB, we set
up a user study experiment. We selected the first 110 paragraphs from the On
Certainty book, and we annotated each paragraph using a model generated from
the filtered KB. Each paragraph was annotated using a dictionary generated from
the Wikipedia anchors, and, in case of ambiguous spots, we disambiguated using
the TAGME disambiguation algorithm [3].
On average we annotated 4.54 entities per paragraph (500 in total). We
designed a simple web application that allows a user to evaluate the annotations.
The application allows to browse the paragraphs in two different ways:
Document based browsing the user can visualize a paragraph and move to
the previous, or the next. Annotated spots are highlighted and if the user
clicks on a spot, a description of the annotated entity pops out;
10 Diego Ceccarelli, et al.
0.7 70
Negative
0.6 Positive 60
0.5 50
cumulative distribution
distribution
0.4 40
0.3 30
0.2 20
0.1 10
0.0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
annotation confidence annotation confidence
(a) Assessment distribution by confidence (b) Annotation confidence distribution
Fig. 6: EL assessment distributions using filtered KB
Entity based browsing If the user clicks on the name of the entity, then
an entity based view is presented: this page presents a description of the
entity, and then a list of paragraphs where the entity is mentioned. For each
paragraph the user can visualize the spot that was annotated with the entity,
and the text around the annotation.
The user can mark an annotation as good or bad, simply by clicking on it. One
click means good and the annotation is highlighted in green, while an additional
click means bad and the annotation is highlighted in red.
We asked an expert human annotator in the field to judge the annotations.
Subsequently we performed the same evaluation, but the EL was performed us-
ing the full KB. We obtained on average 9.72 annotations per paragraph (1070
in total). This high number is due to the fact that we did not pre-filter the an-
notations in any way. To speed up the expert assessment, we decided to remove
the annotations with a confidence (i.e., a score assigned by the EL system which
express the certainty of the annotation) lower than 20%. This threshold is abso-
lutely reasonable since usually TAGME discards annotations with a confidence
lower than 50%. The number of annotations per document decreased from 9.72 to
2.82 (globally from 1070 to 311). Moreover, we automatically copied the assess-
ments relative to the same annotations (i.e., spots occurring in the same place
of the text linked to the same entity) from the previous judgment performed by
the expert annotator. This avoided the evaluation of 113 (36%) annotations.
Figure 6b shows the distribution of the annotations by their confidence score.
We can observe that a traditional EL system encounters some problem on work-
ing with a filtered KB. Indeed the annotation confidence is on average quite low,
thus suggesting that the mutual reinforcement of the entities in the disambigua-
tion phase has still a lot of space for improvement when working on a topic based
KB. It is worth to notice that the disambiguation strategy adopted (TAGME)
would have discarded the majority of the annotations by applying its threshold
value (50%) on the annotation confidence.
Figure 6a on the other hand illustrates the distribution of the two assessment
classes by the confidence score of the relative annotations. We can identify a clear
correlation between the positive class and the confidence score (higher is better).
Entity Linking on Philosophical Documents 11
1.0 500 1.0 350
0.8 400 0.8 280
0.6 300 0.6 210
precision
precision
support
support
0.4 200 0.4 140
0.2 100 0.2 70
Precision Precision
Support Support
0.0 0 0.0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
confidence confidence
(a) Filtered KB (b) Full KB
Fig. 7: EL annotation effectiveness
This can be explained by the fact that high confidence scores are assigned to
entities strongly related with other entities in the annotated document, thus
resulting in a more precise annotation which is less prone to errors.
Finally, Figure 7 depicts the annotation effectiveness of the two EL systems,
the first using the full KB and the second using the filtered one. The effectiveness
is evaluated at different values of confidence in order to study the best threshold
to adopt for filtering out bad annotations and maximizing the effectiveness of
the annotation process. In the figures we show the support, i.e., the number of
assessments with a confidence higher or equal to the threshold adopted, and the
precision, i.e., the fraction of positive assessments over the sum of positive and
negative assessments. Figure 7a clearly depicts that the precision starts from a
value of ≈ 0.7 (obtained without using any threshold on the confidence, thus
having the maximum support) to a maximum of ≈ 0.85 (using a threshold of
0.3, corresponding to a support of ≈ 430). Further values slowly decrease in
precision, but – more importantly – decrease in support. A very low support
means very few annotations, and we should avoid such compromise.
Figure 7b studies the behavior of the EL system that makes use of the full
KB. Here the precision ranges from a value of ≈ 0.65, starting from a confidence
value of 0.2 (remind that annotations with a confidence below this threshold were
discarded) up to a maximum of ≈ 0.85 obtained with a confidence value of 0.57
(with a support of 12). The latter results clearly depict how using only precision
is not enough to measure the effectiveness of a system: in fact, the support
value of 15 means that only 5% of the annotations are considered, resulting in
a very poor document enriching (i.e., 0.14 average annotations per paragraph).
By comparing the behavior of the two systems it is evident that the strategy
to build a topical KB is clearly a key idea for performing EL on a set of topic
specific documents. This is supported by the fact that we obtained a consistent
improvement in terms of precision without penalizing too much the support.
5 Conclusions
In this paper we study how to apply entity linking on a collection of documents
about a particular topic. Our thesis is that pre-filtering a general knowledge base
12 Diego Ceccarelli, et al.
keeping only the entities that are relevant for the topic, and then building the
entity linking model only from these entities could improve the annotation per-
formance. We propose three strategies for filtering the knowledge base, two based
on textual similarity between the topic and the entities (i.e., cosine similarity and
KL-divergence) and one based on the Wikipedia categories (i.e., considering only
the categories that belong to the selected topic). We perform some preliminary
experiments on the topic Philosophy, combining the three methods and perform-
ing the linking on the resulting filtered knowledge base. Finally, in a user-study
performed by an expert in the area, we compare the annotation performance on
a philosophic document collection using a traditional knowledge base and the
one filtered by our approach. The results confirm that the proposed technique is
a promising idea for performing EL on a set of topic specific documents.
References
1. D. Ceccarelli, C. Lucchese, S. Orlando, R. Perego, and S. Trani. Dexter: an open
source framework for entity linking. In Proceedings of the Sixth International Work-
shop on Exploiting Semantic Annotations in Information Retrieval (ESAIR), 2013.
2. P. Ernst, C. Meng, A. Siu, and G. Weikum. Knowlife: A knowledge graph for health
and life sciences. In Data Engineering (ICDE), 2014 IEEE 30th International
Conference on, pages 1254–1257. IEEE, 2014.
3. P. Ferragina and U. Scaiella. Tagme: on-the-fly annotation of short text fragments
(by wikipedia entities). In Proceedings of CIKM, 2010.
4. J. Hoffart, D. Milchevski, and G. Weikum. Stics: searching with strings, things, and
cats. In Proceedings of the 37th international ACM SIGIR conference on Research
& development in information retrieval, pages 1247–1248. ACM, 2014.
5. S. Kullback and R. A. Leibler. On information and sufficiency. Ann. Math. Statist.,
22(1):79–86, 03 1951.
6. C. D. Manning, P. Raghavan, and H. Schütze. Introduction to information retrieval,
volume 1. Cambridge university press Cambridge, 2009.
7. Q. Miao, Y. Meng, L. Fang, F. Nishino, and N. Igata. Link scientific publica-
tions using linked data. In Semantic Computing (ICSC), 2015 IEEE International
Conference on, pages 268–271. IEEE, 2015.
8. R. Mihalcea and A. Csomai. Wikify!: linking documents to encyclopedic knowledge.
In Proceedings of the sixteenth ACM conference on Conference on information and
knowledge management, pages 233–242. ACM, 2007.
9. D. Milne and I. H. Witten. An open-source toolkit for mining wikipedia. Artificial
Intelligence, 194:222–239, 2013.
10. D. Mirylenka and A. Passerini. Navigating the topical structure of academic search
results via the wikipedia category network. In Proceedings of the 22nd ACM in-
ternational conference on Conference on information & knowledge management,
pages 891–896. ACM, 2013.
11. P. Pantel and A. Fuxman. Jigs and lures: Associating web queries with struc-
tured entities. In Proceedings of the 49th Annual Meeting of the Association for
Computational Linguistics: Human Language Technologies, 2011.
12. G. Weikum and M. Theobald. From information to knowledge: harvesting entities
and relationships from web sources. In Proceedings of PODS, 2010.