-

Entity Linking on Philosophical Documents

Diego Ceccarelli

Alberto De Francesco

alberto.defrancesco@imtlucca.it 2 3

Ra aele Perego

Marco Segala

marco.segala@univaq.it 1

Nicola Tonellotto

Salvatore Trani

trani@di.unipi.it 0 3 0 Dipartimento di Informatica, Universita di Pisa , Italy 1 Dipartimento di Scienze Umane , Universita dell'Aquila , Italy 2 Istituto IMT Alti Studi di Lucca , Lucca , Italy 3 Istituto di Scienza e Tecnologie dell'Informazione - CNR , Pisa , Italy

Entity Linking consists in automatically enriching a document by detecting the text fragments mentioning a given entity in an external knowledge base, e.g., Wikipedia. This problem is a hot research topic due to its impact in several text-understanding related tasks. However, its application to some speci c, restricted topic domains has not received much attention. In this work we study how we can improve entity linking performance by exploiting a domain-oriented knowledge base, obtained by ltering out from Wikipedia the entities that are not relevant for the target domain. We focus on the philosophical domain, and we experiment a combination of three di erent entity ltering approaches: one based on the \Philosophy" category of Wikipedia, and two based on similarity metrics between philosophical documents and the textual description of the entities in the knowledge base, namely cosine similarity and Kullback-Leibler divergence. We apply traditional entity linking strategies to the domainoriented knowledge base obtained with these ltering techniques. Finally, we use the resulting enriched documents to conduct a preliminary user study with an expert in the area.

Entity Linking Entity Filtering Information Search and Retrieval Document Enriching

In the latest years, document enriching via Entity Linking (EL) has gained increasing interests due to its impact in several text-understanding related tasks, e.g., web search, document classi cation, etc [ 11, 12 ]. The EL problem has been introduced in 2007 by Mihalcea and Csomai [ 8 ] and consists in linking short fragments of text within a document to an entity listed in a given Knowledge Base (KB). The authors also propose to consider each Wikipedia article as an entity, and the title or the anchor text of the hyperlinks pointing to the article as potential mentions to the entity.

In Dresden! [http://en.wikipedia.org/wiki/Dresden], Schopenhauer!

[http://en.wikipedia.org/wiki/Arthur Schopenhauer] became acquainted with the philosopher! [http://en.wikipedia.org/wiki/Philosopher] and freemason! [http://en.wikipedia.org/wiki/Freemasonry],

Karl Christian Friedrich Krause! [http://en.wikipedia.org/wiki/Karl Christian Friedrich Krause]

A typical EL system works in three steps: i) Spotting: the document is processed in order to detect a set of potential mentions (also referred as surface forms or spots), and for each mention a list of candidate entities is produced; ii) Disambiguation: for each potential mention with more than one candidate entity, a single entity is selected. This is done by trying to maximize the coherence among the selected entities; iii) Filtering: only the most relevant annotations, i.e., the mentions linked with some entity, are selected, ltering out the irrelevant ones by using some measure of annotation con dence/importance. Due to the ambiguity of natural languages, the EL task is not trivial. In fact the same mention could refer to more than one entity (polysemy) and the same entity could be referred by more than one mention (synonymy).

Let us introduce an example to describe how the EL process works. Figure 1 shows a semantically enriched document produced by an EL system. In the reported text there are mentions (e.g., Dresden, Schopenhauer, or freemason) linked to their semantic concept by using the URI or the identi er in the KB, in our case Wikipedia. For example, the spot Dresden is linked to http://en.wikipedia.org/wiki/Dresden. It is worth noting that the mention Dresden could refer to many other meanings, as we can see by looking at the corresponding Wikipedia disambiguation page5.

Now we introduce some notations used thereinafter in the paper. A Knowledge Base is a collection of entities, where each entity represents an artifact or a concept in the real world. An entity e is described by the following attributes: { a Uniform Resource Identi er: univocally identify the entity in KB (e.g., the url "http://en.wikipedia.org/wiki/Dresden" identify the entity Dresden); { a description: text describing what the entity represents, usually the content of its Wikipedia page (e.g., "Dresden is the capital city of..." ); { a set of related entities that are connected to the given entity, usually derived from Wikipedia links (e.g., Germany is linked by Dresden); { a set of surface forms, the fragments of text used to refer the entity (e.g., \A. Schopenhauer" and \Arthur Schopenhauer" are both surface forms for http://en.wikipedia.org/wiki/Arthur Schopenhauer); { a set of categories, organized in a taxonomy, the entity belongs to (e.g., Schopenhauer belongs to the categories "Idealists" and "German atheists").

The entity linking task consists in nding an annotation function fEL that, given KB and a raw text document d, returns an enriched version of the document de which includes also a list of annotations. Each annotation is described by a tuple < start; end; text; entity >, where:

5 http://en.wikipedia.org/wiki/Dresden (disambiguation)

{ start is the starting o set of the annotation in the document; { end is the ending o set of the annotation in the document; { text is the surface form of the entity detected in the document; { entity is the URI of the entity detected in the document.

The research question behind this paper is the following: let us assume to have a collection of documents about a particular topic to enrich, e.g., Philosophy. Could we exploit the knowledge about the topic to improve the e ectiveness of the entity linking process? The solution we propose works a priori on the Knowledge Base (KB) used for generating the EL model. The idea is to consider only the entities relevant for a target topic t of the documents we are going to annotate. These entities form a new domain-speci c Knowledge Base, that in the following we refer to Topical Knowledge Base.

To the best of our knowledge, we are the rst to investigate how to perform topical EL by pre ltering a general knowledge base. Mirylenka and Passerini [ 10 ], and later Miao et al. [ 7 ], applied EL techniques on the domain of scienti c publications: in [ 10 ] authors propose a method of organizing the search results into concise and informative topic hierarchies. They obtain the hierarchies by annotating the entities in a document with Wikipedia Miner [ 9 ] { an open source entity linking tool. STICS [ 4 ] is a system that enriches news with entities and uses them for improve the browsing and provide entity analytics of what is happening in the world. Finally, Ernst et al. [ 2 ] applied entity linking on health and life sciences, through the KnowLife portal, a large KB automatically constructed from Web sources. 2

The Knowledge Base Topic-Filtering Problem

Let fEL(KB; d) be an annotation function that, given a Knowledge Base KB and a document d, produces an enriched version of the document de, and let (fEL(KB; d)) be an e ectiveness measure of the annotation function, i.e. a common information retrieval quality metrics such as precision.

Given a collection of documents Dt related to a topic t, our objective is to nd a subset KBt of the knowledge base KB, such that: 8d 2 Dt; (fEL(KBt; d))

(fEL(KB; d)) jKBtj

jKBj

The topical knowledge base KBt is obtained by ltering KB through a function (KB; t). Since KB is a collection of entities fe1; e2; ; eng and lter each entity independently from the others, we can thus de ne:

KBt =

(e; t) [ e2KB

Our claim is that such a function can improve the e ectiveness of the entity linking task for the topic t. In particular, we propose three ltering methods:

Cosine Similarity Filter, Kullback-Leibler Divergence Similarity Filter, and Category Filter. The rst two approaches exploit the textual similarity6 between the documents in Dt and the description of the entity in KB (averaging the result with respect to the collection Dt). The latter exploits the Wikipedia Category Graph in order to detect how far are the categories the entity belongs to and the root category of the topic being considered.

Cosine Similarity Filter

The cosine similarity lter measures the similarity between two vectors. Let d be a document belonging to a collection of documents D related to a topic t, edesc be the textual description of an entity (e.g., the text in its Wikipedia page), wk(di) the weight associated with a term-document pair (ki; d), wk(eidesc) the weight associated with a term-entity pair (ki; edesc). Then, in the textual similarity context, the cosine similarity is de ned as: 6 In Information Retrieval, the text similarity between document-query pair, is a score aiming to provide a degree of similarity of a document with respect to an user information need. where V = fk1; : : : ; kng is the vocabulary of the terms, n is the number of distinct terms in the document collection and ki be a generic term. The weights wk(di) and wk(eidesc) are computed with the tf-idf [ 6 ] formula as in the following, using the inverse document frequency of the term in Wikipedia (idfw) (1) (2) (3) (4) wk(di) = tf (d)(ki) wk(eidesc) = tf (edesc)(ki) idfw(ki) idfw(ki) The cosine similarity ranges in [0; 1], with the maximum similarity reached at 1. Kullback-Leibler Divergence Filter

The Kullback-Leibler Divergence (KLD ) Filter measures the relative entropy of two di erent probability distributions associated to the same event space. Let d and edesc be de ned as in Cosine Similarity Filter, Pk(id) be the probability of a term ki in a document, and P (edesc) be the probability of a term ki in an entity ki description, the Kullback-Leibler divergence is formulated in [ 5 ] as follows: respectively de ned as in the following:

P (d) =

ki P (edesc) = ki

wk(di) P wk(di) ki V

wk(eidesc) P wk(eidesc) ki V (5) (6) where wk(di) and wk(eidesc) are respectively computed as in Equations (2,3). For the sake of simplicity in this paper we are using the original formulation of the KLD, which is not symmetric (i.e., KLD(d; edesc) 6= KLD(edesc; d)). A more reliable implementation could be the symmetrised or the Jensen-Shannon divergence because they consider also the similarity between the textual description of the entity and the document.

Category Filter

The Category Filter takes advantages of the Wikipedia category graph and of the list of categories each entity belongs to. This information is used to compute the shortest path (and so the minimum distance) of an entity from a set of highly relevant category node (the root of the visit) for the topic being considered (e.g., Philosophy 7). Each Wikipedia article can appear in more than one category, and each category can appear in more than one parent category. Multiple categorization schemes co-exist simultaneously. In other words, categories do not form a strict hierarchy or tree structure, but a more general directed acyclic graph (DAG ).

In particular, given G = (C; E) be the Wikipedia category graph with C the category nodes and E the direct connection between the categories, let us de ne C(t) = fc(1t); c2

(t); : : : ; c(mt)g be the set of highly relevant categories relative to the topic t, with C(t) C. The minimum distance of each wikipedia entity from the categories in C(t) can so be computed by exploiting a breadth- rst search (BFS ) visit of the graph G, starting from the nodes in C(t). Let us de ne such a method with the function '.

'(ci(t)) = fBF S(G; C(t))g 8i 2 1; : : : ; n (7) where n is equal to jCj. 3

Experiments

In the following we introduce the philosophical document adopted as a reference document for the topic t = P hilosophy . This document is used both for ltering

7 http://en.wikipedia.org/wiki/Category:Philosophy Philosophy! [http://en.wikipedia.org/wiki/Category:Philosophy]

Value! [http://en.wikipedia.org/wiki/Category:Value]

by textual similarity and for performing the user study described in the Section 4. Then we describe the methodology adopted to build the ltered KB and the di erences in the EL annotations obtained by using the traditional KB and the ltered one. 3.1

Reference document We adopt a philosophical text written by the philosopher Ludwig Wittgenstein during the middle of the last century as the reference document for the topic t being considered. The title of the book is On Certainty and it is a collection of aphorisms discussing the relation between knowledge and certainty. The book is composed by 676 paragraphs, with an average of 243 characters and 46 terms per paragraph. Since each paragraph is long enough and contains several philosophical notions, we consider it as an independent document d of Dt. 3.2

Filtering methods In order to evaluate the impact of the proposed ltering on the KB and to gain some insights on the thresholds to adopt, we perform a study to measure the frequencies of the entities that pass each ltering strategy in isolation, by considering di erent values for the thresholds. Given the problem formalization described in Section 2, we apply each ltering method to each entity in KB, computing a score that expresses how far/similar is the entity from the topic t. We compute the textual similarity by applying the Cosine Similarity and the Kullback-Leibler Distance between the reference document described above and the description of the entity in KB, i.e., the content of the Wikipedia article. The Category Filter computes the minimum depth of all the categories each entity belongs to. The category graph as well as the categories related to an entity are taken from the KB.

Figure 3 reports the application of the Cosine Similarity (Figure 3a) and Kullback-Leibler Divergence (Figure 3b) lters to the KB. The former obtains maximum similarity when the scores is 1, while the latter when the score is 0. The two gures show that the cosine similarity is more spread out along the X axis (the con dence score thresholds) than KLD, which is indeed very thin-tailed. Finally, Figure 4 shows the category lter application, with the depth distribution of the categories in the category graph given the root node Philosophy (Figure 4a) and the distribution of the entities according to the minimum depth frequency frequency of the categories each entity belongs to (Figure 4b). The outcome of this gure is quite surprising: the category graph (which is a DAG according to Wikipedia) seems to interconnect categories that are not strictly related each other. As a matter of fact, take a look at the odd path in Figure 2, where the category Valuation (Finance) is reached by traversing only two descendant link starting from the Philosophy category node. This evidence clearly explains also why so many categories ( 750k) can be reached by descendant traversing the category graph from the Philosophy node, with at most 10-11 steps. 3.3

Filtering approach adopted According to the Wikipedia Philosophical Portal 8, there are about 15k philosophical articles on the total of 4:35M articles. Our intuition is that by restricting ourself to this topic speci c subset of articles may not be su cient from an EL perspective. Indeed by considering also articles very related to the topic could results in an improvements of the EL e ectiveness, due to a disambiguation phase which make use also of entities only marginally related with the topic t. Thus, our suggestion is to select 2 times the number of entities wikipedia says to belong to the philosophical portal (i.e., our goal is to select 30k entities). Obviously this approach is reasonable only because we are investigating a new, unexplorated research direction. A more general way of solving the problem would lead to not decide a-priori the number of entitites to select, but it should depends on the domain and on how the domain is covered in Wikipedia. The same reasoning would lead to choose the thresholds for the textual similarity strategies depending only from domain-speci city of the entities.

Hence, for ltering out entities not related to the topic t, we adopt the following approach: 1. We make use of the category lter to select entities belonging to the topic and to build the base KB. This choice is motivated by the high accuracy we expect from the category graph in selecting entities relevant for the topic,

8 http://en.wikipedia.org/wiki/Portal:Philosophy

250K 200K iittron150K u b isd100K 50K because the categories are manually assigned (and validated) by the user to the entities. In order to maximize the probability of selecting only topic related entities, we exploit a very aggressive threshold value of 3 (i.e., 3 is the minimum distance between at least one of the category the entity belongs to and the root category of the topic, which is in our case the Philosophical category). By using this threshold the lter selects 28k entities. 2. We expand the KB by considering also textual similarities with the reference document described in Section 3.1. The idea here is that some entities in wikipedia could be misclassi ed (i.e., their categories could not re ect the real topic of the entity or the entity could miss of some relevant categories). In order to apply such an expansion, we adopt the two textual ltering approaches described in Section 3.3 by combining them together, i.e., only entities that pass both the lters are added to the base KB. Since our target is to select 30k entities, and we have 28k from the Categories, we add the missing 2k considering the intersection between the subset of entities with the highest cosine and similarity and the subset of entities with the highest KLD similarity. We select the subsets nding two thresholds so that the sets have approximately the same size. We then lter the cosine similarity (respectively the Kullback-Leibler divergence) with a threshold of 0:35 (0:125) which let 52k ( 79k) entities pass the lter.

The resulting KB is made up of 30k entities, primarily selected by investigating the category graph and expanded with highly similar (from a language point of view) side entities. The size reduction compared to the full KB is 99%. 3.4

Entity Linking di erences Traditional entity linking strategies are applied to the reference document in order to evaluate how the annotation process is a ected by the domain-oriented knowledge base obtained with the proposed ltering approach. We used the EL Dexter framework [ 1 ] to annotate the philosophical document with both the traditional KB and by plugging into it the domain oriented KB. Each paragraph was annotated independently from the others, so that we are able to investigate Full KB Filtered KB 10

15 spots / paragraph 20 25 30 5

10 ambiguity 15 20 (a) Spots distribution (b) Ambiguity distribution the di erence in the annotation process by looking at two important factors: how many spots per paragraph the system is able to annotate and how much each spot is ambiguous before performing the disambiguation phase, i.e., how many entities on average are candidated for each spot. In Figure 7 we report these factors by comparing the two KB solutions. As expected, Figure 5a shows that the distributions of spots per paragraph is very di erent: on average by using the ltered KB, the EL system annotates less spots per paragraph than using the full KB, and this evidence is very clear if we look at the number of cumulative spots annotated. Indeed by using the full KB, the EL system annotates approximately 3 times more spots than using the ltered one. If we look at the distributions of the ambiguity per spot, another important aspect arises: on average the EL system which uses the ltered KB selects far less entities as spots, with a 50% probability to select only one candidate per spot and 25% probability to select two candidates for a spot. The latter evidence is really important because the disambiguation phase is simpler when the ambiguity is low (and it is worth to notice that 50% of the spots do not need a disambiguation at all due to a single candidate per spot selected). 4

User study

For assessing the quality of the linking performed using the ltered KB, we set up a user study experiment. We selected the rst 110 paragraphs from the On Certainty book, and we annotated each paragraph using a model generated from the ltered KB. Each paragraph was annotated using a dictionary generated from the Wikipedia anchors, and, in case of ambiguous spots, we disambiguated using the TAGME disambiguation algorithm [ 3 ].

On average we annotated 4:54 entities per paragraph (500 in total). We designed a simple web application that allows a user to evaluate the annotations. The application allows to browse the paragraphs in two di erent ways: Document based browsing the user can visualize a paragraph and move to the previous, or the next. Annotated spots are highlighted and if the user clicks on a spot, a description of the annotated entity pops out; 0.7 0.6 Entity based browsing If the user clicks on the name of the entity, then an entity based view is presented: this page presents a description of the entity, and then a list of paragraphs where the entity is mentioned. For each paragraph the user can visualize the spot that was annotated with the entity, and the text around the annotation.

The user can mark an annotation as good or bad, simply by clicking on it. One click means good and the annotation is highlighted in green, while an additional click means bad and the annotation is highlighted in red.

We asked an expert human annotator in the eld to judge the annotations. Subsequently we performed the same evaluation, but the EL was performed using the full KB. We obtained on average 9:72 annotations per paragraph (1070 in total). This high number is due to the fact that we did not pre- lter the annotations in any way. To speed up the expert assessment, we decided to remove the annotations with a con dence (i.e., a score assigned by the EL system which express the certainty of the annotation) lower than 20%. This threshold is absolutely reasonable since usually TAGME discards annotations with a con dence lower than 50%. The number of annotations per document decreased from 9:72 to 2:82 (globally from 1070 to 311). Moreover, we automatically copied the assessments relative to the same annotations (i.e., spots occurring in the same place of the text linked to the same entity) from the previous judgment performed by the expert annotator. This avoided the evaluation of 113 (36%) annotations.

Figure 6b shows the distribution of the annotations by their con dence score. We can observe that a traditional EL system encounters some problem on working with a ltered KB. Indeed the annotation con dence is on average quite low, thus suggesting that the mutual reinforcement of the entities in the disambiguation phase has still a lot of space for improvement when working on a topic based KB. It is worth to notice that the disambiguation strategy adopted (TAGME) would have discarded the majority of the annotations by applying its threshold value (50%) on the annotation con dence.

Figure 6a on the other hand illustrates the distribution of the two assessment classes by the con dence score of the relative annotations. We can identify a clear correlation between the positive class and the con dence score (higher is better). 1.0 0.8 300 tro p p 200 su 100 1.0 0.8 This can be explained by the fact that high con dence scores are assigned to entities strongly related with other entities in the annotated document, thus resulting in a more precise annotation which is less prone to errors.

Finally, Figure 7 depicts the annotation e ectiveness of the two EL systems, the rst using the full KB and the second using the ltered one. The e ectiveness is evaluated at di erent values of con dence in order to study the best threshold to adopt for ltering out bad annotations and maximizing the e ectiveness of the annotation process. In the gures we show the support, i.e., the number of assessments with a con dence higher or equal to the threshold adopted, and the precision, i.e., the fraction of positive assessments over the sum of positive and negative assessments. Figure 7a clearly depicts that the precision starts from a value of 0:7 (obtained without using any threshold on the con dence, thus having the maximum support) to a maximum of 0:85 (using a threshold of 0:3, corresponding to a support of 430). Further values slowly decrease in precision, but { more importantly { decrease in support. A very low support means very few annotations, and we should avoid such compromise.

Figure 7b studies the behavior of the EL system that makes use of the full KB. Here the precision ranges from a value of 0:65, starting from a con dence value of 0:2 (remind that annotations with a con dence below this threshold were discarded) up to a maximum of 0:85 obtained with a con dence value of 0:57 (with a support of 12). The latter results clearly depict how using only precision is not enough to measure the e ectiveness of a system: in fact, the support value of 15 means that only 5% of the annotations are considered, resulting in a very poor document enriching (i.e., 0:14 average annotations per paragraph). By comparing the behavior of the two systems it is evident that the strategy to build a topical KB is clearly a key idea for performing EL on a set of topic speci c documents. This is supported by the fact that we obtained a consistent improvement in terms of precision without penalizing too much the support. 5

Conclusions

In this paper we study how to apply entity linking on a collection of documents about a particular topic. Our thesis is that pre- ltering a general knowledge base keeping only the entities that are relevant for the topic, and then building the entity linking model only from these entities could improve the annotation performance. We propose three strategies for ltering the knowledge base, two based on textual similarity between the topic and the entities (i.e., cosine similarity and KL-divergence) and one based on the Wikipedia categories (i.e., considering only the categories that belong to the selected topic). We perform some preliminary experiments on the topic Philosophy, combining the three methods and performing the linking on the resulting ltered knowledge base. Finally, in a user-study performed by an expert in the area, we compare the annotation performance on a philosophic document collection using a traditional knowledge base and the one ltered by our approach. The results con rm that the proposed technique is a promising idea for performing EL on a set of topic speci c documents.

Ceccarelli ,

Lucchese ,

Orlando ,

Perego , and

Trani . Dexter: an open source framework for entity linking . In Proceedings of the Sixth International Workshop on Exploiting Semantic Annotations in Information Retrieval (ESAIR) , 2013 .

Ernst ,

Meng ,

Siu , and

Weikum. Knowlife : A knowledge graph for health and life sciences . In Data Engineering (ICDE) , 2014 IEEE 30th International Conference on, pages 1254 { 1257 . IEEE, 2014 .

Ferragina and

Scaiella . Tagme: on-the- y annotation of short text fragments (by wikipedia entities) . In Proceedings of CIKM , 2010 .

4. J. Ho art, D. Milchevski, and

Weikum. Stics: searching with strings, things, and cats . In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval , pages 1247 { 1248 . ACM, 2014 .

Kullback and

R. A.

Leibler . On information and su ciency . Ann. Math. Statist., 22 ( 1 ): 79 { 86 , 03 1951 .

6. C. D. Manning , P. Raghavan , and H. Schutze. Introduction to information retrieval, volume 1 . Cambridge university press Cambridge, 2009 .

Miao ,

Meng ,

Fang ,

Nishino , and

Igata . Link scienti c publications using linked data . In Semantic Computing (ICSC) , 2015 IEEE International Conference on, pages 268 { 271 . IEEE, 2015 .

Mihalcea and

Csomai . Wikify! : linking documents to encyclopedic knowledge . In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management , pages 233 { 242 . ACM, 2007 .

Milne and

I. H.

Witten . An open-source toolkit for mining wikipedia . Arti cial Intelligence , 194 : 222 { 239 , 2013 .

10.

Mirylenka and

Passerini . Navigating the topical structure of academic search results via the wikipedia category network . In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management , pages 891 { 896 . ACM, 2013 .

11.

Pantel and

Fuxman . Jigs and lures: Associating web queries with structured entities . In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies , 2011 .

12. G. Weikum and

Theobald . From information to knowledge: harvesting entities and relationships from web sources . In Proceedings of PODS , 2010 .