=Paper=
{{Paper
|id=Vol-2696/paper_185
|storemode=property
|title=IRISA System for Entity Detection and Linking at CLEF HIPE 2020
|pdfUrl=https://ceur-ws.org/Vol-2696/paper_185.pdf
|volume=Vol-2696
|authors=Cheikh Brahim El Vaigh,Guillaume Le Noé-Bienvenu,Guillaume Gravier,Pascale Sébillot
|dblpUrl=https://dblp.org/rec/conf/clef/VaighNGS20
}}
==IRISA System for Entity Detection and Linking at CLEF HIPE 2020==
IRISA System for Entity Detection and Linking at CLEF HIPE 2020 Cheikh Brahim El Vaigh1 , Guillaume Le Noé-Bienvenu2 , Guillaume Gravier3 , and Pascale Sébillot4 1 INRIA, IRISA, Rennes, France cheikh-brahim.el-vaigh@inria.fr 2 CNRS, IRISA, Rennes, France guillaume.le-noe-bienvenu@irisa.fr 3 CNRS, IRISA, Rennes, France guig@irisa.fr 4 INSA Rennes, IRISA, Rennes, France pascale.sebillot@irisa.fr Abstract. This note describes IRISA’s system for the task of named en- tity processing on historical newspapers in French. Following a standard entity detection and linking pipeline, our system implements three steps to solve the named entity linking task. Named Entity Recognition (NER) is first performed to identify the entity mentions in a document based on a Conditional Random Fields classifier. Candidate entities from Wikidata are then generated for each mention found, using simple search. Finally, every mention is linked to one of its candidate entities in a so-called link- ing step leveraging various string metrics and the semantic structure of Wikidata to improve on the linking decisions. Keywords: Named entity recognition · CRF · Collective entity linking · WRSM entity relatedness measure. 1 Introduction Entity linking is a core task in textual document processing, which consists in identifying the entities of a knowledge base (KB) that are mentioned in a text. For instance, approaches from the literature implement three stages to solve men- tion ambiguity in texts. The first stage consists in the detection of named entities within the text and is known as named entities recognition (NER). To further link the mention found in the text, candidate entities are generated for each mention detected in the first stage. Finally, every mention is linked to one of its candidate entities in a so-called linking step. This last step can be performed independently for each individual mention, or collectively for all mentions at once. In the first case, every mention in a text is assumed to be independent from other mentions and is linked to a candidate entity on sole basis of some similarity between the mention and the candidate entities, so-called local scores. By contrast, for collec- tive entity linking, entity mentions and the corresponding entities are not assumed independent one from another but somehow semantically related within a (coher- ent) document, i.e., mention-to-entity linking decisions are interdependent. In this Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 September 2020, Thessaloniki, Greece. case, the local mention-entity scores are complemented with global scores reflecting to which extent the candidate entities chosen for the mentions under consideration are related in the KB, according to a so-called entity relatedness measure. The last two stages of the pipeline are also known as named entity linking (NEL). In the context of the shared task CLEF HIPE 2020 —Identifying Historical Peo- ple, Places and other Entities—which is a named entity processing on historical newspapers in French, German and English [3], entity linking techniques can be used to retrieve entities from text. CLEF HIPE 2020 is organised as a CLEF 2020 evaluation Lab. However, the historical context makes the linking task harder since texts considered are the results of an optical character recognition (OCR) algo- rithm which introduces noise. Therefore, we leveraged various features to reduce the impact of the OCR *noise* on named entity processing. Our system for CLEF HIPE 2020 follows a standard pipeline for entity linking and implements three separate stages: 1. We devise a NER stage on top of the baseline provided by CLEF HIPE 2020 organizers. This system used Conditional Random Fields (CRFs) to detect and classify named entities. We added several features that we found effective for the task of NER. 2. The generation step consists in looking to Wikidata directly when searching entities similar to a given mention. As a lookup in the heavy database (CLEF HIPE 2020 Wikidata dump) is costly in time, we performed automatic searches for the entity mentions using online Wikidata. Note that this search is based on Wikidata indexing algorithm. 3. The linking step is to decide which candidate should be retained for each men- tion within a document. We tried to link the mentions separately or collectively, training a classifier to predict if a mention is related to one of its candidate en- tities. The former is based solely on the similarity between a given mention and its candidate entities. The latter which performs the linking collectively for all the mentions at once, beside the previous similarity metrics, makes use of the entity relatedness measure WSRM that we have proposed in [4]. 4. The collective linking setup gave the best results and was ranked second for the bundle2 of the shared task CLEF HIPE 2020. Our source code, datasets and experimental results are made available online for reproducibility purposes5 . The note is organized as follows. We give the description of our method in Sec. 2. Then we group the experimental results in Sec. 3 before discussing the perspectives and conclude in Sec. 4. 2 System Architecture This section gives the description of our system. We distinguish two independent tasks for named entity processing, namely the NER and the NEL. Our solutions for the NER and the NEL are described respectively in Sec. 2.1 and Sec. 2.2. 5 https://gitlab.inria.fr/celvaigh/hipe2020 2.1 NER The NER task aims at detecting the surface forms in a text that correspond to named entities and at classifying those forms as a type (PER, LOC, ORG, TIME, PROD). The NER system that we developed originally came from the NER base- line provided by the organizing team of the evaluation campaign CLEF HIPE 2020. This system used Conditional Random Fields (CRFs) based on a Python implementation [1] to detect and classify named entities. The features used in this system, as well as the ones we have chosen are described in Table 1. 2.2 NEL In the NEL stage, we assume that the annotations are known for the mentions (person, organization, location, etc.) for each document. Those annotations are provided by the NER system described in Sec. 2.1 or by an oracle NER. For the candidate generation stage we rely on a simple Wikidata web search. The candi- date selection stage accounts for the WSRM entity relatedness measure between candidate entities within the document in an efficient manner, relying here on Wikidata, the KB provided by CLEF HIPE 2020 for named entity processing (see [4] for details on the measure). These different steps are described below. Candidate Entities Generation To generate candidate entities from the KB for each mention in a document, we chose a simple yet efficient method exploiting the index of Wikidata. For each mention found by the NER phase, we perform online search using Wikidata web pages. We limit ourselves to the top 10 ranked candidate entities. The motivation behind our choice is to speed up the candidate entities generation step as a lookup in the heavy Wikidata dump is costly in time compared to simple web search. Local Scores The local scores depict the similarity between a mention and its candidate entities. If we assume the mentions to be independent in text, the linking problem can be formalized as ê = argmaxφ(m,ei ) (1) ei where ei is a candidate entity, m is an entity mention, and φ is the local score function. We tried several metrics for φ. Beside the longest contiguous matching sub-sequence, we tried a Levenshtein distance to handle the OCR noise, Wikipedia popularity [2,5] and the cosine similarity based on a word embedding model, sim- ilar to the Skip-gram embedding model [6]. Collective Entity Linking In a collective NEL setup, the local score is com- plemented with a global score accounting for the intricate interrelationships that candidate entities of the different mentions may share. The latter is known as an entity relatedness measure and used to assess entity relationships in the KB, which will allow to estimate the interdependence of the mentions in the text. The CEL problem can be thus formalized as Xn n X n X (ê1 ,...,ên ) = argmax φ(mi ,ei )+ ψ(ei ,ej ) (2) e1,...,n i=1 i=1 j=1,j6=i where n is the total number of mentions in a text and ψ(ei ,ej ) donates the en- tity relatedness measure. In the collective linking version of our system, we used the semantic entity relatedness measure WSRM [4] which weights the relation between entities, where the more relations between the entities, the stronger their relationship. Formally, is defined between two entities ei and ej as |{r | (ei ,r,ej ) ∈ KB}| (ei ,ej ) = X , (3) |{r0 | (ei ,r0 ,e0 ) ∈ KB}| e0 ∈E where E denotes the set of entities in the KB and |S| the cardinality of the set S. Because the directions of the relations are somewhat arbitrary in KBs, depend- ing on how the relation vocabulary was designed (think about the publishes and publishedBy symmetric RDF properties), we use a symmetric version of WSRM defined as 1 ψ(ei ,ej ) = (WSRM(ei ,ej )+WSRM(ej ,ei )) . (4) 2 Using the NEL Output to Correct the NER Predictions We also exploited the output of the NEL in order to enhance the NER results. First, we used the type (obtained from Wikidata) of the entities retrieved by the NEL and forwarded it to the NER stage, which can be updated accordingly. Then we leveraged WSRM [4] to retrieve, for each entity found by the NEL, a list of potential related entities from the KB. We argue that if an entity is mentioned in the text, its related en- tities in the KB should be also mentioned in that text. Our aim is to exploit the semantics of the KB for the NER task. Those information provided by the NEL are used as pseudo-labels or features in the CRF to supervise the NER. 3 Experiments Experimental validation was conducted on the CLEF HIPE 2020 French corpus to assess the quality of our system. The dataset is described in Sec. 3.1. Results for the NER are provided in Sec. 3.2, and in Sec. 3.3 for the NEL. Due to the high number of results given by the CLEF HIPE 2020 scorer, we decided to focus only on a couple of them, that were given in the produced json file: NE-COARSE-LIT - ALL - strict - F1 micro and NE-COARSE-LIT - ALL - ent type - F1 micro. 3.1 Dataset The evaluation corpus is composed of newspaper articles sampled among several Swiss, Luxembourgish and American historical newspapers on a diachronic basis. This corpus is digitised based on an OCR algorithm which hightails the historical context of the evaluation campaign. The time-span of the whole corpus goes from Feature Was on baseline Kept the token in lowercase yes yes the last 3 letters of the token yes yes the last 2 letters of the token yes yes a boolean on whether the token is in uppercase yes yes a boolean on whether the token is in titlecase yes yes a boolean on whether the token is a digit yes yes the correct spelling of the word using an no no open-source library the presence of the token in a list of named entities no no the presence of the token in a list of first names no no if the word is a stop word or not no no if the word is a punctuation mark or not no no the length of the token no no the relative length of the token (small, medium, large) no no the token without redundant letters no no the first 2 characters of the token no no the first 3 characters of the token no no the POS tag of the token no no condition on whether the token matches a date regex no no the token itself, with its diacritical characters no yes converted into their ASCII equivalent the first 100 elements of the vectorial representation of the token using a fastText model provided by the no yes organizers (fr-model-skipgram-300minc20-ws5-maxn-0.vec) Table 1. The list of the features we tested for the NER. 1798 until 2018. We used only the French version of the corpus composed of a train, a validation and a test sets 6 . 3.2 Results of the NER The NER classifier described in Sec. 2.1 is trained on the CLEF HIPE 2020 dataset. We added several features to the ones of the baseline. We performed a random search to select the best features while controlling the overfitting. We provide the list of the features used in Tab. 1 and the list of the best hyper-parameters in Tab. 2. The system was trained on the train file and then tested on the dev and test files provided by the organizers. We compared our NER system with the baseline provided by the CLEF HIPE 2020 organizers on the validation set (dev file). The results are gathered in Tab. 3. We can see that our NER system outperforms the baseline. We believe that its good performance is due to the choice of the selected features, e.g., the use of the tokens present in the text as features for the classifier. The fine tuning of the hyper-parameters of our CRF also partly explains the results better than those of the baseline. The results of our system on the test file are gathered in Tab. 3 6 Details statistics about the data can be found at https://impresso.github.io/ CLEF-HIPE-2020/datasets.html 6 Parameter Best value found c1, the coefficient for L1 regularization, between 0 and 1 0.1798 c2, the coefficient for L2 regularization, between 0 and 1 0.0551 min freq, cut-off threshold for occurrence frequency 0 of a feature, between 0 and 1 max iterations, the maximum number of iterations 192 for optimization algorithm, between 100 and 1000 all possible transitions false num memories, the number of limited memories for 4 approximating the inverse Hessian matrix, between 4 and 8 Table 2. The best parameters for the CRF. Baseline Irisa Team Task F1 F1 Precision Recall NERC coarse French strict (literal sense) 0.622 0.716 0.768 0.671 Dev file NERC coarse French fuzzy (literal sense) 0.735 0.821 0.880 0.769 NERC coarse French strict (literal sense) - 0.668 0.705 0.634 Test file NERC coarse French fuzzy (literal sense) - 0.784 0.828 0.744 Table 3. Scores of NER systems on the test and dev file. Rank Team name System F1 Precision Recall 1 L3i team10 bundle1 fr 3 0.598 0.594 0.602 2 L3i team10 bundle1 fr 1 0.597 0.592 0.601 3 L3i team10 bundle1 fr 2 0.597 0.592 0.602 4 IRISA team7 bundle2 fr 2 0.421 0.446 0.399 5 IRISA team7 bundle2 fr 1 0.419 0.450 0.393 6 IRISA team7 bundle2 fr 3 0.413 0.437 0.391 7 SBB team33 bundle2 fr 1 0.407 0.594 0.310 8 UvA.ILPS team31 bundle2 fr 2 0.251 0.352 0.195 9 ERTIM team16 bundle1 fr 1 0.108 0.150 0.084 Table 4. Linking accuracy (F1 score) on the CLEF HIPE 2020 French dataset for bundle 2. 3.3 Results of the NEL Results of the entity linking process evaluated in terms of micro-averaged F1 clas- sification scores are reported in Tab. 4. The three systems that we submitted to CLEF HIPE 2020 were ranked second (team7 results). We first evaluated the entity linking based on the sole use of the local scores donated by team7 bundle2 fr 1. Second, we added the global score devising a collective entity linking which we named team7 bundle2 fr 2. And finally, we changed the collective linking system to filter the non-linkable mentions (NIL) based on a threshold, meaning we only link a mention to a candidate entity if the prior probability is below a fixed thresh- old (here 0.5). We can see that the collective linking gave the best results, while the collective linking with a fixed threshold is worse than the non-collective one. These results show the benefit of the collective linking. 3.4 Supervising the NER with the NEL A few experiments have been carried out to exploit the outputs of the NEL in order to enhance the NER results. The first one consisted in using the types of the entities found by the NEL to change the NER labels; e.g., if the NER detects the entity ’Europe’ and classifies it as ’PERS’, the NEL links it to ’Q46’ and gives the information that the type of ’Q46’ is ’LOC’. The second consisted in generating closely related entities to the ones found by the NEL. We found that the output of the NEL stage can correct the NER, but can also introduce too much noise. Despite not being able to directly incorporate the output of the NEL with the existing features, we believe that applying a major vote between the different ver- sions of the NER—with and without the NEL output—can lead to an increase of the accuracy of the NER. Nonetheless, our system opened the door to incorporate the semantics of the KB into the NER task. 4 Conclusion We built an entity processing system based on a CRF classifier for the NER task, and a collective entity linking system for the NEL one, exploiting the WSRM en- tity relatedness measure that we have proposed in [4]. Our system was evaluated on the CLEF HIPE 2020 French dataset. Though initially expected, we did not succeed in incorporating the output of the NEL to correct the NER step, but we paved the way to fully use the KB semantics in the NER task. References 1. Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., Niculae, V., Prettenhofer, P., Gramfort, A., Grobler, J., Layton, R., VanderPlas, J., Joly, A., Holt, B., Varoquaux, G.: API design for machine learning software: Experiences from the scikit-learn project. In: European Conference on Machine Learning and Principles and Practices of Knowledge Discovery in Databases Workshop: Languages for Data Mining and Machine Learning. pp. 108–122 (2013) 2. Durrett, G., Klein, D.: A joint model for entity analysis: coreference, typing, and link- ing. Transactions of the Association for Computational Linguistics 2, 477–490 (2014) 3. Ehrmann, M., Romanello, M., Flückiger, A., Clematide, S.: Overview of CLEF HIPE 2020: Named entity recognition and linking on historical newspapers. In: CLEF HIPE 2020. pp. 1–25 (2020) 4. El Vaigh, C.B., Goasdoué, F., Gravier, G., Sébillot, P.: Using knowledge base semantics in context-aware entity linking. In: ACM Symposium on Document Engineering 2019. pp. 8:1–8:10. Berlin, Germany (2019) 5. Francis-Landau, M., Durrett, G., Klein, D.: Capturing semantic similarity for entity linking with convolutional neural networks. In: 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pp. 1256–1261 (2016) 6. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed repre- sentations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems. pp. 3111–3119 (2013)