=Paper=
{{Paper
|id=Vol-2696/paper_152
|storemode=property
|title=Experiments from LIMSI at the French Named Entity Recognition Coarse-grained Task
|pdfUrl=https://ceur-ws.org/Vol-2696/paper_152.pdf
|volume=Vol-2696
|authors=Sahar Ghannay,Cyril Grouin,Thomas Lavergne
|dblpUrl=https://dblp.org/rec/conf/clef/GhannayGL20
}}
==Experiments from LIMSI at the French Named Entity Recognition Coarse-grained Task==
Experiments from LIMSI at the French Named Entity Recognition Coarse-grained task Sahar Ghannay1[0000−0002−7531−2522] , Cyril Grouin1[0000−0001−5809−188𝑋] , and Thomas Lavergne1[] Université Paris-Saclay, CNRS, LIMSI, 91405 Orsay, France first.last@limsi.fr Abstract. This paper presents the participation of the LIMSI team in the HIPE 2020 Challenge on the Coarse-grained named entity recognition task for French. Our approach jointly predicts the literal and metonymy entities. For this, a Camem- BERT base model and a CRF model were used. We submitted three systems: a joint model using only CamemBERT, a joint model extended with a CRF layer, and a CamemBERT model without joint option. Experimental results show that the second system achieved best results on the literal tags (F1=.814) while the third system performed best (F1=.667) on the metonymy tags. The second sys- tem allowed us to obtain our best results on both the dev and test datasets for the literal tags. Nevertheless, we observed a difference on the metonymy tags where our first system obtained best results on the dev dataset (F1=.663) while our third system performed best on the test dataset (F1=.667). Keywords: Named Entity Recognition, historical texts, contextual word embed- dings 1 Introduction In 2011 and 2012, two corpora have been produced and annotated into extended named entities. The 2011 Quaero corpus focused on broadcast news [7] while the 2012 Quaero corpus is composed of press archives in French from December 1880 [6]. Those two cor- pora were used for NLP Challenges that included both coarse-grained and fine-grained named entities, and several named entity imbrications such as the metonymy phenom- ena. The current HIPE 2020 Challenge builds on the annotation guidelines produced during the 2011 and 2012 Quaero NLP Challenges [7, 6]. Specifically for the HIPE 2020 challenge [4], one main issue concerns the digiti- zation of texts from distinct times (from 1798 to 2018 on the French data) with digi- tization errors such as insertion and deletion of characters (e.g., “oppositipjn” instead of “opposition”), including insertion and deletetion of spaces which produces tokeniza- tion issues (“limitrop he” vs. “limitrophe” or “rég iment” vs. “régiment”, producing two Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 September 2020, Thessaloniki, Greece. tokens instead of only one). Digitization errors mainly occur on grammatical words.1 Nevertheless, one may find such errors in named entity, which makes a NER task more difficult (e.g., in the person name “Picqu␣art” instead of “Picquart” or in the town name “Glascow” instead of “Glasgow”). The CLEF HIPE 2020 challenge proposed several tasks (coarse-grained and fine- grained named entity recognition (NER), and entity linking) in three languages (English, French, German). We are interested in the sub-task 1.1 called NERC Coarse-grained for French, that concerns the recognition and classification of entity mentions according to coarse-grained types (Person, Location, Organisation and Product). For this task we dis- tinguish two coarse types of the entity mention token: according to the literal sense and to the metonymic sense, named respectively literal and metonymy tags. These coarse types correspond to NE-COARSE-LIT and NE-COARSE-METO columns in the data. For this challenge we proposed three neural systems that take benefit from contextual word embeddings using Camembert [22] model. This model was extended to jointly predict literal and metonymy NE tags, in addition to the use of a CRF layer on the top of this model to further improve the predictions by taking advantage from neighborhoods labels. The paper is organized along the following lines: Section 2 presents related work on NER task. Section 3 describes the proposed NER system. The experimental setup and results are described in Section 4, just before the conclusion (Section 5). 2 Related work The Named Entity Recognition (NER) task consists in identifying text spans that men- tion named entities (people names, companies, location) and classifying them into pre- defined categories (Person, Location, Organisation and Product). NER task is a key com- ponent of several Natural Language Processing (NLP) applications such as information retrieval [8], text understanding [29], question answering [23]. For decades, the NER task has been widely studied and different approaches have been proposed. Traditional approaches can fall into three categories [28, 15]: rule-based [10], unsupervised learning [5], and feature-based supervised learning approaches [30, 17]. Recent approaches are based on neural network architectures in which hidden fea- tures are discovered automatically. Generally the NER architecture can be regarded as a composition of an encoder (CNN, BiLSTM (bi-directional long-short term memory), RNN, transformer, etc.) and a decoder (BiLSTM, CRF, etc.) [18]. The first NER neu- ral model was proposed by [9] which is based on unidirectional LSTM architecture. Collobert et al. [2] proposed a CNN-CRF architecture enriched by character-level em- beddings. Lample et al. [14] proposed a BiLSTM-CRF architecture that takes benefit from both word and character-level embeddings. State of the art NER systems leverage recent advances in deep learning and recent approaches that take benefit from contextual or language model embeddings such as BERT [20, 16, 18]. 1 The most common errors are found in short grammatical words (error/correct form): Cn/Un, co/ce, cotte/cette, do/de, k/à, lai/lui, lo/le, on/en, quo/que, uno/une. 3 Proposed NER system The proposed NER system is based on CamemBERT [22] model which we extended to jointly predict both NE tags: literal and metonymy. In the following subsections we briefly define the CamemBERT model and then present the proposed joint NER model. 3.1 CamemBERT The CamemBERT model is based on RoBERTa (Robustly Optimized BERT Pretraining Approach) [19] which is based on BERT (Bidirectional Encoder Representations from Transformers) [3]. BERT’s model architecture is a multi-layer bidirectional Transformer [27] encoder, trained with a masked language modeling and Next Sentence Prediction objectives. RoBERTa was proposed to improve BERT pre-training procedure by dynamically chang- ing the masking pattern applied to the training data, removing the next sentence predic- tion task, and training with larger batches and longer sequences, on more data, and for longer. Similar to BERT and RoBERTa, CamemBERT is a multi-layer bidirectional Trans- former. It uses the original architectures of BERTBASE (12 layers, 768 hidden dimen- sions, 12 attention heads, 110M parameters) and BERTLARGE (24 layers, 1024 hidden dimensions, 12 attention heads, 110M parameters). CamemBERT is similar to RoBERTa, using the improved pre-training procedure. However it uses the whole-word masking and the SentencePiece tokenization [12] instead of WordPiece [25]. For more details about CamemBert we refer the reader to Martin et al. [22]. 3.2 Joint NER As we mentioned before, the NER system has to train jointly both literal and metonymy NE tags. Inspired by previous work on Intent Classification and Slot Filling [1], we propose to extend CamemBERT for this purpose. Hence, the final hidden states of the tokens ℎ2 , ..., ℎ𝑇 (excluding the first special token () fed into two softmax layers to classify over literal and metonymy tags. Specifically, each tokenized input word fed into a SentencePiece tokenizer and the hidden state of the first sub-token is used as input to the softmax classifiers. The literal and metonymy tags are predicted respectively as : 𝑦𝐿𝐼𝑇 𝑖 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑊 ℎ𝑖 + 𝑏), 𝑖 ∈ 1...𝑁 (1) 𝑦𝑀𝐸𝑇 𝑖 𝑂 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑊 ℎ𝑖 + 𝑏), 𝑖 ∈ 1...𝑁 (2) where ℎ𝑖 is the hidden state of the first sub-token for the word 𝑤𝑖 . To jointly learn literal and metonymy tags, the learning objective is to maximize the conditional probability defined as follows: ∏ 𝑁 𝑝(𝑦𝐿𝐼𝑇 , 𝑦𝑀𝐸𝑇 𝑂 |𝑤) = 𝑝(𝑦𝐿𝐼𝑇 𝑖 |𝑤)𝑝(𝑦𝑀𝐸𝑇 𝑖 𝑂 |𝑤) (3) 𝑖=1 The model is fine-tuned end-to-end via minimizing the cross-entropy loss. 3.3 CRF For the NER task, label predictions are dependent on surrounding words’ predictions. Thus, for a given input sentence, it is helpful to consider the correlations between neigh- borhood labels and jointly decode the best label chain. It has been shown that the use of conditional random fields (CRF) [13] layer on top of BiLSTM (bi-directional long-short term memory) encoder improves many sequence labeling task including NER [21]. For that reason, we propose to add a CRF layer for modeling NE label dependencies, on top of the joint CamemBERT model. 4 Experiments and results 4.1 Data description For the Coarse-grained NER task, we used the French datasets provided by the organiz- ers. The corpus is divided into 𝑡𝑟𝑎𝑖𝑛, 𝑑𝑒𝑣 and 𝑡𝑒𝑠𝑡 sets, which are composed respectively of 158, 43 and 43 documents from distinct periods of time. As the document size is very long to be processed with our model, we decided to split the document to several sen- tences of length ≥ 𝑙. Two rules are considered during splitting: i) take into account the NE tags i.e. we have to reach the end of the tag before splitting, ii) take into account the end of the document, if the remaining part of the document is less than or equal to 2 ∗ 𝑙, than we consider the remaining document as a sentence. Both rules are applied to train and dev datasets, while only the second rule was applied to the test dataset since annotations were masked. After hyper-parameter fine-tuning, we defined the minimum length size to 50, thus the sentence length varies from 50 to 149. Table 1 reports sentence numbers for each data set. Table 1. Sentence numbers for each data set for French corpus Data sets #sentences train 3080 dev 693 test 755 4.2 Training details We used the CamemBERT base model as provided by its authors,2 which is composed of 12 layers, 768 hidden dimensions and 12 attention heads. CamemBERT is pre-trained on the French part of the OSCAR [26] corpus: a pre-filtered and pre-classified version 2 https://camembert-model.fr/ of Common Crawl, composed of 138GB of raw text and 32.7B tokens after subword tokenization. For fine-tuning, all hyper-parameters are tuned on the development (dev) set. The minimum sentence length is selected from [10,20,50]. The max sequence length is 256. The batch size is selected from [32, 64, 128]. The maximum number of epochs is 100. For optimization we used Adam [11] with an initial learning rate of 5e-5. The dropout probability is 0.1. 4.3 Results This section reports the results of the best three submitted systems, namely: – sys1: fine-tuning the joint NER model without CRF layer; – sys2: fine-tuning the joint NER model with CRF layer; – sys3: fine-tuning the CamemBERT base model without joint option, hence for this model the literal and metonymy tags are concatenated and considered as only one tag. This system has more tags to predict than 𝑠𝑦𝑠1 and 𝑠𝑦𝑠2; The results are evaluated at entity and document levels in terms of micro and macro Precision, Recall and F1-measure considering two scenarios : exact (strict) and fuzzy (relaxed) boundary matching, for both literal and metonymy tags [24]. The best systems are selected based on the results on the dev set, whose results in terms of micro Precision, Recall and F1-measure for both tags are summarized in Table 2. Note that our system is often the second best on French. Table 2. Micro Precision, Recall and F1-score results for literal and metonymy tags on dev data set in both strict and fuzzy scenarios. Strict Fuzzy Tag Systems F1 Precision Recall F1 Precision Recall sys1 0.873 0.874 0.872 0.92 0.921 0.919 LIT sys2 0.882 0.879 0.886 0.929 0.925 0.933 sys3 0.873 0.87 0.875 0.925 0.922 0.927 sys1 0.663 0.725 0.611 0.663 0.725 0.611 METO sys2 0.646 0.711 0.593 0.646 0.711 0.593 sys3 0.612 0.573 0.657 0.612 0.573 0.657 Similarly as on the dev set, for literal tag, the three systems achieve comparable results. For metonymy, 𝑠𝑦𝑠1 achieves better results than 𝑠𝑦𝑠3 and 𝑠𝑦𝑠2, yielding respec- tively to 8.82% and 2.63% of improvements in terms of micro F1 in strict and fuzzy scenarios, while 𝑠𝑦𝑠2, that includes a CRF layer on top of the joint NER model, im- proves the results at the document level in terms of macro F1 (0.68) by 2.25% and 1.34% respectively for 𝑠𝑦𝑠3 and 𝑠𝑦𝑠1 in both scenarios. Results on test data in terms of micro Precision, Recall and F1-measure for both tags are summarized in Table 3. Table 3. Micro Precision, Recall and F1-score results for literal and metonymy tags on test data set in both strict and fuzzy scenarios. Strict Fuzzy Tag systems F1 Precision Recall F1 Precision Recall sys1 0.801 0.791 0.811 0.897 0.886 0.908 LIT sys2 0.814 0.799 0.829 0.896 0.88 0.913 sys3 0.807 0.798 0.818 0.898 0.887 0.909 sys1 0.603 0.69 0.536 0.603 0.69 0.536 METO sys2 0.627 0.696 0.571 0.637 0.707 0.58 sys3 0.667 0.647 0.688 0.675 0.655 0.696 We observe that for literal tag the proposed systems obtain comparable results. The prediction of metonymic tags is not an easy task, since it represents only 0.27% B-org and 0.05% of I-org (in each of train, dev and test sets) and a few rare tags B-Loc in train and dev and B-time in test. Considering, the prediction of metonymy tags as a separate task (the case for 𝑠𝑦𝑠1 and 𝑠𝑦𝑠2), this may cause generalization problems. The results we obtained on the test data confirm this hypothesis. Indeed, the 𝑠𝑦𝑠3 trained on the concatenation of both tags achieved the best results in terms of micro F1 w.r.t. 𝑠𝑦𝑠1 and 𝑠𝑦𝑠2, which is not the case on dev. Thus, the concatenation of both tags is helpful to predict metonymy tag and to avoid their frequency problem. i.e. considering those labels BI-loc_BI-org vs. BI-loc_O, it easier for the system to distinguish the LOC category with or without metonymy. Last, 𝑠𝑦𝑠1 achieves the best results in terms of macro F1 (0.747) in comparison to 𝑠𝑦𝑠2 (.738) and 𝑠𝑦𝑠3 (.733). 5 Conclusions In this paper, we presented our participation in the CLEF HIPE 2020 Challenge on the Coarse-grained named entity recognition task for French. The proposed approach jointly predicts the literal and metonymic entities. For this, a CamemBERT base model and a CRF model were used. We submitted three systems: a joint model using only CamemBERT, a joint model extended with a CRF layer, and a CamemBERT model without joint option. On the test dataset, we achieved our best results on the literal tags using our sec- ond system (F1=.814) while on the metonymy tags, our third system performed best (F1=.667). Our second system allowed us to obtain the best results on both the dev and test datasets for the literal tags. Nevertheless, we observed a difference on the metonymy tags where our first system obtained best results on the dev dataset (F1=.663) while our third system performed best on the test dataset (F1=.667); surprisingly, differences be- tween first and third systems are about 5 points in strict F1-score. References 1. Chen, Q., Zhuo, Z., Wang, W.: Bert for joint intent classification and slot filling. arXiv preprint arXiv:1902.10909 (2019) 2. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural lan- guage processing (almost) from scratch. Journal of machine learning research 12(Aug), 2493– 2537 (2011) 3. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4171–4186. Association for Compu- tational Linguistics, Minneapolis, Minnesota (Jun 2019). https://doi.org/10.18653/v1/N19- 1423, https://www.aclweb.org/anthology/N19-1423 4. Ehrmann, M., Romanello, M., Flückiger, A., Clematide, S.: Overview of CLEF HIPE 2020: Named Entity Recognition and Linking on Historical Newspapers. In: Arampatzis, A., Kanoulas, E., Tsikrika, T., Vrochidis, S., Joho, H., Lioma, C., Eickhoff, C., Névéol, A., Cap- pellato, L., Ferro, N. (eds.) Experimental IR Meets Multilinguality, Multimodality, and In- teraction. Proceedings of the 11th International Conference of the CLEF Association (CLEF 2020). Lecture Notes in Computer Science (LNCS), vol. 12260. Springer (2020) 5. Etzioni, O., Cafarella, M., Downey, D., Popescu, A.M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Unsupervised named-entity extraction from the web: An experimental study. Artificial intelligence 165(1), 91–134 (2005) 6. Galibert, O., Rosset, S., Grouin, C., Zweigenbaum, P., Quintard, L.: Extended named en- tities annotation on OCRed documents: From corpus constitution to evaluation campaign. In: Proceedings of the Eighth International Conference on Language Resources and Evalua- tion (LREC’12). European Language Resources Association (ELRA), Istanbul, Turkey (May 2012) 7. Grouin, C., Rosset, S., Zweigenbaum, P., Fort, K., Galibert, O., Quintard, L.: Proposal for an extension of traditional named entities: from guidelines to evaluation, an overview. In: Proc of LAW. Jeju-do, South Korea (2011) 8. Guo, J., Xu, G., Cheng, X., Li, H.: Named entity recognition in query. In: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. pp. 267–274 (2009) 9. Hammerton, J.: Named entity recognition with long short-term memory. In: Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4. pp. 172–175. Association for Computational Linguistics (2003) 10. Kim, J.H., Woodland, P.C.: A rule-based named entity recognition system for speech input. In: Sixth International Conference on Spoken Language Processing (2000) 11. Kingma, D.P., Ba, J.: Adam: A Method for Stochastic Optimization. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (2015), http://arxiv.org/abs/1412.6980 12. Kudo, T., Richardson, J.: SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In: Proceedings of the 2018 Con- ference on Empirical Methods in Natural Language Processing: System Demonstrations. pp. 66–71. Association for Computational Linguistics, Brussels, Belgium (Nov 2018). https://doi.org/10.18653/v1/D18-2012, https://www.aclweb.org/anthology/D18-2012 13. Lafferty, J., McCallum, A., Pereira, F.C.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data (2001) 14. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural Architec- tures for Named Entity. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pp. 260–270. Association for Computational Linguistics, San Diego, California (Jun 2016). https://doi.org/10.18653/v1/N16-1030 15. Li, J., Sun, A., Han, J., Li, C.: A survey on deep learning for named entity recognition. IEEE Transactions on Knowledge and Data Engineering (2020) 16. Li, X., Feng, J., Meng, Y., Han, Q., Wu, F., Li, J.: A Unified MRC Framework for Named Entity Recognition. arXiv preprint arXiv:1910.11476 (2019) 17. Liao, W., Veeramachaneni, S.: A simple semi-supervised algorithm for named entity recog- nition. In: Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Processing. pp. 58–65 (2009) 18. Liu, M., Tu, Z., Wang, Z., Xu, X.: LTP: A New Active Learning Strategy for Bert-CRF Based Named Entity Recognition. arXiv preprint arXiv:2001.02524 (2020) 19. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019) 20. Luoma, J., Pyysalo, S.: Exploring Cross-sentence Contexts for Named Entity Recognition with BERT. arXiv preprint arXiv:2006.01563 (2020) 21. Ma, X., Hovy, E.: End-to-end Sequence Labeling via Bi-directional LSTM-CNNs- CRF. In: Proceedings of the 54th Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers). pp. 1064–1074. Association for Compu- tational Linguistics, Berlin, Germany (Aug 2016). https://doi.org/10.18653/v1/P16-1101, https://www.aclweb.org/anthology/P16-1101 22. Martin, L., Muller, B., Ortiz Suárez, P.J., Dupont, Y., Romary, L., de la Clergerie, É.V., Sed- dah, D., Sagot, B.: Camembert: a tasty french language model. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (2020) 23. Mollá, D., Van Zaanen, M., Smith, D.: Named entity recognition for question answering (2006) 24. Moosavi, N.S., Strube, M.: Which Coreference Evaluation Metric Do You Trust? A Proposal for a Link-based Entity Aware Metric. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 632–642. Association for Computational Linguistics, Berlin, Germany (Aug 2016). https://doi.org/10.18653/v1/P16-1060, https://www.aclweb.org/anthology/P16-1060 25. Schuster, M., Nakajima, K.: Japanese and korean voice search. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 5149–5152. IEEE (2012) 26. Suárez, P.J.O., Sagot, B., Romary, L.: Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures. Challenges in the Management of Large Corpora (CMLC-7) 2019 p. 9 (2019) 27. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polo- sukhin, I.: Attention is all you need. In: Advances in neural information processing systems. pp. 5998–6008 (2017) 28. Yadav, V., Bethard, S.: A Survey on Recent Advances in Named Entity Recognition from Deep Learning models. In: COLING (2018) 29. Zhang, Z., Han, X., Liu, Z., Jiang, X., Sun, M., Liu, Q.: ERNIE: Enhanced language repre- sentation with informative entities. arXiv preprint arXiv:1905.07129 (2019) 30. Zhou, G., Su, J.: Named entity recognition using an HMM-based chunk tagger. In: proceed- ings of the 40th Annual Meeting on Association for Computational Linguistics. pp. 473–480. Association for Computational Linguistics (2002)