Projecting Heterogeneous Annotations for Named Entity Recognition

Projecting Heterogeneous Annotations for Named Entity Recognition RodrigoAgerri HiTZ Center -Ixa University of the Basque Country UPV/EHU GermanRigau HiTZ Center -Ixa University of the Basque Country UPV/EHU Projecting Heterogeneous Annotations for Named Entity Recognition 1613-0073 D771628F98A471B96377B831E0B9C145 GROBID - A machine learning software for extracting information from scholarly documents Named Entity Recognition Information Extraction Natural Language Processing

In this paper we describe our participation in the CAPITEL at IberLEF 2020 shared task on Named Entity Recognition (NER). Our objectives to participate in the shared task were two-fold: (i) to benchmark current rich multilingual representations of text with respect to monolingual models trained specifically for Spanish; (ii) to study various methods of projecting annotations from several sources into a final target prediction. Our results show that monolingual models, even for a large language such as Spanish, perform better in this particular NER benchmark. Furthermore, our projection method indicates that substantial gains in performance can be obtained by projecting annotations from various heterogeneous sources to obtain the final prediction. Our submission obtained the best score, substantially outperforming other participants of the CAPITEL 2020 NER task.

Introduction

Named Entity Recognition (NER) is a widely studied Natural Language Processing (NLP) task. Briefly, the task involves annotating any mentions of entities (usually proper names) occurring in running text. The most common annotated corpora for NER focuses on four types of named entities: Locations, Organizations, Persons and Other (Miscellaneous) entities. Spanish NER has been well-studied, as it was one of the languages proposed in the CoNLL NER shared tasks [1,2].

As for many other NLP tasks, current best performing models for NER are those based on large pre-trained language models which allow to build rich representations of text based on contextual word embeddings. These approaches are based on character-based models like Flair [3] or masked language models like BERT [4]. Furthermore, multilingual versions of these models have been trained: the multilingual version of BERT [4] was trained for 104 languages. More recently, XLM-RoBERTa [5] was trained for 100 languages.

These publicly available deep learning multilingual models for text excel in tasks involving highresourced languages such as English, but their performance drops when applied to low-resource languages [6]. This may occur for a number of reasons. First, each language has to share the quota of substrings and parameters with the rest of the languages represented in the pre-trained multilingual model. As the quota of substrings partially depends on corpus size, this means that larger languages such as English or Spanish are better represented than lower resourced languages such as Basque [6]. Moreover, multilingual models also seem to behave better for structurally similar languages [7].

In our submission for the CAPITEL 2020 NER task [8] we leverage both these multilingual models as well as other monolingual models trained specifically for Spanish. Furthermore, we project the annotations provided by each system into a target final prediction. The projection of several source annotations into a target is loosely inspired by a method originally designed for projection of annotations across languages [9]. Our projection method indicates that substantial gains in performance (around 1.3 points in F1 score) can be obtained by projecting annotations from various heterogeneous sources into a final target prediction. Our submission obtained the best score, substantially outperforming other participants of the CAPITEL 2020 NER task.

Related Work

Deep learning methods in NLP rely on the ability to represent words as continuous vectors on a low dimensional space, called word embeddings. The first approaches generated static word embeddings [10,11], namely, they provided a unique vector-based representation for a given word independently of the context in which the word occurs. This means that polysemy cannot be represented. Thus, if we consider the word 'bank', static word embedding approaches will generate only one vector representation even though such word may have different senses, namely, 'financial institution', 'bench', etc.

In order to address this problem, contextual word embeddings were proposed. The idea is to be able to generate different word representations according to the context in which the word appears. Currently there are many approaches to generate such contextual word representations, but we will focus on those that have had a direct impact, in terms of performance, for the Named Entity Recognition task. First, Flair [3] representations are built following a LSTM-based architecture and trained as language models. Second, the models based on the transformer architecture [12] and of which BERT is perhaps the most popular example [4].

The multilingual counterpart of BERT, called mBERT, is a single language model pre-trained from corpora in more than 100 languages. Another standout model is XLM-RoBERTa [5] also based on the transformer architecture which provides a pre-trained language model for 100 languages trained on 2.5 TB of Common Crawl text. Both mBERT and XLM-RoBERTa enable to perform transfer knowledge across languages [13,14,7], although in this paper we will use them in a monolingual setting for Spanish NER.

Flair

Flair refers to a system based on a BiLSTM architecture [15] and to a specific type of characterbased contextual word embeddings. Flair (embeddings and system) have been successfully applied to sequence labeling tasks obtaining state-of-the-art results for a number of Named Entity Recognition (NER) and Part-of-Speech tagging benchmarks [3].

Flair embeddings consist of sequences of characters. More specifically, sentences are processed into sequences of characters and feed into a character-level Long short-term memory (LSTM) model. For each sentence, a forward LSTM language model processes the its sequence of characters from the beginning of the sentence to the last character of the word we are modeling. Furthermore, a backward LSTM performs the same operation going from the end of the sentence up to the first character of the word. The extracted hidden states contain information propagated from the end and the beginning of the sentence up to the first and the last character of the target word. Finally, the resulting two hidden states are concatenated to generate the final embedding.

Pooled embeddings are a type of Flair embeddings which consider global information in order to generate the final word embedding [16]. In this approach embeddings are kept into a memory which is later used in a pooling operation to obtain a global word representation. This representation will be the concatenation of all the local Flair contextualized embeddings obtained for a given word. It should be consider that pooling operation is involved in the process of fine-tuning the Flair pre-trained models, not in the process of training the language models themselves. We use the default pooling operation, min, which computes a vector of all element-wise minimum values [16].

Transformers

LSTM-based language models such as the one presented in the previous section cannot capture longrange sequence information. Furthermore, they are quite hard to train at a large scale (see [17], especially Figure 7). In order to address these issues, the Transformer architecture was proposed [12], based on multi-headed self-attention and positional encoding. The most popular Transformer is BERT [4], which pre-trains a Transformer encoder on the Masked Language Model (MLM) and Next Sentence Prediction (NSP) tasks. BERT is composed by stacked layers of Transformer encoders [12]. More specifically, in this paper we will use the BERT BASE configuration which contains 12 Transformer encoder layers, a hidden size of 768 and 12 self-attention heads for a total of 110M parameters.

The MLM task is designed as follows: For a input sequence of 𝑛 tokens 𝑥 1 , 𝑥 2 , ..., 𝑥 𝑛 , 15% are selected as masking candidates. From those candidates, 80% of them are masked (they are replaced with the [MASK] token), 10% are replaced by a random word and the last 10% is left unchanged. For the NSP task, two segments are selected from the training corpus, 𝐴 and 𝐵. In 50% of the cases 𝐵 is the true next segment for 𝐴. For the rest, 𝐵 is just a random segment. The model is trained to optimize the sum of the means of the MLM and NSP likelihoods.

It should be noted that the benefits of the NSP task during the pre-training process has been questioned [18,19,20]. Thus, other transformer proposals such as RoBERTa train without the NSP task, showing strong performance on the same downstream tasks.

XLM-RoBERTa relies exclusively on the MLM objective. The biggest update that XLM-Roberta offers is a significantly increased amount of training data, 2.5TB of Common Crawl clean data [5]. As for BERT, in this paper we use the base version of XLM-RoBERTa. The reason being that their base versions fit for fine-tuning into a standard GPU card with 12GB of RAM.

Experimental Setup

Named entities were originally annotated using the BIO encoding which identifies the Beginning, the Inside and the Outside of named entities. Later on the BILOU model1 was proposed to mark tokens as the Beginning, the Inside and the Last tokens of multi-token entities as well as Unit-length entities [21]. Although the CAPITEL corpus is originally released using the BILOU model, we experiment with both type of encodings.

The CAPITEL (Corpus del Plan de Impulso a las Tecnologías del Lenguaje) has been developed by the PlanTL, the Royal Spanish Academy (RAE) and the Secretariat of State for Digital Advancement (SEAD) of the Ministry of Economy. These organizations signed an agreement for developing a linguistically annotated corpus of Spanish news articles, with the objective of extending the language resource infrastructure for the Spanish language. CAPITEL is composed of contemporary news articles and contains annotations for Universal Dependencies and Named Entities. The NER portion of the corpus contains around one million words.

For the experiments performed for this paper, we use a number of publicly available models: Additionally we trained the following monolingual language models for Spanish:

1. Flair-GW: Flair character-based language model trained on the Spanish Wikipedia and the Gigaword 3rd edition corpus, containing around 11GB of text. 2. Flair-Oscar: Flair language model trained on the OSCAR Spanish corpus [23], which contains 157GB of Common Crawl text cleaned and deduplicated.

The Flair embeddings for Flair-GW and Flair-Oscar were trained with the following parameters: Hidden size 2048, sequence length of 250, and a mini-batch size of 100. The rest of the parameters were left in their default setting. For Flair-GW, training was done for 5 epochs over the full training corpus. The training took around 5 days in a Nvidia Titan V GPU. With respect to Flair-Oscar, only one epoch was performed, requiring around a month to complete it.

Results

Table 1 reports only the best results obtained during the experimentation. Each of the S1-S8 results is the average of five randomly initialized runs. Flair models were trained using the default parameters, although we experimented adding FastText embeddings to the Flair and Pooled embeddings. We used 10 percent of the training data for development of the Flair models. In the case of the Transformer models described in the previous section, we used the full training set for hyperparameter fine-tuning. For XLM-RoBERTa we used a maximum sequence length of 128, mini-batch 16, 5e-5 learning rate, and 4 epochs. For mBERT and BETO best results were obtained using the same hyperparameters as for XLM-RoBERTa but increasing the sequence length to 256.

Out of the many experiments performed with the three Flair language models (Official, GW and Oscar), the best performing language model in every possible configuration was the Flair-Oscar model combined with the FastText embeddings trained on Wikipedia. In fact, Flair-Oscar was the best single system by a substantial margin. Apart from this, S2 and S3 show the small gains obtained by adding the 10 percent used for development for the final evaluation. Furthermore, S3 was trained when the progress of training the language model was at half epoch, whereas S4 was trained using the final Oscar language model based on one epoch. Finally, S5 is the same model as S1 but using BIO encoding instead of the original BILOU encoding from the CAPITEL corpus. The best overall invididual system was S4, significantly outperforming the multilingual and monolingual Transformer models.

With respect to the transformer models, it can be seen that in general their results are lower than those obtained by the Flair-Oscar models. During the development phase they all performed very closely although in the final, official results XLM-RoBERTa was slightly superior to the rest. Furthermore, results also show that mBERT performed worst and that XLM-RoBERTa obtains very similar results to the monolingual models.

The last three rows of Table 1 report the three best projections. Once we had the best 8 systems, we proceeded to project their predictions by means of any possible combination of the 8 systems. The best three systems were picked based on two criteria: the F1 score obtained on the development data and the number of No-agreements recorded by each projection.

The projections were performed using 5 predictions as source. We tested various strategies and the one we finally used to report the final results was, interestingly enough, the simplest of them all. It uses a very simple methodology based on the number of agreements between the predicted labels of the 5 source annotations: if agreement is >= 3 then project, otherwise, project "O".

As we could not compute F1 scores on the official test set released by the shared task, we simply picked the projection which recorded fewer No-agreements. This corresponds to the best overall system (P3), which uses S3, S4, S6, S7 and S8 as source to obtain the final prediction.

Concluding Remarks

In this paper we have described the experiments performed for our participation in the CAPITEL 2020 shared task on Named Entity Recognition. Even though the best results are obtained by the Flair-Oscar monolingual models, our results indicate that multilingual pre-trained models such as XLM-RoBERTa are performing increasingly close to monolingual models for a large-resourced language such as Spanish. Furthermore, we also show the benefits of projecting named entity annotations from various heterogeneous sources in order to substantially improve performance (around 1.3 points in F1 score over the best individual system).

Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020) email: rodrigo.agerri@ehu.eus (R. Agerri) orcid: 0000-0002-7303-7598 (R. Agerri); 0000-0003-1119-0930 (G. Rigau)

Table 11Overview results on both development and test data.DevelopmentTestSystemPrecision Recall F1 score Precision Recall F1 scoreS1 Flair-Oscar + FT89.6589.3689.5188.8688.6388.74S2 Flair-Oscar + FT (dev)89.6789.5389.6088.9788.7588.86S3 Pool-Oscar + FT (dev)89.8589.6389.7989.0788.8588.96S4 Pool-Oscar + FT e189.7889.7289.7589.2988.8289.07S5 Flair-Oscar + FT BIO89.7189.5889.6489.1988.7888.99S6 BETO89.6489.3488.9987.1988.3687.77S7 mBERT87.9088.9088.4087.0387.7587.39S8 XLM-RoBERTa88.2989.5488.9187.3788.4887.92P1 S2-S3-S6-S7-S891.3290.7791.0490.7088.1189.38P2 S2-S4-S6-S7-S891.1090.5990.8490.8188.0689.42P3 S3-S4-S6-S7-S891.1990.7290.9690.5090.1790.341. Multilingual BERT (mBERT).2. XLM-RoBERTa (base).3. BETO, a monolingual Spanish BERT trained with Wikipedia and Spanish data from the OPUScorpus [22].4. Flair official models for Spanish.

Nowadays also known as the BIOES encoding: Beginning, Inside, Outside, End of entity and Single entity.

Acknowledgments

Innovation and Universities (DeepReading RTI2018-096846-B-C21, MCIU/AEI/FEDER, UE) and by Ayudas Fundación BBVA a Equipos de Investigación Científica 2018 (BigKnowledge). Rodrigo Agerri is funded by the RYC-2017-23647 fellowship and acknowledges the donation of a Titan V GPU by the NVIDIA Corporation.

Introduction to the CoNLL-2002 shared task: Language-independent named entity recognition EFTjong KimSang Proceedings of CoNLL-2002 CoNLL-2002

Taipei, Taiwan

2002 Introduction to the CoNLL-2003 shared task: Languageindependent named entity recognition EFTjong Kim Sang FDeMeulder Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 the seventh conference on Natural language learning at HLT-NAACL 2003 2003 4 Contextual string embeddings for sequence labeling AAkbik DBlythe RVollgraf 27th International Conference on Computational Linguistics 2018 COL-ING 2018 BERT: pre-training of deep bidirectional transformers for language understanding JDevlin MChang KLee KToutanova Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019 the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019

Minneapolis, MN, USA

June 2-7, 2019. 2019 1 Long and Short Papers AConneau KKhandelwal NGoyal VChaudhary GWenzek FGuzmán EGrave MOtt LZettlemoyer VStoyanov arXiv:1911.02116 Unsupervised cross-lingual representation learning at scale 2019 Give your text representation models some love: the case for basque RAgerri ISan JAVicente ACampos XBarrena ASaralegi ESoroa Agirre Proceedings of The 12th Language Resources and Evaluation Conference (LREC 2020) The 12th Language Resources and Evaluation Conference (LREC 2020) 2020 Cross-lingual ability of multilingual bert: An empirical study KKarthikeyan ZWang SMayhew DRoth International Conference on Learning Representations (ICLR) 2020 Overview of CAPITEL Shared Tasks at IberLEF 2020: NERC and Universal Dependencies Parsing JPorta-Zamorano LEspinosa-Anke Proceedings of the Iberian Languages Evaluation Forum the Iberian Languages Evaluation Forum

IberLEF

2020. 2020 Building named entity recognition taggers via parallel corpora RAgerri YChung IAldabe NAranberri GLabaka GRigau Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) 2018 Distributed representations of words and phrases and their compositionality TMikolov ISutskever KChen GSCorrado JDean Advances in Neural Information Processing Systems 2013 Enriching word vectors with subword information PBojanowski EGrave AJoulin TMikolov Transactions of the Association for Computational Linguistics 5 2017 Attention is all you need AVaswani NShazeer NParmar JUszkoreit LJones ANGomez ŁKaiser IPolosukhin Advances in neural information processing systems 2017 Sequence tagging with contextual and non-contextual subword representations: A multilingual evaluation BHeinzerling MStrube Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics

Florence, Italy

2019 How multilingual is multilingual bert? TPires ESchlinger DGarrette Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics the 57th Annual Meeting of the Association for Computational Linguistics 2019 Bidirectional LSTM-CRF Models for Sequence Tagging ZHuang WXu KYu arXiv:1508.01991 2015 Pooled contextualized embeddings for named entity recognition AAkbik TBergmann RVollgraf Annual Conference of the North American Chapter of the Association for Computational Linguistics 2019. 2019 NAACL 2019 JKaplan SMccandlish THenighan TBBrown BChess RChild SGray ARadford JWu DAmodei arXiv:2001.08361 Scaling laws for neural language models 2020 arXiv preprint ZYang ZDai YYang JCarbonell RSalakhutdinov QVLe arXiv:1906.08237 XLNet: Generalized Autoregressive Pretraining for Language Understanding 2019 arXiv preprint YLiu MOtt NGoyal JDu MJoshi DChen OLevy MLewis LZettlemoyer VStoyanov arXiv:1907.11692 RoBERTa: A robustly optimized bert pretraining approach 2019 arXiv preprint GLample AConneau arXiv:1901.07291 Cross-lingual language model pretraining 2019 arXiv preprint Design challenges and misconceptions in named entity recognition LRatinov DRoth Proceedings of the Thirteenth Conference on Computational Natural Language Learning the Thirteenth Conference on Computational Natural Language Learning 2009 Parallel data, tools and interfaces in OPUS JTiedemann LREC 2012. 2012 Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures PJOrtiz Suárez BSagot LRomary Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019 the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019

Cardiff

July 2019, 2019