1. Introduction

Projecting Heterogeneous Annotations for Named Entity Recognition

Rodrigo Agerri

rodrigo.agerri@ehu.eus 0

German Rigau

0 0 HiTZ Center - Ixa, University of the Basque Country UPV/EHU

45 51

In this paper we describe our participation in the CAPITEL at IberLEF 2020 shared task on Named Entity Recognition (NER). Our objectives to participate in the shared task were two-fold: (i) to benchmark current rich multilingual representations of text with respect to monolingual models trained specifically for Spanish; (ii) to study various methods of projecting annotations from several sources into a final target prediction. Our results show that monolingual models, even for a large language such as Spanish, perform better in this particular NER benchmark. Furthermore, our projection method indicates that substantial gains in performance can be obtained by projecting annotations from various heterogeneous sources to obtain the final prediction. Our submission obtained the best score, substantially outperforming other participants of the CAPITEL 2020 NER task.

eol>Named Entity Recognition Information Extraction Natural Language Processing

1. Introduction

as well as other monolingual models trained specifically for Spanish. Furthermore, we project the annotations provided by each system into a target final prediction. The projection of several source annotations into a target is loosely inspired by a method originally designed for projection of annotations across languages [ 9 ]. Our projection method indicates that substantial gains in performance (around 1.3 points in F1 score) can be obtained by projecting annotations from various heterogeneous sources into a final target prediction. Our submission obtained the best score, substantially outperforming other participants of the CAPITEL 2020 NER task.

2. Related Work

Deep learning methods in NLP rely on the ability to represent words as continuous vectors on a low dimensional space, called word embeddings. The first approaches generated static word embeddings [ 10, 11 ], namely, they provided a unique vector-based representation for a given word independently of the context in which the word occurs. This means that polysemy cannot be represented. Thus, if we consider the word ‘bank’, static word embedding approaches will generate only one vector representation even though such word may have diferent senses, namely, ‘financial institution’,‘bench’, etc.

In order to address this problem, contextual word embeddings were proposed. The idea is to be able to generate diferent word representations according to the context in which the word appears. Currently there are many approaches to generate such contextual word representations, but we will focus on those that have had a direct impact, in terms of performance, for the Named Entity Recognition task. First, Flair [ 3 ] representations are built following a LSTM-based architecture and trained as language models. Second, the models based on the transformer architecture [ 12 ] and of which BERT is perhaps the most popular example [ 4 ].

The multilingual counterpart of BERT, called mBERT, is a single language model pre-trained from corpora in more than 100 languages. Another standout model is XLM-RoBERTa [ 5 ] also based on the transformer architecture which provides a pre-trained language model for 100 languages trained on 2.5 TB of Common Crawl text. Both mBERT and XLM-RoBERTa enable to perform transfer knowledge across languages [ 13, 14, 7 ], although in this paper we will use them in a monolingual setting for Spanish NER. 2.1. Flair Flair refers to a system based on a BiLSTM architecture [ 15 ] and to a specific type of characterbased contextual word embeddings. Flair (embeddings and system) have been successfully applied to sequence labeling tasks obtaining state-of-the-art results for a number of Named Entity Recognition (NER) and Part-of-Speech tagging benchmarks [ 3 ].

Flair embeddings consist of sequences of characters. More specifically, sentences are processed into sequences of characters and feed into a character-level Long short-term memory (LSTM) model. For each sentence, a forward LSTM language model processes the its sequence of characters from the beginning of the sentence to the last character of the word we are modeling. Furthermore, a backward LSTM performs the same operation going from the end of the sentence up to the first character of the word. The extracted hidden states contain information propagated from the end and the beginning of the sentence up to the first and the last character of the target word. Finally, the resulting two hidden states are concatenated to generate the final embedding.

Pooled embeddings are a type of Flair embeddings which consider global information in order to generate the final word embedding [ 16 ]. In this approach embeddings are kept into a memory which is later used in a pooling operation to obtain a global word representation. This representation will be the concatenation of all the local Flair contextualized embeddings obtained for a given word. It should be consider that pooling operation is involved in the process of fine-tuning the Flair pre-trained models, not in the process of training the language models themselves. We use the default pooling operation, min, which computes a vector of all element-wise minimum values [ 16 ]. 2.2. Transformers LSTM-based language models such as the one presented in the previous section cannot capture longrange sequence information. Furthermore, they are quite hard to train at a large scale (see [ 17 ], especially Figure 7). In order to address these issues, the Transformer architecture was proposed [ 12 ], based on multi-headed self-attention and positional encoding. The most popular Transformer is BERT [ 4 ], which pre-trains a Transformer encoder on the Masked Language Model (MLM) and Next Sentence Prediction (NSP) tasks. BERT is composed by stacked layers of Transformer encoders [ 12 ]. More specifically, in this paper we will use the BERT BASE configuration which contains 12 Transformer encoder layers, a hidden size of 768 and 12 self-attention heads for a total of 110M parameters.

The MLM task is designed as follows: For a input sequence of tokens 1, 2, ..., , 15% are selected as masking candidates. From those candidates, 80% of them are masked (they are replaced with the [MASK] token), 10% are replaced by a random word and the last 10% is left unchanged. For the NSP task, two segments are selected from the training corpus, and . In 50% of the cases is the true next segment for . For the rest, is just a random segment. The model is trained to optimize the sum of the means of the MLM and NSP likelihoods.

It should be noted that the benefits of the NSP task during the pre-training process has been questioned [ 18, 19, 20 ]. Thus, other transformer proposals such as RoBERTa train without the NSP task, showing strong performance on the same downstream tasks.

XLM-RoBERTa relies exclusively on the MLM objective. The biggest update that XLM-Roberta ofers is a significantly increased amount of training data, 2.5TB of Common Crawl clean data [ 5 ]. As for BERT, in this paper we use the base version of XLM-RoBERTa. The reason being that their base versions fit for fine-tuning into a standard GPU card with 12GB of RAM.

3. Experimental Setup

Named entities were originally annotated using the BIO encoding which identifies the Beginning, the Inside and the Outside of named entities. Later on the BILOU model1 was proposed to mark tokens as the Beginning, the Inside and the Last tokens of multi-token entities as well as Unit-length entities [21]. Although the CAPITEL corpus is originally released using the BILOU model, we experiment with both type of encodings.

The CAPITEL (Corpus del Plan de Impulso a las Tecnologías del Lenguaje) has been developed by the PlanTL, the Royal Spanish Academy (RAE) and the Secretariat of State for Digital Advancement (SEAD) of the Ministry of Economy. These organizations signed an agreement for developing a linguistically annotated corpus of Spanish news articles, with the objective of extending the language resource infrastructure for the Spanish language. CAPITEL is composed of contemporary news articles and contains annotations for Universal Dependencies and Named Entities. The NER portion of the corpus contains around one million words.

For the experiments performed for this paper, we use a number of publicly available models: 1Nowadays also known as the BIOES encoding: Beginning, Inside, Outside, End of entity and Single entity.

1. Multilingual BERT (mBERT).

2. XLM-RoBERTa (base). 3. BETO, a monolingual Spanish BERT trained with Wikipedia and Spanish data from the OPUS corpus [22].

4. Flair oficial models for Spanish.

Additionally we trained the following monolingual language models for Spanish:

1. Flair-GW: Flair character-based language model trained on the Spanish Wikipedia and the Gigaword 3rd edition corpus, containing around 11GB of text. 2. Flair-Oscar: Flair language model trained on the OSCAR Spanish corpus [23], which contains 157GB of Common Crawl text cleaned and deduplicated.

The Flair embeddings for Flair-GW and Flair-Oscar were trained with the following parameters: Hidden size 2048, sequence length of 250, and a mini-batch size of 100. The rest of the parameters were left in their default setting. For Flair-GW, training was done for 5 epochs over the full training corpus. The training took around 5 days in a Nvidia Titan V GPU. With respect to Flair-Oscar, only one epoch was performed, requiring around a month to complete it.

4. Results

combined with the FastText embeddings trained on Wikipedia. In fact, Flair-Oscar was the best single system by a substantial margin. Apart from this, S2 and S3 show the small gains obtained by adding the 10 percent used for development for the final evaluation. Furthermore, S3 was trained when the progress of training the language model was at half epoch, whereas S4 was trained using the final Oscar language model based on one epoch. Finally, S5 is the same model as S1 but using BIO encoding instead of the original BILOU encoding from the CAPITEL corpus. The best overall invididual system was S4, significantly outperforming the multilingual and monolingual Transformer models.

With respect to the transformer models, it can be seen that in general their results are lower than those obtained by the Flair-Oscar models. During the development phase they all performed very closely although in the final, oficial results XLM-RoBERTa was slightly superior to the rest. Furthermore, results also show that mBERT performed worst and that XLM-RoBERTa obtains very similar results to the monolingual models.

The last three rows of Table 1 report the three best projections. Once we had the best 8 systems, we proceeded to project their predictions by means of any possible combination of the 8 systems. The best three systems were picked based on two criteria: the F1 score obtained on the development data and the number of No-agreements recorded by each projection.

The projections were performed using 5 predictions as source. We tested various strategies and the one we finally used to report the final results was, interestingly enough, the simplest of them all. It uses a very simple methodology based on the number of agreements between the predicted labels of the 5 source annotations: if agreement is >= 3 then project, otherwise, project “O”.

As we could not compute F1 scores on the oficial test set released by the shared task, we simply picked the projection which recorded fewer No-agreements. This corresponds to the best overall system (P3), which uses S3, S4, S6, S7 and S8 as source to obtain the final prediction.

5. Concluding Remarks

In this paper we have described the experiments performed for our participation in the CAPITEL 2020 shared task on Named Entity Recognition. Even though the best results are obtained by the FlairOscar monolingual models, our results indicate that multilingual pre-trained models such as XLMRoBERTa are performing increasingly close to monolingual models for a large-resourced language such as Spanish. Furthermore, we also show the benefits of projecting named entity annotations from various heterogeneous sources in order to substantially improve performance (around 1.3 points in F1 score over the best individual system).

Acknowledgments

Innovation and Universities (DeepReading RTI2018-096846-B-C21, MCIU/AEI/FEDER, UE) and by Ayudas Fundación BBVA a Equipos de Investigación Científica 2018 (BigKnowledge). Rodrigo Agerri is funded by the RYC-2017-23647 fellowship and acknowledges the donation of a Titan V GPU by the NVIDIA Corporation. [19] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, RoBERTa: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019). [20] G. Lample, A. Conneau, Cross-lingual language model pretraining, arXiv preprint arXiv:1901.07291 (2019). [21] L. Ratinov, D. Roth, Design challenges and misconceptions in named entity recognition, in: Proceedings of the Thirteenth Conference on Computational Natural Language Learning, 2009, pp. 147–155. [22] J. Tiedemann, Parallel data, tools and interfaces in OPUS., in: LREC, volume 2012, 2012, pp.

2214–2218. [23] P. J. Ortiz Suárez, B. Sagot, L. Romary, Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures, Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardif, 22nd July 2019, 2019, pp. 9–16.

[1]

E. F.

Tjong Kim Sang , Introduction to the CoNLL-2002 shared task: Language-independent named entity recognition , in: Proceedings of CoNLL-2002 , Taipei, Taiwan, 2002 , pp. 155 - 158 .

[2]

E. F.

Tjong Kim Sang , F. De Meulder, Introduction to the CoNLL-2003 shared task: Languageindependent named entity recognition , in: Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4 , 2003 , pp. 142 - 147 .

[3]

Akbik ,

Blythe ,

Vollgraf , Contextual string embeddings for sequence labeling , in: COLING 2018 , 27th International Conference on Computational Linguistics, 2018 , pp. 1638 - 1649 .

[4]

Devlin ,

Chang ,

Lee ,

Toutanova , BERT: pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis , MN, USA, June 2-7, 2019 , Volume 1 (Long and Short Papers), 2019 , pp. 4171 - 4186 .

[5]

Conneau ,

Khandelwal ,

Goyal ,

Chaudhary ,

Wenzek ,

Guzmán , E. Grave,

Ott ,

Zettlemoyer ,

Stoyanov , Unsupervised cross-lingual representation learning at scale , arXiv: 1911 . 02116 ( 2019 ).

[6]

Agerri , I. San Vicente,

J. A.

Campos ,

Barrena ,

Saralegi ,

Soroa , E. Agirre, Give your text representation models some love: the case for basque , in: Proceedings of The 12th Language Resources and Evaluation Conference (LREC 2020 ), 2020 , pp. 4781 - 4788 .

[7]

Karthikeyan ,

Wang ,

Mayhew ,

Roth , Cross-lingual ability of multilingual bert: An empirical study , in: International Conference on Learning Representations (ICLR) , 2020 .

[8]

Porta-Zamorano ,

Espinosa-Anke , Overview of CAPITEL Shared Tasks at IberLEF 2020: NERC and Universal Dependencies Parsing , in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020 ), 2020 .

[9]

Agerri ,

Chung , I. Aldabe,

Aranberri , G. Labaka, G. Rigau, Building named entity recognition taggers via parallel corpora , in: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018 ), 2018 .

[10]

Mikolov , I. Sutskever,

Chen ,

G. S.

Corrado ,

Dean , Distributed representations of words and phrases and their compositionality , in: Advances in Neural Information Processing Systems , 2013 , pp. 3111 - 3119 .

[11]

Bojanowski ,

Grave ,

Joulin , T. Mikolov, Enriching word vectors with subword information, Transactions of the Association for Computational Linguistics 5 ( 2017 ) 135 - 146 .

[12]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez , Ł. Kaiser, I. Polosukhin , Attention is all you need , in: Advances in neural information processing systems , 2017 , pp. 5998 - 6008 .

[13]

Heinzerling ,

Strube , Sequence tagging with contextual and non-contextual subword representations: A multilingual evaluation, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics , Florence, Italy, 2019 , pp. 273 - 291 .

[14]

Pires ,

Schlinger ,

Garrette , How multilingual is multilingual bert? , in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , 2019 , pp. 4996 - 5001 .

[15]

Huang ,

Xu ,

Yu , Bidirectional

LSTM

-CRF Models for Sequence Tagging , 2015 . arXiv: 1508 . 01991 .

[16]

Akbik ,

Bergmann ,

Vollgraf , Pooled contextualized embeddings for named entity recognition , in: NAACL 2019 , 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics , 2019 , p. 724 - 728 .

[17]

Kaplan ,

McCandlish ,

Henighan , T. B. Brown , B.

Chess , R.

Child , S.

Gray , A.

Radford , J.

Wu , D.

Amodei , Scaling laws for neural language models , arXiv preprint arXiv: 2001 . 08361 ( 2020 ).

[18]

Yang ,

Dai ,

Yang , J. Carbonell, R. Salakhutdinov,

Q. V.

Le , XLNet: Generalized Autoregressive Pretraining for Language Understanding , arXiv preprint arXiv: 1906 . 08237 ( 2019 ).