1. Introduction

Overview of CAPITEL Shared Tasks at IberLEF 2020: Named Entity Recognition and Universal Dependencies Parsing

Jordi Porta-Zamorano

porta@rae.es 0

Luis Espinosa-Anke

espinosa-anke@cardif.ac.uk 1 0 Centro de Estudios de la Real Academia Española , Madrid , Spain 1 School of Computer Science and Informatics, Cardif University , UK

31 38

We present the results of the CAPITEL-EVAL shared task, held in the context of the IberLEF 2020 competition series. CAPITEL-EVAL consisted on two subtasks: (1) Named Entity Recognition and Classification and (2) Universal Dependency parsing. For both, the source data was a newly annotated corpus, CAPITEL, a collection of Spanish articles in the newswire domain. A total of seven teams participated in CAPITEL-EVAL, with a total of 13 runs submitted across all subtasks. Data, results and further information about this task can be found at sites.google.com/view/capitel2020.

eol>IberLEF named entity recognition and classification NERC Universal Dependencies parsing evaluation

1. Introduction 2. Sub-task 1: NERC 2.1. Description

Information extraction tasks, formalized in the late 1980s, are designed to evaluate systems which capture information present in free text, with the goal of enabling better and faster information and content access. One important subset of this information comprises named entities (NE), which, roughly speaking, are textual elements corresponding to names of people, places, organizations and others. Three processes can be applied to NEs: recognition or identification (NER), categorization, i.e., assigning a type according to a predefined set of semantic categories (NERC), and linking, which consists of disambiguating the in-text mention against a knowledge base or sense inventory (NEL). Since their advent, NER tasks have had notable success, but despite the relative maturity of this subfield, work and research continues to evolve, and new techniques and models appear alongside challenging datasets in diferent languages, domains and textual genres. The aim of this sub-task, thus, was to challenge participants to apply their systems or solutions to the problem of identifying and classifying NEs in Spanish news articles. This two-stage process falls within the NERC evaluation framework.

The following NE categories were evaluated: Person (PER), Location (LOC), Organization (ORG) and Other (OTH) as defined in the Annotation Guidelines [ 1 ] that were shared with participants. The criteria for the identification and classification of entities were based on the capitalization chapter of the Spanish language orthography [ 2 ]. The contextual meaning has been considered in the classiifcation of entities, so that an entity such as Madrid can be classified as PER (a surname), LOC (the city), ORG (the football team) of even OTH (a book title). Moreover, in terms of nesting, only the longest-spanning entities were considered, and coordinated entities are considered one single entity except for those where the name indicating the nature of the NE is used in plural to introduce several entities ([Islas Baleares]loc y [Canarias]loc).

2.2. Dataset

A one-million-word subset of the CAPITEL corpus was randomly sampled into three subsets: training, development and test. The training set comprises 60% of the corpus, whereas the development and test sets roughly amount to 20% each. Descriptive statistics for these splits are provided in Table 1. Together with the test set release, an additional collection of documents (background set) was delivered to ensure that participating teams were not be able to perform manual corrections, and also to encourage features such as scalability to larger data collections. Finally, all documents were tokenized and tagged with NEs following an IOBES format.

2.3. Evaluation Metrics

The metrics used for evaluation were Precision (the percentage of named entities in the system’s output that are correctly recognized and classified), Recall (the percentage of named entities in the test set that were correctly recognized and classified) and macro averaged F 1 score (the harmonic mean of Precision and Recall), with the latter being used as the oficial evaluation score and for the ifnal ranking of the participating teams.

2.4. Systems and Results

We had 22 registrations, 5 final participants with 9 systems submitted and 4 system descriptions.

The Ragerri Team from HiTZ Center-Ixa UPV/EHU presents in [ 3 ] the combination of several systems based on Flair [ 4 ] and Transformer architectures [ 5 ]. They perform experiments with Multilingual BERT (mBERT), XML-RoBERTa (base), BETO (BETO is a BERT-based model pre-trained with Spanish texts [ 6 ]), and Flair of-the-shelf models for Spanish and a monolingual model trained with the OSCAR corpus. All the individual systems’ F1 were within 88.29-89.95% and the combination of ifve of them using a simple agreement scheme of three achieved the first rank with a 90.30% F 1.

The Vicomtech Team presents in [ 7 ] a system based on the BERT architecture and several experiments using multilingual BERT (mBERT) and BETO pre-trained models. BERT models are used to give each token a contextual embedding that are then passed to a fully connected layer to classify each of these tokens. Their work addresses also several interesting issues with the BETO vocabulary and tokenizer, namely: punctuation marks missing in the tokenizer’s vocabulary and problems with certain diacritics and characters. Their systems were fine-tuned with CAPITEL training data and results were 2-3% F1 lower than the best performing system.

The Yanghao Team from Huawei Translation Service Center presents in [ 8 ] a system that uses Multilingual BERT as encoder and a linear layer as a classifier, and is trained with additional 38,000 sentences from WMT news translation corpus [ 9 ] annotated using Spacy [ 10 ]. Their experimental results suggest that pre-training with the augmented set and then fine-tune on CAPITEL improves performance when compared to training on any of them separately or mixed.

The Lirondos Team from ISI-USC presents in [ 11 ] two sequence labelling systems: A CRF model with handcrafted features and a BiLSTM-CRF model with word and character embeddings. A feature ablation study demonstrated that all features contribute positively to the CRF model, with word embeddings being the most informative feature, yielding an F1 score of 84.39%. On the other hand, their BiLSTM-CRF model obtained an F1 score of 83.01%. A interesting error analysis has shown that many of the errors correspond to OTH entities, contextual annotation of some entities (OTH versus ORG or LOC versus ORG), nested entities, and person nicknames with unusual typographical shapes.

Finally, the LolaZarra Team was ranked the last and did not submit any system description paper.

3. Sub-task 2: UD Parsing 3.1. Description

Dependency-based syntactic parsing has become popular in NLP in recent years. One of the reasons for this popularity is the transparent encoding of predicate-argument structures, which is useful in many downstream applications. Another reason is that it is better suited than phrase-structure grammars for languages with free or flexible word order. Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological features and syntactic dependencies) across diferent human languages. Moreover, the UD initiative is an open community efort with over 200 contributors which has produced more than 100 treebanks in over 70 languages.

The aim of this sub-task was to challenge participants to apply their systems or solutions to the problem of Universal Dependency parsing of Spanish news articles as defined in the Annotation

3.2. Dataset

A 300,000-word subset of CAPITEL was provided for this sub-task. In addition to head and dependency relations in CoNLL-U format, this subset was also tokenized and annotated with lemmas and UD tags and features. Similarly to the NERC dataset, we randomly sampled it into three subsets: training, development and test. The training set comprises about 50% of the corpus, whereas the development and test sets roughly amount to 25% each. The description of the data sets can be found in Table 3. In addition, the distribution of labels in the test set is given in Table 5 along with the results of the sub-task. Together with the test set release, an additional collection of documents (background set) were included to ensure that participating teams were not be able to perform manual corrections, and also to encourage features such as scalability to larger data collections.

3.3. Evaluation Metrics

The metrics for the evaluation phase were Unlabeled Attachment Score (UAS): The percentage of words that have the correct head, and Labeled Attachment Score (LAS): The percentage of words that have the correct head and dependency label, with the latter being used as the oficial evaluation score, and for the final ranking of the participating teams.

3.4. Systems and Results

We had, in this subtask, 12 registrations, 2 final participants with 4 submitted systems and 2 system descriptions.

The Vicomtech Team presents in [ 7 ] a system based on the BERT architecture and several experiments using multilingual BERT (mBERT) and BETO pre-trained models. BERT models are used to encode a matrix of all-vs-all token encoding vectors and then pass to several classification layers predicting the connectivity of tokens and their relation types. Their work addresses also some issues that had been explained in 2.1. Their systems were fine-tuned with CAPITEL training data and results on the development set were slightly better using BETO (UAS: 91.540, LAS: 88.410) instead of mBERT (UAS: 91.220, LAS: 87.860), so only the BETO results were submitted as their oficial run.

MartínLendinez Team presents in [13] the combination of the output of diferent UD parsing toolkits using a voting scheme and the augmentation of the training set with 14,305 annotated sentences from the AnCora annotated corpus [14].2 Three diferent toolkits were selected not because of their performance in similar tasks but for their accessibility and documentation. These toolkits were UDPipe [15], NLP-Cube [16] and Stanza [17]. As we can see in the summary provided in Table 4, the final submitted results were obtained with Stanza trained on CAPITEL (4), Stanza trained on CAPITEL and AnCora (3), and the combination of the previous two plus NLP-Cube trained on CAPITEL (1).

As it can be seen in Table 4, results on this sub-task are very tight, with first and second systems being only 0.06% apart, and with only 0.193% between first and fourth. The submission by MartínLendinez was the highest ranked, and Vicomtech the simplest, and acknowledged and described by the authors as a sort of BERT-based baseline. We provide a breakdown of the results by relation type in Table 5.

2There is also a discussion on some diferences in terms of tokenization and analysis between CAPITEL and AnCora.

4. Conclusions

Most of the submitted systems obtained good results overall. In both sub-tasks, the majority of them uses BERT, either multilingual or monolingual and some systems combines the output of several models. Also the augmentation of data from other corpora, or produced by other annotation systems, added to the training data or used to fine-tune the models, despite the heterogeneity of the annotations or domain diferences have shown some modest improvements.

5. Acknowledgements

We would like to thank specially to David Pérez Fernández, Doaa Samy, and all the people involved in the PlanTL, for their contribution in making these shared tasks possible, and José-Luis SanchoSánchez and Rafael-J. Ureña-Ruiz from the Centro de Estudios de la RAE for their help in preparing the data. We would also like to thank the task participants who provided helpful inputs to improve the quality of the dataset and the task itself. [13] F. Sánchez-León, Combining Diferent Parsers and Datasets for CAPITEL UD Parsing, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020), 2020. [14] M. Taulé, M. A. Martí, M. Recasens, AnCora: Multilevel Annotated Corpora for Catalan and Spanish, in: Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), 2008. [15] M. Straka, J. Straková, Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe, in: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, 2017. [16] T. Boros, S. D. Dumitrescu, R. Burtica, NLP-Cube: End-to-End Raw Text Processing With Neural Networks, in: Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, 2018. [17] P. Qi, Y. Zhang, Y. Zhang, J. Bolton, C. D. Manning, Stanza: A Python Natural Language Processing Toolkit for Many Human Languages, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2020. (1) (2) (3) (4) Label

acl acl:relcl advcl advmod amod appos aux aux:pass case cc ccomp compound conj cop csubj dep det discourse

expl expl:impers expl:pass expl:pv fixed flat flat:foreign goeswith iobj mark mark:iobj mark:mod mark:obj mark:subj nmod nsubj nsubj:pass nummod obj obl obl:agent orphan parataxis punct root xcomp Total

LAS 67.27 78.19 71.02 83.75 94.24 74.50 46.72 83.93 98.24 92.72 84.21 45.45 74.29 89.84 63.96 3.57 99.17 8.33 41.30 20.69 82.50 74.05 65.75 53.85 70.17 0.00 72.95 86.75 44.44 83.69 55.46 87.01 88.15 89.27 55.17 95.36 90.65 81.78 85.11 0.00 60.39 88.02 93.32 72.31 88.600

UAS 80.04 75.50 78.69 86.23 96.81 84.40 48.26 100.00 98.83 95.14 90.73 59.09 76.37 93.95 82.88 75.00 99.33 77.78 97.83 93.10 99.44 97.38 70.32 91.54 91.44 0.00 93.62 92.12 100.00 93.62 90.76 94.61 88.28 93.61 96.55 97.53 98.39 87.48 100.00 70.59 72.99 88.64 93.23 72.04 91.773

[1]

Porta-Zamorano ,

J. Romeu

Fernández , Esquema de anotación sintáctica de CAPITEL, Technical

Report

, Centro de Estudios de la Real Academia Española, 2019 .

[2] RAE , ASALE , Ortografía de la lengua española, 2010 .

[3]

Agerri , G. Rigau, Projecting Heterogeneous Annotations for Named Entity Recognition , in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020 ), 2020 .

[4]

Akbik ,

Blythe ,

Vollgraf , Contextual string embeddings for sequence labeling , in: Proceedings of the 27th International Conference on Computational Linguistics , 2018 .

[5]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez ,

Kaiser , I. Polosukhin , Attention is all you need , in: Advances in Neural Information Processing Systems 30 , 2017 .

[6]

Cañete , G. Chaperon,

Fuentes ,

Pérez , Spanish Pre-Trained BERT Model and Evaluation Data , in: Proceedings of the Practical ML for Developing Countries Workshop at the Eight International Conference on Learning Representations (ICLR 2020 ), 2020 .

[7]

García Pablos ,

Cuadros , E. Zotova, Vicomtech at CAPITEL 2020: Facing Entity Recognition and Universal Dependency Parsing of Spanish News Articles with BERT models , in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020 ), 2020 .

[8]

Lei ,

Wang ,

Yang ,

Sun ,

Qin ,

Wei , System Report of HW-TSC on the CAPITEL NER Evaluation, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020 ), 2020 .

[9]

Tiedemann , Parallel Data, Tools and Interfaces in OPUS , in: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12) , 2012 .

[10]

Honnibal , I. Montani , spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing , 2017 .

[11]

Álvarez Mellado , Two Models for Named Entity Recognition in Spanish: Submission for the CAPITEL Shared Task at IberLEF 2020, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020 ), 2020 .

[12]

Porta-Zamorano ,

J. Romeu

Fernández , Esquema de anotación de entidades nombradas de CAPITEL, Technical

Report

, Centro de Estudios de la Real Academia Española, 2019 .