=Paper= {{Paper |id=Vol-2664/capitel_overview |storemode=property |title=Overview of CAPITEL Shared Tasks at IberLEF 2020:Named Entity Recognition and Universal Dependencies Parsing |pdfUrl=https://ceur-ws.org/Vol-2664/capitel_overview.pdf |volume=Vol-2664 |authors=Jordi Porta-Zamorano,Luis Espinosa-Anke |dblpUrl=https://dblp.org/rec/conf/sepln/ZamoranoA20 }} ==Overview of CAPITEL Shared Tasks at IberLEF 2020:Named Entity Recognition and Universal Dependencies Parsing== https://ceur-ws.org/Vol-2664/capitel_overview.pdf
Overview of CAPITEL Shared Tasks at IberLEF 2020:
Named Entity Recognition and Universal Dependencies
Parsing
Jordi Porta-Zamoranoa , Luis Espinosa-Ankeb
a
    Centro de Estudios de la Real Academia Española, Madrid, Spain
b
    School of Computer Science and Informatics, Cardiff University, UK


                                          Abstract
                                          We present the results of the CAPITEL-EVAL shared task, held in the context of the IberLEF 2020 competition
                                          series. CAPITEL-EVAL consisted on two subtasks: (1) Named Entity Recognition and Classification and (2)
                                          Universal Dependency parsing. For both, the source data was a newly annotated corpus, CAPITEL, a collection
                                          of Spanish articles in the newswire domain. A total of seven teams participated in CAPITEL-EVAL, with a total
                                          of 13 runs submitted across all subtasks. Data, results and further information about this task can be found at
                                          sites.google.com/view/capitel2020.

                                          Keywords
                                          IberLEF, named entity recognition and classification, NERC, Universal Dependencies parsing, evaluation




1. Introduction
Within the framework of the Spanish National Plan for the Advancement of Language Technologies
(PlanTL1 ), the Royal Spanish Academy (RAE) and the Secretariat of State for Digital Advancement
(SEAD) of the Ministry of Economy signed an agreement for developing a linguistically annotated
corpus of Spanish news articles, aimed at expanding the language resource infrastructure for the
Spanish language. The name of this corpus is CAPITEL (Corpus del Plan de Impulso a las Tecnologías
del Lenguaje), and is composed of contemporary news articles thanks to agreements with a number
of news media providers. CAPITEL has three levels of linguistic annotation: morphosyntactic (with
lemmas and Universal Dependencies-style POS tags and features), syntactic (following Universal De-
pendencies v2), and named entities.
   The linguistic annotation of a subset of the CAPITEL corpus has been revised using a machine-
annotation-followed-by-human-revision procedure. Manual revision has been carried out by a team
of graduated linguists following a set of Annotation Guidelines created specifically for CAPITEL.
The named entity and syntactic layers of revised annotations comprise about 1 million words for
the former, and roughly 300,000 for the latter. Due to the size of the corpus and the nature of the
annotations, we proposed two IberLEF sub-tasks under the more general, umbrella task of CAPITEL @
IberLEF 2020: (1) Named Entity Recognition and Classification and (2) Universal Dependency Parsing.




Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020)
email: porta@rae.es (J. Porta-Zamorano); espinosa-anke@cardiff.ac.uk (L. Espinosa-Anke)
orcid: 0000-0001-5620-4916 (J. Porta-Zamorano); 0000-0001-6830-9176 (L. Espinosa-Anke)
                                       © 2020 Copyright for this paper by its authors.
                                       Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings           CEUR Workshop Proceedings (CEUR-WS.org)
                  http://ceur-ws.org
                  ISSN 1613-0073




                  1
                      https://www.plantl.gob.es
Table 1
Description of the data for CAPITEL sub-task 1: NERC
                     Dataset     PER     LOC      ORG      OTH     Sents.    Tokens
                      train      9,087    7,513    9,285   5,426   22,647     606,418
                      devel      2,900    2,490    3,058   1,781    7,549     202,408
                       test      2,996    2,348    3,143   1,739    7,549     199,773
                      Total     14,983   12,351   15,486   8,946   37,745   1,008,599


2. Sub-task 1: NERC
2.1. Description
Information extraction tasks, formalized in the late 1980s, are designed to evaluate systems which cap-
ture information present in free text, with the goal of enabling better and faster information and con-
tent access. One important subset of this information comprises named entities (NE), which, roughly
speaking, are textual elements corresponding to names of people, places, organizations and others.
Three processes can be applied to NEs: recognition or identification (NER), categorization, i.e., assign-
ing a type according to a predefined set of semantic categories (NERC), and linking, which consists of
disambiguating the in-text mention against a knowledge base or sense inventory (NEL). Since their
advent, NER tasks have had notable success, but despite the relative maturity of this subfield, work and
research continues to evolve, and new techniques and models appear alongside challenging datasets
in different languages, domains and textual genres. The aim of this sub-task, thus, was to challenge
participants to apply their systems or solutions to the problem of identifying and classifying NEs in
Spanish news articles. This two-stage process falls within the NERC evaluation framework.
   The following NE categories were evaluated: Person (PER), Location (LOC), Organization (ORG)
and Other (OTH) as defined in the Annotation Guidelines [1] that were shared with participants. The
criteria for the identification and classification of entities were based on the capitalization chapter of
the Spanish language orthography [2]. The contextual meaning has been considered in the classi-
fication of entities, so that an entity such as Madrid can be classified as PER (a surname), LOC (the
city), ORG (the football team) of even OTH (a book title). Moreover, in terms of nesting, only the
longest-spanning entities were considered, and coordinated entities are considered one single entity
except for those where the name indicating the nature of the NE is used in plural to introduce several
entities ([Islas Baleares]loc y [Canarias]loc ).

2.2. Dataset
A one-million-word subset of the CAPITEL corpus was randomly sampled into three subsets: train-
ing, development and test. The training set comprises 60% of the corpus, whereas the development
and test sets roughly amount to 20% each. Descriptive statistics for these splits are provided in Table 1.
Together with the test set release, an additional collection of documents (background set) was deliv-
ered to ensure that participating teams were not be able to perform manual corrections, and also to
encourage features such as scalability to larger data collections. Finally, all documents were tokenized
and tagged with NEs following an IOBES format.

2.3. Evaluation Metrics
The metrics used for evaluation were Precision (the percentage of named entities in the system’s
output that are correctly recognized and classified), Recall (the percentage of named entities in the



                                                   32
test set that were correctly recognized and classified) and macro averaged F1 score (the harmonic
mean of Precision and Recall), with the latter being used as the official evaluation score and for the
final ranking of the participating teams.

2.4. Systems and Results
We had 22 registrations, 5 final participants with 9 systems submitted and 4 system descriptions.
   The Ragerri Team from HiTZ Center-Ixa UPV/EHU presents in [3] the combination of several sys-
tems based on Flair [4] and Transformer architectures [5]. They perform experiments with Multi-
lingual BERT (mBERT), XML-RoBERTa (base), BETO (BETO is a BERT-based model pre-trained with
Spanish texts [6]), and Flair off-the-shelf models for Spanish and a monolingual model trained with
the OSCAR corpus. All the individual systems’ F1 were within 88.29-89.95% and the combination of
five of them using a simple agreement scheme of three achieved the first rank with a 90.30% F1 .
   The Vicomtech Team presents in [7] a system based on the BERT architecture and several exper-
iments using multilingual BERT (mBERT) and BETO pre-trained models. BERT models are used to
give each token a contextual embedding that are then passed to a fully connected layer to classify
each of these tokens. Their work addresses also several interesting issues with the BETO vocabulary
and tokenizer, namely: punctuation marks missing in the tokenizer’s vocabulary and problems with
certain diacritics and characters. Their systems were fine-tuned with CAPITEL training data and
results were 2-3% F1 lower than the best performing system.
   The Yanghao Team from Huawei Translation Service Center presents in [8] a system that uses
Multilingual BERT as encoder and a linear layer as a classifier, and is trained with additional 38,000
sentences from WMT news translation corpus [9] annotated using Spacy [10]. Their experimental
results suggest that pre-training with the augmented set and then fine-tune on CAPITEL improves
performance when compared to training on any of them separately or mixed.
   The Lirondos Team from ISI-USC presents in [11] two sequence labelling systems: A CRF model
with handcrafted features and a BiLSTM-CRF model with word and character embeddings. A feature
ablation study demonstrated that all features contribute positively to the CRF model, with word em-
beddings being the most informative feature, yielding an F1 score of 84.39%. On the other hand, their
BiLSTM-CRF model obtained an F1 score of 83.01%. A interesting error analysis has shown that many
of the errors correspond to OTH entities, contextual annotation of some entities (OTH versus ORG
or LOC versus ORG), nested entities, and person nicknames with unusual typographical shapes.
   Finally, the LolaZarra Team was ranked the last and did not submit any system description paper.


3. Sub-task 2: UD Parsing
3.1. Description
Dependency-based syntactic parsing has become popular in NLP in recent years. One of the reasons
for this popularity is the transparent encoding of predicate-argument structures, which is useful in
many downstream applications. Another reason is that it is better suited than phrase-structure gram-
mars for languages with free or flexible word order. Universal Dependencies (UD) is a framework for
consistent annotation of grammar (parts of speech, morphological features and syntactic dependen-
cies) across different human languages. Moreover, the UD initiative is an open community effort with
over 200 contributors which has produced more than 100 treebanks in over 70 languages.
   The aim of this sub-task was to challenge participants to apply their systems or solutions to the
problem of Universal Dependency parsing of Spanish news articles as defined in the Annotation



                                                 33
Table 2
Results of the CAPITEL sub-task 1: NERC.
           Rank      Team     Ref. Metric       PER      LOC     ORG     OTH     Micro    Macro
                                         P      96.40    90.47   88.63   83.36   90.50    90.43
            (1)     ragerri     [3]     R       97.46    91.74   87.31   80.68   90.17    90.17
                                       F1       96.93    91.10   87.96   82.00   90.34    90.30
                                         P      96.50    90.19   88.05   84.37   90.46    90.39
            (2)     ragerri     [3]     R       97.46    91.27   87.21   81.02   90.09    90.09
                                       F1       96.98    90.73   87.63   82.66   90.27    90.23
                                         P      96.69    90.56   88.03   83.39   90.42    90.36
            (3)     ragerri     [3]     R       97.60    91.14   87.24   80.56   90.04    90.04
                                       F1       97.14    90.85   87.63   81.95   90.23    90.19
                                         P      93.48    89.36   85.76   79.63   87.88    87.81
            (4)   mcuadros      [7]     R       96.70    88.03   85.87   77.34   88.09    88.09
                                       F1       95.06    88.69   85.82   78.47   87.99    87.94
                                         P      94.30    87.30   84.99   79.52   87.38    87.32
            (5)    yanghao      [8]     R       96.16    89.86   85.94   77.69   88.43    88.43
                                       F1       95.22    88.56   85.46   78.59   87.90    87.87
                                         P      92.48    83.42   83.76   75.03   84.93    84.75
            (6)    lirondos    [11]     R       94.46    86.97   80.43   69.12   84.12    84.12
                                       F1       93.46    85.15   82.06   71.95   84.52    84.39
                                         P      91.52    83.39   80.10   78.31   83.93    83.90
            (7)   LolaZarra      -      R       92.62    80.41   83.39   73.72   83.77    83.77
                                       F1       92.07    81.87   81.71   75.95   83.85    83.80
                                         P      94.37    85.68   84.20   65.47   83.93    84.33
            (8)    lirondos    [11]     R       90.72    83.35   78.14   71.08   81.82    81.82
                                       F1       92.51    84.50   81.06   68.16   82.86    83.01
                                         P      93.23    82.05   84.55   63.89   82.67    83.01
            (9)    lirondos    [11]     R       90.09    82.54   73.85   67.17   79.46    79.46
                                       F1       91.63    82.29   78.84   65.49   81.03    81.11


Guidelines for the CAPITEL corpus [12].

3.2. Dataset
A 300,000-word subset of CAPITEL was provided for this sub-task. In addition to head and depen-
dency relations in CoNLL-U format, this subset was also tokenized and annotated with lemmas and
UD tags and features. Similarly to the NERC dataset, we randomly sampled it into three subsets:
training, development and test. The training set comprises about 50% of the corpus, whereas the de-
velopment and test sets roughly amount to 25% each. The description of the data sets can be found in
Table 3. In addition, the distribution of labels in the test set is given in Table 5 along with the results
of the sub-task. Together with the test set release, an additional collection of documents (background
set) were included to ensure that participating teams were not be able to perform manual corrections,
and also to encourage features such as scalability to larger data collections.

3.3. Evaluation Metrics
The metrics for the evaluation phase were Unlabeled Attachment Score (UAS): The percentage of
words that have the correct head, and Labeled Attachment Score (LAS): The percentage of words that




                                                    34
Table 3
Description of the data for CAPITEL sub-task 2: UD Parsing
                                             Dataset     Sents.    Tokens
                                              train       7,086    185,560
                                              devel       2,362     61,137
                                               test       2,363     62,682
                                              Total      11,811    309,379

Table 4
Results of the CAPITEL sub-task 2: UD Parsing
                            Rank              Team                 Ref.     UAS       LAS
                             (1)      MartinLendinez (CACV)        [13]    91.935    88.660
                             (2)        Vicomtech (BETO)            [7]    91.875    88.600
                             (3)       MartinLendinez (CA)         [13]    91.773    88.531
                             (4)       MartinLendinez (C)          [13]    91.715    88.467


have the correct head and dependency label, with the latter being used as the official evaluation score,
and for the final ranking of the participating teams.

3.4. Systems and Results
We had, in this subtask, 12 registrations, 2 final participants with 4 submitted systems and 2 system
descriptions.
   The Vicomtech Team presents in [7] a system based on the BERT architecture and several exper-
iments using multilingual BERT (mBERT) and BETO pre-trained models. BERT models are used to
encode a matrix of all-vs-all token encoding vectors and then pass to several classification layers pre-
dicting the connectivity of tokens and their relation types. Their work addresses also some issues that
had been explained in 2.1. Their systems were fine-tuned with CAPITEL training data and results on
the development set were slightly better using BETO (UAS: 91.540, LAS: 88.410) instead of mBERT
(UAS: 91.220, LAS: 87.860), so only the BETO results were submitted as their official run.
   MartínLendinez Team presents in [13] the combination of the output of different UD parsing toolk-
its using a voting scheme and the augmentation of the training set with 14,305 annotated sentences
from the AnCora annotated corpus [14].2 Three different toolkits were selected not because of their
performance in similar tasks but for their accessibility and documentation. These toolkits were UD-
Pipe [15], NLP-Cube [16] and Stanza [17]. As we can see in the summary provided in Table 4, the final
submitted results were obtained with Stanza trained on CAPITEL (4), Stanza trained on CAPITEL and
AnCora (3), and the combination of the previous two plus NLP-Cube trained on CAPITEL (1).
   As it can be seen in Table 4, results on this sub-task are very tight, with first and second systems
being only 0.06% apart, and with only 0.193% between first and fourth. The submission by Martín-
Lendinez was the highest ranked, and Vicomtech the simplest, and acknowledged and described by
the authors as a sort of BERT-based baseline. We provide a breakdown of the results by relation type
in Table 5.




   2
       There is also a discussion on some differences in terms of tokenization and analysis between CAPITEL and AnCora.




                                                           35
4. Conclusions
Most of the submitted systems obtained good results overall. In both sub-tasks, the majority of them
uses BERT, either multilingual or monolingual and some systems combines the output of several
models. Also the augmentation of data from other corpora, or produced by other annotation systems,
added to the training data or used to fine-tune the models, despite the heterogeneity of the annotations
or domain differences have shown some modest improvements.


5. Acknowledgements
We would like to thank specially to David Pérez Fernández, Doaa Samy, and all the people involved
in the PlanTL, for their contribution in making these shared tasks possible, and José-Luis Sancho-
Sánchez and Rafael-J. Ureña-Ruiz from the Centro de Estudios de la RAE for their help in preparing
the data. We would also like to thank the task participants who provided helpful inputs to improve
the quality of the dataset and the task itself.


References
 [1] J. Porta-Zamorano, J. Romeu Fernández, Esquema de anotación sintáctica de CAPITEL, Technical
     Report, Centro de Estudios de la Real Academia Española, 2019.
 [2] RAE, ASALE, Ortografía de la lengua española, 2010.
 [3] R. Agerri, G. Rigau, Projecting Heterogeneous Annotations for Named Entity Recognition, in:
     Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020), 2020.
 [4] A. Akbik, D. Blythe, R. Vollgraf, Contextual string embeddings for sequence labeling, in: Pro-
     ceedings of the 27th International Conference on Computational Linguistics, 2018.
 [5] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin,
     Attention is all you need, in: Advances in Neural Information Processing Systems 30, 2017.
 [6] J. Cañete, G. Chaperon, R. Fuentes, J. Pérez, Spanish Pre-Trained BERT Model and Evaluation
     Data, in: Proceedings of the Practical ML for Developing Countries Workshop at the Eight
     International Conference on Learning Representations (ICLR 2020), 2020.
 [7] A. García Pablos, M. Cuadros, E. Zotova, Vicomtech at CAPITEL 2020: Facing Entity Recognition
     and Universal Dependency Parsing of Spanish News Articles with BERT models, in: Proceedings
     of the Iberian Languages Evaluation Forum (IberLEF 2020), 2020.
 [8] L. Lei, M. Wang, H. Yang, S. Sun, Y. Qin, D. Wei, System Report of HW-TSC on the CAPITEL
     NER Evaluation, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020),
     2020.
 [9] J. Tiedemann, Parallel Data, Tools and Interfaces in OPUS, in: Proceedings of the Eight Inter-
     national Conference on Language Resources and Evaluation (LREC’12), 2012.
[10] M. Honnibal, I. Montani, spaCy 2: Natural language understanding with Bloom embeddings,
     convolutional neural networks and incremental parsing, 2017.
[11] E. Álvarez Mellado, Two Models for Named Entity Recognition in Spanish: Submission for the
     CAPITEL Shared Task at IberLEF 2020, in: Proceedings of the Iberian Languages Evaluation
     Forum (IberLEF 2020), 2020.
[12] J. Porta-Zamorano, J. Romeu Fernández, Esquema de anotación de entidades nombradas de
     CAPITEL, Technical Report, Centro de Estudios de la Real Academia Española, 2019.




                                                  36
[13] F. Sánchez-León, Combining Different Parsers and Datasets for CAPITEL UD Parsing, in: Pro-
     ceedings of the Iberian Languages Evaluation Forum (IberLEF 2020), 2020.
[14] M. Taulé, M. A. Martí, M. Recasens, AnCora: Multilevel Annotated Corpora for Catalan and
     Spanish, in: Proceedings of the Sixth International Conference on Language Resources and
     Evaluation (LREC’08), 2008.
[15] M. Straka, J. Straková, Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe,
     in: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal
     Dependencies, 2017.
[16] T. Boros, S. D. Dumitrescu, R. Burtica, NLP-Cube: End-to-End Raw Text Processing With Neural
     Networks, in: Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text
     to Universal Dependencies, 2018.
[17] P. Qi, Y. Zhang, Y. Zhang, J. Bolton, C. D. Manning, Stanza: A Python Natural Language Pro-
     cessing Toolkit for Many Human Languages, in: Proceedings of the 58th Annual Meeting of the
     Association for Computational Linguistics: System Demonstrations, 2020.




                                                37
Table 5
Detailed results of the CAPITEL sub-task 2: UD Parsing
                                        (1)                 (2)                     (3)                     (4)
          Label       Freq.     UAS        LAS      UAS            LAS      UAS            LAS      UAS            LAS
             acl         501    80.04      65.47    82.44          67.27    80.04          65.67    79.04          65.27
         acl:relcl     1,004    74.90      73.61    79.68          78.19    75.50          74.30    74.70          73.41
           advcl       1,004    79.18      73.11    78.29          71.02    78.69          73.11    79.18          73.61
         advmod        2,062    87.92      85.01    87.34          83.75    86.23          83.51    87.20          83.95
           amod        3,228    96.96      94.95    96.78          94.24    96.81          94.76    97.00          95.26
           appos       1,090    85.50      77.61    84.22          74.50    84.40          76.70    84.86          76.79
            aux          259    40.93      31.27    52.51          46.72    48.26          38.22    40.93          30.89
         aux:pass         56   100.00      94.64    98.21          83.93   100.00          94.64   100.00          94.64
            case       8,705    98.99      98.70    98.78          98.24    98.83          98.55    99.05          98.74
             cc        2,018    95.29      92.86    95.00          92.72    95.14          92.77    94.50          92.02
          ccomp          399    90.73      83.71    90.48          84.21    90.73          83.96    88.72          81.45
       compound           22    63.64      18.18    81.82          45.45    59.09          13.64    68.18          27.27
            conj       2,361    76.20      73.27    77.42          74.29    76.37          73.36    75.60          72.47
            cop          925    93.30      89.84    93.19          89.84    93.95          90.49    92.76          89.51
           csubj         111    81.08      60.36    83.78          63.96    82.88          62.16    79.28          52.25
            dep           28    75.00       7.14    67.86           3.57    75.00           7.14    71.43           7.14
             det       8,840    99.42      99.37    99.29          99.17    99.33          99.29    99.38          99.32
        discourse         36    80.56       2.78    86.11           8.33    77.78           2.78    77.78           5.56
            expl          46    97.83       6.52    95.65          41.30    97.83           6.52    97.83          23.91
       expl:impers        29    93.10       6.90    86.21          20.69    93.10           6.90    89.66          10.34
        expl:pass        360    99.44      75.56    99.17          82.50    99.44          75.56    98.89          79.72
          expl:pv        343    97.67      70.85    97.67          74.05    97.38          70.55    97.38          68.22
           fixed         219    71.23      68.04    71.23          65.75    70.32          66.67    71.23          68.49
            flat         130    92.31      45.38    88.46          53.85    91.54          45.38    90.77          50.00
       flat:foreign      409    90.71      86.80    78.48          70.17    91.44          87.53    91.69          87.29
        goeswith           2     0.00       0.00     0.00           0.00     0.00           0.00     0.00           0.00
            iobj         329    94.22      77.51    91.19          72.95    93.62          76.90    92.40          69.91
           mark        1,992    91.62      85.99    92.67          86.75    92.12          86.40    91.47          85.94
        mark:iobj          9   100.00      33.33    88.89          44.44   100.00          33.33   100.00          22.22
        mark:mod         282    93.97      79.79    93.97          83.69    93.62          79.43    94.33          80.85
        mark:obj         119    89.92      49.58    89.92          55.46    90.76          50.42    90.76          42.86
        mark:subj        816    94.24      87.99    95.10          87.01    94.61          88.36    93.87          88.36
           nmod        4,609    88.24      87.18    89.48          88.15    88.28          87.22    87.85          86.77
           nsubj       2,302    93.61      89.27    93.53          89.27    93.61          89.23    93.01          86.92
       nsubj:pass         29    96.55      58.62    96.55          55.17    96.55          58.62    96.55          55.17
        nummod           689    97.68      96.66    97.68          95.36    97.53          96.37    97.97          96.37
             obj       2,235    98.61      89.80    97.67          90.65    98.39          89.75    98.30          91.50
             obl       3,298    87.54      81.96    87.17          81.78    87.48          82.05    87.72          82.20
        obl:agent         94    98.94      86.17    97.87          85.11   100.00          87.23    97.87          90.43
          orphan          17    70.59       0.00    70.59           0.00    70.59           0.00    64.71           0.00
        parataxis        881    74.01      59.82    72.76          60.39    72.99          58.80    74.35          61.29
           punct       8,028    89.09      88.95    88.17          88.02    88.64          88.52    88.78          88.61
            root       2,394    93.27      93.07    93.48          93.32    93.23          93.02    92.86          92.65
          xcomp          372    75.81      69.09    80.38          72.31    72.04          65.59    79.03          71.51
           Total      62,682   91.935     88.660   91.875         88.600   91.773         88.531   91.715         88.467




                                                      38