=Paper=
{{Paper
|id=Vol-2664/capitel_paper3
|storemode=property
|title=Vicomtech at CAPITEL 2020: Facing Entity Recognition and Universal Dependency Parsing of
Spanish News Articles with BERT Models
|pdfUrl=https://ceur-ws.org/Vol-2664/capitel_paper3.pdf
|volume=Vol-2664
|authors=Aitor García-Pablos,Montse Cuadros,Elena Zotova
|dblpUrl=https://dblp.org/rec/conf/sepln/PablosCZ20
}}
==Vicomtech at CAPITEL 2020: Facing Entity Recognition and Universal Dependency Parsing of
Spanish News Articles with BERT Models==
Vicomtech at CAPITEL 2020: Facing Entity Recognition and Universal Dependency Parsing of Spanish News Articles with BERT Models Aitor García-Pablosa , Montse Cuadrosa and Elena Zotovaa a SNLT group at Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), Mikeletegi Pasealekua 57, Donostia/San-Sebastián, 20009, Spain Abstract These working notes describe the participation of Vicomtech NLP team in the CAPITEL task, which is part of the IberLEF 2020. CAPITEL task included two sub-tasks: Named Entity Recognition and Classification (NERC) and Universal Dependency (UD) Parsing for Spanish news articles. A specific system has been designed for each task based on BERT architectures. Both systems have been tested with different settings, where the best ones have been selected for being submitted to the shared task. The resulting systems show robust and competitive systems in both tasks with simple architectures. Keywords NERC, dependency parsing, deep learning, These working notes present an overview of Vicomtech’s systems presented in CAPITEL 2020[1] tasks. CAPITEL 2020 is a shared task organized at IberLEF (https://sites.google.com/ view/iberlef2020) 2020 campaign. The Corpus del Plan de Impulso a las Tecnologías del Lenguaje (CAPITEL) corpus is an agree- ment between the Royal Spanish Academy (RAE) and the Secretariat of State for Digital Ad- vancement (SEAD) of the Ministry of Economy within the framework of PlanTL (https://www. plantl.gob.es/Paginas/index.aspx). This corpus is composed of contemporary news articles thanks to agreements with a number of news media providers. CAPITEL has three levels of lin- guistic annotation: morphosyntactic (with lemmas and Universal Dependencies-style POS tags and features), syntactic (following Universal Dependencies v2 (https://universaldependencies. org/u/dep/index.html), and named entities. The named entity and syntactic layers of revised annotations comprise about 1 million words for the former, and roughly 300,000 for the latter. Due to the size of the corpus and the nature of the annotations, two IberLEF sub-tasks under the more general, umbrella task of CAPITEL @ IberLEF 2020, have been proposed: • Named Entity Recognition and Classification • Universal Dependency Parsing Named Entity Recognition and Classification sub-task aims to challenge participants to apply their systems or solutions to the problem of identifying and classifying name entities Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020) email: agarciap@vicomtech.org (A. García-Pablos); mcuadros@vicomtech.org (M. Cuadros); ezotova@vicomtech.org (E. Zotova) orcid: 0000-0001-9882-7521 (A. García-Pablos); 0000-0002-3620-1053 (M. Cuadros); 0000-0002-8350-1331 (E. Zotova) © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) (NEs) in Spanish news articles. The following NE categories will be evaluated: Person (PER), Location (LOC), Organization (ORG) and Other (OTH). The metrics used for the evaluation will be Precision, Recall, and F-measure (Micro and Macro average). Universal Dependency Parsing sub-task aims to challenge participants to apply their systems or solutions to the problem of Universal Dependency parsing of Spanish news articles. The metrics used for the evaluation would be the Unlabelled Attachment Score (UAS) and Labelled Attachment Score (LAS). UAS stands for the percentage of words that have the correct head and LAS, the percentage of words that have the correct head and dependency label. We have participated in both sub-tasks with two different systems making use of simple architectures with BERT at their base. These working notes are organised as follows: Section 1 describes the systems presented in both tasks, with all the details concerning architecture and training setup. Section 2 shows the results obtained in the participation of both tasks and Section 3 draws some conclusions and future work. 1. System description This section provides a description of the systems that we have developed to participate in CAPITEL’s task. In the face of the widespread success of Transformer-based architectures [2] in virtually all Natural Language Processing (NLP) tasks, Vicomtech has implemented both systems, one that learns to recognise and classify entities and other to establish relations between sentence elements, based on BERT [3]. The first task, NERC, is approached using a traditional sequence-labelling approach relying on BERT contextual representation for words. The second task, syntactic dependency parsing, is also based on BERT at its core, combining the semantic representation of the tokens to detect the syntactic relations among them. We have tried the same architecture with different pre-trained BERT models. 1.1. Architecture for the NERC system The NERC system is a Deep Learning model based on BERT (Bidirectional Encoder Represen- tation of Transformers). The model is the simplest approach in which BERT can be used to perform NERC. It makes use of BERT to encode the input, obtaining a contextual embedding for each input token. These contextual embeddings are the input to a fully connected feed-forward layer that helps to classify each token as one of the possible output tags. Figure 1 shows a diagram with the architecture. We have experimented with two different pre-trained BERT models in our experiments. On the one hand, we have used the BERT-Base Multilingual Cased (https://github.com/google-research/ bert/blob/master/multilingual.md) shared by Google. On the other hand, we have used BETO [4], a BERT-base architecture pre-trained only using Spanish texts. 1.2. Architecture for the dependency parsing system The dependency parsing system is, again, a Deep Learning model relying on BERT. The model uses BERT to encode the input, obtaining contextual embeddings. Then, a tensor operation is 53 Figure 1: Diagram of the NERC model based on BERT. performed over the contextual embeddings to obtain an all-vs-all combination of token vectors. This generates 𝑆 × 𝑆 combined embeddings that represent all the possible token pairs, 𝑆 being the length of the input sequence. The resulting token-pair representations are then passed to several classification layers to make predictions about the relation between the tokens in each pair. First, the pairs are categorised by a binary classifier that decides whether the tokens that form the pair are connected by a relation (an arc of the dependency tree). The logits resulting from the relation classifier are concatenated with each token-pair embeddings. The resulting representation is passed to a final classification layer to obtain the type of relation for each token pair among the Universal Dependencies types. Note that this all-vs-all token combination strategy has an exponential computational cost w.r.t. the length of the sequence. This approach could not be applied to full documents. However, the scope of dependency parsing is limited to individual sentences, and since the only operations with the all-vs-all pairs consist of concatenation and a simple matrix multiplication (the feed-forward classification layer) the overall computational cost is feasible. Figure 2 shows a diagram of the described architecture. Similar to the NERC task, we have experimented with the BERT multilingual pre-trained model, and with its Spanish pre-trained counterpart, BETO. 54 Figure 2: Diagram of the UD model based on BERT. 1.3. Input and output handling The input for the task comes already tokenized. However, these tokens are not equivalent to what a BERT model expects. Each pre-trained BERT model needs the tokens as they are obtained after using its own tokenizer. Otherwise the input would be meaningless to the pre-trained model. This poses the additional challenge of keeping the alignment between the resulting tokens and the provided labels. For that, each original token is retokenized with the corresponding BERT tokenizer. This results in additional tokens, since BERT uses WordPiece tokenization that breaks words into word-pieces (e.g. ”Jim Henson” would become ”Jim Hen ##son”). The provided labels are mapped to the head of each token (i.e. the first piece of a sub-word) and the rest of the sub- 55 tokens are assigned with an special label ’X’. A mapping indicating the correspondence between head subwords and original tokens is stored, so the original token space can be rebuilt at the end of the process, resulting in a token and label sequences of the same length than the original. 1.3.1. BETO vocabulary issues While using BETO, we realised than the special ’[UNK]’ token was being hit too often. This token is the default representation used for out of vocabulary values (OOV). When using more traditional tokenization approaches it is usual to encounter OOVs in NLP tasks, due to the limited size of the whole-word vocabularies. But with modern tokenization approaches like WordPiece or BPE this is less common. We discovered that BETO tokenizer’s WordPiece vocabulary was missing some common punctuation marks like semicolon ’;’, or percentage symbol ’%’. Also, we noticed that any word containing certain diacritic marks, like ’cigüeña’, ’piragüista’ or ’Düsseldorf’, were automatically marked as unknown. The same happened with the words containing the character ’Ç’, rather common in Catalan or French nouns. Having so many unknown values is inconvenient because that means that all the occurrences of such words will share the same vector representation. The contextual information coming from the surrounding tokens may alleviate the problem, but relevant information is being lost. In order to deal with these issues we manually added missing punctuation marks and symbols to the BETO vocabulary, using the unused slots of the vocabulary that are reserved for the addition of new words. The newly added symbols would have a randomly initialised value because they were not part of the BETO pre-training, but at least this gives them the chance of learning a meaningful representation during the fine-tuning of the model for the downstream task. The problem with the diacritics was solved replacing the offending characters by their diacritics-free counterparts, e.g. ’Düsseldorf’ was converted into ’Dusseldorf’. 1.3.2. Post-processing IOBES tagging The gold labels for the NERC task follow a IOBES tagging scheme, which indicates if a token is at the Beginning, I nside, Outside or End of a named entity, or if it is an entity spanning a Single token. This means that for each given entity type 𝐸𝑁 𝑇, there are four possible labels: 𝐵 − 𝐸𝑁 𝑇, 𝐼 − 𝐸𝑁 𝑇, 𝐸 − 𝐸𝑁 𝑇 or 𝑆 − 𝐸𝑁 𝑇. Sometimes the model selects a tag that correctly predict the entity type, but it is incorrect with respect the IOBES tagging. Some of these mistakes can be corrected with a simple post-processing step. A 𝐵 must be followed by an 𝐼 tag or an 𝐸 tag, otherwise it becomes and 𝑆 tag. If an 𝐼 tag is followed by an 𝑂 tag, then the 𝐼 must become an 𝐸 tag, and vice versa. 1.4. Training setup We have experimented with two different pre-trained BERT models as the core for the se- mantic representation of the input tokens: BERT-Base Multilingual Cased (https://github.com/ google-research/bert/blob/master/multilingual.md) (mBERT) and BETO [4], a BERT model pre-trained on Spanish text. We have used the implementations from the HuggingFace Trans- formers library (https://huggingface.co/transformers/. We did not perform any in-domain 56 Table 1 Results of the submitted system compared with the top-scored participant in NERC task (TestSet) and results of Vicomtech system on the development set comparing BERT and BETO. Dataset Team Metric PER LOC ORG OTH. Micro Macro P 96.40 90.47 88.63 83.36 90.50 90.43 Agerri&Rigau[6] R 97.46 91.74 87.31 80.69 90.17 90.27 F1 96.93 91.10 87.96 82.00 90.34 90.30 TestSet P 93.48 89.36 85.76 79.63 87.88 87.81 Vicomtech (BETO) R 96.70 88.03 85.76 77.34 88.89 88.09 F1 95.06 88.69 85.82 78.47 87.99 87.94 P 94.93 90.43 85.36 81.74 88.67 88.68 Vicomtech (BETO) R 95.66 89.60 87.12 81.19 89.11 89.11 F1 95.29 90.01 86.23 81.46 88.89 88.89 DevSet P 94.12 89.89 96.38 78.14 87.94 87.99 Vicomtech (mBERT) R 95.45 89.96 85.87 81.70 88.86 88.86 F1 94.78 89.92 86.13 79.88 88.39 88.42 language model fine-tuning for the base models. In this sense, the approach is general and domain-agnostic. The only resource used for fine-tuning the whole system is the training data provided for the tasks. For the NERC task the training data consisted in 22,647 sentences with a validation set of 7,549 sentences, while for the UD task the training set contained 7,086 sentences with a validation set of 2,362 sentences. The training of the different variants was carried out on 2 Nvidia GeForce RTX 2080 GPUs with ∼11GB of memory. We applied the AdamW optimiser [5] with a base learning rate of 2𝐸−5, combined with a linear LR scheduling to warm-up the learning rate during the first 5,000 training steps. For each trained model, the training monitored the weighted F1-score for the model predic- tions against the development set (i.e. the entity tags for the NERC system, and the syntactic dependency relations for the dependency parsing system). They were run for a maximum of 500 epochs with an early-stopping patience of 150 epochs. Finally, we chose the model checkpoints that had the best development metrics. 2. Results Table 1 shows the top-results of the NERC task evaluated on the test set, and the results of our training evaluated on the development set. The first system belongs to the task winner (ragerri), while the second system is ours. In the official ranking of the competition our system appears in the 4th position, after the three runs of ragerri and out of 9 different submissions belonging to 5 different participants. We have used BETO-based system for the submission because in the development set it achieved better results than mBERT, as it is shown in the table. The results show that our system obtains high scores for all the entity types. The overall results are, on average, 2-3% lower than the best performing system on the task. 57 Table 2 Results of the submitted system compared with the top-scored participant in UD task (TestSet) and results of Vicomtech system in the development set comparing BERT and BETO. Dataset Team Type Metric UAS 91.935 Lendínez[7] LAS 88.660 Test Set UAS 91.875 Vicomtech (BETO) LAS 88.600 UAS 91.540 Vicomtech (BETO) LAS 88.410 Dev Set UAS 91.220 Vicomtech (mBERT) LAS 87.860 Regarding UD task, Table 2 shows the top-results of the UD task evaluated on the test set and the results of our training evaluated on the development set. Our system achieves the second position in this task out of four different submissions. The results show very similar scores in both metrics, UAS and LAS compared to the winner of the task (0.06 points less in LAS). Again, as for the NERC task, our submitted system is based on BETO instead of mBERT because it achieved better results in the development set. 3. Conclusions In these working notes we have described our participation in the CAPITEL shared task, for the two available subtasks: NERC and dependency parsing based on Universal Dependencies (UD). We have presented the deep-learning-based architecture of our systems, which rely on pre-trained BERT models as the base for semantic representation of the texts. We have tried different pre-trained BERT models, multilingual-BERT and Spanish-BERT (BETO). Despite the presented systems are simple and domain agnostic they obtain high scores. For the NERC subtask our system is the 4th best performing submission, and our team achieves the 2nd position among five participants. For the UD subtask our system ranks the 2nd achieving only 0.06 points less than the best performing system. The described systems can almost be considered baselines based on BERT. As future work, we may experiment with other novel transformer architectures, and additional in-domain pre-training or more sophisticated pre-training objectives. We would also experiment with additional layer on top of these basic architectures (from the well-known CRF for NERC to additional self-attention layers). Also, in particular for the NERC model, researching and designing an extensible way of injecting world-knowledge about existing entities would be very interesting. 58 Acknowledgements This work has been supported by Vicomtech and partially funded by the project DeepReading (RTI2018-096846-B-C21, MCIU/AEI/FEDER,UE). References [1] J. Porta-Zamorano, L. Espinosa-Anke, Overview of CAPITEL Shared Tasks at IberLEF 2020: NERC and Universal Dependencies Parsing, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020), 2020. [2] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention Is All You Need, in: Proceedings of the Thirty-first Conference on Advances in Neural Information Processing Systems (NeurIPS 2017), 2017, pp. 5998–6008. [3] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 4171–4186. [4] J. Cañete, G. Chaperon, R. Fuentes, J. Pérez, Spanish Pre-Trained BERT Model and Evaluation Data, in: Proceedings of the Practical ML for Developing Countries Workshop at the Eighth International Conference on Learning Representations (ICLR 2020), 2020, pp. 1–9. [5] I. Loshchilov, F. Hutter, Decoupled Weight Decay Regularization, in: Proceedings of the Seventh International Conference on Learning Representations (ICLR 2019), 2019. [6] R. Agerri, G. Rigau, Projecting Heterogeneous Annotations for Named Entity Recognition, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020), 2020. [7] F. Sánchez-León, Combining Different Parsers and Datasets for CAPITEL UD Parsing, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020), 2020. 59