=Paper=
{{Paper
|id=Vol-2936/paper-15
|storemode=property
|title=Vicomtech at MESINESP2: BERT-based Multi-label Classification Models for Biomedical
Text Indexing
|pdfUrl=https://ceur-ws.org/Vol-2936/paper-15.pdf
|volume=Vol-2936
|authors=Aitor García-Pablos,Naiara Perez,Montse Cuadros
|dblpUrl=https://dblp.org/rec/conf/clef/PablosPC21
}}
==Vicomtech at MESINESP2: BERT-based Multi-label Classification Models for Biomedical
Text Indexing==
Vicomtech at MESINESP2: BERT-based Multi-label Classification Models for Biomedical Text Indexing Aitor García-Pablos, Naiara Perez and Montse Cuadros SNLT group at Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), Mikeletegi Pasealekua 57, Donostia/San-Sebastián, 20009, Spain Abstract This paper describes the participation of the Vicomtech NLP team in the MESINESP2 shared task. The challenge consists in the development of systems for the automatic indexing with DeCS codes of health- related documents in Spanish. The systems submitted by Vicomtech are multilabel classifiers based on pre-trained BERT models. We have experimented with multiple ways of representing the documents, such as encoding DeCS term glosses along with the input text. According to the official evaluation results, our systems are surpassed by other competing teams—despite being fast and achieving good precision, we fall behind especially in recall metrics. Overall, the task remains challenging even for the best performing systems and there is ample room to advance the state of the art for this particular task. Keywords Biomedical Text, Automatic Indexing, DeCS, Spanish 1. Introduction The MESINESP2 shared task [1], similar to the first MESINESP edition [2], is an open BioASQ [3] competition to develop automatic systems for the semantic indexing of Spanish documents with DeCS1 , a structured medical vocabulary derived from the Medical Subject Headings (MeSH) [4]. DeCS comprises 34,294 descriptors and qualifiers. The shared task is divided into three subtracks, each targeted to a different type of health- related document: scientific literature, clinical trials and patents, respectively. We have par- ticipated in all the subtracks implementing variations of a Transformers-based [5] multi-label classification model. In particular, our models feature a pre-trained BERT model [6] to encode the input text and, in some versions, inject external knowledge (i.e. DeCS term glosses) to the model. Despite our scores lagging behind the best competing systems, our team ranks third in Subtrack 1 and second in Subtrack 2. Still, the overall challenge results show that there is room for improvement and future work. The rest of the document is structured as follows. Section 2 introduces the data provided by the organizers of the challenge, with a special focus on the DeCS code imbalance and how we tackle this problem. Sections 3 and 4 describe our submitted systems and the training setup, CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania " agarciap@vicomtech.org (A. García-Pablos); nperez@vicomtech.org (N. Perez); mcuadros@vicomtch.org (M. Cuadros) 0000-0001-9882-7521 (A. García-Pablos); 0000-0001-8648-0428 (N. Perez); 0000-0002-3620-1053 (M. Cuadros) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 1 http://decs.bvsalud.org/I/homepagei.htm Table 1 Size of the shared task corpus in terms of number of documents per subtrack and split Training Development Background Testing Subtrack 1 237,574 1,065 10,174 500 Scientific literature Subtrack 2 3,560 147 8,919 250 Clinical trials Subtrack 3 0 115 6,000 150 Patents respectively. Section 5 presents the official results. In Section 6, we discuss some decisions taken during the development and training phases, inherent flaws of our systems, and potential improvements. Finally, Section 7 provides some concluding remarks and future work hints. 2. Data description MESINESP2 is organized into three subtracks, each focused on health-related documents of a different genre [7]. The sizes of the datasets per subtrack and split are shown in Table 1. The information available for each document of the corpus is the following: • Document ID: a unique identifier for the document. • Title: the title of the document. • Abstract: the abstract of the document, which is the main source of text for the task. • Metadata: journal, year and database. • DeCS codes: the DeCS codes that characterize the content of the document. The objective of the MESINESP competition is to develop a system capable of predicting the correct set of DeCS codes for any new document. From the total of 34,294 codes of the DeCS terminology, only around 22,000 are present in the training data. Furthermore, the frequency in which the codes occur is highly unbalanced: a minority of codes occur in more than 80% of the training documents, while the majority of codes are way more sparse, with less than a hundred examples in the whole dataset. This problem is exacerbated by the fact that we use a multi-label classification approach, so the codes cannot be easily balanced. A naive sub-sampling or over-sampling approach would sample the codes grouped by docu- ments, and the imbalance would persist. Further, the codes that barely appear in a few tens of documents in the whole corpus are very unlikely to be learnt by the model, due to the lack of representation. We have addressed this problem by applying a minimum support cut-off. That is, the codes with a frequency lower than a certain preset value are not taken into account for training. A large minimum support cut-off would lead to discard too many codes, while too small a cut-off value would keep many underrepresented labels in the output vocabulary. In order to maintain an equilibrium between these too extremes, we have calculated a cut-off value that minimizes the number of codes in the resulting vocabulary while keeping as many word embeddings [CLS] Base libre de carved ... [SEP] ... pH de 5 ... [SEP] ... [PAD] (SxH) pre-trained BERT encoder tokens [CLS] + Base libre de carved ... + [SEP] + ... pH de 5 ... + [SEP] + ... [PAD] (S) pre-trained BERT tokenizer text title = "Base libre de carvedilol [...]" abstract = "[...] a un pH de 5,5 [...] " Figure 1: Representation of a MESINESP document title and abstract using a BERT model. codes as possible from those that occur in the development set. The resulting cut-off value calculated thus over the Subtrack 1 dataset is 80. That is, we have ignored all the codes that appear fewer than 80 times in the training set documents. This reduces the size of the output vocabulary from more than 22,000 codes to 3,274, which still represent 82% of the codes present in the development set. 3. System description 3.1. Input representation We use the titles and abstracts of the documents to be indexed as the main sources of infor- mation. In order to feed these fields to the BERT models that will generate the contextual word embeddings, we concatenate them using the usual BERT representation for two texts: the special [CLS] token, followed by the tokenised title, the special [SEP] token, the tokenised abstract, and a second [SEP] token (see Figure 1). BERT-base models have a hard limit of 512 tokens, including special characters. The average length of the abstracts in the training set after tokenisation with the corresponding BERT tokeniser is around 300 tokens, with a standard deviation of about 120 tokens. That is, a large percentage of the documents fit in the model. The few ones that do not are simply truncated. We assume that even in those cases in which the last words of the abstract are omitted, the amount of information encoded in the first few hundreds of words is enough to predict the most salient DeCS codes for a given document. This assumption is supported by the fact that, during some preliminary experiments, we did not observe major differences in the development set scores when varying the maximum allowed document length between 300 and 500 BERT tokens. 3.2. Architectures We have experimented with two different architectures. We henceforth refer to these systems as CSS and LabelGlosses. Both systems are multi-label classifiers built on top of a Transformers model. Given a document, they can predict any number of labels, from 0 to 𝐶, 𝐶 being the size logits code1 code2 code3 code4 code5 code6 code7 ... codeC (C) classifier: dense linear + non-linearity + dropout + linear document embedding CSS (3xH) concat word embeddings [CLS] title [SEP] abstract [SEP] (SxH) pre-trained BERT encoder tokens [CLS] title [SEP] abstract [SEP] (S) Figure 2: Architecture of the CSS model for multi-label classification of the documents. of the output vocabulary, i.e., 3,274 (in practice, each document is associated to a small number of labels—usually less than 10). The difference between the architectures lies in how they use the contextual word embeddings obtained from the Transformer model. Figure 2 shows a diagram of the CSS model. The token contextual-embeddings obtained from the BERT model need to be gathered or processed in a way that provides a fixed length representation, usually called document embedding, that serves as input to a classification head. There are different ways to obtain such a representation. The most direct and straightforward approach is to use the special [CLS] token to act as the document summary. For a model built on a BERT-base architecture, the [CLS] token is a vector of 768 values. After some experimentation, the use of just this token led to poor results. Our hypothesis is that summarizing the whole document into 768 values aiming at discriminating several thousands of possible classes (i.e. DeCS codes) leads to a choke point. That is, the information represented in the [CLS] token alone is too compressed to serve as the input for a classifier with such a high number of output classes. To overcome this problem, we draw on other tokens that are always present in every document due to the way we represent them: the [SEP] tokens. For each document, we concatenate the [CLS] token and the two [SEP] tokens into a single vector of 3 × 768 values (hence the name of the model). This larger document embedding is then used as input for the classification head. The classification head is composed of a dense layer, followed by a nonlinear function, a dropout layer and a linear layer that maps the document-embedding size into a label space. 3.2.1. LabelGlosses model The initial components of the LabelGlosses model are very similar to the CSS model, the difference between the thow architectures being that the LabelGlosses model contains an encoded representation of the glosses that describe the DeCS codes (see Figure 3). Table 2 shows some examples of such glosses. Prior to the training phase, the DeCS glosses are encoded using the same pre-trained BERT logits code1 code2 code3 code4 code5 code6 code7 ... codeC (C) gloss ... embeddings classifier: dense linear + non-linearity + dropout + linear (CxH) concat weight initialisation repeat x C sum document embedding [CLS] gloss1 [SEP] (H) word ... ... ... average embeddings [CLS] glossC [SEP] (CxSxH) word embeddings [CLS] title [SEP] abstract [SEP] pre-trained BERT encoder (SxH) pre-trained BERT encoder [CLS] gloss1 [SEP] ... ... ... tokens tokens (S) [CLS] title [SEP] abstract [SEP] [CLS] glossC [SEP] (CxS) Figure 3: Architecture of LabelGlosses model. DeCS codes’ glosses are encoded into embeddings and paired to each document-embedding as the input for the classification head. Table 2 Examples of glosses for several DeCS terms, together with their code and name Code Term Gloss D003970 Diastema Espacio entre dos dientes adyacentes en el mismo arco dental. (Dor- land, 27th ed) D007962 Leucocitos Células sanguíneas blancas. Estas incluyen a los leucocitos granu- lares (BASOFILOS, EOSINOFILOS y NEUTROFILOS) así como a los leucocitos no granulares (LINFOCITOS y MONOCITOS). DDCS034870 Mareógrafo Instrumento para registrar y medir las oscilaciones de las mareas. (Material IV - Glosario de Protección Civil, OPS, 1992) D011203 Pobreza Acción y efecto de empobrecer o empobrecerse. (Fuente: Diccionario de la lengua española. Real Academia Española. Disponible en: https://dle.rae.es/?id=ErpRftz) model that will be used for training. First, we strip all parenthetical content from the glosses, because such content is often a citation or other irrelevant boilerplate. The gloss encoding process consists in summing the contextual embeddings obtained from the tokens of each gloss, ignoring padding positions. The resulting vectors are used to initialize the DeCS gloss embeddings layer inside the model. This layer is of size 𝐶 × 𝐻, 𝐶 begin the number of codes in the output vocabulary and 𝐻 the size of BERT embeddings (i.e. 768 for a BERT-base architecture). These embeddings are fine-tuned during training. During training, the document embedding is obtained by averaging the contextual token embeddings from the BERT model. Then, this document embedding is combined with every gloss embedding, forming pairs: (𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡, 𝑔𝑙𝑜𝑠𝑠1 ), (𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡, 𝑔𝑙𝑜𝑠𝑠2 ), ..., (𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡, 𝑔𝑙𝑜𝑠𝑠𝐶 ). Each pair consists of two vectors of size 𝐻 that are concatenated to obtain a single vector of size 𝐻 × 2. This combined vector is the input to a classification head. The classification head of the LabelGlosses model is the same as the CSS model: a dense layer followed by a nonlinear function, a dropout layer and a linear layer. 3.3. Output handling The output of the model (be it CSS or LabelGlosses) is an individual score ranging between 0 and 1 for each DeCS code in the output vocabulary. When the model is confident about predicting a certain code, its corresponding score gets closer to 1, and vice versa. A threshold needs to be chosen to decide when a given score must be interpreted as the model predicting the corresponding code for the given input. With a threshold of 0, the model would predict all the codes regardless of the input, maximizing the recall but minimizing the precision. A threshold of 1 would mean that the model would never predict any code at all. In the absence of further information, a threshold of 0.5 is a reasonable default, but could be suboptimal. In this work, we have used the development sets provided for each subtrack to find out the decision threshold that better balances the precision and recall, thus achieving the best possible F1-score for each trained model. The actual thresholds are reported in the next section. 4. Training setup and submitted systems We have participated in all the task subtracks with several variations of the CSS and Label- Glosses models, listed in Table 3. For Subtrack 1, we used IXAmBERT [8] as the pre-trained core model, a BERT-base model pre-trained for Spanish, Basque and English. The runs of Subtracks 2 and 3 use the fine-tuned model resulting from Subtrack 1 as starting point for their own fine-tuning. The reason for this is that the training dataset for Subtrack 2 is small, while there is no training data at all for Subtrack 3 (see Table 1). To measure the impact these choices might have, Subtrack 2 includes a submission that parts from IXAmBERT, and we submit to Subtrack 3 a run that has not been fine-tuned on the subtrack data. The models have been trained using a GPU NVIDIA 2080ti of 11GB. The training run for a maximum of 200 epochs, with an early-stopping patience of 50 epochs. Under these conditions, the training of the CSS model required 3-4 days, while the training of the LabelGlosses model required around a week to get to the best result validated in the development set. Other training hyperparameters for the described systems are shown in Table 4. The resulting systems process the background sets (6,000 to 10,000 documents) in 2-3 minutes at a speed of ∼80 documents per second using 1 NVIDIA RTX 1080ti GPU. 5. Results Table 5 shows the official results of the competition, including the results of all our runs and the results of the winner system per subtrack. Internal evaluations with the development set showed scores around 44 micro-averaged F1-score for Subtrack 1. However, in the test set our Table 3 Runs submitted to the competition, characterized by model architecture, pre-training of the encoder model, data split used for fine-tuning and inference threshold value Run Architecture Pre-train Fine-tuned on Threshold Subtrack 1 1.1 CSS IXAmBERT Subtrack 1 trainset 0.25 Scientific literature 1.2 CSS IXAmBERT Subtrack 1 trainset 0.30 1.3 CSS IXAmBERT Subtrack 1 trainset 0.35 1.4 LabelGlosses IXAmBERT Subtrack 1 trainset 0.10 1.5 LabelGlosses IXAmBERT Subtrack 1 trainset 0.20 Subtrack 2 2.1 CSS IXAmBERT Subtrack 2 trainset 0.25 Clinical trials 2.2 CSS Run 1 CSS Subtrack 2 trainset 0.20 2.3 CSS Run 1 CSS Subtrack 2 trainset 0.25 2.4 CSS Run 1 CSS Subtrack 2 trainset 0.30 Subtrack 3 3.1 CSS Run 1 CSS none 0.05 Patents 3.2 CSS Run 1 CSS Subtrack 3 devset 0.05 3.3 CSS Run 1 CSS Subtrack 3 devset 0.10 3.4 CSS Run 1 CSS Subtrack 3 devset 0.15 3.5 CSS Run 1 CSS Subtrack 3 devset 0.20 Table 4 Training hyperparameters Hyperparameter Value Hyperparameter Value Max. sequence length 300 Max training epochs 200 epochs Batch size 16 Early stopping patience 50 epochs Optimiser AdamW [9] Dropout rate 0.1 Learning rate 4E-5 Monitored metric micro F1-score Learning rate warm-up linear, 2 epochs Min. support cutoff 80 Non-linearity Mish [10] best system scores 38 points, 10 points below the best competing system in Subtrack 1, and 8 points in Subtrack 2. We achieve a reasonable level of precision, only 3 points below the winning system, but recall scores fall behind. This places us in the third and second position for Subtracks 1 and 2, respectively, after the groups of systems by two other teams. Unsurprisingly, the results in Subtrack 3 are lower than in the other two, as the runs submitted to this subtrack have seen very little to no in-domain training data, and do not exploit any other source of domain knowledge. It is noteworthy that having fine-tuned the model on just 115 examples—i.e. the development examples available for this subtrack—has had a remarkable positive impact, increasing recall by 12 points (compare Run 3.1 and 3.2). Using the LabelGlosses architecture in Subtrack 3 might have helped mitigate the lack of training data, although it seems unlikely given the difference between CSS and LabelGlosses in Subtrack 1. Our attempts to include expert knowledge in the system by encoding DeCS glosses have not had a beneficial impact on the results. In fact, the results obtained by the Label- Glosses runs are slightly lower for all the metrics. Table 5 Official results per subtrack and run (the subscript numbers next to the architecture names indicate the inference threshold values), including the best competing system per subtrack Run F1 P R Acc Subtrack 1 1.1 CSS0.25 w/ IXAmBERT 38.23 45.09 33.18 23.99 Scientific literature 1.2 CSS0.30 w/ IXAmBERT 38.25 46.22 32.62 24.05 1.3 CSS0.35 w/ IXAmBERT 38.01 47.10 31.86 23.90 1.4 LabelGlosses0.10 w/ IXAmBERT 37.04 45.26 31.34 23.13 1.5 LabelGlosses0.20 w/ IXAmBERT 37.46 45.60 31.79 23.23 Best System (BERTDeCS version 4) 48.37 50.77 46.18 32.61 Subtrack 2 2.1 CSS0.25 w/ IXAmBERT 24.85 27.21 22.87 13.84 Clinical trials 2.2 CSS0.20 w/ Run 1 CSS 28.10 28.88 27.36 16.31 2.3 CSS0.25 w/ Run 1 CSS 28.19 29.33 27.15 16.36 2.4 CSS0.30 w/ Run 1 CSS 28.07 29.24 26.78 16.29 Best System (BERTDeCS version 2) 36.40 36.66 36.14 22.42 Subtrack 3 3.1 CSS0.05 w/ Run 1 CSS (no fine-tuning) 19.68 27.00 15.48 10.76 Patents 3.2 CSS0.05 w/ Run 1 CSS 26.51 25.47 27.64 15.72 3.3 CSS0.10 w/ Run 1 CSS 28.34 31.88 25.51 16.89 3.4 CSS0.15 w/ Run 1 CSS 29.08 35.96 24.40 17.29 3.5 CSS0.20 w/ Run 1 CSS 29.21 38.90 23.28 17.25 Best System (BERTDeCS version 2 ) 45.14 44.87 45.41 30.05 A final observation can be made for Runs 2.1 and 2.3, were the former’s fine-tuning starts with IXAmBERT while the latter uses the CSS model resulting from Run 1. The knowledge captured from the Subtrack 1 data has helped raise all metrics, particularly recall (+4 points). 6. Discussion During the training of our systems and their variations, we have made several noteworthy observations. First, the validation scores for all the systems progressed at different paces, some faster than others, towards a plateau of around 45-50 F1-score points in the development sets. Models and hyperparameter variations made little difference. We assume that this plateau is the limit of what the proposed models can learn to generalize from the training data, in particular for the least frequent DeCS codes. For this reason, we tried to inject external knowledge to the model by encoding DeCS term glosses. The proposed approach has not helped in this regard. Second, our systems show a clear imbalance between precision and recall. We hypothesise that the imbalance is related, among others, to the exclusion of DeCS codes from the training data when applying the minimum support cut-off value, although further research would be necessary to confirm this or to uncover interactions between other elements of the systems that might be having this effect. One such relevant element is the decision threshold, which we use to interpret the output of our models. For each model, we have computed the global threshold that maximises the F1-score in the corresponding development set. That is, the same threshold applies to all the modeled DeCS codes, regardless of the degree of confidence the model might have with respect to each individual code. Given the imbalance of code frequencies in the training data, the confidence is bound to vary greatly. Thus, a decision threshold better tailored to each DeCS code could benefit the precision-recall balance of the results. It will be interesting to learn how the winning systems have addressed all these problems. For instance, the official results show that the group of systems that obtained the second position in Subtrack 1, surpassing our models, rank in third position in Subtrack 2, just behind our models. It would be interesting to study the cause for this variation and assess whether the difference lies in the approaches implemented or just in the training procedure. Overall, the systems we have submitted, in particular the CSS model, are not complex, and the scores achieved are lower than expected given the results obtained on the development sets. However, our systems are lightweight and fast, being able to process about 80 documents per second in a commodity GPU, consuming less than 4GB of GPU memory, which enables real-time processing scenarios. 7. Conclusions In these working notes we describe our participation in the MESINESP2 shared task, focused on the medical document indexing in Spanish. We have presented two systems based on Transformers, in particular using BERT-base pre-trained models to encode the text information and to perform a multi-label classification over the large DeCS codes vocabulary. The simpler approach relies on combining special BERT-encoded tokens as input for a classification head. In this sense, it is a straightforward model that works fast. The second proposed approach shares key components, namely, the BERT-base model and the multi- label classification nature of the model. The main difference is that it adds an extra layer of DeCS code embeddings, which are meant to encode the meaning of each modelled DeCS code. The embeddings are initialized from the BERT-encoded glosses that provide human-readable definitions of the DeCS codes. Despite our experiments on the development set having yielded scores around 44 F1-score points, our best results in the test set reach only around 38, falling 10 points behind the best competing system. Even with these lower results, our team achieves the third position among the competing teams in Subtrack 1, and the second position in Subtrack 2. As future work, we have come across several issues that need to be addressed in order to better understand the performance of our systems and improve their results. Most interestingly, we have observed that regardless of the approach and hyperparameter variations, the models reached a similar plateau in the validation score in all our experiments. This suggests that our approaches meet their limit there, and that additional external knowledge is needed to cross it. Thus, we will focus on better and more efficient representations of the DeCS codes, for instance including their hierarchical nature. It would also be interesting to explore approaches related to semantic information retrieval. In conclusion, the task remains challenging regardless of the model and approach, with the winning system having achieved scores lower than 50 F1-score points. Further research will be necessary to improve the state of the art of the task proposed by MESINESP2. Acknowledgments This work has been partially funded by the projects DeepText (KK-2020-00088, SPRI, Basque Government) and DeepReading (RTI2018-096846-B-C21, MCIU/AEI/FEDER, UE). References [1] L. Gasco, A. Nentidis, A. Krithara, D. Estrada-Zavala, , R.-T. Murasaki, E. Primo-Peña, C. Bojo-Canales, G. Paliouras, M. Krallinger, Overview of BioASQ 2021-MESINESP track. Evaluation of Advance Hierarchical Classification Techniques for Scientific Literature, Patents and Clinical Trials. (2021). [2] C. Rodriguez-Penagos, A. Nentidis, A. Gonzalez-Agirre, A. Asensio, J. Armengol-Estapé, A. Krithara, M. Villegas, G. Paliouras, M. Krallinger, Overview of MESINESP8, a Spanish Medical Semantic Indexing Task within BioASQ 2020, in: Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, 2020, pp. 1–12. [3] A. Nentidis, G. Katsimpras, E. Vandorou, A. Krithara, L. Gasco, M. Krallinger, G. . Paliouras, Overview of BioASQ 2021: The ninth BioASQ challenge on Large-Scale Biomedical Se- mantic Indexing and Question Answering. (2021). [4] C. E. Lipscomb, Medical subject headings (mesh), Bulletin of the Medical Library Associa- tion 88 (2000) 265—-266. [5] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polo- sukhin, Attention Is All You Need, in: Proceedings of the Thirty-first Conference on Advances in Neural Information Processing Systems (NeurIPS 2017), 2017, pp. 5998–6008. [6] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 4171–4186. [7] L. Gasco, M. Krallinger, M. Antonio, MESINESP2 Corpora: Annotated Data for Medical Semantic Indexing in Spanish, 2021. Funded by the Plan de Impulso de las Tecnologías de las del Lenguaje (Plan TL). [8] A. Otegi, A. Agirre, J. A. Campos, A. Soroa, E. Agirre, Conversational Question Answering in Low Resource Scenarios: A Dataset and Case Study for Basque, in: Proceedings of The 12th Language Resources and Evaluation Conference, 2020, pp. 436–442. [9] I. Loshchilov, F. Hutter, Decoupled Weight Decay Regularization, in: Proceedings of the Seventh International Conference on Learning Representations (ICLR 2019), 2019, pp. 1–18. [10] D. Misra, Mish: A Self Regularized Non-Monotonic Neural Activation Function, arXiv:1908.08681 (2019) 1–13.