=Paper=
{{Paper
|id=Vol-2943/ehealth_paper6
|storemode=property
|title=Vicomtech at eHealth-KD Challenge 2021: Deep Learning Approaches to Model Health-related Text in Spanish
|pdfUrl=https://ceur-ws.org/Vol-2943/ehealth_paper6.pdf
|volume=Vol-2943
|authors=Aitor García-Pablos,Naiara Perez,Montse Cuadros
|dblpUrl=https://dblp.org/rec/conf/sepln/PablosPC21
}}
==Vicomtech at eHealth-KD Challenge 2021: Deep Learning Approaches to Model Health-related Text in Spanish==
Vicomtech at eHealth-KD Challenge 2021: Deep Learning Approaches to Model Health-related Text in Spanish Aitor Garcı́a-Pablos1[0000−0001−9882−7521] , Naiara Perez1[0000−0001−8648−0428] , and Montse Cuadros1[0000−0002−3620−1053] SNLT group at Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), Mikeletegi Pasealekua 57, Donostia/San-Sebastián, 20009, Spain {agarciap,nperez,mcuadros}@vicomtech.org https://www.vicomtech.org Abstract. This paper describes the participation of the Vicomtech NLP team in the eHealth-KD 2021 shared task about detecting and classifying entities and relations in health-related texts written in Spanish. We par- ticipate with two systems. Joint Classifier is a simplified version of the system that won the main scenario of the previous eHealth-KD edition. It consists of a single end-to-end deep neural network with pre-trained BERT models as the core for the semantic representation of the input texts, that predicts all the output variables—entities and relations—at the same time, modelling the whole problem jointly. The main change w.r.t. the original implementation affects the representation of relations. The Joint Classifier model achieved the first position in the main scenario of the competition and ranked second in the rest of the sce- narios. The second submitted system, Seq2Seq, uses an approach based on an encoder-decoder model. It transduces the input text into an out- put sequence by reading the input text. The target sequence is a com- pact representation of the information contained in the gold-labels of the datasets. This approach showed a promising performance despite not be- ing competitive enough. However, it poses an interesting potential future work. Keywords: Entity detection · Relation extraction · Health documents 1 Introduction This article describes Vicomtech’s participation at the eHealth Knowledge Dis- covery challenge (eHealth-KD) 2021 (https://ehealthkd.github.io/2021) [12]. The IberLEF 2021, September 2021, Málaga, Spain. Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). challenge proposes a general-purpose semantic structure to model human lan- guage, consisting of 4 types of entities and 13 types of relations (see example in Figure 1). We refer the reader to the challenge overview article [12] for detailed information about eHealth-KD 2021, such as descriptions of the provided corpus and evaluation scenarios. Vicomtech’s participation builds partly on the system submitted to the pre- vious edition [4], which addressed the problems of entity and relation extraction jointly with several classification heads on top of a fine-tuned BERT encoder [3] and intermediate token-pair representations. This approach ranks first in the main challenge scenario for the consecutive year. As a novelty, Vicomtech has also experimented with a text-to-text approach, which has recently gained interest as a promising alternative to successfully solv- ing very different NLP tasks in a unified manner [22, 16, 7, 5]. We are interested in this approach because it does not suffer from the limitations of traditional sequence labelling techniques, in particular to represent the non-contiguous and overlapping entities. Although this model ranks lower in the main challenge sce- nario, the results indicate that the text-to-text approach is a viable option that is worth exploring for this task as well. Fig. 1: Example of eHealth-KD annotations in the sentence “The pain may start a day or two before your period.” The paper is organised as follows: Section 2 describes the two proposed models, Joint Classifier and Seq2Seq, and their training setups; Section 3 presents the results obtained, including a comparison to other competing sys- tems; finally, Sections 4 and 5 comment on several design choices and provide some concluding remarks. 2 System descriptions This section provides a comprehensive description of the two submitted systems. For each, we first describe its architecture and then describe how the inputs and outputs have represented and handled. Finally, we present the training setup. 2.1 Joint Classifier This model is the result of the revision of our participation in the previous eHealth-KD edition [4]. Most of the changes introduced in the current edition aim at simplifying components of the architecture that seemed redundant or needlessly complex. In addition, we have forgone the ability to predict multi- word discontinuous and/or overlapping entities like the shown in Figure 1, which in any case required elaborate post-processing to yield acceptable results. Architecture Joint Classifier is a deep neural network that receives the input tokens and jointly emits predictions for two output variables: – Entities: the classification of each individual token into one of the task’s entity types or ‘O’ (from ‘Out’, meaning that the token is not part of any entity at all, such as “puede” in Figure 1). We use the classical BIO tagging scheme [14] to encode entity boundaries, so the effective output vocabulary size for the 4 entity types is 9. – Relations: whether token pairs are related by any of the relation types described in the task, or ‘O’ when there is no relation between the tokens, for a total of 14 output labels. Input sequence (S) 4 All vs All BERT encoder DistilBERT encoder S2x2(B+N) 1 5 Word Relation embeddings concat + product embeddings SxB S2xD Input sequence Dense 0 1 2 3 ... Dropout 3 El dolor puede comenzar ... Entity Dropout embeddings Classifier SxN Mish Classifier Entity predictions Linear 0 1 2 3 ... 2 6 O B-action O B-action ... Entity logits Relation logits argmax SxE S2xR Relation predictions Softmax Softmax ... 3⊕0 3⊕1 3⊕2 ... ... O target O ... Entity predictions (S) Relation predictions (S2) Fig. 2: High-level diagram of the proposed Joint Classifier model, including the shapes of each layer: B=768 (BERT contextual embedding size); N=64 (en- tity embedding size); E=9 (output vocabulary size for the entities head), D=768 (DistilBERT contextual embedding size), R=14 (output vocabulary size for the relations head). An overview of the inner workings of the network is given in Figure 2. The computation of the model starts by feeding the input tokens into a BERT model to obtain their contextual embeddings 1 . These embeddings are passed to a classification layer that emits logits with the predictions about each token being or not an entity of a certain type 2 . The output of this classification layer is one of the two outputs of the network. The entity logits pass through an argmax function to select the entity index that, in its current state, the model would predict for each token. These entity indexes are used to select entity embeddings from a custom embeddings layer initialised at the beginning of the training 3 . Each entity embedding is concatenated to the token contextual embedding it was predicted from. The resulting vectors are operated to obtain an all-vs- all combination of all tokens, resulting in S × S combined embeddings that represent all the possible token pairs, S being the length of the input sequence 4 . These embeddings are further passed to a small randomly initialised DistilBERT model [15] with only two layers of two attention heads each. The objective of this model is to further capture interactions between the token pairs via self- attention. Finally, the resulting relation representations 5 are passed to a another classification layer to predict the type of the relation between each pair of tokens (if there is a relation at all) 6 . Relations are modelled as outgoing arcs. That is, if the pair tokeni ⊕ tokenj is linked by a relation of type R, tokeni and tokenj are the source and destination of the relation R respectively. This is the second and last output of the network. The network has a total of two classifiers built with the same stack of layers: a fully connected linear transformation layer, followed by a dropout layer and a non-linear activation function—Mish [9]—, and a final linear transformation that outputs the logits for the given output variable. The output probabilities are obtained by applying the softmax function. Input and output handling The challenge corpus has been provided in Brat [17] standoff format (see Figures 1 and 3a). This format is character span- based, while Joint Classifier works at token level. Consequently, this system relies on a set of pre-processing and post-processing transformation steps, ex- plained below. Input representation Figure 3b shows an example of the information represen- tation designed with all the network’s input and output variables. Notably, the proposed system does not address continuous and/or overlap- ping entities, as it relies on the BIO tagging scheme. As shown in Figure 3b, the text span “uno o dos dı́as” is represented as “uno” on the one hand and “dos dı́as” on the other, while the gold annotations define the entities “uno dı́as” and “dos dı́as” for the same text span (see Figure 3a). As explained in Section 2.1, Joint Classifier performs all the tasks end-to- end, using its own entity predictions as input to detect relations. In Task B, gold entity annotations are provided by the task organisers; systems need to focus on the relations only. In this case, our model accepts gold entity labels along the input tokens, and replaces the predicted entities with a one-hot encoding of the gold ones as the input for detecting relations. T705 Action 10111 10116 dolor T706 Action 10123 10131 comenzar T707 Concept 10132 10135;10142 10146 uno dı́as T708 Concept 10138 10141;10142 10146 dos dı́as T709 Predicate 10147 10152 antes T710 Concept 10159 10166 perı́odo R628 in-time Arg1:T706 Arg2:T709 R629 target Arg1:T706 Arg2:T705 R630 arg Arg1:T709 Arg2:T710 R631 in-context Arg1:T709 Arg2:T708 R632 in-context Arg1:T709 Arg2:T707 (a) Representation in Brat’s standoff format, i.e. the original format of the challenge corpus and the format expected for submission. El O O - dolor B-action O - puede O O - comenzar B-action target,in-time 1,8 uno B-Concept O - o O O - dos B-Concept O - dı́as I-Concept O - antes B-Predicate in-context,in-context 4,6 (b) Representation for Joint Classifier. The first column provides the input variables: the tokens. The second and third columns provide the output variables, namely, the entity tags and relation types. The last column contains pointers to the destination tokens of the relations (note it is not an output variable; this information is encoded in the S × S matrices described in Section 2.1). comenzar:[Action];[in-time];antes:[Predicate] comenzar:[Action];[target];dolor:[Action] antes:[Predicate];[arg];perı́odo:[Concept] antes:[Predicate];[in-context];dos dı́as:[Concept] antes:[Predicate];[in-context];uno dı́as:[Concept] (c) Representation for Seq2Seq (note that the actual representation consists of a single line, while here we show one pentad per line for better readability). Fig. 3: Textual representations of the annotations depicted in Figure 1. Output interpretation The output of the entity classifier is straightforwardly interpreted as a regular sequence-labelling task, selecting the most probable pre- diction for each individual token. Next, we construct the span-based annota- tions from the emitted BIO tags. Ill-formed tag sequences are always solved by transforming the offending tag into the begging of a new entity (e.g. the tag sequence [B-Concept, I-Concept, I-Action] would be fixed as [B-Concept, I-Concept, B-Action]). As for relations, the network’s outcome for each modelled relation variable forms a S × S matrix, S being the length of the token sequence. Each po- sition i and j; i, j ∈ [0, S] contains the prediction for the relation between tokeni —the source—and tokenj —the destination. Again, as the expected out- put is span-based instead of token-based, the predicted relations are mapped from source/destination tokens to source/destination entities, having the enti- ties been interpreted just as explained above. Relations from/to tokens that are not part of an entity are simply ignored, as are repeated and reflexive relations. Throughout the whole process, token positions must be correctly handled to account for deviations and extra offsets introduced by BERT’s tokenization— BERT uses WordPiece tokenization [20], which breaks original tokens into sub- tokens; in addition, it requires that extra special tokens be added which distort the token positions w.r.t. the original input. 2.2 Seq2Seq The are several phenomena in the challenge data, such as the non-contiguous multi-word entities mentioned earlier, that are difficult to model and predict with a traditional sequence labelling model. For this reason, we have explored a radically different strategy: the text-to-text paradigm. This paradigm is way more flexible because it can ingest a sequence of any length, and output another arbitrary sequence. That is, the output sequence is not tied to the structure of the input; a text-to-text model can potentially encode any sort of information [13]. Architecture The objective of the Seq2Seq model is to transform an input sequence (i.e. a text) into another sequence of semi-structured elements that represent the relevant information of the task. To implement this approach, we rely on a Transformers-based encoder-decoder model [18]. In particular, we use the T5 architecture [13] as is. The approach revolves around the representation format of the information encoded by the sequences, and how to reinterpret this information back to the original, task-specific, representation format. Input and output handling How the output is encoded is critical to help the model learn deriving the desired information from the input. For this reason, we have attempted to produce a data format as compact and summarised as possible. Next, we describe this format and how the model output is mapped back to Brat’s standoff format. Target sequence representation As shown in Figure 3a, Brat’s standoff format assigns a unique identifier to each entity. These identifiers are used to declare the relations between the entities. Further, the entities are defined not only by their type and actual text, but also by the spans in which they occur (that is, the positions in the text in terms of character offsets). In order to relieve the encoder-decoder model from the burden of dealing with deictic information, the data has been represented as five-fold elements built around each relation: r = (e1text , e1type , rtype , e2text , e2type ), where e1 and e2 are the source and destination entities of the relation r. Further, we serialise each of these elements with predefined punctuation marks as separators between the elements of the pentad. An example is shown in Figure 3c. The proposed format gets rid of the boilerplate used by Brat’s standoff format (e.g. “Arg1:”, “Arg2:”, and so on), reducing the sparsity and redundancy while keeping the output sequences as short as possible. Further, the entity types, the relation types and the separators are added to the model’s tokenizer vocabulary, so they receive their own word-embedding. Output interpretation Simplifying the target sequence representation as ex- plained above means that all the omitted information, such as the locations of the entities in the input text, needs to be automatically extrapolated from the system’s output. This task is made more difficult by the fact that the same term or expression may occur more than once in the input text; thus, the sys- tem must choose which of the occurrences the output refers to. On top of that, the model may produce elements that do not strictly occur in the input text, because we do not impose any constraint on the generation process (this topic is further discussed in Section 4). The procedure of constructing Brat annotations from the model’s output is as follows. First, we parse the output sequence into an array of pentads, guided by the predefined separators. Ill-formed pentads are directly ignored. As explained above, each parsed pentad represents a relation between two entities. Then, these entities must be located in the input text based on their textual form (i.e. e1text and e2text ). The initial matching attempt is based on regular expressions. The pattern, built from the entity’s text, allows for content of any length between the to- kens of the entity. For instance, the pattern for “terapia biológica” would be (\bterapia\b)(?:.*?)(\bbiológica\b). Case sensitive matches are preferred over insensitive ones, but the latter are allowed as well. Only if the regular expressions do not yield any match whatsoever do we proceed to apply a more flexible search: we retrieve as matches all the words of the input text that have a Levenshtein distance [6] to the entity smaller than 3 edit operations. This is restricted to entities composed of only one word that is longer or equal to 5 characters. If more than one match is obtained for a given entity, we choose the occur- rence that is closest in the input text to the other related entity of the pentad. Finally, we merge entities that occur exactly in the same spans, and discard reflexive and repeated relations. 2.3 Training setup Joint Classifier and Seq2Seq have been implemented in Python 3.7 with HuggingFace’s transformers library (https://github.com/huggingface/transformers) [19] and trained each on 1 Nvidia GeForce RTX 2080 GPU with ∼11GB of mem- ory. For Joint Classifier, we have experimented with two pre-trained BERT models as the core for the semantic representation of the input tokens: IXAm- BERT base cased [10] and BETO base cased [2]. The former has been pre-trained on Basque, English and Spanish Wikipedia content, while the latter is a mono- lingual model for Spanish. The Seq2Seq system is built on a small mT5 model [21], a multilingual ver- sion of the T5 encoder-decoder [13]. The choice is merely based on the fact that mT5 is multilingual and that there are checkpoints available for several archi- tectural sizes. Due to computational limitations, we use the mt5-small model. Other training hyperparameters of both systems can be consulted in Table 1. It should be noted that no in-domain language post-training of the base models has been performed. In this sense, the approaches are general and domain ag- nostic. The only resource used for fine-tuning the whole systems is the training data provided for the task. We used the development data solely for the purpose of choosing the best models for submission. Table 1: Training hyperparameters Joint Classifier Seq2Seq Max. input length 80 50 Max. target length n/a 250 Batch size 2 2 Optimiser AdamW [8] AdamW [8] Learning rate 2E-5 2E-5 Learning rate warm-up linear, 50 epochs linear, 5 epochs Gradient acc. steps 2 4 Classifier dropout rate 0.5 n/a Other dropout rates 0.1 0.1 Early stopping patience 500 epochs 50 epochs Monitored metric relations F1-score tokens F1-score 3 Results Vicomtech participated in the challenge with the following runs: – Run 1: Seq2Seq with mT5 small – Run 2: Joint Classifier with BETO base cased – Run 3: Joint Classifier with IXAmBERT base cased We did not submit Run 1 to Scenario 3—relation extraction—because presently the Seq2Seq system does not have a mechanism to exploit gold entity annota- tions. The results for each scenario and run are shown in Table 2. We provide overall results as well as separate results for the languages present in the testing data, namely, Spanish and English. It must be noted that the training data consisted of content in Spanish exclusively. In addition, the best results obtained among all the participants in the challenge are also included per scenario for benchmarking purposes. In addition, we trained and evaluated Joint Classifier with BETO in the previous edition’s data, in order to measure the impact of the changes introduced in its architecture. The results are shown in Table 3. Table 2: Official results of the submitted runs and the best system in each scenario ES EN Total P R F1 P R F1 P R F1 Scenario 1 - Main Run 1: mT5 67.01 51.64 58.33 33.69 32.12 32.88 51.27 43.44 47.03 Run 2: BETO 66.36 60.89 63.51 37.12 34.09 35.54 54.07 49.63 51.76 Run 3: IXAmBERT (best) 68.55 62.68 65.49 35.41 40.73 37.88 52.75 53.46 53.11 PUCRJ-PUCPR-UFMG [11] (2nd) - - - - - - 56.85 50.28 52.84 Scenario 2 - Task A: entity recognition and classification Run 1: mT5 83.46 71.06 76.77 56.33 44.11 49.48 69.99 57.11 62.90 Run 2: BETO 79.44 81.37 80.39 41.61 53.82 46.94 57.67 67.11 62.04 Run 3: IXAmBERT (2nd) 79.60 83.48 81.49 50.79 66.53 57.60 63.10 74.71 68.41 PUCRJ-PUCPR-UFMG [11] (best) - - - - - - 71.49 69.73 70.60 Scenario 3 - Task B: relation extraction Run 2: BETO 56.11 37.31 44.82 15.38 1.40 2.56 50.83 18.59 27.22 Run 3: IXAmBERT (2nd) 57.88 40.10 47.38 47.77 17.48 25.60 54.19 28.31 37.19 IXA [1] (best) - - - - - - 45.36 40.95 43.04 Joint Classifier with IXAmBERT has achieved the best scores of the chal- lenge in the Main scenario (53.11 F1-score) by a narrow margin. It is surpassed by other participants in the individual tasks (i.e. scenarios 2 and 3), notably more so in the relation extraction task, where the best system achieves ∼6 F1-score points more (37.19 vs 43.04) due to its remarkably higher recall. Still, Joint Classifier with IXAmBERT is the second-best system in scenarios 2 and 3. Joint Classifier with BETO performs consistently ∼2 points worse than IXAmBERT on the Spanish evaluation dataset. As is expected, this gap widens considerably on the English dataset, because IXAmBERT is a multilingual model, unlike BETO. Seq2Seq, trained also on a multilingual model (i.e. mT5-small), actually performs better than Joint Classifier with BETO on the NERC sce- nario; however, it obtains the worst results overall among Vicomtech’s submitted runs. It must be noted, however, it shows competitive results when compared to other systems presented in the task [12]. Regarding the comparison between the original [4] and the submitted Joint Classifier implementations, the results in Table 3 indicate that the simplifica- tions introduced do affect the performance negatively, but the system remains competitive. The current version would still have won first place in the eHealth- KD 2020 edition despite achieving lower scores than the original version. Table 3: Results on the eHealth-KD 2020 edition testing dataset (Scenario 1) P R F1 2020 Joint Classifier with BETO 67.94 65.23 66.56 2021 Joint Classifier with BETO 68.67 61.48 64.87 4 Discussion The Joint Classifier model is an attempt to reduce the complexity of the model by removing several seemingly redundant layers from the winner system of the last eHealth-KD edition. The reworked model gives up the two-way rep- resentation of the relations: instead of representing the incoming and outgoing arcs for every relation, the new model represents only the outgoing ones. Ideally, this would not cause any information loss, since the relations are in fact unidi- rectional. In addition, the reworked Joint Classifier model does not model same-as and multi-word relations separately, dropping another set of layers. As a consequence, however, the model is unable to deal with non-contiguous or overlapping entities. In spite of these simplifications, the resulting model has won this year’s com- petition as well. According to our experiments, it would have obtained slightly lower scores in the 2020 data, suffering a loss of almost 4 recall points. This suggests that the redundancy in the modelling of relations did actually help the system detect more relations. With reference to the Seq2Seq model, our experiments show promising re- sults. A manual error analysis has revealed that many mismatches of entities are due to the model’s output containing semantic, grammatical and/or ortho- graphic variations of the input text (see examples in Table 4). Furthermore, we have observed that the model has difficulties with numeric expressions. It may also produce incorrect words, that is, words that do not exist in Spanish nor English. In a few occasions it even produces expressions that have no appar- ent relation to the input target, as shown by the last three examples. These issues could be tackled by partly constraining the output of the model to ele- ments present in the input text. Moreover, these models usually need more data than the available in this challenge—1,500 training sentences. A larger encoder- decoder model would also be necessary to obtain results that compete with the traditional approaches, given that the task requires a deep language understand- ing. Table 4: Examples of predictions of entities generated by Seq2Seq (“*” means that the word is not part of the Spanish nor English vocabulary); the original terms to which the entity may be referring to are marked in boldface Prediction Sentence quicker [...] their detection can be much faster and simpler than RT-PCR. principal [It] was the major blood immune response for COVID-19 infection. identificad [...] mNGS identified six patients harboring transcriptionally active [...] pandemia [...] la epidemia de COVID-19 podrı́a prolongarse por doce semanas. Estas Todas las personas tienen estreñimiento alguna vez. Usted [Usted] también puede comunicarle sus deseos a su familia. 25. 6 casos [...] de coronavirus per cápita con 256.2 casos por millón de personas. 60 años Un hombre de 70 años en el cantón de habla italiana de Ticino [...] *probearse La vacuna se está fabricando para que pueda probarse primero en ani- males. *toriste Es más que sentirse “triste” por algunos dı́as. guardan También admiran lo externo, como a sus amigos, quienes suelen ser del mismo sexo. compra El cirujano cose el pulmón nuevo a los vasos sanguı́neos y las vı́as respira- torias. personas Los nervios periféricos se encuentran fuera del cerebro y de la médula espinal. 5 Conclusions In these working notes we have described Vicomtech’s participation in the eHealth- KD 2021 shared task. We have presented a reworked version of the end-to-end deep-learning-based architecture that won last year’s main scenario. This ver- sion of the model drops some expendable layers that were encoding somewhat redundant information. It still shows a very competitive performance, having won the main scenario again. We have also presented an approach based on a sequence-to-sequence model. We have encoded the target information into a sequence of elements, so the model learns to derive those from the input text. This data representation is more flex- ible, and allows us to represent non-contiguous multi-word entities more easily. The results do not improve the more traditional approach, but the obtained scores and the error analysis suggest that this approach may become competi- tive after addressing some specific issues, that we leave as future work. On the one hand, a more controlled and constrained output generation may improve the results. A relevant source of errors was related to the generation of entities that are not present in the input text. On the other hand, the use of more data or additional pre-training, combined with larger or better encoder-decoder models may also help. All in all, the task remains challenging regardless of the model and approach. Further research will be necessary to improve the state of the art for key aspects of the task, in particular relation extraction. Acknowledgments This work has been partially funded by the projects DeepText (KK-2020-00088, SPRI, Basque Government) and DeepReading (RTI2018-096846-B-C21, MCIU/AEI/FEDER, UE). References 1. Andrés, E.: IXA at eHealth-KD Challenge 2021: Generic Sequence Labeling as Re- lation Extraction Approach. In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2021) (2021) 2. Cañete, J., Chaperon, G., Fuentes, R., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Proceedings of the Practical ML for Developing Countries Workshop at the Eighth International Conference on Learning Representations (ICLR 2020). pp. 1–9 (2020) 3. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Compu- tational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4171–4186 (2019) 4. Garcı́a-Pablos, A., Perez, N., Cuadros, M., Zotova, E.: Vicomtech at eHealth-KD Challenge 2020: Deep End-to-End Model for Entity and Relation Extraction in Medical Text. In: Proceedings of the Iberian Languages Evaluation Forum co- located with 36th Conference of the Spanish Society for Natural Language Pro- cessing, IberLEF@ SEPLN. pp. 102–111 (2020) 5. Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., Steinhardt, J.: Measuring massive multitask language understanding. In: Proceedings of the Ninth International Conference on Learning Representations (ICLR 2021). pp. 1– 27 (2021) 6. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics-Doklady 10(8), 707–710 (1966) 7. Lopez, L.E., Cruz, D.K., Cruz, J.C.B., Cheng, C.: Transformer-based end-to-end question generation. arXiv preprint arXiv:2005.01107 pp. 1–9 (2020) 8. Loshchilov, I., Hutter, F.: Decoupled Weight Decay Regularization. In: Proceedings of the Seventh International Conference on Learning Representations (ICLR 2019). pp. 1–18 (2019) 9. Misra, D.: Mish: A Self Regularized Non-Monotonic Neural Activation Function. arXiv:1908.08681 pp. 1–13 (2019) 10. Otegi, A., Agirre, A., Campos, J.A., Soroa, A., Agirre, E.: Conversational question answering in low resource scenarios: A dataset and case study for basque. In: Proceedings of The 12th Language Resources and Evaluation Conference. pp. 436– 442 (2020) 11. Pavanelli, L., Rubel Schneider, E.T., Bonescki Gumiel, Y., Castro Ferreira, T., Ferro Antunes de Oliveira, L., Andrioli de Souza, J.V., Meneghel Paiva, G.P., Silva e Oliveira, Lucas Emanuel Cabral Moro, C.M., Cabrera Paraiso, E., Labera, E., Pagano, A.: PUCRJ-PUCPR-UFMG at eHealth-KD Challenge 2021: A Multi- lingual BERT-based system for Joint Entity Recognition and Relation Extraction. In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2021) (2021) 12. Piad-Morffis, A., Gutiérrez, Y., Estevez-Velarde, S., Almeida-Cruz, Y., Muñoz, R., Montoyo, A.: Overview of the eHealth Knowledge Discovery Challenge at IberLEF 2021. Procesamiento del Lenguaje Natural 67(0) (2021) 13. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the Limits of Transfer Learning with a Unified Text- to-Text Transformer. Journal of Machine Learning Research 21, 1–67 (2020) 14. Ramshaw, L.A., Marcus, M.P.: Text chunking using transformation-based learning. In: Natural language processing using very large corpora, pp. 157–176. Springer (1999) 15. Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. In: Proceedings of the 5th Work- shop on Energy Efficient Machine Learning and Cognitive Computing (EMC2) co-located with the Thirty-third Conference on Neural Information Processing Sys- tems (NeurIPS 2019). pp. 1–5 (2019) 16. Santra, B., Anusha, P., Goyal, P.: Hierarchical transformer for task oriented dialog systems. arXiv preprint arXiv:2011.08067 pp. 1–10 (2020) 17. Stenetorp, P., Pyysalo, S., Topić, G., Ohta, T., Ananiadou, S., Tsujii, J.: BRAT: A Web-based Tool for NLP-assisted Text Annotation. In: Proceedings of the Demon- strations at the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL ’12). pp. 102–107 (2012) 18. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention Is All You Need. In: Proceedings of the Thirty- first Conference on Advances in Neural Information Processing Systems (NeurIPS 2017). pp. 5998–6008 (2017) 19. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Brew, J.: HuggingFace’s Transformers: State- of-the-art Natural Language Processing. arXiv:1910.03771 pp. 1–11 (2019) 20. Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., et al.: Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144 pp. 1–23 (2016) 21. Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R., Siddhant, A., Barua, A., Raffel, C.: mT5: A massively multilingual pre-trained text-to-text transformer. arXiv:2010.11934 pp. 1–17 (2020) 22. Zou, Y., Zhang, X., Lu, W., Wei, F., Zhou, M.: Pre-training for Abstractive Doc- ument Summarization by Reinstating Source Text. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 3646–3660 (2020)