=Paper=
{{Paper
|id=Vol-2943/ehealth_paper1
|storemode=property
|title=IXA at eHealth-KD Challenge 2021: Generic Sequence Labelling as Relation Extraction Approach
|pdfUrl=https://ceur-ws.org/Vol-2943/ehealth_paper1.pdf
|volume=Vol-2943
|authors=Edgar Andrés
|dblpUrl=https://dblp.org/rec/conf/sepln/Andres21
}}
==IXA at eHealth-KD Challenge 2021: Generic Sequence Labelling as Relation Extraction Approach==
IXA at eHealth-KD Challenge 2021: Generic Sequence Labelling as Relation Extraction Approach? Edgar Andrés1[0000−0002−8190−8963] UPV/EHU IXA research group, Bilbao, Spain dgar.andres@ehu.eus Abstract. The eHealth-KD 2021 is the automatic extraction of knowl- edge challenge from health documents written in Spanish with a small selection of sentences from different domains and languages to encourage cross-lingual and transfer learning approaches, we use the pre-trained Language Model (LM), namely XML-RoBERTa-base, to provide a Cross- lingual representation of tokens and the ability to transfer learning from general domains. Our group participated in all the proposed scenarios; the main one (F1 0.499), Entity Recognition (ER) (F1 0.653) and Re- lation Extraction (RE) (F1 0.430). The present system was designed as a pipeline of generic sequence labellers, each of them independently fine-tuned for each subtask. The generic sequence labeller consists of a feed-forward network that learns how to align a sequence of tokens into a sequence of labels regardless of the language and domain. This simple straightforward system ranked in the third position in the main and Entity recognition scenario and widely outperformed the other systems in the relation extraction scenario. Keywords: eHealthKD 2021 , Knowledge Discovery , Natural Language Pro- cessing , Deep Learning 1 Introduction eHealthKD series provide nice scenarios to build and evaluate Natural Language Processing systems on the medical domain. This year eHealthKD2021 [1] includes a selection of sentences not exclusively from medical texts but from other domains and different languages. In the last years, the amount of medical texts, regardless of format, has grown exponentially and accordingly, the interest in its processing for several clinical purposes. In this paper, we propose a system to extract entity mentions and their semantic relation type occurring in Spanish texts in the ? Supported by organization UPV/EHU IXA research group. IberLEF 2021, September 2021, Málaga, Spain. Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). context of the eHealthKD2021 evaluation task. The system was built in two steps. We first identify and classify entity mentions in the sentence, and afterwards, we classify the relation type of identified entity pairs. Sentences are encoded using fine-tuned XLM-RoBERTa [2], which is a neural language model trained in multiple languages including Spanish. We transfer the general knowledge using a pre-trained model of the XLM-RoBERTa-base language model and fine-tuning it for the tasks of identifying entities and relations. We propose a simple yet robust model, where each component is trained separately. This strategy, contrary to joint models, makes learning easier and faster (focusing on one task at a time) and gives flexibility for domain and language adaptation. With this system, we hypothesize the idea that generic sequence labellers could competitively handle the relation extraction task while defining suitable formats to represent the problems and accurate ways to conditionate LM. 2 State Of The Art Actual systems focus on retrain LM for span detection and entity detection [3, 4], LM we use for the task is highly related with the result we achieve for specific domains [5, 6], since LMs have appeared we see that performance of NLP tasks are directly related to the LM we use to represent textual data. Relation Extraction (RE) approaches faced the first revolution on [7] reaching high-performance systems [8–10] those are mainly based on BERT technologies and derivatives. In the clinical domain RE [11, 12] we can encounter a high- performance system based on specific techniques such as novel architectures of Bi-LSTM cells. SOTA Domain-agnostic approaches [13, 14] follow the idea of improving the LM representation using adaptive techniques for required task Sequence Labelling (SL) or RE. Cross-Lingual performance [15] is mainly derived from the appearance of the BERT model and the cross-lingual features it provides. SL [16, 17] even is an extensively researched field, is nowadays widely used in new application fields such as clinical data mining. 3 System Description Generic sequence labelling The 2-stage system to extract entity mentions and their semantic relation type occurring in Spanish texts is based on a pipeline of fine-tuned generic sequence labellers as described in 1. We use a feed-forward network (FFN) to compute the probability y˜i = F F N (xi ) for each token, where each value in ỹi represents the score for a tag in a target tag set. Equation 1 shows how we formalized the the feedforward network. F F N (xi ) = sof tmax(We xi + be ) (1) We decided to apply a pipeline of sequence labellers to 1) keep the model as simple as possible and 2) avoiding over-fitting of the model, as it could learn specific dependencies in training. For the final prediction, we apply an argmax function over the label probability distribution obtained for each token. The sequence labelling is learned minimizing the cross-entropy loss shown in Equation 2. N 1 X Lt = − yi log ỹi (2) N i=1 Where yi is the true label vector for the input token xi , and N is the number of instances in the training set for the task. As you deduce, the input file format is composed of two columns, containing xi and yi pairs per line, the different examples are separated by empty lines. Finally, we use the special token ”jump line” to define the end of a text. Fig. 1. 2-step relation extraction. Subtask A: Entity recognition The input provided by the organizer as BRAT standoff format (.ann), was split into texts keeping line jumps, then texts were divided into tokens keeping white spaces. Those tokens were aligned with the labels following Inside Outside Beginning format (IOB). This format does not capture overlapped and disjoint entities. The output of the system was converted again into (.ann) files. Subtask B: Relation extraction We applied once again the same tokenization strategy exposed for entity recognition. In this case, as we already have the entities identified in the previous step, we pairwise each possible combination generating a repeated example per pair, entity markers [18] are added surrounding entities to avoid overfitting. In this case, we align the entities with the relation type. The output of the system was transformed into final (.ann) files. Training setup We used huggingface transformers [19] for default training param- eters setup, both systems were trained over respective train set and fine-tunned with respective dev set, both performed 40 epochs with a batch size of 40 exam- ples, each fine-tune maximized the f1 score described in Conll2005 shared task [20]. The best model of 11 / 12 checkpoints out of 1100 / 12000 total steps were respectively used for entity recognition / relation extraction. Both models were calculated in 30 minutes each using a single NVIDIA Titan V. 4 Results In the following Table 1 we summarize the results in the three different scenarios over the official Test set, the best results for each metric are highlighted with bold characters. The system gets competitive remarks in whole scenarios winning the third one (Relation extraction) with outstanding results. Although the good results we encounter low precision stats, this is due to the generic Language model we used (XLM-RoBERTa), we encountered similar issues in the previous series [21]. . Scenario 1 Scenario 2 Scenario 3 Model Prec. Rec. F1 Prec. Rec. F1 Prec. Rec. F1 Vicomtech 0.541 0.535 0.531 0.700 0.747 0.684 0.542 0.283 0.372 PUCRJ-PUCPR-UFMG 0.568 0.503 0.528 0.715 0.697 0.706 0.367 0.205 0.263 uhKD4 0.485 0.374 0.423 0.518 0.537 0.527 0.556 0.222 0.318 Baseline 0.337 0.177 0.232 0.350 0.272 0.306 0.438 0.017 0.033 IXA 0.465 0.539 0.499 0.614 0.698 0.653 0.454 0.409 0.430 Table 1. Results of eHealth-KD 2021 task. We summarize the top fourth systems: Vicomtech [22], PUCRJ-PUCPR-UFMG [23], uhKD4 [24] and ours 5 Conclusions Simple compositions of fine-tunned FFN and LM can accurately describe the target language, this is sufficient to perform competitively in prediction tasks via sequence labelling regardless of domain and language, in this way we define generic sequence labelling. We conclude that the sequence labelling task is extensible to many tasks like seq2seq or classification with competitive performance at low cost as we have seen in several approaches [25, 21]. This time we expand the idea enforcing the necessity of new simple mathematical modelling techniques to handle huge amount of complex data as we have seen in RE task. References 1. Alejandro Piad-Morffis, Yoan Gutiérrez, Suilan Estevez-Velarde, Yudivián Almeida- Cruz, Rafael Muñoz, and Andrés Montoyo. Overview of the eHealth Knowledge Discovery Challenge at IberLEF 2021. Procesamiento del Lenguaje Natural, 67(0), 2021. 2. Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guil- laume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116, 2019. 3. Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. SpanBERT: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8:64–77, 2020. 4. Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, and Yuji Matsumoto. Luke: deep contextualized entity representations with entity-aware self-attention. arXiv preprint arXiv:2010.01057, 2020. 5. Yifan Peng, Shankai Yan, and Zhiyong Lu. Transfer learning in biomedical natural language processing: an evaluation of bert and elmo on ten benchmarking datasets. arXiv preprint arXiv:1906.05474, 2019. 6. Sapan Shah, Sreedhar Reddy, and Pushpak Bhattacharyya. A retrofitting model for incorporating semantic relations into word embeddings. In Proceedings of the 28th International Conference on Computational Linguistics, pages 1292–1298, 2020. 7. Yi Luan, Dave Wadden, Luheng He, Amy Shah, Mari Ostendorf, and Hannaneh Hajishirzi. A general framework for information extraction using dynamic span graphs. arXiv preprint arXiv:1904.03296, 2019. 8. Jue Wang and Wei Lu. Two are better than one: Joint entity and relation extraction with table-sequence encoders. arXiv preprint arXiv:2010.03851, 2020. 9. Yucheng Wang, Bowen Yu, Yueyang Zhang, Tingwen Liu, Hongsong Zhu, and Limin Sun. Tplinker: Single-stage joint extraction of entities and relations through token pair linking. arXiv preprint arXiv:2010.13415, 2020. 10. Angrosh Mandya, Danushka Bollegala, and Frans Coenen. Graph convolution over multiple dependency sub-graphs for relation extraction. In COLING, pages 6424–6435. International Committee on Computational Linguistics, 2020. 11. Di Zhao, Jian Wang, Yijia Zhang, Xin Wang, Hongfei Lin, and Zhihao Yang. In- corporating representation learning and multihead attention to improve biomedical cross-sentence n-ary relation extraction. BMC bioinformatics, 21(1):1–17, 2020. 12. Zhiheng Li, Zhihao Yang, Yang Xiang, Ling Luo, Yuanyuan Sun, and Hongfei Lin. Exploiting sequence labeling framework to extract document-level relations from biomedical texts. BMC bioinformatics, 21:1–14, 2020. 13. Ying Lin, Heng Ji, Fei Huang, and Lingfei Wu. A joint neural model for information extraction with global features. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7999–8009, 2020. 14. Hao Huang, Guodong Long, Tao Shen, Jing Jiang, and Chengqi Zhang. Rate: Relation-adaptive translating embedding for knowledge graph completion. arXiv preprint arXiv:2010.04863, 2020. 15. Jouni Luoma and Sampo Pyysalo. Exploring cross-sentence contexts for named entity recognition with bert. arXiv preprint arXiv:2006.01563, 2020. 16. Hai-Long Trieu, Thy Thy Tran, Khoa NA Duong, Anh Nguyen, Makoto Miwa, and Sophia Ananiadou. Deepeventmine: end-to-end neural nested event extraction from biomedical texts. Bioinformatics, 36(19):4910–4917, 2020. 17. Bin Ji, Jie Yu, Shasha Li, Jun Ma, Qingbo Wu, Yusong Tan, and Huijun Liu. Span-based joint entity and relation extraction with attention-based span-specific and contextual semantic representations. In Proceedings of the 28th International Conference on Computational Linguistics, pages 88–99, 2020. 18. Livio Baldini Soares, Nicholas FitzGerald, Jeffrey Ling, and Tom Kwiatkowski. Matching the blanks: Distributional similarity for relation learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2895–2905, Florence, Italy, July 2019. Association for Computational Linguistics. 19. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, October 2020. Association for Computational Linguistics. 20. Xavier Carreras and Lluı́s Màrquez. Introduction to the conll-2005 shared task: Semantic role labeling. In Proceedings of the ninth conference on computational natural language learning (CoNLL-2005), pages 152–164, 2005. 21. E Andrés, O Sainz, A Atutxa, and O Lopez de Lacalle. Ixa-ner-re at ehealth- kd challenge 2020: Cross-lingual transfer learning for medical relation extraction. In Proceedings of the Iberian Languages Evaluation Forum co-located with 36th Conference of the Spanish Society for Natural Language Processing, IberLEF@ SEPLN, volume 2020, 2020. 22. Aitor Garcı́a-Pablos, Naiara Pérez, and Montse Cuadros. Vicomtech at ehealth-kd challenge 2021: Deep learning approaches to model health-related text in spanish. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2021), 2021. 23. Lucas Pavanelli, Elisa Terumi Rubel Schneider, Yohan Bonescki Gumiel, Thiago Castro Ferreira, Lucas Ferro Antunes de Oliveira, João Vitor Andrioli de Souza, Giovanni Pazini Meneghel Paiva, Claudia Maria Silva e Oliveira, Lucas Emanuel Cabral Moro, Emerson Cabrera Paraiso, Eduardo Labera, and Adriana Pagano. Pucrj-pucpr-ufmg at ehealth-kd challenge 2021: A multilingual bert-based system for joint entity recognition and relation extraction. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2021), 2021. 24. Dayany Alfaro-González, Dalianys Pérez-Perera, Gilberto González-Rodrı́guez, and Antonio Jesús Otaño-Barrera. uhkd4 at ehealth-kd challenge 2021: Deep learning approaches for knowledge discovery from spanish biomedical documents. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2021), 2021. 25. Edgar Andrés Santamarı́a. End to end approach for i2b2 2012 challenge based on cross-lingual models. 2020.