=Paper=
{{Paper
|id=Vol-2943/meddoprof_paper7
|storemode=property
|title=Spanish Pre-Trained Language Models for HealthCare Industry
|pdfUrl=https://ceur-ws.org/Vol-2943/meddoprof_paper7.pdf
|volume=Vol-2943
|authors=Jalaj Harkawat,Tejas Vaidhya
|dblpUrl=https://dblp.org/rec/conf/sepln/HarkawatV21
}}
==Spanish Pre-Trained Language Models for HealthCare Industry==
Spanish Pre-Trained Language Models for HealthCare Industry Jalaj Harkawat1,2 and Tejas Vaidhya1,2 1 Indian Institute of Technology, Kharagpur, INDIA 2 Equal Contribution by both Authors Abstract. Currently transformer based model have shown high accu- racy and good prediction on downstream tasks like Named Entity Recog- nition, Sentiment analysis etc. But the terminologies used in Healthcare sector such as names of different diseases, medicines and departments makes it difficult to predict with high accuracy. In this paper we are go- ing to show a system for Named Entity tagging based on BETO (Spanish BERT). Experimental results have shown that our model gives better re- sults than the current baseline of MEDDOPROF Shared task. Keywords: BERT · NER · Healthcare Industry · Transformers · BART · BETO 1 Introduction Natural Language Processing (NLP) is a rapidly expanding subject with sev- eral applications, and we are utilising it to get more insights into our existing dataset. We all know how important our occupations and employment status are to our identities. Occupations have a significant influence on one’s physical and mental health, as well as their habits and lifestyle choices. For the prevention and control of the negative health impacts of our occupations, an entire medi- cal specialty, occupational medicine, is required (workplace accidents, short and long-term effect of exposition to toxic substances and pathogens, work-related mental health issues such as overburden and stress). The COVID-19 epidemic has highlighted this impact, since many people in certain vocations have been disproportionately impacted (for instance, health professionals and other essen- tial workers). ”Tools that automatically detect these sociodemographic factors can help re- searchers to better characterize multiple health aspects related to specific oc- cupations. However, up until now these entities have mostly been ignored. The MEDDOPROF Shared Task [12] takes a more comprehensive look at occupa- tions, also considering employment statuses and non-paid activities. [4]” IberLEF 2021, September 2021, Málaga, Spain. Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) 1.1 Background We generate a lot of data as a result of continual technological development and a fast-paced environment, and with advancements in technology, particularly deep learning techniques used in Natural Language Processing (NLP), there has been substantial improvement in Named Entity Recognition. Long ShortTerm Memory (LSTM) [9] and Conditional Random Field (CRF) [11], for example, have significantly improved their performance in biological Named entity recog- nition (NER) [17] in recent years. In this paper we are introducing our system for Named Entity Recognition tagging on the MEDDOPROF Dataset. We will use BETO [6] which is a BERT [7] based model trained on big Spanish corpus. Our Code and fine-tuned model is available at: https://github.com/jharkawat/meddoprof shared task 2 Task Description and Dataset MEDDOPROF (Medical Documents Profession Recognition) is a shared task or- ganized within the IberLEF 2021 workshop that focuses on to develop automatic occupation detection systems for Spanish medical texts. It has three sub-tracks and Shared Task-1 Named MEDDOPROF-NER is a Named Entity Recognition task, requires automatically finding mentions of occupations and classifying each of them as a profession, an employment status or an activity. It can be described as a token-level classification task. Our sentence with n number of words is defined as: A = {a1 , a2 , a3 , a4 , a5 , a6 .....an } (1) Then it can be classified into a label set with m labels: y = {l1 , l2 , l3 , l4 , l5 , l6 .....lm } (2) Given Named-Entity of type Y Y Y . If entities of type Y Y Y are immediately next to each other, the first word of the second entity will be tagged B-YYY in order to show that it starts another entity and the entities inside B −Y Y Y will be rep- resented as I −Y Y Y . For example, the sentence ”equipo de psiquiatrı́a” have the following labels {B − P ROF ESION, I − P ROF ESION, I − P ROF ESION }. 2.1 Dataset The MEDDOPROF corpus [8] is a collection of 1844 clinical cases annotated with professions and employment statuses from over 20 different specialities. After many rounds of quality control and annotation consistency analysis, the corpus was annotated by a team of linguists and clinical specialists who followed specifically developed annotation standards before annotating the whole dataset. Each clinical case will be stored as a separate file in the corpus, which will be delivered in plain text with UTF8 encoding. Refer to Figure 1 for an example of the corpus’ annotation [15] Fig. 1. BRAT annotation with profession, employment status labels. [3] 3 Approach In part 3.1, we’ll go over the BETO that we used in our final submission, then in section 3.2, we’ll go over our problem-solving strategy, and finally, we’ll go over which additional model we utilised during our tests. Baseline We compare our system to the baseline provided by the organizers [1], which is a simple lookup system that uses the training set as a reference. It then examines if the extracted annotations are present in a fresh batch of text documents. 3.1 BETO BETO is a BERT model trained on a big Spanish corpus [5] (ParaCrawl, EU- Bookshop [16], MultiUN [16], OpenSubtitles, DGC [16], DOGC [16], ECB, EMEA, Europarl, GlobalVoices [16], JRC, News-Commentary11 [16], TED, UN). BETO is around the same size (24-layer, 1024-hidden, 16-heads, 340M parameters) as a BERT-Base and was trained using the Whole Word Masking and Next Sen- tence Prediction classifiers. In most downstream tasks in the Spanish language, this surpassed the Best Multilingual BERT. Such language-specific bidirectional representations, we believe, are also important for our purpose. 3.2 Architecture We first sub-word tokenize each token of sentences, using BETO’s [6] wordpiece tokenizer from Huggingface [18] library and pass it through BETO Models BERT Transformer stacks (trained on big Spanish corpus) to extract contextualised do- main specific representation. Then, for each word, we choose the representation of the first sub-word token and fine-tuning by training an additional feed-forward layer log(softmax(CW )) that assigns the softmax probability distribution to each label. The loss function used: ! exp(x[class]) X loss(x, class) = − log P = −x[class] + log exp(x[j]) (3) j exp(x[j]) j Additionally, we also tried using a Multilingual BERT (cased), which was pretrained model on the top 104 languages with the largest Wikipedia using a masked language modeling (MLM) objective. This model is case sensitive: it makes a difference between english and English. 4 Settings and Results 4.1 Experimental Settings We keep maximum length of input sentence to 512 to consider long sentences. Large models (24-layer, 1024-hidden, 16-heads, 340M parameters) are trained for 4 epochs with batch size 16. We early stop the models using the valid set. The dropout probability was set to 0.1 for all layers. Optimization is done using Adam [10] with a learning rate of 5e-5. The remaining hyperparameters were kept same as BERT. We used the PyTorch [13] implementation of BERT from Huggingfaces tranformers library. An overview of these parameters is given in Table 1. Table 1. Values of different parameters during the experiment Parameter Name Value max length 512 learning rate 5e-5 weight decay 0.01 clip grad 5 batch size 1 epoch number 20 min epoch number 5 patience 0.02 patience number 10 For selecting best models in experimental phase (i.e. before release of test set) we use split of 60/20/20 for train, dev and test respectively. For our final submission, we used a 70/30 split for train/valid set of initial data and a pre- trained BETO model. We also split sentences with more than 512 tokens to two or more sentences to get the desired model’s input sentence length. To evaluate the performance of the system, an evaluation script [2] along with the dataset was provided by the organizers. 4.2 Results The BETO models we propose is a competitive solution that performs better than the task’s baseline. Tables 2 shows the system’s performance on the test set, while Table 3 contain our ablation study and result are based on partial matches, unlike the official results which uses exact matches. We also used multilingual BERT for performance comparison. BETO performed better than other multi language model because it is trained on large domain specific(Spanish corpus) on the other hand multilingual BERT is trained on relatively less data with multiple languages. Table 2. Results on test set Models F1-score Recall Precision BETO 0.567 0.5 0.654 Baseline 0.486 0.508 0.465 Table 3. Performed Ablation study Models(Partial match on provided data) F1-score BETO(mrm8488/BERT- spanish-cased-finetuned-ner) 0.753 Multilingual(BERT- base-multilingual-cased) 0.6342 5 Error Analysis Domain specific pretrained transformer models have shown remarkable improve- ment in the majority of the downstream NLP tasks, but there are instances where BERT failed drastically. In this section we will try to find out some of the causes of failure in our system (BETO). 1. On our dataset, the BERT tokenizer is inefficient. Its lexicon does not include terminology from the healthcare industry, and it has not been trained in a language-specific setting. As a result, learning encoding based on improperly subtokenized words is difficult for the BERT model. Tokenizers could be trained on both biomedical and general text sets as a feasible approach. 2. The dataset contains a large number of phrases that lack Named Entity Recognition tags, resulting in a large number of negative entries and a poor F1-score. Increase the dataset size and eliminate the sentences with no or few NER tags available as a possible solution. 3. Because of small dataset size, our Transformer based model is not giving very good results. A Possible solution can be to add more positive datapoint and break the bigger sentences into smaller ones. This will also help with maintaining the token length less than 512 as desired by BERT based models. 6 Conclusion and Future Work In this paper, we have presented a system based on BETO for the first sub-track of the MEDDOPROF Shared Task, held as part of the IberLEF 2021 workshop. We build our models keeping in mind the success of pre-trained models. It helps in generating bidirectional contextualized representation of each tokens that can be further utilised for task specific fine tuning. As a future work, we would like to improve our current work by extending the work by performing layer by layer analysis of BERT and try to experiment with other Architectures like XLNet [19] and try to make more cost and memory efficient using adapter [14]. 7 Acknowledgement We would like to thank the organiser of Shared task MEDDOPROF for providing us this opportunity and to present our work. References 1. Baseline code, https://github.com/TeMU-BSC/meddoprof-baseline 2. Evalutionscript, https://github.com/TeMU-BSC/meddoprof-evaluation-library 3. Example, https://temu.bsc.es/meddoprof/data/ 4. Home page, https://temu.bsc.es/meddoprof/ 5. Cañete, J.: Compilation of large spanish unannotated corpora (May 2019). https://doi.org/10.5281/zenodo.3247731, https://doi.org/10.5281/zenodo. 3247731 6. Cañete, J., Chaperon, G., Fuentes, R., Ho, J.H., Kang, H., Pérez, J.: Spanish pre- trained bert model and evaluation data. In: PML4DC at ICLR 2020 (2020) 7. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi- rectional transformers for language understanding (2019) 8. Farré-Maduell, E., Lima-López, S., Miranda-Escalada, A., Briva- Iglesias, V., Krallinger, M.: Meddoprof corpus: test set (Jun 2021). https://doi.org/10.5281/zenodo.4889777, https://doi.org/10.5281/zenodo. 4889777 9. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (Nov 1997). https://doi.org/10.1162/neco.1997.9.8.1735, https://doi. org/10.1162/neco.1997.9.8.1735 10. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization (2017) 11. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: Proba- bilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning. p. 282–289. ICML ’01, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2001) 12. Lima-López, S., Farré-Maduell, E., Miranda-Escalada, A., Brivá-Iglesias, V., Krallinger, M.: Nlp applied to occupational health: Meddoprof shared task at iber- lef 2021 on automatic recognition, classification and normalization of professions and occupations from medical texts. Procesamiento del Lenguaje Natural 67 (2021) 13. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., et al.: Pytorch: An imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems 32, pp. 8024– 8035. Curran Associates, Inc. (2019), http://papers.neurips.cc/paper/ 9015-pytorch-an-imperative-style-high-performance-deep-learning-library. pdf 14. Rücklé, A., Geigle, G., Glockner, M., Beck, T., Pfeiffer, J., Reimers, N., et al.: Adapterdrop: On the efficiency of adapters in transformers (2020) 15. Stenetorp, P., Pyysalo, S., Topić, G., Ohta, T., Ananiadou, S., Tsujii, J.: brat: a web-based tool for NLP-assisted text annotation. In: Proceedings of the Demon- strations at the 13th Conference of the European Chapter of the Association for Computational Linguistics. pp. 102–107. Association for Computational Linguis- tics, Avignon, France (Apr 2012), https://aclanthology.org/E12-2021 16. Tiedemann, J.: Parallel data, tools and interfaces in opus. In: Chair), N.C.C., Choukri, K., Declerck, T., Dogan, M.U., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S. (eds.) Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12). European Language Resources Association (ELRA), Istanbul, Turkey (may 2012) 17. Vaidhya, T., Kaushal, A.: IITKGP at W-NUT 2020 shared task-1: Domain specific BERT representation for named entity recognition of lab protocol. In: Proceedings of the Sixth Workshop on Noisy User-generated Text (W- NUT 2020). pp. 268–272. Association for Computational Linguistics, Online (Nov 2020). https://doi.org/10.18653/v1/2020.wnut-1.34, https://www.aclweb. org/anthology/2020.wnut-1.34 18. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., et al.: Hug- gingface’s transformers: State-of-the-art natural language processing (2020) 19. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., Le, Q.V.: Xlnet: Generalized autoregressive pretraining for language understanding (2020)