=Paper=
{{Paper
|id=Vol-2664/cantemist_paper10
|storemode=property
|title=Tumor Morphology Mentions Identification Using Deep Learning and Conditional Random Fields
|pdfUrl=https://ceur-ws.org/Vol-2664/cantemist_paper10.pdf
|volume=Vol-2664
|authors=Utpal Kumar Sikdar,Björn Gambäck,M Krishna Kumar
|dblpUrl=https://dblp.org/rec/conf/sepln/SikdarGK20
}}
==Tumor Morphology Mentions Identification Using Deep Learning and Conditional Random Fields==
Tumor Morphology Mentions Identification Using Deep Learning and Conditional Random Fields Utpal Kumar Sikdara , Björn Gambäckb and M Krishna Kumarc a IBS Software Pvt. Ltd., Trivandrum, Techopark main gate, India-695581 b Department of Computer Science, Norwegian University of Science and Technology, 7491 Trondheim, Norway c IBS Software Pvt. Ltd., Trivandrum, Techopark main gate, India-695581 Abstract The paper reports the application of several machine learning methods to the task of automatically find- ing tumor morphology mentions in Spanish clinical texts. Three setups based on Conditional Random Fields (CRF) techniques with different feature combinations were tested as well as a deep learning model (Bi-directional-LSTM-CNN). The best performance was achieved by combining two of the CRF-based learners and the neural network using a majority voting ensemble. Keywords named entity recognition, CRF, Bi-LSTM, CNN, GloVe 1. Introduction To understand diseases, we need to extract certain key entities such as symptoms, duration, patient age and weight, etc. from unstructured textual medical data. This task, clinical text mining, is important to enable better clinical decision-making. It is, for example, very helpful if we can extract key entities from a pandemic situation (such as COVID-19, SARS, and locations) and take appropriate actions based on the disease symptoms and their attributes. Natural Language Processing fills an important role in extracting such key entities from different types of textual sources in various languages. A myriad of medical texts are generated each day in various languages. Only in Spanish, almost a thousand electronic patient records are generated every minute. Hence automatically processing clinical texts in Spanish is a challenging task, but with a large potential for the medical user community as well as for the pharmaceutical industry and the patients. Similar to Named Entity Recognition, tumor mention identification is a sequence labelling task. Following results published by several researchers in 2016 [1, 2, 3], state-of-the-art work on such sequence labelling tasks has focused on deep learning setups using a neural network structure, in particular Long Short-Term Memory Recurrent Neural Networks [LSTM; 4], followed by Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020) email: utpal.sikdar@gmail.com (U.K. Sikdar); gamback@ntnu.no (B. Gambäck); krishna.kumar@ibsplc.com (M.K. Kumar) url: https://www.linkedin.com/in/dr-utpal-kumar-sikdar-31a1779b/ (U.K. Sikdar); https://www.ntnu.edu/employees/gamback (B. Gambäck); https://www.linkedin.com/in/m-krishna-kumar-56383220/ (M.K. Kumar) orcid: 0000-0002-5252-707X (B. Gambäck) © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) a sequential Conditional Random Field [CRF; 5] layer. Hence Ma and Hovy [2] introduced a neutral network architecture that benefits from both word and character level representations automatically, by using a combination of bidirectional LSTM (bi-LSTM), CNN (Convolutional Neural Network) and CRF. They tested on the CoNLL 2003 English NER dataset [6], obtaining 97.55% accuracy for part-of-speech tagging and 91.21% F1 score for Named Entity Recognition. Chalapathy et al. [7] applied an LSTM-CRF classifier to the 2010 i2b2/VA Natural Language Processing Challenges for Clinical Records data [8], outperforming previous word on the data set. Habibi et al. [9] then compared the same setup to the best CRF-based results on 33 different evaluation sets in the biomedical domain, with the LSTM-CRF structure achieving F-scores on average 5% above the CRF baselines, mainly due to increased recall. However, most recent work has looked into other ways to obtain similar results. Straková et al. [10] compared a LSTM-CRF model to a sequence-to-sequence (seq2seq) model, where the input sequence are the tokens encoded by a bi-LSTM and the matching output label sequence is predicted be an LSTM decoder, showing the seq2seq model to outperform the state-of-the-art models for recognising nested named entities, while the baseline LSTM-CRF model still was competitive on simpler, flat NER structures. Overall focus in NLP in 2017 turned to pre-training using transformer-based neural models such as BERT, Bidirectional Encoder Representations from Transformers [11], and Baevski et al. [12] hence report top results (F1 = 92.8) on the CoNLL 2003 NER dataset with a bi-directional transformer model. The present work looks at various ways to compare and combine neural network learners (bi-LSTM and CNN) with conditional field classifiers, but rather than utilising the CRF directly as a layer in the deep learning setup, it is used in parallel to the network in an ensemble strategy, inspired by previous work on named entity recognition in social media data [13] and for under-resourced languages [14, 15]. Experiments were carried out on the Spanish CANTEMIST (CANcer TExt Mining Shared Task — tumor named entity recognition) data [16], which is introduced in the next section together with methods needed to further process the data in order to use it in the system setups described in Section 3. Experimental results are reported in Section 4 and analysed in Section 5. Finally, Section 6 sums up and points to ways the work could be extended and potentially improved. 2. Datasets and Preprocessing The CANTEMIST shared task organisers provided training, development-1, development-2, and test data [16]. The data had been annotated using the ‘brat’ format [17]. Statistics of the datasets are reported in Table 1. The training and development sets include data in a plain text file (.txt) together with a file containing the ‘brat’ annotation (.ann). The test data include the text file only. All tumor morphology mentions are annotated according to their corresponding character offsets in UTF-8 encoded plain text medical documents. The organisers provided 5,232 documents in the test data set, but out of those only 300 were actually utilised for evaluation purposes. Since CRFs cannot handle the ‘brat’ stand-off annotation format directly, Begin-Inside-Outside (BIO) tags had to be converted from the ‘brat’ notation by aligning their character offsets to the 413 Table 1 Shared Task training and development dataset statistics Data set Number of Documents Number of Mentions Training 501 6,272 Development-1 250 3,258 Development-2 250 2,607 Total 1,001 12,137 character offsets of the tokens: after tumor mentions identification, the BIO tags are converted to the ‘brat’ stand-off annotation format with the offsets of tumor mentions with respect to the plain text file provided by the shared task organisers. The NLTK [18] tool was used to tokenise the plain text. Finally, for evaluation purposes, the given gold label annotation is compared to the annotated ‘brat’ stand-off format given by the predicted BIO tagging assigned by the experimental models described below. 3. CANTEMIST NER Identification Within medical text processing, the task of named entity recognition is to identify medical entities from the unstructured clinical data. To identify CANTEMIST NERs, several methods were tested on the BIO tagging converted data, including Conditional Random Fields and combining a Bi-directional Long Short Term Memory network with a Convolutional Neural Network [19]. A majority voting ensemble approach was also applied to combine the outputs of the different methods. 3.1. Conditional Random Fields Conditional Random Field classifiers were chosen as baseline indicators since they have produced state-of-art results on sequence labelling tasks such as named entity recognition in different domains. Three systems were developed for identification of tumor morphology mentions from the unstructured text, using different feature sets. The classifiers were trained using the following three sets of features, with the first two sets being based on the focus word itself (textual features types resp. binary flag types) and the third feature set being based on information extracted from the word’s context. • Textual features – focus word (current word) – word-lower (lower case version of the focus word) – word-normalised (all upper case characters, lower case, digits and other characters are replaced by ‘A’, ‘a’, ‘0’ and ‘_’, respectively) – word stem – suffix n-grams (last one, two or three characters) – prefix n-grams (last one, two or three characters) • Binary features 414 Table 2 CRF model parameters and basic settings Parameter name Values algorithm lbfgs [21] c1 0.05 c2 0.01 max_iterations 60 min_freq 0 all_possible_transitions True – is-all-upper-case – starts-with-upper-case – is-upper-case-middle – is-any-digit – is-single-digit – is-double-digit – is-any-punctuation-character – is-any-under_score – is-any-special-character (based on a list of special characters extracted from the data provided by the organisers, e.g., ‘%’, ‘±’, ‘’, and ‘’) • Contextual features – local context (with a -m to +n window, i.e., from m preceding to n following tokens) – beginning-of-sentence – end-of-sentence For the experiments, crfsuite [20] was utilised as implemented in the sklearn package.1 The parameters and values given in Table 1 were used as the default model during the tumor mentions extraction process. 3.2. Bi-directional-LSTM-CNN A bi-directional Long Short-Term Memory network was also applied to identify tumor mentions, with a Convolutional Neural Network used to induce character-level inputs as features to the model. Word features along with the character-level information map each word in the input string to potential tumor mention scores for the different categories. The following features were used as model inputs for the tumor mention identification network: Word Embeddings A publicly available GloVe [22] word embedding for Spanish.2 For each word, a vector of size 300 was extracted from the pre-trained word embedding model. Character Embeddings A uniform distribution with range [0.5, 0.5] and size 52 was used as character embedding 1 https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html 2 https://github.com/dccuchile/spanish-word-embeddings 415 for the CNN layer. The character sets include all characters in the training and test data together with PADDING values. Word Label Features To guide the network, some binary features were extracted explicitly from the input word, namely starts-with-upper-case, is-all-upper-case, is-any-digit, is-all-digit, and ‘other’ (if the word does not match any of the previous features). These were fed to the model as word label features. Word Probability Features Using a Naïve Bayes approach, a probability was assigned to each word as to whether it belonged to either the tumor mentions category or the non-tumor mentions category. The probability 𝑃̂ (𝑤𝑖 , 𝑐) of a word 𝑤𝑖 belonging to a category 𝑐 (𝑐 is either the mentions category 𝑚 or the non-mentions 𝑛) was calculated based on its conditional probability 𝑃̂ (𝑤𝑖 |𝑐) given the category 𝑐 and the category’s prior probability, 𝑃̂ (𝑐): 𝑃̂ (𝑤𝑖 |𝑐) ∗ 𝑃̂ (𝑐) 𝑃̂ (𝑤𝑖 , 𝑐) = , 𝑐 ∈ {𝑚, 𝑛} (1) ∑𝑐 ∈{𝑚,𝑛} 𝑃̂ (𝑤𝑖 |𝑐𝑘 ) ∗ 𝑃̂ (𝑐𝑘 ) 𝑘 where the conditional probability was calculated using add-one (Laplace) smoothing as: freq(𝑤𝑖 , 𝑐) + 1 𝑃̂ (𝑤𝑖 |𝑐) = , 𝑐 ∈ {𝑚, 𝑛} (2) (∑𝑤∈𝐷 freq(𝑤, 𝑐)) + |𝐷| with |𝐷| being the size of the dictionary (all word unique unigrams in the data) and freq(𝑤𝑖 , 𝑐) the frequency of word 𝑤𝑖 in category 𝑐, while the prior probabilities for the categories were calculated based on the total number of mentions and non-mentions (again using Laplace smoothing): freq(𝑐) + 1 𝑃̂ (𝑐) = , 𝑐 ∈ {𝑚, 𝑛} (3) (∑𝑐𝑘 ∈{𝑚,𝑛} freq(𝑐𝑘 )) + 1 If the probability 𝑃̂ (𝑤𝑖 , 𝑐) was higher than 0.5, the word was considered to belong to that mentions category. 3.3. Majority Voting Ensemble Based on the feature combinations, three models were generated using Conditional Random Fields. Another model was developed using the Bi-directional-LSTM-CNN network introduced in Section 3.2. A fifth ensemble-based model was created by taking the majority vote of the outputs of two of the CRF models and the Bi-LSTM-CNN model. 4. Results Five test runs were submitted to the shared task using different models trained on the devel- opment set data. The different models used in the five runs were created as described in the previous section and characterised as follows: 416 Table 3 Development-1 data results Precision Recall F-score Run-1 0.768 0.728 0.748 Run-2 0.774 0.735 0.754 Run-3 0.768 0.727 0.747 Run-4 0.691 0.710 0.700 Run-5 0.779 0.735 0.757 Run-1: A Conditional Random Fields classifier along with the features mentioned in Section 3.1 using a context window size of two preceding to two following words. Run-2: A CRF classifier utilising fewer features than the one in Run-1, namely only: current word, word stem, prefix of two and three characters, suffix of two and three characters, starts-with-upper-case, is-upper-case-middle, is-any-digit, is-single-digit, and end-of- sentence, together with a context of two preceding and one following words. Run-3: A third CRF model, again with a context of two preceding and one following words, combined with probability of the current word belonging to a tumor mention as assigned by the Naïve Bayes classifier, as well as the probability given by the Naïve Bayes classifier of the current word belonging to the non-tumor mention category, and utilising the following additional features: current word, word-lower-case, word stem, prefix of one and two characters, suffix of two characters, starts-with-upper-case, is-all-upper-case, is-single-digit, is-double-digit, is-any-under_score, and end-of-sentence. Run-4: The bi-directional-LSTM-CNN model described in Section 3.2. Run-5: A combination of the outputs of Run-1, Run-2 and Run4 using majority voting. If the outputs of all three models differed, the mention category was chosen randomly among the outputs. The shared task organisers provided an evaluation script to measure system performance based on micro-averaged precision, recall and F1 -score. The five models described above were tested on the two data development sets, development-1 and development-2. As can be seen in Table 3, Run-5 performed best on the development-1 data, with micro-average precision, recall and F-score values of 77.9%, 73.5% and 75.7%, respectively. However, on the development-2 data set, Run-2 performed best of the models, with micro- averaged precision, recall and F1 values of 75.8%, 74.7% and 75.3%, respectively. The performance of all five models in terms of micro-averaged precision, recall and F-score on development-2 are reported in Table 4. All the models were applied to the blind test data which was provided by the shared task organisers. During testing, the shared task training and development data sets were merged and used as training set to build the models that were then applied to the test data. 417 Table 4 Development-2 data results Precision Recall F-score Run-1 0.753 0.742 0.747 Run-2 0.758 0.747 0.753 Run-3 0.743 0.739 0.741 Run-4 0.675 0.739 0.701 Run-5 0.757 0.746 0.752 Table 5 Blind test data results Precision Recall F-score Run-1 0.758 0.746 0.752 Run-2 0.746 0.745 0.746 Run-3 0.756 0.747 0.751 Run-4 0.697 0.751 0.723 Run-5 0.765 0.764 0.764 On the unseen test data, the Run-5 ensemble model outperformed all the other models, with the micro-averaged precision, recall and F-score values of 76.5%, 76.4% and 76.4%, respectively. The test results of all models are reported in Table 5. 5. Error Analysis and Discussion As can be seen in the tables in the previous section, the variations between the five models are small, in particular in terms of recall. However, the deep learner of Run-4 in general performed poorer in terms of precision than the CRF-based models. On the other hand, the deep learner actually showed slightly better recall than the CRF-based models on the unseen test data, indicating that it is better at generalising. Closer analysing the outputs on the two development data sets, Table 6 shows the confusion matrices for Run-1. It is clear that many of the tumor mentions were not identified by the system, with many mentions miss-classified into other categories. In particular the I(nside) and O(utside) mention tags often got confused, while the system in general was better at pin-pointing the B(egin) mention category. A common kind of error was found on multi-word tumor mentions due to incorrect boundary identification, such as finding only carcinoma ductal instead of the full multi-word mention carcinoma ductal de páncreas extendido al mesenterio. An inverse kind of boundary detection issue is early start of the predicted BIO tagging such as Infiltración del peritoneo parietal por adenocarcinoma being tagged instead of the actual tumor mention adenocarcinoma. Furthermore, in some cases two multi-word tumor mentions got grouped together (often along with a few non-mention words) and were tagged as one single multi-word mention by the prediction. 418 Table 6 Development data (Run-1) Confusion Matrices Actual Development-1 Development-2 Predicted B I O B I O B 2,813 98 252 2,328 78 217 I 74 2,942 760 77 2,560 635 O 371 1,282 208,309 202 892 168,441 Hence proper phrase identification must be in focus to avoid these kinds of incorrect entity boundaries. It is also possible that there is a need to add more features to the CRF-based models in order to extract more mentions and try to minimise category miss-classification. 6. Conclusion and Future Work Three Conditional Random Fields classifiers and a Deep Learning approach (a bi-LSTM-CNN combination) were trained and tested on the task of identifying tumor mentions in Spanish medical texts. The best performance was obtained with an ensemble model using majority voting to combine two of the CRF learners and the bi-LSTM-CNN model. Overall the differences between the models were small in terms of recall, while the deep learner struggled somewhat compared to the CRF-based models in terms of precision. On the unseen test data, however, the bi-LSTM-CNN network showed slightly better recall than the other individual models, although still being outperformed by the voting ensemble. To improve on the results, it would make potentially be good to incorporate other features such as part-of-speech tags and to utilise tools for noun phrase identification and chunking, at least in the CRF-based models. The deep learners could benefit from having access to word embeddings specifically pre-trained for the clinical domain. The machine learning models could also be improved by applying feature selection and hyper-parameter optimisation based on an evolutionary approach, such as Genetic Algorithms. Finally, more and other types of models could be generated using other classification algorithms, alternative neural network setups, and ensemble models with weighted voting approaches. 419 References [1] Z. Yang, R. Salakhutdinov, W. W. Cohen, Multi-task cross-lingual sequence tagging from scratch, CoRR abs/1603.06270 (2016). URL: http://arxiv.org/abs/1603.06270. [2] X. Ma, E. Hovy, End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF, in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Berlin, Germany, 2016, pp. 1064–1074. URL: https://www.aclweb.org/anthology/P16-1101. doi:10.18653/ v1/P16-1101. [3] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, C. Dyer, Neural architectures for named entity recognition, in: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, San Diego, CA, USA, 2016, pp. 260–270. [4] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation 9 (1997) 1735–1780. [5] J. Lafferty, A. McCallum, F. C. Pereira, Conditional Random Fields: Probabilistic models for segmenting and labeling sequence data, in: Proceedings of the 18th International Conference on Machine Learning, Morgan Kaufmann, Williamstown, MA, USA, 2001, pp. 282–289. [6] E. F. Tjong Kim Sang, F. De Meulder, Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition, in: Proceedings of the Seventh Con- ference on Natural Language Learning at HLT-NAACL 2003, 2003, pp. 142–147. URL: https://www.aclweb.org/anthology/W03-0419. [7] R. Chalapathy, E. Zare Borzeshi, M. Piccardi, Bidirectional LSTM-CRF for clinical concept extraction, in: Proceedings of the Clinical Natural Language Processing Workshop (Clinical- NLP), The COLING 2016 Organizing Committee, Osaka, Japan, 2016, pp. 7–12. URL: https://www.aclweb.org/anthology/W16-4202. [8] O. Uzuner, B. R. South, S. Shen, S. L. DuVall, 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text, Journal of the American Medical Informat- ics Association 18 (2011) 552–556. URL: https://doi.org/10.1136/amiajnl-2011-000203. doi:10.1136/amiajnl-2011-000203. [9] M. Habibi, L. Weber, M. Neves, D. L. Wiegandt, U. Leser, Deep learning with word embeddings improves biomedical named entity recognition, Bioinformatics 33 (2017) i37– i48. URL: https://doi.org/10.1093/bioinformatics/btx228. doi:10.1093/bioinformatics/ btx228. [10] J. Straková, M. Straka, J. Hajič, Neural architectures for nested NER through lineariza- tion, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Florence, Italy, 2019, pp. 5326–5331. [11] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, CoRR abs/1810.04805 (2018). URL: http://arxiv. org/abs/1810.04805. [12] A. Baevski, S. Edunov, Y. Liu, L. Zettlemoyer, M. Auli, Cloze-driven pretraining of self- attention networks, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural 420 Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, 2019, pp. 5363–5372. [13] U. K. Sikdar, B. Gambäck, Named entity recognition for Amharic using stack-based deep learning, in: International Conference on Computational Linguistics and Intelligent Text Processing, Springer, Budapest, Hungary, 2017, pp. 276–287. [14] B. Gambäck, U. K. Sikdar, Named entity recognition for Amharic using deep learning, in: 2017 IST-Africa Week Conference (IST-Africa), IEEE, Windhoek, Namibia, 2017, pp. 1–8. [15] U. K. Sikdar, B. Gambäck, A feature-based ensemble approach to recognition of emerging and rare named entities, in: Proceedings of the 3rd Workshop on Noisy User-generated Text, Association for Computational Linguistics, Copenhagen, Denmark, 2017, pp. 177–181. URL: https://www.aclweb.org/anthology/W17-4424. doi:10.18653/v1/W17-4424. [16] A. Miranda-Escalada, E. Farré, M. Krallinger, Named entity recognition, concept normal- ization and clinical coding: Overview of the CANTEMIST track for cancer text mining in Spanish, corpus, guidelines, methods and results, in: Proceedings of the Iberian Lan- guages Evaluation Forum (IberLEF 2020), CEUR Workshop Proceedings, Spanish Society for Natural Language Processing, Málaga, Spain, 2020. [17] P. Stenetorp, S. Pyysalo, G. Topić, T. Ohta, S. Ananiadou, J. Tsujii, brat: a web-based tool for NLP-assisted text annotation, in: Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, Avignon, France, 2012, pp. 102–107. [18] S. Bird, E. Loper, NLTK: The natural language toolkit, in: Proceedings of the ACL Interactive Poster and Demonstration Sessions, Association for Computational Linguistics, Barcelona, Spain, 2004, pp. 214–217. [19] J. P. C. Chiu, E. Nichols, Named entity recognition with bidirectional LSTM-CNNs, CoRR abs/1511.08308 (2015). URL: https://arxiv.org/abs/1511.08308. [20] N. Okazaki, CRFsuite: a fast implementation of conditional random fields (CRFs), 2007. URL: http://www.chokkan.org/software/crfsuite. [21] D. C. Liu, J. Nocedal, On the limited memory BFGS method for large scale optimization, Mathematical Programming 45 (1989) 503–528. [22] J. Pennington, R. Socher, C. Manning, GloVe: Global vectors for word representation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Doha, Qatar, 2014, pp. 1532–1543. 421