=Paper=
{{Paper
|id=Vol-1976/paper08
|storemode=property
|title=Clinical Named Entity Recognition - ECUST in the CCKS-2017 Shared Task 2
|pdfUrl=https://ceur-ws.org/Vol-1976/paper08.pdf
|volume=Vol-1976
|authors=Yuhang Xia,Qi Wang
}}
==Clinical Named Entity Recognition - ECUST in the CCKS-2017 Shared Task 2==
Clinical Named Entity Recognition: ECUST in the CCKS-2017 Shared Task 2 Yuhang Xia and Qi Wang( ) East China University of Science and Technology, Shanghai, China 200237 153996626@qq.com, dsx4602@163.com Abstract. Clinical named entity recognition aims to identify and clas- sify clinical terms in electronic medical records, including diseases, symp- toms, treatments, exams, and body parts. Challenges occur due to ambi- guity in the boundary of Chinese words and the number limitation of an- notated training data. In this paper, we propose a Bi-LSTM CRF model along with self-taught learning, active learning, and ensemble learning to recognize clinical named entities. The results achieved on CCKS-2017 Task 2 dataset with a F1 -Measure of 89.88% ranks among the top sys- tems. Keywords: clinical named entity recognition, Bi-LSTM CRF, self-taught learning, active learning, ensemble learning 1 Introduction Electronic medical record systems have been widely used in China. Many tasks in clinical text mining rely on accurate clinical named entity recognition (NER), the identification of text spans mentioning a concept of a specific class, including disease, symptom, exam, treatment, and body part. Challenges occur due to ambiguity in the boundary of Chinese words and number limitation of annotated training data. Traditionally, most of the effective NER approaches are based on machine learning techniques, such as Support Vector Machines (SVM) [1], Hidden Markov Models (HMM) [2], Conditional Random Fields (CRF) [3], Convolutional Neural Network (CNN) based models [4], and Recurrent Neural Network (RNN) based models [5]. For biomedical NER tasks, existing efforts include rule or dictionary based methods [6], supervised methods [7], and distant supervision methods [8]. In this paper, we regard the clinical NER as a sequence labeling problem, and use the Bi-LSTM CRF model which is similar to the one presented by Huang et al. [5] to address the problem. Different from Huang et al., we exploit character embedding rather than word embedding to deal with the ambiguity in the boundary of Chinese words. In addition, self-taught learning and active learning is introduced to enlarge the training set. Finally, ensemble learning is used to obtain the best recognition performance for all five types of clinical named entities. The results achieved on CCKS-2017 Task 2 dataset with a F1 - Measure of 89.88% ranks among the top systems. 2 Problem Formalism The clinical named entity recognition task is defined as a sequence labeling problem in this paper. Given a text sequence X =< x1 , ..., xn >, the goal is to label X with tag sequence Y =< y1 , ..., yn >. We experiment with three differ- ent tagging formats for the recognition, including BIO (Begin, Inside, Outside), BIOS (Begin, Inside, Outside, Single), and BIEOS (Begin, Inside, End, Outside, Single). Examples of the three tagging formats can be found in Table 1. Table 1. The tag sequences for “腹平坦,未见腹壁静脉曲张。” with three different tagging formats. The B-tag indicates the beginning of an entity. The I-tag indicates the inside of an entity. The E-tag indicates the end of an entity. The O-tag indicates the character is outside an entity. The S-tag indicates the single character is an entity. As for entity types, the b-tag indicates the entity is a body part, and the s-tag indicates the entity is a symptom. 腹 平 坦 , 未 见 腹 壁 静 脉 曲 张 。 BIO B-b O O O O O B-b I-b B-s I-s I-s I-s O BIOS S-b O O O O O B-b I-b B-s I-s I-s I-s O BIEOS S-b O O O O O B-b E-b B-s I-s I-s E-s O 3 Methods In this section, we will describe the methods we employ, including Bi-LSTM CRF, self-taught learning, active learning, and ensemble learning. 3.1 Bi-LSTM CRF This model is similar to the one presented by Huang et. al. [5]. It combines the framework of bidirectional LSTM layer [9] with linear chain CRF [10]. Different from Huang et. al., we employ character embedding rather than word embedding to deal with the ambiguity in the boundary of Chinese words. The raw natural language input sentence is processed into sequence of char- acters X = [x]T1 . The character sequence is fed into an embedding layer, which produces dense vector representation of characters. The character vectors are then fed into a bidirectional LSTM layer. The LSTM [11] incorporates a gated memory-cell to capture long-range dependencies within the data. In the bidirec- → − tional LSTM, for any given sequence, the network computes both a left, ht , and ← − a right, ht , representations of the sequence context at every input, xt . The final → − ← − representation is created by concatenating them as ht = [ ht , ht ]. The bidirec- tional LSTM along with the embedding layer is the main machinery responsible for learning a good feature representation of the data. Then the network use sentence level tag information via a CRF layer fed by a fully connected hidden layer. The CRF layer is represented by lines which connect consecutive output layers, and has a state transition matrix as parameters. With such a layer, we can efficiently use past and future tags to predict the current tag, which is similar to the use of past and future input features via a bidirectional LSTM network. We consider the matrix of scores fθ ([x]T1 ) are output by the network. The element [fθ ]i,t of the matrix is the score output by the network with parameters θ, for the sentence [x]T1 and for the i-th tag, at the t-th character. We introduce a transition score [A]i,j to model the transition from i-th state to j-th for a pair of consecutive time steps. Note that this transition matrix is position independent. We now denote the new parameters for our network as θ̃ = θ ∪ {[A]i,j ∀i, j}. The score of a sentence [x]T1 along with a path of tags [i]T1 is then given by the sum of transition scores and network scores: T X S([x]T1 , [i]T1 , θ̃) = ([A][i]t−1 ,[i]t + [fθ ][i]t ,t ) (1) t=1 The dynamic programming [12] can be used efficiently to compute [A]i,j and optimal tag sequences for inference. See [10] for details. 3.2 Add Training Data by Self-taught Learning Generally, the more training data we have, the better performance we can get. Considering that we have some unlabeled sentences, a self-taught learning algorithm is introduced to enlarge the training set. First, we train a Bi-LSTM CRF model using the original training set. Second, we apply the trained model to annotate the unlabeled sentences. Third, we choose some high-quality anno- tation results and add them to the original training set. Specifically, we define an annotating confidence (AC) to evaluate the quality of the annotation results: T T eS([x]1 ,[y]1 ,θ̃) AC([x]T1 , [y]T1 , θ̃) = P (2) S([x]T T 1 ,[j]1 ,θ̃) je where [y]T1 is the tag sequence predicted by the trained model and [j]T1 is the set of all possible output sequences. We choose those annotation results whose AC is greater than a threshold. In addition, considering that disease and treatment entities are much fewer than symptom and exam entities, and the body part entity recognition is not well performed, we only choose the annotation results which must include disease, treatment, or body part entities. 3.3 Add Training Data by Active Learning We also use active learning algorithm to improve the recognition perfor- mance. First, we use the original training data along with the self-taught high- quality annotation results to train a Bi-LSTM CRF model. Second, we apply the trained model to annotate the rest unlabeled sentences. Third, we manually re-label a few low-quality annotation results whose AC is less than a threshold, and then add them to the training set. Similar to the self-taught algorithm, we only choose the annotation results which must include disease, treatment, or body part entities to re-label manually. 3.4 Improve Recognition Performance through Ensemble Learning The task has five types of clinical named entities to be recognized. If we train models for each type of entities respectively, a class imbalance problem (too many O-tags in sequences) will exist, and the models cannot utilize the information of other types of entities. Thus, we train models to annotate all five types of entities at the same time, but finally choose five models in which each model has the best performance for the corresponding type on validation data, so that we can obtain the best recognition performance for all five types of entities. For example, for disease entities, we choose the model which has the best performance in recognizing diseases, and only use its disease recognition results as part of the final recognition results. 4 Experiments 4.1 Dataset and Evaluation Metrics We use the CCKS-2017 Task 2 dataset to perform our experiments. The dataset contains 10,420 unannotated instances and 1,596 annotated instances with five types of clinical named entities, including diseases, symptoms, exams, treatments, and body parts. The annotated instances are already partitioned into 1,198 training instances and 398 test instances. Each instance has one or several sentences. We further partition the training sentences, take 70% of them as training data, and the rest 30% as validation data. We score our methods by using the CCKS-2017 Task 2 official metrics, which computes F1 -Measure. 4.2 Hyper-parameters and Training Details As for the Bi-LSTM CRF model, we initialize character embeddings via word2vec[13] on both the annotated data and the unannotated data. Each char- acter embedding is 100-dimensional. We compare the results with word em- bedding segmented by Jieba Chinese segmentation module, which is also 100- dimensional. We set the size of LSTM hidden layer to 64, and apply dropout [14] to the output of the Bi-LSTM layer. The dropout rate is 0.2. The Bi-LSTM CRF model is trained by AdaDelta [15] and the batch size is 128. For self-taught learn- ing, we add the top 3,850 automatically annotated sentences to the training set. For active learning, we re-label the worst 125 automatically annotated sentences manually. We split training data into sentences by periods, and set the maximum length of a sentence to 185. If a sentence is longer than 185 characters, it will be further split into clauses. Then we take all the labeled sentences and clauses as a whole data set after de-duplication. We also try to split all the training data into clauses by periods, commas and semicolons to train different models. 4.3 Results and Discussion To explore the impact of different settings, we train Bi-LSTM CRF models us- ing different tagging formats on training data both in clauses and sentences, and test them on validation data. The results are shown in Table 2. First, character- level models outperform word-level models. It is because word-level approaches may have segmentation error. What’s more, the word set is much bigger than the character set. This means the corpus is not big enough to learn word em- beddings effectively. Second, the recognition of diseases, treatments, and body parts is not well performed, which indicates the necessity of self-taught learning and active learning. Third, different types of entities are recognized the best by different settings, respectively. This indicates the necessity of ensemble learning. Table 2. Performance (F1 -Measure) of Bi-LSTM CRF models with different settings tested on validation data. Note that we report the best results for each type, which means different lines are reported by different models. tag in character level tag in word level train in clauses train in sentences train in clauses train in sentences BIO BIOS BIEOS BIO BIOS BIEOS BIO BIOS BIEOS BIO BIOS BIEOS disease 77.54 75.00 77.21 74.15 75.20 77.78 64.08 64.74 64.08 63.44 62.96 63.84 symptom 96.24 96.21 96.37 96.26 96.17 96.27 94.02 94.64 94.80 93.27 93.99 93.75 exam 95.21 95.57 95.54 94.92 94.80 95.09 84.32 84.68 84.93 83.67 83.59 83.91 treatment 79.65 80.71 82.37 78.94 77.97 79.59 74.89 76.37 77.10 74.78 76.40 75.19 body part 83.85 83.94 83.19 82.67 83.39 83.81 72.59 73.41 73.56 71.87 72.57 73.07 overall 88.62 90.29 90.28 88.33 89.62 90.33 78.57 82.10 82.47 77.97 81.37 81.52 To dissect the effectiveness of self-taught learning, active learning, and en- semble learning, we train Bi-LSTM CRF models in character level only, and test them on test data. Table 3 shows that the Bi-LSTM CRF model along with self-taught learning, active learning, and ensemble learning achieves the best performance in clinical NER, with a F1 -Measure of 89.88%. Table 3. Performance (F1 -Measure) of Bi-LSTM CRF models with different settings tested on test data. train in clauses train in sentences BIO BIOS BIEOS BIO BIOS BIEOS Bi-LSTM + CRF 87.81 88.36 88.47 87.09 87.74 87.87 + self-taught learning (only) 88.44 88.49 88.67 87.75 87.60 88.12 + active learning (only) 88.17 88.57 88.79 87.63 87.90 88.27 + self-taught and active learning 88.72 88.63 88.83 87.78 88.35 88.53 + ensemble learning 89.88 5 Conclusion In this paper, we propose a Bi-LSTM CRF model along with self-taught learning, active learning, and ensemble learning to recognize clinical named enti- ties. We exploit character embedding to deal with the ambiguity in the boundary of Chinese words, and employ self-taught learning and active learning to increase training data. After comparing different tagging schemes, we use ensemble learn- ing to obtain the best recognition performance for all five types of entities. The results achieved on CCKS-2017 Task 2 dataset ranks among the top systems. Acknowledgements. This work is supported by the 863 Program funded by China Ministry of Science and Technology (Program No.2015AA020107). References 1. Asahara, M., Matsumoto, Y.: Japanese named entity extraction with redundant morphological analysis. In: Proceedings of the 2003 Conference of the North Ameri- can Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, Association for Computational Linguistics (2003) 8–15 2. Bikel, D.M., Miller, S., Schwartz, R., Weischedel, R.: Nymble: a high-performance learning name-finder. In: Proceedings of the fifth conference on Applied natural language processing, Association for Computational Linguistics (1997) 194–201 3. Mccallum, A., Li, W.: Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Conference on Natural Language Learning at Hlt-Naacl. (2003) 188–191 4. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. Journal of Machine Learning Research 12(1) (2011) 2493–2537 5. Huang, Z., Xu, W., Yu, K.: Bidirectional lstm-crf models for sequence tagging. Computer Science (2015) 6. Kipper-Schuler, K., Kaggal, V., Masanz, J., Ogren, P., Savova, G.: System evalu- ation on a named entity corpus from clinical notes. In: International Conference on Language Resources and Evaluation 2008. (2008) 3007–3011 7. Wang, Y., Yu, Z., Chen, L., Chen, Y., Liu, Y., Hu, X., Jiang, Y.: Supervised methods for symptom name recognition in free-text clinical records of traditional chinese medicine: An empirical study. Journal of biomedical informatics 47 (2014) 91–104 8. Bing, L., Ling, M., Wang, R.C., Cohen, W.W.: Distant ie by bootstrapping using lists and document structure. arXiv preprint arXiv:1601.00620 (2016) 9. Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks the Official Journal of the International Neural Network Society 18(5) (2005) 602–610 10. Lafferty, John, D., McCallum, Andrew, Pereira, Fernando, C.N.: Conditional Ran- dom Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. (2001) 11. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9(8) (1997) 1735–1780 12. Rabiner, L.R.: A tutorial on hidden markov models and selected applications in speech recognition. Readings in Speech Recognition 77(2) (1990) 267–296 13. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre- sentations in vector space. Computer Science (2013) 14. Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors. Com- puter Science 3(4) (2012) págs. 212–223 15. Zeiler, M.D.: Adadelta: An adaptive learning rate method. Computer Science (2012)