DUTIR at the CCKS-2018 Task1: A Neural Network Ensemble Approach for Chinese Clinical Named Entity Recognition Ling Luo, Nan Li, Shuaichi Li, Zhihao Yang* and Hongfei Lin College of Computer Science and Technology, Dalian University of Technology, Dalian, China 116024 *Corresponding Author: yangzh@dlut.edu.cn Abstract. As a fundamental task in information extraction, named entity recognition (NER) has received constant research attention over the recent years. The 2018 China conference on knowledge graph and semantic computing (CCKS) challenge sets up a task for clinical named entity recognition (CNER). In this task, we presented a neural network ensemble approach, which combines five individual neural network models (i.e., CNN-CRF, BiLSTM-CRF, BiLSTM-CNN-CRF, BiLSTM+CNN-CRF and Lattice LSTM). In this approach, the various features (i.e., stroke, word segmentation and dictionary features) are adopted. On the official test set, our best submission achieves the F-scores of 88.63% and 95.19% under the “strict” and “relaxed” criteria, respectively. Keywords: Entity Recognition, Chinese Clinical Text, Neural Network, Ensemble. 1 Introduction In recent years, the medical information processing has become a popular research focus as the generation of larger amount of electronic medical records and the potential requirements for medical information services and medical decision supports. As a fundamental task for medical information extraction, clinical named entity recognition (CNER) has received much attention and has been organized as a shared- task in many challenges [1-3]. To promote the performance of CNER on the Chinese clinical text, the 2018 China conference on knowledge graph and semantic computing (CCKS-2018) organized a CNER task to identify and extract the related medical clinical entities (i.e., anatomy, symptom, independent symptom, drug and operation) from Chinese clinical text. In the previous work, the state-of-the-art CRF-based NER methods depend on effective feature engineering, i.e., the design of effective features using various NLP tools and knowledge resources, which is still a labor-intensive and skill-dependent task. Recently, deep learning has become prevalent in the machine learning research community. For the NER task, several similar neural network architectures [4-6] have been proposed and exhibit promising results. Compared with the traditional CRF- based methods, the key advantage of these deep learning methods is that these layers of features are not designed by human engineers and, therefore, the least feature engineering is needed. In this paper, we describe our CNER method for the CCKS-2018 CNER task. In our method, five individual neural network models (i.e., CNN-CRF, BiLSTM-CRF, BiLSTM-CNN-CRF, BiLSTM+CNN-CRF and Lattice LSTM) are used for CNER. Afterwards an ensemble model is built by combining these models’ results with majority voting. Moreover, we explored the effect of additional features to further improve the performance. 2 Methods Fig. 1. The processing flow of our method In this section, our approach for CNER is described. Fig. 1 shows the processing flow of our method. Firstly, some preprocessing steps including sentence splitting, word segmentation and stroke generation are performed. Secondly, a character embedding is learned with large amounts of unlabeled data by the cw2vec tool 1. Moreover, the additional features (i.e., word embedding, dictionary feature and stroke feature) are introduced into the model. Then with the embeddings as input, five neural network models are trained by the annotated training set. Finally, the results from these models are combined by majority voting and entity type ensemble. The detailed description of our method is presented in the following sections. 2.1 Features The character embedding is used as the basic features of our method since the character-based methods outperform word-based methods for Chinese NER [7, 8]. Moreover, to investigate the effects of other features (such as word embedding, dictionary feature and stroke feature), these features are added into the model as additional features. Details of each of features are provided as follows. To achieve a high-quality character and word embeddings, we collected a total of 3,005 clinical texts from the CCKS-2017 challenge. Then these texts and all CCKS- 2018 CNER training set (a total of 600 texts) were spilt into the words with the HanLP tool 2 . Afterwards, these data were used to train 100-dimensional character embedding and 50-dimensional word embedding by the cw2vec tool as pre-trained character and word embeddings. Due to the complexity of the natural language and the specialty of the clinical 1 https://github.com/bamtercelboo/cw2vec 2 http://hanlp.linrunsoft.com/ medical domain, some linguistic and domain resource features can be employed to improve the performance of our model. We also explored the effect of dictionary and stroke features for our neural network models. For the dictionary feature, we used Sogou dictionary 3 to generate our drug dictionary feature. First, longest possible matches between the character sequences and dictionary entries were captured. Then, for each character in the match, the feature was encoded in BIOES tagging scheme. At last, a lookup table was used to output 50-dimensional dictionary embedding. For the stroke feature, we first obtained the stroke sequence of every character with HanDian4. Then a stroke lookup table containing a stroke embedding for every stroke was initialized randomly. Then the stroke embedding sequence of the character was fed into a convolutional layer. At last, an attention pooling layer is used to extract global features from the convolution layer as the stroke feature of the character. 2.2 NN-CRF Models In this section we describe in details the five individual neural network models (i.e., CNN-CRF, BiLSTM-CRF, BiLSTM-CNN-CRF, BiLSTM+CNN-CRF and Lattice LSTM) used in our ensemble. These models have the similar neural network architecture, i.e., the neural network with a conditional random field layer (NN-CRF). The architecture of the NN-CRF model is illustrated in Fig. 2. Fig. 2. The overall architecture of the NN-CRF model  CNN-CRF model. In the convolutional neural network (CNN) with a CRF layer model, a convolution operation is applied to produce local features. Firstly, a sentence is represented as a sequence of embeddings. Next, the embeddings are given as input to a CNN layer. Then a tanh function on top of the CNN layer is used to learn higher features. Finally, the CRF layer is added after the tanh layer to predict the best label sequence path in all possible tag paths.  BiLSTM-CRF model. We also employed a bidirectional long short-term memory with a CRF layer (BiLSTM-CRF) model for the CNER. Firstly, a sentence is 3 https://pinyin.sogou.com/dict/detail/index/270 4 http://www.zdic.net represented as a sequence of embeddings. Next, the embeddings are given as input to a BiLSTM layer. In the BiLSTM layer, a forward LSTM computes a representation of the sequence from left to right, and another backward LSTM computes a representation of the same sequence in reverse. These two distinct networks use different parameters, and then the representation of a word is obtained by concatenating its left and right context representations. Then a tanh function on top of the BiLSTM layer is used to learn higher features. Finally, the CRF layer is added after the tanh layer to predict the best label sequence path in all possible tag paths.  BiLSTM-CNN-CRF model. Different with the above models, the NN layer of this model is a BiLSTM-CNN layer. In the BiLSTM-CNN layer, a BiLSTM computes a representation of the sequence, which is then fed into a CNN layer to learn higher features.  BiLSTM+CNN-CRF model. The model is similar with the BiLSTM-CNN-CRF model, but the BiLSTM-CNN layer is replaced with the BiLSTM+CNN layer. In the BiLSTM+CNN layer, the representation of the previous layer is fed into a BiLSTM layer and a CNN layer, and then their outputs are concatenated and fed into a tanh layer.  Lattice LSTM model. Recently, Zhang and Yang proposed a lattice-structured LSTM model for Chinese NER [9]. In this model, the latent word information is integrated into character-based LSTM-CRF by representing lexicon words form the sentence using a lattice structure LSTM. This model explicitly leverages word and word sequence information, and does not suffer from segmentation errors. 2.3 Ensemble As introduced before, five machine learning-based methods were deployed for the CNER independently. To take the advantages of different methods, we used a majority voting approach to combine all predicted entities. In addition, different combinations of the models were also investigated. Finally the models with the best performances for the five entity types (i.e., anatomy, symptom, independent symptom, drug and operation) on the development set were combined to output the final result. 3 Experiments 3.1 Dataset In the CCKS-2018 CNER challenge, organizers provided a corpus including the training and test sets. The training set consists of 600 medical records annotated with five categories of entity, including anatomy, symptom, independent symptom, drug and operation. And the test set consists of 400 medical records. In our experiments, we randomly selected the 20% of the training set as the development set to tune the hyper-parameters. The statistics of entity on different categories are list in Table 1. Table 1. Statistics of the entities of different categories Dataset Anatomy Symptom Independent Drug Operation Training set 7,838 2,066 3,055 1,055 1,116 Development set 1,634 418 657 166 213 3.2 Evaluation The evaluation metrics of this task include two criteria: 1) strict metrics which define a correct match as that the ground truth and extraction result share same mention, same boundaries and same entity type; 2) relaxed metrics which only consider the ground truth and the result have same entity type and overlap boundaries. All our evaluations were performed on the official test set using the evaluation tool of CCKS- 2018 CNER challenge, which outputs micro-average precisions (Prec.), recalls (Rec.) and F-scores (F) via the strict metrics. 3.3 Experimental Results In this task, we directly divided the sentences into Chinese characters, which can avoid the boundary error of entity caused by the word segmentation tools. The “BIOES” (i.e., B-begin, I-inside, E-end, S-single, O-outside) tags are used to represent the entity. Table 2 shows the performance of various models on our development set. Our basic model is BiLSTM-CRF which achieves an F-score of 90.88%. When stroke and dictionary features are added into the basic model, the model achieves performance improvement. The BiLSTM-CRF+Stoke+dic achieves an F-Score of 92.64%. It demonstrates that the features can help boost the performance of the model. In addition, among the various models, BiLSTM-CRF+ALL and BiLSTM-CNN- CRF+ALL perform better than others for anatomy entities. BiLSTM-CRF+ALL and BiLSTM+CNN-CRF+ALL perform better for drug entities. To take the advantages of different methods, we used a majority voting approach to combine all predicted entities. The results show that the model ensemble achieves the highest F-score of 93.16%. Table 2. Results of various models on our development set Models Anatomy Symptom Independent Drug Operation Overall BiLSTM-CRF 91.09 90.30 93.08 85.02 88.03 90.88 BiLSTM-CRF+Stroke 92.39 91.89 93.55 86.14 88.42 91.95 BiLSTM-CRF+Stroke+dic 92.66 92.88 93.92 91.89 88.63 92.64 CNN-CRF+ALL 92.58 93.65 94.90 91.94 87.62 92.83 BiLSTM-CRF+ALL 93.13 92.43 94.17 92.77 88.22 92.89 BiLSTM-CNN-CRF+ALL 93.07 91.72 93.94 91.19 89.20 92.70 BiLSTM+CNN-CRF+ALL 92.56 92.97 94.26 92.49 87.76 92.57 Lattice-LSTM 91.60 93.29 94.36 86.93 88.32 91.94 Ensemble 93.21 92.79 94.98 91.84 89.67 93.16 Note: “Stroke” denotes the stroke feature; “dic” denotes the dictionary feature; “All” denotes all additional features. Table 3 lists the results on the official test set. Our best submission achieves the F- scores of 88.63% and 95.19% under the “strict” and “relaxed” criteria, respectively. We analyzed the results and found that the anatomy and operation have poor performance. Therefore, how to recognize the anatomy and operation entities more accurately will be the main focus in our future work. Table 3. Results of our best submission on the official test set Strict (%) Relaxed (%) Types Prec. Rec. F Prec. Rec. F Overall 88.89 88.37 88.63 95.47 94.92 95.19 Anatomy 87.70 87.49 87.59 95.98 95.75 95.86 Symptom 92.73 88.89 90.77 94.77 90.85 92.77 Independent 91.52 91.93 91.72 94.67 95.10 94.88 Drug 92.69 90.41 91.53 95.21 92.87 94.92 Operation 85.62 86.67 86.41 93.68 94.83 94.25 4 Conclusion In this paper, we present a neural network ensemble approach to automatically recognize clinical entities from Chinese clinical texts. In this approach, five different neural network models are explored, and the ensemble can achieve better performance. In addition, the effect of additional features for these models in the CNER task is also explored. The experimental results show that the additional features are effective to improve the performance of our system. At last, our best submission achieves the F- scores of 88.63% and 95.19% under the “strict” and “relaxed” criteria on the official test set, respectively. We will focus on the more effective extraction of anatomy and operation entities in the future work. 5 References 1. Uzuner Ö, Solti I, Cadag E: Extracting medication information from clinical text. Journal of the American Medical Informatics Association 2010, 17(5):514-518. 2. Sun W, Rumshisky A, Uzuner O: Evaluating temporal relations in clinical text: 2012 i2b2 challenge. Journal of the American Medical Informatics Association 2013, 20(5):806-813. 3. Bethard S, Savova G, Chen W-T, Derczynski L, Pustejovsky J, Verhagen M: Semeval-2016 task 12: Clinical tempeval. In: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016): 2016. 1052-1062. 4. Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P: Natural language processing (almost) from scratch. Journal of Machine Learning Research 2011, 12(Aug):2493- 2537. 5. Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C: Neural architectures for named entity recognition. arXiv preprint arXiv:160301360 2016. 6. Ma X, Hovy E: End-to-end sequence labeling via bi-directional lstm-cnns-crf. arXiv preprint arXiv:160301354 2016. 7. He J, Wang H: Chinese named entity recognition and word segmentation based on character. In: Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing: 2008. 8. Li H, Hagiwara M, Li Q, Ji H: Comparison of the Impact of Word Segmentation on Name Tagging for Chinese and Japanese. In: LREC: 2014. 2532-2536. 9. Zhang Y, Yang J: Chinese NER Using Lattice LSTM. arXiv preprint arXiv:180502023 2018.