A Conditional Random Fields Approach to Clinical Name Entity Recognition Xiaoran Yang and Wenkang Huang* Alibaba Health Information Technology Limited { xiaoyang.yxr,wenkang.hwk}@alibaba-inc.com Abstract. Clinic named entity recognition (CNER) is an initial step in under- standing and using electronic medical record clinical free-text. The CCKS com- mittee sets up a task for CNER for recognizing five types of entities including body part, independent symptom, symptom description, operation and drug. For this task, we develop a conditional random fields (CRF) model with char embed- ding, POS, radical, PinYin, dictionary and rule features. Our best model on the test dataset achieves the strict F1-Measure of 0.8926 which ranked the first place. Keywords: Name Entity Recognition, Electronic Medical Records, NER 1 Introduction With the growth of the scale of electronic medical records, clinical named entity recog- nition (CNER) has gradually become an important research topic. The research pro- gress of CNER in China is quite slow due to the lack of uniform standards and public datasets. For this purpose, the CCKS 2018 conference in July 2018, sets up a CNER task to identify entities from Chinese clinical text with a label specification and a train- ing datasets. Currently, the most effective way to identify named entities is based on machine learning algorithm, such as support vector machines (SVM) [1], conditional random fields (CRF) [2], structured support vector machines (SSVM) [3], recurrent neural net- work (RNN) with its variant model [4], and convolutional neural network (CNN) with its variant model [5]. In this paper, we participated in the CCKS 2018 CNER task and developed a method based on conditional random fields. By evaluating and choosing a great number of different features in the method including the characteristic feature and feature based on external data, we achieved a F1-Measure of 0.8926 based on the CCKS 2018 CNER task dataset. * Correspondence: wenkang.hwk@alibaba-inc.com 2 2 Task Formalism Clinical named entity recognition task is often considered as a sequence label task. Given a sentence X= , the goal is to label each character xi according the context of X with BMESO (B-Begin M-Middle E-End S-Single O-Outside) notation scheme. The CCKS2018 evaluation task 1 gives the annotation datasets and the unla- beled datasets with 5 pre-defined categories (body part, independent symptom, symp- tom description, operation and drug). An example of the tag sequence for “患者2个月 前因上腹部不适于我院就诊 (the patient went to see a doctor two month ago in our hospital because of epigastrium)” is shown in Figure 1. 患 \O 者 \O 2 \O 个 \O 月 \O 前 \O 因 \O 上 \B-BOD 腹 \M-BOD 部 \E-BOD 不 \B-DES 适 \E-DES 于 \O 我 \O 院 \O 就 \O 诊 \O Fig. 1. Example of tag sequence. 3 Methods In this section, we first introduce the CRF method algorithms, then introduce the features used in CRF, including char embedding, POS, radical, PinYin, dictionary and rule. 3.1 Conditional Random Fields(CRF) A conditional random field (CRF) is a type of discriminative, undirected probabilis- tic graphical model, which has been widely used for sequence labeling problems. For a given character sequence 𝑧 = {𝑧1 , … , 𝑧n } where 𝑧n is the input vector composed of the char and features of 𝑖th character, and a given label sequence 𝑦 = {𝑦1 , … , 𝑦n } for 𝑧. γ(𝑧) represent the all of possible labels for 𝑧. The CRF model define the formula of the probability of character sequence 𝑦 with given label sequence 𝑧 is: ∑𝑛𝑡=1 exp⁡(𝑆(𝑦 (𝑡) , 𝑧 (𝑡) , 𝜃)) 𝑝(𝑦|𝑧; 𝜃) = 𝑛 ∑𝑡=1 ∑𝑗∈γ(𝑧) exp⁡(𝑆(𝑦𝑗 , 𝑧 (𝑡) , 𝜃)) (𝑡) (𝑡) Where 𝑆(𝑦 , 𝑧 , 𝜃)⁡are potential function, and 𝜃 is the parameters of CRF. In our work, we use the character as a unit for sequence labeling model rather than use the word. Log likelihood function was used to get the loss of the CRF layer. Finally, the viterbi algorithm was used to decode. 3.2 Features 3 3.2.1 Char Embedding Given a sequence X =⁡< x1 , … , xn >, distributed embedding vector is used to repre- sent the information for each character. Formally, we look up in a character embedding matrix for embedding vector for each character xi. A single English character does not have semantics, while Chinese characters often have strong semantic information. To utilize these semantic information, we use cw2vec [7] instead of word2vec to construct the char embedding matrix. Different from the work of word2vec, it puts forward the concept of "n-gram strokes", which is the semantic structure of the continuous n strokes of Chinese words (or Chinese charac- ters). We have trained a cw2vec model using CCKS2018 training corpus and testing corpus with 128 embedding dims. 3.2.2 Part-of-Speech (POS) Part-of-speech (POS) features can help identify clinical named entity. For example, body parts are always consist of many nouns, such as “右上腹(the right upper quad- rant)” , and a verb often comes before the name of the operation or that of the drug, such as taking a drug or performing a surgery. In this paper, a python library named Jieba was used to implement a POS tagger. 3.2.3 Chinese PinYin Due to the use of the Pinyin input method, a large number of homophone typos en- tities have appeared, and these homophone typos entities have not been identified. For example, the "右附件(the right adnexa)" appearing in the text can be identified, but " 右附件(the right adnexa)" may be mistakenly written as "有附件(have adnexa)" due to the use of Pinyin input method. These homonym characters cannot be identified. In addition, some similar Chinese characters with the same pronunciation would have the same meanings .Therefore, we use character spell features to help improve the result of clinical named entity recognition. 3.2.4 Radical Chinese characters are composed of smaller units - radicals, like English words are composed of letters. These radicals often have semantic information about the original character. For example, the characters “肠(intestines)”, “肺(lung)”, “肝(liver)” with same radical “月” are all related to human body parts. We retrieved the radical compo- sition of each character from online Xinhua dictionary (http://tool.httpcn.com/Zi). 3.2.5 Dictionary An additional dictionary was constructed from the training set and open websites or databases such as DrugBank, “xunyiwenyao”, etc. Bi-directional maximum matching (BDMM) algorithm [8] was used to find the word in dictionary appearing in sequences. In order to improve the accuracy of entity boundary recognition, BMESO notation schema was used for tagging which can give more information about character’s posi- tion. 4 3.2.6 Rule With these dictionaries above, by mining frequent pattern [9], we can also find many medical terminology do not appear in the dictionary. For instance, according the se- quence “行子宫切除术(do hysterectomy),” and “子宫切除术(hysterectomy)” in op- eration dictionary, we can extract the pattern “行(do),”. Using the pat- tern we can also extract operation entity “直肠癌切除术(rectal cancer resection)” from “行直肠癌切除术(do rectal cancer resection)”,while “直肠癌切除术(ectal cancer resection)” are not in operation dictionary. We also use body part prefixes to extend the body part entities such as “左侧卵巢(the left ovary)” while only “卵巢(ovary)” in body part dictionary. In this paper, words that extracted by patterns were also tagged using BMESO notation schema. 4 Experiments 4.1 Datasets The CCKS 2018 CNER task provided 600 annotated corpus as training dataset with five types of entities (body part, independent symptom, symptom description, operation and drug). 400 unlabeled corpus were also provided as testing dataset to evaluate the model. The statistics of different types of entities in training corpus are listed in Table 1. To choose features and best hyper-parameters, we split 600 training corpus into 480 training corpus and 120 validation corpus. Table 1. Statistics of entity on different categories in training sets. Entity Body Symptom Description Operation Drug All Count 5574 2764 1708 1085 849 11980 4.2 Experimental Settings By adjusting the hyper-parameters of the training model through the validation da- tasets, the best hyper-parameters in CRF model was obtained and described below. The model are trained by Adam optimization algorithm [11]. (1) L1 penalty: 1; (2) L2 penalty: 0.01; (3) Max iterations: 100; (4) Epsilon: 1e-5. 4.3 Experiment on CRF Model In this section, we compare different combinations between six type features in CRF model. The comparative results are listed in Table 2. 5 Table 2. Comparative results of different combinations in CRF model. Feature and CRF loss function F1 in validation sets Char 0.9140 Char + Char embedding 0.9082 Char + Word segmentation 0.9111 Char + Radical 0.9157 Char + Radical + POS 0.9160 Char + Radical + POS + PinYin 0.9180 Char + Radical + POS + PinYin + Dictionary 0.9510 Char + Radical + POS + PinYin + Dictionary + Rule 0.9729 The result of CRF model has improved a little with radical features, POS features, and PinYin features, but with dictionary features and rule features, it has improved no- tably. It seems that radical features, POS features, PinYin features may have potential influence in clinical named entity recognition, but dictionary features and rule features could have explicit improvement. 4.4 Compared with the state-of-art model In this section, we compare the best CRF model with a state-of-art model bi-LSTM- CRF by testing sets. The comparative results are summarized in Table 3. Table 3. Comparative best results between two models in test sets. Model Evaluation Body Symp- Descrip- Opera- Drug All tom tion tion Bi- Strict 0.8812 0.9184 0.8994 0.8506 0.9343 0.8897 LSTM+CRF Relaxed 0.9572 0.9522 0.9220 0.9310 0.9516 0.9511 Our CRF Strict 0.8797 0.9245 0.9059 0.8543 0.9449 0.8913 Relaxed 0.9556 0.9552 0.9304 0.9325 0.9620 0.9522 Compared strict and relaxed results, we find that the body parts and the operations don’t have a high strict F-measure but have a high relaxed F-measure. It means that the right position of entities has been found without the right boundary. Through searching the full testing corpus, it seems that the body part and operation entities are lack of a uniform labeling specification. Comparing results between two models, the reason that why the best result in CRF model is better than it in Bi-LSTM-CRF model may be the scale of the data sets smaller than the scale of the entities. Therefore, the Bi-LSTM-CRF model is easy to fall into overfitting. And by looking through the result, we can find that Bi-LSTM-CRF model can identify more entities while some of them are wrong. We believe that if the scale of data sets become larger, the result of Bi-LSTM-CRF will be better. 6 5 Conclusion By building a number of features including characteristic of character and external data, a clinical named entity recognition model using CRF algorithm was developed. Compared with the state-of-art algorithm Bi-LSTM+CRF, our CRF model achieved a better performance. The reason might be that the scale of corpus is not large enough and the label specification is not uniform. In the CCKS 2018 CNER task, we achieved a strict F-measure of 0.8926 which ranked the first. We will focus on the more effective extraction of body and operation entities’ boundary in the future. References 1. Asahara Masayuki, and Yuji Matsumoto.: Japanese named entity extraction with redundant morphological analysis. Proceedings of the 2003 Conference of the North American Chap- ter of the Association for Computational Linguistics on Human Language Technology- Volume 1. Association for Computational Linguistics, 8-15 (2003). 2. McCallum Andrew, and Wei Li.: Early results for named entity recognition with condi- tional random fields, feature induction and web-enhanced lexicons. Proceedings of the sev- enth conference on Natural language learning at HLT-NAACL 2003-Volume 4. Associa- tion for Computational Linguistics, 188-191 (2003). 3. Lee Yuh-Jye, and Olvi L. Mangasarian.: SSVM: A smooth support vector machine for classification. Computational optimization and Applications 20.1, 5-22 (2001). 4. Huang Zhiheng, Wei Xu, and Kai Yu.: Bidirectional LSTM-CRF models for sequence tag- ging. arXiv preprint arXiv, 1508.01991 (2015). 5. Strubell Emma, et al.: Fast and accurate entity recognition with iterated dilated convolu- tions. arXiv preprint arXiv, 1702.02098 (2017). 6. Xu Yan, et al. "Joint segmentation and named entity recognition using dual decomposition in Chinese discharge summaries." Journal of the American Medical Informatics Associa- tion 21.e1, e84-e92 (2013). 7. Cao Shaosheng, et al.: cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information. (2018). 8. Gai Rong Li, et al.: Bidirectional maximal matching word segmentation algorithm with rules. Advanced Materials Research. Vol. 926. Trans Tech Publications, 3368-3372 (2014). 9. Xu Dong, et al.: Data-driven information extraction from Chinese electronic medical rec- ords. PloS one 10.8, e0136270 (2015). 10. Gross, Samuel S., et al.: Training conditional random fields for maximum labelwise accu- racy. Advances in Neural Information Processing Systems (2007). 11. Kingma Diederik P., and Jimmy Ba.: Adam: A method for stochastic optimization. arXiv preprint arXiv,1412.6980 (2014).