Disease Named Entity Recognition Using Conditional Random Fields Hidayat Ur Rahman Thomas Hahn Richard Segall Lahore Leads University University of Arkansas Arkansas State University Near Garden Town Kalma Chowk South University Avenue Computer Inform Tech Department Lahore Pakistan Little Rock,AR,72204 State University,AR 72404-0130 Hidayat.R@gmail.com Thomas.F@gmail.com rsegall@astate.edu Abstract cal and rule-based techniques. However, the per- formance of machine learning techniques highly Named Entity Recognition is a crucial depends on the availability of sufficient training component in bio-medical text mining.In data in order to adequately train the machine learn- this paper a method for disease Named ing classifiers (M. Krallinger et al, 2011).In this Entity Recognition is proposed which uti- article Bio-NER for disease names has been car- lizes sentence and token level features ried out to handle the challenges of boundary de- based on Conditional Random Field’s us- tection and entity classification using Conditional ing NCBI disease corpus. The fea- Random Fields(CRF).The model consists of an ture set used for the experiment in- enriched set of features including boundary de- cludes orthographic,contextual,affixes,n- tection features, such as word normalization, af- grams,part of speech tags and word nor- fixes, orthographic and part of speech(POS) fea- malization.Using these features,our ap- tures. For the semantic labeling features, such as proach has achieved a maximum F-score n-grams and contextual features have been used. of 94% for the training set by applying 10 fold cross validation for semantic labeling of the NCBI disease corpus. For testing 2 Methodology and development,F-score of 88% and 85% For disease NER our methodology follows the tra- were reported. ditional machine leaning approach. Figure. I de- 1 Introduction picts the work-flow of our methodology. Firstly, raw text is obtained from training, testing and de- The increasing amount of bio-medical literature velopment set,then pre-processing is carried out to requires more robust approaches for information remove characters and symbols such as underscore retrieval and knowledge discovery because every character, full stop etc. After pre-processing var- single day more information is published than ious features are extracted as described in section humans can read. Unique challenges specifi- 2.1. The features are fed into a sequential CRF as cally to bio-medical Named Entity Recognition described in section 2.2. Thus, structured output in (NER) are caused due to its structure,since bio- the form of annotated named entities is obtained. medical Named Entities(NEs) consist of sym- This section provides details about feature extrac- bols and abbreviations to infer relationships, thus tion and classification. the length of Bio-medical NEs are not consis- tent,which is the primary reason why Bio-NER 2.1 Feature Set have low performance compared to general pur- pose NER (Lishuang, L.and W. Fan, 2013).Bio- Feature extraction plays a vital role in the clas- NER is the most important step in the extraction sification accuracy of machine learning classifier of knowledge, which has the overall aim of identi- as well as the NER system. The selection of rel- fying specific concepts or categories, such as gene, evant feature set improves the classification per- protein, disease,drug, etc. Current trend in NER is formance of Bio-NER. Table.1 shows the list of based on machine learning (ML) approaches, ML features used for Bio-NER and their short descrip- based approach provides the flexibility of statisti- tions is listed below Feature Description word nor- Stemmed form of the Named malization Entity Contextual wi−2 , wi−1 , wi , wi+1 , wi+2 features POS Posw−2 , P osw−1 , P osw Posw+1 , P osw+2 ł Orthographic Uppercase,lowercase,title, features hyphen,Alphanumeric etc Word N- wi −2 /wi−1 , wi −1 grams /wi , wi /wi+1 , wi +1 /wi+2 POS POSw−2 , P OSw−1 , P OSwi , Figure 1: Flow Chart of Proposed System) N-grams POSw+1 , P OSw+2 Word Normalization Prefix PREFIX(wi ) Word normalization attempts to reduce different Suffix SUFFIX(wi ) forms of words, such as nouns, adjectives,verbs, etc. to their root form . For word normalization, Table 1: Feature set for named entity recognition Porter stemmer has been used to reduce disease names to its root form. Below are few examples of • ALLCAPS: Is set to true if all the alphabets disease names for word root reductions obtained in a given token are capital examples includes with the Porter stemmer algorithm. DMD, BMD, FD, APC, FAP and HDD etc. • Colorectal cancer – colorect cancer • TITLE: If the first alphabet in a token is cap- • Endometrial cancer – endometri cancer italized such as Alzheimer disease, Hunting- ton disease, Combined genetic deficiency of • Alzheimer disease – alzheim diseas C6 and C7. • Neurological disease – neurolog diseas • LOW: All the alphabets in a given word are • Arthritis – arthriti in lower case e.g. myotonic dystrophy, idio- pathic dilated cardiomyopathy, and facial le- Orthographic Features sions. Orthographic features are related to the orthogra- phy of the text, such as Capitalization, Digits, Nu- • MIXED: If a given sequence of words con- meric, Single Caps, All Caps,numerics and punc- tains both upper and lower case such as DMD tuation. Such features are very effective in bound- defects, hypo myelination of the PNS, defi- ary detection (Collier, Nigel and K. Takeuchi, ciency of active AVP. 2004). The eleven orthographic features below • ALPNUM: If a given words contains both nu- have been used in our model: meric and text like abnormality of CYP27, • IDASH: Whether a token/word contains an C6 deficiency, achondrogensis 1B, abnormal- inner dash such as A-T, G6PD-deficient, ity of CYP27. Palizaeus-Merzbacher disease. • PARN: If a multi-word contains parenthe- • 2IDAH: If the number of IDASH counts sis such as Arginine vasopressin (AVP) de- equals to 2 e.g. X-linked Emery-Dreifuss ficiency, palmoplanter keratoderma (PPK) muscular dystrophy, Borjeson-Forssman- conditions, sporadic (nonhereditary) ovarian Lehmann, cancers • BRACKS: Bracket is contained within a Contextual features token,example includes hypoxanthine phos- Contextual features refer to the word preceding phoribosyl transferees [HPRT] deficiency. and following the NEs. Contextual features are the most important features in this experiment for • GREEKS: Greek letters such as I,II,III,IV semantic labeling of disease names. In our exper- etc,is contained within a token e.g. type iment four contextual features are selected. Two IIA vWD,Type II ALD,type II Gaucher dis- words preceded and two followed the named en- ease,type II GD and type III GD tities. E.g. for the term bactracin in colon carci- noma loss cells, colon carcinoma represents the • SLASH: Character / is contained within the named entities while bactracin in and loss cells multi-word token such as cleft lip/palate, represent the two preceding word and two follow- CL/P, breast and/ovarian cancers, glu- ing word. cose/galactose malabsorption. 2.2 Conditional Random Field’s (CRF) Ngrams CRF is a probabilistic model used for labeling sequential data. It is widely used for POS N-grams are defined by a sequence of n-tokens tagging and NER. (Huang H-S and Lin Y-S, or words. The most common n-gram is uni-gram 2007). CRF has several advantages over the which,contains a single token.Other n-grams ex- Hidden Markov Model (HMM) and Support amples are bi-grams and tri-grams containing 2 Vector Machine (SVM). CRF includes rich and 3 tokens respectively. In this experiment uni- feature sets,i.e. overlapping features using gram and bi-gram have been used,in this method conditional probability. For example, given all the digits within a word are replaced with d a sequence X = x1 , x2 , x3 , x4 .....xn and its e.g,the uni-gram of 33 is dd, uni-gram of nt943 labels Y = y1 , y2 , y3 , y4 .....yn , the conditional is ntddd.Bigram examples are ALD/Eighteen, skin probability P (Y | X) is defined by CRF as tumor/caused, APC/protein, breast or ovarian can- follows P (Y | X) ∝ exp(wT f (yn , Yn−1 , x)) cer/novel, etc. (Sutton, C. and McCallum, 2011).W is a weight vector defined by w = (w1 , w2 , w3 .wM )T . Part of Speech(POS) tags Theses weight are associated with features hav- POS tags are helpful in defining boundary of a ing length equal to M.f (yn , y( n−1 ), x) = phrase,inclusion of POS has been advocated by f (yn , y( n−1 ), x), f 2(yn , y( n−1 ), x), (J. Kazama and T. Makino, 2002).Our experiment f 3(yn , y( n−1 ), x)...fM = (yn , y( n − 1), x))T includes POS tags of contextual features and bi- is a feature function.The weight vector is ob- grams. Adding POS tags to our feature set, the tained using L-BFGS method.In our experiment performance of the classifier is boosted as shown CRFSUITE has been used which is the python in Table.3 Application programming interface(API) of CRF++. Affixes 3 Experimental Setup Prefix and suffix feature has shown better perfor- mance in the recognition of NEs in this experi- 3.1 Dataset ment.In (J. Kazama and T. Makino, 2002) the au- Our experiment is based on National Center thors collected most frequent suffixes and prefixes for Biotechnology Information (NCBI) dis- from the training data. Prefix and suffix are n char- ease corpus, which is freely available at NCBI acter in length at the beginning and end of a token website(http://www.ncbi.nlm.nih. respectively (Zhou, G. Dong, and J. Sui, 2002). In gov/CBBresearch/Dogan/DISEASE/ our model all the combinations n=1 through 4 have ) NCBI corpus includes 793 abstracts, which been used to boost performance.The prefix for the consist of 2783 sentences and a total of 6900 word ”tumour” are t, tu, tum and tumo, the suf- disease names (Dogan, R. and Islamaj, 2012). fix for the same word are r,ur,our and mour.Beside Annotations of disease names are based on the contextual features,affixes yielded improvements criteria that,disease mentions which describes in the overall performance as shown in Table.3. a family of specific diseases are annotated as disease class e.g. autosomal recessive disease, Features p r f whereas text referring to specific disease are annotated as specific disease, such as Diastrophic O 0.54 0.62 0.53 dysplasia.Strings referring to more than one dis- O+Nm 0.77 0.76 0.74 ease names are annotated as composite mention, e.g. Duchene and Becker muscular dystrophy are O+Nm+POS 0.87 0.87 0.86 two disease mentions and hence it is categorized O+Nm+POS+ 0.92 0.92 0.91 as composite mention. Certain disease mentions Ngram are used as modifier for other concepts, e.g. a string may denote a disease name but it is O+Nm+POS+ 0.92 0.92 0.92 not a noun phrase and hence it is annotated as Ngram+Cc modifier, e.g. colorectal cancer.Table 2 shows the O+Nm+POS+ 0.94 0.94. 0.94 distribution of disease names in training,testing Ngram+Cc+Affixes and development set. Table 3: Combination of different features Classes Train Test Dev set set set 4 Result and Discussion Modifiers 1292 264 218 For result visualization we have plotted the f- ł Specific Disease 2959 556 409 score of individual classes. In Figure.2 the F- Composite Men- 116 20 37 score of individual dataset has been plotted. In tion Figure.2 DC denotes Disease Class,CM denotes Composite Mention,SD denotes Specific Disease Disease Class 781 121 127 and MD denotes Modifier.Figure II shows that the highest F-score have been reported by Modifier Table 2: Dataset used in experiment for Training,Testing and Development set respec- tively,followed by Specific Disease. The lowest F-score has been shown by Composite Mentions 3.2 Classification and Feature Selection followed by Disease Class.One reason for the rel- atively poor performance of the Composite Men- Table.3 shows contribution of features and tion is the inadequate training samples compared its effect on performance of CRF. The to the training samples of Specific disease and feature set is mainly divided into Contex- Modifiers,which exceed 1000.The relatively poor tual(Cc),Normalized(Nm),Ngrams,Affixes(Ax),Part performance of the Disease Class is because it has of speech (POS) and Orthographic(O). Perfor- been based on the second smallest training sam- mance evaluation has been carried out using the ple since, the performance of machine learning metrics precision, recall and F-score. Results based techniques heavily depends on the number obtained in Table.3 is based on applying 10 Fold of training samples. cross validation on the training set. Orthographic features were taken as a benchmarks, which 5 Conclusion results in F-score of 0.53. This is considered as the lowest F-score reported in this experiment. This paper presents a machine learning approach Addition of normalized features resulted in an for human disease NER using NCBI disease increasing the F-score by 21%.Further addition of corpus. The system takes the advantage of POS tags increased the performance by 12%.With rich feature set which,helps in representation the addition of N-gram features the overall F-score and distinguishing of related concepts and cat- achieved is 0.91.Finally,with the addition of af- egories.Simple features including orthographic, fixes,the final F-score obtained is 0.94.Compared contextual, affixes, bigrams, part of speech and to other state of the art Bio-NER systems,such as normalized tokens without exploiting features BANNER,our system has a higher level of F-score such as head nouns, dictionaries etc.The model has using 10 fold cross validation on training set due achieved state of the art performance for semantic to the selection of good features for disease NER. labeling of named entities using the NCBI disease Sutton, C. and McCallum. 2011. An introduction to conditional random fields. Foundations and Trends in Machine Learning.267–373 Dogan, R. Islamaj, and Z. Lu. 2012. An improved cor- pus of disease mentions in PubMed citations. Pro- ceedings of the 2012 workshop on biomedical nat- ural language processing. Association for Computa- tional Linguistics.91–99 L. Lishuang, W. Fan, and D. Huang. 2013. A Two- Phase Bio-NER System Based on Integrated Clas- sifiers and Multiagent Strategy.. Computational Bi- ology and Bioinformatics, IEEE/ACM Transactions on 10.4.897-904 M. Krallinger, M. Vazquez, F. Leitner, D. Salgado, A. Figure 2: f-score comparison of training,testing Chatr-Aryamontri, A. Winter, et al. 2011. The and development set ProteinProtein Interaction tasks of BioCreative III: Classification/ranking of articles and linking bio- ontology concepts to full text. BMC Bioinformatics, 12 (Suppl. 8) corpus.Each feature set represent some knowledge about the named entity and hence, in order to eval- uate the overall benefit for each feature, all possi- ble combinations of feature additions need to be considered. References Collier, Nigel, and K. Takeuchi. 2004. Comparison of character-level and part of speech features for name recognition in biomedical texts, volume 36. Journal of Biomedical Informatics. Ratinov, Lev, and D. Roth. 2009. Design chal- lenges and misconceptions in named entity recog- nition. Proceedings of the Thirteenth Conference on Computational Natural Language Learning. Associ- ation for Computational Linguistics. J. Kazama, T. Makino, Y. Ohta, and J. Tsujii. 2002. Tuning Support Vector Machines for Biomedical Named Entity Recognition. Proceedings of Work- shop on NLP in the Biomedical Domain. 1–8 Zhou, G. Dong, and J. Sui. 2002. Named entity recog- nition using an HMM-based chunk tagger proceed- ings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Compu- tational Linguistics. 473–480 Huang H-S, Lin Y-S, Lin K-T, Kuo C-J, Chang Y-M, Yang B-H, Chung I-F, Hsu C-Ni. 2007. High-recall gene mention recognition by unification of multiple background parsing models Proceedings of the 2nd BioCreative Challenge Evaluation Workshop. Klinger R, Friedrich CM, Fluck J, Hofmann-Apitius. 2007. Named entity recognition with combinations of conditional random fields. Proceedings of the 2nd BioCreative Challenge Evaluation Workshop.