=Paper=
{{Paper
|id=Vol-1747/BP02_ICBO2016
|storemode=property
|title=Disease Named Entity Recognition Using NCBI Corpus
|pdfUrl=https://ceur-ws.org/Vol-1747/BP02_ICBO2016.pdf
|volume=Vol-1747
|authors=Thomas Hahn,Hidayat Ur Rahman,Richard Segall
|dblpUrl=https://dblp.org/rec/conf/icbo/HahnRS16
}}
==Disease Named Entity Recognition Using NCBI Corpus ==
Biomedical Disease Name Entity Recognition Using NCBI Corpus Hidayat Ur Rahman Thomas Hahn Dr. Richard Segall Lahore Leads University University of Arkansas at Little Rock Arkansas State University 5Tipu Block Near Garden Town Near 2801 South University Avenue Computer Inform Tech Department Kalma Chowk, Lahore 54000 Pakistan Little Rock, AR, 72204 State University, AR 72404-0130 +92-3329702722 + 1 (501) 301 4890 + 1 (870) 972-3989 Hidayat.Rhman@gmail.com Thomas.F.Hahn3@gmail.com rsegall@astate.edu Abstract— Named Entity Recognition (NER) in biomedical the author used ME to distinguish between 23 different literature is a very active research area. NER is a crucial component of biological categories achieving an F-score of 72%. biomedical text mining because it allows for information retrieval, reasoning and knowledge discovery. Much research has been carried out Performance of biomedical NER as compared to general in this area using semantic type categories, such as “DNA”, “RNA”, purpose NER is not satisfactory [13]. Many approaches have “proteins” and “genes”. However, disease NER has not received its been used to enhance the performance of biomedical NER needed attention yet, specifically human disease NER. Traditional systems, e.g. adding biomedical domain knowledge [14] [15], machine learning approaches lack the precision for disease NER, due to their dependence on token level features, sentence level features and the applying post-processing [14] and combining different integration of features, such as orthographic, contextual and linguistic machine learning classifiers to perform a hybrid classification features. In this paper a method for disease NER is proposed which scheme [16]. Some of the above mentioned applications are utilizes sentence and token level features based on Conditional Random discussed below. Fields using the NCBI disease corpus. Our system utilizes rich features including orthographic, contextual, affixes, bigrams, part of speech and The exact biomedical term could be referred to by stem based features. Using these feature sets our approach has achieved a abbreviations or synonyms. Therefore, abbreviation and maximum F-score of 94% for the training set by applying 10 fold cross synonym recognition are used to unify and normalize validation for semantic labeling of the NCBI disease corpus. For testing biomedical entities for biomedical NER. For example, in [17] and development corpus the model has achieved an F-score of 88% and 85% respectively. the authors have used logistic regression for abbreviation scoring based on the Medstract corpus thus achieving a recall Keywords— NCBI disease corpus, naïve Bayesian, Bayesian of 83% and precision of 80%. In [18] an abbreviation networks, Non nested generalized exemplars; recognition system has been developed using the AB3P corpus. Thus, a recall of 95.86% and precision of 86.64% I. INTRODUCTION could be achieved. In [19] pattern-matching rules were developed for matching abbreviations with their respective Biomedical Named Entity Recognition (NER) is based on full term. Thus, a recall of 70% and a precision of 95% could dictionary-based, rule-based and machine learning approaches be obtained. In [20] a system was developed based on [1] and [2]. In the dictionary based approach all the terms are collocations yielding a recall of 88.5% and precision of not defined in dictionary. This is the major limitation of this 96.3%. In [21] a rule-based synonym recognition system was approach [3]. Rule-based approaches make decisions based on developed, in [22] a pattern matching system was developed certain rules, which are learned from the data in form of text to match abbreviations with their corresponding full names. terms. But these rules are not applicable in all cases [3]. On A lot of current research is interested in entity recognition and the other hand, machine learning approaches require enormous normalization [23]. In the BioCreative III competition, one annotated data to train the algorithm [4]. Nowadays machine task was focused on gene normalization, i.e. to identify and learning approaches are commonly used for NER, e.g., link genes to the standard database [24]. Such system has also Support Vector Machines (SVM) [5], Maximum Entropy been developed in [25]. Relationships between biomedical (ME) [6], Hidden Markov Models (HMM) [7] and entities, e.g. protein-protein interactions, gene-disease Conditional Random Fields (CRF) [8]. In [9] an HMM model interactions are investigated in [26]. has been proposed to distinguish between DNA, RNA, Much work has been done in the field of relationship mining. protein, cell-type and cell-line. Kazema et al. proposed an For example, in [27] a relationship mining system was SVM based approach to identify DNA, cell-type, cell-line, developed using MetaMap to identify biomedical entities [28] protein and lipid achieving an f-score of 73.6% [10]. In [11] while using linguistic rules to determine the semantic CRFs based NER system was developed to recognize protein relationships between them. In [29] a gene-disease mentions achieving an F-score of 78.4%. Beside CRFs in [12], relationship extraction system was developed from Medline abstracts using machine learning approach. It performed better than dictionary- and rule-based approaches. A. Word Normalization The research in this work focuses on biomedical disease Word normalization attempts to reduce different form of classification using the National Center for biotechnology words such as noun, adjective, verb etc. to its reduced/stemmed (NCBI) corpus and applying combinations of machine or root form . Common technique used for word normalization learning approaches. We found that selecting rich features and is the use of stemmer or lemmatizer, which stems word to its combining classifiers contribute to a better performance. base form. Following are the various patterns analyzed which are reduced to its root form. II. DATASET DETAILS • Colorectal cancer colorect cancer Our dataset is the National Center for Biotechnology • Endometrial cancer endometri cancer Information (NCBI) Disease Corpus. It is available at http://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEAS • Alzheimer disease alzheim diseas E/. It consists of 793 abstracts containing 2783 sentences, • Neurological disease neurolog diseas 3224 unique disease names [30] and about 6,900 disease names in total. NCBI corpus annotators have annotated every • Arthritis arthriti sentence of the PubMed abstracts excluding organism names • Deficiency of DPD defici of DPD (e.g. human, virus and bacteria), gender (male and female), general terms (deficiencies and syndromes), biological • Premenopausal ovarian cancer premenopaus references and nested disease. Annotations were done using a ovarian cancer web base tool called PubTator [31]. The corpus annotations were assigned four categories based on the nature of the • Neurodegeneration neurodegener disease which consist of 3922 specific disease annotation, • Familial deficiency of the seventh component of 1029 disease class annotations, 1774 modifiers and 173 complement famili defici of the seventh compon of composite mentions. The dataset is further divided into complement training, testing and development set as shown in the table below B. Orthographic Features Orthographic features are related to the geometry and Classes Training Testing Development indentation of the text such as capitalization, digits, numbers, set set set numerics, single caps, all caps, two caps, punctuation, Modifiers 1292 264 218 symbols etc. Such features are very effective in NER. Use of Specific 2959 556 409 orthographic feature has been advocated in [32-34]. Disease Composite 116 20 37 C. Part Of Speech (POS) Tags Mention Disease 781 121 127 Usually POS tags help define the boundaries of phrases. In Class some scenarios POS tags have improved NER performance Table-1: Description of Train, test and Development [34-35]. Since POS tagging is a challenging and d d i hi i computationally demanding process some researchers have III. FEATURE SET not used it in NER [36]. We have improved performance by including POS tags. To improve classification accuracy, selecting and defining the D. N-grams features is very important. Enriching the feature set can improve the performance of a particular machine learning N-grams are defined by a sequence of n tokens or words. The algorithm. To train our algorithm we used the following most common n-gram is unigram because it contains a single features: token. Other n-grams are bigrams and tri-grams containing 2 and 3 tokens respectively. Generally, N-grams are represented 1. Word Normalization by the equation ------- 2. Orthographic (1). 3. Part of Speech (POS) Tags From equation (1) which 4. N-grams represents unigrams, while bigrams add one more word and 5. Affixes can be represented as 6. Contextual and hence tri-grams adds two more words and hence other N- Each of these 6 features is explained in more detail below: gram models can be found so on. In our experiment we only Contextual (Cc), Normalized (Nm), Unigrams (Ug), bigrams used bigrams and unigrams. (bg), Affixes (Ax), Part of speech (POS) and Orthographic (O). Performance evaluation was carried out using standard metrics E. Affixes such as precision, recall and F-score. Prefix and suffix features have significantly improved Precision= performance in the recognition of named entities. In [37] the authors have collected most frequent suffixes and prefixes Recall = from the training data, while in [38] the authors have grouped the prefixes and suffixes into 23 categories. In our experiment F-score = beside contextual features affixes has shown significant improvement. Results obtained in Table-2 is based on applying 10 Fold cross validation on the training set. F. Contextual features Feature combination precision recall F-score Contextual features refer to the word preceding and following O 0.54 0.62 0.53 the named entities. Let be the current token i.e. named entity, so for each feature we use two token instances around it O+ Nm 0.77 0.76 0.74 i.e. . Now for each token which O+ Nm+ POS 0.87 0.87 0.86 appears in the text at location the same features are calculated or more specifically c= O+ Nm + POS +Un 0.91 0.91 0.91 …….. (2) Is the contextual window. In our experiment O+ Nm + POS + Un + Bg 0.92 0.92 0.91 contextual features are the most important features in the recognition of NEs combined with affixes. Initially two O+ Nm+ POS +Un + Bg + Cc 0.92 0.92 0.92 contextual features followed by the current word were selected O+ Nm +POS +Un + Bg +Cc + Affixes 0.94 0.94. 0.94 for the experiment. However, when realizing their importance four contextual features were selected. See equation 2, i.e. the Table-2: Performance evaluation of Feature set. two words preceding and the two words following the NE. Table-2 shows combinations of different features for improving CRF performance. Oorthographic features were IV. CLASSIFICATION SCHEME taken as a benchmark. The benchmark performance was an F- In this research Conditional Random Fields (CRF) was applied score of 0.53, a precision of 0.54 and a recall of 0.62. Adding to the NCBI disease corpus. CRF is a probabilistic model for stemmed or normalized features improved the F-score to 0.74, labeling sequential data; it’s widely used for part of speech the precision to 0.77 and the recall to 0.76. Adding part of tagging and named entity recognition [39, 40]. CRF has several speech tags further improved the F-score by 12 percent. advantages over the HMM and SVM. CRF is based on a Nevertheless, the part of speech tags were recently removed discriminative model. Hence, it includes a rich feature set from the NER system. Unigram-based models have been the containing overlapping features using conditional probability. primary models in NER and hence we included them in our Given a sequence and its system. Adding the unigram features improved the F-score by labels , the conditional probability 5%. Adding bigram-features did not raise the overall F-score is defined by CRF as follows [41]: but improved precision and recall by 1%. Adding contextual features only improved the F-score slightly by 1% but had no (2) effect on precision and recall. Combining all features, i.e. orthographic, normalized, part of speech, unigram, bigram, Is a weight vector defined by contextual features and affixes yielded 94% for precision, These weights are associated with features having length equal recall and F-score. This performance was achieved with a 10- to M. fold cross-validation on the training set due to the rich feature selection. f is a feature function. Weight vectors (denoted by w) are Figure 1 shows the F-scores for each of the 4 classes. In obtained using the L-BFGS method [42]. In our experiment our experiment the following four classes were defined: CRFSUITE has been used, which is the Python • Disease Class = DC implementation of CRF [43]. • Composite Mention = CM V. RESULT AND DISCUSSION • Specific Disease = SD Table-2 shows the contributions of features and their effects on • Modifier = MD the performance of CRF. The feature set is divided into The F-scores of the training, development and testing sets are System Dataset Precision Recall F-Measure plotted in figure 1. The best F-scores could be achieved for the Modifier class. For this class an F-score of 0.96 could be Training 0.94 0.94 0.94 reached for the training dataset and for the development and testing dataset an F-score of 0.92 was obtained. The second CRF highest F-scores could be achieved for the Specific Disease Testing 0.88 0.89 0.88 Result class. For this class the F-score of the training dataset was 0.95, for the testing set it was 0.92 and for the development set Development 0.86 0.86 0.85 it was 0.88. The third highest F-scores were achieved for the Disease Class. For this class the F-score for the training set was Training 0.86 0.82 0.84 0.86 and the F-scores for the testing and development set were BANNER both 0.71. The F-scores were lowest for the Composite Result Testing 0.83 0.80 0.81 Mention class. For this class the F-score for the training set was 0.72, for the testing set it was 0.52 and for the Development 0.82 0.81 0.81 development set it was 0.62. We observed a positive Table-3: Comparison of BANNER and CRF results: For both correlation between the size of the training sample sets and F- Classifiers Precision, Recall and F-score are reported. score. The largest training sample comprising of over 1,000 was available for the Modifier class, followed by the Special Figure 2 also shows that our F-scores (depicted in blue) are Disease class, followed by the Disease Class having the second much higher than those of BANNER (depicted in red) smallest training sample followed by the Composite Mention class, which had the smallest training sample. The performance of machine learning algorithms depends on the size of the training sample. Too small training samples increase the risk of under fitting while too large training samples increase the risk for over fitting. Figure-2: Plot of BANNER Vs Proposed Model In summary it can be concluded that CRF based on 6 features clearly outperformed BANNER. This clearly shows that the sequential classifier CRF is well suited for classifying biomedical literature based on rich features. Figure-1: F-score Comparision of Training, Testing and VI. CONCLUSION Development Data sets. This paper presents a machine learning approach for human We compared the performance of our approach, which is based disease named entity recognition using the NCBI disease on combining features with that of BANNER using the same corpus. The system takes the advantage of background dataset and classes. The results of this comparison are shown in knowledge obtained from the selected features to better table 3. Details about BANNER results can be found in [30]. distinguish between the four classes. Improvements due to The data in table 3 indicates that our approach yielded much feature additions have been demonstrated. The highest higher F-scores than BANNAR for the training, testing and improvement could be obtained when adding a second feature development set. The F-score obtained with our approach is to the first. However, in order to evaluate the overall benefit for 10% higher for the training set, 7% higher for the testing set each feature, all possible combinations of feature additions and 4% higher for the development set. Hence, we clearly need to be considered. succeeded in outperforming BANNER. REFERENCES [1]. A.M. Cohen, W.R. Hersh A survey of current work in biomedical text [24]. C.N. Arighi, P.M. Roberts, S. Agarwal, S. Bhattacharya, G. Cesareni, A. mining Brief Bioinform, 6 (2005), pp. 57–71 Chatr-Aryamontri, et al. [2]. L. Li, R. Zhou, D. Huang,Two-phase biomedical named entity [25]. BioCreative III interactive task: an overview. BMC Bioinformatics, 12 recognition using CRFs. Comput Biol Chem, 33 (2009), pp. 334–338 (Suppl. 8) (2011), [3]. D. Rebholz-Schuhmann, A.J. Yepes, C. Li, S. Kafkas, I. Lewin, N. [26]. M. Huang, J. Liu, X. Zhu. GeneTUKit: a software for document-level Kang, et al. Assessment of NER solutions against the first and second gene normalization. Bioinformatics, 27 (2011), pp. 1032–1033 CALBC Silver Standard Corpus.J Biomed Semantics, 2 (Suppl. 5) [27]. C.N. Arighi, Z. Lu, M. Krallinger, K.B. Cohen, W.J. Wilbur, A. (2011) Valencia, et al.Overview of the BioCreative III workshop. BMC [4]. M. Krallinger, M. Vazquez, F. Leitner, D. Salgado, A. Chatr- Bioinformatics, 12 (Suppl. 8) (2011), p. S1 Aryamontri, A. Winter, et al.The Protein–Protein Interaction tasks of [28]. Ben Abacha, P. Zweigenbaum. Automatic extraction of semantic BioCreative III: Classification/ranking of articles and linking bio- relations between medical entities: a rule based approach. J Biomed ontology concepts to full text.BMC Bioinformatics, 12 (Suppl. 8) (2011) Semantics, 2 (Suppl. 5) (2011), p. S4 [5]. M.S. Habib, J. Kalita, Scalable biomedical Named Entity Recognition: [29]. A.R. Aronson, F.M. Lang. An overview of MetaMap: historical investigation of a database-supported SVM approach.Int J Bioinform perspective and recent advances. J Am Med Inform Assoc, 17 (2010), Res Appl, 6 (2010), pp. 191–208 pp. 229–236 [6]. S.K. Saha, S. Sarkar, P. Mitra. Feature selection techniques for [30]. Rezarta Islamaj, Dogan Zhiyong Lu. An improved corpus for disease maximum entropy based biomedical named entity recognition. J Biomed mentioned in Pubmed citatations Proceedings of the 2012 Workshop on Inform, 42 (2009), pp. 905–911 Biomedical Natural Language Processing (BioNLP 2012), pages 91–99, [7]. Y.M.N. Ephraim.Hidden Markov processes.IEEE Trans Inform Theory, Montr´eal, Canada, June 8, 2012 48 (2002), pp. 1518–1569 [31]. Leaman, R.,Miller,C.Gonzalez. enabling recognition of disease in [8]. He Y, Kayaalp M. Biological entity recognition with conditional random biomedical text with machine learning: corpus and Benchmarks. fields. In: AMIA annu symp proc; 2008. p. 293–7. Symposium on languages in biology and medicine 2009. Pg 82-89. [9]. Zhou GD, Su J. Exploring deep knowledge resources in biomedical [32]. Wei.C, Kao.H,Lu.Z. ‘Pubtator: A Pubmed-like interactive curation name recognition. In: JNLPBA; 2004. p. 96–99 system for document triage and literature curation. In procedings of [10]. Kazama J, Makino T, Ohta Y, Tsujii J. Tuning support vector machines BioCreative workshop 2012 pg145-150. for biomedical named entity recognition. In: Association for [33]. N. Collier, K. Takeuchi. Comparison of character-level and part of computational linguistics Morristown, NJ, USA; 2002. p. 1–8. speech features for name recognition in biomedical texts. J Biom. [11]. T. Tsai, W.C. Chou, S.H. Wu, T.Y. Sung, J. Hsiang, W.L. Inform. 37. pp423-435. 2004. Hsu,Integrating linguistic knowledge into a conditional random field [34]. D. Shen, J. Zhang, G. Zhou, S. Jian and L. Tan, Effective Adaptation of framework to identify biomedical named entities. Expert Syst Appl, 30 a Hidden Markov Modelbased Named Entity Recognizer for Biomedical (2006), pp. 117–128 Domain, In: Proceedings of ACL 2003 Workshop on NLP in [12]. Lin YF, Tsai TH, Chou WC, Wu KP, Sung TY, Hsu WL. A maximum Biomedicine, Sapporo, Japan, pp4956, 2003. entropy approach to biomedical named entity recognition. In: The 4th [35]. Tsai, T.-H., Wu, S.-H., & Hsu, W.-L. (2005). Exploitation of linguistic ACM SIGKDD workshop on data mining in bioinformatics; 2004. p. features using a CRFbased biomedical named entity recognizer. to 56–61. appear in ACL Workshop on Linking Biological Literature, Ontologies [13]. C.R. Yen-Ching, Tsai Tzong-Han, Hsu Wen-Lian. New challenges for and Databases: Mining Biological Semantics, Detroit biological text-mining in the next decade. J Comput Sci Technol, 25 [36]. L. Ratinov and D. Roth. 2009. Design challenges and misconceptions in (2010), pp. 169–179 named entity recognition. In CoNLL, 6. [14]. Y. Sasaki, Y. Tsuruoka, J. McNaught, S. Ananiadou. How to make the [37]. J. Kazama, T. Makino, Y. Ohta, J. Tsujii. Tuning Support Vector most of NE dictionaries in statistical NER.BMC Bioinformatics, 9 Machines for Biomedical Named Entity Recognition. In: Proceedings of (Suppl. 11) (2008), p. S5 Workshop on NLP in the Biomedical Domain, ACL 2002. pp1-8. 2002. [15]. Zhou GDaJS. Exploring deep knowledge resources in biomedical name [38]. G. Zhou and J. Su. Named Entity Recognition using an HMM-based recognition. In: JNLPBA; 2004. Chunk Tagger. In Proc. of the 40th Annual Meeting of the Association [16]. B.S. Fei Zhu. Combined SVM-CRFs for biological named entity for Computational Linguistics (ACL), pp. 473-480 2002. recognition with maximal bidirectional squeezing. PLoS One, 7 (6) [39]. Huang H-S, Lin Y-S, Lin K-T, Kuo C-J, Chang Y-M, Yang B-H, Chung (2012), p. e39230 I-F, Hsu C-N: High-recall gene mention recognition by unification of [17]. J.T. Chang, H. Schutze, R.B. Altman. Creating an online dictionary of multiple background parsing models. Proceedings of the 2nd abbreviations from MEDLINE.J Am Med Inform Assoc, 9 (2002), pp. BioCreative Challenge Evaluation Workshop 2007, 23:109-111. 612–620 [40]. Klinger R, Friedrich CM, Fluck J, Hofmann-Apitius M: Named entity [18]. C.J. Kuo, M.H. Ling, K.T. Lin, C.N. Hsu. BIOADI: a machine learning recognition with combinations of conditional random fields. In approach to identifying abbreviations and definitions in biological Proceedings of the 2nd BioCreative Challenge Evaluation Workshop literature. BMC Bioinformatics, 10 (Suppl. 15) (2009), p. S7 [41]. Porter M.F. “Snowball: A language for stemming algorithms”. 2001. [19]. H. Yu, G. Hripcsak, C. Friedman.Mapping abbreviations to full forms in [42]. Ms. Anjali Ganesh Jivani “A Comparative Study of Stemming biomedical articles. J Am Med Inform Assoc, 9 (2002), pp. 262–272 Algorithms” Int. J. Comp. Tech. Appl., Vol 2 (6), 1930-1938 [20]. H. Liu, C. Friedman. Mining terminological knowledge in large [43]. Dekang Lin and Xiaoyun Wu. 2009. Phrase Clustering for biomedical corpora. Pac Symp Biocomput (2003), pp. 415–426 Discriminative Learning. In Proceedings of the Joint Conference of the [21]. J. McCrae, N. Collier. Synonym set extraction from the biomedical 47th Annual Meeting of the ACL and the 4th International Joint literature by lexical pattern discovery. BMC Bioinformatics, 9 (2008), p. Conference on Natural Language Processing of the AFNLP, pages 159 1030–1038, Suntec, Singapore, August. Association for Computational [22]. A.M. Cohen, W.R. Hersh, C. Dubay, K. Spackman. Using co- Linguistics. occurrence network structure to extract synonymous gene and protein names from MEDLINE abstracts. BMC Bioinformatics, 6 (2005), p. 103 [23]. H.-Y.K. Zhiyong Lu, Wei Chih-Hsuan, Huang Minlie, Liu Jingchen, Kuo Cheng-Ju, Hsu Chun-Nan, et al.The gene normalization task in BioCreative III.BMC Bioinformatics, 12 (2011)