=Paper=
{{Paper
|id=Vol-1718/paper4
|storemode=property
|title=SL − FII: Syntactic and Lexical Constraints with Frequency based Iterative
Improvement for Disease Mention Recognition in News Headlines
|pdfUrl=https://ceur-ws.org/Vol-1718/paper4.pdf
|volume=Vol-1718
|authors=Sidak Pal Singh,Sopan Khosla,Sajal Rustagi,Manisha Patel,Dhaval Patel
|dblpUrl=https://dblp.org/rec/conf/ijcai/SinghKRPP16
}}
==SL − FII: Syntactic and Lexical Constraints with Frequency based Iterative
Improvement for Disease Mention Recognition in News Headlines==
SL − F II: Syntactic and Lexical Constraints with Frequency based Iterative Improvement for Disease Mention Recognition in News Headlines Sidak Pal Singh, Sopan Khosla, Sajal Rustagi, Manisha Patel Dhaval Patel Graduate student Faculty IIT Roorkee IIT Roorkee (sidakuec,sajalume,manipubt)@iitr.ac.in,khoslasopan@gmail.com patelfec@iitr.ac.in Abstract in [Mazumder et al., 2014]. Our SL − F II 1 system was able to extract a total of 3157 and 5058 correct occurrences News headlines are a vital source of information of disease names for 2014 and 2015 respectively. In order for the masses. Identifying diseases that are being to compare the performance of our method with baseline ap- spread or discovered is important to take necessary proaches, we use both manual analysis as well as an external steps for their prevention and cure. Our system knowledge source. Our system gives 40% gain in accuracy uses a syntactic and lexical constraint-based ap- in comparison to other baseline approaches in the task of top- proach which then goes through a frequency anal- 150 (unique) disease mention recognition on the 2015 news ysis phase to extract meaningful disease names. In headlines dataset. the task of top-150 (unique) disease mention recog- nition on the 2015 news headlines dataset, our ap- proach shows 40% gain in accuracy in comparison 2 Related Work to other baseline approaches, illustrating the benefit One of the essential requirements for a text mining applica- of our approach. tion is the ability to identify relevant entities. There has been an increasing amount of research in Biomedical Named En- 1 Introduction tity Recognition (BNER), which is the task of locating bound- aries of the biomedical entities in the given corpus and tag- Disease Mention Recognition involves extraction of disease ging them with corresponding semantic type (e.g. proteins, name from a given text. News provides us access to cur- vitamins, viruses etc.). With the various events, i2b2 [Uzuner rent events and up-to-date information regarding varied fields. et al., 2011] and scientific challenges [Kim et al., 2009], Rather than analyzing the entire news text across different BNER has seen huge development in recognizing the men- sources, headlines are a quick and viable option to extract tion of genes [Lu et al., 2011], [Torii et al., 2009], organisms useful knowledge. [Naderi et al., 2011] and diseases [Dogan and Lu, 2012]. Disease names found in news headlines can inform us Most of the research related to Biomedical Named Entity about the kind of diseases which are getting spread or are Recognition has been focused on clinical texts [Pradhan et prevalent in different regions at various points of time. This al., 2014], medical records and PubMed queries [Névéol et has several advantages: taking adequate measures for the pre- al., 2009], [Doan et al., 2014]. But as far as our knowledge is vention and control of diseases, investing in research and concerned, extracting disease mentions from news headlines development for their cures, predicting future epidemic out- hasn’t been significantly explored in the literature. breaks, etc. Correctly recognizing a disease mention is vi- Most of the techniques that have been used in these tasks tal for improvement of disease-centric knowledge extraction are based on machine learning approaches such as Support tasks, like drug discovery [Agarwal and Searls, 2008]. Vector Machines (SVM) and Conditional Random Fields We aim to discover patterns in the news headlines that con- (CRF). Disease mention recognition by [Jonnagaddala et al., tain diseases and use them to generalize over diseases that 2015] was performed using CRF approach on PubMed Ab- haven’t been seen. We identify a set of significantly covering stracts. BANNER, developed by [Leaman et al., 2008] is word roots that signal disease mentions and then extract the also based on CRF approach using syntactic, lexical and or- sentence structure using rule-based inference techniques. In thographic features extracted to recognize disease mentions. other words, we use syntactic and lexical (SL) constraints to This work was further extended in the context of biomedical extract the disease names from headlines in an initial pass. texts by [Chowdhury et al., 2010] by use of contextual fea- The highlight of our approach is the Frequency-based Itera- tures in addition to features extracted by Banner. tive Improvement (F II) that leads to more accurate results by In all these approaches, the used corpus (for example weeding out the false positives. NCBI, [Doan et al., 2014] ) has detailed annotations at both We experimented on a total of 664163 headlines for year 2014 and 408052 headlines for year 2015, collected using 1 Code and data for our system can be found at https:// the iM M (Indian news Media Monitoring) system described github.com/sidak/Disease_Mention_Recognition mention and concept level. But, our headline dataset does (The symbols for the PoS tags and their corresponding de- not have any form of annotations. Such a scenario makes it scription is shown in Table 2.) difficult to apply supervised machine learning techniques. To handle such inconsistencies we use the following ap- Further, due to differences in structural patterns of news proach: corpus and biomedical texts, aforementioned approaches can- 1. Convert headlines to lower-case and then compare the not be effectively used for disease mention recognition from respective PoS tags of tokens with that of the original news corpus. Apart from this, an advantage of our approach sentence. is that it can generalize for entities belonging to domains as different as cricket, politics etc. We illustrate this with an 2. If PoS tag differs, use lower-case form. Otherwise, use example in Section 3.3. the original one. Thus, the example headline is compared to “india seeks 3 Proposed Solution revenge from australia” Our solution involves four basic stages: Pre-processing, Re- PoS tags: [(’india’, ’NN’), (’seeks’, lational Extraction, Frequency-based Iterative Improvement ’VBZ’), (’revenge’, ’NN’), (’from’, ’IN’), and Post-processing. The architecture of our SL − F II sys- (’australia’, ’NN’)] tem is shown in Figure 1. On converting to lower-case, PoS tag of ‘India’ still se- 3.1 Dataset mantically portrays a noun, whereas the PoS tag of ‘Seeks’ changes from noun to verb (NNP to VBZ). So the example We use the iM M system [Mazumder et al., 2014] to collect headline finally gets converted to ”India seeks Revenge from news headlines for year 2014 and 2015. We make use of a Australia” manually prepared list of 95 diseases to simulate annotations POS tagged: [(’India’, ’NNP’), (’seeks’, (see Section 4) and then use it to extract word roots that cover ’VBZ’), (’Revenge’, ’NNP’), (’from’, ’IN’), a significant portion of disease containing headlines. Some (’Australia’, ’NNP’)] of the headlines collected by the iM M system that would be further used in this paper for the explanation of our technique 3.3 Relational Extraction are mentioned in Table 1. In this section, we introduce two types of constraints, namely lexical and syntactic. These constraints help us to discover Sample Headlines and extract disease name - word root relations. We also dis- Healthcare worker in Scotland diagnosed with Ebola cuss how to generalize this idea for entities pertaining to dif- ferent domains. Beyonce Reaches Out to Grieving Family of Teen Who Died of Cancer Symbol Description Bird flu outbreak in Kottayam, Alappuzha DT determiner More details on Dan Uggla’s concussion symptoms JJ adjective IMF rules out special treatment for Greece NN noun, singular Table 1: Sample headlines collected by iM M system NNS noun plural NNP proper noun, singular 3.2 Pre-processing NNPS proper noun, plural The first step of pre-processing is to remove apostrophe RB adverb inconsistencies from the corpus (headlines). After this we use N LT K 2 to tokenize the headlines and then tag the VB verb, base form produced tokens with parts of speech (i.e. PoS tagging). V BZ verb, 3rd person News headlines may also contain grammatical inconsis- tencies. For example, capitalizing the first letter of every CD cardinal number word, punctuation mistakes, or missing articles etc. In such cases, PoS taggers might incorrectly tag certain words Table 2: Notation used for PoS tags. as explained in the following example. Consider the headline, Lexical Constraints “India Seeks Revenge From Australia” POS tags: [(’India’, ’NNP’), (’Seeks’, News headlines contain multiple entities. Our task is to ’NNP’), (’Revenge’, ’NNP’), (’From’, ’NNP’), identify the correct set of entities that correspond to disease (’Australia’, ’NNP’)] names. In other words, we need to formulate a context that signals the occurrence of disease names. 2 In order to define this context or neighbourhood, we ex- http://www.nltk.org/ Natural Language Toolkit (NLTK v3.1) tract certain word roots that indicate the presence of disease Figure 1: The architecture of our SL − F II system. names in headlines (with high confidence). The word roots Besides eliminating incoherent extractions, syntactic con- are obtained by analyzing the headline data and the initial straints also reduce uninformative extractions by capturing list of disease names. Based on how well they cover the relation phrases that are expressed by only certain combina- disease containing headlines, both quantitatively and quali- tions. tatively, we select a certain subset of these word roots. From Figure 2 shows the syntactic constraints developed for the our experiments, we find that a specific set of 10 word roots inflected form ‘diagnoses’ of word root ’diagnos’ and is de- covers a significant portion of disease containing headlines scribed in detail below. and are listed as follows: • In the headline, “Stool test diagnoses bowel disease”, • ‘diagnos’ derivatives inflected form ‘diagnoses’ of word root ‘diagnos’ is used as 3rd person verb (VBZ). The disease mention, ‘bowel • ‘outbreak’ derivatives disease’ extracted as pair of singular nouns (NN NN), • ‘cur’ derivatives occurs to the right of ’diagnoses’. • ‘vaccin’ derivatives • In another headline, “Autism diagnoses surge by 30 per- cent in kids”, inflected form ‘diagnoses’ of word root • ‘die’ derivatives ‘diagnos’ is used as plural noun (NNS). The disease • ‘battling’ derivatives mention , ‘Autism’ extracted as noun (NN), occurs to • ‘symptom’ derivatives the left of ’diagnoses’. • ‘treatment’ derivatives . • ‘virus’ derivatives • ‘hospital’ derivatives Note that, derivatives (over here) implies the inflected forms of the keywords mentioned along with certain preposi- tions. For example, consider the derivatives of ‘diagnos’ [mis]diagnos(e | es | ed | is | tic) [with | for | of | by] Apart from the above list, word roots like ‘drug’, ‘patient’, ‘therapy’ and more are also identified. Consideration of these Figure 2: Rules for the inflected form ‘diagnoses‘ of word word roots leads to a marginal improvement in the number root ‘diagnos’. of identified disease mentions. An intuitive justification for the above is that quite often such word roots are used along Syntactic constraints for other word roots are developed in with entities other than disease names. In several cases, they a similar manner. Disease mentions in news headlines gen- tend to identify more false positives than true positives. Thus erally occur around the word roots as the regular expression choosing this small list doesn’t lead to any significant loss, given below: and at the same time speeds up the system. E = [DT ](JJ)∗ (N N |N N P |N N S|N N P S|CD)+ Syntactic Constraints The headlines containing inflected forms of different word Or in other words, disease mentions are phrases that con- roots are extracted using lexical constraints. Using the tain an optional determiner or article (e.g. a, an) followed by PoS tags of obtained headlines, we develop syntactic con- multiple optional adjectives (e.g. fractured) and at least one straints/rules to capture the position of occurrence of disease noun (e.g. elbow, non-Hodgskins lymphoma) with a maxi- names, in relation to word roots identified above. mum length of four. Since disease names basically represent a kind of entities, used with different lexical rules (p ∩ rulei ) as defined the syntactic constraints extract the potential disease name in Equation 1. phrases from news headlines. For obvious reasons, we omit X the determiner or article (represented by the PoS tag: DT), in P (p ∈ D) = P (p ∈ D|p ∩ rulei ) × P (p ∩ rulei ) order to get the disease names. Below, is an example head- rules line, depicting the application of these constraints, with ex- (1) tracted disease mention highlighted in bold. Former Butler forward Andrew Smith diagnosed with non- 2. Based on the training set (2014 headlines), a weight is Hodgskins lymphoma . assigned each to lexical rule (W [rulei ]) which corre- sponds to the probability that the phrase extracted using Generalization that lexical rule(p ∩ rulei ) is a valid disease-name, as Since a basic entity is represented by the regular expression shown by Equation 2. discussed earlier, our SL − F II approach can be general- ized for entities of other domains. This can be possible via a P (p ∈ D|p ∩ rulei ) = W [rulei ] (2) simple modification of lexical constraints as per the context requirements. For example in order to identify the names of For example, rules involving ‘diagnosed with’ give bet- cricketers, we can use word roots like: ter results than rules involving ‘outbreak’ (which also gives false positives with natural disasters). Thus much • ‘ wicket haul’ derivatives higher weight is associated with phrases output via ’di- • ‘ton’ derivatives agnosed with’. Thus, the weight assigned to each rule depends on the In the headlines below, we can successfully extract the number of correctly recognized disease mentions and the cricketer’s name using our SL − F II model. total number of disease mentions extracted using it. Gayle’s 47-ball ton wipes out England 3. Probability of a phrase occurring with the lexical rule in Ashwin’s 5 wicket haul takes India to the semis consideration (p ∩ rulei ) depends on the frequency of phrase occurring with lexical rule and size of our entire Generalization of this approach to other languages mainly corpus(size), as shown by Equation 3. depends on the quality of the standard NLP pre-processing tools like PoS tagger and stemmer, that are available for F [rulei ][p] P (p ∩ rulei ) = (3) the required language, while the approach will remain pretty size much the same. 4. Final score (also termed as DES in Equation 4) is cal- culated taking into account the above probabilities for 3.4 Frequency-based Iterative Improvement (F II) each rule. The main motivation is to use the previous experience gained The score is equivalent to finding the probability that the in recognizing disease mentions with different word roots, to phrase detected after the initial pass is a disease. weed out incorrect diseases names. The constraints formu- lated above are used to extract potential disease names from X the corpus in an initial pass, which are then passed through DES[p] = W [rulei ] × F [rulei ][p] (4) rules the F II phase. In order to filter out false positives, we as- sign Disease Expectancy Score (DES) values (based on the F II increases accuracy and reduces false positives based probability of a word being disease name) to outputs of the on probabilistic measures. Words with higher DES have initial pass. The idea behind our F II phase can be easily higher chances of being a valid disease-name. Phrases which understood with following examples and equations 1 to 4. occur with more rules are more reliable as a disease-name. • Mentions identified along with word root ‘battling’ can e.g. Both Ebola and Tornado occur with ’outbreak’, but Ebola come as ‘battling insecurity’,‘battling cancer’,‘battling can also occur with ’diagnosed with’, ’died of’ and several lust’, but for word root ‘treatment’ most of the occur- other rules whereas Tornado cannot. Thus, F II increases be- rences are of the form, ‘Ebola treatment’, ‘treatment of lief in Ebola as a valid disease-name. cancer’. So if disease mention is extracted using word 3.5 Post-processing root ‘treatment’, it is more probable to be an instance of disease name as compared to mention extracted using The results of F II are sent for post-processing. Firstly we word root ‘battling’. normalize the results and then analyze them based on its tem- poral distribution. Valid disease-names show distinct peaks • Disease mention such as ‘Cancer’ which can be used whereas non-disease entities show uniform distribution over along with multiple word roots ‘treatment’,‘battling’ and the specified time-interval. ‘cure’, are more probable to be instances of disease names as compared to mentions like ‘seasonal’ that are 4 Experimental Results only extracted using a single word root ‘outbreak’. Across all 664163 headlines for the year 2014 and 408052 1. Probability of a phrase(p) being an instance of disease headlines for the year 2015, none of them is annotated in name(D) depends on the probability of phrase being any manner. To resolve this problem, we manually prepare a list of 95 disease names. Then any headline which contains lustrated in Table 5. The DES value of Greece is 0.32 which atleast one of these diseases is taken to be a disease containing is much less than the DES values for other entities and can headline. In this manner, we obtain 1562 and 1884 headlines be filtered out as per the threshold accepted DES value. This for the year 2014 and 2015 respectively, that are considered threshold can be decided on the basis of accuracy require- to have a correct disease mention. Figure 3 shows the number ments. of occurrences for the most frequently occurring diseases, in these headlines for the year 2014. Lexical rule Disease Few of these headlines may actually contain diseases Headline name names (from the prepared list) used in different contexts, for extracted example, ‘fever’. Despite this fact, our system is able to han- dle such slight inconsistencies, because of the F II phase. Healthcare worker in Scot- diagnosed Ebola The system is trained on headlines from 2014 to extract land diagnosed with Ebola with suitable SL constraints and learn the weights of lexical rules Beyonce Reaches Out to died of Cancer for the F II phase. When tested on all the headlines from Grieving Family of Teen 2015, the system was able to extract a total of 5058 correct Who died of Cancer disease mentions. There is more than 2.5x gain in the number of disease containing headlines for 2015. Bird flu outbreak in Kot- outbreak Bird flu tayam, Alappuzha More details on Dan Ug- symptoms Concussion gla’s concussion symptoms IMF rules out special treatment for Greece treatment for Greece Table 4: Diseases extracted from headlines using SL rules Disease DES Ebola 80.13 Cancer 14.66 Bird Flu 12.38 Concussion 1.86 Figure 3: Number of occurrences of Disease names in 2014 Greece 0.32 Besides this quantitative improvement, we also obtain Table 5: Disease Expectancy Scores after F II many diseases like ‘Ebola’, ‘Cancer’, ‘Concussion’ etc., which were not there in our input set of diseases, as shown Extracted disease names are sorted in descending order of in Table 3. their DES values. Due to our post-processing step i.e. Tem- poral Distribution Analysis, the accuracy of our SL − F II system gets improved by a great extent since uniformly oc- Disease Name # instances # instances curring words like ‘fight’, ‘water’, ‘state’ etc. are filtered out. in 2014 in 2015 The difference in accuracy (% of correct disease mentions in Cancer 1072 494 the top K unique predictions) on test data (headlines of 2015) before and after the Post-processing step can be observed in Ebola 182 990 Figure 4. Concussion 190 79 Now, we compare our SL − F II approach to other techniques that can be used for extraction of disease names. Autism 82 35 Listeria 18 13 Baseline I : Machine learning approach developed by [Chowdhury et al., 2010] for Disease mention recognition in Table 3: Example of new extracted disease-names biomedical texts. This technique makes use of decision trees and extracts orthographic and linguistic features such as PoS Table 4 shows a sample of potential disease names ex- tag, suffixes and prefixes, word beginning with upper-case tracted from headlines using syntactic and lexical (SL) con- letter, checking if all letters are in the upper-case and more. straints. The 5th headline is a sample false positive, where 15,000 instances were extracted from news corpus in order SL constraints incorrectly label ‘Greece’ as a disease. But to train our classifier. 10-fold cross validation accuracy of this will be handled once we pass it through F II phase, as il- 74.89% was obtained using decision trees for classification. Figure 5: Comparison of Accuracies using Manual Approach Figure 4: Accuracy obtained for top-K unique instances Other classifiers such as SVM (72.3%.) gave lower accuracy in comparison to decision trees over our data set. Baseline II : Sequence approach which is based on the hypothesis that context for disease names can be defined using prefixes and suffixes used around them. This approach extracts all the unigrams and bigrams that occur around disease names i.e. all the one/two word prefixes and suffixes. These prefix and suffix are assigned weights on the basis of the probability of their occurrence with disease names. Some of the prefixes extracted by this approach were ‘cured of’,‘deaths from’ and ‘deadly’. Similarly ‘outbreak rises’,‘cases reported’, ‘vaccine’ were some of the suffixes extracted by this approach. Only mentions that have a score above a particular threshold value are considered as disease mentions. Similarly probability of misclassification by each Figure 6: Comparison of Accuracies using CDC’s Disease prefix/suffix with which disease mention occurs is used to calculate the probability of error for each disease mention. If this error probability is below 5%, disease mention is F II technique performs considerably better than other ap- considered to be correct by the approach. proaches. In the top-150 (unique) disease instances, our ap- proach outperforms the best baseline by 40% and similarly In our comparison, all the approaches are trained using in the top-50 by 22%. headlines from 2014 and are tested on headlines from 2015. Disease names extracted using all three approaches are then 5 Conclusion sorted in order of confidence-values given by the approaches. Using syntactic and lexical constraints gives us potential dis- Disease instances/names extracted using each approach are ease names. We can then use the F II phase to weed out false further analyzed manually to estimate the top-K (unique) in- diseases names based on the experience acquired in recogniz- stance accuracies, as shown in Figure 5. Our method achieves ing disease mentions with different word roots. Even in the around 88% accuracy in the top-50 instances. absence of annotations on the corpus, the initial list of dis- To automatically compare our results with other ap- eases helps us to simulate annotations. In future, we plan to proaches, we used the list of diseases provided by the Cen- find the minimum size of this initial list and the influence of ters for Disease Control and Prevention (CDC) 3 , USA. CDC removing some of the word roots on the accuracy. identifies around 841 diseases, along with their conditions and variants. Figure 6 shows the automatic comparison of 6 Acknowledgments accuracies for top-K (unique) disease instances extracted by We thank anonymous reviewers for their valuable comments each approach using disease names provided by CDC. and suggestions to improve this paper. We would also like to Using these experiments, we observe that our SL − thank Shubham Kumar Pandey, Paarth Neekhara, Vikash Ku- mar, Baviskar Hrishikesh Hari, Chandan Singha and Kavin 3 http://www.cdc.gov/diseasesconditions/ Motlani for setting up the baseline experiments. References [Pradhan et al., 2014] Sameer Pradhan, Noémie Elhadad, [Agarwal and Searls, 2008] Pankaj Agarwal and David B. Wendy Chapman, Suresh Manandhar, and Guergana Searls. Literature mining in support of drug discovery. Savova. Semeval-2014 task 7: Analysis of clinical text. Briefings in Bioinformatics, 9(6):479–492, 2008. SemEval, 199(99):54, 2014. [Chowdhury et al., 2010] Mahbub Chowdhury, Md Faisal, [Torii et al., 2009] Manabu Torii, Zhangzhi Hu, Cathy H. et al. Disease mention recognition with specific features. Wu, and Hongfang Liu. Biotagger-gm: A gene/protein In Proceedings of the 2010 workshop on biomedical nat- name recognition system. Journal of the American Medi- ural language processing, pages 83–90. Association for cal Informatics Association, 16(2):247–255, 2009. Computational Linguistics, 2010. [Uzuner et al., 2011] Özlem Uzuner, Brett R South, Shuy- [Dogan and Lu, 2012] Rezarta Islamaj Dogan and Zhiyong ing Shen, and Scott L DuVall. 2010 i2b2/va challenge on Lu. An inference method for disease name normaliza- concepts, assertions, and relations in clinical text. Jour- tion. In Information Retrieval and Knowledge Discovery nal of the American Medical Informatics Association, in Biomedical Text, Papers from the 2012 AAAI Fall Sym- 18(5):552–556, 2011. posium, Arlington, Virginia, USA, November 2-4, 2012, 2012. [Doan et al., 2014] Rezarta Islamaj Doan, Robert Leaman, and Zhiyong Lu. {NCBI} disease corpus: A resource for disease name recognition and concept normalization. Journal of Biomedical Informatics, 47:1 – 10, 2014. [Jonnagaddala et al., 2015] Jitendra Jonnagaddala, Nai-Wen Chang, Toni Rose Jue, and Hong-Jie Dai. Recognition and normalization of disease mentions in pubmed abstracts. 2015. [Kim et al., 2009] Jin-dong Kim, Tomoko Ohta, Sampo Pyysalo, and Yoshinobu Kano. Overview of bionlp09 shared task on event extraction. In In Proceedings of Natural Language Processing in Biomedicine (BioNLP) NAACL 2009 Workshop. Citeseer, 2009. [Leaman et al., 2008] Robert Leaman, Graciela Gonzalez, et al. Banner: an executable survey of advances in biomed- ical named entity recognition. In Pacific Symposium on Biocomputing, volume 13, pages 652–663. Citeseer, 2008. [Lu et al., 2011] Zhiyong Lu, Hung-Yu Kao, Chih-Hsuan Wei, Minlie Huang, Jingchen Liu, Cheng-Ju Kuo, Chun- Nan Hsu, Richard Tzong-Han Tsai, Hong-Jie Dai, Naoaki Okazaki, et al. The gene normalization task in biocreative iii. BMC bioinformatics, 12(8):1, 2011. [Mazumder et al., 2014] Sahisnu Mazumder, Bazir Bishnoi, and Dhaval Patel. News headlines: What they can tell us? In Proceedings of the 6th IBM Collaborative Academia Research Exchange Conference (I-CARE) on I- CARE 2014, I-CARE 2014, pages 4:1–4:4, New York, NY, USA, 2014. ACM. [Naderi et al., 2011] Nona Naderi, Thomas Kappler, Christopher JO Baker, and René Witte. Organismtagger: detection, normalization and grounding of organism entities in biomedical documents. Bioinformatics, 27(19):2721–2729, 2011. [Névéol et al., 2009] Aurélie Névéol, Won Kim, W. John Wilbur, and Zhiyong Lu. Exploring two biomedical text genres for disease recognition. In Proceedings of the Work- shop on Current Trends in Biomedical Natural Language Processing, BioNLP ’09, pages 144–152, Stroudsburg, PA, USA, 2009. Association for Computational Linguis- tics.