=Paper= {{Paper |id=Vol-1718/paper4 |storemode=property |title=SL − FII: Syntactic and Lexical Constraints with Frequency based Iterative Improvement for Disease Mention Recognition in News Headlines |pdfUrl=https://ceur-ws.org/Vol-1718/paper4.pdf |volume=Vol-1718 |authors=Sidak Pal Singh,Sopan Khosla,Sajal Rustagi,Manisha Patel,Dhaval Patel |dblpUrl=https://dblp.org/rec/conf/ijcai/SinghKRPP16 }} ==SL − FII: Syntactic and Lexical Constraints with Frequency based Iterative Improvement for Disease Mention Recognition in News Headlines== https://ceur-ws.org/Vol-1718/paper4.pdf
    SL − F II: Syntactic and Lexical Constraints with Frequency based Iterative
         Improvement for Disease Mention Recognition in News Headlines
    Sidak Pal Singh, Sopan Khosla, Sajal Rustagi, Manisha Patel                                            Dhaval Patel
                             Graduate student                                                                  Faculty
                               IIT Roorkee                                                                    IIT Roorkee
    (sidakuec,sajalume,manipubt)@iitr.ac.in,khoslasopan@gmail.com                                         patelfec@iitr.ac.in

                          Abstract                                in [Mazumder et al., 2014]. Our SL − F II 1 system was
                                                                  able to extract a total of 3157 and 5058 correct occurrences
     News headlines are a vital source of information             of disease names for 2014 and 2015 respectively. In order
     for the masses. Identifying diseases that are being          to compare the performance of our method with baseline ap-
     spread or discovered is important to take necessary          proaches, we use both manual analysis as well as an external
     steps for their prevention and cure. Our system              knowledge source. Our system gives 40% gain in accuracy
     uses a syntactic and lexical constraint-based ap-            in comparison to other baseline approaches in the task of top-
     proach which then goes through a frequency anal-             150 (unique) disease mention recognition on the 2015 news
     ysis phase to extract meaningful disease names. In           headlines dataset.
     the task of top-150 (unique) disease mention recog-
     nition on the 2015 news headlines dataset, our ap-
     proach shows 40% gain in accuracy in comparison              2   Related Work
     to other baseline approaches, illustrating the benefit       One of the essential requirements for a text mining applica-
     of our approach.                                             tion is the ability to identify relevant entities. There has been
                                                                  an increasing amount of research in Biomedical Named En-
1   Introduction                                                  tity Recognition (BNER), which is the task of locating bound-
                                                                  aries of the biomedical entities in the given corpus and tag-
Disease Mention Recognition involves extraction of disease        ging them with corresponding semantic type (e.g. proteins,
name from a given text. News provides us access to cur-           vitamins, viruses etc.). With the various events, i2b2 [Uzuner
rent events and up-to-date information regarding varied fields.   et al., 2011] and scientific challenges [Kim et al., 2009],
Rather than analyzing the entire news text across different       BNER has seen huge development in recognizing the men-
sources, headlines are a quick and viable option to extract       tion of genes [Lu et al., 2011], [Torii et al., 2009], organisms
useful knowledge.                                                 [Naderi et al., 2011] and diseases [Dogan and Lu, 2012].
   Disease names found in news headlines can inform us               Most of the research related to Biomedical Named Entity
about the kind of diseases which are getting spread or are        Recognition has been focused on clinical texts [Pradhan et
prevalent in different regions at various points of time. This    al., 2014], medical records and PubMed queries [Névéol et
has several advantages: taking adequate measures for the pre-     al., 2009], [Doan et al., 2014]. But as far as our knowledge is
vention and control of diseases, investing in research and        concerned, extracting disease mentions from news headlines
development for their cures, predicting future epidemic out-      hasn’t been significantly explored in the literature.
breaks, etc. Correctly recognizing a disease mention is vi-          Most of the techniques that have been used in these tasks
tal for improvement of disease-centric knowledge extraction       are based on machine learning approaches such as Support
tasks, like drug discovery [Agarwal and Searls, 2008].            Vector Machines (SVM) and Conditional Random Fields
   We aim to discover patterns in the news headlines that con-    (CRF). Disease mention recognition by [Jonnagaddala et al.,
tain diseases and use them to generalize over diseases that       2015] was performed using CRF approach on PubMed Ab-
haven’t been seen. We identify a set of significantly covering    stracts. BANNER, developed by [Leaman et al., 2008] is
word roots that signal disease mentions and then extract the      also based on CRF approach using syntactic, lexical and or-
sentence structure using rule-based inference techniques. In      thographic features extracted to recognize disease mentions.
other words, we use syntactic and lexical (SL) constraints to     This work was further extended in the context of biomedical
extract the disease names from headlines in an initial pass.      texts by [Chowdhury et al., 2010] by use of contextual fea-
The highlight of our approach is the Frequency-based Itera-       tures in addition to features extracted by Banner.
tive Improvement (F II) that leads to more accurate results by       In all these approaches, the used corpus (for example
weeding out the false positives.                                  NCBI, [Doan et al., 2014] ) has detailed annotations at both
   We experimented on a total of 664163 headlines for year
2014 and 408052 headlines for year 2015, collected using            1
                                                                      Code and data for our system can be found at https://
the iM M (Indian news Media Monitoring) system described          github.com/sidak/Disease_Mention_Recognition
mention and concept level. But, our headline dataset does             (The symbols for the PoS tags and their corresponding de-
not have any form of annotations. Such a scenario makes it         scription is shown in Table 2.)
difficult to apply supervised machine learning techniques.            To handle such inconsistencies we use the following ap-
   Further, due to differences in structural patterns of news      proach:
corpus and biomedical texts, aforementioned approaches can-
                                                                    1. Convert headlines to lower-case and then compare the
not be effectively used for disease mention recognition from
                                                                       respective PoS tags of tokens with that of the original
news corpus. Apart from this, an advantage of our approach
                                                                       sentence.
is that it can generalize for entities belonging to domains as
different as cricket, politics etc. We illustrate this with an      2. If PoS tag differs, use lower-case form. Otherwise, use
example in Section 3.3.                                                the original one.
                                                                     Thus, the example headline is compared to “india seeks
3     Proposed Solution                                            revenge from australia”
Our solution involves four basic stages: Pre-processing, Re-          PoS tags: [(’india’, ’NN’), (’seeks’,
lational Extraction, Frequency-based Iterative Improvement         ’VBZ’), (’revenge’, ’NN’), (’from’, ’IN’),
and Post-processing. The architecture of our SL − F II sys-        (’australia’, ’NN’)]
tem is shown in Figure 1.
                                                                     On converting to lower-case, PoS tag of ‘India’ still se-
3.1     Dataset                                                    mantically portrays a noun, whereas the PoS tag of ‘Seeks’
                                                                   changes from noun to verb (NNP to VBZ). So the example
We use the iM M system [Mazumder et al., 2014] to collect          headline finally gets converted to ”India seeks Revenge from
news headlines for year 2014 and 2015. We make use of a            Australia”
manually prepared list of 95 diseases to simulate annotations         POS tagged: [(’India’, ’NNP’), (’seeks’,
(see Section 4) and then use it to extract word roots that cover   ’VBZ’), (’Revenge’, ’NNP’), (’from’, ’IN’),
a significant portion of disease containing headlines. Some        (’Australia’, ’NNP’)]
of the headlines collected by the iM M system that would be
further used in this paper for the explanation of our technique    3.3   Relational Extraction
are mentioned in Table 1.
                                                                   In this section, we introduce two types of constraints, namely
                                                                   lexical and syntactic. These constraints help us to discover
      Sample Headlines                                             and extract disease name - word root relations. We also dis-
      Healthcare worker in Scotland diagnosed with Ebola           cuss how to generalize this idea for entities pertaining to dif-
                                                                   ferent domains.
      Beyonce Reaches Out to Grieving Family of Teen
      Who Died of Cancer                                                         Symbol      Description
      Bird flu outbreak in Kottayam, Alappuzha                                   DT          determiner
      More details on Dan Uggla’s concussion symptoms                            JJ          adjective
      IMF rules out special treatment for Greece                                 NN          noun, singular

      Table 1: Sample headlines collected by iM M system                         NNS         noun plural
                                                                                 NNP         proper noun, singular
3.2     Pre-processing                                                           NNPS        proper noun, plural
The first step of pre-processing is to remove apostrophe                         RB          adverb
inconsistencies from the corpus (headlines). After this we
use N LT K 2 to tokenize the headlines and then tag the                          VB          verb, base form
produced tokens with parts of speech (i.e. PoS tagging).                         V BZ        verb, 3rd person
News headlines may also contain grammatical inconsis-
tencies. For example, capitalizing the first letter of every                     CD          cardinal number
word, punctuation mistakes, or missing articles etc. In
such cases, PoS taggers might incorrectly tag certain words                    Table 2: Notation used for PoS tags.
as explained in the following example. Consider the headline,
                                                                   Lexical Constraints
    “India Seeks Revenge From Australia”
  POS tags: [(’India’, ’NNP’), (’Seeks’,                           News headlines contain multiple entities. Our task is to
’NNP’), (’Revenge’, ’NNP’), (’From’, ’NNP’),                       identify the correct set of entities that correspond to disease
(’Australia’, ’NNP’)]                                              names. In other words, we need to formulate a context that
                                                                   signals the occurrence of disease names.
   2                                                                  In order to define this context or neighbourhood, we ex-
     http://www.nltk.org/          Natural   Language    Toolkit
(NLTK v3.1)                                                        tract certain word roots that indicate the presence of disease
                                        Figure 1: The architecture of our SL − F II system.


names in headlines (with high confidence). The word roots             Besides eliminating incoherent extractions, syntactic con-
are obtained by analyzing the headline data and the initial        straints also reduce uninformative extractions by capturing
list of disease names. Based on how well they cover the            relation phrases that are expressed by only certain combina-
disease containing headlines, both quantitatively and quali-       tions.
tatively, we select a certain subset of these word roots. From        Figure 2 shows the syntactic constraints developed for the
our experiments, we find that a specific set of 10 word roots      inflected form ‘diagnoses’ of word root ’diagnos’ and is de-
covers a significant portion of disease containing headlines       scribed in detail below.
and are listed as follows:                                             • In the headline, “Stool test diagnoses bowel disease”,
  • ‘diagnos’ derivatives                                                inflected form ‘diagnoses’ of word root ‘diagnos’ is used
                                                                         as 3rd person verb (VBZ). The disease mention, ‘bowel
  • ‘outbreak’ derivatives                                               disease’ extracted as pair of singular nouns (NN NN),
  • ‘cur’ derivatives                                                    occurs to the right of ’diagnoses’.
  • ‘vaccin’ derivatives                                               • In another headline, “Autism diagnoses surge by 30 per-
                                                                         cent in kids”, inflected form ‘diagnoses’ of word root
  • ‘die’ derivatives
                                                                         ‘diagnos’ is used as plural noun (NNS). The disease
  • ‘battling’ derivatives                                               mention , ‘Autism’ extracted as noun (NN), occurs to
  • ‘symptom’ derivatives                                                the left of ’diagnoses’.
  • ‘treatment’ derivatives                                        .

  • ‘virus’ derivatives
  • ‘hospital’ derivatives
   Note that, derivatives (over here) implies the inflected
forms of the keywords mentioned along with certain preposi-
tions. For example, consider the derivatives of ‘diagnos’

  [mis]diagnos(e | es | ed | is | tic) [with | for | of | by]

   Apart from the above list, word roots like ‘drug’, ‘patient’,
‘therapy’ and more are also identified. Consideration of these     Figure 2: Rules for the inflected form ‘diagnoses‘ of word
word roots leads to a marginal improvement in the number           root ‘diagnos’.
of identified disease mentions. An intuitive justification for
the above is that quite often such word roots are used along          Syntactic constraints for other word roots are developed in
with entities other than disease names. In several cases, they     a similar manner. Disease mentions in news headlines gen-
tend to identify more false positives than true positives. Thus    erally occur around the word roots as the regular expression
choosing this small list doesn’t lead to any significant loss,     given below:
and at the same time speeds up the system.
                                                                        E = [DT ](JJ)∗ (N N |N N P |N N S|N N P S|CD)+
Syntactic Constraints
The headlines containing inflected forms of different word            Or in other words, disease mentions are phrases that con-
roots are extracted using lexical constraints. Using the           tain an optional determiner or article (e.g. a, an) followed by
PoS tags of obtained headlines, we develop syntactic con-          multiple optional adjectives (e.g. fractured) and at least one
straints/rules to capture the position of occurrence of disease    noun (e.g. elbow, non-Hodgskins lymphoma) with a maxi-
names, in relation to word roots identified above.                 mum length of four.
   Since disease names basically represent a kind of entities,          used with different lexical rules (p ∩ rulei ) as defined
the syntactic constraints extract the potential disease name            in Equation 1.
phrases from news headlines. For obvious reasons, we omit                              X
the determiner or article (represented by the PoS tag: DT), in          P (p ∈ D) =       P (p ∈ D|p ∩ rulei ) × P (p ∩ rulei )
order to get the disease names. Below, is an example head-                            rules
line, depicting the application of these constraints, with ex-                                                                 (1)
tracted disease mention highlighted in bold.
   Former Butler forward Andrew Smith diagnosed with non-            2. Based on the training set (2014 headlines), a weight is
Hodgskins lymphoma .                                                    assigned each to lexical rule (W [rulei ]) which corre-
                                                                        sponds to the probability that the phrase extracted using
Generalization                                                          that lexical rule(p ∩ rulei ) is a valid disease-name, as
Since a basic entity is represented by the regular expression           shown by Equation 2.
discussed earlier, our SL − F II approach can be general-
ized for entities of other domains. This can be possible via a                      P (p ∈ D|p ∩ rulei ) = W [rulei ]          (2)
simple modification of lexical constraints as per the context
requirements. For example in order to identify the names of             For example, rules involving ‘diagnosed with’ give bet-
cricketers, we can use word roots like:                                 ter results than rules involving ‘outbreak’ (which also
                                                                        gives false positives with natural disasters). Thus much
  • ‘ wicket haul’ derivatives                                          higher weight is associated with phrases output via ’di-
  • ‘ton’ derivatives                                                   agnosed with’.
                                                                        Thus, the weight assigned to each rule depends on the
   In the headlines below, we can successfully extract the
                                                                        number of correctly recognized disease mentions and the
cricketer’s name using our SL − F II model.
                                                                        total number of disease mentions extracted using it.
  Gayle’s 47-ball ton wipes out England                              3. Probability of a phrase occurring with the lexical rule in
  Ashwin’s 5 wicket haul takes India to the semis                       consideration (p ∩ rulei ) depends on the frequency of
                                                                        phrase occurring with lexical rule and size of our entire
   Generalization of this approach to other languages mainly            corpus(size), as shown by Equation 3.
depends on the quality of the standard NLP pre-processing
tools like PoS tagger and stemmer, that are available for                                                F [rulei ][p]
                                                                                      P (p ∩ rulei ) =                          (3)
the required language, while the approach will remain pretty                                                 size
much the same.                                                       4. Final score (also termed as DES in Equation 4) is cal-
                                                                        culated taking into account the above probabilities for
3.4   Frequency-based Iterative Improvement (F II)                      each rule.
The main motivation is to use the previous experience gained            The score is equivalent to finding the probability that the
in recognizing disease mentions with different word roots, to           phrase detected after the initial pass is a disease.
weed out incorrect diseases names. The constraints formu-
lated above are used to extract potential disease names from                                 X
the corpus in an initial pass, which are then passed through                    DES[p] =         W [rulei ] × F [rulei ][p]     (4)
                                                                                              rules
the F II phase. In order to filter out false positives, we as-
sign Disease Expectancy Score (DES) values (based on the            F II increases accuracy and reduces false positives based
probability of a word being disease name) to outputs of the      on probabilistic measures. Words with higher DES have
initial pass. The idea behind our F II phase can be easily       higher chances of being a valid disease-name. Phrases which
understood with following examples and equations 1 to 4.         occur with more rules are more reliable as a disease-name.
  • Mentions identified along with word root ‘battling’ can      e.g. Both Ebola and Tornado occur with ’outbreak’, but Ebola
    come as ‘battling insecurity’,‘battling cancer’,‘battling    can also occur with ’diagnosed with’, ’died of’ and several
    lust’, but for word root ‘treatment’ most of the occur-      other rules whereas Tornado cannot. Thus, F II increases be-
    rences are of the form, ‘Ebola treatment’, ‘treatment of     lief in Ebola as a valid disease-name.
    cancer’. So if disease mention is extracted using word       3.5     Post-processing
    root ‘treatment’, it is more probable to be an instance
    of disease name as compared to mention extracted using       The results of F II are sent for post-processing. Firstly we
    word root ‘battling’.                                        normalize the results and then analyze them based on its tem-
                                                                 poral distribution. Valid disease-names show distinct peaks
  • Disease mention such as ‘Cancer’ which can be used           whereas non-disease entities show uniform distribution over
    along with multiple word roots ‘treatment’,‘battling’ and    the specified time-interval.
    ‘cure’, are more probable to be instances of disease
    names as compared to mentions like ‘seasonal’ that are       4     Experimental Results
    only extracted using a single word root ‘outbreak’.
                                                                 Across all 664163 headlines for the year 2014 and 408052
 1. Probability of a phrase(p) being an instance of disease      headlines for the year 2015, none of them is annotated in
    name(D) depends on the probability of phrase being           any manner. To resolve this problem, we manually prepare
a list of 95 disease names. Then any headline which contains        lustrated in Table 5. The DES value of Greece is 0.32 which
atleast one of these diseases is taken to be a disease containing   is much less than the DES values for other entities and can
headline. In this manner, we obtain 1562 and 1884 headlines         be filtered out as per the threshold accepted DES value. This
for the year 2014 and 2015 respectively, that are considered        threshold can be decided on the basis of accuracy require-
to have a correct disease mention. Figure 3 shows the number        ments.
of occurrences for the most frequently occurring diseases, in
these headlines for the year 2014.                                                                     Lexical rule       Disease
   Few of these headlines may actually contain diseases              Headline                                             name
names (from the prepared list) used in different contexts, for                                                            extracted
example, ‘fever’. Despite this fact, our system is able to han-
dle such slight inconsistencies, because of the F II phase.          Healthcare worker in Scot-        diagnosed          Ebola
   The system is trained on headlines from 2014 to extract           land diagnosed with Ebola         with
suitable SL constraints and learn the weights of lexical rules       Beyonce Reaches Out to            died of            Cancer
for the F II phase. When tested on all the headlines from            Grieving Family of Teen
2015, the system was able to extract a total of 5058 correct         Who died of Cancer
disease mentions. There is more than 2.5x gain in the number
of disease containing headlines for 2015.                            Bird flu outbreak in Kot-         outbreak           Bird flu
                                                                     tayam, Alappuzha
                                                                     More details on Dan Ug-           symptoms           Concussion
                                                                     gla’s concussion symptoms
                                                                     IMF rules out special             treatment for      Greece
                                                                     treatment for Greece

                                                                     Table 4: Diseases extracted from headlines using SL rules


                                                                                          Disease         DES
                                                                                          Ebola           80.13
                                                                                          Cancer          14.66
                                                                                          Bird Flu        12.38
                                                                                          Concussion      1.86
   Figure 3: Number of occurrences of Disease names in 2014                               Greece          0.32

   Besides this quantitative improvement, we also obtain                   Table 5: Disease Expectancy Scores after F II
many diseases like ‘Ebola’, ‘Cancer’, ‘Concussion’ etc.,
which were not there in our input set of diseases, as shown            Extracted disease names are sorted in descending order of
in Table 3.                                                         their DES values. Due to our post-processing step i.e. Tem-
                                                                    poral Distribution Analysis, the accuracy of our SL − F II
                                                                    system gets improved by a great extent since uniformly oc-
       Disease Name       # instances      # instances              curring words like ‘fight’, ‘water’, ‘state’ etc. are filtered out.
                          in 2014          in 2015                  The difference in accuracy (% of correct disease mentions in
           Cancer         1072             494                      the top K unique predictions) on test data (headlines of 2015)
                                                                    before and after the Post-processing step can be observed in
            Ebola         182              990                      Figure 4.
         Concussion       190              79                          Now, we compare our SL − F II approach to other
                                                                    techniques that can be used for extraction of disease names.
           Autism         82               35
           Listeria       18               13                          Baseline I : Machine learning approach developed by
                                                                    [Chowdhury et al., 2010] for Disease mention recognition in
     Table 3: Example of new extracted disease-names                biomedical texts. This technique makes use of decision trees
                                                                    and extracts orthographic and linguistic features such as PoS
   Table 4 shows a sample of potential disease names ex-            tag, suffixes and prefixes, word beginning with upper-case
tracted from headlines using syntactic and lexical (SL) con-        letter, checking if all letters are in the upper-case and more.
straints. The 5th headline is a sample false positive, where        15,000 instances were extracted from news corpus in order
SL constraints incorrectly label ‘Greece’ as a disease. But         to train our classifier. 10-fold cross validation accuracy of
this will be handled once we pass it through F II phase, as il-     74.89% was obtained using decision trees for classification.
                                                                 Figure 5: Comparison of Accuracies using Manual Approach
  Figure 4: Accuracy obtained for top-K unique instances

Other classifiers such as SVM (72.3%.) gave lower accuracy
in comparison to decision trees over our data set.
   Baseline II : Sequence approach which is based on the
hypothesis that context for disease names can be defined
using prefixes and suffixes used around them. This approach
extracts all the unigrams and bigrams that occur around
disease names i.e. all the one/two word prefixes and suffixes.
These prefix and suffix are assigned weights on the basis
of the probability of their occurrence with disease names.
Some of the prefixes extracted by this approach were
‘cured of’,‘deaths from’ and ‘deadly’. Similarly ‘outbreak
rises’,‘cases reported’, ‘vaccine’ were some of the suffixes
extracted by this approach. Only mentions that have a score
above a particular threshold value are considered as disease
mentions. Similarly probability of misclassification by each      Figure 6: Comparison of Accuracies using CDC’s Disease
prefix/suffix with which disease mention occurs is used to
calculate the probability of error for each disease mention.
If this error probability is below 5%, disease mention is        F II technique performs considerably better than other ap-
considered to be correct by the approach.                        proaches. In the top-150 (unique) disease instances, our ap-
                                                                 proach outperforms the best baseline by 40% and similarly
   In our comparison, all the approaches are trained using       in the top-50 by 22%.
headlines from 2014 and are tested on headlines from 2015.
Disease names extracted using all three approaches are then      5   Conclusion
sorted in order of confidence-values given by the approaches.    Using syntactic and lexical constraints gives us potential dis-
Disease instances/names extracted using each approach are        ease names. We can then use the F II phase to weed out false
further analyzed manually to estimate the top-K (unique) in-     diseases names based on the experience acquired in recogniz-
stance accuracies, as shown in Figure 5. Our method achieves     ing disease mentions with different word roots. Even in the
around 88% accuracy in the top-50 instances.                     absence of annotations on the corpus, the initial list of dis-
   To automatically compare our results with other ap-           eases helps us to simulate annotations. In future, we plan to
proaches, we used the list of diseases provided by the Cen-      find the minimum size of this initial list and the influence of
ters for Disease Control and Prevention (CDC) 3 , USA. CDC       removing some of the word roots on the accuracy.
identifies around 841 diseases, along with their conditions
and variants. Figure 6 shows the automatic comparison of         6   Acknowledgments
accuracies for top-K (unique) disease instances extracted by
                                                                 We thank anonymous reviewers for their valuable comments
each approach using disease names provided by CDC.
                                                                 and suggestions to improve this paper. We would also like to
  Using these experiments, we observe that our SL −              thank Shubham Kumar Pandey, Paarth Neekhara, Vikash Ku-
                                                                 mar, Baviskar Hrishikesh Hari, Chandan Singha and Kavin
   3
       http://www.cdc.gov/diseasesconditions/                    Motlani for setting up the baseline experiments.
References                                                       [Pradhan et al., 2014] Sameer Pradhan, Noémie Elhadad,
[Agarwal and Searls, 2008] Pankaj Agarwal and David B.              Wendy Chapman, Suresh Manandhar, and Guergana
   Searls. Literature mining in support of drug discovery.          Savova. Semeval-2014 task 7: Analysis of clinical text.
   Briefings in Bioinformatics, 9(6):479–492, 2008.                 SemEval, 199(99):54, 2014.
[Chowdhury et al., 2010] Mahbub Chowdhury, Md Faisal,            [Torii et al., 2009] Manabu Torii, Zhangzhi Hu, Cathy H.
   et al. Disease mention recognition with specific features.       Wu, and Hongfang Liu. Biotagger-gm: A gene/protein
   In Proceedings of the 2010 workshop on biomedical nat-           name recognition system. Journal of the American Medi-
   ural language processing, pages 83–90. Association for           cal Informatics Association, 16(2):247–255, 2009.
   Computational Linguistics, 2010.                              [Uzuner et al., 2011] Özlem Uzuner, Brett R South, Shuy-
[Dogan and Lu, 2012] Rezarta Islamaj Dogan and Zhiyong              ing Shen, and Scott L DuVall. 2010 i2b2/va challenge on
   Lu. An inference method for disease name normaliza-              concepts, assertions, and relations in clinical text. Jour-
   tion. In Information Retrieval and Knowledge Discovery           nal of the American Medical Informatics Association,
   in Biomedical Text, Papers from the 2012 AAAI Fall Sym-          18(5):552–556, 2011.
   posium, Arlington, Virginia, USA, November 2-4, 2012,
   2012.
[Doan et al., 2014] Rezarta Islamaj Doan, Robert Leaman,
   and Zhiyong Lu. {NCBI} disease corpus: A resource
   for disease name recognition and concept normalization.
   Journal of Biomedical Informatics, 47:1 – 10, 2014.
[Jonnagaddala et al., 2015] Jitendra Jonnagaddala, Nai-Wen
   Chang, Toni Rose Jue, and Hong-Jie Dai. Recognition and
   normalization of disease mentions in pubmed abstracts.
   2015.
[Kim et al., 2009] Jin-dong Kim, Tomoko Ohta, Sampo
   Pyysalo, and Yoshinobu Kano. Overview of bionlp09
   shared task on event extraction. In In Proceedings of
   Natural Language Processing in Biomedicine (BioNLP)
   NAACL 2009 Workshop. Citeseer, 2009.
[Leaman et al., 2008] Robert Leaman, Graciela Gonzalez,
   et al. Banner: an executable survey of advances in biomed-
   ical named entity recognition. In Pacific Symposium on
   Biocomputing, volume 13, pages 652–663. Citeseer, 2008.
[Lu et al., 2011] Zhiyong Lu, Hung-Yu Kao, Chih-Hsuan
   Wei, Minlie Huang, Jingchen Liu, Cheng-Ju Kuo, Chun-
   Nan Hsu, Richard Tzong-Han Tsai, Hong-Jie Dai, Naoaki
   Okazaki, et al. The gene normalization task in biocreative
   iii. BMC bioinformatics, 12(8):1, 2011.
[Mazumder et al., 2014] Sahisnu Mazumder, Bazir Bishnoi,
   and Dhaval Patel. News headlines: What they can
   tell us? In Proceedings of the 6th IBM Collaborative
   Academia Research Exchange Conference (I-CARE) on I-
   CARE 2014, I-CARE 2014, pages 4:1–4:4, New York, NY,
   USA, 2014. ACM.
[Naderi et al., 2011] Nona Naderi, Thomas Kappler,
   Christopher JO Baker, and René Witte. Organismtagger:
   detection, normalization and grounding of organism
   entities in biomedical documents.           Bioinformatics,
   27(19):2721–2729, 2011.
[Névéol et al., 2009] Aurélie Névéol, Won Kim, W. John
   Wilbur, and Zhiyong Lu. Exploring two biomedical text
   genres for disease recognition. In Proceedings of the Work-
   shop on Current Trends in Biomedical Natural Language
   Processing, BioNLP ’09, pages 144–152, Stroudsburg,
   PA, USA, 2009. Association for Computational Linguis-
   tics.