<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>F I I : Syntactic and Lexical Constraints with Frequency based Iterative Improvement for Disease Mention Recognition in News Headlines</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sidak Pal Singh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sopan Khosla</string-name>
          <email>khoslasopan@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sajal Rustagi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manisha Patel Graduate student</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>IIT Roorkee</string-name>
          <email>manipubt@iitr.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dhaval Patel Faculty</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>News headlines are a vital source of information for the masses. Identifying diseases that are being spread or discovered is important to take necessary steps for their prevention and cure. Our system uses a syntactic and lexical constraint-based approach which then goes through a frequency analysis phase to extract meaningful disease names. In the task of top-150 (unique) disease mention recognition on the 2015 news headlines dataset, our approach shows 40% gain in accuracy in comparison to other baseline approaches, illustrating the benefit of our approach.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Disease Mention Recognition involves extraction of disease
name from a given text. News provides us access to
current events and up-to-date information regarding varied fields.
Rather than analyzing the entire news text across different
sources, headlines are a quick and viable option to extract
useful knowledge.</p>
      <p>Disease names found in news headlines can inform us
about the kind of diseases which are getting spread or are
prevalent in different regions at various points of time. This
has several advantages: taking adequate measures for the
prevention and control of diseases, investing in research and
development for their cures, predicting future epidemic
outbreaks, etc. Correctly recognizing a disease mention is
vital for improvement of disease-centric knowledge extraction
tasks, like drug discovery [Agarwal and Searls, 2008].</p>
      <p>We aim to discover patterns in the news headlines that
contain diseases and use them to generalize over diseases that
haven’t been seen. We identify a set of significantly covering
word roots that signal disease mentions and then extract the
sentence structure using rule-based inference techniques. In
other words, we use syntactic and lexical (SL) constraints to
extract the disease names from headlines in an initial pass.
The highlight of our approach is the Frequency-based
Iterative Improvement (F II) that leads to more accurate results by
weeding out the false positives.</p>
      <p>We experimented on a total of 664163 headlines for year
2014 and 408052 headlines for year 2015, collected using
the iM M (Indian news Media Monitoring) system described
in [Mazumder et al., 2014]. Our SL F II 1 system was
able to extract a total of 3157 and 5058 correct occurrences
of disease names for 2014 and 2015 respectively. In order
to compare the performance of our method with baseline
approaches, we use both manual analysis as well as an external
knowledge source. Our system gives 40% gain in accuracy
in comparison to other baseline approaches in the task of
top150 (unique) disease mention recognition on the 2015 news
headlines dataset.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>One of the essential requirements for a text mining
application is the ability to identify relevant entities. There has been
an increasing amount of research in Biomedical Named
Entity Recognition (BNER), which is the task of locating
boundaries of the biomedical entities in the given corpus and
tagging them with corresponding semantic type (e.g. proteins,
vitamins, viruses etc.). With the various events, i2b2 [Uzuner
et al., 2011] and scientific challenges [Kim et al., 2009],
BNER has seen huge development in recognizing the
mention of genes [Lu et al., 2011], [Torii et al., 2009], organisms
[Naderi et al., 2011] and diseases [Dogan and Lu, 2012].</p>
      <p>Most of the research related to Biomedical Named Entity
Recognition has been focused on clinical texts [Pradhan et
al., 2014], medical records and PubMed queries [Ne´ve´ol et
al., 2009], [Doan et al., 2014]. But as far as our knowledge is
concerned, extracting disease mentions from news headlines
hasn’t been significantly explored in the literature.</p>
      <p>Most of the techniques that have been used in these tasks
are based on machine learning approaches such as Support
Vector Machines (SVM) and Conditional Random Fields
(CRF). Disease mention recognition by [Jonnagaddala et al.,
2015] was performed using CRF approach on PubMed
Abstracts. BANNER, developed by [Leaman et al., 2008] is
also based on CRF approach using syntactic, lexical and
orthographic features extracted to recognize disease mentions.
This work was further extended in the context of biomedical
texts by [Chowdhury et al., 2010] by use of contextual
features in addition to features extracted by Banner.</p>
      <p>
        In all these approaches, the used corpus
        <xref ref-type="bibr" rid="ref12 ref4 ref9">(for example
NCBI, [Doan et al., 2014] )</xref>
        has detailed annotations at both
1Code and data for our system can be found at https://
github.com/sidak/Disease_Mention_Recognition
mention and concept level. But, our headline dataset does
not have any form of annotations. Such a scenario makes it
difficult to apply supervised machine learning techniques.
      </p>
      <p>Further, due to differences in structural patterns of news
corpus and biomedical texts, aforementioned approaches
cannot be effectively used for disease mention recognition from
news corpus. Apart from this, an advantage of our approach
is that it can generalize for entities belonging to domains as
different as cricket, politics etc. We illustrate this with an
example in Section 3.3.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Proposed Solution</title>
      <p>Our solution involves four basic stages: Pre-processing,
Relational Extraction, Frequency-based Iterative Improvement
and Post-processing. The architecture of our SL F II
system is shown in Figure 1.
3.1</p>
      <sec id="sec-3-1">
        <title>Dataset</title>
        <p>We use the iM M system [Mazumder et al., 2014] to collect
news headlines for year 2014 and 2015. We make use of a
manually prepared list of 95 diseases to simulate annotations
(see Section 4) and then use it to extract word roots that cover
a significant portion of disease containing headlines. Some
of the headlines collected by the iM M system that would be
further used in this paper for the explanation of our technique
are mentioned in Table 1.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Sample Headlines</title>
        <p>Healthcare worker in Scotland diagnosed with Ebola
Beyonce Reaches Out to Grieving Family of Teen
Who Died of Cancer</p>
        <sec id="sec-3-2-1">
          <title>Bird flu outbreak in Kottayam, Alappuzha</title>
          <p>More details on Dan Uggla’s concussion symptoms</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>IMF rules out special treatment for Greece</title>
          <p>The first step of pre-processing is to remove apostrophe
inconsistencies from the corpus (headlines). After this we
use N LT K 2 to tokenize the headlines and then tag the
produced tokens with parts of speech (i.e. PoS tagging).
News headlines may also contain grammatical
inconsistencies. For example, capitalizing the first letter of every
word, punctuation mistakes, or missing articles etc. In
such cases, PoS taggers might incorrectly tag certain words
as explained in the following example. Consider the headline,
“India Seeks Revenge From Australia”</p>
          <p>POS tags: [(’India’, ’NNP’), (’Seeks’,
’NNP’), (’Revenge’, ’NNP’), (’From’, ’NNP’),
(’Australia’, ’NNP’)]</p>
          <p>2http://www.nltk.org/
(NLTK v3.1)
Natural Language</p>
          <p>Toolkit</p>
          <p>(The symbols for the PoS tags and their corresponding
description is shown in Table 2.)</p>
          <p>To handle such inconsistencies we use the following
approach:
1. Convert headlines to lower-case and then compare the
respective PoS tags of tokens with that of the original
sentence.
2. If PoS tag differs, use lower-case form. Otherwise, use
the original one.</p>
          <p>Thus, the example headline is compared to “india seeks
revenge from australia”</p>
          <p>PoS tags: [(’india’, ’NN’), (’seeks’,
’VBZ’), (’revenge’, ’NN’), (’from’, ’IN’),
(’australia’, ’NN’)]</p>
          <p>On converting to lower-case, PoS tag of ‘India’ still
semantically portrays a noun, whereas the PoS tag of ‘Seeks’
changes from noun to verb (NNP to VBZ). So the example
headline finally gets converted to ”India seeks Revenge from
Australia”</p>
          <p>POS tagged: [(’India’, ’NNP’), (’seeks’,
’VBZ’), (’Revenge’, ’NNP’), (’from’, ’IN’),
(’Australia’, ’NNP’)]
3.3</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>Relational Extraction</title>
        <p>In this section, we introduce two types of constraints, namely
lexical and syntactic. These constraints help us to discover
and extract disease name - word root relations. We also
discuss how to generalize this idea for entities pertaining to
different domains.</p>
      </sec>
      <sec id="sec-3-4">
        <title>Symbol</title>
      </sec>
      <sec id="sec-3-5">
        <title>Description</title>
        <p>DT
J J
N N
N N S
N N P
N N P S
RB
V B
V BZ
CD
determiner
adjective
noun, singular
noun plural
proper noun, singular
proper noun, plural
adverb
verb, base form
verb, 3rd person
cardinal number</p>
      </sec>
      <sec id="sec-3-6">
        <title>Lexical Constraints</title>
        <p>News headlines contain multiple entities. Our task is to
identify the correct set of entities that correspond to disease
names. In other words, we need to formulate a context that
signals the occurrence of disease names.</p>
        <p>In order to define this context or neighbourhood, we
extract certain word roots that indicate the presence of disease
names in headlines (with high confidence). The word roots
are obtained by analyzing the headline data and the initial
list of disease names. Based on how well they cover the
disease containing headlines, both quantitatively and
qualitatively, we select a certain subset of these word roots. From
our experiments, we find that a specific set of 10 word roots
covers a significant portion of disease containing headlines
and are listed as follows:
‘diagnos’ derivatives
‘outbreak’ derivatives
‘cur’ derivatives
‘vaccin’ derivatives
‘die’ derivatives
‘battling’ derivatives
‘symptom’ derivatives
‘treatment’ derivatives
‘virus’ derivatives
‘hospital’ derivatives</p>
        <p>Note that, derivatives (over here) implies the inflected
forms of the keywords mentioned along with certain
prepositions. For example, consider the derivatives of ‘diagnos’
[mis]diagnos(e j es j ed j is j tic) [with j for j of j by]
Apart from the above list, word roots like ‘drug’, ‘patient’,
‘therapy’ and more are also identified. Consideration of these
word roots leads to a marginal improvement in the number
of identified disease mentions. An intuitive justification for
the above is that quite often such word roots are used along
with entities other than disease names. In several cases, they
tend to identify more false positives than true positives. Thus
choosing this small list doesn’t lead to any significant loss,
and at the same time speeds up the system.</p>
      </sec>
      <sec id="sec-3-7">
        <title>Syntactic Constraints</title>
        <p>The headlines containing inflected forms of different word
roots are extracted using lexical constraints. Using the
PoS tags of obtained headlines, we develop syntactic
constraints/rules to capture the position of occurrence of disease
names, in relation to word roots identified above.</p>
        <p>Besides eliminating incoherent extractions, syntactic
constraints also reduce uninformative extractions by capturing
relation phrases that are expressed by only certain
combinations.</p>
        <p>Figure 2 shows the syntactic constraints developed for the
inflected form ‘diagnoses’ of word root ’diagnos’ and is
described in detail below.</p>
        <p>In the headline, “Stool test diagnoses bowel disease”,
inflected form ‘diagnoses’ of word root ‘diagnos’ is used
as 3rd person verb (VBZ). The disease mention, ‘bowel
disease’ extracted as pair of singular nouns (NN NN),
occurs to the right of ’diagnoses’.</p>
        <p>In another headline, “Autism diagnoses surge by 30
percent in kids”, inflected form ‘diagnoses’ of word root
‘diagnos’ is used as plural noun (NNS). The disease
mention , ‘Autism’ extracted as noun (NN), occurs to
the left of ’diagnoses’.
.</p>
        <p>Syntactic constraints for other word roots are developed in
a similar manner. Disease mentions in news headlines
generally occur around the word roots as the regular expression
given below:</p>
        <p>E = [DT ](J J ) (N N jN N P jN N SjN N P SjCD)+
Or in other words, disease mentions are phrases that
contain an optional determiner or article (e.g. a, an) followed by
multiple optional adjectives (e.g. fractured) and at least one
noun (e.g. elbow, non-Hodgskins lymphoma) with a
maximum length of four.</p>
        <p>Since disease names basically represent a kind of entities,
the syntactic constraints extract the potential disease name
phrases from news headlines. For obvious reasons, we omit
the determiner or article (represented by the PoS tag: DT), in
order to get the disease names. Below, is an example
headline, depicting the application of these constraints, with
extracted disease mention highlighted in bold.</p>
        <p>Former Butler forward Andrew Smith diagnosed with
nonHodgskins lymphoma .</p>
      </sec>
      <sec id="sec-3-8">
        <title>Generalization</title>
        <p>Since a basic entity is represented by the regular expression
discussed earlier, our SL F II approach can be
generalized for entities of other domains. This can be possible via a
simple modification of lexical constraints as per the context
requirements. For example in order to identify the names of
cricketers, we can use word roots like:
‘ wicket haul’ derivatives
‘ton’ derivatives</p>
        <p>In the headlines below, we can successfully extract the
cricketer’s name using our SL F II model.</p>
        <p>Gayle’s 47-ball ton wipes out England
Ashwin’s 5 wicket haul takes India to the semis</p>
        <p>Generalization of this approach to other languages mainly
depends on the quality of the standard NLP pre-processing
tools like PoS tagger and stemmer, that are available for
the required language, while the approach will remain pretty
much the same.
3.4</p>
      </sec>
      <sec id="sec-3-9">
        <title>Frequency-based Iterative Improvement (F II )</title>
        <p>The main motivation is to use the previous experience gained
in recognizing disease mentions with different word roots, to
weed out incorrect diseases names. The constraints
formulated above are used to extract potential disease names from
the corpus in an initial pass, which are then passed through
the F II phase. In order to filter out false positives, we
assign Disease Expectancy Score (DES) values (based on the
probability of a word being disease name) to outputs of the
initial pass. The idea behind our F II phase can be easily
understood with following examples and equations 1 to 4.</p>
        <p>Mentions identified along with word root ‘battling’ can
come as ‘battling insecurity’,‘battling cancer’,‘battling
lust’, but for word root ‘treatment’ most of the
occurrences are of the form, ‘Ebola treatment’, ‘treatment of
cancer’. So if disease mention is extracted using word
root ‘treatment’, it is more probable to be an instance
of disease name as compared to mention extracted using
word root ‘battling’.</p>
        <p>Disease mention such as ‘Cancer’ which can be used
along with multiple word roots ‘treatment’,‘battling’ and
‘cure’, are more probable to be instances of disease
names as compared to mentions like ‘seasonal’ that are
only extracted using a single word root ‘outbreak’.
1. Probability of a phrase(p) being an instance of disease
name(D) depends on the probability of phrase being
used with different lexical rules (p \ rulei) as defined
in Equation 1.</p>
        <p>P (p 2 D) = X P (p 2 Djp \ rulei)</p>
        <p>P (p \ rulei)
rules
(1)
(2)
(3)
2. Based on the training set (2014 headlines), a weight is
assigned each to lexical rule (W [rulei]) which
corresponds to the probability that the phrase extracted using
that lexical rule(p \ rulei) is a valid disease-name, as
shown by Equation 2.</p>
        <p>P (p 2 Djp \ rulei) = W [rulei]
For example, rules involving ‘diagnosed with’ give
better results than rules involving ‘outbreak’ (which also
gives false positives with natural disasters). Thus much
higher weight is associated with phrases output via
’diagnosed with’.</p>
        <p>Thus, the weight assigned to each rule depends on the
number of correctly recognized disease mentions and the
total number of disease mentions extracted using it.
3. Probability of a phrase occurring with the lexical rule in
consideration (p \ rulei) depends on the frequency of
phrase occurring with lexical rule and size of our entire
corpus(size), as shown by Equation 3.</p>
        <p>P (p \ rulei) =
4. Final score (also termed as DES in Equation 4) is
calculated taking into account the above probabilities for
each rule.</p>
        <p>The score is equivalent to finding the probability that the
phrase detected after the initial pass is a disease.</p>
        <p>DES[p] = X W [rulei]</p>
        <p>F [rulei][p]
(4)
rules</p>
        <p>F II increases accuracy and reduces false positives based
on probabilistic measures. Words with higher DES have
higher chances of being a valid disease-name. Phrases which
occur with more rules are more reliable as a disease-name.
e.g. Both Ebola and Tornado occur with ’outbreak’, but Ebola
can also occur with ’diagnosed with’, ’died of’ and several
other rules whereas Tornado cannot. Thus, F II increases
belief in Ebola as a valid disease-name.
3.5</p>
      </sec>
      <sec id="sec-3-10">
        <title>Post-processing</title>
        <p>The results of F II are sent for post-processing. Firstly we
normalize the results and then analyze them based on its
temporal distribution. Valid disease-names show distinct peaks
whereas non-disease entities show uniform distribution over
the specified time-interval.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experimental Results</title>
      <p>Across all 664163 headlines for the year 2014 and 408052
headlines for the year 2015, none of them is annotated in
any manner. To resolve this problem, we manually prepare
a list of 95 disease names. Then any headline which contains
atleast one of these diseases is taken to be a disease containing
headline. In this manner, we obtain 1562 and 1884 headlines
for the year 2014 and 2015 respectively, that are considered
to have a correct disease mention. Figure 3 shows the number
of occurrences for the most frequently occurring diseases, in
these headlines for the year 2014.</p>
      <p>Few of these headlines may actually contain diseases
names (from the prepared list) used in different contexts, for
example, ‘fever’. Despite this fact, our system is able to
handle such slight inconsistencies, because of the F II phase.</p>
      <p>The system is trained on headlines from 2014 to extract
suitable SL constraints and learn the weights of lexical rules
for the F II phase. When tested on all the headlines from
2015, the system was able to extract a total of 5058 correct
disease mentions. There is more than 2.5x gain in the number
of disease containing headlines for 2015.
Besides this quantitative improvement, we also obtain
many diseases like ‘Ebola’, ‘Cancer’, ‘Concussion’ etc.,
which were not there in our input set of diseases, as shown
in Table 3.</p>
      <sec id="sec-4-1">
        <title>Disease Name</title>
        <sec id="sec-4-1-1">
          <title>Cancer</title>
        </sec>
        <sec id="sec-4-1-2">
          <title>Ebola</title>
        </sec>
        <sec id="sec-4-1-3">
          <title>Concussion</title>
        </sec>
        <sec id="sec-4-1-4">
          <title>Autism</title>
          <p>Listeria
1072
182
190
82
18
# instances
in 2014
# instances
in 2015
494
990
79
35
13
Table 4 shows a sample of potential disease names
extracted from headlines using syntactic and lexical (SL)
constraints. The 5th headline is a sample false positive, where
SL constraints incorrectly label ‘Greece’ as a disease. But
this will be handled once we pass it through F II phase, as
illustrated in Table 5. The DES value of Greece is 0.32 which
is much less than the DES values for other entities and can
be filtered out as per the threshold accepted DES value. This
threshold can be decided on the basis of accuracy
requirements.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>Headline</title>
        <sec id="sec-4-2-1">
          <title>Healthcare worker in Scot</title>
          <p>land diagnosed with Ebola</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>Beyonce Reaches Out to</title>
          <p>Grieving Family of Teen
Who died of Cancer</p>
        </sec>
        <sec id="sec-4-2-3">
          <title>Bird flu outbreak in Kot</title>
          <p>tayam, Alappuzha</p>
        </sec>
        <sec id="sec-4-2-4">
          <title>More details on Dan Ug</title>
          <p>gla’s concussion symptoms</p>
        </sec>
        <sec id="sec-4-2-5">
          <title>IMF rules out</title>
          <p>treatment for Greece</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>Lexical rule</title>
        <p>diagnosed
with
died of
outbreak
symptoms</p>
      </sec>
      <sec id="sec-4-4">
        <title>Disease name extracted</title>
        <sec id="sec-4-4-1">
          <title>Ebola</title>
        </sec>
        <sec id="sec-4-4-2">
          <title>Cancer</title>
        </sec>
        <sec id="sec-4-4-3">
          <title>Bird flu</title>
        </sec>
        <sec id="sec-4-4-4">
          <title>Concussion</title>
          <p>special treatment for</p>
        </sec>
        <sec id="sec-4-4-5">
          <title>Greece</title>
          <p>Extracted disease names are sorted in descending order of
their DES values. Due to our post-processing step i.e.
Temporal Distribution Analysis, the accuracy of our SL F II
system gets improved by a great extent since uniformly
occurring words like ‘fight’, ‘water’, ‘state’ etc. are filtered out.
The difference in accuracy (% of correct disease mentions in
the top K unique predictions) on test data (headlines of 2015)
before and after the Post-processing step can be observed in
Figure 4.</p>
          <p>Now, we compare our SL F II approach to other
techniques that can be used for extraction of disease names.</p>
          <p>Baseline I : Machine learning approach developed by
[Chowdhury et al., 2010] for Disease mention recognition in
biomedical texts. This technique makes use of decision trees
and extracts orthographic and linguistic features such as PoS
tag, suffixes and prefixes, word beginning with upper-case
letter, checking if all letters are in the upper-case and more.
15,000 instances were extracted from news corpus in order
to train our classifier. 10-fold cross validation accuracy of
74.89% was obtained using decision trees for classification.
Other classifiers such as SVM (72.3%.) gave lower accuracy
in comparison to decision trees over our data set.</p>
          <p>Baseline II : Sequence approach which is based on the
hypothesis that context for disease names can be defined
using prefixes and suffixes used around them. This approach
extracts all the unigrams and bigrams that occur around
disease names i.e. all the one/two word prefixes and suffixes.
These prefix and suffix are assigned weights on the basis
of the probability of their occurrence with disease names.
Some of the prefixes extracted by this approach were
‘cured of’,‘deaths from’ and ‘deadly’. Similarly ‘outbreak
rises’,‘cases reported’, ‘vaccine’ were some of the suffixes
extracted by this approach. Only mentions that have a score
above a particular threshold value are considered as disease
mentions. Similarly probability of misclassification by each
prefix/suffix with which disease mention occurs is used to
calculate the probability of error for each disease mention.
If this error probability is below 5%, disease mention is
considered to be correct by the approach.</p>
          <p>In our comparison, all the approaches are trained using
headlines from 2014 and are tested on headlines from 2015.
Disease names extracted using all three approaches are then
sorted in order of confidence-values given by the approaches.
Disease instances/names extracted using each approach are
further analyzed manually to estimate the top-K (unique)
instance accuracies, as shown in Figure 5. Our method achieves
around 88% accuracy in the top-50 instances.</p>
          <p>To automatically compare our results with other
approaches, we used the list of diseases provided by the
Centers for Disease Control and Prevention (CDC) 3, USA. CDC
identifies around 841 diseases, along with their conditions
and variants. Figure 6 shows the automatic comparison of
accuracies for top-K (unique) disease instances extracted by
each approach using disease names provided by CDC.</p>
          <p>Using these experiments, we observe that our SL
3http://www.cdc.gov/diseasesconditions/
F II technique performs considerably better than other
approaches. In the top-150 (unique) disease instances, our
approach outperforms the best baseline by 40% and similarly
in the top-50 by 22%.
5</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>Using syntactic and lexical constraints gives us potential
disease names. We can then use the F II phase to weed out false
diseases names based on the experience acquired in
recognizing disease mentions with different word roots. Even in the
absence of annotations on the corpus, the initial list of
diseases helps us to simulate annotations. In future, we plan to
find the minimum size of this initial list and the influence of
removing some of the word roots on the accuracy.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>We thank anonymous reviewers for their valuable comments
and suggestions to improve this paper. We would also like to
thank Shubham Kumar Pandey, Paarth Neekhara, Vikash
Kumar, Baviskar Hrishikesh Hari, Chandan Singha and Kavin
Motlani for setting up the baseline experiments.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>[Agarwal and Searls</source>
          , 2008]
          <string-name>
            <given-names>Pankaj</given-names>
            <surname>Agarwal</surname>
          </string-name>
          and
          <string-name>
            <given-names>David B.</given-names>
            <surname>Searls</surname>
          </string-name>
          .
          <article-title>Literature mining in support of drug discovery</article-title>
          .
          <source>Briefings in Bioinformatics</source>
          ,
          <volume>9</volume>
          (
          <issue>6</issue>
          ):
          <fpage>479</fpage>
          -
          <lpage>492</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [Chowdhury et al.,
          <year>2010</year>
          ]
          <string-name>
            <given-names>Mahbub</given-names>
            <surname>Chowdhury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Md</given-names>
            <surname>Faisal</surname>
          </string-name>
          , et al.
          <article-title>Disease mention recognition with specific features</article-title>
          .
          <source>In Proceedings of the 2010 workshop on biomedical natural language processing</source>
          , pages
          <fpage>83</fpage>
          -
          <lpage>90</lpage>
          . Association for Computational Linguistics,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>[Dogan and Lu</source>
          , 2012]
          <article-title>Rezarta Islamaj Dogan</article-title>
          and
          <string-name>
            <given-names>Zhiyong</given-names>
            <surname>Lu</surname>
          </string-name>
          .
          <article-title>An inference method for disease name normalization</article-title>
          .
          <source>In Information Retrieval and Knowledge Discovery in Biomedical Text, Papers from the 2012 AAAI Fall Symposium</source>
          , Arlington, Virginia, USA, November 2-
          <issue>4</issue>
          ,
          <year>2012</year>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [Doan et al.,
          <year>2014</year>
          ]
          <string-name>
            <given-names>Rezarta</given-names>
            <surname>Islamaj</surname>
          </string-name>
          <string-name>
            <surname>Doan</surname>
          </string-name>
          , Robert Leaman, and Zhiyong Lu.
          <article-title>fNCBIg disease corpus: A resource for disease name recognition and concept normalization</article-title>
          .
          <source>Journal of Biomedical Informatics</source>
          ,
          <volume>47</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [Jonnagaddala et al.,
          <year>2015</year>
          ]
          <string-name>
            <given-names>Jitendra</given-names>
            <surname>Jonnagaddala</surname>
          </string-name>
          ,
          <string-name>
            <surname>Nai-Wen</surname>
            <given-names>Chang</given-names>
          </string-name>
          , Toni Rose Jue, and
          <string-name>
            <surname>Hong-Jie Dai</surname>
          </string-name>
          .
          <article-title>Recognition and normalization of disease mentions in pubmed abstracts</article-title>
          .
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [Kim et al.,
          <year>2009</year>
          ]
          <article-title>Jin-dong Kim, Tomoko Ohta, Sampo Pyysalo, and Yoshinobu Kano. Overview of bionlp09 shared task on event extraction</article-title>
          .
          <source>In In Proceedings of Natural Language Processing in Biomedicine (BioNLP)</source>
          NAACL 2009 Workshop. Citeseer,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [Leaman et al.,
          <year>2008</year>
          ]
          <string-name>
            <given-names>Robert</given-names>
            <surname>Leaman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Graciela</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          , et al.
          <article-title>Banner: an executable survey of advances in biomedical named entity recognition</article-title>
          .
          <source>In Pacific Symposium on Biocomputing</source>
          , volume
          <volume>13</volume>
          , pages
          <fpage>652</fpage>
          -
          <lpage>663</lpage>
          . Citeseer,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [Lu et al.,
          <year>2011</year>
          ]
          <string-name>
            <given-names>Zhiyong</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <surname>Hung-Yu</surname>
            <given-names>Kao</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chih-Hsuan</surname>
            <given-names>Wei</given-names>
          </string-name>
          , Minlie Huang, Jingchen Liu, Cheng-Ju Kuo, ChunNan Hsu, Richard Tzong-Han
          <string-name>
            <surname>Tsai</surname>
          </string-name>
          ,
          <string-name>
            <surname>Hong-Jie</surname>
            <given-names>Dai</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Naoaki</given-names>
            <surname>Okazaki</surname>
          </string-name>
          , et al.
          <article-title>The gene normalization task in biocreative iii</article-title>
          .
          <source>BMC bioinformatics</source>
          ,
          <volume>12</volume>
          (
          <issue>8</issue>
          ):
          <fpage>1</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [Mazumder et al.,
          <year>2014</year>
          ]
          <string-name>
            <given-names>Sahisnu</given-names>
            <surname>Mazumder</surname>
          </string-name>
          , Bazir Bishnoi, and
          <string-name>
            <given-names>Dhaval</given-names>
            <surname>Patel</surname>
          </string-name>
          .
          <article-title>News headlines: What they can tell us</article-title>
          ?
          <source>In Proceedings of the 6th IBM Collaborative Academia Research Exchange Conference (I-CARE) on ICARE</source>
          <year>2014</year>
          ,
          <string-name>
            <surname>I-CARE</surname>
          </string-name>
          <year>2014</year>
          , pages
          <issue>4</issue>
          :
          <fpage>1</fpage>
          -
          <issue>4</issue>
          :
          <fpage>4</fpage>
          , New York, NY, USA,
          <year>2014</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [Naderi et al.,
          <year>2011</year>
          ]
          <string-name>
            <given-names>Nona</given-names>
            <surname>Naderi</surname>
          </string-name>
          , Thomas Kappler, Christopher JO Baker, and Rene´ Witte.
          <article-title>Organismtagger: detection, normalization and grounding of organism entities in biomedical documents</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>27</volume>
          (
          <issue>19</issue>
          ):
          <fpage>2721</fpage>
          -
          <lpage>2729</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [Ne´ve´ol et al.,
          <year>2009</year>
          ]
          <article-title>Aure´lie Ne´ve´ol</article-title>
          , Won Kim,
          <string-name>
            <given-names>W. John</given-names>
            <surname>Wilbur</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Zhiyong</given-names>
            <surname>Lu</surname>
          </string-name>
          .
          <article-title>Exploring two biomedical text genres for disease recognition</article-title>
          .
          <source>In Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing</source>
          ,
          <source>BioNLP '09</source>
          , pages
          <fpage>144</fpage>
          -
          <lpage>152</lpage>
          , Stroudsburg, PA, USA,
          <year>2009</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [Pradhan et al.,
          <year>2014</year>
          ]
          <string-name>
            <given-names>Sameer</given-names>
            <surname>Pradhan</surname>
          </string-name>
          , Noe´mie Elhadad, Wendy Chapman,
          <string-name>
            <given-names>Suresh</given-names>
            <surname>Manandhar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Guergana</given-names>
            <surname>Savova</surname>
          </string-name>
          .
          <article-title>Semeval-2014 task 7: Analysis of clinical text</article-title>
          .
          <source>SemEval</source>
          ,
          <volume>199</volume>
          (
          <issue>99</issue>
          ):
          <fpage>54</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [Torii et al.,
          <year>2009</year>
          ]
          <string-name>
            <given-names>Manabu</given-names>
            <surname>Torii</surname>
          </string-name>
          , Zhangzhi Hu,
          <string-name>
            <surname>Cathy H. Wu</surname>
          </string-name>
          , and Hongfang Liu.
          <article-title>Biotagger-gm: A gene/protein name recognition system</article-title>
          .
          <source>Journal of the American Medical Informatics Association</source>
          ,
          <volume>16</volume>
          (
          <issue>2</issue>
          ):
          <fpage>247</fpage>
          -
          <lpage>255</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [Uzuner et al.,
          <year>2011</year>
          ]
          <string-name>
            <given-names>O</given-names>
            <surname>¨ zlem</surname>
          </string-name>
          <string-name>
            <surname>Uzuner</surname>
          </string-name>
          , Brett R South,
          <string-name>
            <given-names>Shuying</given-names>
            <surname>Shen</surname>
          </string-name>
          , and Scott L DuVall.
          <year>2010</year>
          i2b2/
          <article-title>va challenge on concepts, assertions, and relations in clinical text</article-title>
          .
          <source>Journal of the American Medical Informatics Association</source>
          ,
          <volume>18</volume>
          (
          <issue>5</issue>
          ):
          <fpage>552</fpage>
          -
          <lpage>556</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>