<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Biomedical Disease Name Entity Recognition Using NCBI Corpus</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hidayat Ur Rahman</string-name>
          <email>Hidayat.Rhman@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thomas Hahn</string-name>
          <email>Thomas.F.Hahn3@gmail.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dr. Richard Segall</string-name>
          <email>rsegall@astate.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Arkansas State University, Computer Inform Tech Department, State University</institution>
          ,
          <addr-line>AR 72404-0130, + 1 (870) 972-3989</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Lahore Leads University</institution>
          ,
          <addr-line>5Tipu Block Near Garden Town Near, Kalma Chowk, Lahore 54000 Pakistan, +92-3329702722</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Arkansas at Little Rock</institution>
          ,
          <addr-line>2801 South University Avenue, Little Rock, AR, 72204, + 1 (501) 301 4890</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2001</year>
      </pub-date>
      <volume>2</volume>
      <fpage>1930</fpage>
      <lpage>1938</lpage>
      <abstract>
        <p>- Named Entity Recognition (NER) in biomedical literature is a very active research area. NER is a crucial component of biomedical text mining because it allows for information retrieval, reasoning and knowledge discovery. Much research has been carried out in this area using semantic type categories, such as “DNA”, “RNA”, “proteins” and “genes”. However, disease NER has not received its needed attention yet, specifically human disease NER. Traditional machine learning approaches lack the precision for disease NER, due to their dependence on token level features, sentence level features and the integration of features, such as orthographic, contextual and linguistic features. In this paper a method for disease NER is proposed which utilizes sentence and token level features based on Conditional Random Fields using the NCBI disease corpus. Our system utilizes rich features including orthographic, contextual, affixes, bigrams, part of speech and stem based features. Using these feature sets our approach has achieved a maximum F-score of 94% for the training set by applying 10 fold cross validation for semantic labeling of the NCBI disease corpus. For testing and development corpus the model has achieved an F-score of 88% and 85% respectively.</p>
      </abstract>
      <kwd-group>
        <kwd>NCBI disease corpus</kwd>
        <kwd>naïve Bayesian</kwd>
        <kwd>Bayesian networks</kwd>
        <kwd>Non nested generalized exemplars</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        INTRODUCTION
Biomedical Named Entity Recognition (NER) is based on
dictionary-based, rule-based and machine learning approaches
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In the dictionary based approach all the terms are
not defined in dictionary. This is the major limitation of this
approach [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Rule-based approaches make decisions based on
certain rules, which are learned from the data in form of text
terms. But these rules are not applicable in all cases [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. On
the other hand, machine learning approaches require enormous
annotated data to train the algorithm [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Nowadays machine
learning approaches are commonly used for NER, e.g.,
Support Vector Machines (SVM) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], Maximum Entropy
(ME) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], Hidden Markov Models (HMM) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and
Conditional Random Fields (CRF) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] an HMM model
has been proposed to distinguish between DNA, RNA,
protein, cell-type and cell-line. Kazema et al. proposed an
SVM based approach to identify DNA, cell-type, cell-line,
protein and lipid achieving an f-score of 73.6% [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]
CRFs based NER system was developed to recognize protein
mentions achieving an F-score of 78.4%. Beside CRFs in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ],
the author used ME to distinguish between 23 different
biological categories achieving an F-score of 72%.
Performance of biomedical NER as compared to general
purpose NER is not satisfactory [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Many approaches have
been used to enhance the performance of biomedical NER
systems, e.g. adding biomedical domain knowledge [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ],
applying post-processing [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and combining different
machine learning classifiers to perform a hybrid classification
scheme [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Some of the above mentioned applications are
discussed below.
      </p>
      <p>
        The exact biomedical term could be referred to by
abbreviations or synonyms. Therefore, abbreviation and
synonym recognition are used to unify and normalize
biomedical entities for biomedical NER. For example, in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]
the authors have used logistic regression for abbreviation
scoring based on the Medstract corpus thus achieving a recall
of 83% and precision of 80%. In [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] an abbreviation
recognition system has been developed using the AB3P
corpus. Thus, a recall of 95.86% and precision of 86.64%
could be achieved. In [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] pattern-matching rules were
developed for matching abbreviations with their respective
full term. Thus, a recall of 70% and a precision of 95% could
be obtained. In [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] a system was developed based on
collocations yielding a recall of 88.5% and precision of
96.3%. In [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] a rule-based synonym recognition system was
developed, in [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] a pattern matching system was developed
to match abbreviations with their corresponding full names.
A lot of current research is interested in entity recognition and
normalization [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. In the BioCreative III competition, one
task was focused on gene normalization, i.e. to identify and
link genes to the standard database [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. Such system has also
been developed in [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]. Relationships between biomedical
entities, e.g. protein-protein interactions, gene-disease
interactions are investigated in [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ].
      </p>
      <p>
        Much work has been done in the field of relationship mining.
For example, in [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] a relationship mining system was
developed using MetaMap to identify biomedical entities [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ]
while using linguistic rules to determine the semantic
relationships between them. In [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ] a gene-disease
relationship extraction system was developed from Medline
abstracts using machine learning approach. It performed better
than dictionary- and rule-based approaches.
      </p>
      <p>The research in this work focuses on biomedical disease
classification using the National Center for biotechnology
(NCBI) corpus and applying combinations of machine
learning approaches. We found that selecting rich features and
combining classifiers contribute to a better performance.</p>
      <p>II.</p>
      <p>
        DATASET DETAILS
Our dataset is the National Center for Biotechnology
Information (NCBI) Disease Corpus. It is available
at http://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEAS
E/. It consists of 793 abstracts containing 2783 sentences,
3224 unique disease names [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ] and about 6,900 disease
names in total. NCBI corpus annotators have annotated every
sentence of the PubMed abstracts excluding organism names
(e.g. human, virus and bacteria), gender (male and female),
general terms (deficiencies and syndromes), biological
references and nested disease. Annotations were done using a
web base tool called PubTator [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ]. The corpus annotations
were assigned four categories based on the nature of the
disease which consist of 3922 specific disease annotation,
1029 disease class annotations, 1774 modifiers and 173
composite mentions. The dataset is further divided into
training, testing and development set as shown in the table
below
      </p>
      <p>Classes
To improve classification accuracy, selecting and defining the
features is very important. Enriching the feature set can
improve the performance of a particular machine learning
algorithm. To train our algorithm we used the following
features:
1.
2.
3.
4.
5.
6.</p>
    </sec>
    <sec id="sec-2">
      <title>Word Normalization Orthographic Part of Speech (POS) Tags N-grams</title>
      <p>Affixes</p>
      <p>Contextual
Each of these 6 features is explained in more detail below:</p>
      <sec id="sec-2-1">
        <title>A. Word Normalization</title>
        <p>Word normalization attempts to reduce different form of
words such as noun, adjective, verb etc. to its reduced/stemmed
or root form . Common technique used for word normalization
is the use of stemmer or lemmatizer, which stems word to its
base form. Following are the various patterns analyzed which
are reduced to its root form.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Endometrial cancer  endometri cancer</title>
    </sec>
    <sec id="sec-4">
      <title>Alzheimer disease  alzheim diseas</title>
    </sec>
    <sec id="sec-5">
      <title>Neurological disease  neurolog diseas</title>
    </sec>
    <sec id="sec-6">
      <title>Arthritis  arthriti</title>
    </sec>
    <sec id="sec-7">
      <title>Deficiency of DPD  defici of DPD</title>
    </sec>
    <sec id="sec-8">
      <title>Premenopausal</title>
      <p>ovarian cancer
ovarian
cancer
premenopaus</p>
    </sec>
    <sec id="sec-9">
      <title>Neurodegeneration  neurodegener</title>
      <p>Familial deficiency of the seventh component of
complement  famili defici of the seventh compon of
complement</p>
      <sec id="sec-9-1">
        <title>B. Orthographic Features</title>
        <p>
          Orthographic features are related to the geometry and
indentation of the text such as capitalization, digits, numbers,
numerics, single caps, all caps, two caps, punctuation,
symbols etc. Such features are very effective in NER. Use of
orthographic feature has been advocated in [
          <xref ref-type="bibr" rid="ref32 ref33 ref34">32-34</xref>
          ].
        </p>
      </sec>
      <sec id="sec-9-2">
        <title>C. Part Of Speech (POS) Tags</title>
        <p>
          Usually POS tags help define the boundaries of phrases. In
some scenarios POS tags have improved NER performance
[
          <xref ref-type="bibr" rid="ref34 ref35">34-35</xref>
          ]. Since POS tagging is a challenging and
computationally demanding process some researchers have
not used it in NER [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ]. We have improved performance by
including POS tags.
        </p>
      </sec>
      <sec id="sec-9-3">
        <title>D. N-grams</title>
        <p>N-grams are defined by a sequence of n tokens or words. The
most common n-gram is unigram because it contains a single
token. Other n-grams are bigrams and tri-grams containing 2
and 3 tokens respectively. Generally, N-grams are represented
by the equation
------(1).</p>
        <p>From equation (1) which
represents unigrams, while bigrams add one more word and
can be represented as
and hence tri-grams adds two
more words
and hence other
NContextual (Cc), Normalized (Nm), Unigrams (Ug), bigrams
(bg), Affixes (Ax), Part of speech (POS) and Orthographic (O).
Performance evaluation was carried out using standard metrics
such as precision, recall and F-score.</p>
        <p>Results obtained in Table-2 is based on applying 10 Fold
cross validation on the training set.</p>
        <p>Precision=
Recall =</p>
        <p>F-score =
Feature combination
O
O+ Nm
O+ Nm+ POS
O+ Nm + POS +Un
precision
0.54
0.77
0.87
0.91
0.92
0.92
0.94
recall
0.62
0.76
0.87
0.91
0.92
0.92
0.94.</p>
        <p>F-score
0.53
0.74
0.86
0.91
0.91
0.92
0.94
gram models can be found so on. In our experiment we only
used bigrams and unigrams.</p>
      </sec>
      <sec id="sec-9-4">
        <title>E. Affixes</title>
        <p>
          Prefix and suffix features have significantly improved
performance in the recognition of named entities. In [
          <xref ref-type="bibr" rid="ref37">37</xref>
          ] the
authors have collected most frequent suffixes and prefixes
from the training data, while in [
          <xref ref-type="bibr" rid="ref38">38</xref>
          ] the authors have grouped
the prefixes and suffixes into 23 categories. In our experiment
beside contextual features affixes has shown significant
improvement.
        </p>
      </sec>
      <sec id="sec-9-5">
        <title>F. Contextual features</title>
        <p>Contextual features refer to the word preceding and following
the named entities. Let be the current token i.e. named
entity, so for each feature we use two token instances around it
i.e. . Now for each token which
appears in the text at location the
same features are calculated or more specifically c=
…….. (2) Is the contextual window. In our experiment
contextual features are the most important features in the
recognition of NEs combined with affixes. Initially two
contextual features followed by the current word were selected
for the experiment. However, when realizing their importance
four contextual features were selected. See equation 2, i.e. the
two words preceding and the two words following the NE.</p>
        <p>IV.</p>
        <p>
          CLASSIFICATION SCHEME
In this research Conditional Random Fields (CRF) was applied
to the NCBI disease corpus. CRF is a probabilistic model for
labeling sequential data; it’s widely used for part of speech
tagging and named entity recognition [
          <xref ref-type="bibr" rid="ref39">39, 40</xref>
          ]. CRF has several
advantages over the HMM and SVM. CRF is based on a
discriminative model. Hence, it includes a rich feature set
containing overlapping features using conditional probability.
Given a sequence and its
labels , the conditional probability
is defined by CRF as follows [41]:
(2)
        </p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>Is a weight vector defined by</title>
      <p>These weights are associated with features having length equal
to M.
f is a feature function. Weight vectors (denoted by w) are
obtained using the L-BFGS method [42]. In our experiment
CRFSUITE has been used, which is the Python
implementation of CRF [43].</p>
      <p>V.</p>
      <p>RESULT AND DISCUSSION
Table-2 shows the contributions of features and their effects on
the performance of CRF. The feature set is divided into
O+ Nm + POS + Un + Bg
O+ Nm+ POS +Un + Bg + Cc
O+ Nm +POS +Un + Bg +Cc + Affixes</p>
      <p>Table-2 shows combinations of different features for
improving CRF performance. Oorthographic features were
taken as a benchmark. The benchmark performance was an
Fscore of 0.53, a precision of 0.54 and a recall of 0.62. Adding
stemmed or normalized features improved the F-score to 0.74,
the precision to 0.77 and the recall to 0.76. Adding part of
speech tags further improved the F-score by 12 percent.
Nevertheless, the part of speech tags were recently removed
from the NER system. Unigram-based models have been the
primary models in NER and hence we included them in our
system. Adding the unigram features improved the F-score by
5%. Adding bigram-features did not raise the overall F-score
but improved precision and recall by 1%. Adding contextual
features only improved the F-score slightly by 1% but had no
effect on precision and recall. Combining all features, i.e.
orthographic, normalized, part of speech, unigram, bigram,
contextual features and affixes yielded 94% for precision,
recall and F-score. This performance was achieved with a
10fold cross-validation on the training set due to the rich feature
selection.</p>
      <p>Figure 1 shows the F-scores for each of the 4 classes. In
our experiment the following four classes were defined:
•
•
•
•</p>
      <p>Disease Class = DC
Composite Mention = CM
Specific Disease = SD
Modifier = MD
The F-scores of the training, development and testing sets are
plotted in figure 1. The best F-scores could be achieved for the
Modifier class. For this class an F-score of 0.96 could be
reached for the training dataset and for the development and
testing dataset an F-score of 0.92 was obtained. The second
highest F-scores could be achieved for the Specific Disease
class. For this class the F-score of the training dataset was
0.95, for the testing set it was 0.92 and for the development set
it was 0.88. The third highest F-scores were achieved for the
Disease Class. For this class the F-score for the training set was
0.86 and the F-scores for the testing and development set were
both 0.71. The F-scores were lowest for the Composite
Mention class. For this class the F-score for the training set
was 0.72, for the testing set it was 0.52 and for the
development set it was 0.62. We observed a positive
correlation between the size of the training sample sets and
Fscore. The largest training sample comprising of over 1,000
was available for the Modifier class, followed by the Special
Disease class, followed by the Disease Class having the second
smallest training sample followed by the Composite Mention
class, which had the smallest training sample. The
performance of machine learning algorithms depends on the
size of the training sample. Too small training samples increase
the risk of under fitting while too large training samples
increase the risk for over fitting.</p>
      <p>
        We compared the performance of our approach, which is based
on combining features with that of BANNER using the same
dataset and classes. The results of this comparison are shown in
table 3. Details about BANNER results can be found in [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ].
The data in table 3 indicates that our approach yielded much
higher F-scores than BANNAR for the training, testing and
development set. The F-score obtained with our approach is
10% higher for the training set, 7% higher for the testing set
and 4% higher for the development set. Hence, we clearly
succeeded in outperforming BANNER.
      </p>
      <p>In summary it can be concluded that CRF based on 6
features clearly outperformed BANNER. This clearly shows
that the sequential classifier CRF is well suited for classifying
biomedical literature based on rich features.</p>
      <p>This paper presents a machine learning approach for human
disease named entity recognition using the NCBI disease
corpus. The system takes the advantage of background
knowledge obtained from the selected features to better
distinguish between the four classes. Improvements due to
feature additions have been demonstrated. The highest
improvement could be obtained when adding a second feature
to the first. However, in order to evaluate the overall benefit for
each feature, all possible combinations of feature additions
need to be considered.
[43]. Dekang Lin and Xiaoyun Wu. 2009. Phrase Clustering for
Discriminative Learning. In Proceedings of the Joint Conference of the
47th Annual Meeting of the ACL and the 4th International Joint
Conference on Natural Language Processing of the AFNLP, pages
1030–1038, Suntec, Singapore, August. Association for Computational
Linguistics.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>. A.M.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.R.</given-names>
            <surname>Hersh</surname>
          </string-name>
          <article-title>A survey of current work in biomedical text mining</article-title>
          <source>Brief Bioinform</source>
          ,
          <volume>6</volume>
          (
          <year>2005</year>
          ), pp.
          <fpage>57</fpage>
          -
          <lpage>71</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]. L.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>Two-phase biomedical named entity recognition using CRFs</article-title>
          .
          <source>Comput Biol Chem</source>
          ,
          <volume>33</volume>
          (
          <year>2009</year>
          ), pp.
          <fpage>334</fpage>
          -
          <lpage>338</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>. D.</given-names>
            <surname>Rebholz-Schuhmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.J.</given-names>
            <surname>Yepes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kafkas</surname>
          </string-name>
          , I. Lewin,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kang</surname>
          </string-name>
          , et al.
          <article-title>Assessment of NER solutions against the first and second CALBC Silver Standard Corpus</article-title>
          .
          <source>J Biomed Semantics</source>
          ,
          <volume>2</volume>
          (
          <issue>Suppl</issue>
          . 5) (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>. M.</given-names>
            <surname>Krallinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vazquez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Leitner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Salgado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>ChatrAryamontri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Winter</surname>
          </string-name>
          , et al.
          <article-title>The Protein-Protein Interaction tasks of BioCreative III: Classification/ranking of articles and linking bioontology concepts to full text</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>12</volume>
          (
          <issue>Suppl</issue>
          . 8) (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>. M.S.</given-names>
            <surname>Habib</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kalita</surname>
          </string-name>
          ,
          <article-title>Scalable biomedical Named Entity Recognition: investigation of a database-supported SVM approach</article-title>
          .
          <source>Int J Bioinform Res Appl</source>
          ,
          <volume>6</volume>
          (
          <year>2010</year>
          ), pp.
          <fpage>191</fpage>
          -
          <lpage>208</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6].
          <string-name>
            <given-names>S.K.</given-names>
            <surname>Saha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sarkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mitra</surname>
          </string-name>
          .
          <article-title>Feature selection techniques for maximum entropy based biomedical named entity recognition</article-title>
          .
          <source>J Biomed Inform</source>
          ,
          <volume>42</volume>
          (
          <year>2009</year>
          ), pp.
          <fpage>905</fpage>
          -
          <lpage>911</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7].
          <string-name>
            <given-names>Y.M.N.</given-names>
            <surname>Ephraim</surname>
          </string-name>
          .
          <article-title>Hidden Markov processes</article-title>
          .
          <source>IEEE Trans Inform Theory</source>
          ,
          <volume>48</volume>
          (
          <year>2002</year>
          ), pp.
          <fpage>1518</fpage>
          -
          <lpage>1569</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]. He
          <string-name>
            <given-names>Y</given-names>
            ,
            <surname>Kayaalp</surname>
          </string-name>
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Biological entity recognition with conditional random fields</article-title>
          .
          <source>In: AMIA annu symp proc; 2008</source>
          . p.
          <fpage>293</fpage>
          -
          <lpage>7</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>.</given-names>
            <surname>Zhou</surname>
          </string-name>
          <string-name>
            <given-names>GD</given-names>
            ,
            <surname>Su</surname>
          </string-name>
          <string-name>
            <surname>J</surname>
          </string-name>
          .
          <article-title>Exploring deep knowledge resources in biomedical name recognition</article-title>
          . In: JNLPBA;
          <year>2004</year>
          . p.
          <fpage>96</fpage>
          -
          <lpage>99</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]. Kazama
          <string-name>
            <given-names>J</given-names>
            ,
            <surname>Makino</surname>
          </string-name>
          <string-name>
            <given-names>T</given-names>
            ,
            <surname>Ohta</surname>
          </string-name>
          <string-name>
            <given-names>Y</given-names>
            ,
            <surname>Tsujii</surname>
          </string-name>
          <string-name>
            <surname>J</surname>
          </string-name>
          .
          <article-title>Tuning support vector machines for biomedical named entity recognition</article-title>
          . In:
          <article-title>Association for computational linguistics Morristown, NJ</article-title>
          , USA;
          <year>2002</year>
          . p.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]. T. Tsai,
          <string-name>
            <given-names>W.C.</given-names>
            <surname>Chou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.H.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.Y.</given-names>
            <surname>Sung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hsiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.L.</given-names>
            <surname>Hsu</surname>
          </string-name>
          ,
          <article-title>Integrating linguistic knowledge into a conditional random field framework to identify biomedical named entities</article-title>
          .
          <source>Expert Syst Appl</source>
          ,
          <volume>30</volume>
          (
          <year>2006</year>
          ), pp.
          <fpage>117</fpage>
          -
          <lpage>128</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12].
          <string-name>
            <surname>Lin</surname>
            <given-names>YF</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tsai</surname>
            <given-names>TH</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chou</surname>
            <given-names>WC</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            <given-names>KP</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sung</surname>
            <given-names>TY</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hsu</surname>
            <given-names>WL</given-names>
          </string-name>
          .
          <article-title>A maximum entropy approach to biomedical named entity recognition</article-title>
          .
          <source>In: The 4th ACM SIGKDD workshop on data mining in bioinformatics; 2004</source>
          . p.
          <fpage>56</fpage>
          -
          <lpage>61</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13].
          <string-name>
            <given-names>C.R.</given-names>
            <surname>Yen-Ching</surname>
          </string-name>
          , Tsai Tzong-Han, Hsu Wen-Lian.
          <article-title>New challenges for biological text-mining in the next decade</article-title>
          .
          <source>J Comput Sci Technol</source>
          ,
          <volume>25</volume>
          (
          <year>2010</year>
          ), pp.
          <fpage>169</fpage>
          -
          <lpage>179</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14].
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sasaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tsuruoka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>McNaught</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ananiadou</surname>
          </string-name>
          .
          <article-title>How to make the most of NE dictionaries in statistical NER</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>9</volume>
          (
          <issue>Suppl</issue>
          . 11) (
          <year>2008</year>
          ), p.
          <fpage>S5</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]. Zhou GDaJS.
          <article-title>Exploring deep knowledge resources in biomedical name recognition</article-title>
          . In: JNLPBA;
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16].
          <string-name>
            <given-names>B.S. Fei</given-names>
            <surname>Zhu</surname>
          </string-name>
          .
          <article-title>Combined SVM-CRFs for biological named entity recognition with maximal bidirectional squeezing</article-title>
          .
          <source>PLoS One</source>
          ,
          <volume>7</volume>
          (
          <issue>6</issue>
          ) (
          <year>2012</year>
          ), p.
          <fpage>e39230</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17].
          <string-name>
            <given-names>J.T.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schutze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.B.</given-names>
            <surname>Altman</surname>
          </string-name>
          .
          <article-title>Creating an online dictionary of abbreviations from MEDLINE</article-title>
          .
          <source>J Am Med Inform Assoc</source>
          ,
          <volume>9</volume>
          (
          <year>2002</year>
          ), pp.
          <fpage>612</fpage>
          -
          <lpage>620</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18].
          <string-name>
            <given-names>C.J.</given-names>
            <surname>Kuo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.H.</given-names>
            <surname>Ling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.T.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.N. Hsu. BIOADI:</surname>
          </string-name>
          <article-title>a machine learning approach to identifying abbreviations and definitions in biological literature</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>10</volume>
          (
          <issue>Suppl</issue>
          . 15) (
          <year>2009</year>
          ), p.
          <fpage>S7</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19
          <string-name>
            <given-names>]. H.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Hripcsak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Friedman</surname>
          </string-name>
          .
          <article-title>Mapping abbreviations to full forms in biomedical articles</article-title>
          .
          <source>J Am Med Inform Assoc</source>
          ,
          <volume>9</volume>
          (
          <year>2002</year>
          ), pp.
          <fpage>262</fpage>
          -
          <lpage>272</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]. H. Liu,
          <string-name>
            <given-names>C.</given-names>
            <surname>Friedman</surname>
          </string-name>
          .
          <article-title>Mining terminological knowledge in large biomedical corpora</article-title>
          .
          <source>Pac Symp Biocomput</source>
          (
          <year>2003</year>
          ), pp.
          <fpage>415</fpage>
          -
          <lpage>426</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]. J.
          <string-name>
            <surname>McCrae</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Collier</surname>
          </string-name>
          .
          <article-title>Synonym set extraction from the biomedical literature by lexical pattern discovery</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>9</volume>
          (
          <year>2008</year>
          ), p.
          <fpage>159</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22].
          <string-name>
            <given-names>A.M.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.R.</given-names>
            <surname>Hersh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Dubay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Spackman</surname>
          </string-name>
          .
          <article-title>Using cooccurrence network structure to extract synonymous gene and protein names from MEDLINE abstracts</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>6</volume>
          (
          <year>2005</year>
          ), p.
          <fpage>103</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]. H.
          <string-name>
            <surname>-Y.K. Zhiyong Lu</surname>
            , Wei Chih-
            <given-names>Hsu</given-names>
            an, Huang Minlie, Liu Jingchen, Kuo Cheng-Ju, Hsu
          </string-name>
          <string-name>
            <surname>Chun-Nan</surname>
          </string-name>
          , et al.
          <article-title>The gene normalization task in BioCreative III</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>12</volume>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24].
          <string-name>
            <given-names>C.N.</given-names>
            <surname>Arighi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.M.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bhattacharya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Cesareni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chatr-Aryamontri</surname>
          </string-name>
          , et al.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25].
          <article-title>BioCreative III interactive task: an overview</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>12</volume>
          (
          <issue>Suppl</issue>
          . 8) (
          <year>2011</year>
          ),
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26].
          <string-name>
            <given-names>M.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <surname>X. Zhu.</surname>
          </string-name>
          <article-title>GeneTUKit: a software for document-level gene normalization</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>27</volume>
          (
          <year>2011</year>
          ), pp.
          <fpage>1032</fpage>
          -
          <lpage>1033</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27].
          <string-name>
            <given-names>C.N.</given-names>
            <surname>Arighi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krallinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.B.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.J.</given-names>
            <surname>Wilbur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Valencia</surname>
          </string-name>
          , et al.
          <article-title>Overview of the BioCreative III workshop</article-title>
          . BMC Bioinformatics,
          <volume>12</volume>
          (
          <issue>Suppl</issue>
          . 8) (
          <year>2011</year>
          ), p.
          <fpage>S1</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]. Ben Abacha,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zweigenbaum</surname>
          </string-name>
          .
          <article-title>Automatic extraction of semantic relations between medical entities: a rule based approach</article-title>
          .
          <source>J Biomed Semantics</source>
          ,
          <volume>2</volume>
          (
          <issue>Suppl</issue>
          . 5) (
          <year>2011</year>
          ), p.
          <fpage>S4</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29].
          <string-name>
            <given-names>A.R.</given-names>
            <surname>Aronson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.M.</given-names>
            <surname>Lang</surname>
          </string-name>
          .
          <article-title>An overview of MetaMap: historical perspective and recent advances</article-title>
          .
          <source>J Am Med Inform Assoc</source>
          ,
          <volume>17</volume>
          (
          <year>2010</year>
          ), pp.
          <fpage>229</fpage>
          -
          <lpage>236</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]. Rezarta Islamaj,
          <source>Dogan Zhiyong Lu. An improved corpus for disease mentioned in Pubmed citatations Proceedings of the 2012 Workshop on Biomedical Natural Language Processing (BioNLP</source>
          <year>2012</year>
          ), pages
          <fpage>91</fpage>
          -
          <lpage>99</lpage>
          , Montr´eal, Canada, June 8, 2012
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]. Leaman,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          .
          <article-title>enabling recognition of disease in biomedical text with machine learning: corpus and Benchmarks</article-title>
          .
          <source>Symposium on languages in biology and medicine 2009. Pg 82-89.</source>
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]. Wei.C,
          <string-name>
            <surname>Kao</surname>
          </string-name>
          .H,
          <string-name>
            <surname>Lu</surname>
            .
            <given-names>Z.</given-names>
          </string-name>
          '
          <article-title>Pubtator: A Pubmed-like interactive curation system for document triage and literature curation</article-title>
          .
          <source>In procedings of BioCreative workshop 2012</source>
          pg145-
          <fpage>150</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33
          <string-name>
            <given-names>]. N.</given-names>
            <surname>Collier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Takeuchi</surname>
          </string-name>
          .
          <article-title>Comparison of character-level and part of speech features for name recognition in biomedical texts</article-title>
          .
          <source>J Biom. Inform</source>
          .
          <volume>37</volume>
          . pp423-
          <fpage>435</fpage>
          .
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34].
          <string-name>
            <given-names>D.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , G. Zhou,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jian</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <article-title>Effective Adaptation of a Hidden Markov Modelbased Named Entity Recognizer for Biomedical Domain</article-title>
          ,
          <source>In: Proceedings of ACL 2003 Workshop on NLP in Biomedicine</source>
          , Sapporo, Japan,
          <year>pp4956</year>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]. Tsai, T.-H.,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>S.-H.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Hsu</surname>
            ,
            <given-names>W.-L.</given-names>
          </string-name>
          (
          <year>2005</year>
          ).
          <article-title>Exploitation of linguistic features using a CRFbased biomedical named entity recognizer</article-title>
          . to appear
          <source>in ACL Workshop on Linking Biological Literature</source>
          , Ontologies and Databases: Mining Biological Semantics, Detroit
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36].
          <string-name>
            <given-names>L.</given-names>
            <surname>Ratinov</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Roth</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Design challenges and misconceptions in named entity recognition</article-title>
          .
          <source>In CoNLL, 6.</source>
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]. J.
          <string-name>
            <surname>Kazama</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Makino</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Ohta</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Tsujii</surname>
          </string-name>
          .
          <article-title>Tuning Support Vector Machines for Biomedical Named Entity Recognition</article-title>
          .
          <source>In: Proceedings of Workshop on NLP in the Biomedical Domain</source>
          ,
          <string-name>
            <surname>ACL</surname>
          </string-name>
          <year>2002</year>
          . pp1-
          <fpage>8</fpage>
          .
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]. G. Zhou and
          <string-name>
            <given-names>J.</given-names>
            <surname>Su</surname>
          </string-name>
          .
          <article-title>Named Entity Recognition using an HMM-based Chunk Tagger</article-title>
          .
          <source>In Proc. of the 40th Annual Meeting of the Association for Computational Linguistics (ACL)</source>
          , pp.
          <fpage>473</fpage>
          -
          <lpage>480</lpage>
          2002.
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39].
          <string-name>
            <surname>Huang H-S</surname>
            , Lin
            <given-names>Y-S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            <given-names>K-T</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuo</surname>
            <given-names>C-J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            <given-names>Y-M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            <given-names>B-H</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chung</surname>
            <given-names>I-F</given-names>
          </string-name>
          ,
          <article-title>Hsu C-N: High-recall gene mention recognition by unification of multiple background parsing models</article-title>
          .
          <source>Proceedings of the 2nd BioCreative Challenge Evaluation Workshop</source>
          <year>2007</year>
          ,
          <volume>23</volume>
          :
          <fpage>109</fpage>
          -
          <lpage>111</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>