<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Disease Named Entity Recognition Using Conditional Random Fields</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hidayat Ur Rahman</string-name>
          <email>Hidayat.R@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lahore Pakistan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Little Rock</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Arkansas State University</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Arkansas</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Named Entity Recognition is a crucial component in bio-medical text mining.In this paper a method for disease Named Entity Recognition is proposed which utilizes sentence and token level features based on Conditional Random Field's using NCBI disease corpus. The feature set used for the experiment includes orthographic,contextual,affixes,ngrams,part of speech tags and word normalization.Using these features,our approach has achieved a maximum F-score of 94% for the training set by applying 10 fold cross validation for semantic labeling of the NCBI disease corpus. For testing and development,F-score of 88% and 85% were reported.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The increasing amount of bio-medical literature
requires more robust approaches for information
retrieval and knowledge discovery because every
single day more information is published than
humans can read. Unique challenges
specifically to bio-medical Named Entity Recognition
(NER) are caused due to its structure,since
biomedical Named Entities(NEs) consist of
symbols and abbreviations to infer relationships, thus
the length of Bio-medical NEs are not
consistent,which is the primary reason why Bio-NER
have low performance compared to general
purpose NER
        <xref ref-type="bibr" rid="ref9">(Lishuang, L.and W. Fan,
2013)</xref>
        .BioNER is the most important step in the extraction
of knowledge, which has the overall aim of
identifying specific concepts or categories, such as gene,
protein, disease,drug, etc. Current trend in NER is
based on machine learning (ML) approaches, ML
based approach provides the flexibility of
statistical and rule-based techniques. However, the
performance of machine learning techniques highly
depends on the availability of sufficient training
data in order to adequately train the machine
learning classifiers
        <xref ref-type="bibr" rid="ref10">(M. Krallinger et al, 2011)</xref>
        .In this
article Bio-NER for disease names has been
carried out to handle the challenges of boundary
detection and entity classification using Conditional
Random Fields(CRF).The model consists of an
enriched set of features including boundary
detection features, such as word normalization,
affixes, orthographic and part of speech(POS)
features. For the semantic labeling features, such as
n-grams and contextual features have been used.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Methodology</title>
      <p>For disease NER our methodology follows the
traditional machine leaning approach. Figure. I
depicts the work-flow of our methodology. Firstly,
raw text is obtained from training, testing and
development set,then pre-processing is carried out to
remove characters and symbols such as underscore
character, full stop etc. After pre-processing
various features are extracted as described in section
2.1. The features are fed into a sequential CRF as
described in section 2.2. Thus, structured output in
the form of annotated named entities is obtained.
This section provides details about feature
extraction and classification.
2.1</p>
      <sec id="sec-2-1">
        <title>Feature Set</title>
        <p>Feature extraction plays a vital role in the
classification accuracy of machine learning classifier
as well as the NER system. The selection of
relevant feature set improves the classification
performance of Bio-NER. Table.1 shows the list of
features used for Bio-NER and their short
descriptions is listed below
Word normalization attempts to reduce different
forms of words, such as nouns, adjectives,verbs,
etc. to their root form . For word normalization,
Porter stemmer has been used to reduce disease
names to its root form. Below are few examples of
disease names for word root reductions obtained
with the Porter stemmer algorithm.</p>
        <sec id="sec-2-1-1">
          <title>Colorectal cancer – colorect cancer</title>
          <p>Endometrial cancer – endometri cancer</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Alzheimer disease – alzheim diseas</title>
          <p>Neurological disease – neurolog diseas</p>
        </sec>
        <sec id="sec-2-1-3">
          <title>Arthritis – arthriti</title>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Orthographic Features</title>
        <p>
          Orthographic features are related to the
orthography of the text, such as Capitalization, Digits,
Numeric, Single Caps, All Caps,numerics and
punctuation. Such features are very effective in
boundary detection
          <xref ref-type="bibr" rid="ref1">(Collier, Nigel and K. Takeuchi,
2004)</xref>
          . The eleven orthographic features below
have been used in our model:
        </p>
        <p>IDASH: Whether a token/word contains an
inner dash such as A-T, G6PD-deficient,
Palizaeus-Merzbacher disease.
2IDAH: If the number of IDASH counts
equals to 2 e.g. X-linked Emery-Dreifuss
muscular dystrophy,
Borjeson-ForssmanLehmann,
word
normalization</p>
        <sec id="sec-2-2-1">
          <title>Contextual features POS</title>
          <p>ł Orthographic
features
N</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>Word grams POS</title>
        </sec>
        <sec id="sec-2-2-3">
          <title>N-grams Prefix Suffix</title>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>Description</title>
        <sec id="sec-2-3-1">
          <title>Stemmed form of the Named Entity</title>
          <p>wi 2; wi 1; wi; wi+1; wi+2
Posw 2; P osw 1; P osw
Posw+1; P osw+2</p>
        </sec>
        <sec id="sec-2-3-2">
          <title>Uppercase,lowercase,title, hyphen,Alphanumeric etc</title>
          <p>wi 2 =wi 1; wi
=wi; wi=wi+1; wi +1 =wi+2
POSw 2; P OSw 1; P OSwi;
POSw+1; P OSw+2
PREFIX(wi)
SUFFIX(wi)
1
ALLCAPS: Is set to true if all the alphabets
in a given token are capital examples includes
DMD, BMD, FD, APC, FAP and HDD etc.
TITLE: If the first alphabet in a token is
capitalized such as Alzheimer disease,
Huntington disease, Combined genetic deficiency of
C6 and C7.</p>
          <p>LOW: All the alphabets in a given word are
in lower case e.g. myotonic dystrophy,
idiopathic dilated cardiomyopathy, and facial
lesions.</p>
          <p>MIXED: If a given sequence of words
contains both upper and lower case such as DMD
defects, hypo myelination of the PNS,
deficiency of active AVP.</p>
          <p>ALPNUM: If a given words contains both
numeric and text like abnormality of CYP27,
C6 deficiency, achondrogensis 1B,
abnormality of CYP27.</p>
          <p>PARN: If a multi-word contains
parenthesis such as Arginine vasopressin (AVP)
deficiency, palmoplanter keratoderma (PPK)
conditions, sporadic (nonhereditary) ovarian
cancers
BRACKS: Bracket is contained within a
token,example includes hypoxanthine
phosphoribosyl transferees [HPRT] deficiency.
GREEKS: Greek letters such as I,II,III,IV
etc,is contained within a token e.g. type
IIA vWD,Type II ALD,type II Gaucher
disease,type II GD and type III GD
SLASH: Character / is contained within the
multi-word token such as cleft lip/palate,
CL/P, breast and/ovarian cancers,
glucose/galactose malabsorption.</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>Ngrams</title>
        <p>N-grams are defined by a sequence of n-tokens
or words. The most common n-gram is uni-gram
which,contains a single token.Other n-grams
examples are bi-grams and tri-grams containing 2
and 3 tokens respectively. In this experiment
unigram and bi-gram have been used,in this method
all the digits within a word are replaced with d
e.g,the uni-gram of 33 is dd, uni-gram of nt943
is ntddd.Bigram examples are ALD/Eighteen, skin
tumor/caused, APC/protein, breast or ovarian
cancer/novel, etc.</p>
      </sec>
      <sec id="sec-2-5">
        <title>Part of Speech(POS) tags</title>
        <p>
          POS tags are helpful in defining boundary of a
phrase,inclusion of POS has been advocated by
          <xref ref-type="bibr" rid="ref3 ref4">(J. Kazama and T. Makino, 2002)</xref>
          .Our experiment
includes POS tags of contextual features and
bigrams. Adding POS tags to our feature set, the
performance of the classifier is boosted as shown
in Table.3
Affixes
Prefix and suffix feature has shown better
performance in the recognition of NEs in this
experiment.In
          <xref ref-type="bibr" rid="ref3 ref4">(J. Kazama and T. Makino, 2002)</xref>
          the
authors collected most frequent suffixes and prefixes
from the training data. Prefix and suffix are n
character in length at the beginning and end of a token
respectively
          <xref ref-type="bibr" rid="ref3 ref4">(Zhou, G. Dong, and J. Sui, 2002)</xref>
          . In
our model all the combinations n=1 through 4 have
been used to boost performance.The prefix for the
word ”tumour” are t, tu, tum and tumo, the
suffix for the same word are r,ur,our and mour.Beside
contextual features,affixes yielded improvements
in the overall performance as shown in Table.3.
        </p>
      </sec>
      <sec id="sec-2-6">
        <title>Contextual features</title>
        <p>Contextual features refer to the word preceding
and following the NEs. Contextual features are
the most important features in this experiment for
semantic labeling of disease names. In our
experiment four contextual features are selected. Two
words preceded and two followed the named
entities. E.g. for the term bactracin in colon
carcinoma loss cells, colon carcinoma represents the
named entities while bactracin in and loss cells
represent the two preceding word and two
following word.
2.2</p>
      </sec>
      <sec id="sec-2-7">
        <title>Conditional Random Field’s (CRF)</title>
        <p>
          CRF is a probabilistic model used for labeling
sequential data. It is widely used for POS
tagging and NER.
          <xref ref-type="bibr" rid="ref5">(Huang H-S and Lin Y-S,
2007)</xref>
          . CRF has several advantages over the
Hidden Markov Model (HMM) and Support
Vector Machine (SVM). CRF includes rich
feature sets,i.e. overlapping features using
conditional probability. For example, given
a sequence X = x1; x2; x3; x4:::::xn and its
labels Y = y1; y2; y3; y4:::::yn, the conditional
probability P (Y j X) is defined by CRF as
follows P (Y j X) / exp(wT f (yn; Yn 1; x))
          <xref ref-type="bibr" rid="ref7">(Sutton, C. and McCallum, 2011)</xref>
          .W is a weight
vector defined by w = (w1; w2; w3:wM )T .
Theses weight are associated with features
having length equal to M:f (yn; y(n 1); x) =
f (yn; y(n 1); x); f 2(yn; y(n 1); x),
f 3(yn; y(n 1); x):::fM = (yn; y(n 1); x))T
is a feature function.The weight vector is
obtained using L-BFGS method.In our experiment
CRFSUITE has been used which is the python
Application programming interface(API) of
CRF++.
3
3.1
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experimental Setup</title>
      <sec id="sec-3-1">
        <title>Dataset</title>
        <p>
          Our experiment is based on National Center
for Biotechnology Information (NCBI)
disease corpus, which is freely available at NCBI
website(http://www.ncbi.nlm.nih.
gov/CBBresearch/Dogan/DISEASE/
) NCBI corpus includes 793 abstracts, which
consist of 2783 sentences and a total of 6900
disease names
          <xref ref-type="bibr" rid="ref8">(Dogan, R. and Islamaj, 2012)</xref>
          .
Annotations of disease names are based on the
criteria that,disease mentions which describes
a family of specific diseases are annotated as
disease class e.g. autosomal recessive disease,
whereas text referring to specific disease are
annotated as specific disease, such as Diastrophic
dysplasia.Strings referring to more than one
disease names are annotated as composite mention,
e.g. Duchene and Becker muscular dystrophy are
two disease mentions and hence it is categorized
as composite mention. Certain disease mentions
are used as modifier for other concepts, e.g.
a string may denote a disease name but it is
not a noun phrase and hence it is annotated as
modifier, e.g. colorectal cancer.Table 2 shows the
distribution of disease names in training,testing
and development set.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Classes</title>
        <sec id="sec-3-2-1">
          <title>Modifiers</title>
          <p>ł Specific Disease</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>Composite tion</title>
        </sec>
        <sec id="sec-3-2-3">
          <title>Disease Class</title>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>Train set</title>
        <p>1292
2959
781</p>
      </sec>
      <sec id="sec-3-4">
        <title>Test set</title>
        <p>264
556
20
121</p>
        <p>Dev
set
218
409
37
127</p>
        <p>Men- 116
Table.3 shows contribution of features and
its effect on performance of CRF. The
feature set is mainly divided into
Contextual(Cc),Normalized(Nm),Ngrams,Affixes(Ax),Part
of speech (POS) and Orthographic(O).
Performance evaluation has been carried out using the
metrics precision, recall and F-score. Results
obtained in Table.3 is based on applying 10 Fold
cross validation on the training set. Orthographic
features were taken as a benchmarks, which
results in F-score of 0.53. This is considered as
the lowest F-score reported in this experiment.
Addition of normalized features resulted in an
increasing the F-score by 21%.Further addition of
POS tags increased the performance by 12%.With
the addition of N-gram features the overall F-score
achieved is 0.91.Finally,with the addition of
affixes,the final F-score obtained is 0.94.Compared
to other state of the art Bio-NER systems,such as
BANNER,our system has a higher level of F-score
using 10 fold cross validation on training set due
to the selection of good features for disease NER.
For result visualization we have plotted the
fscore of individual classes. In Figure.2 the
Fscore of individual dataset has been plotted. In
Figure.2 DC denotes Disease Class,CM denotes
Composite Mention,SD denotes Specific Disease
and MD denotes Modifier.Figure II shows that the
highest F-score have been reported by Modifier
for Training,Testing and Development set
respectively,followed by Specific Disease. The lowest
F-score has been shown by Composite Mentions
followed by Disease Class.One reason for the
relatively poor performance of the Composite
Mention is the inadequate training samples compared
to the training samples of Specific disease and
Modifiers,which exceed 1000.The relatively poor
performance of the Disease Class is because it has
been based on the second smallest training
sample since, the performance of machine learning
based techniques heavily depends on the number
of training samples.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>This paper presents a machine learning approach
for human disease NER using NCBI disease
corpus. The system takes the advantage of
rich feature set which,helps in representation
and distinguishing of related concepts and
categories.Simple features including orthographic,
contextual, affixes, bigrams, part of speech and
normalized tokens without exploiting features
such as head nouns, dictionaries etc.The model has
achieved state of the art performance for semantic
labeling of named entities using the NCBI disease
corpus.Each feature set represent some knowledge
about the named entity and hence, in order to
evaluate the overall benefit for each feature, all
possible combinations of feature additions need to be
considered.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Collier</surname>
            , Nigel, and
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Takeuchi</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>Comparison of character-level and part of speech features for name recognition in biomedical texts</article-title>
          , volume
          <volume>36</volume>
          .
          <source>Journal of Biomedical Informatics.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Ratinov</surname>
            , Lev, and
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Roth</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Design challenges and misconceptions in named entity recognition</article-title>
          .
          <source>Proceedings of the Thirteenth Conference on Computational Natural Language Learning. Association for Computational Linguistics.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>J.</given-names>
            <surname>Kazama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Makino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ohta</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Tsujii</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>Tuning Support Vector Machines for Biomedical Named Entity Recognition</article-title>
          .
          <source>Proceedings of Workshop on NLP in the Biomedical Domain. 1-8</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Dong</surname>
            , and
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Sui</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>Named entity recognition using an HMM-based chunk tagger proceedings of the 40th Annual Meeting on Association for Computational Linguistics</article-title>
          .
          <source>Association for Computational Linguistics</source>
          .
          <fpage>473</fpage>
          -
          <lpage>480</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Huang H-S</surname>
            , Lin
            <given-names>Y-S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            <given-names>K-T</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuo</surname>
            <given-names>C-J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            <given-names>Y-M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            <given-names>B-H</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chung</surname>
            <given-names>I-F</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hsu</surname>
            <given-names>C-</given-names>
          </string-name>
          <string-name>
            <surname>Ni</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>High-recall gene mention recognition by unification of multiple background parsing models</article-title>
          <source>Proceedings of the 2nd BioCreative Challenge Evaluation Workshop.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Klinger</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Friedrich</surname>
            <given-names>CM</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fluck</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hofmann-Apitius</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Named entity recognition with combinations of conditional random fields</article-title>
          .
          <source>Proceedings of the 2nd BioCreative Challenge Evaluation Workshop.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Sutton</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>McCallum</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>An introduction to conditional random fields</article-title>
          .
          <source>Foundations and Trends in Machine Learning</source>
          .
          <fpage>267</fpage>
          -
          <lpage>373</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Dogan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Islamaj</surname>
            , and
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Lu</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>An improved corpus of disease mentions in PubMed citations</article-title>
          .
          <source>Proceedings of the 2012 workshop on biomedical natural language processing. Association for Computational Linguistics</source>
          .
          <fpage>91</fpage>
          -
          <lpage>99</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>L.</given-names>
            <surname>Lishuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Fan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Huang</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>A TwoPhase Bio-NER System Based on Integrated Classifiers and Multiagent Strategy</article-title>
          ..
          <source>Computational Biology and Bioinformatics, IEEE/ACM Transactions on 10.4</source>
          .
          <fpage>897</fpage>
          -
          <lpage>904</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Krallinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vazquez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Leitner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Salgado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chatr-Aryamontri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Winter</surname>
          </string-name>
          , et al.
          <year>2011</year>
          .
          <article-title>The ProteinProtein Interaction tasks of BioCreative III: Classification/ranking of articles and linking bioontology concepts to full text</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>12</volume>
          (
          <issue>Suppl</issue>
          . 8)
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>