Disease Named Entity Recognition Using Conditional Random Fields

      Hidayat Ur Rahman                      Thomas Hahn                        Richard Segall
     Lahore Leads University            University of Arkansas              Arkansas State University
Near Garden Town Kalma Chowk South University Avenue Computer Inform Tech Department
         Lahore Pakistan                 Little Rock,AR,72204        State University,AR 72404-0130
    Hidayat.R@gmail.com Thomas.F@gmail.com                               rsegall@astate.edu


                    Abstract                           cal and rule-based techniques. However, the per-
                                                       formance of machine learning techniques highly
    Named Entity Recognition is a crucial              depends on the availability of sufficient training
    component in bio-medical text mining.In            data in order to adequately train the machine learn-
    this paper a method for disease Named              ing classifiers (M. Krallinger et al, 2011).In this
    Entity Recognition is proposed which uti-          article Bio-NER for disease names has been car-
    lizes sentence and token level features            ried out to handle the challenges of boundary de-
    based on Conditional Random Field’s us-            tection and entity classification using Conditional
    ing NCBI disease corpus.         The fea-          Random Fields(CRF).The model consists of an
    ture set used for the experiment in-               enriched set of features including boundary de-
    cludes orthographic,contextual,affixes,n-          tection features, such as word normalization, af-
    grams,part of speech tags and word nor-            fixes, orthographic and part of speech(POS) fea-
    malization.Using these features,our ap-            tures. For the semantic labeling features, such as
    proach has achieved a maximum F-score              n-grams and contextual features have been used.
    of 94% for the training set by applying 10
    fold cross validation for semantic labeling
    of the NCBI disease corpus. For testing
                                                       2     Methodology
    and development,F-score of 88% and 85%
                                                       For disease NER our methodology follows the tra-
    were reported.
                                                       ditional machine leaning approach. Figure. I de-
1   Introduction                                       picts the work-flow of our methodology. Firstly,
                                                       raw text is obtained from training, testing and de-
The increasing amount of bio-medical literature        velopment set,then pre-processing is carried out to
requires more robust approaches for information        remove characters and symbols such as underscore
retrieval and knowledge discovery because every        character, full stop etc. After pre-processing var-
single day more information is published than          ious features are extracted as described in section
humans can read. Unique challenges specifi-            2.1. The features are fed into a sequential CRF as
cally to bio-medical Named Entity Recognition          described in section 2.2. Thus, structured output in
(NER) are caused due to its structure,since bio-       the form of annotated named entities is obtained.
medical Named Entities(NEs) consist of sym-            This section provides details about feature extrac-
bols and abbreviations to infer relationships, thus    tion and classification.
the length of Bio-medical NEs are not consis-
tent,which is the primary reason why Bio-NER           2.1    Feature Set
have low performance compared to general pur-
pose NER (Lishuang, L.and W. Fan, 2013).Bio-           Feature extraction plays a vital role in the clas-
NER is the most important step in the extraction       sification accuracy of machine learning classifier
of knowledge, which has the overall aim of identi-     as well as the NER system. The selection of rel-
fying specific concepts or categories, such as gene,   evant feature set improves the classification per-
protein, disease,drug, etc. Current trend in NER is    formance of Bio-NER. Table.1 shows the list of
based on machine learning (ML) approaches, ML          features used for Bio-NER and their short descrip-
based approach provides the flexibility of statisti-   tions is listed below
                                                         Feature         Description
                                                         word     nor-   Stemmed form of the Named
                                                         malization      Entity
                                                         Contextual      wi−2 , wi−1 , wi , wi+1 , wi+2
                                                         features
                                                         POS             Posw−2 , P osw−1 , P osw
                                                                         Posw+1 , P osw+2
                                                       ł Orthographic Uppercase,lowercase,title,
                                                         features     hyphen,Alphanumeric etc
                                                         Word      N-    wi      −2      /wi−1 , wi   −1
                                                         grams           /wi , wi /wi+1 , wi +1 /wi+2
                                                         POS             POSw−2 , P OSw−1 , P OSwi ,
   Figure 1: Flow Chart of Proposed System)
                                                         N-grams         POSw+1 , P OSw+2

Word Normalization                                       Prefix          PREFIX(wi )
Word normalization attempts to reduce different          Suffix          SUFFIX(wi )
forms of words, such as nouns, adjectives,verbs,
etc. to their root form . For word normalization,      Table 1: Feature set for named entity recognition
Porter stemmer has been used to reduce disease
names to its root form. Below are few examples of        • ALLCAPS: Is set to true if all the alphabets
disease names for word root reductions obtained            in a given token are capital examples includes
with the Porter stemmer algorithm.                         DMD, BMD, FD, APC, FAP and HDD etc.
  • Colorectal cancer – colorect cancer
                                                         • TITLE: If the first alphabet in a token is cap-
  • Endometrial cancer – endometri cancer                  italized such as Alzheimer disease, Hunting-
                                                           ton disease, Combined genetic deficiency of
  • Alzheimer disease – alzheim diseas                     C6 and C7.
  • Neurological disease – neurolog diseas
                                                         • LOW: All the alphabets in a given word are
  • Arthritis – arthriti                                   in lower case e.g. myotonic dystrophy, idio-
                                                           pathic dilated cardiomyopathy, and facial le-
Orthographic Features                                      sions.
Orthographic features are related to the orthogra-
phy of the text, such as Capitalization, Digits, Nu-     • MIXED: If a given sequence of words con-
meric, Single Caps, All Caps,numerics and punc-            tains both upper and lower case such as DMD
tuation. Such features are very effective in bound-        defects, hypo myelination of the PNS, defi-
ary detection (Collier, Nigel and K. Takeuchi,             ciency of active AVP.
2004). The eleven orthographic features below
                                                         • ALPNUM: If a given words contains both nu-
have been used in our model:
                                                           meric and text like abnormality of CYP27,
  • IDASH: Whether a token/word contains an                C6 deficiency, achondrogensis 1B, abnormal-
    inner dash such as A-T, G6PD-deficient,                ity of CYP27.
    Palizaeus-Merzbacher disease.
                                                         • PARN: If a multi-word contains parenthe-
  • 2IDAH: If the number of IDASH counts                   sis such as Arginine vasopressin (AVP) de-
    equals to 2 e.g. X-linked Emery-Dreifuss               ficiency, palmoplanter keratoderma (PPK)
    muscular dystrophy, Borjeson-Forssman-                 conditions, sporadic (nonhereditary) ovarian
    Lehmann,                                               cancers
  • BRACKS: Bracket is contained within a               Contextual features
    token,example includes hypoxanthine phos-           Contextual features refer to the word preceding
    phoribosyl transferees [HPRT] deficiency.           and following the NEs. Contextual features are
                                                        the most important features in this experiment for
  • GREEKS: Greek letters such as I,II,III,IV           semantic labeling of disease names. In our exper-
    etc,is contained within a token e.g. type           iment four contextual features are selected. Two
    IIA vWD,Type II ALD,type II Gaucher dis-            words preceded and two followed the named en-
    ease,type II GD and type III GD                     tities. E.g. for the term bactracin in colon carci-
                                                        noma loss cells, colon carcinoma represents the
  • SLASH: Character / is contained within the          named entities while bactracin in and loss cells
    multi-word token such as cleft lip/palate,          represent the two preceding word and two follow-
    CL/P, breast and/ovarian cancers, glu-              ing word.
    cose/galactose malabsorption.
                                                        2.2    Conditional Random Field’s (CRF)
Ngrams                                                  CRF is a probabilistic model used for labeling
                                                        sequential data. It is widely used for POS
N-grams are defined by a sequence of n-tokens
                                                        tagging and NER. (Huang H-S and Lin Y-S,
or words. The most common n-gram is uni-gram
                                                        2007). CRF has several advantages over the
which,contains a single token.Other n-grams ex-
                                                        Hidden Markov Model (HMM) and Support
amples are bi-grams and tri-grams containing 2
                                                        Vector Machine (SVM). CRF includes rich
and 3 tokens respectively. In this experiment uni-
                                                        feature sets,i.e.        overlapping features using
gram and bi-gram have been used,in this method
                                                        conditional probability.           For example, given
all the digits within a word are replaced with d
                                                        a sequence X = x1 , x2 , x3 , x4 .....xn and its
e.g,the uni-gram of 33 is dd, uni-gram of nt943
                                                        labels Y = y1 , y2 , y3 , y4 .....yn , the conditional
is ntddd.Bigram examples are ALD/Eighteen, skin
                                                        probability P (Y | X) is defined by CRF as
tumor/caused, APC/protein, breast or ovarian can-
                                                        follows P (Y | X) ∝ exp(wT f (yn , Yn−1 , x))
cer/novel, etc.
                                                        (Sutton, C. and McCallum, 2011).W is a weight
                                                        vector defined by w = (w1 , w2 , w3 .wM )T .
Part of Speech(POS) tags                                Theses weight are associated with features hav-
POS tags are helpful in defining boundary of a          ing length equal to M.f (yn , y( n−1 ), x) =
phrase,inclusion of POS has been advocated by           f (yn , y( n−1 ), x), f 2(yn , y( n−1 ), x),
(J. Kazama and T. Makino, 2002).Our experiment          f 3(yn , y( n−1 ), x)...fM = (yn , y( n − 1), x))T
includes POS tags of contextual features and bi-        is a feature function.The weight vector is ob-
grams. Adding POS tags to our feature set, the          tained using L-BFGS method.In our experiment
performance of the classifier is boosted as shown       CRFSUITE has been used which is the python
in Table.3                                              Application programming interface(API) of
                                                        CRF++.
Affixes
                                                        3     Experimental Setup
Prefix and suffix feature has shown better perfor-
mance in the recognition of NEs in this experi-         3.1    Dataset
ment.In (J. Kazama and T. Makino, 2002) the au-         Our experiment is based on National Center
thors collected most frequent suffixes and prefixes     for Biotechnology Information (NCBI) dis-
from the training data. Prefix and suffix are n char-   ease corpus, which is freely available at NCBI
acter in length at the beginning and end of a token     website(http://www.ncbi.nlm.nih.
respectively (Zhou, G. Dong, and J. Sui, 2002). In      gov/CBBresearch/Dogan/DISEASE/
our model all the combinations n=1 through 4 have       ) NCBI corpus includes 793 abstracts, which
been used to boost performance.The prefix for the       consist of 2783 sentences and a total of 6900
word ”tumour” are t, tu, tum and tumo, the suf-         disease names (Dogan, R. and Islamaj, 2012).
fix for the same word are r,ur,our and mour.Beside      Annotations of disease names are based on the
contextual features,affixes yielded improvements        criteria that,disease mentions which describes
in the overall performance as shown in Table.3.         a family of specific diseases are annotated as
disease class e.g. autosomal recessive disease,          Features               p         r        f
whereas text referring to specific disease are
annotated as specific disease, such as Diastrophic       O                      0.54      0.62     0.53
dysplasia.Strings referring to more than one dis-        O+Nm                   0.77      0.76     0.74
ease names are annotated as composite mention,
e.g. Duchene and Becker muscular dystrophy are           O+Nm+POS               0.87      0.87     0.86
two disease mentions and hence it is categorized         O+Nm+POS+              0.92      0.92     0.91
as composite mention. Certain disease mentions           Ngram
are used as modifier for other concepts, e.g.
a string may denote a disease name but it is             O+Nm+POS+              0.92      0.92     0.92
not a noun phrase and hence it is annotated as           Ngram+Cc
modifier, e.g. colorectal cancer.Table 2 shows the       O+Nm+POS+        0.94            0.94.    0.94
distribution of disease names in training,testing        Ngram+Cc+Affixes
and development set.
                                                             Table 3: Combination of different features
  Classes               Train    Test      Dev
                        set      set       set
                                                     4       Result and Discussion
  Modifiers             1292     264       218
                                                    For result visualization we have plotted the f-
ł Specific Disease      2959     556       409      score of individual classes. In Figure.2 the F-
   Composite Men- 116               20       37     score of individual dataset has been plotted. In
   tion                                             Figure.2 DC denotes Disease Class,CM denotes
                                                    Composite Mention,SD denotes Specific Disease
   Disease Class           781      121      127    and MD denotes Modifier.Figure II shows that the
                                                    highest F-score have been reported by Modifier
        Table 2: Dataset used in experiment         for Training,Testing and Development set respec-
                                                    tively,followed by Specific Disease. The lowest
                                                    F-score has been shown by Composite Mentions
3.2 Classification and Feature Selection            followed by Disease Class.One reason for the rel-
                                                    atively poor performance of the Composite Men-
Table.3 shows contribution of features and
                                                    tion is the inadequate training samples compared
its effect on performance of CRF. The
                                                    to the training samples of Specific disease and
feature set is mainly divided into Contex-
                                                    Modifiers,which exceed 1000.The relatively poor
tual(Cc),Normalized(Nm),Ngrams,Affixes(Ax),Part
                                                    performance of the Disease Class is because it has
of speech (POS) and Orthographic(O). Perfor-
                                                    been based on the second smallest training sam-
mance evaluation has been carried out using the
                                                    ple since, the performance of machine learning
metrics precision, recall and F-score. Results
                                                    based techniques heavily depends on the number
obtained in Table.3 is based on applying 10 Fold
                                                    of training samples.
cross validation on the training set. Orthographic
features were taken as a benchmarks, which          5 Conclusion
results in F-score of 0.53. This is considered as
the lowest F-score reported in this experiment.     This paper presents a machine learning approach
Addition of normalized features resulted in an      for human disease NER using NCBI disease
increasing the F-score by 21%.Further addition of   corpus. The system takes the advantage of
POS tags increased the performance by 12%.With      rich feature set which,helps in representation
the addition of N-gram features the overall F-score and distinguishing of related concepts and cat-
achieved is 0.91.Finally,with the addition of af-   egories.Simple features including orthographic,
fixes,the final F-score obtained is 0.94.Compared   contextual, affixes, bigrams, part of speech and
to other state of the art Bio-NER systems,such as   normalized tokens without exploiting features
BANNER,our system has a higher level of F-score     such as head nouns, dictionaries etc.The model has
using 10 fold cross validation on training set due  achieved state of the art performance for semantic
to the selection of good features for disease NER.  labeling of named entities using the NCBI disease
                                                         Sutton, C. and McCallum. 2011. An introduction to
                                                           conditional random fields. Foundations and Trends
                                                           in Machine Learning.267–373
                                                         Dogan, R. Islamaj, and Z. Lu. 2012. An improved cor-
                                                           pus of disease mentions in PubMed citations. Pro-
                                                           ceedings of the 2012 workshop on biomedical nat-
                                                           ural language processing. Association for Computa-
                                                           tional Linguistics.91–99
                                                         L. Lishuang, W. Fan, and D. Huang. 2013. A Two-
                                                            Phase Bio-NER System Based on Integrated Clas-
                                                            sifiers and Multiagent Strategy.. Computational Bi-
                                                            ology and Bioinformatics, IEEE/ACM Transactions
                                                            on 10.4.897-904
                                                         M. Krallinger, M. Vazquez, F. Leitner, D. Salgado, A.
Figure 2: f-score comparison of training,testing           Chatr-Aryamontri, A. Winter, et al. 2011. The
and development set                                        ProteinProtein Interaction tasks of BioCreative III:
                                                           Classification/ranking of articles and linking bio-
                                                           ontology concepts to full text. BMC Bioinformatics,
                                                           12 (Suppl. 8)
corpus.Each feature set represent some knowledge
about the named entity and hence, in order to eval-
uate the overall benefit for each feature, all possi-
ble combinations of feature additions need to be
considered.


References
Collier, Nigel, and K. Takeuchi. 2004. Comparison of
  character-level and part of speech features for name
  recognition in biomedical texts, volume 36. Journal
  of Biomedical Informatics.

Ratinov, Lev, and D. Roth. 2009. Design chal-
  lenges and misconceptions in named entity recog-
  nition. Proceedings of the Thirteenth Conference on
  Computational Natural Language Learning. Associ-
  ation for Computational Linguistics.

J. Kazama, T. Makino, Y. Ohta, and J. Tsujii. 2002.
   Tuning Support Vector Machines for Biomedical
   Named Entity Recognition. Proceedings of Work-
   shop on NLP in the Biomedical Domain. 1–8

Zhou, G. Dong, and J. Sui. 2002. Named entity recog-
  nition using an HMM-based chunk tagger proceed-
  ings of the 40th Annual Meeting on Association for
  Computational Linguistics. Association for Compu-
  tational Linguistics. 473–480

Huang H-S, Lin Y-S, Lin K-T, Kuo C-J, Chang Y-M,
  Yang B-H, Chung I-F, Hsu C-Ni. 2007. High-recall
  gene mention recognition by unification of multiple
  background parsing models Proceedings of the 2nd
  BioCreative Challenge Evaluation Workshop.

Klinger R, Friedrich CM, Fluck J, Hofmann-Apitius.
  2007. Named entity recognition with combinations
  of conditional random fields. Proceedings of the 2nd
  BioCreative Challenge Evaluation Workshop.