=Paper= {{Paper |id=Vol-1747/BIT102_ICBO2016 |storemode=property |title=One Tagger, Many Uses: Illustrating the Power of Ontologies in Dictionary-based Named Entity Recognition |pdfUrl=https://ceur-ws.org/Vol-1747/BIT102_ICBO2016.pdf |volume=Vol-1747 |authors=Lars Juhl Jensen |dblpUrl=https://dblp.org/rec/conf/icbo/Jensen16 }} ==One Tagger, Many Uses: Illustrating the Power of Ontologies in Dictionary-based Named Entity Recognition == https://ceur-ws.org/Vol-1747/BIT102_ICBO2016.pdf
                                           One tagger, many uses
         Illustrating the power of ontologies in dictionary-based named entity recognition

                                                                Lars Juhl Jensen
                                         Novo Nordisk Foundation Center for Protein Research
                                   Faculty of Health and Medical Sciences, University of Copenhagen
                                                         Copenhagen, Denmark
                                                       lars.juhl.jensen@cpr.ku.dk

    Abstract— Automatic annotation of text is an important
complement to manual annotation, because the latter is highly            III. DICTIONARIES AND APPLICATION AREAS
labour intensive. We have developed a fast dictionary-based                   Software is only one half of a NER system; the other half is
named entity recognition (NER) system and addressed a wide                the dictionary with all the names that the software matches
variety of biomedical problems by applied it to text from many            against the text. When adapting the NER system to a new
different sources. We have used this tagger both in real-time             biomedical application area, the main work required is the
tools to support curation efforts and in pipelines for populating
                                                                          construction of a suitable high-quality dictionary and blacklist
databases through bulk processing of entire Medline, the open-
                                                                          names not to be tagged. The latter is created through manual
access subset of PubMed Central, NIH grant abstracts, FDA
drug labels, electronic health records, and the Encyclopedia of           inspection of the most frequently occurring dictionary names in
Life. Despite the simplicity of the approach, it typically achieves       a large text corpus.
80–90% precision and 70–80% recall. Many of the underlying
dictionaries were built from open biomedical ontologies, which            A. Molecular Entities
further facilitate integration of the text-mining results with                NER and normalization of genes and proteins has been the
evidence from other sources.                                              subject of several BioCreative tasks over the years, most
                                                                          recently in BioCreative III [5]. This was also one of the very
   Keywords—named entity recognition; software; dictionaries;             first uses of the tagger, which is a key component of the text-
                                                                          mining pipeline in the STRING database [6]. The underlying
I. INTRODUCTION                                                           dictionary of gene/protein names is based on Ensembl [7] and
   Named entity recognition (NER) is a fundamental task in                RefSeq [8], which were expanded with additional synonyms
biomedical text mining and can benefit greatly from the use of            from UniProt [9]. An older version of the system achieved F-
ontologies. This is especially true for dictionary-based NER              scores of 91% and 66% for recognition and normalization of
methods, which with a good ontology at hand can be quickly                yeast and fly genes, respectively [3]. This NER method is also
adapted to a new task without the need for a manually curated             heavily used within the Illuminating the Druggable Genome
corpus for training.                                                      program to assess how well studied drug targets are based on
                                                                          both publications and NIH RePORTER funding data.
II. SOFTWARE IMPLEMENTATION                                                   Identification of small-molecule chemical compounds in
    The core of our NER system is a highly optimized                      text was a task in BioCreative V [10]. The tagger is also used
dictionary-based tagging engine implemented in C++ [1]. The               for this in the STITCH database [11], which relies on a
core tagger makes use of a custom hashing function to process             dictionary constructed from a filtered version of PubChem
thousands of PubMed abstracts per second with a single CPU                [12]. STRING and STITCH both employ a statistical co-
thread. It is furthermore is inherently thread-safe, allowing for         occurrence as well as natural language processing (NLP) for
perfect scalability in multi-threaded use and is both available as        subsequent extraction of relations between the identified
a command-line tool and as a Python module that was                       molecular entities from Medline and the open-access subset of
generated in part by the Simplified Wrapper and Interface                 PubMed Central. These relations are integrated with evidence
Generator (SWIG). For real-time applications, we have                     from many other sources including experimental data and
developed a multi-threaded HTTP server that utilizes this                 manually curated pathway databases.
Python module to expose the tagger as a RESTful web service,
which includes support for the Open Annotation model [2].                 B. Protein Localization and Expression
                                                                              The COMPARTMENTS [13] and TISSUES [14] databases
   The ability to perform real-time tagging enabled us to
                                                                          take a very similar integrative approach to associate proteins
develop the Reflect [3] and EXTRACT [4] tools, which helps
                                                                          with their subcellular localizations and tissue expression
curators identify and extract terms from any web page and was
                                                                          patterns. To this end, we constructed dictionaries based on the
evaluated favourably in the interactive annotation track of
                                                                          cellular component part of Gene Ontology [15] and the Brenda
BioCreative V [4].
                                                                          Tissue Ontology [16], respectively. Both ontologies were well
                                                                          populated with synonyms, which were automatically expanded

   This work was in part funded by the Novo Nordisk Foundation
(NNF14CC0001) and the National Institutes of Health (U54 CA189205-01).
 to construct plural and adjective forms. The resulting resources                 [2]  S. Pyysalo, et al., “Sharing annotations better: RESTful Open
 can be used to filter protein networks from STRING to include                         Annotation,” Proc. ACL-IJCNLP, pp. 91–96, 2015.
 only proteins from certain subcellular and/or tissue contexts.                   [3] E. Pafilis, et al., “Reflect: augmented browsing for the life scientist,”
                                                                                       Nat. Biotechnol., vol. 27, pp. 508–510, 2009.
 This is useful, for example, in prediction of host–pathogen
                                                                                  [4] E. Pafilis, et al., “EXTRACT: Interactive extraction of environment
 interactions.                                                                         metadata and term suggestion for metagenomic sample annotation,”
                                                                                       Proc. BioCreative Challenge Evaluation Workshop, pp. 384–395, 2015.
 C. Diseases and Adverse Drug Reactions                                           [5] Z. Lu et al., “The gene normalization task in BioCreative III,” BMC
     The DISEASES database [17] uses the very same approach                            Bioinformatics, vol. 12(S8), S2, 2011.
 to extract disease–gene associations from Medline abstracts. In                  [6] D. Szklarczyk, et al., “STRING v10: protein-protein interaction
 this case, we run the tagger with a dictionary based on Disease                       networks, integrated over the tree of life,” Nucleic Acids Res., vol. 43,
                                                                                       pp. D447–D452, 2015.
 Ontology [18]; this NER approach has been shown to compare
                                                                                  [7] F. Cunningham, et al, “Ensembl 2015,” Nucleic Acids Res., vol. 43, pp.
 favourably with other methods [19].                                                   D662–D669, 2015.
     When treating diseases with drugs, patents may experience                    [8] T. Tatusova, et al., “RefSeq microbial genomes database: new
 adverse drug reactions (ADRs). The SIDER database [20]                                representation and annotation strategy,” Nucleic Acids Res., 42:D553–
                                                                                       D559, 2014.
 extracts information on known ADRs from FDA drug labels
                                                                                  [9] UniProt Consortium, “UniProt: a hub for protein information,” Nucleic
 using an NLP system, which uses the tagger Python module to                           Acids Res., vol. 43, pp. D204–D212, 2015.
 recognize names from the Unified Medical Language System                         [10] C.-H. Wei, et al., “Assessing the state of the art in biomedical relation
 (UMLS) Metathesaurus for all terms of the Medical Dictionary                          extraction: overview of the BioCreative V chemical-disease relation
 for Regulatory Activities (MedDRA). In a separate study, we                           (CDR) task,” Vol. 2016, baw032, 2016.
 showed that it is also possible to identify ADRs in the clinical                 [11] M. Kuhn, et al., “STITCH 4: integration of protein-chemical interactions
 narrative text of electronic health records, which required the                       with user data,” Nucleic Acids Res., vol. 42, pp. D401–D407, 2014.
 construction of a separate ADR dictionary in Danish [21]. The                    [12] E. Bolton, Y. Wang, P. A. Thiessen, and S. H. Bryant, “PubChem:
 latter achieved 89% precision and 75% recall.                                         integrated platform of small molecules and biological activities,” Annu.
                                                                                       Rep. Comput. Chem., vol. 4, pp. 217–241, 2008.
                                                                                  [13] J.X. Binder, et al., “COMPARTMENTS: unification and visualization of
 D. Organisms and Environments/Habitats                                                protein subcellular localization evidence,” Database, vol. 2014, bau012,
     The applications described so far all fall within molecular                       2014.
 biomedicine; however, the tagger has proven equally useful                       [14] A. Santos, et al., “Comprehensive comparison of large-scale tissue
 within biodiversity and ecology. Specifically, we have created                        expression datasets,” PeerJ, vol. 3, e1054, 2015.
 dictionaries of taxa [1] and environmental descriptors [22]                      [15] M. Ashburner, et al., “Gene ontology: tool for the unification of biology.
 from the NCBI Taxonomy [23] and the Environment Ontology                              The Gene Ontology Consortium,” Nat. Genet., vol. 25, pp. 25–29, 2000.
 [24], respectively. This achieved 83.9% precision and 72.6%                      [16] A. Chang, et al., “BRENDA in 2015: exciting developments in its 25th
                                                                                       year of existence,” Nucleic Acids Res., vol. 43, pp. D439–D446, 2015.
 recall for species [1] and 87.8% precision and 77.0% recall for
 environments [22]. We use these dictionaries with the tagger to                  [17] S. Pletscher-Frankild, et al, “DISEASES: text mining and data
                                                                                       integration of disease-gene associations,” Methods, vol. 74, pp. 83–89,
 extract structured information on habitats of organisms based                         2015.
 on their textual descriptions in the Encyclopedia of Life [22].                  [18] W. A. Kibbe, et al., “Disease Ontology 2015 update: an expanded and
    Most recently we participated in the related 2016 BioNLP                           updated database of human diseases for linking biomedical knowledge
                                                                                       through disease data,” Nucleic Acids Res., vol. 43, pp. D1071–D1078,
 shared task on bacterial biotopes, specifically NER of bacteria                       2015.
 and biotopes. To this end we implemented rules to refine the                     [19] S. ElShal, et al., “A comprehensive comparison of two MEDLINE
 match boundaries and normalization of bacterial names and                             annotators for disease and gene linkage: sometimes less is more,”
 compiled a biotope dictionary by extending the OntoBiotope                            Lecture Notes in Computer Science, vol. 9656, pp. 765–778, 2016.
 habitat ontology with additional synonyms from other relevant                    [20] M. Kuhn, I. Letunic, L.J. Jensen, and P. Bork, “The SIDER database of
 ontologies [25].                                                                      drugs and side effects,” Nucleic Acids Res., vol. 44, pp. D1075–D1079,
                                                                                       2016.
                                                                                  [21] R. Eriksson, et al., “Dictionary construction and identification of
IV. CONCLUSIONS                                                                        possible adverse drug events in Danish clinical narrative text,” J. Am.
     Despite its simplicity, dictionary-based NER is a powerful                        Med. Inform. Assoc., vol. 20, pp. 947–953, 2013.
 approach that in many cases can give comparable performance                      [22] E. Pafilis, et al., “ENVIRONMENTS and EOL: identification of
                                                                                       Environment Ontology terms in text and the annotation of the
 to more advanced methods, if care is taken when constructing                          Encyclopedia of Life,” Bioinformatics, vol. 31, pp. 1872–1874, 2015.
 the dictionaries. The dictionary-based approach is particularly
                                                                                  [23] S. Federhen, “Type material in the NCBI Taxonomy Database,” Nucleic
 attractive in the biomedical domain due to the many ontologies                        Acids Res., vol. 43, pp. D1086–D1098, 2015.
 that provide excellent starting points constructing dictionaries.                [24] P. L. Buttigieg, et al., “The environment ontology: contextualising
                                                                                       biological and biomedical entities,” J. Biomed. Semant., vol. 4, p. 43,
       REFERENCES                                                                      2013.
                                                                                  [25] H. V. Cook, E. Pafilis, and L. J. Jensen, “A dictionary- and rule-based
 [1]    E. Pafilis, et al., “The SPECIES and ORGANISMS resources for fast
                                                                                       system for identification of bacteria and habitats in text”, to appear in
        and accurate identification of taxonomic names in text,” PLoS One, vol.
                                                                                       Proc. BioNLP Shared Task Workshop, 2016.
        8, e65390, 2013.