One tagger, many uses Illustrating the power of ontologies in dictionary-based named entity recognition Lars Juhl Jensen Novo Nordisk Foundation Center for Protein Research Faculty of Health and Medical Sciences, University of Copenhagen Copenhagen, Denmark lars.juhl.jensen@cpr.ku.dk Abstract— Automatic annotation of text is an important complement to manual annotation, because the latter is highly III. DICTIONARIES AND APPLICATION AREAS labour intensive. We have developed a fast dictionary-based Software is only one half of a NER system; the other half is named entity recognition (NER) system and addressed a wide the dictionary with all the names that the software matches variety of biomedical problems by applied it to text from many against the text. When adapting the NER system to a new different sources. We have used this tagger both in real-time biomedical application area, the main work required is the tools to support curation efforts and in pipelines for populating construction of a suitable high-quality dictionary and blacklist databases through bulk processing of entire Medline, the open- names not to be tagged. The latter is created through manual access subset of PubMed Central, NIH grant abstracts, FDA drug labels, electronic health records, and the Encyclopedia of inspection of the most frequently occurring dictionary names in Life. Despite the simplicity of the approach, it typically achieves a large text corpus. 80–90% precision and 70–80% recall. Many of the underlying dictionaries were built from open biomedical ontologies, which A. Molecular Entities further facilitate integration of the text-mining results with NER and normalization of genes and proteins has been the evidence from other sources. subject of several BioCreative tasks over the years, most recently in BioCreative III [5]. This was also one of the very Keywords—named entity recognition; software; dictionaries; first uses of the tagger, which is a key component of the text- mining pipeline in the STRING database [6]. The underlying I. INTRODUCTION dictionary of gene/protein names is based on Ensembl [7] and Named entity recognition (NER) is a fundamental task in RefSeq [8], which were expanded with additional synonyms biomedical text mining and can benefit greatly from the use of from UniProt [9]. An older version of the system achieved F- ontologies. This is especially true for dictionary-based NER scores of 91% and 66% for recognition and normalization of methods, which with a good ontology at hand can be quickly yeast and fly genes, respectively [3]. This NER method is also adapted to a new task without the need for a manually curated heavily used within the Illuminating the Druggable Genome corpus for training. program to assess how well studied drug targets are based on both publications and NIH RePORTER funding data. II. SOFTWARE IMPLEMENTATION Identification of small-molecule chemical compounds in The core of our NER system is a highly optimized text was a task in BioCreative V [10]. The tagger is also used dictionary-based tagging engine implemented in C++ [1]. The for this in the STITCH database [11], which relies on a core tagger makes use of a custom hashing function to process dictionary constructed from a filtered version of PubChem thousands of PubMed abstracts per second with a single CPU [12]. STRING and STITCH both employ a statistical co- thread. It is furthermore is inherently thread-safe, allowing for occurrence as well as natural language processing (NLP) for perfect scalability in multi-threaded use and is both available as subsequent extraction of relations between the identified a command-line tool and as a Python module that was molecular entities from Medline and the open-access subset of generated in part by the Simplified Wrapper and Interface PubMed Central. These relations are integrated with evidence Generator (SWIG). For real-time applications, we have from many other sources including experimental data and developed a multi-threaded HTTP server that utilizes this manually curated pathway databases. Python module to expose the tagger as a RESTful web service, which includes support for the Open Annotation model [2]. B. Protein Localization and Expression The COMPARTMENTS [13] and TISSUES [14] databases The ability to perform real-time tagging enabled us to take a very similar integrative approach to associate proteins develop the Reflect [3] and EXTRACT [4] tools, which helps with their subcellular localizations and tissue expression curators identify and extract terms from any web page and was patterns. To this end, we constructed dictionaries based on the evaluated favourably in the interactive annotation track of cellular component part of Gene Ontology [15] and the Brenda BioCreative V [4]. Tissue Ontology [16], respectively. Both ontologies were well populated with synonyms, which were automatically expanded This work was in part funded by the Novo Nordisk Foundation (NNF14CC0001) and the National Institutes of Health (U54 CA189205-01). to construct plural and adjective forms. The resulting resources [2] S. Pyysalo, et al., “Sharing annotations better: RESTful Open can be used to filter protein networks from STRING to include Annotation,” Proc. ACL-IJCNLP, pp. 91–96, 2015. only proteins from certain subcellular and/or tissue contexts. [3] E. Pafilis, et al., “Reflect: augmented browsing for the life scientist,” Nat. Biotechnol., vol. 27, pp. 508–510, 2009. This is useful, for example, in prediction of host–pathogen [4] E. Pafilis, et al., “EXTRACT: Interactive extraction of environment interactions. metadata and term suggestion for metagenomic sample annotation,” Proc. BioCreative Challenge Evaluation Workshop, pp. 384–395, 2015. C. Diseases and Adverse Drug Reactions [5] Z. Lu et al., “The gene normalization task in BioCreative III,” BMC The DISEASES database [17] uses the very same approach Bioinformatics, vol. 12(S8), S2, 2011. to extract disease–gene associations from Medline abstracts. In [6] D. Szklarczyk, et al., “STRING v10: protein-protein interaction this case, we run the tagger with a dictionary based on Disease networks, integrated over the tree of life,” Nucleic Acids Res., vol. 43, pp. D447–D452, 2015. Ontology [18]; this NER approach has been shown to compare [7] F. Cunningham, et al, “Ensembl 2015,” Nucleic Acids Res., vol. 43, pp. favourably with other methods [19]. D662–D669, 2015. When treating diseases with drugs, patents may experience [8] T. Tatusova, et al., “RefSeq microbial genomes database: new adverse drug reactions (ADRs). The SIDER database [20] representation and annotation strategy,” Nucleic Acids Res., 42:D553– D559, 2014. extracts information on known ADRs from FDA drug labels [9] UniProt Consortium, “UniProt: a hub for protein information,” Nucleic using an NLP system, which uses the tagger Python module to Acids Res., vol. 43, pp. D204–D212, 2015. recognize names from the Unified Medical Language System [10] C.-H. Wei, et al., “Assessing the state of the art in biomedical relation (UMLS) Metathesaurus for all terms of the Medical Dictionary extraction: overview of the BioCreative V chemical-disease relation for Regulatory Activities (MedDRA). In a separate study, we (CDR) task,” Vol. 2016, baw032, 2016. showed that it is also possible to identify ADRs in the clinical [11] M. Kuhn, et al., “STITCH 4: integration of protein-chemical interactions narrative text of electronic health records, which required the with user data,” Nucleic Acids Res., vol. 42, pp. D401–D407, 2014. construction of a separate ADR dictionary in Danish [21]. The [12] E. Bolton, Y. Wang, P. A. Thiessen, and S. H. Bryant, “PubChem: latter achieved 89% precision and 75% recall. integrated platform of small molecules and biological activities,” Annu. Rep. Comput. Chem., vol. 4, pp. 217–241, 2008. [13] J.X. Binder, et al., “COMPARTMENTS: unification and visualization of D. Organisms and Environments/Habitats protein subcellular localization evidence,” Database, vol. 2014, bau012, The applications described so far all fall within molecular 2014. biomedicine; however, the tagger has proven equally useful [14] A. Santos, et al., “Comprehensive comparison of large-scale tissue within biodiversity and ecology. Specifically, we have created expression datasets,” PeerJ, vol. 3, e1054, 2015. dictionaries of taxa [1] and environmental descriptors [22] [15] M. Ashburner, et al., “Gene ontology: tool for the unification of biology. from the NCBI Taxonomy [23] and the Environment Ontology The Gene Ontology Consortium,” Nat. Genet., vol. 25, pp. 25–29, 2000. [24], respectively. This achieved 83.9% precision and 72.6% [16] A. Chang, et al., “BRENDA in 2015: exciting developments in its 25th year of existence,” Nucleic Acids Res., vol. 43, pp. D439–D446, 2015. recall for species [1] and 87.8% precision and 77.0% recall for environments [22]. We use these dictionaries with the tagger to [17] S. Pletscher-Frankild, et al, “DISEASES: text mining and data integration of disease-gene associations,” Methods, vol. 74, pp. 83–89, extract structured information on habitats of organisms based 2015. on their textual descriptions in the Encyclopedia of Life [22]. [18] W. A. Kibbe, et al., “Disease Ontology 2015 update: an expanded and Most recently we participated in the related 2016 BioNLP updated database of human diseases for linking biomedical knowledge through disease data,” Nucleic Acids Res., vol. 43, pp. D1071–D1078, shared task on bacterial biotopes, specifically NER of bacteria 2015. and biotopes. To this end we implemented rules to refine the [19] S. ElShal, et al., “A comprehensive comparison of two MEDLINE match boundaries and normalization of bacterial names and annotators for disease and gene linkage: sometimes less is more,” compiled a biotope dictionary by extending the OntoBiotope Lecture Notes in Computer Science, vol. 9656, pp. 765–778, 2016. habitat ontology with additional synonyms from other relevant [20] M. Kuhn, I. Letunic, L.J. Jensen, and P. Bork, “The SIDER database of ontologies [25]. drugs and side effects,” Nucleic Acids Res., vol. 44, pp. D1075–D1079, 2016. [21] R. Eriksson, et al., “Dictionary construction and identification of IV. CONCLUSIONS possible adverse drug events in Danish clinical narrative text,” J. Am. Despite its simplicity, dictionary-based NER is a powerful Med. Inform. Assoc., vol. 20, pp. 947–953, 2013. approach that in many cases can give comparable performance [22] E. Pafilis, et al., “ENVIRONMENTS and EOL: identification of Environment Ontology terms in text and the annotation of the to more advanced methods, if care is taken when constructing Encyclopedia of Life,” Bioinformatics, vol. 31, pp. 1872–1874, 2015. the dictionaries. The dictionary-based approach is particularly [23] S. Federhen, “Type material in the NCBI Taxonomy Database,” Nucleic attractive in the biomedical domain due to the many ontologies Acids Res., vol. 43, pp. D1086–D1098, 2015. that provide excellent starting points constructing dictionaries. [24] P. L. Buttigieg, et al., “The environment ontology: contextualising biological and biomedical entities,” J. Biomed. Semant., vol. 4, p. 43, REFERENCES 2013. [25] H. V. Cook, E. Pafilis, and L. J. Jensen, “A dictionary- and rule-based [1] E. Pafilis, et al., “The SPECIES and ORGANISMS resources for fast system for identification of bacteria and habitats in text”, to appear in and accurate identification of taxonomic names in text,” PLoS One, vol. Proc. BioNLP Shared Task Workshop, 2016. 8, e65390, 2013.