Large-scale Semantic Indexing with Biomedical Ontologies Chih-Hsuan Wei Robert Leaman Zhiyong Lu National Center for Biotechnology National Center for Biotechnology National Center for Biotechnology Information (NCBI), National Information (NCBI), National Information (NCBI), National Library of Medicine (NLM) Library of Medicine (NLM) Library of Medicine (NLM) Bethesda, Maryland, USA Bethesda, Maryland, USA Bethesda, Maryland, USA chih-hsuan.wei@nih.gov robert.leaman@nih.gov zhiyong.lu@nih.gov Abstract— We introduce PubTator, a web-based A. Concept Recognition using PubTator application that enables large-scale semantic indexing and PubTator currently utilizes five state-of-the-art named automatic concept recognition in biomedical ontologies. Not entity recognition and normalization tools to locate and only was PubTator formally evaluated and top-rated in BioCreative, it also has been widely adopted and used by the identify important biomedical entities. Specifically, the entity scientific community from around the world, supporting both types currently supported and their respective systems with research projects and real-world applications in biocuration, F-scores are: genes and proteins (GNormPlus [4] - 86.74%), crowdsourcing and translational bioinformatics. diseases (DNorm [5] - 80.90%), chemicals (tmChem [6] - 87.51%), species (SR4GN [7] - 85.42%) and genetic variants Keywords—PubTator; TaggerOne; Text Mining; Biomedical (tmVar [8] - 91.39%). Ontologies While the entity types currently covered includes those most commonly searched [9], our most recent work, I. INTRODUCTION TaggerOne, is trainable to identify arbitrary entity types, With over 26 million articles in PubMed, the biomedical requiring only annotated training data and a corresponding literature is a knowledge-rich resource and forms an lexicon [10]. TaggerOne employs a novel machine learning important foundation for future research. However, the rapid model to address named entity recognition and normalization expansion of the scientific literature and the increasingly jointly, reducing cascading errors and enabling the NER cross-disciplinary nature of biomedical research are making (name entity recognition) task to directly exploit the lexical it difficult than ever for individual researchers to find and information provided by the normalization. TaggerOne assimilate all of the relevant information from the literature. achieves state of the art performance on diseases (NCBI Research in automated text processing is of a growing Disease corpus [11]) and chemicals (BioCreative 5 CDR importance to relieve today's information overload problem. corpus [12]) and is being used to tag anatomy terms Hence, processing the biomedical literature with automated (including organs, tissues, cellular components) in PubMed tools becomes more important as its growth accelerates. articles so they can be mapped to the corresponding concept identifiers in multiple biomedical ontologies in We present PubTator [1], a web-based application that http://www.obofoundry.org/. indexes the ever-growing biomedical literature with ontological concepts in biomedicine. PubTator features a PubMed-like interface and is equipped with multiple high- performing text mining algorithms (e.g. DNorm for disease concepts in MeSH or SNOMED-CT) to ensure the quality of its text-mined results over the entire set of articles in PubMed. PubTator was first developed as an interactive text mining system through our participation in BioCreative (see [2] for more details and related work). More recently, we created RESTful Web Services [3] for PubTator to further increase its scalability and ease its use by non-experts of text mining, allowing its users to focus on results rather than technical methodology. II. SYSTEM DESCRIPTION Figure 1. The screenshot of a PubMed article in PubTator with concepts and relations highlighted in color. Recognizing ontological concepts requires the creation of their results in PubTator. Text mining open-access full- a lexical resource identifying the concepts desired, their length articles in PMC for key ontological concepts in real- terms and relevant variations. We recently proposed a world applications (e.g. computer-assisted biocuration) modification of TaggerOne to automatically identify would be another exciting opportunity to pursue. inconsistencies that arise when creating a single lexical resource from multiple knowledge resources, including ACKNOWLEDGMENT ontologies, and then address the inconsistency semi- automatically. The proposed method actively learns a model This research is supported by the Intramural Research to identify identical concepts from separate resources, with Program of the National Institutes of Health, National preliminary results showing the model successfully identifies Library of Medicine. both synonymous tokens (e.g. “kidney” and “renal”) and contrastive terms (“dominant” vs. “recessive”). REFERENCES [1] C.-H. Wei, H.-Y. Kao, and Z. Lu, "PubTator: a web-based text B. Scalability and interoperability mining tool for assisting biocuration," Nucleic Acids Research, vol. 41, 2013, pp. W518-W522. Large scale use of PubTator or open-source tools requires [2] C. N. Arighi, B. Carterette, K. B. Cohen, M. Krallinger, W. J. Wilbur, a significant investment in infrastructure and maintenance P. Fey, et al., "An overview of the BioCreative 2012 Workshop Track time. These barriers to entry reduce the ability of individual III: interactive text mining task," Database, 2013, pp. bas056. researchers to explore applying text mining to problems in [3] C.-H. Wei, R. Leaman, and Z. Lu, "Beyond accuracy: Creating their research area and consequently impair the continued interoperable and scalable text mining web services," Bioinformatics, adoption of text mining tools. Web services provide on- 2016. demand access to software tools through the Internet using [4] C.-H. Wei, H.-Y. Kao, and Z. Lu, "GNormPlus: An Integrative straightforward interfaces and data formats. Providing text Approach for Tagging Genes, Gene Families, and Protein Domains," BioMed Research International, 2015, pp. 918710. mining tools as web services therefore lowers the bar to use for end users and bioinformatics researchers not working [5] R. Leaman, R. Islamaj Doğan, and Z. Lu, "DNorm: Disease name normalization with pairwise learning-to-rank," Bioinformatics, vol. specifically in text mining, allowing free exploration and the 29, 2013, pp. 2909-2917. ability to focus on results rather than methodology. [6] R. Leaman, C.-H. Wei, and Z. Lu, "tmChem: a high performance approach for chemical named entity recognition and normalization," Therefore, we recently developed NCBI text-mining web Journal of Cheminformatics, vol. 7, 2015, pp. S3. services on top of PubTator by using standard HTTP method [7] C.-H. Wei, H.-Y. Kao, and Z. Lu, "SR4GN: a species recognition calls (often known as RESTful services), which allows software tool for gene normalization," PLoS One, vol. 7, 2012, pp. instant retrieval of pre-annotated PubTator results via HTTP e38460. GET. To improve system interoperability, we support [8] C.-H. Wei, B. R. Harris, H.-Y. Kao, and Z. Lu, "tmVar: A text mining multiple data formats including BioC/XML [13], approach for extracting sequence variants in biomedical literature," PubTator/TXT [1] and PubAnnotation/JSON [14]. To Bioinformatics, vol. 29, 2013, pp. 1433-1439. simplify programmatic access to our web services, we also [9] R. I. Dogan, G. C. Murray, A. Névéol, and Z. Lu, "Understanding provide sample client code in Perl, Python and Java. PubMed user search behavior through log analysis," Database, 2009, pp. bap018. [10] R. Leaman and Z. Lu, "TaggerOne: Joint Named Entity Recognition C. Evaluation & Usage and Normalization with Semi-Markov Models," Bioinformatics, vol. PubTator was formally assessed by a group of external In Press, 2016. evaluators during the BioCreative Interactive Text Mining [11] R. I. Doğana, R. Leaman, and Z. Lu, "NCBI disease corpus: A challenge task where it was top-rated in all categories from resource for disease name recognition and concept normalization," Journal of Biomedical Informatics, vol. 47, 2014, pp. 1-10. system design to learnability to usability [15]. [12] J. Li, Y. Sun, R. J. Johnson, D. Sciaky, C.-H. W. R. Leaman, A. P. More recently, through collaboration with curation Davis, et al., "BioCreative V CDR task corpus: a resource for groups, PubTator has been successfully integrated into the chemical disease relation extraction," Database, 2016, pp. baw068. production pipeline of multiple curation databases including [13] D. C. Comeau, R. I. Doğan, P. Ciccarese, K. B. Cohen, M. Krallinger, F. Leitner, et al., "BioC: a minimalist approach to interoperability for SwissProt [16] and the CDC’s human genome epidemiology biomedical text processing," Database, 2013, pp. bat064. knowledge base called HuGE navigator [17]. [14] J.-D. Kim, K. B. Cohen, and J.-J. Kim, "PubAnnotation-query: a Furthermore, since the inception of PubTator Web search tool for corpora with multi-layers of annotation," BMC Proceedings, vol. 9, 2015, pp. A3. Services, millions of requests have been made by the [15] C.-H. Wei, B. R. Harris, D. Li, T. Z. Berardini, E. Huala, H.-Y. Kao, scientific community from around the world. From et al., "Accelerating literature curation with text-mining tools: a case interactions with some of our users, we learned that the study of using PubTator to curate genes in PubMed abstracts," results of our text-mining services are being used in many Database, 2012, pp. bas041. different research areas in bioinformatics. For instance, our [16] The UniProt Consortium, "UniProt: a hub for protein information," web services are used to provide initial annotations for the Nucleic Acids Research, vol. 43, 2015, pp. D204-D212. mark2cure crowdsourcing project (https://mark2cure.org/). [17] W. Yu, M. Gwinn, M. Clyne, A. Yesupriya, and M. J. Khoury, "A navigator for human genome epidemiology," Nature Genetics, vol. 40, 2008, pp. 124-125. III. CONCLUSIONS & FUTURE WORK In the future, we plan to expand the automatic concept recognition to additional biomedical ontologies and include