Large-scale Semantic Indexing with Biomedical
                        Ontologies

         Chih-Hsuan Wei                                   Robert Leaman                                   Zhiyong Lu
 National Center for Biotechnology                National Center for Biotechnology            National Center for Biotechnology
  Information (NCBI), National                     Information (NCBI), National                 Information (NCBI), National
    Library of Medicine (NLM)                        Library of Medicine (NLM)                    Library of Medicine (NLM)
     Bethesda, Maryland, USA                          Bethesda, Maryland, USA                      Bethesda, Maryland, USA
     chih-hsuan.wei@nih.gov                            robert.leaman@nih.gov                         zhiyong.lu@nih.gov


    Abstract— We introduce PubTator, a web-based                      A. Concept Recognition using PubTator
application that enables large-scale semantic indexing and                PubTator currently utilizes five state-of-the-art named
automatic concept recognition in biomedical ontologies. Not
                                                                      entity recognition and normalization tools to locate and
only was PubTator formally evaluated and top-rated in
BioCreative, it also has been widely adopted and used by the
                                                                      identify important biomedical entities. Specifically, the entity
scientific community from around the world, supporting both           types currently supported and their respective systems with
research projects and real-world applications in biocuration,         F-scores are: genes and proteins (GNormPlus [4] - 86.74%),
crowdsourcing and translational bioinformatics.                       diseases (DNorm [5] - 80.90%), chemicals (tmChem [6] -
                                                                      87.51%), species (SR4GN [7] - 85.42%) and genetic variants
   Keywords—PubTator; TaggerOne; Text Mining; Biomedical              (tmVar [8] - 91.39%).
Ontologies
                                                                          While the entity types currently covered includes those
                                                                      most commonly searched [9], our most recent work,
                        I.     INTRODUCTION                           TaggerOne, is trainable to identify arbitrary entity types,
     With over 26 million articles in PubMed, the biomedical          requiring only annotated training data and a corresponding
literature is a knowledge-rich resource and forms an                  lexicon [10]. TaggerOne employs a novel machine learning
important foundation for future research. However, the rapid          model to address named entity recognition and normalization
expansion of the scientific literature and the increasingly           jointly, reducing cascading errors and enabling the NER
cross-disciplinary nature of biomedical research are making           (name entity recognition) task to directly exploit the lexical
it difficult than ever for individual researchers to find and         information provided by the normalization. TaggerOne
assimilate all of the relevant information from the literature.       achieves state of the art performance on diseases (NCBI
Research in automated text processing is of a growing                 Disease corpus [11]) and chemicals (BioCreative 5 CDR
importance to relieve today's information overload problem.           corpus [12]) and is being used to tag anatomy terms
Hence, processing the biomedical literature with automated            (including organs, tissues, cellular components) in PubMed
tools becomes more important as its growth accelerates.               articles so they can be mapped to the corresponding concept
                                                                      identifiers in multiple biomedical ontologies in
    We present PubTator [1], a web-based application that             http://www.obofoundry.org/.
indexes the ever-growing biomedical literature with
ontological concepts in biomedicine. PubTator features a
PubMed-like interface and is equipped with multiple high-
performing text mining algorithms (e.g. DNorm for disease
concepts in MeSH or SNOMED-CT) to ensure the quality of
its text-mined results over the entire set of articles in
PubMed. PubTator was first developed as an interactive text
mining system through our participation in BioCreative (see
[2] for more details and related work). More recently, we
created RESTful Web Services [3] for PubTator to further
increase its scalability and ease its use by non-experts of text
mining, allowing its users to focus on results rather than
technical methodology.

                  II.        SYSTEM DESCRIPTION
                                                                        Figure 1. The screenshot of a PubMed article in PubTator
                                                                            with concepts and relations highlighted in color.
    Recognizing ontological concepts requires the creation of    their results in PubTator. Text mining open-access full-
a lexical resource identifying the concepts desired, their       length articles in PMC for key ontological concepts in real-
terms and relevant variations. We recently proposed a            world applications (e.g. computer-assisted biocuration)
modification of TaggerOne to automatically identify              would be another exciting opportunity to pursue.
inconsistencies that arise when creating a single lexical
resource from multiple knowledge resources, including                                     ACKNOWLEDGMENT
ontologies, and then address the inconsistency semi-
automatically. The proposed method actively learns a model          This research is supported by the Intramural Research
to identify identical concepts from separate resources, with     Program of the National Institutes of Health, National
preliminary results showing the model successfully identifies    Library of Medicine.
both synonymous tokens (e.g. “kidney” and “renal”) and
contrastive terms (“dominant” vs. “recessive”).                                                REFERENCES
                                                                 [1]  C.-H. Wei, H.-Y. Kao, and Z. Lu, "PubTator: a web-based text
B. Scalability and interoperability                                   mining tool for assisting biocuration," Nucleic Acids Research, vol.
                                                                      41, 2013, pp. W518-W522.
    Large scale use of PubTator or open-source tools requires
                                                                 [2] C. N. Arighi, B. Carterette, K. B. Cohen, M. Krallinger, W. J. Wilbur,
a significant investment in infrastructure and maintenance            P. Fey, et al., "An overview of the BioCreative 2012 Workshop Track
time. These barriers to entry reduce the ability of individual        III: interactive text mining task," Database, 2013, pp. bas056.
researchers to explore applying text mining to problems in       [3] C.-H. Wei, R. Leaman, and Z. Lu, "Beyond accuracy: Creating
their research area and consequently impair the continued             interoperable and scalable text mining web services," Bioinformatics,
adoption of text mining tools. Web services provide on-               2016.
demand access to software tools through the Internet using       [4] C.-H. Wei, H.-Y. Kao, and Z. Lu, "GNormPlus: An Integrative
straightforward interfaces and data formats. Providing text           Approach for Tagging Genes, Gene Families, and Protein Domains,"
                                                                      BioMed Research International, 2015, pp. 918710.
mining tools as web services therefore lowers the bar to use
for end users and bioinformatics researchers not working         [5] R. Leaman, R. Islamaj Doğan, and Z. Lu, "DNorm: Disease name
                                                                      normalization with pairwise learning-to-rank," Bioinformatics, vol.
specifically in text mining, allowing free exploration and the        29, 2013, pp. 2909-2917.
ability to focus on results rather than methodology.             [6] R. Leaman, C.-H. Wei, and Z. Lu, "tmChem: a high performance
                                                                      approach for chemical named entity recognition and normalization,"
    Therefore, we recently developed NCBI text-mining web             Journal of Cheminformatics, vol. 7, 2015, pp. S3.
services on top of PubTator by using standard HTTP method
                                                                 [7] C.-H. Wei, H.-Y. Kao, and Z. Lu, "SR4GN: a species recognition
calls (often known as RESTful services), which allows                 software tool for gene normalization," PLoS One, vol. 7, 2012, pp.
instant retrieval of pre-annotated PubTator results via HTTP          e38460.
GET. To improve system interoperability, we support              [8] C.-H. Wei, B. R. Harris, H.-Y. Kao, and Z. Lu, "tmVar: A text mining
multiple data formats including BioC/XML [13],                        approach for extracting sequence variants in biomedical literature,"
PubTator/TXT [1] and PubAnnotation/JSON [14]. To                      Bioinformatics, vol. 29, 2013, pp. 1433-1439.
simplify programmatic access to our web services, we also        [9] R. I. Dogan, G. C. Murray, A. Névéol, and Z. Lu, "Understanding
provide sample client code in Perl, Python and Java.                  PubMed user search behavior through log analysis," Database, 2009,
                                                                      pp. bap018.
                                                                 [10] R. Leaman and Z. Lu, "TaggerOne: Joint Named Entity Recognition
C. Evaluation & Usage                                                 and Normalization with Semi-Markov Models," Bioinformatics, vol.
    PubTator was formally assessed by a group of external             In Press, 2016.
evaluators during the BioCreative Interactive Text Mining        [11] R. I. Doğana, R. Leaman, and Z. Lu, "NCBI disease corpus: A
challenge task where it was top-rated in all categories from          resource for disease name recognition and concept normalization,"
                                                                      Journal of Biomedical Informatics, vol. 47, 2014, pp. 1-10.
system design to learnability to usability [15].
                                                                 [12] J. Li, Y. Sun, R. J. Johnson, D. Sciaky, C.-H. W. R. Leaman, A. P.
   More recently, through collaboration with curation                 Davis, et al., "BioCreative V CDR task corpus: a resource for
groups, PubTator has been successfully integrated into the            chemical disease relation extraction," Database, 2016, pp. baw068.
production pipeline of multiple curation databases including     [13] D. C. Comeau, R. I. Doğan, P. Ciccarese, K. B. Cohen, M. Krallinger,
                                                                      F. Leitner, et al., "BioC: a minimalist approach to interoperability for
SwissProt [16] and the CDC’s human genome epidemiology                biomedical text processing," Database, 2013, pp. bat064.
knowledge base called HuGE navigator [17].                       [14] J.-D. Kim, K. B. Cohen, and J.-J. Kim, "PubAnnotation-query: a
    Furthermore, since the inception of PubTator Web                  search tool for corpora with multi-layers of annotation," BMC
                                                                      Proceedings, vol. 9, 2015, pp. A3.
Services, millions of requests have been made by the
                                                                 [15] C.-H. Wei, B. R. Harris, D. Li, T. Z. Berardini, E. Huala, H.-Y. Kao,
scientific community from around the world. From                      et al., "Accelerating literature curation with text-mining tools: a case
interactions with some of our users, we learned that the              study of using PubTator to curate genes in PubMed abstracts,"
results of our text-mining services are being used in many            Database, 2012, pp. bas041.
different research areas in bioinformatics. For instance, our    [16] The UniProt Consortium, "UniProt: a hub for protein information,"
web services are used to provide initial annotations for the          Nucleic Acids Research, vol. 43, 2015, pp. D204-D212.
mark2cure crowdsourcing project (https://mark2cure.org/).        [17] W. Yu, M. Gwinn, M. Clyne, A. Yesupriya, and M. J. Khoury, "A
                                                                      navigator for human genome epidemiology," Nature Genetics, vol.
                                                                      40, 2008, pp. 124-125.
           III. CONCLUSIONS & FUTURE WORK
   In the future, we plan to expand the automatic concept
recognition to additional biomedical ontologies and include