<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Large-scale Semantic Indexing with Biomedical Ontologies</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Chih-Hsuan Wei</string-name>
          <email>chih-hsuan.wei@nih.gov</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhiyong Lu</string-name>
          <email>zhiyong.lu@nih.gov</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM) Bethesda</institution>
          ,
          <addr-line>Maryland</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>- We introduce PubTator, a web-based application that enables large-scale semantic indexing and automatic concept recognition in biomedical ontologies. Not only was PubTator formally evaluated and top-rated in BioCreative, it also has been widely adopted and used by the scientific community from around the world, supporting both research projects and real-world applications in biocuration, crowdsourcing and translational bioinformatics.</p>
      </abstract>
      <kwd-group>
        <kwd>PubTator</kwd>
        <kwd>TaggerOne</kwd>
        <kwd>Text Mining</kwd>
        <kwd>Biomedical Ontologies</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>INTRODUCTION</p>
      <p>With over 26 million articles in PubMed, the biomedical
literature is a knowledge-rich resource and forms an
important foundation for future research. However, the rapid
expansion of the scientific literature and the increasingly
cross-disciplinary nature of biomedical research are making
it difficult than ever for individual researchers to find and
assimilate all of the relevant information from the literature.
Research in automated text processing is of a growing
importance to relieve today's information overload problem.
Hence, processing the biomedical literature with automated
tools becomes more important as its growth accelerates.</p>
      <p>
        We present PubTator [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], a web-based application that
indexes the ever-growing biomedical literature with
ontological concepts in biomedicine. PubTator features a
PubMed-like interface and is equipped with multiple
highperforming text mining algorithms (e.g. DNorm for disease
concepts in MeSH or SNOMED-CT) to ensure the quality of
its text-mined results over the entire set of articles in
PubMed. PubTator was first developed as an interactive text
mining system through our participation in BioCreative (see
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] for more details and related work). More recently, we
created RESTful Web Services [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] for PubTator to further
increase its scalability and ease its use by non-experts of text
mining, allowing its users to focus on results rather than
technical methodology.
      </p>
      <p>II.</p>
      <p>SYSTEM DESCRIPTION</p>
    </sec>
    <sec id="sec-2">
      <title>A. Concept Recognition using PubTator</title>
      <p>
        PubTator currently utilizes five state-of-the-art named
entity recognition and normalization tools to locate and
identify important biomedical entities. Specifically, the entity
types currently supported and their respective systems with
F-scores are: genes and proteins (GNormPlus [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] - 86.74%),
diseases (DNorm [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] - 80.90%), chemicals (tmChem [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
87.51%), species (SR4GN [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] - 85.42%) and genetic variants
(tmVar [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] - 91.39%).
      </p>
      <p>
        While the entity types currently covered includes those
most commonly searched [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], our most recent work,
TaggerOne, is trainable to identify arbitrary entity types,
requiring only annotated training data and a corresponding
lexicon [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. TaggerOne employs a novel machine learning
model to address named entity recognition and normalization
jointly, reducing cascading errors and enabling the NER
(name entity recognition) task to directly exploit the lexical
information provided by the normalization. TaggerOne
achieves state of the art performance on diseases (NCBI
Disease corpus [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]) and chemicals (BioCreative 5 CDR
corpus [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]) and is being used to tag anatomy terms
(including organs, tissues, cellular components) in PubMed
articles so they can be mapped to the corresponding concept
identifiers in multiple biomedical ontologies in
http://www.obofoundry.org/.
      </p>
      <p>Recognizing ontological concepts requires the creation of
a lexical resource identifying the concepts desired, their
terms and relevant variations. We recently proposed a
modification of TaggerOne to automatically identify
inconsistencies that arise when creating a single lexical
resource from multiple knowledge resources, including
ontologies, and then address the inconsistency
semiautomatically. The proposed method actively learns a model
to identify identical concepts from separate resources, with
preliminary results showing the model successfully identifies
both synonymous tokens (e.g. “kidney” and “renal”) and
contrastive terms (“dominant” vs. “recessive”).</p>
    </sec>
    <sec id="sec-3">
      <title>B. Scalability and interoperability</title>
      <p>Large scale use of PubTator or open-source tools requires
a significant investment in infrastructure and maintenance
time. These barriers to entry reduce the ability of individual
researchers to explore applying text mining to problems in
their research area and consequently impair the continued
adoption of text mining tools. Web services provide
ondemand access to software tools through the Internet using
straightforward interfaces and data formats. Providing text
mining tools as web services therefore lowers the bar to use
for end users and bioinformatics researchers not working
specifically in text mining, allowing free exploration and the
ability to focus on results rather than methodology.</p>
      <p>
        Therefore, we recently developed NCBI text-mining web
services on top of PubTator by using standard HTTP method
calls (often known as RESTful services), which allows
instant retrieval of pre-annotated PubTator results via HTTP
GET. To improve system interoperability, we support
multiple data formats including BioC/XML [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ],
PubTator/TXT [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and PubAnnotation/JSON [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. To
simplify programmatic access to our web services, we also
provide sample client code in Perl, Python and Java.
      </p>
    </sec>
    <sec id="sec-4">
      <title>C. Evaluation &amp; Usage</title>
      <p>
        PubTator was formally assessed by a group of external
evaluators during the BioCreative Interactive Text Mining
challenge task where it was top-rated in all categories from
system design to learnability to usability [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>
        More recently, through collaboration with curation
groups, PubTator has been successfully integrated into the
production pipeline of multiple curation databases including
SwissProt [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] and the CDC’s human genome epidemiology
knowledge base called HuGE navigator [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
      <p>Furthermore, since the inception of PubTator Web
Services, millions of requests have been made by the
scientific community from around the world. From
interactions with some of our users, we learned that the
results of our text-mining services are being used in many
different research areas in bioinformatics. For instance, our
web services are used to provide initial annotations for the
mark2cure crowdsourcing project (https://mark2cure.org/).</p>
      <sec id="sec-4-1">
        <title>III. CONCLUSIONS &amp; FUTURE WORK</title>
        <p>In the future, we plan to expand the automatic concept
recognition to additional biomedical ontologies and include
their results in PubTator. Text mining open-access
fulllength articles in PMC for key ontological concepts in
realworld applications (e.g. computer-assisted biocuration)
would be another exciting opportunity to pursue.</p>
      </sec>
      <sec id="sec-4-2">
        <title>ACKNOWLEDGMENT This research is supported by the Intramural Research Program of the National Institutes of Health, National Library of Medicine.</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>C.-H. Wei</surname>
            , H.-Y. Kao, and
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Lu</surname>
          </string-name>
          ,
          <article-title>"PubTator: a web-based text mining tool for assisting biocuration,"</article-title>
          <source>Nucleic Acids Research</source>
          , vol.
          <volume>41</volume>
          ,
          <year>2013</year>
          , pp.
          <fpage>W518</fpage>
          -
          <lpage>W522</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C. N.</given-names>
            <surname>Arighi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Carterette</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. B.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krallinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. J.</given-names>
            <surname>Wilbur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fey</surname>
          </string-name>
          , et al.,
          <article-title>"An overview of the BioCreative 2012 Workshop Track III: interactive text mining task,"</article-title>
          <source>Database</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>bas056</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>C.-H. Wei</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Leaman</surname>
            , and
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Lu</surname>
          </string-name>
          ,
          <article-title>"Beyond accuracy: Creating interoperable and scalable text mining web services,"</article-title>
          <source>Bioinformatics</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>C.-H. Wei</surname>
            , H.-Y. Kao, and
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Lu</surname>
          </string-name>
          ,
          <article-title>"GNormPlus: An Integrative Approach for Tagging Genes</article-title>
          , Gene
          <string-name>
            <surname>Families</surname>
            , and
            <given-names>Protein</given-names>
          </string-name>
          <string-name>
            <surname>Domains</surname>
          </string-name>
          ," BioMed Research International,
          <year>2015</year>
          , pp.
          <fpage>918710</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Leaman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. Islamaj</given-names>
            <surname>Doğan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <article-title>"DNorm: Disease name normalization with pairwise learning-to-rank,"</article-title>
          <source>Bioinformatics</source>
          , vol.
          <volume>29</volume>
          ,
          <year>2013</year>
          , pp.
          <fpage>2909</fpage>
          -
          <lpage>2917</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Leaman</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-H. Wei</surname>
            , and
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Lu</surname>
          </string-name>
          ,
          <article-title>"tmChem: a high performance approach for chemical named entity recognition and normalization,"</article-title>
          <source>Journal of Cheminformatics</source>
          , vol.
          <volume>7</volume>
          ,
          <issue>2015</issue>
          , pp.
          <fpage>S3</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>C.-H. Wei</surname>
            , H.-Y. Kao, and
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Lu</surname>
          </string-name>
          ,
          <article-title>"SR4GN: a species recognition software tool for gene normalization,"</article-title>
          <source>PLoS One</source>
          , vol.
          <volume>7</volume>
          ,
          <issue>2012</issue>
          , pp.
          <fpage>e38460</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>C.-H. Wei</surname>
            ,
            <given-names>B. R.</given-names>
          </string-name>
          <string-name>
            <surname>Harris</surname>
            , H.-Y. Kao, and
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Lu</surname>
          </string-name>
          ,
          <article-title>"tmVar: A text mining approach for extracting sequence variants in biomedical literature,"</article-title>
          <source>Bioinformatics</source>
          , vol.
          <volume>29</volume>
          ,
          <year>2013</year>
          , pp.
          <fpage>1433</fpage>
          -
          <lpage>1439</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R. I.</given-names>
            <surname>Dogan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. C.</given-names>
            <surname>Murray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Névéol</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <article-title>"Understanding PubMed user search behavior through log analysis,"</article-title>
          <source>Database</source>
          ,
          <year>2009</year>
          , pp.
          <fpage>bap018</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R.</given-names>
            <surname>Leaman</surname>
          </string-name>
          and
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <article-title>"TaggerOne: Joint Named Entity Recognition and Normalization with Semi-Markov Models,"</article-title>
          <source>Bioinformatics</source>
          , vol. In Press,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>R. I. Doğana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Leaman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <article-title>"NCBI disease corpus: A resource for disease name recognition and concept normalization,"</article-title>
          <source>Journal of Biomedical Informatics</source>
          , vol.
          <volume>47</volume>
          ,
          <year>2014</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Johnson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sciaky</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-H. W. R. Leaman</surname>
            ,
            <given-names>A. P.</given-names>
          </string-name>
          <string-name>
            <surname>Davis</surname>
          </string-name>
          , et al.,
          <article-title>"BioCreative V CDR task corpus: a resource for chemical disease relation extraction,"</article-title>
          <source>Database</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>baw068</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D. C.</given-names>
            <surname>Comeau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. I.</given-names>
            <surname>Doğan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ciccarese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. B.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krallinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Leitner</surname>
          </string-name>
          , et al.,
          <article-title>"BioC: a minimalist approach to interoperability for biomedical text processing,"</article-title>
          <source>Database</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>bat064</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>J.-D. Kim</surname>
            ,
            <given-names>K. B.</given-names>
          </string-name>
          <string-name>
            <surname>Cohen</surname>
          </string-name>
          , and J.
          <string-name>
            <surname>-J. Kim</surname>
          </string-name>
          ,
          <article-title>"PubAnnotation-query: a search tool for corpora with multi-layers of annotation,"</article-title>
          <source>BMC Proceedings</source>
          , vol.
          <volume>9</volume>
          ,
          <issue>2015</issue>
          , pp.
          <fpage>A3</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>C.-H. Wei</surname>
            ,
            <given-names>B. R.</given-names>
          </string-name>
          <string-name>
            <surname>Harris</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>T. Z.</given-names>
          </string-name>
          <string-name>
            <surname>Berardini</surname>
            , E. Huala, H.-
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Kao</surname>
          </string-name>
          , et al.,
          <article-title>"Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts,"</article-title>
          <source>Database</source>
          ,
          <year>2012</year>
          , pp.
          <fpage>bas041</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <article-title>The UniProt Consortium, "UniProt: a hub for protein information,"</article-title>
          <source>Nucleic Acids Research</source>
          , vol.
          <volume>43</volume>
          ,
          <year>2015</year>
          , pp.
          <fpage>D204</fpage>
          -
          <lpage>D212</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gwinn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Clyne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yesupriya</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Khoury</surname>
          </string-name>
          ,
          <article-title>"A navigator for human genome epidemiology,"</article-title>
          <source>Nature Genetics</source>
          , vol.
          <volume>40</volume>
          ,
          <year>2008</year>
          , pp.
          <fpage>124</fpage>
          -
          <lpage>125</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>