<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using biomedical databases as knowledge sources for large-scale text mining</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Fabio Rinaldi, Institute of Computational Linguistics, University of Zurich</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we discuss how terminological knowledge extracted from biomedical databases can be used effectively in large-scale processing of the biomedical literature. We briefly present an integrated information extraction and text mining environment which is capable of reliably identifying and disambiguating several categories of relevant domain entities, which can then constitute relevant indexing entries in order to allow efficient retrieval of relevant documents and passages. Additionally the system generates ranked lists of candidate interactions among the detected entities, which can be useful for several purposes, from assisted literature curation to question answering systems.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The rapid increase of novel scientific results in the domain of molecular biology
renders it necessary to collect this information in structured repositories, so that it
becomes easily accessible to the end users. Well-known databases like UniProt, Mint,
IntAct, BioGrid, collect information about proteins and their interactions. PharmGKB
[
        <xref ref-type="bibr" rid="ref12 ref4">4, 12</xref>
        ] curates knowledge about the impact of genetic variation on drug response for
clinicians and researchers. The Comparative Toxicogenomics Database (CTD) collects
interactions between chemicals and genes in order to support the study on the effects of
environmental chemicals on health [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. A significant amount of manual effort is needed
in order to extract from the literature the information required to accurately fill those
databases (a process referred to as “curation”). Text mining solutions are increasingly
requested to support the process of curation of biomedical databases.
      </p>
      <p>
        The OntoGene project1 focuses on the improvement of biomedical text mining
through the usage of advanced natural language processing techniques. Our approach
relies upon information delivered by a pipeline of NLP tools, including sentence
splitting, tokenization, part of speech tagging, term recognition, noun and verb phrase
chunking, and a dependency-based syntactic analysis of input sentences [
        <xref ref-type="bibr" rid="ref11 ref8">11, 8</xref>
        ]. The
results of the entity detection feed directly into the process of identification of
interactions.
      </p>
      <p>
        Different implementations of the OntoGene system have been used for
participation in several well-known text mining shared tasks, such as BioCreative, CALBC and
BioNLP, obtaining always competitive results. For example, in the BioCreative 2009
challenge the OntoGene system obtained the best results for protein-protein
interactions [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. More recently, within the scope of the SASEBio project (Semi-Automated
Semantic Enrichment of the Biomedical Literature), we have developed a user-friendly
interface (ODIN: OntoGene Document INspector) which can be used by database
curator to inspect the results of the text mining system. The interface is designed to simplify
the interaction of the user with the text mining system, allowing for example
modification of incorrect results. The system can then learn based upon this interaction.
      </p>
      <p>In the rest of this short paper we briefly describe the OntoGene pipeline architecture
and the ODIN interface for assisted curation.2</p>
    </sec>
    <sec id="sec-2">
      <title>2 Information Extraction</title>
      <p>
        Biomedical terminological resources can be leveraged for construction of large-scale
knowledge bases. One example is KaBOB (Knowledge Base of Biology), a large RDF
store based upon 17 prominent biomedical daabases. KaBOB contains 5.6-billion
RDFtriples [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Similar kinds of integrated data networks can be used for knowledge
discovery purposes through usage of semantic web technologies (see for example [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]).
      </p>
      <p>In our own work we have used such databases as knowledge sources for the process
of semi-automated information extraction. In the rest of this section we describe the
OntoGene Text Mining pipeline which is used to (a) provide all basic preprocessing
(e.g. tokenization) of the target documents, (b) identify all mentions of domain entities
and normalize them to database identifiers, and (c) extract candidate interactions.
2.1</p>
      <sec id="sec-2-1">
        <title>Preprocessing and Detection of Domain Entities</title>
        <p>Several large-scale terminological resources are used by the OntoGene system in order
to detect names of relevant domain entities in biomedical literature (proteins, genes,
chemicals, diseases, etc.) and ground them to widely accepted identifiers assigned by
the original database, such as UniProt Knowledgebase, National Center for
Biotechnology Information (NCBI) Taxonomy, Proteomics Standards Initiative Molecular
Interactions Ontology (PSI-MI), Cell Line Knowledge Base (CLKB), etc.</p>
        <p>
          From the original databases we extract preferred names and synonyms for each
term, together with its unique identifier. This information is used to annotate the
input documents using an efficient lookup procedure. A term normalization step is used
to take into account a number of possible surface variations of the terms. The same
normalization is applied to the list of known terms at the beginning of the annotation
process, when it is read into memory, and to the candidate terms in the input text, so
that a matching between variants of the same term becomes possible despite the
differences in the surface strings [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. For more technical details of the OntoGene terminology
recognition process, see [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>2Readers interested in more details are invited to consult the journal publications available from the
OntoGene web site.</p>
        <p>The terminological resource obtained as described above is used to annotate
biomedical text in a relatively straightforward way. First, in a preprocessing stage, the input
text is transformed into a custom XML format, and sentences and tokens boundaries
are identified. For this task, we use the LingPipe tokenizer and sentence splitter which
have been trained on biomedical corpora. The tokenizer produces a granular set of
tokens, e.g. words that contain a hyphen (such as ‘Pop2p-Cdc18p’) are split into several
tokens, revealing the inner structure of such constructs which would allow to discover
the interaction mention in “Pop2p-Cdc18p interaction”. Tagging of terms is performed
by sequentially processing each token in a sentence and, if it can start a term, annotate
the longest possible match (partial overlaps are excluded). In the case of success, all
the possible IDs (as found in the term list) are assigned to the candidate term.</p>
        <p>
          Ambiguity is a serious problem for several types of entities. For example names of
some proteins and genes can refer to several different database identifiers. For example,
hemoglobin can refer to human hemoglobin or to mouse hemoglobin (or to any other
species). Besides, even in humans there are several different types of hemoglobin.
Using knowledge about the organisms which are the focus of the experiments described
in each paper we can disambiguate to a large extent entities such as proteins and genes.
In the OntoGene pipeline we apply an approach which we first described in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. We
first create a ranked list of ’focus’ organisms based on all mentions of proteins, genes,
cell lines and organisms in the paper. In the disambiguation process we remove all the
IDs that do not correspond to an organism present in the list. Additionally, the scores
provided for each organism can be used in ranking the candidate IDs for each entity.
Such ranking is useful in a semi-automated curation environment where the curator is
expected to take the final decision. However, it can also be used in a fully automated
environment as a factor in ranking any other derived information, such as interactions
where the given entity participates.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Detection of Interactions</title>
        <p>Mentions of relevant domain entities in a given text span are used by the OntoGene
system to create candidate interactions. The selected text span can vary from a sentence
to a larger observation window. Simple co-occurrence in the selected text span is a
lowprecision, but high-recall indication of a potential relationship among those entities. In
order to obtain better precision the OntoGene system uses the syntactic structure of
the sentence, and the global distribution of interactions in the original database. In this
section we describe in detail how candidate interactions are ranked by our system,
according to their relevance for the original database.</p>
        <p>The OntoGene system creates an initial ranking of the candidate relations from the
selected text span using only the frequency of the respective entities with the following
formula:</p>
        <p>relscore(e1; e2) = (f (e1) + f (e2))=f (E)
where f (e1) and f (e2) are the number of times the entities e1 and e2 are observed in
the abstract, while f (E) is the total count of all identifiers in the abstract. An additional
zone-based boost might be used in some cases (e.g. for entities mentioned in the title).</p>
        <p>The OntoGene pipeline makes use of an internally developed dependency parser [13]
in order to parse all sentences in the input documents. The information derived from
the dependency analysis is used to improve on the baseline ranking for candidate
interaction. Besides, the syntactic analysis provides useful information for the extraction of
the interaction type. Given two terms identified in the same sentence, a collector
traverses the tree from each of the two terms upwards to the lowest common parent node,
recording all intermediate nodes and dependency paths along the route. An example of
such a traversal can be seen in Figure 1. Such traversals have been used in many PPI
applications, they are commonly called tree walks or paths.</p>
        <p>
          Each candidate interaction is assigned a score, obtained by combining several
features, including: (1) Syntactic path, which encodes the information provided by the
dependency structure between the two entities in the candidate interaction; (2) Known
interaction: in order to better distinguish between ’novel’ interactions (more important
for the curation process) and ’older’ interactions (already known, thus less important
for the curation process), we penalize interactions that are already reported in the
reference databases, in proportion to their ’age’ (date at which the interaction was first
reported); (3) Novelty score: we also use linguistic clues in order to to distinguish
between sentences that report the results detected by the authors (e.g. “Here we report
that...”) from sentences that report background results. Interactions in ’novelty’
sentences are scored higher than interactions in ’background’ sentences; (4) Zoning:
different structural zones of the paper have often different levels of relevance. We observed
that novel interactions are often mentioned in the abstract and the conclusions, while
the introduction and methods section are less likely and therefore get lower scores; (5)
Pair salience: the frequency of mentions in the paper of each of the entities in the
candidate pair is an important indicator of the relevance of that interaction in the paper.
Scores from each feature are then combined and normalized to the [
          <xref ref-type="bibr" rid="ref1">0,1</xref>
          ] range, in
order to produce a ranking for the candidate interactions.
        </p>
        <p>
          The results of the OntoGene text mining system are made accessible through a
curation system called ODIN (“OntoGene Document INspector”) which allows a user
to dynamically inspect the results of the text mining pipeline. An experiment in
interactive curation has been performed recently in collaboration with the PharmGKB
database [
          <xref ref-type="bibr" rid="ref12 ref4">4, 12</xref>
          ]. The results of this experiment are described in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] provides
further details on the architecture of the system. Figure 2 shows a screenshot of ODIN.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Conclusion</title>
      <p>In this paper we briefly described the OntoGene text mining system, targeted at the
extraction of entities and relationships from the biomedical literature. The OntoGene
pipeline leverages upon manually curated resources and is capable of reliably
identifying entity and relationships which can optionally be delivered using standard
semanticweb formats such as RDF or OWL. The long-term vision of the project is a deeper
integration of databases and literature.</p>
      <p>Acknowledgments</p>
      <p>This research is partially funded by the Swiss National Science Foundation (grant
100014-118396/1) and Novartis Pharma AG, NIBR-IT, Text Mining Services,
CH4002, Basel, Switzerland.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Michael</given-names>
            <surname>Bada</surname>
          </string-name>
          , Kevin Livingston, and
          <string-name>
            <given-names>Lawrence</given-names>
            <surname>Hunter</surname>
          </string-name>
          .
          <article-title>An ontological representation of biomedical data sources and records</article-title>
          .
          <source>Bio-Ontologies</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Huajun</given-names>
            <surname>Chen</surname>
          </string-name>
          , Li Ding,
          <string-name>
            <surname>Zhaohui Wu</surname>
            , Tong Yu,
            <given-names>Lavanya</given-names>
          </string-name>
          <string-name>
            <surname>Dhanapalan</surname>
          </string-name>
          , and
          <string-name>
            <surname>Jake</surname>
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>Semantic web for integrated network analysis in biomedicine</article-title>
          . Briefings in Bioinformatics,
          <volume>10</volume>
          (
          <issue>2</issue>
          ):
          <fpage>177</fpage>
          -
          <lpage>192</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Kappeler</surname>
          </string-name>
          , Kaarel Kaljurand, and Fabio Rinaldi. TX Task:
          <article-title>Automatic Detection of Focus Organisms in Biomedical Publications</article-title>
          .
          <source>In Proceedings of the BioNLP workshop</source>
          , Boulder, Colorado, pages
          <fpage>80</fpage>
          -
          <lpage>88</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.E.</given-names>
            <surname>Klein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.T.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.L.</given-names>
            <surname>Easton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fergerson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hewett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.E.</given-names>
            <surname>Oliver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.L.</given-names>
            <surname>Rubin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Shafa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.M.</given-names>
            <surname>Stuart</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.B.</given-names>
            <surname>Altman</surname>
          </string-name>
          .
          <article-title>Integrating genotype and phenotype information: An overview of the PharmGKB project</article-title>
          .
          <source>The Pharmacogenomics Journal</source>
          ,
          <volume>1</volume>
          :
          <fpage>167</fpage>
          -
          <lpage>170</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.J.</given-names>
            <surname>Mattingly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.C.</given-names>
            <surname>Rosenstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.T.</given-names>
            <surname>Colby</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.N. Forrest</given-names>
            <surname>Jr</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>J.L.</given-names>
            <surname>Boyer</surname>
          </string-name>
          .
          <article-title>The Comparative Toxicogenomics Database (CTD): a resource for comparative toxicological studies</article-title>
          .
          <source>Journal of Experimental Zoology Part A: Comparative Experimental Biology, 305A(9)</source>
          :
          <fpage>689</fpage>
          -
          <lpage>692</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Fabio</given-names>
            <surname>Rinaldi</surname>
          </string-name>
          , Simon Clematide, Yael Garten, Michelle Whirl-Carrillo,
          <string-name>
            <given-names>Li</given-names>
            <surname>Gong</surname>
          </string-name>
          ,
          <string-name>
            <surname>Joan M. Hebert</surname>
            , Katrin Sangkuhl, Caroline F. Thorn,
            <given-names>Teri E.</given-names>
          </string-name>
          <string-name>
            <surname>Klein</surname>
          </string-name>
          , and
          <string-name>
            <surname>Russ</surname>
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Altman</surname>
          </string-name>
          .
          <article-title>Using ODIN for a PharmGKB re-validation experiment</article-title>
          .
          <source>Database: The Journal of Biological Databases and Curation</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Fabio</given-names>
            <surname>Rinaldi</surname>
          </string-name>
          , Kaarel Kaljurand, and
          <string-name>
            <given-names>Rune</given-names>
            <surname>Saetre</surname>
          </string-name>
          .
          <article-title>Terminological resources for text mining over biomedical scientific literature</article-title>
          .
          <source>Journal of Artificial Intelligence in Medicine</source>
          ,
          <volume>52</volume>
          (
          <issue>2</issue>
          ):
          <fpage>107</fpage>
          -
          <lpage>114</lpage>
          ,
          <year>June 2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Fabio</given-names>
            <surname>Rinaldi</surname>
          </string-name>
          , Thomas Kappeler, Kaarel Kaljurand, Gerold Schneider, Manfred Klenner, Simon Clematide, Michael Hess,
          <article-title>Jean-Marc von Allmen, Pierre Parisot, Martin Romacker, and Therese Vachon</article-title>
          .
          <source>OntoGene in BioCreative II. Genome Biology</source>
          ,
          <volume>9</volume>
          (
          <issue>Suppl 2</issue>
          ):
          <fpage>S13</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Fabio</given-names>
            <surname>Rinaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Gerold</given-names>
            <surname>Schneider</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Simon</given-names>
            <surname>Clematide</surname>
          </string-name>
          .
          <article-title>Relation mining experiments in the pharmacogenomics domain</article-title>
          .
          <source>Journal of Biomedical Informatics</source>
          ,
          <year>2012</year>
          . doi:
          <volume>10</volume>
          .1016/j.jbi.
          <year>2012</year>
          .
          <volume>04</volume>
          .014.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Fabio</surname>
            <given-names>Rinaldi</given-names>
          </string-name>
          , Gerold Schneider, Kaarel Kaljurand, Simon Clematide, Therese Vachon, and Martin Romacker.
          <source>OntoGene in BioCreative II.5. IEEE/ACM Transactions on Computational Biology and Bioinformatics</source>
          ,
          <volume>7</volume>
          (
          <issue>3</issue>
          ):
          <fpage>472</fpage>
          -
          <lpage>480</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Fabio</surname>
            <given-names>Rinaldi</given-names>
          </string-name>
          , Gerold Schneider,
          <string-name>
            <given-names>Kaarel</given-names>
            <surname>Kaljurand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Hess</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Martin</given-names>
            <surname>Romacker</surname>
          </string-name>
          .
          <article-title>An Environment for Relation Mining over Richly Annotated Corpora: the case of GENIA</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>7</volume>
          (
          <issue>Suppl 3</issue>
          ):
          <fpage>S3</fpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Katrin</surname>
            <given-names>Sangkuhl</given-names>
          </string-name>
          , Dorit S. Berlin, Russ B.
          <string-name>
            <surname>Altman</surname>
            , and
            <given-names>Teri E. Klein.</given-names>
          </string-name>
          <article-title>PharmGKB: Understanding the effects of individual genetic variants</article-title>
          .
          <source>Drug Metabolism Reviews</source>
          ,
          <volume>40</volume>
          (
          <issue>4</issue>
          ):
          <fpage>539</fpage>
          -
          <lpage>551</lpage>
          ,
          <year>2008</year>
          . PMID:
          <volume>18949600</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>