<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>PosMed: a biomedical entity prioritisation tool based on statistical inference over literature and the Semantic Web</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Integrated Database Unit</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Advanced Center for Computing</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Communication</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>RIKEN</string-name>
          <email>hmasuya@brc.riken.jp</email>
          <email>takatter@brc.riken.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Japan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>makita</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>manabu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>amatsus</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>gifford</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>toyoda}@base.riken.jp</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Technology</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>BioResource Center, RIKEN</institution>
          ,
          <addr-line>Tsukuba</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Positional MEDLINE (PosMed) is a web application that quickly prioritises biomedical entities such as genes and diseases based on statistical signi cance of associations between these and a user-speci ed keyword by employing our original search engine named General and Rapid Association Study Engine (GRASE). GRASE search is modelled as an extension of SPARQL search with statistical analysis, which enables searching over semantic data including not only linked datasets but also signi cant extracted semantic links over multiple biomedical documents. PosMed was originally implemented for in silico positional cloning studies by prioritizing genes. Further applications include bioresource search with associated genetic functions or ontologies, and functional interpretation of gene variants found from exome sequencing of personal genomes. PosMed is available at http://database.riken.jp/PosMed/.</p>
      </abstract>
      <kwd-group>
        <kwd>Linked data prioritisation</kwd>
        <kwd>Statistical search</kwd>
        <kwd>Text mining</kwd>
        <kwd>Omics analysis</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In the life sciences eld, a Semantic Web approach that employs
machinereadable linked data prepared from conventional various omics datasets has been
studied to understand biomedical phenomena. However, because the task of
generating semantic links for our biomedical knowledge is too expensive, and such
knowledge is described by a vast amount of human-readable biomedical
literature, this semantic technology is still not widely adopted by biologists.</p>
      <p>For practical use of published biomedical data on the Semantic Web,
especially use of data di cult to utilise due to lack of semantic links, it is bene cial
to reinforce acquisition of such data by supplying a hybrid methodology
combining not only inferences over that knowledge described as linked data but also
knowledge supported by statistical signi cance over a vast number of multiple
raw documents.</p>
      <p>
        Our implementation of this methodology is the search engine named GRASE
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. To con rm the problem solving abilities of GRASE for the life sciences,
we developed a simple but e ective graphical user interface for GRASE called
PosMed [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and in 2005 published this service to be accessible by a user's web
browser. We started with mouse and human gene prioritisation for in silico
positional cloning, and so far extended datasets and the service for intelligent
bioresource search and exome analysis for the next generation sequencing. The
rest of this paper presents a computational model of GRASE search and problem
solving examples using PosMed with our latest datasets.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Statistical search model of GRASE</title>
      <p>GRASE search is modelled as an extension of SPARQL search with statistical
analysis, which enables searching over semantic data including not only datasets
in Resource Description Framework (RDF) but also signi cant extracted
semantic links over multiple biomedical documents including MEDLINE abstracts.
Direct search (keyword ! entity) The GRASE search engine quickly prioritises
biomedical entities such as genes, diseases, drugs and mouse strains based on
statistical signi cance of associations with a user-speci ed keyword. More
cona b
cretely, for each entity GRASE generates a 2 2 contingency table c d
consisting of the number of documents (a) where both the the keyword and the
entity appear, (b) where the keyword appears but the entity does not appear, (c)
where the keyword does not appear and the entity does appear, and (d) where
neither the keyword nor the entity appear, then applies the Fisher's exact test
to the contingency table to compute a P-value for the signi cance of the test.
Inference search (entity ! entity) GRASE further infers other entities from
the result of direct search by applying semantic links described by RDF triples
and statistically extracted co-citation relationships over two entities e1 and e2
appearing in a common document by applying Fisher's exact test against 2
a b
2 contingency table c d consisting of the number of documents (a) where
both e1 and e2 appear, (b) where e1 appears but e2 does not appear, (c) where
e1 does not appear but e2 does appear, and (d) where neither e1 nor e2 appear.
Therefore, an entity can be searched via a search path keyword ! entity 1 !
entity 2 and its signi cance computed as 1 (1 pd)(1 pi), where pd and pi are
P-values of direct search and inference search respectively.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Practical applications of PosMed</title>
      <p>Since 2005, we have been extending datasets in PosMed to make it possible
to follow an ever-changing trend of biomedical applications. Datasets currently
introduced in PosMed are shown at http://database.riken.jp/PosMed/.</p>
      <sec id="sec-3-1">
        <title>In silico positional cloning</title>
        <p>
          A typical application of PosMed is searching genes with user-speci ed
keywords and chromosomal intervals suggested by linkage analysis. So that inference
searches can be performed such as mouse gene{drug inference and mouse gene{
human gene inference, currently PosMed supports the following datasets:
{ up to 352,000 entities including not only genes in mouse, human, rat,
Arabidopsis and rice, but also drugs, metabolites, diseases and mouse strains
associated with document sets including up to 9,870,000 documents from
MEDLINE abstracts, OMIM, gene annotation, molecular interaction, Open
Biomedical Ontologies (OBO) [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] including Gene Ontology, Mammalian
Phenotype, Human Disease Ontology and Plant Ontology, and
{ up to 828,000 semantic links including homologue genes and mouse strain{
gene relationships.
        </p>
        <p>In order to realise quick response, the datasets listed above are distributed over
11 computers and these work in parallel.</p>
        <p>
          PosMed was used to prioritise genes in the RIKEN large-scale mouse ENU
mutagenesis project and contributed to successful identi cation of 65 responsible
genes [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. PosMed is also used worldwide and successfully narrowed candidate
genes responsible for a speci c function after QTL analysis [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Bioresource search in mice and Arabidopsis</title>
        <p>
          One conventional problem for a mouse bioresource database is that knockout
strains are not used when the targeted gene has an unknown function and no
observed phenotype. We introduced 19,885 mouse strains registered in the
International Mouse Strain Resource (IMSR) [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] to discover wider resources than by
simple keyword search over mouse strain catalogues and this accelerated
bioresource utilisation, especially for those having fewer phenotypic annotations.
        </p>
        <p>PosMed successfully connects these functionally unknown genes to known
genes using molecular interactions, pathway information and co-citations and as
a result enables suggestion of unobserved phenotypic bioresources. PosMed not
only allows users to retrieve mouse bioresources directly with the user's keywords
described in bioresource annotations, but also inferentially through
corresponding documents for mouse and human genes, diseases, drugs, ontologies, pathways,
metabolites, molecular interactions and MEDLINE abstracts.</p>
        <p>As an extension to other species, we newly introduced 7,207 Arabidopsis
bioresources and 823 Arabidopsis phenotype observations extracted by human
literature curation, so that PosMed inferentially discovers Arabidopsis
bioresources as well through correspoonding documents for genes, phenotypes,
ontologies, co-expressions, molecular interactions and MEDLINE abstracts.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Functional interpretation of gene variants</title>
        <p>PosMed can also be applied to functional interpretation of genetic variants
detected by exome sequencing studies using a next generation sequencer. Since
exome sequencing studies usually nd several hundreds or thousands of genetic
variants by comparing samples and controls, prioritisation of the candidate genes
using PosMed is an e ective method for further functional analysis.</p>
        <p>Users can upload a tab-separated values le with gene IDs and their
descriptions. PosMed prioritises genes listed within les by statistical relevance between
the user's keywords and each gene, and displays ranked genes together with user
uploaded descriptions and associated documents.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Discussion and conclusion</title>
      <p>
        We proposed a Semantic Web data search methodology and tool that extends
conventional graph search like SPARQL with statistical text mining over a vast
number of documents. Not only discovering discoveries documents related to
biomedical entities when given a query as does the service GoPubMed, our
PosMed also supports biomedical entity prioritization. Among several software
tools available to prioritise positional candidate genes, PosMed was evaluated as
sepecially highly e ective in comparison with two other similar tools
GeneSniffer and SUSPECTS [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. We expect our prioritisation tools to e ectively assist
further practical life science studies, making the most of the data extensibility
of the Semantic Web.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Kobayashi</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toyoda</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Statistical search on the Semantic Web</article-title>
          . Bioinfomatics,
          <volume>24</volume>
          (
          <issue>7</issue>
          ), pp.
          <volume>1002</volume>
          {
          <issue>1010</issue>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Makita</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kobayashi</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yoshida</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Doi</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mochizuki</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nishikata</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Matsushima</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Takahashi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ishii</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Takatsuki</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bhatia</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khadbaatar</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Watabe</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Masuya</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toyoda</surname>
          </string-name>
          , T.:
          <article-title>PosMed: Ranking genes and bioresources based on Semantic Web Association Study</article-title>
          .
          <source>Nucleic Acids Res</source>
          .,
          <volume>41</volume>
          (
          <issue>Web Server issue</issue>
          ), pp.
          <source>W109{W114</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ashburner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosse</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bard</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bug</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ceusters</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goldberg</surname>
            ,
            <given-names>L.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eilbeck</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ireland</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mungall</surname>
            ,
            <given-names>C.J.; OB</given-names>
          </string-name>
          <string-name>
            <surname>Consortium</surname>
            , Leontis,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>RoccaSerra</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruttenberg</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sansone</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scheuermann</surname>
            ,
            <given-names>R.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shah</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Whetzel</surname>
            ,
            <given-names>P.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lewis</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration</article-title>
          .
          <source>Nat. Biotechnol</source>
          .,
          <volume>25</volume>
          (
          <issue>11</issue>
          ), pp.
          <fpage>1251</fpage>
          -
          <lpage>1255</lpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Masuya</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yoshikawa</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heida</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toyoda</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wakana</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shiroishi</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Phenosite: a web database integrating the mouse phenotyping platform and the experimental procedures in mice</article-title>
          .
          <source>J. Bioinform. Comput. Biol., 5</source>
          , pp.
          <volume>1173</volume>
          {
          <issue>1191</issue>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Kato</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Watanabe</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ohno</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Inoue</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kanno</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suzuki</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Okada</surname>
          </string-name>
          , H.:
          <article-title>Mapping quantitative trait loci for proteinuria-induced renal collagen deposition</article-title>
          .
          <source>Kidney Int., 73</source>
          , pp.
          <volume>1017</volume>
          {
          <issue>1023</issue>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Eppig</surname>
            ,
            <given-names>J.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Strivens</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Finding a mouse: the International Mouse Strain Resource (IMSR)</article-title>
          .
          <source>Trends Genet</source>
          .,
          <volume>15</volume>
          (
          <issue>2</issue>
          ), pp.
          <fpage>81</fpage>
          -
          <lpage>82</lpage>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Thornblad</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elliott</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jowett</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Visscher</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Prioritization of Positional Candidate Genes Using Multiple Web-Based Software Tools</article-title>
          .
          <source>Twin Res. Hum. Genet</source>
          .,
          <volume>10</volume>
          (
          <issue>6</issue>
          ), pp.
          <volume>861</volume>
          {
          <issue>870</issue>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>