<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Integration of variation data through SPARQL Micro-Services</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Catherine Faron</string-name>
          <email>faron@i3s.unice.fr</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Frederic Metereau</string-name>
          <email>frederic.metereau@etu.univ-cotedazur.fr</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Franck Michel</string-name>
          <email>fmichel@i3s.unice.fr</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pierre Larmande</string-name>
          <email>pierre.larmande@ird.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guilhem Sempere</string-name>
          <email>guilhem.sempere@cirad.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Montpellier</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>France</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Knowledge Graphs, MongoDB, FAIR data, Genetic variations, Bioinformatics</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DIADE, IRD, Univ. Montpellier, CIRAD</institution>
          ,
          <addr-line>Montpellier</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>French Institute of Bioinformatics (IFB)-South Green Bioinformatics Platform</institution>
          ,
          <addr-line>Bioversity, CIRAD, INRAE, IRD</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Intertryp</institution>
          ,
          <addr-line>CIRAD, INRAE, IRD, Montpellier</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Université Côte d'Azur</institution>
          ,
          <addr-line>Inria, CNRS, I3S (UMR 7271)</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <abstract>
        <p>Integrating genetic variations data is essential to understand the interactions involving multiple genes in complex diseases. However, managing and extracting meaningful information from a large volume of genotyping data is challenging. This work aims to interconnect eficiently a MongoDB database with an RDF database through SPARQL Micro-Services. We first developed an RDF Model reusing existing ontologies and implemented it. Then, we evaluated some examples of queries interconnecting two applications Gigwa (MongoDB) and AgroLD (SPARQL endpoint).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Sciences
problem, as RDF enables data to be interconnected between several databases. This work aims
to find a way to interconnect eficiently a MongoDB database with another RDF database.</p>
      <p>
        As a proof of concept, we decided to use the Gigwa [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and AgroLD [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] database applications
to demonstrate the benefits of leveraging data semantics on a high volume of genomic data.
Gigwa is a web application designed to store large volumes of genotypes (up to tens of billions),
initially imported from VCF or other file formats, in a MongoDB [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] database, and to provide a
straightforward interface for filtering these data. It makes it possible to navigate within search
results, visualize them in diferent ways, and re-export subsets of data into various common
formats. AgroLD is a knowledge graph that exploits Semantic Web technologies to integrate
data of interest for the plant science community. AgroLD is built incrementally spanning
vast aspects of plant molecular interactions. The current phase covers information on genes,
proteins, predictions of homologous genes, metabolic pathways, plant trait associations and
genetic studies.
      </p>
      <p>
        For this work, we first developed an RDF model based on existing ontologies and inspired by
DisGeNET [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. We extended it with some features needed for the Gigwa data model which
integrates gene annotation information. Then we developed some SPARQL Micro-Services [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
using the Gigwa RESTFul API. Finally, we developed and evaluated some queries interconnecting
Gigwa and AgroLD through SPARQL query examples.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Sempéré</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pétel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rouard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Frouin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hueber</surname>
          </string-name>
          , F. De Bellis, P. Larmande,
          <article-title>Gigwa v2-Extended and improved genotype investigator</article-title>
          ,
          <source>GigaScience</source>
          <volume>8</volume>
          (
          <year>2019</year>
          ). doi:
          <volume>10</volume>
          .1093/ gigascience/giz051.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Venkatesan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. T.</given-names>
            <surname>Ngompe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. E.</given-names>
            <surname>Hassouni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Chentli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Guignon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Jonquet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Larmande</surname>
          </string-name>
          ,
          <article-title>Agronomic Linked Data (AgroLD): A knowledge-based system to enable integrative biology in agronomy</article-title>
          ,
          <source>PLOS ONE 13</source>
          (
          <year>2018</year>
          )
          <article-title>e0198270</article-title>
          . doi:
          <volume>10</volume>
          .1371/journal. pone.
          <volume>0198270</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kamsky</surname>
          </string-name>
          ,
          <article-title>Adapting TPC-C benchmark to measure performance of multi-document transactions in MongoDB</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>12</volume>
          (
          <year>2019</year>
          )
          <fpage>2254</fpage>
          -
          <lpage>2262</lpage>
          . doi:
          <volume>10</volume>
          .14778/3352063. 3352140.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Piñero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Queralt-Rosinach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bravo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Deu-Pons</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bauer-Mehren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Baron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Sanz</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. I. Furlong</surname>
          </string-name>
          ,
          <article-title>DisGeNET: A discovery platform for the dynamical exploration of human diseases and their genes</article-title>
          ,
          <source>Database</source>
          <year>2015</year>
          (
          <year>2015</year>
          ). doi:
          <volume>10</volume>
          .1093/database/bav028.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Faron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Gargominy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Gandon</surname>
          </string-name>
          ,
          <article-title>Integration of Web APIs and Linked Data Using SPARQL Micro-Services-Application to Biodiversity Use Cases</article-title>
          ,
          <source>Information</source>
          <volume>9</volume>
          (
          <year>2018</year>
          )
          <article-title>310</article-title>
          . doi:
          <volume>10</volume>
          .3390/info9120310.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>