<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Multi-Source Linked Open Data Fusion Method for Gene Disorder Drug Relationship Querying</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Guozheng Rao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Li Zhang</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiaowang Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wenwen Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fang Li</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cui Tao</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>College of Intelligence and Computing, Tianjin University</institution>
          ,
          <addr-line>Tianjin 300350</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Economics and Management, Tianjin University of Science and Technology</institution>
          ,
          <addr-line>Tianjin 300222</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>The University of Texas School of Biomedical Informatics</institution>
          ,
          <addr-line>7000 Fannin St Suite 600, Houston, TX 77030</addr-line>
          ,
          <country country="US">United States</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Biomedical data are gradually increasing. More and more research focused on Relationship querying between genes, drugs, and disorders. Generally, genes, drugs, and disorders information were stored in different heterogeneous datasets. These datasets were stored in different places and different formats, such as RDF/XML, SQL relational databases, and text, etc. The challenge is the fusion of multi-source and cross-platform biomedical datasets for most application based on gene disorder drug Relationship querying. To tackle these problems, we propose a novel multi-source linked open data fusion method for gene disorder drug Relationship querying. In this method, a variety of biomedical datasets are converted into RDF triple data; and then multi-source datasets are formed into a storage system with data fusion method. After fusion, the system can query the relationships among various entities from different datasets. The experiment results demonstrate that our method significantly has advantages in integrating multi-source heterogeneous biomedical datasets with high efficiency and reliability. The SPARQL query experiment is carried on 4 different datasets by using 9 kinds of common query questions proposed in this paper. The results show that most of the query results came from different datasets. the method can be used to fusion more other different biomedical datasets.</p>
      </abstract>
      <kwd-group>
        <kwd>Linked Open Data</kwd>
        <kwd>Gene Disorder Drug Relationship querying</kwd>
        <kwd>Data Fusion</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Large amounts of semantic data are available in RDF format in many fields, such as
life science[1]. These datasets cover many fields such as medicine and biology. Most
biomedical researchers hope to find more research results through these biomedical
datasets. Relationship querying among disorders, gene, and drugs is important to
multiple biomedical research, such as extracting disease biomarkers[2], identifying
disease mechanisms[3], and predicting health benefits (efficacy) of a drug over a
placebo[4], which could further facilitate precision medicine and clinical decision support.
However, genes, drugs, and disorders datasets were built by different research
organization and instruction for different biomedical application. Furthermore, these
datasets were stored in different places and different formats, such as RDF/XML, SQL
relational databases, and text, etc. How to integrate multi-source and cross-platform
biomedical datasets is a big challenge for gene disorder drug Relationship querying.
To tackle these problems, semantic web technology is used. For example,
Bio2RDF[5] was built for a more complete biological database for associated
bioscience data. A total of 35 datasets have been integrated.</p>
      <p>Most of these datasets are in a specific field. However, more and more genes, drugs,
and disorders datasets will be emerged. Furthermore, there are many new genes,
drugs, and disorders will be found in them. How to realize the fusion of multi-source
data as needed is an important issue. Another key issue is how to realize a fast query
of large-scale data across multiple datasets for a specific application.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Multi-source Data Fusion</title>
      <p>A comprehensive fusion of multi-source data scheme is presented: normalize the RDF
vocabulary and replace the original multiple old vocabularies, and normalize the URI
of the same entity and replace the old URI. The steps mainly include:
1. Data collection: There are three methods to collect data. The data can be imported
from local datasets, crawled through the crawler, or imported via SPARQL.
2. Map to schema: The purpose of this step is to generate a united RDF vocabulary.</p>
      <p>The tool R2R [6] is used as a mapping language that translates concepts from all
the datasets into an application's target vocabulary.
3. Unifying URI. The purpose of this step is to normalize the URI of the same entity
in different data sources, avoiding different aliases of the same entity. The tool used
is SilkLink Discovery Framework.
4. Quality evaluation and data fusion: LDIF[7] (Linked Data Integration Framework)
is an open-source linked data integration framework., sieve of LDIF is used to
evaluate the quality of data fusion.
5. Output: in LDIF, the format of the data output can be written to a file or stored in
the corresponding memory system.</p>
      <p>In this paper, we first preprocess each dataset:
1) convert PharmGKB to plain text format;
2) import it into relational database;
3) convert it to RDF data with D2R tool.</p>
      <p>KEGG is preprocessed in the same way; SemMedDB is a relational database that
needs to be converted to RDF data with the D2R tool.</p>
      <p>After preprocessing, we merge the multi datasets using the Algorithm I.
1: while(getTriple(?s, rdfs:label, ?o) || getTriple(?s, pharmgkb:name, ?o)
2: || getTriple(?s, kegg:name, ?o) || getTriple(?s, sem:name, ?o))
3: the predicate is replaced with myprop:Label
4:
5: while(getTriple(?s, a, kegg:drug) || getTriple(?s, a, pharmgkb:
6: PharmGKB_Drugs || getTriple(?s, a, sem:drug))
7: the object is replaced with myclass:Drug
8：
9: while(getTriple(?s, a, kegg:gene) || getTriple(?s, a, pharmgkb:
10: PharmGKB_Genes || getTriple(?s, a, sem:gene) || getTriple(?s, a,
11: uniprot:Gene))
11: the object is replaced with myclass:Gene
12:
13: while(getTriple(?s, a, kegg:disorder) || getTriple(?s, a, pharmgkb:
14: PharmGKB_Disorders || getTriple(?s, a, sem:disorder))
15: the object is replaced with myclass:Disorder
Fig. 1. Algorithm I- Datasets Fusion (Uniprot, PharmGKB, SemMedDB, KEGG_GENE,
KEGG_PATH)
Algorithm I describes the operation of the dataset fusion.
1. Merge the name and name-like properties in each dataset into the myprop: Label
type;
2. Normalize drugs in each dataset into customized myclass:Drug;
3. Normalize the gene in each dataset to the customized myclass:Gene;
4. Normalize the disorder in each dataset to the customized myclass:Disorder;
3
3.1</p>
    </sec>
    <sec id="sec-3">
      <title>Experiment</title>
      <sec id="sec-3-1">
        <title>Datasets</title>
        <p>The following are several biomedical datasets used in the experiment to verify the
method. PharmGKB[8] is a database of pharmacogenetics and pharmacogenomics.
UniProt [9] is a comprehensive resources for protein sequences and annotation data.
The KEGG[10] database is now used as a reference knowledge base for the
integration of molecular data sets for genome sequencing. SemMedDB is a repository of
semantic predications from MEDLINE citations (titles and abstracts)[11]. In this
paper, the Gene-Disorder-Drug relationship extracted from SemMedDB is used and
converted to RDF for experiments.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Query Design</title>
        <p>For the gene-drug-disorder relationship querying, nine kinds of relational queries are
designed in Table 1. These queries contain all the relationships between
gene-drugdisorder.</p>
      </sec>
      <sec id="sec-3-3">
        <title>No. SemMedDB PharmGKB KEGG Uniprot</title>
        <p>Q1 62.22% 25.56% — 12.22%
Q2 63.48% 36.52% — —
Q3 80% 20% — —
Q4 100% — — —
Q5 — 100% — —
Q6 76.60% 23.40% — —
Q7 83.56% — 16.44% —
Q8 74.42% 25.58% — —
Q9 100% — — —
For the nine relationships between genes, disorders, and drugs, nine queries (Q1-Q9)
were designed. Table 2 is shown the source and respective proportion of each query
results. The data is mainly from SemMedDB and PharmGKB, and some of the results
are from KEGG and Uniprot. Except the Q4, Q5, Q9, all the results are from multi
datasets. It can help to get more valued results to analyze the relationship among
gene-disorder-drug.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>In this paper, a multi-source linked open data fusion method for gene-disorder-drug
Relationship querying is proposed. A variety of biomedical data are converted into
RDF triple data; and then multi datasets are formed into a storage system with data
fusion method. After fusion, the system can discover the relationships among various
entities. The SPARQL query experiments are carried out by using 9 kinds of common
query questions proposed in this paper. The experiments results show that most of the
query results came from different datasets. The multi-source medical linked data
storage method that supports federal query among multi datasets.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgment</title>
      <p>This research is partially supported by the National Natural Science Foundation of
China (NSFC) (61373165, 61672377). The authors also appreciate the support from
the National Library of Medicine of the National Institutes of Health under Award
Number R01LM011829.
1.
11.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Cong</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Feng</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tao</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Constructing Biomedical Knowledge Graph Based on SemMedDB and Linked Open Data</article-title>
          .
          <source>In: 2018 IEEE International Conference on Bioinformatics and Biomedicine</source>
          . IEEE, Madrid, Spain (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Vlietstra</surname>
            ,
            <given-names>W.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zielman</surname>
            , R., van Dongen,
            <given-names>R.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schultes</surname>
            ,
            <given-names>E.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiesman</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vos</surname>
          </string-name>
          , R., van
          <string-name>
            <surname>Mulligen</surname>
            ,
            <given-names>E.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kors</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          :
          <article-title>Automated extraction of potential migraine biomarkers using a semantic graph</article-title>
          .
          <source>J. Biomed. Inform</source>
          .
          <volume>71</volume>
          ,
          <fpage>178</fpage>
          -
          <lpage>189</lpage>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Hofmann-Apitius</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ball</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gebel</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bagewadi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Bono</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schneider</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Page</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kodamullil</surname>
            ,
            <given-names>A.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Younesi</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ebeling</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tegnér</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Canard</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Bioinformatics mining and modeling methods for the identification of disease mechanisms in neurodegenerative disorders</article-title>
          .
          <source>Int. J. Mol. Sci</source>
          .
          <volume>16</volume>
          ,
          <fpage>29179</fpage>
          -
          <lpage>29206</lpage>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rao</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , Zhang, L.:
          <article-title>PSPS: A pharmacological substances prediction system based on biomedical literature data</article-title>
          .
          <source>In: The 2nd International Workshop on the Semantics of Mental Health</source>
          . pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          . IEEE, Xian,China (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Belleau</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tourigny</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Good</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morissette</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>Bio2RDF: a semantic web atlas of post genomic knowledge about human and mouse</article-title>
          .
          <source>In: International Workshop on Data Integration in the Life Sciences</source>
          . pp.
          <fpage>153</fpage>
          -
          <lpage>160</lpage>
          . Springer, Berlin, Heidelberg (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schultz</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>The R2R Framework: Publishing and Discovering Mappings on the Web</article-title>
          .
          <source>COLD</source>
          .
          <volume>665</volume>
          , (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Schultz</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Matteini</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Isele</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mendes</surname>
            ,
            <given-names>P.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Becker</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Ldif-a framework for large-scale linked data integration</article-title>
          .
          <source>In: 21st International World Wide Web Conference Developers Track. CEUR-WS.org, Lyon</source>
          , France (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Whirl-Carrillo</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McDonagh</surname>
            ,
            <given-names>E.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hebert</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gong</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sangkuhl</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thorn</surname>
            ,
            <given-names>C.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Altman</surname>
            ,
            <given-names>R.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klein</surname>
          </string-name>
          , T.E.:
          <article-title>Pharmacogenomics knowledge for personalized medicine</article-title>
          .
          <source>Clin. Pharmacol. Ther</source>
          .
          <volume>92</volume>
          ,
          <fpage>414</fpage>
          -
          <lpage>417</lpage>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Consortium</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <article-title>others: UniProt: the universal protein knowledgebase</article-title>
          .
          <source>Nucleic Acids Res</source>
          .
          <volume>46</volume>
          ,
          <issue>2699</issue>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Kanehisa</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sato</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kawashima</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Furumichi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tanabe</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>KEGG as a reference resource for gene and protein annotation</article-title>
          .
          <source>Nucleic Acids Res</source>
          .
          <volume>44</volume>
          ,
          <fpage>457</fpage>
          -
          <lpage>462</lpage>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Serv</surname>
          </string-name>
          . Use.
          <volume>31</volume>
          ,
          <fpage>15</fpage>
          -
          <lpage>21</lpage>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>