<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>PoLyInfo RDF: A Semantically Reinforced Polymer Database for Materials Informatics</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Masashi Ishii</string-name>
          <email>ISHII.Masashi@nims.go.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Taro Takemura</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mikiko Tanifuji</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Institute for Materials Science</institution>
          ,
          <addr-line>1-1 Namiki, Tsukuba, Ibaraki 305-0044</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>RDF Design for a Polymer Database</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>For materials database integration, we introduce a semantic web technology into a polymer database, called PoLyInfo. A resource data framework (RDF) was used to create a semantic description of the polymer formation process (polymerization). The polymerization correlates polymers with their source monomers, and then the monomers in the PoLyInfo RDF were conceptually linked to those in other chemical substance databases, such as Nikkaji. Although common ontology was not used in the PoLyInfo and Nikkaji RDFs, 94.3% of the monomers in PoLyInfo were assigned a Nikkaji substance ID, and the established information path provided probable polymerization information for monomers in the Nikkaji RDF.</p>
      </abstract>
      <kwd-group>
        <kwd>RDF</kwd>
        <kwd>Polymer Database</kwd>
        <kwd>Monomer</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        A semantic web technology using resource data framework (RDF) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] for linked open
data (LOD) is widely accepted among life science societies in which open science has
been historically approved. However, in material science, a sub-domain of physics and
chemistry, industrial importance and confidentiality of material development are not
always consistent with open science, and hence, semantic web technology has never
held a position of major importance. In spite of the traditional marketing mechanism,
recent material development, accompanied by ecology and economy, accountability for
products, and social contribution, necessitates knowledge-sharing. LOD satisfies this
demand and is considered the leading player in open-data-driven materials informatics
(ODD-MI) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In this study, we design an RDF for a material (polymer) database, for
ODD-MI, and demonstrate the linking of data with other international databases.
      </p>
      <p>Copyright © 2019 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).</p>
      <p>1 https://www.nims.go.jp/eng/index.html, 2 https://www.jst.go.jp/EN/
(JST)2 to NIMS in 2003, the database has been unconcerned with semantic web
technology, that is, the database has been functioning as an unlinked data source. Due to
the skillful identification of polymer naming in academic articles, PoLyInfo has proved
expensive owing to an increase in polymer data that are not easily obtainable from other
chemical databases. On the other hand, the uniqueness of PoLyInfo means it contains
few common terms in comparison to other related elements, resulting in a technical
barrier to the linked data. From a survey of PoLyInfo aiming to realize linked data, we
deduced that the most important and linkable term in the database is the “monomer
name”. The grounds of this deduction were established as follows:
- Polymers are essentially synthesized from monomers.
- The synthesis (polymerization path) represents the primary information for polymer
developers.
- There are several monomer databases linked to life science with RDF.</p>
      <p>
        Fortunately, as PoLyInfo has managed polymer names, polymerization paths,
monomer names with original identification numbers (IDs), and a protocol between these
IDs, we were able to design an RDF-connected graph for PoLyInfo, as shown in Figure
1, without utilizing frontier tools [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], whereby we defined ns1 and ns2 as prefixes:
@prefix ns1: &lt;https://polymer.nims.go.jp/rdf/&gt; . (tentative prefix for development)
@prefix ns2: &lt;http://www.w3.org/2000/01/rdf-schema#&gt; .
      </p>
      <p>The triple graph in Fig. 1 shows an example of an industrially important polymer,
“polystyrene”, and its semantic links can be described as follows:
1) The polymer with composition unit identified by CU010001 has the name
“Polystyrene” (CU010001 ns2:label “Polyethylene”@en.).
2) CU010001 has a polymerization path with the ID number J000002 (CU010001
ns1:pHasPolymerizationPath J000002.).
3) J000002 has the name of “Addition polymerization” (J000002 ns2:label “Addition
polymerization”@en.).
4) J000002 has source monomers with ID numbers M0301021 and M0101001
(J000002 ns1:pHasMonomer M0301021, M0101001.).
5) M0301021 has the name “Buta1,3-diene”, and M0101001 has the name “Ethene”
(M0301021 ns2:label “Buta1,3-diene”@en. M0101001 ns2:label “Ethene”@en.).
Consequently, the polymer CU010001 is correlated to the monomers M0301021 and
M0101001.</p>
      <p>
        In this study, we performed data linking with an organic chemical database created
by JST, called Nikkaji (Japan Chemical Substance Dictionary) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which primarily has
monomer information comprising 3,459,747 records. Nikkaji has previously been
published in RDF (Nikkaji RDF) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], in 2015. The Nikkaji RDF was designed by the
National Bioscience Database Center (NBDC)3 with the aim of linking data to the life
science domain using a monomer. The Nikkaji RDF was standardized with an ontology
that is commonly used in the ChEMBL database of the European Bioinformatics
Institute, the European Molecular Biology Laboratory (EMBL-EBI)4 in the UK, and the
PubChem database [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] of the National Center for Biotechnology Information (NCBI)5
in the US. The latest Nikkaji RDF provides 146,220,942 triples.
      </p>
      <p>
        To link the PoLyInfo RDF to the Nikkaji RDF, we introduced conceptual linking,
based on Simple Knowledge Organization System (SKOS) specifications [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ],
3https://biosciencedbc.jp/en/
4https://www.ebi.ac.uk/
5https://www.ncbi.nlm.nih.gov/
published as part of the World Wide Web Consortium (W3C) recommendations. In the
example illustrated in Fig. 1,
      </p>
      <p>6) M0301021 (Buta1,3-diene) and M0101001 (Ethene) of PoLyInfo closely match
J4.043F and J1.939I of Nikkaji, respectively. The corresponding RDF triples in SKOS
can be expressed by:
@prefix skos: &lt;http://www.w3.org/2004/02/skos/core#&gt; .
ns1: M0301021 skos:closeMatch nikkaji: J4.043F .
ns1: M0101001 skos:closeMatch nikkaji: J1.939I .</p>
      <p>As these entities may not be chemically identical (e.g., they may exhibit a difference in
purity), we concluded that the “skos:closeMatch” was better than “skos:exactMatch”
and “owl:sameAs”.
Based on the designed RDF in Fig. 1, we made triples for the polymerization process
in PoLyInfo, as listed in Table 1.</p>
      <p>This table indicates that 15,884 polymers of 2) had information concerning the
polymerization path, and 35,018 monomers of 4) were related to these polymerizations.
As PoLyInfo has 17,825 monomers while the RDF for skos:closeMatch of 6) was
16,809 triples, this shows that 94.3% of monomers in PoLyInfo were assigned to
Nikkaji substance ID. The total number, 227,482 triples, was not significant. However, we
also performed a demonstration using a query written in SPARQL. The crossover task
was to compile a list of polymers in PoLyInfo, which could be synthesized from the
monomers in Nikkaji, and the response provided 39,907 polymers. This result indicates
that the PoLyInfo RDF was successfully linked to the Nikkaji RDF.</p>
      <p>Although we showed the potential of the use of linked data technology between
PoLyInfo and Nikkaji, we still encountered a number of problems intrinsic to polymer
science. In PoLyInfo, we normalized the polymer name to International Union of Pure
and Applied Chemistry (IUPAC)6 nomenclature. Unfortunately, the international rule
sometimes conflicts with the RDF format of W3C. An example of this conflict can be
seen in a triple for polymerization ID of J0018355:
&lt;https://polymer.nims.go.jp/rdf/J0018355&gt;
ns1:pHasMonomer &lt;https://polymer.nims.go.jp/rdf/M4000864&gt; ;
ns2:label
"Addition_polymerization_of_9,9,9',9',9",9"-hexahexyl-7-(4-vinylphenyl)-2,2':7',2"-terfluorene"@en .</p>
      <p>The triples indicate that J0018355 with source monomer M4000864 has a label of
“Addition polymerization of
9,9,9',9',9",9"-hexahexyl-7-(4-vinylphenyl)-2,2':7',2"-terfluorene”. However, the IUPAC monomer name M4000864, including double quotes (")
without an escape code, clearly conflicts with the RDF format (the escape code is
undefined in IUPAC nomenclature), resulting in a syntax error in uploading process.
4</p>
    </sec>
    <sec id="sec-2">
      <title>Summary</title>
      <p>To establish linked data for the polymer database PoLyInfo, we designed RDF triples
implementable in a polymerization process. The triples describing polymerization from
source monomers finally bridged the entities present in the other chemical substance
databases, such as Nikkaji. By using the 227,482 triples created for the purposes of this
study, we found 39,907 polymers that could be synthesized from monomers registered
in Nikkaji. In some cases, we found an unexpected conflict between the International
Union of Pure and Applied Chemistry (IUPAC) nomenclature and the RDF format
recommended by W3C.</p>
      <p>This study was supported by the Cabinet Office, Government of Japan,
Cross-ministerial Strategic Innovation Promotion Program (SIP), “Technologies for Smart
Bioindustry and Agriculture” (funding agency: NARO).
5</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>1. RDF 1</source>
          .1 Primer, https://www.w3.org/TR/rdf11-primer/.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Rajan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Materials Informatics: The Materials “Gene”</article-title>
          and
          <string-name>
            <given-names>Big</given-names>
            <surname>Data</surname>
          </string-name>
          .
          <source>Annu. Rev. Mater. Res</source>
          .
          <volume>45</volume>
          (
          <issue>1</issue>
          ),
          <fpage>153</fpage>
          -
          <lpage>169</lpage>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Polymer</given-names>
            <surname>Database</surname>
          </string-name>
          (PoLyInfo), https://polymer.nims.go.jp/index_en.html.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>The</given-names>
            <surname>Linked Data Integration Framework</surname>
          </string-name>
          , http://silkframework.org/.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Nikkaji</given-names>
            <surname>Web</surname>
          </string-name>
          , https://integbio.jp/dbcatalog/en/record/nbdc01530.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>6. NBDC NikkajiRDF, https://dbarchive.biosciencedbc.jp/en/nikkaji/desc.html.</mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>PubChem</given-names>
            <surname>Database</surname>
          </string-name>
          , https://pubchem.ncbi.nlm.nih.gov/.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>SKOS</given-names>
            <surname>Primer</surname>
          </string-name>
          , https://www.w3.org/TR/skos-primer/.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>