<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Supporting Polystore Queries using Provenance in a Hyperknowledge Graph</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Leonardo G. Azevedo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Renan Souza</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elton Soares</string-name>
          <email>eltons@ibm.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Raphael Thiago</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anna Oliveira</string-name>
          <email>acoliveira@cos.ufjr.br</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marcio Moreno</string-name>
          <email>mmorenog@br.ibm.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IBM Research</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Current modern applications commonly need to manage various types of datasets, usually composed of heterogeneous data and schema manipulated by disparate tools and techniques in an ad-hoc way. This demo presents HKPoly - a solution that tackles the challenge of mapping and linking heterogeneous data, providing data access encapsulation by employing semantic, provenance, and data linkage.</p>
      </abstract>
      <kwd-group>
        <kwd>Knowledge Graph Hyperknowledge Polystore</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Introduction</p>
      <p>
        Modern applications usually manipulate diverse datasets with di erent
models and usages, employing several tools and techniques. As an example, Oil
reserves discovery is critical in the O&amp;G industry, and it involves several activities
performed by collaborating teams , consuming and generating data of distinct
sources, semantics, and format [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. We demonstrate our proposal in this scenario.
      </p>
      <p>
        Data management solutions have emerged to handle heterogeneous data
access, e.g., distributed le systems, NoSQL, processing frameworks [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In general,
each solution handles one or a few kinds of data models or formats and does not
provide a semantic abstraction for client access. However, modern applications
usually require handling heterogeneous data systems. So, a middleware that
provides a seamless interface with an independent data model and (perhaps) data
schema becomes necessary [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Federated data systems are the leading candidates
for such middleware [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], but they lack an abstract semantic layer. Semantic
mapping and record linkage is still a challenge [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which states for automatic
translation of client utterances to the encapsulated storage systems' dialect and
the integration of query results.
      </p>
      <p>
        This work demonstrates HKPoly, a solution to overcome this challenge
employing Semantic Web concepts, i.e., Linked Data, Provenance Data, and
inference rules to extract meaning and enable reasoning [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ][
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
2
      </p>
      <p>Hyperknowledge Polystore (HKPoly)
HKPoly uses abstract global representation and query language to encapsulate
heterogeneous remote data stores. It provides semantic mapping and record
linkage through a domain ontology, metadata about remote data stores' schemas
and the domain ontology, and provenance techniques. Its main components are
(Figure 1.a): (i) HKPoly Core: controls the access to the remote data stores;
(ii) Provenance Manager : captures and manages provenance of the processes that
manipulate the (remote) data; (iii) Provenance and Knowledge Graph: stores
provenance data, domain ontology and data stores' metadata, references to
remote data objects, and mappings of the domain and remote data store schemas.
oly HKPoly</p>
    </sec>
    <sec id="sec-2">
      <title>PK Core</title>
      <p>H</p>
      <p>s
Data Store RDBMS</p>
    </sec>
    <sec id="sec-3">
      <title>Client Application</title>
    </sec>
    <sec id="sec-4">
      <title>Provenance and</title>
    </sec>
    <sec id="sec-5">
      <title>Knowledge Graph</title>
    </sec>
    <sec id="sec-6">
      <title>Provenance</title>
    </sec>
    <sec id="sec-7">
      <title>Manager</title>
      <p>NoSQL
Object
Store
File
System
API
(a)</p>
      <sec id="sec-7-1">
        <title>Client</title>
      </sec>
      <sec id="sec-7-2">
        <title>Application</title>
        <p>HyQL</p>
        <p>KES
oly HKcPoorley anPdroKvneonwalnecdege
P SQL Graph Catalog
KH tf PostgreSQLts Foreign Dattai Wrapper</p>
      </sec>
      <sec id="sec-7-3">
        <title>File</title>
      </sec>
      <sec id="sec-7-4">
        <title>System</title>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Files</title>
      <sec id="sec-8-1">
        <title>Postgre SQL</title>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>SEGY</title>
      <p>strategic</p>
    </sec>
    <sec id="sec-10">
      <title>Data</title>
      <p>(b)</p>
      <sec id="sec-10-1">
        <title>MongoDB</title>
        <p>Geo
spatial</p>
      </sec>
      <sec id="sec-10-2">
        <title>Indexes</title>
      </sec>
      <sec id="sec-10-3">
        <title>Provenance</title>
      </sec>
      <sec id="sec-10-4">
        <title>Manager</title>
        <p>tk</p>
      </sec>
      <sec id="sec-10-5">
        <title>Allegro</title>
      </sec>
      <sec id="sec-10-6">
        <title>Graph</title>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>Knowledge and annotations</title>
      <p>3 https://www.postgresql.org/docs/9.5/postgres-fdw.html
4 The HyQL grammar is presented in https://ibm.ent.box.com/v/
iswc2021-hyql-grammar
Machine
1</p>
      <p>*
wasRunOn
1
DataStore
1
isInStore</p>
      <p>*
Database
*
DatabaseSchema
0..1
wasInStore
1..*
0..1</p>
      <p>referred
0..*</p>
      <p>alias
0..1 0..1 0..* wasDerivedFrom...</p>
      <p>Attribute</p>
      <p>* DataReference
*
isMemberOf...</p>
      <p>0..1
1
*
ComplexAttribute
0..1 isIdentifierOf
*
AttributeValue</p>
      <p>*
wasMemberOf...</p>
      <p>0..1
ComplexAttribute</p>
      <p>Value
0..1</p>
      <p>isAttributeOf
FileSystem</p>
      <p>GraphDBMS</p>
      <p>isSchemaOf
DocDBMS</p>
      <p>ObjectStore
RDBMS</p>
      <p>
        KES (Knowledge Explorer System) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] provides a user interface (UI) for
management of HK bases (i.e., knowledge bases with HK speci cations). KES promotes
creating knowledge representations, validating and curating knowledge through
an interactive visual approach. We used KES to navigate, manipulate, and
visualize the Provenance and Knowledge Graph.
      </p>
      <p>
        Provenance Manager was inherited from ProvLake implementation [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] . It
creates a provenance graph from data captured during work ow execution, but,
in HKPoly, these graphs are augmented with polystore semantics.
      </p>
      <p>HKPoly steps are: (i) Create FDW tables. (ii) Create metadata about:
domain ontology; data store con guration (i.e., access and schema metadata);
FDW schemas; mappings among domain ontology and FDW schemas, and data
store schemas and FDW schemas. (iii) Run the work ows that manipulates
the remote data, and capture provenance. (iv) Process queries.</p>
      <p>In step (iii), the work ows' implementations are instrumented to capture the
work ow's time, input, and output data and its data transformations, including
references to the remote data. This data is sent to Provenance Manager which
stores it in the Provenance and Knowledge Graph.</p>
      <p>In step (iv), a Client Application sends a HyQL query to HKPoly. HKPoly
parses the query to identify the queried elements and the related data sources.
It discovers the FDW foreign tables and creates a SQL query. Then, it sends the
query to PostgreSQL, which access the remote data using FDW wrappers5.</p>
      <p>
        Although several works have tackled the problem of database federation from
di erent perspectives [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Our solution encapsulates the remote data and
the complexity of its underlying model. The Client Application does not have
to specify the paths to navigate to remote data. It speci es a query considering
the domain ontology, and HKPoly uses the captured data references and schema
metadata and mappings to get the remote data using FDW.
5 We used Multicorn (https://multicorn.org/) and le fdw (https://www.
postgresql.org/docs/9.5/file-fdw.html) PosgreSQL FDW extensions for the
wrappers implementation.
3
      </p>
      <p>
        HKPoly in use
We employed HKPoly in an Oil reserves discovery scenario, which is critical in
the O&amp;G industry. It involves several activities, including seismic image
interpretation. We considered the scenario's heterogeneous data aspect (Figure 3 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]).
      </p>
      <p>Each activity uses and generates data from/to data stores with heterogeneous
data models. The rst activity processes geological raw data les (residing on
Parallel File System) to extract necessary metadata and assess their data quality
(stored in PostgreSQL - R-DBMS). The second activity uses the high-quality
data les and generates geospatial indexes (stored in MongoDB - Doc DBMS) to
accelerate geospatial queries over the geological data. The third activity uses the
high-quality data les and augments the raw geodata les with extra knowledge
informed by geoscience experts (stored in AllegroGraph - T-DBMS). The last
activity prepares the learning datasets to be used by the DL algorithms.</p>
      <sec id="sec-11-1">
        <title>R-DBMS</title>
      </sec>
      <sec id="sec-11-2">
        <title>Doc DBMS</title>
      </sec>
      <sec id="sec-11-3">
        <title>T-DBMS</title>
        <p>DT1</p>
        <p>Data quality
assessment
Geological raw
data files</p>
        <p>DT2
Geospatial index
generation
DT3</p>
        <p>Knowledge
ingestion</p>
      </sec>
      <sec id="sec-11-4">
        <title>Parallel File System</title>
        <p>Legend
Activities dependencies</p>
        <p>Used data
Generated data
DT4
Data preparation</p>
        <p>Training
datasets</p>
        <p>The user is an ML expert with deep knowledge in the domain, and, after
running the work ow, s/he has to report the ML model results. It requires
querying the processed domain data which resides in the remote data stores.</p>
        <p>The user application interacts with HKPoly through its endpoints, e.g.,
sending queries in HyQL (Listing 1.1). This query is received by HKPoly, and parsed,
resulting in the SQL query to be performed in FDW (Listing 1.2). The HyQL
query is in the domain abstraction level, i.e., the user does not have to be aware
of the complexities underlying the heterogeneous remote data systems. Hence,
HKPoly is easier to use than the user writing the FDW query or implementing
scripts to get the data from each remote data store independently.</p>
        <p>Listing 1.1. HyQL to get data of seismic Netherlands.
1 select Seismic . inline , Seismic . crossline , Seismic . hasWell , Seismic .</p>
        <p>hasHorizon , Seismic . epsg
2 where Seismic from geological_data_ingestion_workflow
3 and Seismic . name = " Netherlands "</p>
        <p>Listing 1.2. SQL to get remote data of seismic Netherlands.
1 select distinct ag ." hasHorizon ", mg .uri ,
2 pg . crossline , ag ." hasWell ", pg . inline
3 from segy fl , kb_seismic ag , mongo_seismic mg , seismic_header pg ,
4 ( VALUES ( ' netherlands .sgy ',
5 'http :// br . ibm . com / hkpoly / seismicData_ABox # Netherland_3D ',
6 'http :// br . ibm . com / hkpoly / seismicData_ABox # Netherland_3D ', 1 ))
7 as p( FileSystem1_prov_id , Allegro1_prov_id ,
8 Mongo1_prov_id , Postgres1_prov_id )
9 where ag . uri =p. Allegro1_prov_id AND mg . uri =p. Mongo1_prov_id
10 AND pg . id =p. Postgres1_prov_id
Demo video 1: HKPoly architecture; ProvLake and Data Store metamodel;
load domain knowledge, load data stores' con guration; load domain knowledge
schema; load FDW mappings.
https://ibm.box.com/v/iswc2021-hkpoly-demo-video1
Demo video 2: The scenario used in the demo; input HyQL; FDW generated
SQL for remote data access; an example of user code provenance
instrumentation; data visualization in KES; HKPoly steps; and, HKPoly service running.
https://ibm.box.com/v/iswc2021-hkpoly-demo-video2</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Azevedo</surname>
            ,
            <given-names>L.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soares</surname>
            ,
            <given-names>E.F.d.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Souza</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moreno</surname>
            ,
            <given-names>M.F.</given-names>
          </string-name>
          :
          <article-title>Modern federated database systems: An overview</article-title>
          .
          <source>In: 22nd International Conference in Enterprise Information Systems (ICEIS)</source>
          . pp.
          <volume>276</volume>
          {
          <issue>283</issue>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Berners-Lee</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hendler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lassila</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>The Semantic Web : a new form of Web content that is meaningful to computers will unleash a revolution of new possibilities</article-title>
          .
          <source>Scienti c American</source>
          <volume>284</volume>
          (
          <issue>5</issue>
          ),
          <volume>34</volume>
          {
          <fpage>43</fpage>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Moreno</surname>
            ,
            <given-names>M.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brandao</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cerqueira</surname>
          </string-name>
          , R.:
          <article-title>Extending hypermedia conceptual models to support hyperknowledge speci cations</article-title>
          .
          <source>International Journal of Semantic Computing</source>
          <volume>11</volume>
          (
          <issue>01</issue>
          ),
          <volume>43</volume>
          {
          <fpage>64</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Moreno</surname>
            ,
            <given-names>M.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Santos</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Santos</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          , Branda~o,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Carrion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Cerqueira</surname>
          </string-name>
          , R.:
          <article-title>Handling hyperknowledge representations through an interactive visual approach</article-title>
          .
          <source>In: IEEE Intl. Conf. on Information Reuse and Integration</source>
          . pp.
          <volume>139</volume>
          {
          <issue>146</issue>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>O</given-names>
            <surname>zsu</surname>
          </string-name>
          , M.T.,
          <string-name>
            <surname>Valduriez</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Principles of distributed database systems</article-title>
          . Springer, 4th edn. (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Patel</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jain</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Present and future of Semantic Web Technologies: a Research Statement</article-title>
          .
          <source>Intl. Journal of Computers and Applications</source>
          pp.
          <volume>1</volume>
          {
          <issue>10</issue>
          (01
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Prud</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seaborne</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Sparql query language for rdf (</article-title>
          <year>2008</year>
          ), https://www.w3. org/TR/rdf-sparql-query/, accessed in April 12st,
          <year>2021</year>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Souza</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Azevedo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thiago</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soares</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , et al.:
          <article-title>E cient runtime capture of multiwork ow data using provenance</article-title>
          .
          <source>In: 2019 15th International Conference on eScience (eScience)</source>
          . pp.
          <volume>359</volume>
          {
          <issue>368</issue>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Stonebraker</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>The case for polystore</article-title>
          . https://wp.sigmod.org/?p=
          <volume>1629</volume>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chirkova</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gadepally</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mattson</surname>
          </string-name>
          , T.G.:
          <article-title>Enabling query processing across heterogeneous data models: A survey</article-title>
          .
          <source>In: IEEE Intl. Conf. on Big Data (Big Data)</source>
          . pp.
          <volume>3211</volume>
          {
          <fpage>3220</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>