<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Data Discovery Platform Empowered by Knowledge Graph Technologies: Challenges and Opportunities</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Essam Mansour</string-name>
          <email>essam.mansour@concordia.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Concordia University</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <abstract>
        <p>In this talk, we present KGLac, a data discovery platform empowered by knowledge graph technologies, and highlights several open research challenges and opportunities. HDFS</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>DEVELOPMENT AND OPPORTUNITIES
With the growing importance of data science and open data
initiatives, thousands of machine-readable, structured, and
semistructured datasets are collected and made available via data
discovery systems in the case of enterprise datasets or via data portals in
the case of public datasets. Data portals are maintained, for example,
by by governments, e.g., USA, Canada, and EU, organizations, such
as WHO and WTO, and ML portals, such as Kaggle and OpenML.
Existing portals and systems sufer from limited discovery support
and do not track the use of a dataset and insights derived from it.
Thus, data integration and enrichment are the primary
responsibility of data scientists, who spend most of their time knowing where
a relevant dataset exists, understanding its impact on a specific task,
ifnding ways to enrich a dataset, and leverage the derived insights.</p>
      <p>Data portals and search engines, such as Google Dataset Search,
provide primitive search capabilities to find and download open
datasets in diferent formats, such as CSV, JSON, and XML.
Moreover, many organizations are encouraged to build a navigational
data structure (data catalogue) to support data discovery [2, 4] or
to use tools such as Amundsen. Unfortunately, these systems and
tools sufer from limited query support and cannot find data items
based on learned representations (embeddings). There is a need for
an extensible set of efective discovery operations to find relevant
data from their enterprise datasets accessible via data discovery
systems or open datasets accessible via data portals.</p>
      <p>Several methods were proposed to measure table relatedness [5],
support table discovery [1], and find joinable tables [ 6]. These
methods work in isolation from each other and from data portals and
discovery systems. Thus, there is a need for data portals and
discovery systems with a flexible query language and an extensible set
of discovery operations. Moreover, existing data science platforms,</p>
      <p>Storage</p>
      <p>Data Lake ML Pipeline Tools
Figure 1: The KGLac architecture; KGLac gets access to a
local data lake to construct GLac. Diferent ML pipeline tools
can communicate with KGLac to facilitate data discovery.
such as MLFlow or Cloud AutoML, and tools, such as Jupyter
Notebooks or Google Colab, should be able to communicate easily with
these portals and systems.</p>
      <p>The development of KGLac [3], as illustrated in Figure 1, poses
research opportunities in various areas spanning data management
and AI. These research opportunities cover (i) abstracting and
capturing semantics from heterogeneous datasets, (ii) constructing
decentralized knowledge graphs (KGs) for datasets, (iii) supporting
inference and automatic graph learning to incrementally introduce
and enhance the relationships among diferent nodes in the graph,
and (iv) automating several aspects of data science including data
preparation, augmentation, and insights analysis.</p>
      <p>KGLac is supported by diferent methods for data profiling and
representation learning (embedding) to capture metadata and
semantics of datasets to construct a knowledge graph (GLac). KGLac
provides an extensible set of data discovery operations implemented
using SPARQL queries, and supports ad-hoc queries. KGLac enables
automatic graph learning to advance functionalities, such as
classiifcation of similar data items, finding unionable and joinable tables,
predicting shortest paths between tables, and inferring new
relationships. We designed KGLac to be deployed on top of a data
owner’s data lake to enable eficient and extensible data discovery
operations for data scientists who have access to the data lake.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>