=Paper=
{{Paper
|id=Vol-2929/poster5
|storemode=property
|title=A Data Discovery Platform Empowered by Knowledge GraphTechnologies: Challenges and Opportunities
|pdfUrl=https://ceur-ws.org/Vol-2929/poster5.pdf
|volume=Vol-2929
|authors=Essam Mansour
|dblpUrl=https://dblp.org/rec/conf/vldb/Mansour21
}}
==A Data Discovery Platform Empowered by Knowledge GraphTechnologies: Challenges and Opportunities==
<pdf width="1500px">https://ceur-ws.org/Vol-2929/poster5.pdf</pdf>
<pre>
    A Data Discovery Platform Empowered by Knowledge Graph
            Technologies: Challenges and Opportunities
                                                                            Essam Mansour
                                                                        Concordia University
                                                                    essam.mansour@concordia.ca
                                                                                                      GLac Construction                    Interface Services
ABSTRACT                                                                                               Data
                                                                                                                   Storage            Discovery           Query
In this talk, we present KGLac, a data discovery platform empow-                                      Profiler
                                                                                                                                      Operations         Manager
                                                                                                       GLac
ered by knowledge graph technologies, and highlights several open                                     Builder
                                                                                                                                     Embedding           Access
research challenges and opportunities.                                                               Deductive                        Similarity         Control
                                                                                                      Linker

Reference Format:
Essam Mansour. A Data Discovery Platform Empowered by Knowledge
Graph Technologies: Challenges and Opportunities. In the 2nd Workshop                                                   HDFS

on Search, Exploration, and Analysis in Heterogeneous Datastores (SEA
                                                                                                                Data Lake                     ML Pipeline Tools
Data 2021).                                                                               Figure 1: The KGLac architecture; KGLac gets access to a lo-
                                                                                          cal data lake to construct GLac. Different ML pipeline tools
                                                                                          can communicate with KGLac to facilitate data discovery.
1    DEVELOPMENT AND OPPORTUNITIES                                                        such as MLFlow or Cloud AutoML, and tools, such as Jupyter Note-
With the growing importance of data science and open data ini-                            books or Google Colab, should be able to communicate easily with
tiatives, thousands of machine-readable, structured, and semi-                            these portals and systems.
structured datasets are collected and made available via data discov-                        The development of KGLac [3], as illustrated in Figure 1, poses
ery systems in the case of enterprise datasets or via data portals in                     research opportunities in various areas spanning data management
the case of public datasets. Data portals are maintained, for example,                    and AI. These research opportunities cover (i) abstracting and cap-
by by governments, e.g., USA, Canada, and EU, organizations, such                         turing semantics from heterogeneous datasets, (ii) constructing
as WHO and WTO, and ML portals, such as Kaggle and OpenML.                                decentralized knowledge graphs (KGs) for datasets, (iii) supporting
Existing portals and systems suffer from limited discovery support                        inference and automatic graph learning to incrementally introduce
and do not track the use of a dataset and insights derived from it.                       and enhance the relationships among different nodes in the graph,
Thus, data integration and enrichment are the primary responsibil-                        and (iv) automating several aspects of data science including data
ity of data scientists, who spend most of their time knowing where                        preparation, augmentation, and insights analysis.
a relevant dataset exists, understanding its impact on a specific task,                      KGLac is supported by different methods for data profiling and
finding ways to enrich a dataset, and leverage the derived insights.                      representation learning (embedding) to capture metadata and se-
   Data portals and search engines, such as Google Dataset Search,                        mantics of datasets to construct a knowledge graph (GLac). KGLac
provide primitive search capabilities to find and download open                           provides an extensible set of data discovery operations implemented
datasets in different formats, such as CSV, JSON, and XML. More-                          using SPARQL queries, and supports ad-hoc queries. KGLac enables
over, many organizations are encouraged to build a navigational                           automatic graph learning to advance functionalities, such as classi-
data structure (data catalogue) to support data discovery [2, 4] or                       fication of similar data items, finding unionable and joinable tables,
to use tools such as Amundsen. Unfortunately, these systems and                           predicting shortest paths between tables, and inferring new re-
tools suffer from limited query support and cannot find data items                        lationships. We designed KGLac to be deployed on top of a data
based on learned representations (embeddings). There is a need for                        owner’s data lake to enable efficient and extensible data discovery
an extensible set of effective discovery operations to find relevant                      operations for data scientists who have access to the data lake.
data from their enterprise datasets accessible via data discovery
                                                                                          REFERENCES
systems or open datasets accessible via data portals.
                                                                                          [1] Christina Christodoulakis, Eric Munson, Moshe Gabel, Angela Demke Brown, and
   Several methods were proposed to measure table relatedness [5],                            Renée J. Miller. 2020. Pytheas: Pattern-based Table Discovery in CSV Files. PVLDB
support table discovery [1], and find joinable tables [6]. These meth-                        13, 11.
                                                                                          [2] Raul Castro Fernandez, Ziawasch Abedjan, Famien Koko, Gina Yuan, Samuel
ods work in isolation from each other and from data portals and                               Madden, and Michael Stonebraker. 2018. Aurum: A Data Discovery System. In
discovery systems. Thus, there is a need for data portals and discov-                         ICDE.
ery systems with a flexible query language and an extensible set                          [3] Ahmed Helal, Mossad Helali, Khaled Ammar, and Essam Mansour. 2021. A Demon-
                                                                                              stration of KGLac: A Data Discovery and Enrichment Platform for Data Science.
of discovery operations. Moreover, existing data science platforms,                           PVLDB 14, 12.
                                                                                          [4] Fatemeh Nargesian, Ken Q. Pu, Erkang Zhu, Bahar Ghadiri Bashardoost, and
                                                                                              Renée J. Miller. 2020. Organizing Data Lakes for Navigation. In SIGMOD.
Copyright © 2021 for the individual papers by the papers’ authors. Copyright © 2021       [5] Yi Zhang and Zachary G. Ives. 2020. Finding Related Tables in Data Lakes for
for the volume as a collection by its editors. This volume and its papers are published       Interactive Data Science. In SIGMOD.
under the Creative Commons License Attribution 4.0 International (CC BY 4.0).             [6] Erkang Zhu, Dong Deng, Fatemeh Nargesian, and Renée J. Miller. 2019. JOSIE:
Published in the Proceedings of the 2nd Workshop on Search, Exploration, and Anal-            Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. In SIG-
ysis in Heterogeneous Datastores, co-located with VLDB 2021 (August 16-20, 2021,              MOD.
Copenhagen, Denmark) on CEUR-WS.org.

</pre>