=Paper=
{{Paper
|id=Vol-2929/poster5
|storemode=property
|title=A Data Discovery Platform Empowered by Knowledge GraphTechnologies: Challenges and Opportunities
|pdfUrl=https://ceur-ws.org/Vol-2929/poster5.pdf
|volume=Vol-2929
|authors=Essam Mansour
|dblpUrl=https://dblp.org/rec/conf/vldb/Mansour21
}}
==A Data Discovery Platform Empowered by Knowledge GraphTechnologies: Challenges and Opportunities==
A Data Discovery Platform Empowered by Knowledge Graph
Technologies: Challenges and Opportunities
Essam Mansour
Concordia University
essam.mansour@concordia.ca
GLac Construction Interface Services
ABSTRACT Data
Storage Discovery Query
In this talk, we present KGLac, a data discovery platform empow- Profiler
Operations Manager
GLac
ered by knowledge graph technologies, and highlights several open Builder
Embedding Access
research challenges and opportunities. Deductive Similarity Control
Linker
Reference Format:
Essam Mansour. A Data Discovery Platform Empowered by Knowledge
Graph Technologies: Challenges and Opportunities. In the 2nd Workshop HDFS
on Search, Exploration, and Analysis in Heterogeneous Datastores (SEA
Data Lake ML Pipeline Tools
Data 2021). Figure 1: The KGLac architecture; KGLac gets access to a lo-
cal data lake to construct GLac. Different ML pipeline tools
can communicate with KGLac to facilitate data discovery.
1 DEVELOPMENT AND OPPORTUNITIES such as MLFlow or Cloud AutoML, and tools, such as Jupyter Note-
With the growing importance of data science and open data ini- books or Google Colab, should be able to communicate easily with
tiatives, thousands of machine-readable, structured, and semi- these portals and systems.
structured datasets are collected and made available via data discov- The development of KGLac [3], as illustrated in Figure 1, poses
ery systems in the case of enterprise datasets or via data portals in research opportunities in various areas spanning data management
the case of public datasets. Data portals are maintained, for example, and AI. These research opportunities cover (i) abstracting and cap-
by by governments, e.g., USA, Canada, and EU, organizations, such turing semantics from heterogeneous datasets, (ii) constructing
as WHO and WTO, and ML portals, such as Kaggle and OpenML. decentralized knowledge graphs (KGs) for datasets, (iii) supporting
Existing portals and systems suffer from limited discovery support inference and automatic graph learning to incrementally introduce
and do not track the use of a dataset and insights derived from it. and enhance the relationships among different nodes in the graph,
Thus, data integration and enrichment are the primary responsibil- and (iv) automating several aspects of data science including data
ity of data scientists, who spend most of their time knowing where preparation, augmentation, and insights analysis.
a relevant dataset exists, understanding its impact on a specific task, KGLac is supported by different methods for data profiling and
finding ways to enrich a dataset, and leverage the derived insights. representation learning (embedding) to capture metadata and se-
Data portals and search engines, such as Google Dataset Search, mantics of datasets to construct a knowledge graph (GLac). KGLac
provide primitive search capabilities to find and download open provides an extensible set of data discovery operations implemented
datasets in different formats, such as CSV, JSON, and XML. More- using SPARQL queries, and supports ad-hoc queries. KGLac enables
over, many organizations are encouraged to build a navigational automatic graph learning to advance functionalities, such as classi-
data structure (data catalogue) to support data discovery [2, 4] or fication of similar data items, finding unionable and joinable tables,
to use tools such as Amundsen. Unfortunately, these systems and predicting shortest paths between tables, and inferring new re-
tools suffer from limited query support and cannot find data items lationships. We designed KGLac to be deployed on top of a data
based on learned representations (embeddings). There is a need for owner’s data lake to enable efficient and extensible data discovery
an extensible set of effective discovery operations to find relevant operations for data scientists who have access to the data lake.
data from their enterprise datasets accessible via data discovery
REFERENCES
systems or open datasets accessible via data portals.
[1] Christina Christodoulakis, Eric Munson, Moshe Gabel, Angela Demke Brown, and
Several methods were proposed to measure table relatedness [5], Renée J. Miller. 2020. Pytheas: Pattern-based Table Discovery in CSV Files. PVLDB
support table discovery [1], and find joinable tables [6]. These meth- 13, 11.
[2] Raul Castro Fernandez, Ziawasch Abedjan, Famien Koko, Gina Yuan, Samuel
ods work in isolation from each other and from data portals and Madden, and Michael Stonebraker. 2018. Aurum: A Data Discovery System. In
discovery systems. Thus, there is a need for data portals and discov- ICDE.
ery systems with a flexible query language and an extensible set [3] Ahmed Helal, Mossad Helali, Khaled Ammar, and Essam Mansour. 2021. A Demon-
stration of KGLac: A Data Discovery and Enrichment Platform for Data Science.
of discovery operations. Moreover, existing data science platforms, PVLDB 14, 12.
[4] Fatemeh Nargesian, Ken Q. Pu, Erkang Zhu, Bahar Ghadiri Bashardoost, and
Renée J. Miller. 2020. Organizing Data Lakes for Navigation. In SIGMOD.
Copyright © 2021 for the individual papers by the papers’ authors. Copyright © 2021 [5] Yi Zhang and Zachary G. Ives. 2020. Finding Related Tables in Data Lakes for
for the volume as a collection by its editors. This volume and its papers are published Interactive Data Science. In SIGMOD.
under the Creative Commons License Attribution 4.0 International (CC BY 4.0). [6] Erkang Zhu, Dong Deng, Fatemeh Nargesian, and Renée J. Miller. 2019. JOSIE:
Published in the Proceedings of the 2nd Workshop on Search, Exploration, and Anal- Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. In SIG-
ysis in Heterogeneous Datastores, co-located with VLDB 2021 (August 16-20, 2021, MOD.
Copenhagen, Denmark) on CEUR-WS.org.