<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">A Data Discovery Platform Empowered by Knowledge Graph Technologies: Challenges and Opportunities</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Essam</forename><surname>Mansour</surname></persName>
							<email>essam.mansour@concordia.ca</email>
							<affiliation key="aff0">
								<orgName type="institution">Concordia University</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">A Data Discovery Platform Empowered by Knowledge Graph Technologies: Challenges and Opportunities</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">372E377486E9678B3AE0A11AE316834C</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T08:55+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In this talk, we present KGLac, a data discovery platform empowered by knowledge graph technologies, and highlights several open research challenges and opportunities.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">DEVELOPMENT AND OPPORTUNITIES</head><p>With the growing importance of data science and open data initiatives, thousands of machine-readable, structured, and semistructured datasets are collected and made available via data discovery systems in the case of enterprise datasets or via data portals in the case of public datasets. Data portals are maintained, for example, by by governments, e.g., USA, Canada, and EU, organizations, such as WHO and WTO, and ML portals, such as Kaggle and OpenML. Existing portals and systems suffer from limited discovery support and do not track the use of a dataset and insights derived from it. Thus, data integration and enrichment are the primary responsibility of data scientists, who spend most of their time knowing where a relevant dataset exists, understanding its impact on a specific task, finding ways to enrich a dataset, and leverage the derived insights.</p><p>Data portals and search engines, such as Google Dataset Search, provide primitive search capabilities to find and download open datasets in different formats, such as CSV, JSON, and XML. Moreover, many organizations are encouraged to build a navigational data structure (data catalogue) to support data discovery <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b3">4]</ref> or to use tools such as Amundsen. Unfortunately, these systems and tools suffer from limited query support and cannot find data items based on learned representations (embeddings). There is a need for an extensible set of effective discovery operations to find relevant data from their enterprise datasets accessible via data discovery systems or open datasets accessible via data portals.</p><p>Several methods were proposed to measure table relatedness <ref type="bibr" target="#b4">[5]</ref>, support table discovery <ref type="bibr" target="#b0">[1]</ref>, and find joinable tables <ref type="bibr" target="#b5">[6]</ref>. These methods work in isolation from each other and from data portals and discovery systems. Thus, there is a need for data portals and discovery systems with a flexible query language and an extensible set of discovery operations. Moreover, existing data science platforms, such as MLFlow or Cloud AutoML, and tools, such as Jupyter Notebooks or Google Colab, should be able to communicate easily with these portals and systems.</p><p>The development of KGLac <ref type="bibr" target="#b2">[3]</ref>, as illustrated in Figure <ref type="figure" target="#fig_0">1</ref>, poses research opportunities in various areas spanning data management and AI. These research opportunities cover (i) abstracting and capturing semantics from heterogeneous datasets, (ii) constructing decentralized knowledge graphs (KGs) for datasets, (iii) supporting inference and automatic graph learning to incrementally introduce and enhance the relationships among different nodes in the graph, and (iv) automating several aspects of data science including data preparation, augmentation, and insights analysis.</p><p>KGLac is supported by different methods for data profiling and representation learning (embedding) to capture metadata and semantics of datasets to construct a knowledge graph (GLac). KGLac provides an extensible set of data discovery operations implemented using SPARQL queries, and supports ad-hoc queries. KGLac enables automatic graph learning to advance functionalities, such as classification of similar data items, finding unionable and joinable tables, predicting shortest paths between tables, and inferring new relationships. We designed KGLac to be deployed on top of a data owner's data lake to enable efficient and extensible data discovery operations for data scientists who have access to the data lake.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: The KGLac architecture; KGLac gets access to a local data lake to construct GLac. Different ML pipeline tools can communicate with KGLac to facilitate data discovery.</figDesc><graphic coords="1,334.88,173.84,95.16,56.23" type="bitmap" /></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Pytheas: Pattern-based Table Discovery in CSV Files</title>
		<author>
			<persName><forename type="first">Christina</forename><surname>Christodoulakis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Eric</forename><surname>Munson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Moshe</forename><surname>Gabel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Angela</forename><forename type="middle">Demke</forename><surname>Brown</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Renée</forename><forename type="middle">J</forename><surname>Miller</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">PVLDB</title>
		<imprint>
			<biblScope unit="volume">13</biblScope>
			<biblScope unit="page">11</biblScope>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">Aurum: A Data Discovery System</title>
		<author>
			<persName><forename type="first">Raul</forename><surname>Castro Fernandez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ziawasch</forename><surname>Abedjan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Famien</forename><surname>Koko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Gina</forename><surname>Yuan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Samuel</forename><surname>Madden</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Michael</forename><surname>Stonebraker</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2018">2018</date>
			<publisher>ICDE</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">A Demonstration of KGLac: A Data Discovery and Enrichment Platform for Data Science</title>
		<author>
			<persName><forename type="first">Ahmed</forename><surname>Helal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mossad</forename><surname>Helali</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Khaled</forename><surname>Ammar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Essam</forename><surname>Mansour</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">PVLDB</title>
		<imprint>
			<biblScope unit="volume">14</biblScope>
			<biblScope unit="page">12</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Organizing Data Lakes for Navigation</title>
		<author>
			<persName><forename type="first">Fatemeh</forename><surname>Nargesian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ken</forename><forename type="middle">Q</forename><surname>Pu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Erkang</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Bahar</forename><surname>Ghadiri Bashardoost</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Renée</forename><forename type="middle">J</forename><surname>Miller</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2020">2020</date>
			<publisher>SIGMOD</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">Finding Related Tables in Data Lakes for Interactive Data Science</title>
		<author>
			<persName><forename type="first">Yi</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Zachary</forename><forename type="middle">G</forename><surname>Ives</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2020">2020</date>
			<publisher>SIGMOD</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes</title>
		<author>
			<persName><forename type="first">Erkang</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dong</forename><surname>Deng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Fatemeh</forename><surname>Nargesian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Renée</forename><forename type="middle">J</forename><surname>Miller</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2019">2019</date>
			<publisher>SIG-MOD</publisher>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
