<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Hints to Save Time when Dealing with Big Data</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Damien</forename><surname>Graux</surname></persName>
							<email>damien.graux@inria.fr</email>
							<affiliation key="aff0">
								<orgName type="institution" key="instit1">Inria</orgName>
								<orgName type="institution" key="instit2">Université Côte d&apos;Azur</orgName>
								<orgName type="institution" key="instit3">CNRS</orgName>
								<address>
									<region>I3S</region>
									<country key="FR">France</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Hints to Save Time when Dealing with Big Data</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">DBAA89B1797C1A8DE9BE840C4FD62847</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-25T04:15+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Considering the increasing number of available systems, paradigms and tools related to Big Data challenges, this keynote aims at providing hints and good practices to avoid the common time-consuming pitfalls of the domain.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>During the last decade, the availability of large datasets has enabled the design and exploration of novel scenarios that leverage both openly accessible and private datasets for gaining competitive advantages. For example, Web users nowadays have access to general knowledge through the Wikidata endpoint <ref type="bibr" target="#b11">[13]</ref>, to public transport schedules with the GTFS format <ref type="bibr">[4]</ref>, to source code repositories <ref type="bibr" target="#b6">[8]</ref>, to proteins <ref type="bibr" target="#b1">[2]</ref>, to medical data<ref type="foot" target="#foot_0">1</ref> , to governments' records <ref type="bibr" target="#b0">[1]</ref>, etc. This availability has therefore opened the door to more advanced and complex analytic scenarios where multiple sources are combined together in order to build new block of knowledge, for instance touristic tours relying on geo-data, buses' schedules and reviews from previous tourists <ref type="bibr" target="#b3">[5]</ref>. These new scenarios have practically led to the design of new paradigms where intermediate data structures are used in order to align on a same ground the useful pieces of data coming from different heterogeneous sources<ref type="foot" target="#foot_1">2</ref> . Consequently, with this profusion of data sources and more generally of avalailable data, new paradigms were designed in order to cop with the large amounts of information; this is for instance the case of the MapReduce model <ref type="bibr" target="#b2">[3]</ref> and the associated Apache Hadoop<ref type="foot" target="#foot_2">3</ref> or Apache Spark<ref type="foot" target="#foot_3">4</ref> to deal, practically, with Big Data processing tasks when clusters of nodes have to be used because data is distributed.</p><p>By nature, the Big Data landscape is cross-domain and the tools and systems available are numerous (with ones specifically created for particular use-cases and datasets). That is why the design of solutions for a particular problem in the Big Data context is challenging from different aspects: one needs to know which tool to select, how to structure and combine the data, where to find the missing information to complete the task, while having in mind that the solution might come a different community having an analogical problem. In this keynote, we provide several hints to avoid the common traps when having to deal with Big Data challenges.    Data Distribution Landscape. First, it is important to know where the considered datasets are located in the data distribution landscape. Indeed, datasets might come from several sources for a use-case linking them together, see Figure <ref type="figure" target="#fig_0">1</ref>'s right-hand side. And in parallel, each source could be either on a singlenode architecture or relying on a cluster of machines in charge of distributing the data and (maybe) the computations, see the left-hand side of Figure <ref type="figure" target="#fig_0">1</ref>. Figuring out where the current use-case is located will help to reach decision on the working paradigms and more practically about the systems to be used.</p><p>Taking the use-case into consideration. To build an efficient solution, it is also crucial to be use-case driven since the beginning. Typically, in case of a distributed context, one needs to know, for instance, the type of Big Data the user is dealing with i.e. is the data fitting in memory of one single node, is it fitting over the cluster memory or is it larger than the sum of the memories of each node? And depending on the context, the practitioner will need to select the "best" system(s) available. Typically, it is important to choose from the beginning the performance indicators or metrics that are going to be used to evaluate and rank together the various potential solutions and systems which could be used to achieve the use-case. Practically, relying on state-of-the-art benchmarks, surveys, comparative evaluations is often helpful; however, most of the time, not all the metrics that should be reviewed are considered at once by a single study. For instance, to select a SPARQL evaluator, Graux et al. compared several solution under the lights of different general use-cases and chose the relevant set of metrics for each <ref type="bibr" target="#b4">[6]</ref>. They ended up having visual Kiviat charts, as depicted on Figure <ref type="figure" target="#fig_1">2</ref>, to guide their choice for their "best" system. Data integration classification. Similarly to the data distribution landscape, it is also relevant to decide on the integration paradigm. As presented in Figure <ref type="figure" target="#fig_2">3</ref>, there are mainly four situations depending if the datasets are structurally homogeneous or not and depending on the distribution. For instance, if there are several data sources having different data structures (e.g. relational tables, graphs, documents, etc.), the data integration will have to rely on the use of wrappers to make the intermediate results compatible. More generally, it is worth noticing that Semantic Web technologies and the OBDA approach are good candidates to integrate together heterogeneous sources, see e.g. Squerall <ref type="bibr" target="#b8">[10,</ref><ref type="bibr" target="#b9">11]</ref> or SANSA <ref type="bibr" target="#b7">[9]</ref>.</p><p>The community effect. Finally, having a glance at Figure <ref type="figure" target="#fig_3">4</ref> gives an insight into the complexity of finding and selecting useful tools for a dedicated use case. Indeed, the Big Data (&amp; AI) ecosystem listed by Matt Turck shows that there exist several distinct tools to achieve one task, see for instance the number of storage solutions in the top-left corner of Figure <ref type="figure" target="#fig_3">4</ref>. As a consequence, the safest move is usually to select a tool based on the vividness of its community and not exclusively because of its advertised features and performances. Typically, such a criterion can be checked using different indicators, to name a few: checking the response time of the main contributors to the open issues, glancing at the release agenda, reading the documentation, asking for advice.</p><p>Summary. In a nutshell, when having Big Data challenges, to save time from the very beginning, it is advised to take the following actions:</p><p>1. Check the situation of the needed datasets in the data distribution landscape; 2. Select the tool based on the final use-case, not strictly on performances and design for that a suitable set of metrics to evaluate the solution; 3. Gain awareness and decide on the data integration paradigm to be used; 4. Select the tool based on the vividness of its community.</p><p>Following these rules will significantly simplify the selection of paradigms for data integration, and thus help the practitioner with the specific use case implementation. To go further, we recommend to explore our open access book <ref type="bibr" target="#b5">[7]</ref> focusing on the different facets of the Big Data ecosystem.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. Data distribution landscape.</figDesc><graphic coords="2,186.64,305.79,242.07,122.73" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Fig. 2 .</head><label>2</label><figDesc>Fig. 2. Relative ranking of 10 systems.</figDesc><graphic coords="2,309.21,129.73,169.45,143.90" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Fig. 3 .</head><label>3</label><figDesc>Fig. 3. Data integration classification.</figDesc><graphic coords="2,134.77,460.71,345.73,165.65" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Fig. 4 .</head><label>4</label><figDesc>Fig. 4. Big Data ecosystem in 2021 according to mattturck.com.</figDesc></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">Health datasets available on: https://data.world/datasets/health</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">See for example the RDF data model<ref type="bibr" target="#b10">[12]</ref> often used in ontology-based data access solutions<ref type="bibr" target="#b12">[14]</ref> to virtualise the combined data.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">https://hadoop.apache.org/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">https://spark.apache.org/</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">A systematic review of open government data initiatives</title>
		<author>
			<persName><forename type="first">J</forename><surname>Attard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Orlandi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Scerri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Auer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Government information quarterly</title>
		<imprint>
			<biblScope unit="volume">32</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="399" to="418" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Uniprot: a worldwide hub of protein knowledge</title>
		<author>
			<persName><forename type="first">U</forename><surname>Consortium</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Nucleic acids research</title>
		<imprint>
			<biblScope unit="volume">47</biblScope>
			<biblScope unit="issue">D1</biblScope>
			<biblScope unit="page" from="D506" to="D515" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">MapReduce: simplified data processing on large clusters</title>
		<author>
			<persName><forename type="first">J</forename><surname>Dean</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ghemawat</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Communications of the ACM</title>
		<imprint>
			<biblScope unit="volume">51</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="107" to="113" />
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Smart trip alternatives for the curious</title>
		<author>
			<persName><forename type="first">D</forename><surname>Graux</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Geneves</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Layaïda</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">15th International Semantic Web Conference (ISWC 2016 demo paper</title>
				<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">A multi-criteria experimental ranking of distributed SPARQL evaluators</title>
		<author>
			<persName><forename type="first">D</forename><surname>Graux</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Jachiet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Geneves</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Layaïda</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2018 IEEE International Conference on Big Data (Big Data)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="693" to="702" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">Knowledge graphs and Big Data processing</title>
		<author>
			<persName><forename type="first">V</forename><surname>Janev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Graux</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Jabeen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Sallinger</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2020">2020</date>
			<publisher>Springer Nature</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Semangit: a linked dataset from git</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">O</forename><surname>Kubitza</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Böckmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Graux</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Semantic Web Conference</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="215" to="228" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Distributed semantic analytics using the sansa stack</title>
		<author>
			<persName><forename type="first">J</forename><surname>Lehmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sejdiu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Bühmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Westphal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Stadler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Ermilov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Chakraborty</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Saleem</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">C N</forename><surname>Ngomo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Semantic Web Conference</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="147" to="155" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Squerall: Virtual ontology-based access to heterogeneous and large data sources</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">N</forename><surname>Mami</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Graux</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Scerri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Jabeen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Auer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lehmann</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Semantic Web Conference</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="229" to="245" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Uniform access to multiform data lakes using semantic technologies</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">N</forename><surname>Mami</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Graux</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Scerri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Jabeen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Auer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lehmann</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 21st International Conference on Information Integration and Web-based Applications &amp; Services</title>
				<meeting>the 21st International Conference on Information Integration and Web-based Applications &amp; Services</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="313" to="322" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">RDF primer</title>
		<author>
			<persName><forename type="first">F</forename><surname>Manola</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Miller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Mcbride</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">W3C recommendation</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="issue">1-107</biblScope>
			<biblScope unit="page">6</biblScope>
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Wikidata: a free collaborative knowledgebase</title>
		<author>
			<persName><forename type="first">D</forename><surname>Vrandečić</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Krötzsch</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Communications of the ACM</title>
		<imprint>
			<biblScope unit="volume">57</biblScope>
			<biblScope unit="issue">10</biblScope>
			<biblScope unit="page" from="78" to="85" />
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Ontology-based data access: A survey</title>
		<author>
			<persName><forename type="first">G</forename><surname>Xiao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Calvanese</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Kontchakov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Lembo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Poggi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Rosati</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Zakharyaschev</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Joint Conferences on Artificial Intelligence</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
