<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Hints to Save Time when Dealing with Big Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Damien Graux</string-name>
          <email>damien.graux@inria.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Inria, Université Côte d'Azur</institution>
          ,
          <addr-line>CNRS, I3S</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Considering the increasing number of available systems, paradigms and tools related to Big Data challenges, this keynote aims at providing hints and good practices to avoid the common time-consuming pitfalls of the domain. During the last decade, the availability of large datasets has enabled the design and exploration of novel scenarios that leverage both openly accessible and private datasets for gaining competitive advantages. For example, Web users nowadays have access to general knowledge through the Wikidata endpoint [13], to public transport schedules with the GTFS format [4], to source code repositories [8], to proteins [2], to medical data1, to governments' records [1], etc. This availability has therefore opened the door to more advanced and complex analytic scenarios where multiple sources are combined together in order to build new block of knowledge, for instance touristic tours relying on geo-data, buses' schedules and reviews from previous tourists [5]. These new scenarios have practically led to the design of new paradigms where intermediate data structures are used in order to align on a same ground the useful pieces of data coming from different heterogeneous sources2. Consequently, with this profusion of data sources and more generally of avalailable data, new paradigms were designed in order to cop with the large amounts of information; this is for instance the case of the MapReduce model [3] and the associated Apache Hadoop3 or Apache Spark4 to deal, practically, with Big Data processing tasks when clusters of nodes have to be used because data is distributed. By nature, the Big Data landscape is cross-domain and the tools and systems available are numerous (with ones specifically created for particular use-cases and datasets). That is why the design of solutions for a particular problem in the Big Data context is challenging from different aspects: one needs to know which tool to select, how to structure and combine the data, where to find the missing information to complete the task, while having in mind that the solution might come a different community having an analogical problem. In this keynote, we provide several hints to avoid the common traps when having to deal with Big Data challenges.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Data Distribution Landscape. First, it is important to know where the
considered datasets are located in the data distribution landscape. Indeed, datasets
might come from several sources for a use-case linking them together, see
Figure 1’s right-hand side. And in parallel, each source could be either on a
singlenode architecture or relying on a cluster of machines in charge of distributing
the data and (maybe) the computations, see the left-hand side of Figure 1.
Figuring out where the current use-case is located will help to reach decision on the
working paradigms and more practically about the systems to be used.
Taking the use-case into consideration. To build an efficient solution, it is also
crucial to be use-case driven since the beginning. Typically, in case of a
distributed context, one needs to know, for instance, the type of Big Data the user
is dealing with i.e. is the data fitting in memory of one single node, is it fitting
over the cluster memory or is it larger than the sum of the memories of each
node? And depending on the context, the practitioner will need to select the
“best” system(s) available. Typically, it is important to choose from the
beginning the performance indicators or metrics that are going to be used to evaluate
and rank together the various potential solutions and systems which could be
used to achieve the use-case. Practically, relying on state-of-the-art benchmarks,
surveys, comparative evaluations is often helpful; however, most of the time, not
all the metrics that should be reviewed are considered at once by a single study.
For instance, to select a SPARQL evaluator, Graux et al. compared several
solution under the lights of different general use-cases and chose the relevant set
of metrics for each [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. They ended up having visual Kiviat charts, as depicted
on Figure 2, to guide their choice for their “best” system.
      </p>
      <p>
        Data integration classification. Similarly to the data distribution landscape, it
is also relevant to decide on the integration paradigm. As presented in Figure 3,
there are mainly four situations depending if the datasets are structurally
homogeneous or not and depending on the distribution. For instance, if there are
several data sources having different data structures (e.g. relational tables, graphs,
documents, etc.), the data integration will have to rely on the use of wrappers
to make the intermediate results compatible. More generally, it is worth noticing
that Semantic Web technologies and the OBDA approach are good candidates to
integrate together heterogeneous sources, see e.g. Squerall [
        <xref ref-type="bibr" rid="ref10 ref11">10,11</xref>
        ] or SANSA [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
The community effect. Finally, having a glance at Figure 4 gives an insight into
the complexity of finding and selecting useful tools for a dedicated use case.
Indeed, the Big Data (&amp; AI) ecosystem listed by Matt Turck shows that there
exist several distinct tools to achieve one task, see for instance the number of
storage solutions in the top-left corner of Figure 4. As a consequence, the safest
move is usually to select a tool based on the vividness of its community and not
exclusively because of its advertised features and performances. Typically, such
a criterion can be checked using different indicators, to name a few: checking
the response time of the main contributors to the open issues, glancing at the
release agenda, reading the documentation, asking for advice.
      </p>
      <p>Summary. In a nutshell, when having Big Data challenges, to save time from
the very beginning, it is advised to take the following actions:
1. Check the situation of the needed datasets in the data distribution landscape;
2. Select the tool based on the final use-case, not strictly on performances and
design for that a suitable set of metrics to evaluate the solution;
3. Gain awareness and decide on the data integration paradigm to be used;
4. Select the tool based on the vividness of its community.</p>
      <p>
        Following these rules will significantly simplify the selection of paradigms for
data integration, and thus help the practitioner with the specific use case
implementation. To go further, we recommend to explore our open access book [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
focusing on the different facets of the Big Data ecosystem.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Attard</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Orlandi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scerri</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>A systematic review of open government data initiatives</article-title>
          .
          <source>Government information quarterly 32</source>
          (
          <issue>4</issue>
          ),
          <fpage>399</fpage>
          -
          <lpage>418</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Consortium</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          :
          <article-title>Uniprot: a worldwide hub of protein knowledge</article-title>
          .
          <source>Nucleic acids research</source>
          <volume>47</volume>
          (
          <issue>D1</issue>
          ),
          <fpage>D506</fpage>
          -
          <lpage>D515</lpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghemawat</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>MapReduce: simplified data processing on large clusters</article-title>
          .
          <source>Communications of the ACM</source>
          <volume>51</volume>
          (
          <issue>1</issue>
          ),
          <fpage>107</fpage>
          -
          <lpage>113</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. Google: GTFS (
          <year>2006</year>
          ), https://developers.google.com/transit/gtfs/
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Graux</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Geneves</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Layaïda</surname>
          </string-name>
          , N.:
          <article-title>Smart trip alternatives for the curious</article-title>
          .
          <source>In: 15th International Semantic Web Conference (ISWC 2016 demo paper)</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Graux</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jachiet</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Geneves</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Layaïda</surname>
          </string-name>
          , N.:
          <article-title>A multi-criteria experimental ranking of distributed SPARQL evaluators</article-title>
          .
          <source>In: 2018 IEEE International Conference on Big Data (Big Data)</source>
          . pp.
          <fpage>693</fpage>
          -
          <lpage>702</lpage>
          . IEEE (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Janev</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Graux</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jabeen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sallinger</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          :
          <article-title>Knowledge graphs and Big Data processing</article-title>
          . Springer Nature (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Kubitza</surname>
            ,
            <given-names>D.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Böckmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Graux</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Semangit: a linked dataset from git</article-title>
          .
          <source>In: International Semantic Web Conference</source>
          . pp.
          <fpage>215</fpage>
          -
          <lpage>228</lpage>
          . Springer (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sejdiu</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bühmann</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Westphal</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stadler</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ermilov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chakraborty</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saleem</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ngomo</surname>
            ,
            <given-names>A.C.N.</given-names>
          </string-name>
          , et al.:
          <article-title>Distributed semantic analytics using the sansa stack</article-title>
          .
          <source>In: International Semantic Web Conference</source>
          . pp.
          <fpage>147</fpage>
          -
          <lpage>155</lpage>
          . Springer (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Mami</surname>
            ,
            <given-names>M.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Graux</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scerri</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jabeen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
          </string-name>
          , J.: Squerall:
          <article-title>Virtual ontology-based access to heterogeneous and large data sources</article-title>
          .
          <source>In: International Semantic Web Conference</source>
          . pp.
          <fpage>229</fpage>
          -
          <lpage>245</lpage>
          . Springer (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Mami</surname>
            ,
            <given-names>M.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Graux</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scerri</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jabeen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
          </string-name>
          , J.:
          <article-title>Uniform access to multiform data lakes using semantic technologies</article-title>
          .
          <source>In: Proceedings of the 21st International Conference on Information Integration and Web-based Applications &amp; Services</source>
          . pp.
          <fpage>313</fpage>
          -
          <lpage>322</lpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Manola</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McBride</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , et al.:
          <article-title>RDF primer</article-title>
          .
          <source>W3C recommendation 10(1-107)</source>
          ,
          <volume>6</volume>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Vrandečić</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krötzsch</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Wikidata: a free collaborative knowledgebase</article-title>
          .
          <source>Communications of the ACM</source>
          <volume>57</volume>
          (
          <issue>10</issue>
          ),
          <fpage>78</fpage>
          -
          <lpage>85</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Xiao</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Calvanese</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kontchakov</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lembo</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Poggi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosati</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zakharyaschev</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Ontology-based data access: A survey</article-title>
          .
          <source>International Joint Conferences on Artificial Intelligence</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>