Hints to Save Time when Dealing with Big Data Damien Graux Inria, Université Côte d’Azur, CNRS, I3S, France damien.graux@inria.fr Abstract. Considering the increasing number of available systems, para- digms and tools related to Big Data challenges, this keynote aims at pro- viding hints and good practices to avoid the common time-consuming pitfalls of the domain. During the last decade, the availability of large datasets has enabled the de- sign and exploration of novel scenarios that leverage both openly accessible and private datasets for gaining competitive advantages. For example, Web users nowadays have access to general knowledge through the Wikidata endpoint [13], to public transport schedules with the GTFS format [4], to source code reposi- tories [8], to proteins [2], to medical data1 , to governments’ records [1], etc. This availability has therefore opened the door to more advanced and complex ana- lytic scenarios where multiple sources are combined together in order to build new block of knowledge, for instance touristic tours relying on geo-data, buses’ schedules and reviews from previous tourists [5]. These new scenarios have prac- tically led to the design of new paradigms where intermediate data structures are used in order to align on a same ground the useful pieces of data coming from different heterogeneous sources2 . Consequently, with this profusion of data sources and more generally of avalailable data, new paradigms were designed in order to cop with the large amounts of information; this is for instance the case of the MapReduce model [3] and the associated Apache Hadoop3 or Apache Spark4 to deal, practically, with Big Data processing tasks when clusters of nodes have to be used because data is distributed. By nature, the Big Data landscape is cross-domain and the tools and systems available are numerous (with ones specifically created for particular use-cases and datasets). That is why the design of solutions for a particular problem in the Big Data context is challenging from different aspects: one needs to know which tool to select, how to structure and combine the data, where to find the missing information to complete the task, while having in mind that the solution might come a different community having an analogical problem. In this keynote, we provide several hints to avoid the common traps when having to deal with Big Data challenges. 1 Health datasets available on: https://data.world/datasets/health 2 See for example the RDF data model [12] often used in ontology-based data access solutions [14] to virtualise the combined data. 3 https://hadoop.apache.org/ 4 https://spark.apache.org/ Copyright © 2021 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). Fig. 1. Data distribution landscape. Fig. 2. Relative ranking of 10 systems. Fig. 3. Data integration classification. Fig. 4. Big Data ecosystem in 2021 according to mattturck.com. Data Distribution Landscape. First, it is important to know where the consid- ered datasets are located in the data distribution landscape. Indeed, datasets might come from several sources for a use-case linking them together, see Fig- ure 1’s right-hand side. And in parallel, each source could be either on a single- node architecture or relying on a cluster of machines in charge of distributing the data and (maybe) the computations, see the left-hand side of Figure 1. Fig- uring out where the current use-case is located will help to reach decision on the working paradigms and more practically about the systems to be used. Taking the use-case into consideration. To build an efficient solution, it is also crucial to be use-case driven since the beginning. Typically, in case of a dis- tributed context, one needs to know, for instance, the type of Big Data the user is dealing with i.e. is the data fitting in memory of one single node, is it fitting over the cluster memory or is it larger than the sum of the memories of each node? And depending on the context, the practitioner will need to select the “best” system(s) available. Typically, it is important to choose from the begin- ning the performance indicators or metrics that are going to be used to evaluate and rank together the various potential solutions and systems which could be used to achieve the use-case. Practically, relying on state-of-the-art benchmarks, surveys, comparative evaluations is often helpful; however, most of the time, not all the metrics that should be reviewed are considered at once by a single study. For instance, to select a SPARQL evaluator, Graux et al. compared several so- lution under the lights of different general use-cases and chose the relevant set of metrics for each [6]. They ended up having visual Kiviat charts, as depicted on Figure 2, to guide their choice for their “best” system. Data integration classification. Similarly to the data distribution landscape, it is also relevant to decide on the integration paradigm. As presented in Figure 3, there are mainly four situations depending if the datasets are structurally homo- geneous or not and depending on the distribution. For instance, if there are sev- eral data sources having different data structures (e.g. relational tables, graphs, documents, etc.), the data integration will have to rely on the use of wrappers to make the intermediate results compatible. More generally, it is worth noticing that Semantic Web technologies and the OBDA approach are good candidates to integrate together heterogeneous sources, see e.g. Squerall [10,11] or SANSA [9]. The community effect. Finally, having a glance at Figure 4 gives an insight into the complexity of finding and selecting useful tools for a dedicated use case. Indeed, the Big Data (& AI) ecosystem listed by Matt Turck shows that there exist several distinct tools to achieve one task, see for instance the number of storage solutions in the top-left corner of Figure 4. As a consequence, the safest move is usually to select a tool based on the vividness of its community and not exclusively because of its advertised features and performances. Typically, such a criterion can be checked using different indicators, to name a few: checking the response time of the main contributors to the open issues, glancing at the release agenda, reading the documentation, asking for advice. Summary. In a nutshell, when having Big Data challenges, to save time from the very beginning, it is advised to take the following actions: 1. Check the situation of the needed datasets in the data distribution landscape; 2. Select the tool based on the final use-case, not strictly on performances and design for that a suitable set of metrics to evaluate the solution; 3. Gain awareness and decide on the data integration paradigm to be used; 4. Select the tool based on the vividness of its community. Following these rules will significantly simplify the selection of paradigms for data integration, and thus help the practitioner with the specific use case im- plementation. To go further, we recommend to explore our open access book [7] focusing on the different facets of the Big Data ecosystem. References 1. Attard, J., Orlandi, F., Scerri, S., Auer, S.: A systematic review of open government data initiatives. Government information quarterly 32(4), 399–418 (2015) 2. Consortium, U.: Uniprot: a worldwide hub of protein knowledge. Nucleic acids research 47(D1), D506–D515 (2019) 3. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008) 4. Google: GTFS (2006), https://developers.google.com/transit/gtfs/ 5. Graux, D., Geneves, P., Layaïda, N.: Smart trip alternatives for the curious. In: 15th International Semantic Web Conference (ISWC 2016 demo paper) (2016) 6. Graux, D., Jachiet, L., Geneves, P., Layaïda, N.: A multi-criteria experimental ranking of distributed SPARQL evaluators. In: 2018 IEEE International Conference on Big Data (Big Data). pp. 693–702. IEEE (2018) 7. Janev, V., Graux, D., Jabeen, H., Sallinger, E.: Knowledge graphs and Big Data processing. Springer Nature (2020) 8. Kubitza, D.O., Böckmann, M., Graux, D.: Semangit: a linked dataset from git. In: International Semantic Web Conference. pp. 215–228. Springer (2019) 9. Lehmann, J., Sejdiu, G., Bühmann, L., Westphal, P., Stadler, C., Ermilov, I., Bin, S., Chakraborty, N., Saleem, M., Ngomo, A.C.N., et al.: Distributed semantic analytics using the sansa stack. In: International Semantic Web Conference. pp. 147–155. Springer (2017) 10. Mami, M.N., Graux, D., Scerri, S., Jabeen, H., Auer, S., Lehmann, J.: Squerall: Virtual ontology-based access to heterogeneous and large data sources. In: Inter- national Semantic Web Conference. pp. 229–245. Springer (2019) 11. Mami, M.N., Graux, D., Scerri, S., Jabeen, H., Auer, S., Lehmann, J.: Uniform access to multiform data lakes using semantic technologies. In: Proceedings of the 21st International Conference on Information Integration and Web-based Applica- tions & Services. pp. 313–322 (2019) 12. Manola, F., Miller, E., McBride, B., et al.: RDF primer. W3C recommendation 10(1-107), 6 (2004) 13. Vrandečić, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Com- munications of the ACM 57(10), 78–85 (2014) 14. Xiao, G., Calvanese, D., Kontchakov, R., Lembo, D., Poggi, A., Rosati, R., Za- kharyaschev, M.: Ontology-based data access: A survey. International Joint Con- ferences on Artificial Intelligence (2018)