Big Data Management Challenges in SUPERSEDE Sergi Nadal, Alberto Abelló, Oscar Romero, Jovan Varga Universitat Politècnica de Catalunya, BarcelonaTech Barcelona, Spain {snadal,aabello,oromero,jvarga}@essi.upc.edu 1. INTRODUCTION on-the-fly data crossing [1]. In SUPERSEDE, an integration- The H2020 SUPERSEDE (www.supersede.eu) project aims oriented RDF graph is used to represent and integrate the to support decision-making in the evolution and adaptation data related to monitoring and user feedback, as well as of software services and applications by exploiting end-user crossing it with contextual data from the use cases. Also, feedback and runtime data, with the overall goal of improv- the analytical processes to support decision making are rep- ing the end-users quality of experience (QoE). Such QoE resented on top of such concepts. is defined as the overall performance of a system from the point of view of users, which must consider both feedback 2.2 Big Data Architectures and runtime data gathered. End-user’s feedback is extracted The λ-architecture [3] is currently the most widespread from online forums, app stores, social networks and novel reference architecture for scalable and fault-tolerant Big direct feedback channels, which connect software applications Data processing. While succeeding at managing humongous and service users to developers. Runtime data is primarily amounts of data (i.e., in the Batch layer), as well as near- gathered by monitoring environmental sensors, infrastruc- real time data streams (i.e., in the Speed layer), it has two tures and usage logs. Hereafter, we discuss our solutions for main drawbacks. First, it completely overlooks semantics, the main data management challenges in SUPERSEDE. as discussed before, as it uses NOSQL technologies as its baseline components. Second, its vaguely defined, which hinders its instantiation. 2. CHALLENGES (i) Refining the λ-architecture, by defining its components as well as their interconnections, would facilitate its instan- 2.1 Big Data Governance tiation and allow a simpler deployment of SUPERSEDE’s One well-known problem of NOSQL repositories is the lack Big Data ecosystem. (ii) To accommodate the requirements of semantics caused by their schemaless properties. This lack on governance, metadata should be considered as first-class of schema prevents the system from knowing which data is citizen throughout the data management processes. stored and how they interrelate. Thus, data analysts are hindered with data management tasks, like understanding 3. PARTICIPATION BENEFITS the specific structure and parsing it, before writing their ana- Our objective is twofold. Firstly, we aim at presenting lytical pipelines. In SUPERSEDE, this gets more challenging our approach to tackle the previously described challenges. as it aims at performing integrated analysis over multiple, Secondly, by leading a round table, we aim at discussing evolving and heterogeneous data sources. A challenge that pros and cons of this and other solutions pursued by other current Big Data technologies fail to address. reserachers in similar settings. Big Data ecosystems demand complex metadata gover- nance processes spanning throughout all data management 4. ACKNOWLEDGEMENTS phases, from ingestion to analysis [2]. Semantic Web tech- This work has been partly supported by the SUPERSEDE nologies have proven to be a valid asset for such purpose. project, funded by the European Union’s Information and The Resource Description Framework (RDF) allows to flexi- Communication Technologies Programme (H2020) under bly define concepts and their relationships in the form of a grant agreement number 644018. semantic graph. Furthermore, it can leverage on the Linked Data initiative to (a) reuse existing vocabularies, (b) make data self-descriptive, and (c) publish such data to faciliate 5. REFERENCES [1] C. Bizer, T. Heath, and T. Berners-Lee. Linked Data - The Story So Far. Int. J. Semantic Web Inf. Syst., 5(3):1–22, 2009. [2] E. Kandogan, M. Roth, P. M. Schwarz, J. Hui, I. Terrizzano, C. Christodoulakis, and R. J. Miller. LabBook: Metadata-driven Social Collaborative Data 2017, Copyright is with the authors. Published in the Workshop Proceed- Analysis. In IEEE Big Data, 2015. ings of the EDBT/ICDT 2017 Joint Conference (March 21, 2017, Venice, [3] N. Marz and J. Warren. Big Data: Principles and Best Italy) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is Practices of Scalable Realtime Data Systems. Manning, permitted under the terms of the Creative Commons license CC-by-nc-nd 4.0 1st edition, 2015.