Big Data Management Challenges in SUPERSEDE

                             Sergi Nadal, Alberto Abelló, Oscar Romero, Jovan Varga
                                        Universitat Politècnica de Catalunya, BarcelonaTech
                                                           Barcelona, Spain
                                     {snadal,aabello,oromero,jvarga}@essi.upc.edu


1.    INTRODUCTION                                                        on-the-fly data crossing [1]. In SUPERSEDE, an integration-
   The H2020 SUPERSEDE (www.supersede.eu) project aims                    oriented RDF graph is used to represent and integrate the
to support decision-making in the evolution and adaptation                data related to monitoring and user feedback, as well as
of software services and applications by exploiting end-user              crossing it with contextual data from the use cases. Also,
feedback and runtime data, with the overall goal of improv-               the analytical processes to support decision making are rep-
ing the end-users quality of experience (QoE). Such QoE                   resented on top of such concepts.
is defined as the overall performance of a system from the
point of view of users, which must consider both feedback
                                                                          2.2    Big Data Architectures
and runtime data gathered. End-user’s feedback is extracted                  The λ-architecture [3] is currently the most widespread
from online forums, app stores, social networks and novel                 reference architecture for scalable and fault-tolerant Big
direct feedback channels, which connect software applications             Data processing. While succeeding at managing humongous
and service users to developers. Runtime data is primarily                amounts of data (i.e., in the Batch layer), as well as near-
gathered by monitoring environmental sensors, infrastruc-                 real time data streams (i.e., in the Speed layer), it has two
tures and usage logs. Hereafter, we discuss our solutions for             main drawbacks. First, it completely overlooks semantics,
the main data management challenges in SUPERSEDE.                         as discussed before, as it uses NOSQL technologies as its
                                                                          baseline components. Second, its vaguely defined, which
                                                                          hinders its instantiation.
2.    CHALLENGES                                                             (i) Refining the λ-architecture, by defining its components
                                                                          as well as their interconnections, would facilitate its instan-
2.1    Big Data Governance                                                tiation and allow a simpler deployment of SUPERSEDE’s
   One well-known problem of NOSQL repositories is the lack               Big Data ecosystem. (ii) To accommodate the requirements
of semantics caused by their schemaless properties. This lack             on governance, metadata should be considered as first-class
of schema prevents the system from knowing which data is                  citizen throughout the data management processes.
stored and how they interrelate. Thus, data analysts are
hindered with data management tasks, like understanding                   3.    PARTICIPATION BENEFITS
the specific structure and parsing it, before writing their ana-            Our objective is twofold. Firstly, we aim at presenting
lytical pipelines. In SUPERSEDE, this gets more challenging               our approach to tackle the previously described challenges.
as it aims at performing integrated analysis over multiple,               Secondly, by leading a round table, we aim at discussing
evolving and heterogeneous data sources. A challenge that                 pros and cons of this and other solutions pursued by other
current Big Data technologies fail to address.                            reserachers in similar settings.
   Big Data ecosystems demand complex metadata gover-
nance processes spanning throughout all data management                   4.    ACKNOWLEDGEMENTS
phases, from ingestion to analysis [2]. Semantic Web tech-
                                                                            This work has been partly supported by the SUPERSEDE
nologies have proven to be a valid asset for such purpose.
                                                                          project, funded by the European Union’s Information and
The Resource Description Framework (RDF) allows to flexi-
                                                                          Communication Technologies Programme (H2020) under
bly define concepts and their relationships in the form of a
                                                                          grant agreement number 644018.
semantic graph. Furthermore, it can leverage on the Linked
Data initiative to (a) reuse existing vocabularies, (b) make
data self-descriptive, and (c) publish such data to faciliate             5.    REFERENCES
                                                                          [1] C. Bizer, T. Heath, and T. Berners-Lee. Linked Data -
                                                                              The Story So Far. Int. J. Semantic Web Inf. Syst.,
                                                                              5(3):1–22, 2009.
                                                                          [2] E. Kandogan, M. Roth, P. M. Schwarz, J. Hui,
                                                                              I. Terrizzano, C. Christodoulakis, and R. J. Miller.
                                                                              LabBook: Metadata-driven Social Collaborative Data
2017, Copyright is with the authors. Published in the Workshop Proceed-       Analysis. In IEEE Big Data, 2015.
ings of the EDBT/ICDT 2017 Joint Conference (March 21, 2017, Venice,      [3] N. Marz and J. Warren. Big Data: Principles and Best
Italy) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is         Practices of Scalable Realtime Data Systems. Manning,
permitted under the terms of the Creative Commons license CC-by-nc-nd
4.0                                                                           1st edition, 2015.