=Paper= {{Paper |id=Vol-3135/dataplat_keynote1 |storemode=property |title=Big DataBase Management System |pdfUrl=https://ceur-ws.org/Vol-3135/dataplat_keynote1.pdf |volume=Vol-3135 |authors=Alberto Abelló,Sergi Nadal |dblpUrl=https://dblp.org/rec/conf/edbt/AbelloN22 }} ==Big DataBase Management System== https://ceur-ws.org/Vol-3135/dataplat_keynote1.pdf
Big DataBase Management System
Alberto Abelló1 , Sergi Nadal1
1
    Universitat Politècnica de Catalunya-BarcelonaTech


                                             Abstract
                                             A Big Data system is a tiny fraction of analytical code surrounded by a lot of “plumbing” devoted to manage the generated
                                             models and the associated data. Hence, we can consider that plumbing to be mimicking a DBMS, which is indeed a complex
                                             system that actually has to serve different purposes and hence provide multiple and independent functionalities. Thus, it
                                             can neither be studied nor built monolithically as an atomic unit. Oppositely, there are different software inter-dependent
                                             components that interact in different ways to achieve the global purpose. Similarly to DBMS, in a Big Data system, we have
                                             to understand among other issues how our system is going to collect data; how these are going to be used; where they are
                                             going to be stored; how they are going to be related to the corresponding metadata; if we are going to use any kind of master
                                             data, where these will come from and how they will be integrated; how are the data going to be processed; how replicas are
                                             going to be managed and their consistency guaranteed; etc. In this paper, we briefly discuss the difficulties to build such
                                             system, paying special attention to how metadata can help storage and processing.

                                             Keywords
                                             Big Data architecture, DBMS



1. Functionalities of a DBMS                                                                                          theless, this does not mean everything must be done in
                                                                                                                      real time. Indeed, some execution flows do not feel such
In order to study the functional components required                                                                  pressure and do not require such low latency. Hence, it is
to manage data, it is essential to first understand the re-                                                           the case that in many applications both batch and stream
quired functionalities. A database management system                                                                  processing have to coexist in the same architecture. This,
(DBMS) is a software system that provides the function-                                                               however, makes the architecture more complex in terms
alities to manage large, shared and persistent data collec-                                                           of number of components, also hindering their commu-
tions, while ensuring reliability and privacy. Out of the                                                             nication and data sharing, as well as the consistency of
many functionalities provided by a DBMS, we highlight:                                                                independent processing branches. On the other hand, the
Storage, Modeling, Ingestion, and Querying/Fetching.                                                                  “Variety” dimension refers to the complexity of providing
   Nowadays, a new kind of data-intensive systems that                                                                an on-demand integrated view over an heterogeneous
gather and analyse all kinds of data has emerged bring-                                                               and evolving set of data sources such that it conceptual-
ing new challenges for data management and analytics.                                                                 izes the domain at hand. An example of it is BigBench
These are today referred as Big Data systems. Thus, we                                                                [2], which defines a benchmark representative of real use
can see a Big Data Management System as a DBMS that                                                                   cases of Big Data. We can observe that the main differ-
has to provide the previously highlighted functionali-                                                                ences with traditional data-intensive applications are (i)
ties adapted to the new scenarios posed by Big Data [1].                                                              the presence of external sources and (ii) the relevance of
The development of a cohesive and integrated Big Data                                                                 non-structured data (i.e., not typed and not tabular, which
system is, however, a challenging task that requires to                                                               today represents the majority of data being generated).
understand how the required functionalities can be per-                                                               It is clear, thus, that to make use of rapidly generated, ex-
formed by different, independent components. Hence, it                                                                ternal and non-structured information sources, we need
is crucial to understand and establish how they interact.                                                             specific and specialized architectural components that
   We highlight some of the challenges such new systems                                                               interact with many others to transform such complex
face. On the one hand, the “Velocity” dimension of Big                                                                data into actionable knowledge.
Data identifies the need of managing and processing data
streams which are generated at a very large pace. Never-
                                                                                                                      2. Big Data Architectures
Published in the Workshop Proceedings of the EDBT/ICDT 2022 Joint
Conference (March 29-April 1, 2022), Edinburgh, UK                                                                    The previously identified challenges require a complete
" aabello@essi.upc.edu (A. Abelló); snadal@essi.upc.edu                                                               reconsideration of classical DBMS architecture and com-
(S. Nadal)                                                                                                            ponents that date back to the 70s. Yet, the approach so far
~ https://www.essi.upc.edu/dtim/people/alberto (A. Abelló);
https://www.essi.upc.edu/dtim/people/snadal (S. Nadal)
                                                                                                                      adopted by the data management community has been
 0000-0002-3223-2186 (A. Abelló); 0000-0002-8565-952X                                                                that of developing components addressing each of the
(S. Nadal)                                                                                                            required functionalities (i.e., Storage, Modeling, Inges-
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).                     tion, Processing, and Querying/Fetching) as efficiently
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
as possible. Indeed, current technological stacks for the    focuses on batch processing, and the other on stream pro-
management and processing of data-intensive tasks are        cessing. Nevertheless, maintaining such potentially re-
composed of independent components (commonly those           dundant flows generates some management risks. For the
in the NOSQL family) that generally work in isolation and    prediction to be accurate, the new arriving tuples have
are orchestrated together to map to what would be equiv-     to go through the same transformation and preparation
alent to different functionalities of a DBMS. In complex     tasks as the training data (otherwise, the validity of the
applications, where many of these tools need to interact,    prediction would be compromised). The 𝜆-architecture
it is definitely not wise to do it arbitrarily. Indeed, such evolved into 𝜅-architecture, as a simplification with a
erroneous approach yields what is coined as a “pipeline      single execution engine (hence a single implementation
jungle” [3]. The alternative is, thus, to reconsider some    of the transformations).1 The batch processing is re-
of the architectural patterns they implement.                placed by playing the data through the streaming system
                                                             quickly. If for any reason, we require different versions
2.1. New storage patterns                                    of the transformations for different predictive models, we
                                                             can keep all of them in the same system and choose the
Big Data analysis is an exploratory task, thus integrating most appropriate one at every moment, independently
and structuring the data a priori entails a large overhead. of whether it is for training or production.
Alternatively, the Data Lake approach promotes to main-
tain raw data as they are in the sources as a collection
of independent files. Then, once data scientists require 3. Conclusion
some of these data for a concrete purpose, it is when the
task of integrating, cleaning and structuring into the right The current problem in Big Data is not how to make a
format and schema for the problem at hand is performed. more accurate predictive model, but how to manage the
This referred as the “Load-first, Model-later” approach. data needed for its training. The difficulty is amplified
   The risk with this approach is that files can be simply by having independent components that need to inter-
massively accumulated without any order, resulting in act without a solid backbone that removes the burden
what is called a “Data Swamp”, where just finding the of their connectivity from the shoulders of developers.
relevant data would be a challenge. The solution for this If we pay attention to either Velocity or Variety, we can
is creating an organization of files and semantically anno- conclude that metadata is crucial for such backbone and
tate into a metadata catalog conceptualizing the domain the governance of Big Data. However, current state of
(e.g., implemented via graph-based formalisms). Thus, to the art has not reached the required level of maturity
the already existing mappings in the catalog, we should to give the view of a single homogenous system coordi-
add links from each file to such graph containing the rel- nated through those metadata instead of adhoc scripts
evant concepts for our business. In this way, users would or specifically programmed APIs and connectors.
be able to perform guided searches over the metadata
instead of blindly navigating the files. If properly done, References
a semantic approach can even facilitate automation of
integration and queries [4].                                 [1] P. Jovanovic, S. Nadal, O. Romero, A. Abelló, B. Bi-
                                                                  lalli, Quarry: A user-centered big data integration
2.2. New processing patterns                                      platform, Inf. Syst. Frontiers 23 (2021) 9–33.
                                                             [2] T. Rabl, M. Frank, M. Danisch, H. Jacobsen, B. Gowda,
Descriptive analytics study how the business performs             The vision of bigbench 2.0, in: Proceedings of the 4th
at different levels of granularity (e.g., regions, cities or      Workshop on Data analytics in the Cloud (DanaC),
districts), and how it evolves over time. Timeliness of           ACM, 2015, pp. 3:1–3:4.
data is usually not an issue for long term trends, and days [3] D. Sculley, G. Holt, D. Golovin, E. Davydov,
or even weeks are acceptable for the current data to be           T. Phillips, D. Ebner, V. Chaudhary, M. Young, J.-
processed and made ready for the analysis. Oppositely,            F. Crespo, D. Dennison, Hidden technical debt in
predictive analytics aim to foresee how a given entity            machine learning systems, in: Advances in Neural
(e.g., customer) is going to behave in the near future.           Information Processing Systems, volume 28, Curran
Obviously, since the purpose of a prediction is to react          Associates, Inc., 2015.
or at least be ready to take some action, data freshness [4] S. Nadal, O. Romero, A. Abelló, P. Vassiliadis, S. Van-
and response time is typically crucial in this case.              summeren, An integration-oriented ontology to gov-
   Consequently, since time requirements are contradic-           ern evolution in big data ecosystems, Inf. Syst. 79
tory, we have to distinguish both precessing flows, giving        (2019) 3–19.
rise to what is known as 𝜆-Architecture. This, consists of       1
                                                                   https://www.oreilly.com/radar/
two execution branches fed from the same sources. One questioning-the-lambda-architecture