=Paper= {{Paper |id=Vol-2971/paper12 |storemode=property |title=Towards an Integrated Solution for IoT Data Management |pdfUrl=https://ceur-ws.org/Vol-2971/paper12.pdf |volume=Vol-2971 |authors=Anderson Chaves |dblpUrl=https://dblp.org/rec/conf/vldb/Chaves21 }} ==Towards an Integrated Solution for IoT Data Management== https://ceur-ws.org/Vol-2971/paper12.pdf
         Towards an Integrated Solution for IoT Data Management
                                                                    Anderson Chaves
                                                                 Supervised by Fabio Porto
                                                                              LNCC, Brazil
                                                                            achaves@lncc.br
ABSTRACT                                                                               semi-structured or unstructured, conforming it to the Big Data par-
The emergence of Big Data and the Internet of Things (IoT) is                          adigm [9]. Traditional DBMSs, which need to store and index data
increasingly affecting all areas of modern society, being charac-                      before processing it, cannot fulfill the requirements of timeliness
terized by a huge number of data streams that demand real-time                         and scalability of IoT data streams [10]. Besides, in order to perform
processing and analysis. The development of systems to assist on                       analysis and visualization, existing solutions are often inefficient,
the management of these data streams plays an important role for                       because they incur in an incompatibility between the structure of
IoT applications. However, there are numerous challenges that must                     the source data and the analysis tool [7]. Finally, there are a number
be taken into account when building an efficient data system for                       of privacy and security issues as well as resource constraints such
handling large scale, dynamic, semi-structured data such as IoT, and                   as memory, bandwidth and energy that must be taken into account
currently existing solutions only partially address the requirements                   when building an IoT data management system.
of these scenarios. In this PhD research, we summarize some of the                        Another challenge in IoT is the necessity for on-line processing
main challenges involved in building an efficient system for IoT                       of data streams as opposed to off-line analysis. Machine learning
data management and analysis, and how different data management                        (ML) is one of the leading strategies to perform reliable, efficient
approaches such as Actor oriented, Array and Active Databases fit                      real-time analysis of IoT data in tasks such as predictions or anom-
together offering strong contributions to these requirements. We                       alies detection [1]. However, the lack of integration between the ML
also examine the potential of performing Machine Learning infer-                       application and the data system is often a restraint to performance
ence and handling Concept Drift in IoT as an integrated database                       improvements, since optimizations such as query planning or lazy
process. Through this work, we lay the structure for the develop-                      evaluation are not possible when the two processes are treated
ment of a Database Management System to support large scale                            as completely isolated tasks [8]. Additionally, when dealing with
data stream based analysis capable of combining these different                        dynamic stream data such as IoT, the nature of the data distribution
strategies.                                                                            tends to change over time, resulting in the phenomenon known as
                                                                                       concept drift. It occurs when the statistical properties of the target
                                                                                       variable, which the model is trying to predict, change over time in
                                                                                       unforeseen ways [15]. When that happens, the learned patterns of
                                                                                       past data may not be relevant to the new data, leading to poor pre-
1    INTRODUCTION                                                                      dictions and incorrect decisions. Machine Learning based analysis
From smart homes control systems to transportation, healthcare                         needs to be able not only to detect the drift, but also understand
and industrial automation, the Internet of Things has been enabling                    and react to it.
great benefits both for individual and businesses, being used for                         We argue that data management systems demand efficient mech-
better decision making, planning and higher productivity [1]. The                      anisms to deal with large-scale, heterogeneous IoT data. A re-
main characteristics behind this IoT paradigm is the exploration of                    cent work [25] has demonstrated that the programming model
different technologies such as communication, embedded systems                         aimed specifically at concurrency and inherent parallelism of actor-
and data analytics in order to create smart devices for intelligent                    oriented databases such as Orleans [5] and ReactDB [22] is an
monitoring, locating, tracking and so forth [9, 18].                                   adequate solution for systems focused on IoT data management.
   The efficient management of sensor data from IoT devices is                         Reactive behavior and CEP techniques are also essential for eval-
essential to perform IoT data analysis. Through Complex Event                          uating complex patterns over high-throughput data streams such
Processing (CEP) methods, it is possible to detect anomalies and                       as IoT [13, 21]. Since a large part of data made available by IoT de-
meaningful events from data streams and perform real-time deci-                        vices is multidimensional spatio-temporal [9, 19], multidimensional
sion making. However, processing and analyzing continuous data                         array data models could provide great advantages to its manage-
streams from heterogeneous networks still leads to a number of dif-                    ment [4]. However, managing several different platforms instead
ferent challenges, and requires the development of new techniques                      of one makes the resulting solution unnecessarily complex and
and strategies.                                                                        potentially inefficient. To the best of our knowledge, no existing
   A major challenge in an IoT environment is related to its large                     solution has been yet proposed to combine all these approaches for
scale data flows. Data in IoT can have its sources in a very big                       IoT Scenarios.
range of endpoints that generate masses of data, and is frequently                        Therefore, to address the challenges involved in the development
                                                                                       of an adequate IoT solution, we envision a Database Management
Proceedings of the VLDB 2021 PhD Workshop, August 16th, 2021. Copenhagen, Den-         System capable of offering scalable support for IoT data manage-
mark. Copyright (C) 2021 for this paper by its authors. Use permitted under Creative   ment as well as analysis through Machine Learning. In this work,
Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                       we present the following contributions:
                                                                                                                                        Anderson Chaves
                                                                                                                                Supervised by Fabio Porto


                                                                           Actor Oriented      Array          Active        Proposed
                                 System Features
                                                                             Databases        Databases      Databases      Solution
                                       Dynamic Scalability
                   Actor-Based
                                    Asynchronous primitives           +              -          -                                +
                  Programming
                                          Encapsulation
                 Array Based         Array-Based Operations
                                                                      -              +          -                                +
               Data Management       Flexible Storage Format
                Complex Event            Event Detection
                                                                      -              -          +                                +
                   Handling             Reactive Behavior
               Machine Learning ML as first class operations
                                                                      -              -          -                                +
                    Support          Concept Drift Handling
                        Table 1: Potential contributions from different models for IoT data management



      β€’ We propose the development of a new Database Manage-                   highly concurrently generated data. How to perform the manage-
        ment System that offers CEP primitives through actor-based             ment of these data interactions while ensuring low latency?
        programming in order to perform rule-based monitoring for                 Visualization: Visualization is important in big data analytics,
        real-time scalable IoT scenarios.                                      specially for IoT systems [18]. How can we perform visualization in
      β€’ We propose to further extend our solution to include ML                the case of heterogeneous and diversely structured data generated
        inference as first class operators for CEP, enabling further in-       in IoT?
        tegration between the data system and the Machine Learning                Data Mining: The realization of the potential of IoT depends on
        tasks.                                                                 being able to gain the insights hidden in the vast and ever increasing
      β€’ We propose to investigate the challenges involved in concept           available data. Current data mining approaches don’t scale well
        drift handling specifically in an IoT environment, and how             to IoT volumes. What characteristics are the most essential for a
        to address these challenges in a data management system.               system fit to such environments?
                                                                                  Resource Constraints: In the IoT data stream model, a high
   The remainder of this paper is organized as follows. In Section             volume of data is produced at high speed. Therefore algorithms
2 we present the base concepts for the highlighted problems and                that process it must do so under very strict constraints of space
proposed solutions. In Section 3 we present our idea of leveraging             and time. Addressing these constraints requires that a significant
array databases to a scalable, reactive and intelligent solution fit for       amount of data processing must happen on edge devices. How can
IoT. We conclude and present our research directions in Section 4.             we design algorithms that work efficiently in such environments?
                                                                                  Security: Being able to deal with dynamic scaling while guar-
2     RESEARCH CONTEXT                                                         anteeing protection of data from different entities is another sig-
                                                                               nificant challenge. What is the most effective way to ensure access
In this section, we introduce the base concepts of IoT data and
                                                                               control and protection of data from large volumes of devices and,
challenges related to it. Afterward, we present the different database
                                                                               at the same time, ensure the development of a dynamic and flexible
models that serve as foundation to the proposed solution. Finally,
                                                                               application?
we describe the problem of Concept Drift in IoT context.
                                                                               2.2    Data management solutions
2.1     IoT Big Data Challenges                                                2.2.1 Array Database Models. Most IoT environments are consti-
According to [9], big data in IoT has three features that conform              tuted by static or moving sensor devices placed in specific locations
to the big data paradigm: (a) a very big range of endpoints that               that produce data continuously. Each data item has space coordi-
generate masses of data; (b) semi-structured or unstructured data;             nates as well as a time-stamp associated, incurring in a high time
(c) it is only useful after being analyzed.                                    and space correlation. Because of this multidimensional spatio-
    Data generated by IoT has usually a high number of parallel                temporal nature of IoT data, multidimensional array database mod-
sources, being subject to inaccuracies and noise during acquisition            els, built using arrays as the primary data representation, offer
and transmission. It can be streamed continuously or accumulated               advantages for an efficient data management.
as a source of big data. When dealing with big data analytics, its                Array databases were initially proposed to better represent sen-
possible to produce insights after several days of its generation, but         sor, image, simulation, and statistics data of tipically spatio-temporal
in the case of streaming data IoT analytics, they must be delivered            dimensions [4]. They have special query languages built upon array-
in at most a few seconds or less. This real-time constraint incur in           based algebraic formalizations that model different kinds of oper-
the following challenges for IoT big data:                                     ations such as aggregations or subsetting. Cells in an array have
    Data Management: Data management is a big challenge to be                  an intrinsic ordering, making it easy to quickly lookup values by
addressed in order to realize the full potential of IoT, and therefore         taking advantage of this ordering. Array indexes do not need to be
has become a key research topic [17, 20]. Many IoT systems are                 stored and can be inferred by the position of a cell, saving storage
processor-intensive and require processing a massive amount of                 space. Arrays can also be split into subarrays (called tiles or chunks)
Towards an Integrated Solution for IoT Data Management


that can be used as processing and storage units to help answering        2.3          IoT Concept Drift
queries efficiently.                                                      Concept drift can be formally defined as follows [15]: given a
   Recently, some research effort is being applied in order to inte-      time period [0, 𝑑], a set of samples, denoted as 𝑆 0,𝑑 = {𝑑 0, ..., 𝑑𝑑 },
grate ML tools and array DMBSs [24]. The system Rasdaman [3]              where 𝑑𝑖 = (𝑋𝑖 , 𝑦𝑖 ) is one observation or data instance, 𝑋𝑖 is
allows the implementation of machine learning algorithms through          the feature vector, 𝑦𝑖 is the label, and 𝑆 0,𝑑 follows a certain dis-
User Defined Types and Functions that implement the underlying            tribution 𝐹 0,𝑑 (𝑋, 𝑦). Concept drift occurs at timestamp 𝑑 + 1, if
linear algebra operations directly over the arrays. In the case of        𝐹 0,𝑑 (𝑋, 𝑦) β‰  𝐹𝑑 +1,∞ (𝑋, 𝑦).
SciDB [23], users are provided with linear algebra operators that             Research on learning under concept drift presents three com-
can be used as building blocks to implement the ML algorithms.            ponents beyond traditional Training/Prediction: Drift detection,
In SAVIME [11], users can perform inference from machine learn-           drift understanding and drift adaptation. The first refers to whether
ing models as part of the query expression, allowing the jointly          or not a concept drift occurs in a stream set of data. Drift under-
optimization of the data preparation process and its input to the         standing is related to when, how and where it occurs. Finally, drift
model.                                                                    adaptation refers to reacting to the existence of a drift.
                                                                              Recently, some works have been proposed to deal with concept
                                                                          drift specifically in IoT platforms. For example, the work of [14]
2.2.2 Active Databases and Complex Event Processing. An event can         proposes an ensemble learning method based on offline classifiers
be defined as an occurrence of significance in a system [16]. Histor-     to address concept drifts and imbalance data concurrently. In [2],
ically, many different initiatives have studied event processing for      its proposed an unsupervised model-independent methodology
different reasons. Active Databases intended to extend traditional        to detect drifts in data generated from IoT devices. In [27], it is
DBMSs by enabling the specification of reactive behavior. The idea        proposed a concept drift adaptive method to anomaly detection in
was to develop strategies to respond automatically to events and          IoT services that considers the time influence to change the sample
changes in the database state through mechanisms formalized as            distribution. However, this is a not fully explored topic and many
ECA rules [26]: if an event is detected, and any of previously de-        research opportunities still exist.
fined conditions become true, then a corresponding action is taken
without any external intervention.                                        3     LEVERAGE ARRAY DATABASES TO IOT
   Complex Event Processing extend the logic behind ECA rules,                  COMPLEX EVENT PROCESSING
being understood as a set of techniques combined in order to per-         Historically, Database Management Systems have offered many
form real-time stream processing for monitoring and detection of          benefits to data intensive applications, such as transactions, index-
arbitrarily complex patterns in massive data streams [16]. They           ing, query planning and declarative query languages. An IoT data
are commonly used in IoT environments to enable real-time or              management solution must answer specific demands, such as en-
near real-time decisions [13]. In CEP, each data item is abstracted       capsulation for isolating state and access control, asynchronous
as an event produced by a data source. A CEP engine combines              primitives and dynamic scalability, since in many scenarios, sens-
multiple simpler events to produce more complex ones, that match          ing devices can instantly enter and leave a system. It should be
previously defined patterns. It typically must process multiple data      able to detect and react to predefined data patterns automatically,
streams from different sources in order to track simultaneously           while providing quick data access and an efficient integration to
hundreds or even thousands of different patterns through evalu-           ML analysis. Table 1 highlights the strong contributions offered
ation mechanisms such as non-deterministic finite automaton or            by active, actor-oriented and array databases to each of these IoT
tree-based plans [12].                                                    demands.

                                                                                Sensor
                                                                                                          Query
                                                                                Devices                                Staging
                                                                                                        Processor
                                                                                                                        Data
2.2.3 Actor Oriented Databases. The actor programming model is                                          (continuous)               Storage
                                                                                                                                                    Array
                                                                                                         Concept                                     Data
                                                                                              Working
a well-known model for distributed and concurrent programming,                                Storage
                                                                                                          Drift/
                                                                                                         Detector        Event
                                                                                                                                                  Structures

                                                                                                                       Processor
in which the actor is the fundamental computing unit. Its main                                            Event         (Local)        Array
                                                                                                         Detector                    Continuous
principle is that in a system, the control flow and the data flow                                                                     Loader

                                                                                                                                                     Array Data
must be inseparable. Actors do not share state and communicate                                                                         Model

                                                                              Stream
                                                                                                                                      Manager        Manager
via asynchronous messages. Because of its characteristics, actors              Data
                                                                                                                                   Event Processor
are a scalable solution to support the management of any number                                                                       (Global)


of independent and heterogeneous streaming data sources.
   In recent works, it has been demonstrated the effectiveness of the            Things                      Actors                               Analysis
                                                                                 Layer                       Layer                                 Layer
integration of data management features such as transactions and
indexing into actor runtimes [6]. The authors of [25] demonstrate
that this solution is in fact very suitable to perform IoT data manage-                        Figure 1: System Overview
ment. A similar approach has sought to integrate actor primitives
into relational databases [22] by extending the programmability of          By taking our inspiration in the approaches of Orleans [5], that
stored procedures with actor objects, taking advantage of databases       added data-management functionality in a virtual actor runtime
state management features.                                                and ReactDB [22], which integrates actor features into a relational
                                                                                                                                                            Anderson Chaves
                                                                                                                                                    Supervised by Fabio Porto


database system, we investigate the potential of performing event                          Association for Computing Machinery, Washington, USA, 575–577.
detection and reactive behavior through actor-based primitives in                      [4] Peter Baumann, Dimitar Misev, Vlad Merticariu, and Bang Pham Huu. 2021.
                                                                                           Array databases: concepts, standards, implementations. Journal of Big Data 8, 1
an array database model. Figure 1 illustrates the proposed idea. At                        (2021), 1–61.
the things layer, data is collected from sensor devices and com-                       [5] Phil Bernstein, Sergey Bykov, Alan Geller, Gabriel Kliot, and Jorgen Thelin. 2014.
                                                                                           Orleans: Distributed virtual actors for programmability and scalability. MSR-TR-
municated to actor engines at the actor layer. Distributed actors                          2014–41 (2014).
manage these intermediate nodes that process and detect relevant                       [6] Philip A Bernstein, Mohammad Dashti, Tim Kiefer, and David Maier. 2017. In-
(local) events based on attached sensors before sending them to                            dexing in an Actor-Oriented Database.. In CIDR.
                                                                                       [7] Spyros Blanas, Kesheng Wu, Surendra Byna, Bin Dong, and Arie Shoshani. 2014.
the cloud based data center, along with relevant data in the form                          Parallel data analysis directly on scientific file formats. In Proceedings of the 2014
of array data structures. At the analysis layer, global queries and                        ACM SIGMOD international conference on Management of data. Association for
analysis that take into account alerts provided by actors can be                           Computing Machinery, Utah, USA, 385–396.
                                                                                       [8] Shaofeng Cai, Gang Chen, Beng Chin Ooi, and Jinyang Gao. 2019. Model slic-
made over the collected data. The intention is to provide a low                            ing for supporting complex analytics with elastic inference cost and resource
latency environment, in which there is a reduced communication                             constraints. Proceedings of the VLDB Endowment 13, 2 (2019), 86–99.
                                                                                       [9] Min Chen, Shiwen Mao, Yin Zhang, Victor CM Leung, et al. 2014. Big data: related
bottleneck.                                                                                technologies, challenges and future prospects. Vol. 96. Springer.
   The integration of ML-based analytics as part of the Data Man-                     [10] Gianpaolo Cugola and Alessandro Margara. 2012. Processing flows of information:
agement System may lead to powerful optimization opportunities                             From data stream to complex event processing. ACM Computing Surveys (CSUR)
                                                                                           44, 3 (2012), 1–62.
since different parts of the ML process may be treated as operators                   [11] Anderson Chaves da Silva, Hermano LourenΓ§o Souza Lustosa, Daniel Nasci-
of the query plan. To cope with the growing need for ML support                            mento Ramos da Silva, FΓ‘bio AndrΓ© Machado Porto, and Patrick Valduriez. 2020.
in IoT data systems, we aim to provide both a local and a global                           SAVIME: An Array DBMS for Simulation Analysis and ML Models Prediction.
                                                                                           Journal of Information and Data Management 11, 3 (2020).
event detector that supports ML inference from trained models as                      [12] Nikos Giatrakos, Elias Alevizos, Alexander Artikis, Antonios Deligiannakis, and
first class operators.                                                                     Minos Garofalakis. 2020. Complex event recognition in the big data era: a survey.
   In IoT environments, communicated data from devices is usu-                             The VLDB Journal 29, 1 (2020), 313–352.
                                                                                      [13] Ilya Kolchinsky and Assaf Schuster. 2019. Real-time multi-pattern detection over
ally collected and recorded by assuming a temporal relationship                            event streams. In Proceedings of the 2019 International Conference on Management
between records. As time goes on, concept drift is bound to occur,                         of Data. 589–606.
                                                                                      [14] Chun-Cheng Lin, Der-Jiunn Deng, Chin-Hung Kuo, and Linnan Chen. 2019.
which may cause an accuracy drop to any methods that rely on                               Concept drift detection and adaption in big imbalance industrial IoT data using
long-term statistical data attributes. The proposed solution will                          an ensemble learning method of offline classifiers. IEEE Access 7 (2019), 56198–
count with a central drift detector that is able to determine if and                       56207.
                                                                                      [15] Jie Lu, Anjin Liu, Fan Dong, Feng Gu, Joao Gama, and Guangquan Zhang. 2018.
when the drift occurred as well as the best reaction to it based on                        Learning under concept drift: A review. IEEE Transactions on Knowledge and
the local drift detectors.                                                                 Data Engineering 31, 12 (2018), 2346–2363.
                                                                                      [16] David C. Luckham. 2001. The Power of Events: An Introduction to Complex Event
                                                                                           Processing in Distributed Enterprise Systems. Addison-Wesley Longman Publishing
4    CONCLUSION AND RESEARCH DIRECTION                                                     Co., Inc., USA.
In this paper, we discuss characteristics and challenges of IoT data                  [17] Meng Ma, Ping Wang, and Chao-Hsien Chu. 2013. Data management for internet
                                                                                           of things: Challenges, approaches and opportunities. In 2013 IEEE International
management and summarize potential contributions from differ-                              conference on green computing and communications and IEEE Internet of Things
ent strategies in addressing each of them. Our goal is to build an                         and IEEE cyber, physical and social computing. IEEE, 1144–1151.
                                                                                      [18] Mohsen Marjani, Fariza Nasaruddin, Abdullah Gani, Ahmad Karim, Ibrahim
efficient, in-memory data management system that combines each                             Abaker Targio Hashem, Aisha Siddiqa, and Ibrar Yaqoob. 2017. Big IoT data
of these different contributions into a single integrated solution,                        analytics: architecture, opportunities, and open research challenges. IEEE Access
while offering a robust support for data analysis trough Machine                           5 (2017), 5247–5261.
                                                                                      [19] Mehdi Mohammadi, Ala Al-Fuqaha, Sameh Sorour, and Mohsen Guizani. 2018.
Learning. As the next step in our study, we aim to focus on the                            Deep learning for IoT big data and streaming analytics: A survey. IEEE Commu-
design refinement and implementation of a prototype system as                              nications Surveys & Tutorials 20, 4 (2018), 2923–2960.
a foundation to our subsequent investigations. To evaluate the vi-                    [20] John Paparrizos, Chunwei Liu, Bruno Barbarioli, Johnny Hwang, Ikraduya Edian,
                                                                                           Aaron J Elmore, Michael J Franklin, and Sanjay Krishnan. 2021. VergeDB: A
ability of our approach, we intend to submit it to a real use-case                         Database for IoT Analytics on Edge Devices. In CIDR.
scenario that presents the IoT characteristics and challenges de-                     [21] JosΓ© RoldΓ‘n, Juan Boubeta-Puig, JosΓ© Luis MartΓ­nez, and Guadalupe Ortiz. 2020.
                                                                                           Integrating complex event processing and machine learning: An intelligent ar-
scribed. We also intend to perform comparative experiments with                            chitecture for detecting IoT security attacks. Expert Systems with Applications
state-of-the-art big data frameworks in order to demonstrate the                           149 (2020), 113251.
optimization opportunities that we envision.                                          [22] Vivek Shah and Marcos Antonio Vaz Salles. 2018. Reactors: A case for predictable,
                                                                                           virtualized actor database systems. In Proceedings of the 2018 International Con-
                                                                                           ference on Management of Data. 259–274.
5    ACKNOWLEDGEMENT                                                                  [23] Michael Stonebraker, Paul Brown, Donghui Zhang, and Jacek Becla. 2013. SciDB:
                                                                                           A database management system for applications with complex analytics. Com-
We would like to thank CAPES for its scholarships, and Petrobras                           puting in Science & Engineering 15, 3 (2013), 54–62.
for financing this work through the Gypscie project.                                  [24] Sebastian Villarroya and Peter Baumann. 2020. On the Integration of Machine
                                                                                           Learning and Array Databases. In 2020 IEEE 36th International Conference on Data
                                                                                           Engineering (ICDE). IEEE, 1786–1789.
REFERENCES                                                                            [25] Yiwen Wang, Julio Cesar Dos Reis, Kasper Myrtue Borggren, Marcos Antonio Vaz
 [1] Furqan Alam, Rashid Mehmood, Iyad Katib, and Aiiad Albeshri. 2016. Analysis           Salles, Claudia Bauzer Medeiros, and Yongluan Zhou. 2019. Modeling and Build-
     of eight data mining algorithms for smarter Internet of Things (IoT). Procedia        ing IoT Data Platforms with Actor-Oriented Databases.. In EDBT. 512–523.
     Computer Science 98 (2016), 437–442.                                             [26] Jennifer Widom and Stefano Ceri. 1996. Active database systems: Triggers and
 [2] Mohsen Asghari, Daniel Sierra-Sosa, Michael Telahun, Anup Kumar, and Adel S           rules for advanced database processing. Morgan Kaufmann.
     Elmaghraby. 2020. Aggregate density-based concept drift identification for dy-   [27] Rongbin Xu, Yongliang Cheng, Zhiqiang Liu, Ying Xie, and Yun Yang. 2020.
     namic sensor data models. Neural Computing and Applications (2020), 1–13.             Improved Long Short-Term Memory based anomaly detection with concept drift
 [3] Peter Baumann, Andreas Dehmel, Paula Furtado, Roland Ritsch, and Norbert              adaptive method for supporting IoT services. Future Generation Computer Systems
     Widmann. 1998. The multidimensional database system RasDaMan. In Proceed-             (2020).
     ings of the 1998 ACM SIGMOD international conference on Management of data.