Towards an Integrated Solution for IoT Data Management Anderson Chaves Supervised by Fabio Porto LNCC, Brazil achaves@lncc.br ABSTRACT semi-structured or unstructured, conforming it to the Big Data par- The emergence of Big Data and the Internet of Things (IoT) is adigm [9]. Traditional DBMSs, which need to store and index data increasingly affecting all areas of modern society, being charac- before processing it, cannot fulfill the requirements of timeliness terized by a huge number of data streams that demand real-time and scalability of IoT data streams [10]. Besides, in order to perform processing and analysis. The development of systems to assist on analysis and visualization, existing solutions are often inefficient, the management of these data streams plays an important role for because they incur in an incompatibility between the structure of IoT applications. However, there are numerous challenges that must the source data and the analysis tool [7]. Finally, there are a number be taken into account when building an efficient data system for of privacy and security issues as well as resource constraints such handling large scale, dynamic, semi-structured data such as IoT, and as memory, bandwidth and energy that must be taken into account currently existing solutions only partially address the requirements when building an IoT data management system. of these scenarios. In this PhD research, we summarize some of the Another challenge in IoT is the necessity for on-line processing main challenges involved in building an efficient system for IoT of data streams as opposed to off-line analysis. Machine learning data management and analysis, and how different data management (ML) is one of the leading strategies to perform reliable, efficient approaches such as Actor oriented, Array and Active Databases fit real-time analysis of IoT data in tasks such as predictions or anom- together offering strong contributions to these requirements. We alies detection [1]. However, the lack of integration between the ML also examine the potential of performing Machine Learning infer- application and the data system is often a restraint to performance ence and handling Concept Drift in IoT as an integrated database improvements, since optimizations such as query planning or lazy process. Through this work, we lay the structure for the develop- evaluation are not possible when the two processes are treated ment of a Database Management System to support large scale as completely isolated tasks [8]. Additionally, when dealing with data stream based analysis capable of combining these different dynamic stream data such as IoT, the nature of the data distribution strategies. tends to change over time, resulting in the phenomenon known as concept drift. It occurs when the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways [15]. When that happens, the learned patterns of past data may not be relevant to the new data, leading to poor pre- 1 INTRODUCTION dictions and incorrect decisions. Machine Learning based analysis From smart homes control systems to transportation, healthcare needs to be able not only to detect the drift, but also understand and industrial automation, the Internet of Things has been enabling and react to it. great benefits both for individual and businesses, being used for We argue that data management systems demand efficient mech- better decision making, planning and higher productivity [1]. The anisms to deal with large-scale, heterogeneous IoT data. A re- main characteristics behind this IoT paradigm is the exploration of cent work [25] has demonstrated that the programming model different technologies such as communication, embedded systems aimed specifically at concurrency and inherent parallelism of actor- and data analytics in order to create smart devices for intelligent oriented databases such as Orleans [5] and ReactDB [22] is an monitoring, locating, tracking and so forth [9, 18]. adequate solution for systems focused on IoT data management. The efficient management of sensor data from IoT devices is Reactive behavior and CEP techniques are also essential for eval- essential to perform IoT data analysis. Through Complex Event uating complex patterns over high-throughput data streams such Processing (CEP) methods, it is possible to detect anomalies and as IoT [13, 21]. Since a large part of data made available by IoT de- meaningful events from data streams and perform real-time deci- vices is multidimensional spatio-temporal [9, 19], multidimensional sion making. However, processing and analyzing continuous data array data models could provide great advantages to its manage- streams from heterogeneous networks still leads to a number of dif- ment [4]. However, managing several different platforms instead ferent challenges, and requires the development of new techniques of one makes the resulting solution unnecessarily complex and and strategies. potentially inefficient. To the best of our knowledge, no existing A major challenge in an IoT environment is related to its large solution has been yet proposed to combine all these approaches for scale data flows. Data in IoT can have its sources in a very big IoT Scenarios. range of endpoints that generate masses of data, and is frequently Therefore, to address the challenges involved in the development of an adequate IoT solution, we envision a Database Management Proceedings of the VLDB 2021 PhD Workshop, August 16th, 2021. Copenhagen, Den- System capable of offering scalable support for IoT data manage- mark. Copyright (C) 2021 for this paper by its authors. Use permitted under Creative ment as well as analysis through Machine Learning. In this work, Commons License Attribution 4.0 International (CC BY 4.0). we present the following contributions: Anderson Chaves Supervised by Fabio Porto Actor Oriented Array Active Proposed System Features Databases Databases Databases Solution Dynamic Scalability Actor-Based Asynchronous primitives + - - + Programming Encapsulation Array Based Array-Based Operations - + - + Data Management Flexible Storage Format Complex Event Event Detection - - + + Handling Reactive Behavior Machine Learning ML as first class operations - - - + Support Concept Drift Handling Table 1: Potential contributions from different models for IoT data management β€’ We propose the development of a new Database Manage- highly concurrently generated data. How to perform the manage- ment System that offers CEP primitives through actor-based ment of these data interactions while ensuring low latency? programming in order to perform rule-based monitoring for Visualization: Visualization is important in big data analytics, real-time scalable IoT scenarios. specially for IoT systems [18]. How can we perform visualization in β€’ We propose to further extend our solution to include ML the case of heterogeneous and diversely structured data generated inference as first class operators for CEP, enabling further in- in IoT? tegration between the data system and the Machine Learning Data Mining: The realization of the potential of IoT depends on tasks. being able to gain the insights hidden in the vast and ever increasing β€’ We propose to investigate the challenges involved in concept available data. Current data mining approaches don’t scale well drift handling specifically in an IoT environment, and how to IoT volumes. What characteristics are the most essential for a to address these challenges in a data management system. system fit to such environments? Resource Constraints: In the IoT data stream model, a high The remainder of this paper is organized as follows. In Section volume of data is produced at high speed. Therefore algorithms 2 we present the base concepts for the highlighted problems and that process it must do so under very strict constraints of space proposed solutions. In Section 3 we present our idea of leveraging and time. Addressing these constraints requires that a significant array databases to a scalable, reactive and intelligent solution fit for amount of data processing must happen on edge devices. How can IoT. We conclude and present our research directions in Section 4. we design algorithms that work efficiently in such environments? Security: Being able to deal with dynamic scaling while guar- 2 RESEARCH CONTEXT anteeing protection of data from different entities is another sig- nificant challenge. What is the most effective way to ensure access In this section, we introduce the base concepts of IoT data and control and protection of data from large volumes of devices and, challenges related to it. Afterward, we present the different database at the same time, ensure the development of a dynamic and flexible models that serve as foundation to the proposed solution. Finally, application? we describe the problem of Concept Drift in IoT context. 2.2 Data management solutions 2.1 IoT Big Data Challenges 2.2.1 Array Database Models. Most IoT environments are consti- According to [9], big data in IoT has three features that conform tuted by static or moving sensor devices placed in specific locations to the big data paradigm: (a) a very big range of endpoints that that produce data continuously. Each data item has space coordi- generate masses of data; (b) semi-structured or unstructured data; nates as well as a time-stamp associated, incurring in a high time (c) it is only useful after being analyzed. and space correlation. Because of this multidimensional spatio- Data generated by IoT has usually a high number of parallel temporal nature of IoT data, multidimensional array database mod- sources, being subject to inaccuracies and noise during acquisition els, built using arrays as the primary data representation, offer and transmission. It can be streamed continuously or accumulated advantages for an efficient data management. as a source of big data. When dealing with big data analytics, its Array databases were initially proposed to better represent sen- possible to produce insights after several days of its generation, but sor, image, simulation, and statistics data of tipically spatio-temporal in the case of streaming data IoT analytics, they must be delivered dimensions [4]. They have special query languages built upon array- in at most a few seconds or less. This real-time constraint incur in based algebraic formalizations that model different kinds of oper- the following challenges for IoT big data: ations such as aggregations or subsetting. Cells in an array have Data Management: Data management is a big challenge to be an intrinsic ordering, making it easy to quickly lookup values by addressed in order to realize the full potential of IoT, and therefore taking advantage of this ordering. Array indexes do not need to be has become a key research topic [17, 20]. Many IoT systems are stored and can be inferred by the position of a cell, saving storage processor-intensive and require processing a massive amount of space. Arrays can also be split into subarrays (called tiles or chunks) Towards an Integrated Solution for IoT Data Management that can be used as processing and storage units to help answering 2.3 IoT Concept Drift queries efficiently. Concept drift can be formally defined as follows [15]: given a Recently, some research effort is being applied in order to inte- time period [0, 𝑑], a set of samples, denoted as 𝑆 0,𝑑 = {𝑑 0, ..., 𝑑𝑑 }, grate ML tools and array DMBSs [24]. The system Rasdaman [3] where 𝑑𝑖 = (𝑋𝑖 , 𝑦𝑖 ) is one observation or data instance, 𝑋𝑖 is allows the implementation of machine learning algorithms through the feature vector, 𝑦𝑖 is the label, and 𝑆 0,𝑑 follows a certain dis- User Defined Types and Functions that implement the underlying tribution 𝐹 0,𝑑 (𝑋, 𝑦). Concept drift occurs at timestamp 𝑑 + 1, if linear algebra operations directly over the arrays. In the case of 𝐹 0,𝑑 (𝑋, 𝑦) β‰  𝐹𝑑 +1,∞ (𝑋, 𝑦). SciDB [23], users are provided with linear algebra operators that Research on learning under concept drift presents three com- can be used as building blocks to implement the ML algorithms. ponents beyond traditional Training/Prediction: Drift detection, In SAVIME [11], users can perform inference from machine learn- drift understanding and drift adaptation. The first refers to whether ing models as part of the query expression, allowing the jointly or not a concept drift occurs in a stream set of data. Drift under- optimization of the data preparation process and its input to the standing is related to when, how and where it occurs. Finally, drift model. adaptation refers to reacting to the existence of a drift. Recently, some works have been proposed to deal with concept drift specifically in IoT platforms. For example, the work of [14] 2.2.2 Active Databases and Complex Event Processing. An event can proposes an ensemble learning method based on offline classifiers be defined as an occurrence of significance in a system [16]. Histor- to address concept drifts and imbalance data concurrently. In [2], ically, many different initiatives have studied event processing for its proposed an unsupervised model-independent methodology different reasons. Active Databases intended to extend traditional to detect drifts in data generated from IoT devices. In [27], it is DBMSs by enabling the specification of reactive behavior. The idea proposed a concept drift adaptive method to anomaly detection in was to develop strategies to respond automatically to events and IoT services that considers the time influence to change the sample changes in the database state through mechanisms formalized as distribution. However, this is a not fully explored topic and many ECA rules [26]: if an event is detected, and any of previously de- research opportunities still exist. fined conditions become true, then a corresponding action is taken without any external intervention. 3 LEVERAGE ARRAY DATABASES TO IOT Complex Event Processing extend the logic behind ECA rules, COMPLEX EVENT PROCESSING being understood as a set of techniques combined in order to per- Historically, Database Management Systems have offered many form real-time stream processing for monitoring and detection of benefits to data intensive applications, such as transactions, index- arbitrarily complex patterns in massive data streams [16]. They ing, query planning and declarative query languages. An IoT data are commonly used in IoT environments to enable real-time or management solution must answer specific demands, such as en- near real-time decisions [13]. In CEP, each data item is abstracted capsulation for isolating state and access control, asynchronous as an event produced by a data source. A CEP engine combines primitives and dynamic scalability, since in many scenarios, sens- multiple simpler events to produce more complex ones, that match ing devices can instantly enter and leave a system. It should be previously defined patterns. It typically must process multiple data able to detect and react to predefined data patterns automatically, streams from different sources in order to track simultaneously while providing quick data access and an efficient integration to hundreds or even thousands of different patterns through evalu- ML analysis. Table 1 highlights the strong contributions offered ation mechanisms such as non-deterministic finite automaton or by active, actor-oriented and array databases to each of these IoT tree-based plans [12]. demands. Sensor Query Devices Staging Processor Data 2.2.3 Actor Oriented Databases. The actor programming model is (continuous) Storage Array Concept Data Working a well-known model for distributed and concurrent programming, Storage Drift/ Detector Event Structures Processor in which the actor is the fundamental computing unit. Its main Event (Local) Array Detector Continuous principle is that in a system, the control flow and the data flow Loader Array Data must be inseparable. Actors do not share state and communicate Model Stream Manager Manager via asynchronous messages. Because of its characteristics, actors Data Event Processor are a scalable solution to support the management of any number (Global) of independent and heterogeneous streaming data sources. In recent works, it has been demonstrated the effectiveness of the Things Actors Analysis Layer Layer Layer integration of data management features such as transactions and indexing into actor runtimes [6]. The authors of [25] demonstrate that this solution is in fact very suitable to perform IoT data manage- Figure 1: System Overview ment. A similar approach has sought to integrate actor primitives into relational databases [22] by extending the programmability of By taking our inspiration in the approaches of Orleans [5], that stored procedures with actor objects, taking advantage of databases added data-management functionality in a virtual actor runtime state management features. and ReactDB [22], which integrates actor features into a relational Anderson Chaves Supervised by Fabio Porto database system, we investigate the potential of performing event Association for Computing Machinery, Washington, USA, 575–577. detection and reactive behavior through actor-based primitives in [4] Peter Baumann, Dimitar Misev, Vlad Merticariu, and Bang Pham Huu. 2021. Array databases: concepts, standards, implementations. Journal of Big Data 8, 1 an array database model. Figure 1 illustrates the proposed idea. At (2021), 1–61. the things layer, data is collected from sensor devices and com- [5] Phil Bernstein, Sergey Bykov, Alan Geller, Gabriel Kliot, and Jorgen Thelin. 2014. Orleans: Distributed virtual actors for programmability and scalability. MSR-TR- municated to actor engines at the actor layer. Distributed actors 2014–41 (2014). manage these intermediate nodes that process and detect relevant [6] Philip A Bernstein, Mohammad Dashti, Tim Kiefer, and David Maier. 2017. In- (local) events based on attached sensors before sending them to dexing in an Actor-Oriented Database.. In CIDR. [7] Spyros Blanas, Kesheng Wu, Surendra Byna, Bin Dong, and Arie Shoshani. 2014. the cloud based data center, along with relevant data in the form Parallel data analysis directly on scientific file formats. In Proceedings of the 2014 of array data structures. At the analysis layer, global queries and ACM SIGMOD international conference on Management of data. Association for analysis that take into account alerts provided by actors can be Computing Machinery, Utah, USA, 385–396. [8] Shaofeng Cai, Gang Chen, Beng Chin Ooi, and Jinyang Gao. 2019. Model slic- made over the collected data. The intention is to provide a low ing for supporting complex analytics with elastic inference cost and resource latency environment, in which there is a reduced communication constraints. Proceedings of the VLDB Endowment 13, 2 (2019), 86–99. [9] Min Chen, Shiwen Mao, Yin Zhang, Victor CM Leung, et al. 2014. Big data: related bottleneck. technologies, challenges and future prospects. Vol. 96. Springer. The integration of ML-based analytics as part of the Data Man- [10] Gianpaolo Cugola and Alessandro Margara. 2012. Processing flows of information: agement System may lead to powerful optimization opportunities From data stream to complex event processing. ACM Computing Surveys (CSUR) 44, 3 (2012), 1–62. since different parts of the ML process may be treated as operators [11] Anderson Chaves da Silva, Hermano LourenΓ§o Souza Lustosa, Daniel Nasci- of the query plan. To cope with the growing need for ML support mento Ramos da Silva, FΓ‘bio AndrΓ© Machado Porto, and Patrick Valduriez. 2020. in IoT data systems, we aim to provide both a local and a global SAVIME: An Array DBMS for Simulation Analysis and ML Models Prediction. Journal of Information and Data Management 11, 3 (2020). event detector that supports ML inference from trained models as [12] Nikos Giatrakos, Elias Alevizos, Alexander Artikis, Antonios Deligiannakis, and first class operators. Minos Garofalakis. 2020. Complex event recognition in the big data era: a survey. In IoT environments, communicated data from devices is usu- The VLDB Journal 29, 1 (2020), 313–352. [13] Ilya Kolchinsky and Assaf Schuster. 2019. Real-time multi-pattern detection over ally collected and recorded by assuming a temporal relationship event streams. In Proceedings of the 2019 International Conference on Management between records. As time goes on, concept drift is bound to occur, of Data. 589–606. [14] Chun-Cheng Lin, Der-Jiunn Deng, Chin-Hung Kuo, and Linnan Chen. 2019. which may cause an accuracy drop to any methods that rely on Concept drift detection and adaption in big imbalance industrial IoT data using long-term statistical data attributes. The proposed solution will an ensemble learning method of offline classifiers. IEEE Access 7 (2019), 56198– count with a central drift detector that is able to determine if and 56207. [15] Jie Lu, Anjin Liu, Fan Dong, Feng Gu, Joao Gama, and Guangquan Zhang. 2018. when the drift occurred as well as the best reaction to it based on Learning under concept drift: A review. IEEE Transactions on Knowledge and the local drift detectors. Data Engineering 31, 12 (2018), 2346–2363. [16] David C. Luckham. 2001. The Power of Events: An Introduction to Complex Event Processing in Distributed Enterprise Systems. Addison-Wesley Longman Publishing 4 CONCLUSION AND RESEARCH DIRECTION Co., Inc., USA. In this paper, we discuss characteristics and challenges of IoT data [17] Meng Ma, Ping Wang, and Chao-Hsien Chu. 2013. Data management for internet of things: Challenges, approaches and opportunities. In 2013 IEEE International management and summarize potential contributions from differ- conference on green computing and communications and IEEE Internet of Things ent strategies in addressing each of them. Our goal is to build an and IEEE cyber, physical and social computing. IEEE, 1144–1151. [18] Mohsen Marjani, Fariza Nasaruddin, Abdullah Gani, Ahmad Karim, Ibrahim efficient, in-memory data management system that combines each Abaker Targio Hashem, Aisha Siddiqa, and Ibrar Yaqoob. 2017. Big IoT data of these different contributions into a single integrated solution, analytics: architecture, opportunities, and open research challenges. IEEE Access while offering a robust support for data analysis trough Machine 5 (2017), 5247–5261. [19] Mehdi Mohammadi, Ala Al-Fuqaha, Sameh Sorour, and Mohsen Guizani. 2018. Learning. As the next step in our study, we aim to focus on the Deep learning for IoT big data and streaming analytics: A survey. IEEE Commu- design refinement and implementation of a prototype system as nications Surveys & Tutorials 20, 4 (2018), 2923–2960. a foundation to our subsequent investigations. To evaluate the vi- [20] John Paparrizos, Chunwei Liu, Bruno Barbarioli, Johnny Hwang, Ikraduya Edian, Aaron J Elmore, Michael J Franklin, and Sanjay Krishnan. 2021. VergeDB: A ability of our approach, we intend to submit it to a real use-case Database for IoT Analytics on Edge Devices. In CIDR. scenario that presents the IoT characteristics and challenges de- [21] JosΓ© RoldΓ‘n, Juan Boubeta-Puig, JosΓ© Luis MartΓ­nez, and Guadalupe Ortiz. 2020. Integrating complex event processing and machine learning: An intelligent ar- scribed. We also intend to perform comparative experiments with chitecture for detecting IoT security attacks. Expert Systems with Applications state-of-the-art big data frameworks in order to demonstrate the 149 (2020), 113251. optimization opportunities that we envision. [22] Vivek Shah and Marcos Antonio Vaz Salles. 2018. Reactors: A case for predictable, virtualized actor database systems. In Proceedings of the 2018 International Con- ference on Management of Data. 259–274. 5 ACKNOWLEDGEMENT [23] Michael Stonebraker, Paul Brown, Donghui Zhang, and Jacek Becla. 2013. SciDB: A database management system for applications with complex analytics. Com- We would like to thank CAPES for its scholarships, and Petrobras puting in Science & Engineering 15, 3 (2013), 54–62. for financing this work through the Gypscie project. [24] Sebastian Villarroya and Peter Baumann. 2020. On the Integration of Machine Learning and Array Databases. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 1786–1789. REFERENCES [25] Yiwen Wang, Julio Cesar Dos Reis, Kasper Myrtue Borggren, Marcos Antonio Vaz [1] Furqan Alam, Rashid Mehmood, Iyad Katib, and Aiiad Albeshri. 2016. Analysis Salles, Claudia Bauzer Medeiros, and Yongluan Zhou. 2019. Modeling and Build- of eight data mining algorithms for smarter Internet of Things (IoT). Procedia ing IoT Data Platforms with Actor-Oriented Databases.. In EDBT. 512–523. Computer Science 98 (2016), 437–442. [26] Jennifer Widom and Stefano Ceri. 1996. Active database systems: Triggers and [2] Mohsen Asghari, Daniel Sierra-Sosa, Michael Telahun, Anup Kumar, and Adel S rules for advanced database processing. Morgan Kaufmann. Elmaghraby. 2020. Aggregate density-based concept drift identification for dy- [27] Rongbin Xu, Yongliang Cheng, Zhiqiang Liu, Ying Xie, and Yun Yang. 2020. namic sensor data models. Neural Computing and Applications (2020), 1–13. Improved Long Short-Term Memory based anomaly detection with concept drift [3] Peter Baumann, Andreas Dehmel, Paula Furtado, Roland Ritsch, and Norbert adaptive method for supporting IoT services. Future Generation Computer Systems Widmann. 1998. The multidimensional database system RasDaMan. In Proceed- (2020). ings of the 1998 ACM SIGMOD international conference on Management of data.