=Paper=
{{Paper
|id=Vol-2971/paper12
|storemode=property
|title=Towards an Integrated Solution for IoT Data Management
|pdfUrl=https://ceur-ws.org/Vol-2971/paper12.pdf
|volume=Vol-2971
|authors=Anderson Chaves
|dblpUrl=https://dblp.org/rec/conf/vldb/Chaves21
}}
==Towards an Integrated Solution for IoT Data Management==
Towards an Integrated Solution for IoT Data Management
Anderson Chaves
Supervised by Fabio Porto
LNCC, Brazil
achaves@lncc.br
ABSTRACT semi-structured or unstructured, conforming it to the Big Data par-
The emergence of Big Data and the Internet of Things (IoT) is adigm [9]. Traditional DBMSs, which need to store and index data
increasingly affecting all areas of modern society, being charac- before processing it, cannot fulfill the requirements of timeliness
terized by a huge number of data streams that demand real-time and scalability of IoT data streams [10]. Besides, in order to perform
processing and analysis. The development of systems to assist on analysis and visualization, existing solutions are often inefficient,
the management of these data streams plays an important role for because they incur in an incompatibility between the structure of
IoT applications. However, there are numerous challenges that must the source data and the analysis tool [7]. Finally, there are a number
be taken into account when building an efficient data system for of privacy and security issues as well as resource constraints such
handling large scale, dynamic, semi-structured data such as IoT, and as memory, bandwidth and energy that must be taken into account
currently existing solutions only partially address the requirements when building an IoT data management system.
of these scenarios. In this PhD research, we summarize some of the Another challenge in IoT is the necessity for on-line processing
main challenges involved in building an efficient system for IoT of data streams as opposed to off-line analysis. Machine learning
data management and analysis, and how different data management (ML) is one of the leading strategies to perform reliable, efficient
approaches such as Actor oriented, Array and Active Databases fit real-time analysis of IoT data in tasks such as predictions or anom-
together offering strong contributions to these requirements. We alies detection [1]. However, the lack of integration between the ML
also examine the potential of performing Machine Learning infer- application and the data system is often a restraint to performance
ence and handling Concept Drift in IoT as an integrated database improvements, since optimizations such as query planning or lazy
process. Through this work, we lay the structure for the develop- evaluation are not possible when the two processes are treated
ment of a Database Management System to support large scale as completely isolated tasks [8]. Additionally, when dealing with
data stream based analysis capable of combining these different dynamic stream data such as IoT, the nature of the data distribution
strategies. tends to change over time, resulting in the phenomenon known as
concept drift. It occurs when the statistical properties of the target
variable, which the model is trying to predict, change over time in
unforeseen ways [15]. When that happens, the learned patterns of
past data may not be relevant to the new data, leading to poor pre-
1 INTRODUCTION dictions and incorrect decisions. Machine Learning based analysis
From smart homes control systems to transportation, healthcare needs to be able not only to detect the drift, but also understand
and industrial automation, the Internet of Things has been enabling and react to it.
great benefits both for individual and businesses, being used for We argue that data management systems demand efficient mech-
better decision making, planning and higher productivity [1]. The anisms to deal with large-scale, heterogeneous IoT data. A re-
main characteristics behind this IoT paradigm is the exploration of cent work [25] has demonstrated that the programming model
different technologies such as communication, embedded systems aimed specifically at concurrency and inherent parallelism of actor-
and data analytics in order to create smart devices for intelligent oriented databases such as Orleans [5] and ReactDB [22] is an
monitoring, locating, tracking and so forth [9, 18]. adequate solution for systems focused on IoT data management.
The efficient management of sensor data from IoT devices is Reactive behavior and CEP techniques are also essential for eval-
essential to perform IoT data analysis. Through Complex Event uating complex patterns over high-throughput data streams such
Processing (CEP) methods, it is possible to detect anomalies and as IoT [13, 21]. Since a large part of data made available by IoT de-
meaningful events from data streams and perform real-time deci- vices is multidimensional spatio-temporal [9, 19], multidimensional
sion making. However, processing and analyzing continuous data array data models could provide great advantages to its manage-
streams from heterogeneous networks still leads to a number of dif- ment [4]. However, managing several different platforms instead
ferent challenges, and requires the development of new techniques of one makes the resulting solution unnecessarily complex and
and strategies. potentially inefficient. To the best of our knowledge, no existing
A major challenge in an IoT environment is related to its large solution has been yet proposed to combine all these approaches for
scale data flows. Data in IoT can have its sources in a very big IoT Scenarios.
range of endpoints that generate masses of data, and is frequently Therefore, to address the challenges involved in the development
of an adequate IoT solution, we envision a Database Management
Proceedings of the VLDB 2021 PhD Workshop, August 16th, 2021. Copenhagen, Den- System capable of offering scalable support for IoT data manage-
mark. Copyright (C) 2021 for this paper by its authors. Use permitted under Creative ment as well as analysis through Machine Learning. In this work,
Commons License Attribution 4.0 International (CC BY 4.0).
we present the following contributions:
Anderson Chaves
Supervised by Fabio Porto
Actor Oriented Array Active Proposed
System Features
Databases Databases Databases Solution
Dynamic Scalability
Actor-Based
Asynchronous primitives + - - +
Programming
Encapsulation
Array Based Array-Based Operations
- + - +
Data Management Flexible Storage Format
Complex Event Event Detection
- - + +
Handling Reactive Behavior
Machine Learning ML as first class operations
- - - +
Support Concept Drift Handling
Table 1: Potential contributions from different models for IoT data management
β’ We propose the development of a new Database Manage- highly concurrently generated data. How to perform the manage-
ment System that offers CEP primitives through actor-based ment of these data interactions while ensuring low latency?
programming in order to perform rule-based monitoring for Visualization: Visualization is important in big data analytics,
real-time scalable IoT scenarios. specially for IoT systems [18]. How can we perform visualization in
β’ We propose to further extend our solution to include ML the case of heterogeneous and diversely structured data generated
inference as first class operators for CEP, enabling further in- in IoT?
tegration between the data system and the Machine Learning Data Mining: The realization of the potential of IoT depends on
tasks. being able to gain the insights hidden in the vast and ever increasing
β’ We propose to investigate the challenges involved in concept available data. Current data mining approaches donβt scale well
drift handling specifically in an IoT environment, and how to IoT volumes. What characteristics are the most essential for a
to address these challenges in a data management system. system fit to such environments?
Resource Constraints: In the IoT data stream model, a high
The remainder of this paper is organized as follows. In Section volume of data is produced at high speed. Therefore algorithms
2 we present the base concepts for the highlighted problems and that process it must do so under very strict constraints of space
proposed solutions. In Section 3 we present our idea of leveraging and time. Addressing these constraints requires that a significant
array databases to a scalable, reactive and intelligent solution fit for amount of data processing must happen on edge devices. How can
IoT. We conclude and present our research directions in Section 4. we design algorithms that work efficiently in such environments?
Security: Being able to deal with dynamic scaling while guar-
2 RESEARCH CONTEXT anteeing protection of data from different entities is another sig-
nificant challenge. What is the most effective way to ensure access
In this section, we introduce the base concepts of IoT data and
control and protection of data from large volumes of devices and,
challenges related to it. Afterward, we present the different database
at the same time, ensure the development of a dynamic and flexible
models that serve as foundation to the proposed solution. Finally,
application?
we describe the problem of Concept Drift in IoT context.
2.2 Data management solutions
2.1 IoT Big Data Challenges 2.2.1 Array Database Models. Most IoT environments are consti-
According to [9], big data in IoT has three features that conform tuted by static or moving sensor devices placed in specific locations
to the big data paradigm: (a) a very big range of endpoints that that produce data continuously. Each data item has space coordi-
generate masses of data; (b) semi-structured or unstructured data; nates as well as a time-stamp associated, incurring in a high time
(c) it is only useful after being analyzed. and space correlation. Because of this multidimensional spatio-
Data generated by IoT has usually a high number of parallel temporal nature of IoT data, multidimensional array database mod-
sources, being subject to inaccuracies and noise during acquisition els, built using arrays as the primary data representation, offer
and transmission. It can be streamed continuously or accumulated advantages for an efficient data management.
as a source of big data. When dealing with big data analytics, its Array databases were initially proposed to better represent sen-
possible to produce insights after several days of its generation, but sor, image, simulation, and statistics data of tipically spatio-temporal
in the case of streaming data IoT analytics, they must be delivered dimensions [4]. They have special query languages built upon array-
in at most a few seconds or less. This real-time constraint incur in based algebraic formalizations that model different kinds of oper-
the following challenges for IoT big data: ations such as aggregations or subsetting. Cells in an array have
Data Management: Data management is a big challenge to be an intrinsic ordering, making it easy to quickly lookup values by
addressed in order to realize the full potential of IoT, and therefore taking advantage of this ordering. Array indexes do not need to be
has become a key research topic [17, 20]. Many IoT systems are stored and can be inferred by the position of a cell, saving storage
processor-intensive and require processing a massive amount of space. Arrays can also be split into subarrays (called tiles or chunks)
Towards an Integrated Solution for IoT Data Management
that can be used as processing and storage units to help answering 2.3 IoT Concept Drift
queries efficiently. Concept drift can be formally defined as follows [15]: given a
Recently, some research effort is being applied in order to inte- time period [0, π‘], a set of samples, denoted as π 0,π‘ = {π 0, ..., ππ‘ },
grate ML tools and array DMBSs [24]. The system Rasdaman [3] where ππ = (ππ , π¦π ) is one observation or data instance, ππ is
allows the implementation of machine learning algorithms through the feature vector, π¦π is the label, and π 0,π‘ follows a certain dis-
User Defined Types and Functions that implement the underlying tribution πΉ 0,π‘ (π, π¦). Concept drift occurs at timestamp π‘ + 1, if
linear algebra operations directly over the arrays. In the case of πΉ 0,π‘ (π, π¦) β πΉπ‘ +1,β (π, π¦).
SciDB [23], users are provided with linear algebra operators that Research on learning under concept drift presents three com-
can be used as building blocks to implement the ML algorithms. ponents beyond traditional Training/Prediction: Drift detection,
In SAVIME [11], users can perform inference from machine learn- drift understanding and drift adaptation. The first refers to whether
ing models as part of the query expression, allowing the jointly or not a concept drift occurs in a stream set of data. Drift under-
optimization of the data preparation process and its input to the standing is related to when, how and where it occurs. Finally, drift
model. adaptation refers to reacting to the existence of a drift.
Recently, some works have been proposed to deal with concept
drift specifically in IoT platforms. For example, the work of [14]
2.2.2 Active Databases and Complex Event Processing. An event can proposes an ensemble learning method based on offline classifiers
be defined as an occurrence of significance in a system [16]. Histor- to address concept drifts and imbalance data concurrently. In [2],
ically, many different initiatives have studied event processing for its proposed an unsupervised model-independent methodology
different reasons. Active Databases intended to extend traditional to detect drifts in data generated from IoT devices. In [27], it is
DBMSs by enabling the specification of reactive behavior. The idea proposed a concept drift adaptive method to anomaly detection in
was to develop strategies to respond automatically to events and IoT services that considers the time influence to change the sample
changes in the database state through mechanisms formalized as distribution. However, this is a not fully explored topic and many
ECA rules [26]: if an event is detected, and any of previously de- research opportunities still exist.
fined conditions become true, then a corresponding action is taken
without any external intervention. 3 LEVERAGE ARRAY DATABASES TO IOT
Complex Event Processing extend the logic behind ECA rules, COMPLEX EVENT PROCESSING
being understood as a set of techniques combined in order to per- Historically, Database Management Systems have offered many
form real-time stream processing for monitoring and detection of benefits to data intensive applications, such as transactions, index-
arbitrarily complex patterns in massive data streams [16]. They ing, query planning and declarative query languages. An IoT data
are commonly used in IoT environments to enable real-time or management solution must answer specific demands, such as en-
near real-time decisions [13]. In CEP, each data item is abstracted capsulation for isolating state and access control, asynchronous
as an event produced by a data source. A CEP engine combines primitives and dynamic scalability, since in many scenarios, sens-
multiple simpler events to produce more complex ones, that match ing devices can instantly enter and leave a system. It should be
previously defined patterns. It typically must process multiple data able to detect and react to predefined data patterns automatically,
streams from different sources in order to track simultaneously while providing quick data access and an efficient integration to
hundreds or even thousands of different patterns through evalu- ML analysis. Table 1 highlights the strong contributions offered
ation mechanisms such as non-deterministic finite automaton or by active, actor-oriented and array databases to each of these IoT
tree-based plans [12]. demands.
Sensor
Query
Devices Staging
Processor
Data
2.2.3 Actor Oriented Databases. The actor programming model is (continuous) Storage
Array
Concept Data
Working
a well-known model for distributed and concurrent programming, Storage
Drift/
Detector Event
Structures
Processor
in which the actor is the fundamental computing unit. Its main Event (Local) Array
Detector Continuous
principle is that in a system, the control flow and the data flow Loader
Array Data
must be inseparable. Actors do not share state and communicate Model
Stream
Manager Manager
via asynchronous messages. Because of its characteristics, actors Data
Event Processor
are a scalable solution to support the management of any number (Global)
of independent and heterogeneous streaming data sources.
In recent works, it has been demonstrated the effectiveness of the Things Actors Analysis
Layer Layer Layer
integration of data management features such as transactions and
indexing into actor runtimes [6]. The authors of [25] demonstrate
that this solution is in fact very suitable to perform IoT data manage- Figure 1: System Overview
ment. A similar approach has sought to integrate actor primitives
into relational databases [22] by extending the programmability of By taking our inspiration in the approaches of Orleans [5], that
stored procedures with actor objects, taking advantage of databases added data-management functionality in a virtual actor runtime
state management features. and ReactDB [22], which integrates actor features into a relational
Anderson Chaves
Supervised by Fabio Porto
database system, we investigate the potential of performing event Association for Computing Machinery, Washington, USA, 575β577.
detection and reactive behavior through actor-based primitives in [4] Peter Baumann, Dimitar Misev, Vlad Merticariu, and Bang Pham Huu. 2021.
Array databases: concepts, standards, implementations. Journal of Big Data 8, 1
an array database model. Figure 1 illustrates the proposed idea. At (2021), 1β61.
the things layer, data is collected from sensor devices and com- [5] Phil Bernstein, Sergey Bykov, Alan Geller, Gabriel Kliot, and Jorgen Thelin. 2014.
Orleans: Distributed virtual actors for programmability and scalability. MSR-TR-
municated to actor engines at the actor layer. Distributed actors 2014β41 (2014).
manage these intermediate nodes that process and detect relevant [6] Philip A Bernstein, Mohammad Dashti, Tim Kiefer, and David Maier. 2017. In-
(local) events based on attached sensors before sending them to dexing in an Actor-Oriented Database.. In CIDR.
[7] Spyros Blanas, Kesheng Wu, Surendra Byna, Bin Dong, and Arie Shoshani. 2014.
the cloud based data center, along with relevant data in the form Parallel data analysis directly on scientific file formats. In Proceedings of the 2014
of array data structures. At the analysis layer, global queries and ACM SIGMOD international conference on Management of data. Association for
analysis that take into account alerts provided by actors can be Computing Machinery, Utah, USA, 385β396.
[8] Shaofeng Cai, Gang Chen, Beng Chin Ooi, and Jinyang Gao. 2019. Model slic-
made over the collected data. The intention is to provide a low ing for supporting complex analytics with elastic inference cost and resource
latency environment, in which there is a reduced communication constraints. Proceedings of the VLDB Endowment 13, 2 (2019), 86β99.
[9] Min Chen, Shiwen Mao, Yin Zhang, Victor CM Leung, et al. 2014. Big data: related
bottleneck. technologies, challenges and future prospects. Vol. 96. Springer.
The integration of ML-based analytics as part of the Data Man- [10] Gianpaolo Cugola and Alessandro Margara. 2012. Processing flows of information:
agement System may lead to powerful optimization opportunities From data stream to complex event processing. ACM Computing Surveys (CSUR)
44, 3 (2012), 1β62.
since different parts of the ML process may be treated as operators [11] Anderson Chaves da Silva, Hermano LourenΓ§o Souza Lustosa, Daniel Nasci-
of the query plan. To cope with the growing need for ML support mento Ramos da Silva, FΓ‘bio AndrΓ© Machado Porto, and Patrick Valduriez. 2020.
in IoT data systems, we aim to provide both a local and a global SAVIME: An Array DBMS for Simulation Analysis and ML Models Prediction.
Journal of Information and Data Management 11, 3 (2020).
event detector that supports ML inference from trained models as [12] Nikos Giatrakos, Elias Alevizos, Alexander Artikis, Antonios Deligiannakis, and
first class operators. Minos Garofalakis. 2020. Complex event recognition in the big data era: a survey.
In IoT environments, communicated data from devices is usu- The VLDB Journal 29, 1 (2020), 313β352.
[13] Ilya Kolchinsky and Assaf Schuster. 2019. Real-time multi-pattern detection over
ally collected and recorded by assuming a temporal relationship event streams. In Proceedings of the 2019 International Conference on Management
between records. As time goes on, concept drift is bound to occur, of Data. 589β606.
[14] Chun-Cheng Lin, Der-Jiunn Deng, Chin-Hung Kuo, and Linnan Chen. 2019.
which may cause an accuracy drop to any methods that rely on Concept drift detection and adaption in big imbalance industrial IoT data using
long-term statistical data attributes. The proposed solution will an ensemble learning method of offline classifiers. IEEE Access 7 (2019), 56198β
count with a central drift detector that is able to determine if and 56207.
[15] Jie Lu, Anjin Liu, Fan Dong, Feng Gu, Joao Gama, and Guangquan Zhang. 2018.
when the drift occurred as well as the best reaction to it based on Learning under concept drift: A review. IEEE Transactions on Knowledge and
the local drift detectors. Data Engineering 31, 12 (2018), 2346β2363.
[16] David C. Luckham. 2001. The Power of Events: An Introduction to Complex Event
Processing in Distributed Enterprise Systems. Addison-Wesley Longman Publishing
4 CONCLUSION AND RESEARCH DIRECTION Co., Inc., USA.
In this paper, we discuss characteristics and challenges of IoT data [17] Meng Ma, Ping Wang, and Chao-Hsien Chu. 2013. Data management for internet
of things: Challenges, approaches and opportunities. In 2013 IEEE International
management and summarize potential contributions from differ- conference on green computing and communications and IEEE Internet of Things
ent strategies in addressing each of them. Our goal is to build an and IEEE cyber, physical and social computing. IEEE, 1144β1151.
[18] Mohsen Marjani, Fariza Nasaruddin, Abdullah Gani, Ahmad Karim, Ibrahim
efficient, in-memory data management system that combines each Abaker Targio Hashem, Aisha Siddiqa, and Ibrar Yaqoob. 2017. Big IoT data
of these different contributions into a single integrated solution, analytics: architecture, opportunities, and open research challenges. IEEE Access
while offering a robust support for data analysis trough Machine 5 (2017), 5247β5261.
[19] Mehdi Mohammadi, Ala Al-Fuqaha, Sameh Sorour, and Mohsen Guizani. 2018.
Learning. As the next step in our study, we aim to focus on the Deep learning for IoT big data and streaming analytics: A survey. IEEE Commu-
design refinement and implementation of a prototype system as nications Surveys & Tutorials 20, 4 (2018), 2923β2960.
a foundation to our subsequent investigations. To evaluate the vi- [20] John Paparrizos, Chunwei Liu, Bruno Barbarioli, Johnny Hwang, Ikraduya Edian,
Aaron J Elmore, Michael J Franklin, and Sanjay Krishnan. 2021. VergeDB: A
ability of our approach, we intend to submit it to a real use-case Database for IoT Analytics on Edge Devices. In CIDR.
scenario that presents the IoT characteristics and challenges de- [21] JosΓ© RoldΓ‘n, Juan Boubeta-Puig, JosΓ© Luis MartΓnez, and Guadalupe Ortiz. 2020.
Integrating complex event processing and machine learning: An intelligent ar-
scribed. We also intend to perform comparative experiments with chitecture for detecting IoT security attacks. Expert Systems with Applications
state-of-the-art big data frameworks in order to demonstrate the 149 (2020), 113251.
optimization opportunities that we envision. [22] Vivek Shah and Marcos Antonio Vaz Salles. 2018. Reactors: A case for predictable,
virtualized actor database systems. In Proceedings of the 2018 International Con-
ference on Management of Data. 259β274.
5 ACKNOWLEDGEMENT [23] Michael Stonebraker, Paul Brown, Donghui Zhang, and Jacek Becla. 2013. SciDB:
A database management system for applications with complex analytics. Com-
We would like to thank CAPES for its scholarships, and Petrobras puting in Science & Engineering 15, 3 (2013), 54β62.
for financing this work through the Gypscie project. [24] Sebastian Villarroya and Peter Baumann. 2020. On the Integration of Machine
Learning and Array Databases. In 2020 IEEE 36th International Conference on Data
Engineering (ICDE). IEEE, 1786β1789.
REFERENCES [25] Yiwen Wang, Julio Cesar Dos Reis, Kasper Myrtue Borggren, Marcos Antonio Vaz
[1] Furqan Alam, Rashid Mehmood, Iyad Katib, and Aiiad Albeshri. 2016. Analysis Salles, Claudia Bauzer Medeiros, and Yongluan Zhou. 2019. Modeling and Build-
of eight data mining algorithms for smarter Internet of Things (IoT). Procedia ing IoT Data Platforms with Actor-Oriented Databases.. In EDBT. 512β523.
Computer Science 98 (2016), 437β442. [26] Jennifer Widom and Stefano Ceri. 1996. Active database systems: Triggers and
[2] Mohsen Asghari, Daniel Sierra-Sosa, Michael Telahun, Anup Kumar, and Adel S rules for advanced database processing. Morgan Kaufmann.
Elmaghraby. 2020. Aggregate density-based concept drift identification for dy- [27] Rongbin Xu, Yongliang Cheng, Zhiqiang Liu, Ying Xie, and Yun Yang. 2020.
namic sensor data models. Neural Computing and Applications (2020), 1β13. Improved Long Short-Term Memory based anomaly detection with concept drift
[3] Peter Baumann, Andreas Dehmel, Paula Furtado, Roland Ritsch, and Norbert adaptive method for supporting IoT services. Future Generation Computer Systems
Widmann. 1998. The multidimensional database system RasDaMan. In Proceed- (2020).
ings of the 1998 ACM SIGMOD international conference on Management of data.