=Paper=
{{Paper
|id=Vol-2978/casa-paper2
|storemode=property
|title=Preemptive Anomaly Prediction in IoT Components (short paper)
|pdfUrl=https://ceur-ws.org/Vol-2978/casa-paper2.pdf
|volume=Vol-2978
|authors=Alhassan Boner Diallo,Hiroyuki Nakagawa,Tatsuhiro Tsuchiya
|dblpUrl=https://dblp.org/rec/conf/ecsa/DialloNT21
}}
==Preemptive Anomaly Prediction in IoT Components (short paper)==
Preemptive Anomaly Prediction in IoT Components Alhassan Boner Diallo, Hiroyuki Nakagawa and Tatsuhiro Tsuchiya Osaka University, Osaka, Japan Abstract The Internet-of-Things (IoT) has become a very promising and fruitful area of research. The rapid development of IoT is revolutionizing our daily utilization of technology in every way. The IoT paradigm is that the devices making up an IoT system have resource constraints such as storage, computing and energy consumption. That paradigm makes possible a flexible and pervasive communication between devices that are bound to low resources. These constraints may create a state where there is anomaly occurrence on the component level that may impact the whole system. Some innovative techniques have been proposed to quantify the reliability of these devices for the aforementioned constraints. However, there is a gap between the quantification of the component reliability and the predictive and preemptive maintenance of these components. In this study, we propose an approach combining reliability quantification and reinforcement learning to build a mechanism that can achieve a predictive maintenance for the components of an IoT system such as devices and links. In the approach, a component-level mechanism is built to synthesize the reliability data, and to determine the probability of anomaly occurrence for each component. The approach is being applied to a self-adaptive IoT system for smart environment monitoring named DeltaIoT. Keywords self-adaptive systems, IoT, preliability, reinforcement learning, q-learning 1. Introduction based on the data they collect and provide. The relia- bility of the IoT systems depends on the reliability of Recently, the Internet of Things (IoT) has been one of the components that make up the system. As the IoT the fastest growing fields in the computing domain. Its devices are constrained by nature, there must be some paradigm has been applied to many critical applications mechanism in place to ensure their reliability at all time, such as early warning systems for earthquake or tsunami, in order to have accurate decision models based on the smart home security, traffic management, healthcare, and data provided by the lower layer of the IoT architecture. education systems, etc. Despite a rapid development and IoT reliability is a critical domain of research that has improvement in the IoT research area, many challenges seen a lot of important contributions over the years. Mul- remain. The challenges faced in IoT are related mainly tiple ways of quantifying the reliability of IoT compo- to the following properties: scalability, availability, reli- nents have been proposed. However, there is a gap be- ability, interoperability, security, mobility, performance, tween that quantified reliability and its application in pre- etc. dictive maintenance. In other words, how can we predict The IoT infrastructure is made up of low resource an accurate maintenance date for IoT components, based devices, meaning that they have low storage and low on the reliability measurement? To achieve that, we must computing power compared to other devices within the build first mechanisms that can synthesize the reliabil- computing domain. This is the result of the desire to ac- ity information from anomalies to determine whether commodate the energy consumption as most of the com- the system has become less reliable from that anomaly ponent rely on battery to power them up[1][2]. Nowa- occurrence. The ability to reason about the quantified days, the IoT paradigm is applied to many mission-critical reliability of the IoT system is a valuable step towards systems, such as factory management, personal body sen- achieving predictive maintenance. The idea here is to sors in healthcare, surveillance systems in nuclear power build a dynamic decision-making process that can collect plants. These areas of application require a failure free reliability data in a periodic manner and try to estimate system; otherwise there will be disastrous consequences. a future failure time. We must be able to trust these systems in all conditions Fundamentally, we can define reliability as the study as they impact the way we make numerous decisions of failures. The reliability of a system or a computing device is its quality over a certain period of time. To CASA: 4th Context-aware, Autonomous and Smart Architectures quantify the reliability of a system or computing device, International Workshop, ECSA’21 15-17 September 2021 " a-bonerdiallo@ist.osaka-u.ac.jp (A. B. Diallo); we use standard metrics all related to time like Mean nakagawa@ist.osaka-u.ac.jp (H. Nakagawa); Time To Failure, Mean Time Between Failures, and Mean t-tutiya@ist.osaka-u.ac.jp (T. Tsuchiya) Time To Repair, etc. Quantifying reliability is essential 0000-0001-5280-4113 (H. Nakagawa) to assessing the continued success in the operation of an © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). information system or a computing device. CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 2. Background ity depends on the reliability of the device layer and the network layer. For example the device layer collects and Computing systems require a high degree of performance transmits anomalous data, which are sent through the and availability, but above all, they must be reliable. The network to the application layer. Beyond being able to appropriate way of assessing the reliability of a com- reason about the fitness of our IoT devices, we must also puting system depends on the type and mission of the be able to attest to the reliability of the network infras- system. In their study, Xie et al. [3] addressed several tructure that forms the backbone of IoT communication. key metrics for reliability quantification. Some of these There are two approaches of network reliability studies key metrics are Mean Time To Failure (MTTF), Mean which are discussed in this section; studies for enhanc- Time Between Failures (MTBF), failure rate. The MTTF ing QoS in networks, and studies aimed at quantifying metric quantifies the expected operating time of a sys- reliability metrics for networks. Some research has also tem before the occurrence of a failure. The MTBF metric been conducted to evaluate IoT reliability at a system as the name indicates, quantifies the operating time be- level. These approaches are at a high level and do not tween one failure occurrence to another. The failure rate capture the individual detail for reliability, such as which function helps to quantify the failure of a system within devices are responsible for failures, or which parts of the a specified window of time. The maintainability metric network are responsible for traffic problems. quantifies the probability that a system can go back to op- erating normally after the occurrence of an anomaly or a failure. The availability metric quantifies the probability 3. Focus: Anomaly Prediction of the system being expected normally operating. The methods and techniques to analyze the reliability In our study, we consider the types of anomalies accord- of computing systems depend on the domains that make ing to where and how frequent they occur. Anomalies up the system. There are mainly four domains or level: can occur on each layer of the architecture with different system, hardware, software, and network. The assess- degree of frequency. The device layer and network layer ment of the reliability at a system level is the result of of the architecture are where anomalies occur the most, the combined assessment of the hardware, software, and whereas the application layer is less prone to anomalies. network levels. In the hardware domain, the reliability As for the occurrence frequency, we consider two main assessment is related to the decay of the quality over time forms of occurrence in the IoT components: cyclic anoma- of the physical components of the computing system. In lies and random anomalies. The former type of anomalies the software domain, according to the study in [4], there are linked to the nature of the component itself. Each is no concern over a physical decay of the quality over component has a starting time and an ending time. The time. As for the network reliability, it may be subjected probability of anomaly occurrence is very small when to a decrease of performance over time due to internal the reliability is quantified closed to the starting time. On and external factors on the hardware and software that the other hand, the probability is great when quantified make up the network. towards the ending time. The latter type of anomalies, In the case of IoT systems, their reliability can be as- called random anomalies, stem from random external as sessed by quantifying the reliability of the different lay- well as internal factors, like noise, interference, etc. ers of their architecture. In [5], the IoT functionalities Our approach combines reliability quantification and are grouped into the sensing and actuation, the commu- machine learning to solve the problem of predictive main- nication, and the end-user application and services. A tenance from the aforementioned anomalies. Reliability basic architecture of an IoT system can be divided into quantification is achieved using the metrics introduced three layers: a device layer, a network layer, and an ap- in [3]. Even though the concept of component anomalies plication layer. The device layer is responsible for the is mentioned throughout this paper, detecting anomalies sensing and actuation. At the device level, the reliabil- is not the main focus of this study. In their review of ity is constrained by the battery life, the low capacity IoT reliability and anomaly detection techniques, Moore of both the memory and the CPU which prevent them et al. [8] noted that no study had explored the poten- using complex encryption to protect the transmitted data tial of synthesizing quantified reliability data. The study [6]. The device reliability is further constrained by false pointed out that the decrease of reliability of a smart reading events that are common for sensors, when they home system has different consequence than a decrease collect and transmit data erroneously after an undetected of reliability of a power plant surveillance system. The failure[7]. decrease in reliability of the IoT system increases the The network layer is responsible for the communica- probability of anomaly occurrence within the system. As tion between the devices of the system. The application stated in the background, each layer of the IoT archi- layer is responsible for the services and the interactions tecture has its own way of assessing reliability. In this with the end-user applications. In most cases, its reliabil- study, we cover mainly anomaly occurrence at the device Figure 1: DeltaIoT network structure layer and the network layer of the architecture. The main DeltaIoT [9]. Self-adaptive systems are able to modify goal of our study is to enable the IoT system to achieve their behavior at runtime, in a response to a change in predictive maintenance, i.e., predict a probable failure their operating environment, to achieve their goals. In time of one or more components and preemptively apply this research, the study is not only about engineering correction to the components, based on their quantified reactive self-adaptivity, rather it is also about designing reliability. Based on this goal, we include in the study robust IoT system that are subjected to environmental components where corrections can be applied after a changes. A typical IoT network system is composed of failure or an anomaly. Therefore, some components of devices with different types of sensors and actuators, usu- the device layer such as the battery, the memory and the ally linked together wirelessly through the internet[2]. CPU, are out of the scope of this paper. The reason is that The concept of Internet-of-Things enables devices to op- they cannot be automatically maintained after a failure erate with the constraints of energy consumption, low or an anomaly occurrence. These components, once the computing power and low storage power. The networks reliability has decreased or a failure has occurred, would connecting the devices are also prone to congestion es- require a system where a Human-in-the-loop is placed pecially when there is a burst in demand, e. g., during in for maintenance. an emergency situation for a system deployed to mon- There are components of an IoT system that can be itor large geographical areas to detect potential disas- calibrated after the decrease of reliability or occurrence ters as early as possible[10]. All these constraints make of an anomaly. Such components can be sensors at the the engineering of dependable and reliable IoT systems device layer or links at the network layer. Therefore, our more challenging. The next paragraph introduces an IoT approach is applied to the sensor devices and the net- system which is used in the case study of applying our work links in order to achieve predictive maintenance. approach. There are some consequences for undiagnosed anoma- The DeltaIoT system is a platform for smart environ- lous data to be ignored within the different layers of the ment monitoring. The system, introduced in [9], is a IoT architecture. Therefore, to decrease the vulnerability self-adaptive system, enabling it to react to environmen- of the IoT-centered systems, there is a need to design tal changes. The DeltaIoT system “enables researchers to lightweight solutions that are capable of handling the evaluate and compare new methods, techniques and tools anomaly detection tasks without impacting the resource for self-adaptation in Internet of Things”. The DeltaIoT constrained systems. system has been built into two versions and they are deployed at the campus of KU Leuven University. The two versions differ in the number of devices present in 4. Motivating Example: DeltaIoT each network and the geographical deployment of each version of the system. DeltaIoT system is described in In this section, we describe the motivating example of Figure 1. DeltaIoT has a multihop communication system our research which is a self-adaptive IoT system named in cycles of 570 seconds. The system experiences exter- nal and internal stimulations that causes it to change its behavior to achieve its goals. There are two main causes for adaptation. The first cause for adaptation is an inter- ference in the network causing the links to experience delay or packet loss. The second cause for adaptation is the fluctuating load of messages. This results in some or all links to be clogged creating delay and packet loss. There are three quality requirements the system must fulfil. The first quality requirement is about the average packet loss over 12 hours, which should not exceed 10% of the overall messages sent through the links. The second quality requirement concerns the average latency over 12 hours which should not exceed 5% of the cycle time. Figure 2: Overview of the component-level mechanism The third quality requirement concerns the average en- ergy consumption over 12 hours. It has to be minimized during that period. 5. Approach One of the main mission of the Internet of Things systems is to collect and communicate data about the en- In this section we describe in details our approach and its vironment or the people around which they are deployed. practical implementation. The goal of the approach is to DeltaIoT, like many other IoT systems, alternates sensing determine a high probability failure time or an anomaly and actuation during its operation. In many cases, the occurrence time in order to apply corrective measures. actuation is performed based on the results of the sens- We build two mechanisms. The first mechanism is on ing. Therefore, anomalies during data sensing and during the component level, that is the level of devices and links. data communication may have a negative effect on the It captures the behavior of each individual component. system performance or operation. Collecting anomalous The reliability of each component is computed by this data typically happens on the device level by the sensors. mechanism. The second mechanism is on the level of It can be caused by different reasons like noise or defect the MAPE feedback loop. The MAPE stands for Monitor, due to environmental factors. When this happens, the Analyzer, Planner and Executor. The feedback loop is sensors can be calibrated again to perform with a great ac- used in autonomic computing to achieve self-adaptation curacy. Anomalies occurring on the links of the DeltaIoT in software systems[13]. The system-level mechanism system are related to the decrease in the QoS. The packet is connected to the monitor component of the feedback loss and the latency are some of the manifestations of loop. these anomalies occurring in those links. The backbone of the component-level mechanism is an We have presented a mechanism for an efficient con- anomaly agent that is instantiated by each component of figuration space reduction [11]. The mechanism focused the IoT system. The quantified reliability is determined on the analysis after an anomaly has happened at a com- using mainly two metrics: mean time between anomalies, ponent level. In this paper, the main focus of the study anomaly rate. The function of the anomaly agent is to is to forecast an anomaly before it happens. It is impor- predict an anomaly time, depending on the quantified tant to reduce the time between anomaly occurrence and reliability of the component. The anomaly agent has to detection. It is equally important to minimize the time predict an accurate anomaly time. It behaves according from anomaly detection to correction. Moreover, precise to the principles of reinforcement learning. It is rewarded anomaly understanding aids in constructing more precise for the accurate prediction of the anomaly time. Figure probabilistic model of the system, which helps to find 2 illustrates the component-level mechanism of the ap- more reliable configuration of the system using proba- proach. According to [14], “reinforcement learning is an bilistic model checking [12]. Many anomaly detection area of machine learning concerned with how intelligent techniques have been proposed for computing devices in agents ought to take actions in an environment in order general, each with its advantages and drawbacks. How- to maximize the notion of cumulative reward”. The main ever, techniques for anomaly forecasting are few. In the motivation of using reinforcement learning is to record Internet of Things domain, to the best of our knowledge, the different states of the system and their transitions our study is the only one that makes use of reliability [15][16]. The system has an optimal state in which the quantification and machine learning approach to predict probability of each component’s reliability is high. The anomaly occurrence. As explained in the approach, if next state is an in-between state where the component’s the time of anomaly occurrence could be predicted, then reliability is just average. Lastly, the system has a critical corrective measures can be applied in order to prevent state in which an anomaly has already occurred or is the anomaly from happening. very likely to occur. Capturing these different states and actions related to an anomaly time. We formalize our reasoning about them can be helpful in discovering an problem as a Markov Decision Process or MDP. The com- optimal time for predictive maintenance. To implement ponent, which is the environment interacting with the the anomaly agent, we use an approach that relies on agent, is modeled as a Markov Process. The Q-learning Time Difference Learning [17]. The agent is implemented algorithm used to create the agent, is chosen because it according to a Q-Learning algorithm [18]. The approach is model-free, off-policy, and value-based algorithm. is well suited for situations with great degree of random The MDP describing the environment for the learning variables and uncertainty. In the next subsections, we process, contains a tuple of four elements. The first ele- explain the two mechanisms in detail. ment is a set of finite states S. the second element is a set of finite actions A. the number of states is function of the 5.1. Component-level mechanism number of actions. The actions to be performed by the agent are, for each run, adding an integer value to the The network of most IoT systems is composed of several current time and to check whether the time corresponds heterogeneous devices. These heterogeneous devices to the anomaly time. The third element of the tuple is the possess sometimes different characteristics that can hin- reward R to be received after transitioning from state S der their interoperability. Therefore, when designing a to state S’ as a result of performing an action. The fourth mechanism for anomaly prediction, each individual com- element of the tuple is the probability P related to the ponent of the network must have a self-centered module performed action. that captures its unique characteristics. The component- The Q in Q-learning is a measure of the quality of a level mechanism is illustrated in figure 2. The mechanism state-action combination. When an action is taken by a has two main parts. The first part is a reliability quan- learning agent, the reward of that action along with the tification algorithm, where the reliability of the module learning rate, the discount factor and the initial condition is quantified based on the previously mentioned metrics. or previous value of Q, are used to determine the new The IoT system operates in an environment where the value of Q for that state. quality of its components deprecates over time. Some components can be calibrated back to normal like the sen- sors and the network links. However, the physical aspect 𝑄𝑡 (𝑠, 𝑎) = 𝑄𝑡−1 (𝑠, 𝑎) + 𝛼[𝑟+ of the system in most of the cases cannot be calibrated. 𝛾 * 𝑚𝑎𝑥𝛼 𝑄(𝑠′ , 𝑎′ ) − 𝑄𝑡−1 (𝑠, 𝑎)] (4) Therefore, that aspect is out of the scope of this study. We track the component based on the three metrics: the mean time between anomalies (MTBA), the anomaly rate 5.2. System-level mechanism (AR) and the probability of anomaly (PA). First we deter- The mechanism is implemented on the monitor level of mine the anomaly rate AR by determining the number of the MAPE feedback loop. The monitor part of the MAPE anomalies per cycle of time. It is calculated by dividing feedback loop observes the system and the operating en- the number of anomalies over the cycle of time. vironment with which the system is interacting, to check whether there are changes. We leverage this function 𝐴𝑛𝑜𝑚𝑎𝑙𝑖𝑒𝑠 𝐴𝑅 = (1) of the monitor, and append the system-level mechanism 𝐶𝑦𝑐𝑙𝑒𝑇 𝑖𝑚𝑒 on it. The mechanism performs two main tasks. The The MTBA is the time the system or the component is first task of the mechanism is to check the results from operating normally before an anomaly occurrence. The the component during each cycle performed by the IoT MTBA is determined by the following formula system. The second task is to aggregate the results of the from the components over all the cycles performed by 1 𝑀 𝑇 𝐵𝐴 = (2) the system. 𝐴𝑅 The probabilbity of anomaly occurrence PA, is deter- mined using the MTBA is the following formula 5.3. Learning Process −1 In the component-level mechanism, our method first de- 𝑃 𝐴 = 𝑒(( 𝑀 𝑇 𝐵𝐴 )*𝑡𝑖𝑚𝑒) (3) termines an accurate quantification of the metrics, that The second part of the component-level mechanism can give a snapshot of the quality of the component at is a Q-Learning agent, where the agent learns the char- each period of the system operating cycle. This is most acteristics of the component, based on the quantified required during the time of data sensing and data for- reliability and the overall environment of the component. warding. The components of the IoT system, i.e., sensors The agent must learn to predict an anomaly time. There- and links, operate differently in the environment. We fore, the actions to be taken by the agent are prediction have described earlier the kind of anomalies that the com- ponents are subjected to. The sensors can have random anomalies like noise but also cyclic anomalies. The links [6] F. A. Alaba, M. Othman, I. A. T. Hashem, F. Alotaibi, of the network have external interferences or message Internet of things security: A survey, Journal of clogging leading to anomalies. However, most of these Network and Computer Applications 88 (2017) 10– anomalies are related to the decrease of accuracy of the 28. device and decrease of power settings of the link. The [7] A. Karkouch, H. Mousannif, H. Al Moatassime, approach determines the number of anomalies that are T. Noel, A model-driven architecture-based data occurring during each cycle. Therefore, for each cycle we quality management framework for the internet of can observe a different anomaly rate. That observation things, in: 2016 2nd International Conference on helps to determine and update the mean time between Cloud Computing Technologies and Applications anomalies during all the cycles. We determine the actions (CloudTech), IEEE, 2016, pp. 252–259. performed by the agent as adding time in seconds to the [8] S. J. Moore, C. D. Nugent, S. Zhang, I. Cleland, Iot current time. The reason is that the current time is the reliability: a review leading to 5 key research direc- time when the agent decides to make a prediction about tions, CCF Transactions on Pervasive Computing an anomaly time. The agent decides to make a prediction and Interaction (2020) 1–17. after getting the anomaly probability for that period. The [9] M. U. Iftikhar, G. S. Ramachandran, P. Bollansée, amount of time in seconds to add to the current time is D. Weyns, D. Hughes, Deltaiot: A real world exem- function of the anomaly probability of that period. If the plar for self-adaptive internet of things (artifact), anomaly probability is high, the amount is small, and in: DARTS-Dagstuhl Artifacts Series, volume 3, on the other hand, if it is low, either no time is added, Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, or a big amount. After each prediction, the Q value of 2017. that state-action combination is updated according to the [10] S. Y. Shin, S. Nejati, M. Sabetzadeh, L. C. Briand, reward obtained. C. Arora, F. Zimmer, Dynamic adaptation of software-defined networks for iot systems: a search- based approach, in: Proceedings of the IEEE/ACM 6. Conclusion 15th International Symposium on Software Engi- neering for Adaptive and Self-Managing Systems, In this research, we are investigating the possibility of 2020, pp. 137–148. preemptive forecasting of anomalies that occur at the [11] A. B. Diallo, H. Nakagawa, T. Tsuchiya, Adapta- device and network layers of an IoT architecture, by im- tion space reduction using an explainable frame- plementing an anomaly agent based on the Time Dif- work, in: Proc. of the IEEE 45th Annual Computers, ference Learning method. In the next step, we plan to Software, and Applications Conference (COMPSAC implement another anomaly agent based on the Monte 2021), IEEE, 2021, pp. 1654–1661. Carlo method and evaluate the performance of these two [12] H. Nakagawa, H. Toyama, T. Tsuchiya, Expression agents. caching for runtime verification based on parame- terized probabilistic models, The Journal of Systems References and Software, Elsevier 156 (2019) 300–311. [13] J. O. Kephart, D. M. Chess, The vision of autonomic [1] D. E. Kouicem, A. Bouabdallah, H. Lakhlef, Internet computing, Computer 36 (2003) 41–50. of things security: A top-down survey, Computer [14] J. Hu, H. Niu, J. Carrasco, B. Lennox, F. Arvin, Networks 141 (2018) 199–221. Voronoi-based multi-robot autonomous exploration [2] A. Al-Fuqaha, M. Guizani, M. Mohammadi, M. Aled- in unknown environments via deep reinforcement hari, M. Ayyash, Internet of things: A survey on learning, IEEE Transactions on Vehicular Technol- enabling technologies, protocols, and applications, ogy 69 (2020) 14413–14423. IEEE communications surveys & tutorials 17 (2015) [15] M. Wiering, M. Van Otterlo, Reinforcement learn- 2347–2376. ing, Adaptation, learning, and optimization 12 [3] M. Xie, Y.-S. Dai, K.-L. Poh, Computing system re- (2012). liability: models and analysis, Springer Science & [16] R. Riveret, Y. Gao, G. Governatori, A. Rotolo, J. Pitt, Business Media, 2004. G. Sartor, A probabilistic argumentation framework [4] A. Mavrogiorgou, A. Kiourtis, C. Symvoulidis, for reinforcement learning agents, Autonomous D. Kyriazis, Capturing the reliability of unknown Agents and Multi-Agent Systems 33 (2019) 216–274. devices in the iot world, in: 2018 Fifth Interna- [17] R. S. Sutton, A. G. Barto, Temporal-difference learn- tional Conference on Internet of Things: Systems, ing, Reinforcement learning: an introduction (1998) Management and Security, IEEE, 2018, pp. 62–69. 167–200. [5] A. Rayes, S. Salam, Internet of things from hype to [18] C. J. Watkins, P. Dayan, Q-learning, Machine learn- reality, Springer (2017). ing 8 (1992) 279–292.