=Paper=
{{Paper
|id=Vol-2978/saml-paper3
|storemode=property
|title=A Microservices Architecture for Machine Learning Assisted Decision Support in a Real-Time Field Sensors Environment (short paper)
|pdfUrl=https://ceur-ws.org/Vol-2978/saml-paper3.pdf
|volume=Vol-2978
|authors=Giovanni De Gasperis,Giuseppe Della Penna,Sante Dino Facchini
|dblpUrl=https://dblp.org/rec/conf/ecsa/GasperisPF21
}}
==A Microservices Architecture for Machine Learning Assisted Decision Support in a Real-Time Field Sensors Environment (short paper)==
A Microservices Architecture for Machine Learning Assisted Decision Support in a Real-Time Field Sensors Environment Giovanni De Gasperis1 , Giuseppe Della Penna1 and Sante Dino Facchini1 1 Università degli Studi dell’Aquila, Dipartimento di Ingegneria e Scienze dell’Informazione a Matematica, Via Vetoio, L’Aquila, 67100, Italy Abstract In this paper we describe the design and development of a real-world software system that integrates machine learning augmenting a pre-existing remote surveillance framework. Machine learning was embedded as a service in the system, plugged-in between back-end data flux handlers; the system has been redesigned following a microservices architecture to make it scalable and to allow a progressive adoption of the machine learning-powered assistance in the event management process. A case study of the application in an actual security company is analysed and discussed, where we show how this innovation helped human operators to better shield themselves from the "information overloading". Keywords Real-Time Critical Systems, Machine Learning, Big Data, Microservices 1. Introduction must try to understand the reason of the notified alarm. If it is recognized as a false alarm, the case is simply closed. In this paper we describe the design and development On the other hand, if the alarm is improper, i.e., it is due a of a real-world software system that integrates big data system anomaly, the operator starts an anomaly handling analytics and machine learning into a pre-existing remote process. surveillance framework operated by security company The software adopted by the company to support such that monitors a number of sites through closed circuit and a process was a monolithic application that offered only IP cameras, anti-theft sensors (e.g., volume and pressure basic functionalities such as collecting signals and data sensors, door opening sensors, etc.) and also physical streams, presenting the events in a managing console and sensors (e.g., humidity and temperature). saving them in a persistent database. Therefore, most Figure 1 shows a fragment of the process commonly of the operations described by the event management followed to handle events and alarms coming from a process above required a substantial amount of manual surveillance network. When an alarm is received, first the work by the control center operators. operators check the surveillance videos. If such videos While the human intervention cannot be avoided in are not available or they do not clearly show the event, such a context, as in any security-related context, machine the operator requests an on-site check to the security learning can be exploited to assist the operators in several staff. Such action and its outcome, as well as the outcome steps of the process, leaving the humans with only the to all the actions taken during the process, is stored in most critical steps to accomplish (see, e.g., [1, 2, 3, 4] for the system database. Then, if the event is in progress, examples belonging to different surveillance contexts). the operator starts the true alarm handling process. Oth- However, embedding machine learning in the com- erwise, if the notified event is not actually in progress, pany pre-existing software presented several challenges. the operator must check for other alarms on the same First, we are modifying a production, real-time critical site and, if any, restart the handling process for such new system, so we need to gradually add such a support, in events. If no other site alarms are active, the operator order to let the operators adapt to the new functionali- ties while verifying their reliability without interrupting ECSA2021 Companion Volume, Robert Heinrich, Raffaela Mirandola the company services. Second, the closed, monolithic and Danny Weyns, Växjö Sweden, 13-17 September 2021 architecture of the company software described above " giovanni.degasperis@univaq.it (G. De Gasperis); giuseppe.dellapenna@univaq.it (G. Della Penna); makes any modification to the pre-existing process very santedino.facchini@student.univaq.it (S. Facchini) complex and error-prone. It is also worth noting that ~ https://www.disim.univaq.it/main/home.php?users_username= such a software, developed many years ago, was already giovanni.degasperis (G. De Gasperis); not adequate to accomplish the current high QoS levels https://people.disim.univaq.it/~dellapenna (G. Della Penna) and to be compliant with the latest safety regulations. 0000-0001-9521-4711 (G. De Gasperis); 0000-0003-2327-9393 (G. Della Penna) Therefore, we decided to rebuild the system from © 2021 Copyright for this paper by its authors. Use permitted under Creative scratch, extracting only some relevant modules/algo- Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) rithms from the old software in order to embed it in Figure 1: Surveillance event handling process the new release, to maintain some kind of "continuity". classification, the data storage and the log certifi- In particular, while having a completely redesigned core, cation; the system has been designed to offer the basic function- • offer decision support tools to assist operators in alities of the previous one (to maintain the above service the real-time management of events. continuity) and extend them in order to All the three extensions above could be based on simple • acquire data and information on objects and (sometimes pre-existing) algorithms, but also take advan- events from multiple sources such as IoT devices, tage from machine learning. Thus, ML can be seen as a open systems and mobile services; (plug-in) service in the overall system architecture. This • combine and correlate the real-time big data consideration, together with a number of technical as- streams in order to make easier and faster the con- pects concerning the modular structure of the applica- trol center operators job, in particular the event tion and its deployment environment, made us opt for a service oriented architecture [5] as its development to the events of the last three days gathered from all basis. More in detail, to be as lightweight as possible, the sources and suitably aggregated (requirement 2). we adopted a microservices architecture [6] to effectively Prometheus is an open-source systems monitoring and split the complexity of the overall system into small spe- alerting software, which provides a multi-dimensional cialized units, each with its REST API and containerized data model with time series, and allows a variety of in- using Docker [7]. This allowed us to develop a scalable, teractions with other third-party software components. versatile, and easily maintainable system, as opposed to Next, there is the the Redis [12] (gray) service that the previous one. In particular, microservices allowed us wraps the well-known fast in-memory data store (sup- to fragment the overall system functionality in a set of porting real-time data streams), which acts as a cache meaningful basic units that, apart from the known ad- memory to provide the consumers a fast access to the vantages in term of scalability and maintanability, make event information streams. It essentially takes the role easier the progressive addition of ML support as a ser- of "live working memory" for the events and the corre- vice (or set of services) interacting with the core system sponding management procedures, which are stored in services to enhance their functionalities. the database until they reach the closing state. It is worth noting that the output of sensors like temperature and humidity, whose handling does not require the human 2. System Architecture for a intervention (e.g., temperature sensor alarms are based Reliable Event Handling on simple logic rules), are not stored in Redis. Finally, the SQLite [13] service (gray) provides persis- At a macro level, the base requirements of the new system tent storage for all the data flowing through Prometheus are the following: and Redis. This is mainly needed to later extract the ev- 1. Acquire data, events and states from an heteroge- idences needed by the authorities (requirements 5 and neous set of sources spread in a wide geographical 6). area and connected through a digital network; To support requirement 4, the blue front-end services present the gathered and processed data in different for- 2. aggregate such data streams in a configurable mats, tailored for the specific needs of the different sys- and scalable way, since the number and type of tem users. In particular, three "IG Server" services handle sources may vary; the GUI for the operators, their supervisors and the sys- 3. correlate events through time and space logical tem administrators. Such user interfaces present to the rules; users the data streams coming from the sensors, read 4. visualize all the collected and elaborated informa- from the Redis service, as well as the data coming from tion on an interactive web dashboard; Prometheus. A special "configurator" GUI is used to con- 5. store all the data, as well as all the actions taken figure the system. The last two front-end services di- by the control center operators, on a persistent rectly interact with Prometheus: the first is based on the memory; Grafana software package [14], which provides advanced 6. provide a search engine with configurable queries visualization and dashboards for the processed and ag- to access such historical data. gregated data, whereas the alertmanager service pushes Starting from such requirements, collected through alerts automatically generated from Prometheus through interviews with the company staff and control center PromQL queries directly into email and Telegram mes- operators, we designed a microservice-based system ar- sages. chitecture that aims to be easily testable, maintainable, Finally, the red services are at the core of the architec- and extensible. Figure 2 gives an overview of the devel- ture. In general, they read from both Prometheus and oped system. Redis and apply actions, possibly modifying the Redis The data is pushed in the system by a number of spe- data streams accordingly. In particular, gest_notifier is cialized source services (drawn in green in the figure): a critical module that notifies the supervisors (via text in the currently deployed platform, we provide services messages) about event escalation, i.e., alarms that are trig- which support reading data from sensors using MQTT [8] gered by an event not being correctly and timely handled and SNMP [9] as well as from specific proprietary sensors by the control center operators. On the other hand, the such as Papago [10]. These services satisfy requirement gest_control service manages the QoS by monitoring that 1. the operator reaction times follow the company SLA, also The overall system leverages on three services that pro- generating alarms in case of inefficiencies (requirement vide different "analysis horizons" on the event streams: 3). The gcounter service generates aggregate statistics the Prometheus [11] service (yellow), which wraps the from the Redis data and posts them back to Prometheus corresponding software package, acts as the main data to support longer-term alarms (requirement 2). Finally, collector of the overall architecture, providing access the gest_source service is the action actuator, i.e., it is Figure 2: System architecture called by the operator interfaces to actually apply the to the operators. actions ("take charge", "start/pause/close workflow", etc.), Clearly the human intervention cannot be avoided executes them, and generates the action events that are when trying to solve potentially dangerous events. How- stored back in Redis to log the handling process. ever, ML can help the operators in many ways, as com- The data_analysis service, also drawn in red, is de- mented in the introduction. Thus, we initially focused on scribed in the next section. It provides ML support to the an aspect that is well known to benefit from automatic overall process, and in particular assists the operators by reasoning: mitigating the effects of an event flooding on interacting with the graphical interfaces. the operators by pre-selecting or pre-classifying events. All the microservices above were developed in Python Indeed, an operator that manages events is the classical and containerized, to be easily deployable on the com- FIFO order may spend too much time on less-significant pany’s infrastructure through an overall container de- events and delay the solution of the really critical ones. ployment script that, in particular, takes into account the To avoid this, the Data Analysis service provides an adap- service inter-dependencies. tive event classification routine that prioritizes the events so that the ones that are considered more important, i.e., that may lead to real alarms, are presented first to the 3. Machine Learning Services operators. Such a service stands between the gest_source Integration and lgo_server services, learning from the operator ac- tions and suitably modifying his data views in order to The event handling process, in the critical context where suggest the event classification. our system works, must be timely and effective. The Operators close each event handling process by la- architecture described in the previous section (Figure 2) belling the events as true alarms, false alarms, inappropri- has been designed to be reliable and fast, but the process ate alarms (i.e., due to a system failure) or "other alarms" (Figure 1) still includes a number of checks, calls, and (typically due to test or maintenance). We focus on the lengthy actions that require a substantial amount of work true and false ones, since the other two types of events are a minority class that would be difficult and useless to 4. Conclusions consider in our context. Thus, initially, we extracted a number of significant Thanks to its design and to the use of ML, the developed features from the events, relating them with the operator- system meets highest quality standards, in particular: assigned label. Such features include information such as • it allows to to acquire, aggregate and process data the unique alarm ID, the alarm central where the event and information from a variety of IoT devices, was generated, the related customer, the activated sensor which means offering a better service in terms of name, the timestamp of the alarm, etc. quality and flexibility; Then, we cleaned and refined these features to further focus on the information that seems to be more relevant • it guarantees high scalability and easy configura- for our classification task. As an example, we substituted bility; the alarm timestamp, which conveys too much informa- • it is fully compliant with data privacy, integrity tion, with the alarm weekday and the corresponding part and security regulations. of the day (morning, afternoon, evening, night). More- The project requirements foresaw to process about over, we performed a K-means clustering [15] on the sen- 300,000 events a year with the current number of opera- sor names (suitably transformed in a numerical vector tors, and manage an alarm within at most 30 seconds, as through a word embedding process), in order to extract set by the standard regulations. Currently the system has an artificial "sensor type" feature. The significance of the been deployed and is being tested by the company in a selected features was validated by calculating the mutual control room operating 24/7 on three turns of eight hours information of each feature w.r.t. the classification label each, with two or three operators per turn. The staff is (see, e.g., [16] for an overview of mutual information ap- managing about 32,500 anti-theft and intrusion detection plied to feature selection) on a set of 169,347 past events sensors and over 500 environmental IoT devices mainly provided by the company. targeted to precision agriculture. The network connects Once the final 128 features were devised, we extracted about 30 clients and its nodes are deployed on more than an initial training dataset from the set of past events 600 different sources. above. Unfortunately, the dataset was heavily imbal- Our initial statistics show that the staff is now able to anced, since the false alarms were much more than the manage an average of 1,000 events per day, thus yielding true ones [17]. Thus, we tried both a random undersam- 365,000 events managed on yearly basis as a forecast, pling of the majority class and the well-known Synthetic which doubles the performances achieved using the pre- Minority Oversampling Technique (SMOTE, [18]) to re- vious support software. The average time from the event balance it. arrival to its classification and taking charge is now 10 Finally, we built a deep neural network [19] with 128 seconds, slightly better compared to the performance of input neurons (one for each feature), two hidden layers of the previous system but in a far more complex scenario. 64 neurons each with RELU [20] activation function, and Finally, the average event management time, includ- a single output neuron with sigmoid activation function. ing classification, site operations, police calls, alarm clos- We trained the network on our dataset in order to obtain ing and archiving, has been dramatically improved from the correct classification given the event features. The 1,800 to 900 seconds (Figure 3 shows the current statistics classifier validation showed that the dataset re-balanced generated by the Grafana service in the application), and through random undersampling achieves a better overall the error ratio has been tackled almost to zero thanks to performance in this context, reaching an accuracy of 0.91, the ML priority classification system. a recall of 0.93, and a precision of 0.95, with a F1-Score of 0.92. The trained network was then embedded in the Data Acknowledgments Analysis service, where each new event is classified be- fore being presented to the operators. However, since a We would like to thank Mr. Andrea Perna who imple- wrong "false alarm" classification may always happen, mented the first prototype of the machine learning model we do not simply drop the events considered not harmful as part of his Master’s Thesis in Computer Science at our by the neural network from the stream, but rather we ex- Department. tract the classification probability that can be read from This work was funded by SPEE S.r.l. Company 1 in its output neuron and use it to alter the priority value that L’Aquila, ITALY. is used to sort the events on the operator dashboards. In this way, a possibly false alarm will be handled later, but never dropped. After the event is handled, the correct, final classification given by the operator is sent back to the neural network to fix its previsions, if needed. 1 https://www.spee.it Figure 3: Average event management time References architectural term, 2014. URL: https://martinfowler. com/articles/microservices.html. [1] A. Goyal, S. B. Anandamurthy, P. Dash, S. Acharya, [7] D. Merkel, Docker: lightweight linux containers D. Bathla, D. Hicks, A. Bhan, P. Ranjan, Auto- for consistent development and deployment, Linux matic border surveillance using machine learning journal 2014 (2014) 2. in remote video surveillance systems, in: T. Hiten- [8] International Organization for Standardization, dra Sarma, V. Sankar, R. A. Shaik (Eds.), Emerging Information technology — message queuing Trends in Electrical, Communications, and Informa- telemetry transport (mqtt) v3.1.1 (iso/iec stan- tion Technologies, Springer Singapore, Singapore, dard 20922:2016), 2016. URL: https://www.iso.org/ 2020, pp. 751–760. standard/69466.html. [2] J. Albusac, J. Castro-Schez, L. Lopez-Lopez, [9] D. Harrington, R. Presuhn, B. Wijnen, An architec- D. Vallejo, L. Jimenez-Linares, A supervised learn- ture for describing simple network management ing approach to automate the acquisition of knowl- protocol (snmp) management frameworks, 2002. edge in surveillance systems, Signal Processing doi:10.17487/RFC3411. 89 (2009) 2400–2414. doi:https://doi.org/10. [10] Papouch, Papago sensor modules, 2021. URL: https: 1016/j.sigpro.2009.04.008, special Section: //en.papouch.com/papago/. Visual Information Analysis for Security. [11] The Linux Foundation, many authors, Prometheus, [3] M. Elhoseny, Multi-object detection and tracking 2021. URL: https://prometheus.io/. (modt) machine learning model for real-time video [12] Redis Labs, S. Sanfilippo, Redis, 2021. URL: https: surveillance systems, Circuits, Systems, and Sig- //redis.io/. nal Processing 39 (2020) 611–630. doi:10.1007/ [13] SQLite Consortium, many authors, SQLite, 2021. s00034-019-01234-7. URL: https://www.sqlite.org/. [4] F. Opitz, K. Dästner, B. v. H. z. Roseneckh-Köhler, [14] Grafana Labs, many authors, Grafana: the open E. Schmid, Data analytics and machine learning observability platform, 2021. URL: https://grafana. in wide area surveillance systems, in: 2019 20th com/. International Radar Symposium (IRS), 2019, pp. 1– [15] A. K. Jain, R. C. Dubes, Algorithms for Clustering 10. doi:10.23919/IRS.2019.8768102. Data, Prentice-Hall, Inc., USA, 1988. [5] N. Niknejad, W. Ismail, I. Ghani, B. Nazari, [16] I. Letteri, G. Della Penna, P. Caianiello, Feature M. Bahari, A. R. B. C. Hussin, Understand- selection strategies for http botnet traffic detec- ing service-oriented architecture (soa): A sys- tion, in: IEEE Computer Society (Ed.), Proceed- tematic literature review and directions for fur- ings of Workshop on Machine Learning for Cyber- ther investigation, Information Systems 91 Crime Investigation and Cybersecurity, Proceed- (2020) 101491. URL: https://www.sciencedirect.com/ ings of 2019 IEEE European Symposium on Secu- science/article/pii/S0306437920300028. doi:https: rity and Privacy Workshops, 2019, pp. 202–210. //doi.org/10.1016/j.is.2020.101491. doi:10.1109/EuroSPW.2019.00029. [6] M. Fowler, Microservices: a definition of this new [17] B. Krawczyk, Learning from imbalanced data: open challenges and future directions, Progress in Arti- ficial Intelligence 5 (2016) 221–232. doi:10.1007/ s13748-016-0094-0. [18] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, Smote: Synthetic minority over- sampling technique, Journal of Artificial Intelli- gence Research 16 (2002) 321–357. URL: http://dx. doi.org/10.1613/jair.953. doi:10.1613/jair.953. [19] J. Schmidhuber, Deep learning in neural networks: An overview, Neural Networks 61 (2015) 85– 117. URL: https://www.sciencedirect.com/science/ article/pii/S0893608014002135. doi:https://doi. org/10.1016/j.neunet.2014.09.003. [20] X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectifier neural networks, in: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 2011, pp. 315–323.