=Paper= {{Paper |id=Vol-2978/saml-paper3 |storemode=property |title=A Microservices Architecture for Machine Learning Assisted Decision Support in a Real-Time Field Sensors Environment (short paper) |pdfUrl=https://ceur-ws.org/Vol-2978/saml-paper3.pdf |volume=Vol-2978 |authors=Giovanni De Gasperis,Giuseppe Della Penna,Sante Dino Facchini |dblpUrl=https://dblp.org/rec/conf/ecsa/GasperisPF21 }} ==A Microservices Architecture for Machine Learning Assisted Decision Support in a Real-Time Field Sensors Environment (short paper)== https://ceur-ws.org/Vol-2978/saml-paper3.pdf
A Microservices Architecture for Machine Learning
Assisted Decision Support in a Real-Time Field Sensors
Environment
Giovanni De Gasperis1 , Giuseppe Della Penna1 and Sante Dino Facchini1
1
    Università degli Studi dell’Aquila, Dipartimento di Ingegneria e Scienze dell’Informazione a Matematica, Via Vetoio, L’Aquila, 67100, Italy


                                             Abstract
                                             In this paper we describe the design and development of a real-world software system that integrates machine learning
                                             augmenting a pre-existing remote surveillance framework. Machine learning was embedded as a service in the system,
                                             plugged-in between back-end data flux handlers; the system has been redesigned following a microservices architecture to
                                             make it scalable and to allow a progressive adoption of the machine learning-powered assistance in the event management
                                             process. A case study of the application in an actual security company is analysed and discussed, where we show how this
                                             innovation helped human operators to better shield themselves from the "information overloading".

                                             Keywords
                                             Real-Time Critical Systems, Machine Learning, Big Data, Microservices



1. Introduction                                                                                                       must try to understand the reason of the notified alarm. If
                                                                                                                      it is recognized as a false alarm, the case is simply closed.
In this paper we describe the design and development                                                                  On the other hand, if the alarm is improper, i.e., it is due a
of a real-world software system that integrates big data                                                              system anomaly, the operator starts an anomaly handling
analytics and machine learning into a pre-existing remote                                                             process.
surveillance framework operated by security company                                                                       The software adopted by the company to support such
that monitors a number of sites through closed circuit and                                                            a process was a monolithic application that offered only
IP cameras, anti-theft sensors (e.g., volume and pressure                                                             basic functionalities such as collecting signals and data
sensors, door opening sensors, etc.) and also physical                                                                streams, presenting the events in a managing console and
sensors (e.g., humidity and temperature).                                                                             saving them in a persistent database. Therefore, most
   Figure 1 shows a fragment of the process commonly                                                                  of the operations described by the event management
followed to handle events and alarms coming from a                                                                    process above required a substantial amount of manual
surveillance network. When an alarm is received, first the                                                            work by the control center operators.
operators check the surveillance videos. If such videos                                                                   While the human intervention cannot be avoided in
are not available or they do not clearly show the event,                                                              such a context, as in any security-related context, machine
the operator requests an on-site check to the security                                                                learning can be exploited to assist the operators in several
staff. Such action and its outcome, as well as the outcome                                                            steps of the process, leaving the humans with only the
to all the actions taken during the process, is stored in                                                             most critical steps to accomplish (see, e.g., [1, 2, 3, 4] for
the system database. Then, if the event is in progress,                                                               examples belonging to different surveillance contexts).
the operator starts the true alarm handling process. Oth-                                                                 However, embedding machine learning in the com-
erwise, if the notified event is not actually in progress,                                                            pany pre-existing software presented several challenges.
the operator must check for other alarms on the same                                                                  First, we are modifying a production, real-time critical
site and, if any, restart the handling process for such new                                                           system, so we need to gradually add such a support, in
events. If no other site alarms are active, the operator                                                              order to let the operators adapt to the new functionali-
                                                                                                                      ties while verifying their reliability without interrupting
ECSA2021 Companion Volume, Robert Heinrich, Raffaela Mirandola                                                        the company services. Second, the closed, monolithic
and Danny Weyns, Växjö Sweden, 13-17 September 2021                                                                   architecture of the company software described above
" giovanni.degasperis@univaq.it (G. De Gasperis);
giuseppe.dellapenna@univaq.it (G. Della Penna);
                                                                                                                      makes any modification to the pre-existing process very
santedino.facchini@student.univaq.it (S. Facchini)                                                                    complex and error-prone. It is also worth noting that
~ https://www.disim.univaq.it/main/home.php?users_username=                                                           such a software, developed many years ago, was already
giovanni.degasperis (G. De Gasperis);                                                                                 not adequate to accomplish the current high QoS levels
https://people.disim.univaq.it/~dellapenna (G. Della Penna)                                                           and to be compliant with the latest safety regulations.
 0000-0001-9521-4711 (G. De Gasperis); 0000-0003-2327-9393
(G. Della Penna)
                                                                                                                          Therefore, we decided to rebuild the system from
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative   scratch, extracting only some relevant modules/algo-
                                       Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)                                        rithms from the old software in order to embed it in
Figure 1: Surveillance event handling process



the new release, to maintain some kind of "continuity".             classification, the data storage and the log certifi-
In particular, while having a completely redesigned core,           cation;
the system has been designed to offer the basic function-         • offer decision support tools to assist operators in
alities of the previous one (to maintain the above service          the real-time management of events.
continuity) and extend them in order to
                                                                All the three extensions above could be based on simple
     • acquire data and information on objects and           (sometimes pre-existing) algorithms, but also take advan-
       events from multiple sources such as IoT devices,     tage from machine learning. Thus, ML can be seen as a
       open systems and mobile services;                     (plug-in) service in the overall system architecture. This
     • combine and correlate the real-time big data          consideration, together with a number of technical as-
       streams in order to make easier and faster the con-   pects concerning the modular structure of the applica-
       trol center operators job, in particular the event    tion and its deployment environment, made us opt for
a service oriented architecture [5] as its development        to the events of the last three days gathered from all
basis. More in detail, to be as lightweight as possible,      the sources and suitably aggregated (requirement 2).
we adopted a microservices architecture [6] to effectively    Prometheus is an open-source systems monitoring and
split the complexity of the overall system into small spe-    alerting software, which provides a multi-dimensional
cialized units, each with its REST API and containerized      data model with time series, and allows a variety of in-
using Docker [7]. This allowed us to develop a scalable,      teractions with other third-party software components.
versatile, and easily maintainable system, as opposed to          Next, there is the the Redis [12] (gray) service that
the previous one. In particular, microservices allowed us     wraps the well-known fast in-memory data store (sup-
to fragment the overall system functionality in a set of      porting real-time data streams), which acts as a cache
meaningful basic units that, apart from the known ad-         memory to provide the consumers a fast access to the
vantages in term of scalability and maintanability, make      event information streams. It essentially takes the role
easier the progressive addition of ML support as a ser-       of "live working memory" for the events and the corre-
vice (or set of services) interacting with the core system    sponding management procedures, which are stored in
services to enhance their functionalities.                    the database until they reach the closing state. It is worth
                                                              noting that the output of sensors like temperature and
                                                              humidity, whose handling does not require the human
2. System Architecture for a                                  intervention (e.g., temperature sensor alarms are based
   Reliable Event Handling                                    on simple logic rules), are not stored in Redis.
                                                                  Finally, the SQLite [13] service (gray) provides persis-
At a macro level, the base requirements of the new system     tent storage for all the data flowing through Prometheus
are the following:                                            and Redis. This is mainly needed to later extract the ev-
    1. Acquire data, events and states from an heteroge-      idences needed by the authorities (requirements 5 and
       neous set of sources spread in a wide geographical     6).
       area and connected through a digital network;              To support requirement 4, the blue front-end services
                                                              present the gathered and processed data in different for-
    2. aggregate such data streams in a configurable
                                                              mats, tailored for the specific needs of the different sys-
       and scalable way, since the number and type of
                                                              tem users. In particular, three "IG Server" services handle
       sources may vary;
                                                              the GUI for the operators, their supervisors and the sys-
    3. correlate events through time and space logical
                                                              tem administrators. Such user interfaces present to the
       rules;
                                                              users the data streams coming from the sensors, read
    4. visualize all the collected and elaborated informa-
                                                              from the Redis service, as well as the data coming from
       tion on an interactive web dashboard;
                                                              Prometheus. A special "configurator" GUI is used to con-
    5. store all the data, as well as all the actions taken   figure the system. The last two front-end services di-
       by the control center operators, on a persistent       rectly interact with Prometheus: the first is based on the
       memory;                                                Grafana software package [14], which provides advanced
    6. provide a search engine with configurable queries      visualization and dashboards for the processed and ag-
       to access such historical data.                        gregated data, whereas the alertmanager service pushes
   Starting from such requirements, collected through         alerts automatically generated from Prometheus through
interviews with the company staff and control center          PromQL queries directly into email and Telegram mes-
operators, we designed a microservice-based system ar-        sages.
chitecture that aims to be easily testable, maintainable,         Finally, the red services are at the core of the architec-
and extensible. Figure 2 gives an overview of the devel-      ture. In general, they read from both Prometheus and
oped system.                                                  Redis and apply actions, possibly modifying the Redis
   The data is pushed in the system by a number of spe-       data streams accordingly. In particular, gest_notifier is
cialized source services (drawn in green in the figure):      a critical module that notifies the supervisors (via text
in the currently deployed platform, we provide services       messages) about event escalation, i.e., alarms that are trig-
which support reading data from sensors using MQTT [8]        gered by an event not being correctly and timely handled
and SNMP [9] as well as from specific proprietary sensors     by the control center operators. On the other hand, the
such as Papago [10]. These services satisfy requirement       gest_control service manages the QoS by monitoring that
1.                                                            the operator reaction times follow the company SLA, also
   The overall system leverages on three services that pro-   generating alarms in case of inefficiencies (requirement
vide different "analysis horizons" on the event streams:      3). The gcounter service generates aggregate statistics
the Prometheus [11] service (yellow), which wraps the         from the Redis data and posts them back to Prometheus
corresponding software package, acts as the main data         to support longer-term alarms (requirement 2). Finally,
collector of the overall architecture, providing access       the gest_source service is the action actuator, i.e., it is
Figure 2: System architecture



called by the operator interfaces to actually apply the        to the operators.
actions ("take charge", "start/pause/close workflow", etc.),      Clearly the human intervention cannot be avoided
executes them, and generates the action events that are        when trying to solve potentially dangerous events. How-
stored back in Redis to log the handling process.              ever, ML can help the operators in many ways, as com-
   The data_analysis service, also drawn in red, is de-        mented in the introduction. Thus, we initially focused on
scribed in the next section. It provides ML support to the     an aspect that is well known to benefit from automatic
overall process, and in particular assists the operators by    reasoning: mitigating the effects of an event flooding on
interacting with the graphical interfaces.                     the operators by pre-selecting or pre-classifying events.
   All the microservices above were developed in Python           Indeed, an operator that manages events is the classical
and containerized, to be easily deployable on the com-         FIFO order may spend too much time on less-significant
pany’s infrastructure through an overall container de-         events and delay the solution of the really critical ones.
ployment script that, in particular, takes into account the    To avoid this, the Data Analysis service provides an adap-
service inter-dependencies.                                    tive event classification routine that prioritizes the events
                                                               so that the ones that are considered more important, i.e.,
                                                               that may lead to real alarms, are presented first to the
3. Machine Learning Services                                   operators. Such a service stands between the gest_source
   Integration                                                 and lgo_server services, learning from the operator ac-
                                                               tions and suitably modifying his data views in order to
The event handling process, in the critical context where      suggest the event classification.
our system works, must be timely and effective. The               Operators close each event handling process by la-
architecture described in the previous section (Figure 2)      belling the events as true alarms, false alarms, inappropri-
has been designed to be reliable and fast, but the process     ate alarms (i.e., due to a system failure) or "other alarms"
(Figure 1) still includes a number of checks, calls, and       (typically due to test or maintenance). We focus on the
lengthy actions that require a substantial amount of work      true and false ones, since the other two types of events
are a minority class that would be difficult and useless to     4. Conclusions
consider in our context.
   Thus, initially, we extracted a number of significant        Thanks to its design and to the use of ML, the developed
features from the events, relating them with the operator-      system meets highest quality standards, in particular:
assigned label. Such features include information such as
                                                                        • it allows to to acquire, aggregate and process data
the unique alarm ID, the alarm central where the event
                                                                          and information from a variety of IoT devices,
was generated, the related customer, the activated sensor
                                                                          which means offering a better service in terms of
name, the timestamp of the alarm, etc.
                                                                          quality and flexibility;
   Then, we cleaned and refined these features to further
focus on the information that seems to be more relevant                 • it guarantees high scalability and easy configura-
for our classification task. As an example, we substituted                bility;
the alarm timestamp, which conveys too much informa-                    • it is fully compliant with data privacy, integrity
tion, with the alarm weekday and the corresponding part                   and security regulations.
of the day (morning, afternoon, evening, night). More-             The project requirements foresaw to process about
over, we performed a K-means clustering [15] on the sen-        300,000 events a year with the current number of opera-
sor names (suitably transformed in a numerical vector           tors, and manage an alarm within at most 30 seconds, as
through a word embedding process), in order to extract          set by the standard regulations. Currently the system has
an artificial "sensor type" feature. The significance of the    been deployed and is being tested by the company in a
selected features was validated by calculating the mutual       control room operating 24/7 on three turns of eight hours
information of each feature w.r.t. the classification label     each, with two or three operators per turn. The staff is
(see, e.g., [16] for an overview of mutual information ap-      managing about 32,500 anti-theft and intrusion detection
plied to feature selection) on a set of 169,347 past events     sensors and over 500 environmental IoT devices mainly
provided by the company.                                        targeted to precision agriculture. The network connects
   Once the final 128 features were devised, we extracted       about 30 clients and its nodes are deployed on more than
an initial training dataset from the set of past events         600 different sources.
above. Unfortunately, the dataset was heavily imbal-               Our initial statistics show that the staff is now able to
anced, since the false alarms were much more than the           manage an average of 1,000 events per day, thus yielding
true ones [17]. Thus, we tried both a random undersam-          365,000 events managed on yearly basis as a forecast,
pling of the majority class and the well-known Synthetic        which doubles the performances achieved using the pre-
Minority Oversampling Technique (SMOTE, [18]) to re-            vious support software. The average time from the event
balance it.                                                     arrival to its classification and taking charge is now 10
   Finally, we built a deep neural network [19] with 128        seconds, slightly better compared to the performance of
input neurons (one for each feature), two hidden layers of      the previous system but in a far more complex scenario.
64 neurons each with RELU [20] activation function, and            Finally, the average event management time, includ-
a single output neuron with sigmoid activation function.        ing classification, site operations, police calls, alarm clos-
We trained the network on our dataset in order to obtain        ing and archiving, has been dramatically improved from
the correct classification given the event features. The        1,800 to 900 seconds (Figure 3 shows the current statistics
classifier validation showed that the dataset re-balanced       generated by the Grafana service in the application), and
through random undersampling achieves a better overall          the error ratio has been tackled almost to zero thanks to
performance in this context, reaching an accuracy of 0.91,      the ML priority classification system.
a recall of 0.93, and a precision of 0.95, with a F1-Score
of 0.92.
   The trained network was then embedded in the Data            Acknowledgments
Analysis service, where each new event is classified be-
fore being presented to the operators. However, since a         We would like to thank Mr. Andrea Perna who imple-
wrong "false alarm" classification may always happen,           mented the first prototype of the machine learning model
we do not simply drop the events considered not harmful         as part of his Master’s Thesis in Computer Science at our
by the neural network from the stream, but rather we ex-        Department.
tract the classification probability that can be read from         This work was funded by SPEE S.r.l. Company 1 in
its output neuron and use it to alter the priority value that   L’Aquila, ITALY.
is used to sort the events on the operator dashboards. In
this way, a possibly false alarm will be handled later, but
never dropped. After the event is handled, the correct,
final classification given by the operator is sent back to
the neural network to fix its previsions, if needed.                1
                                                                        https://www.spee.it
Figure 3: Average event management time



References                                                      architectural term, 2014. URL: https://martinfowler.
                                                                com/articles/microservices.html.
 [1] A. Goyal, S. B. Anandamurthy, P. Dash, S. Acharya,     [7] D. Merkel, Docker: lightweight linux containers
     D. Bathla, D. Hicks, A. Bhan, P. Ranjan, Auto-             for consistent development and deployment, Linux
     matic border surveillance using machine learning           journal 2014 (2014) 2.
     in remote video surveillance systems, in: T. Hiten-    [8] International Organization for Standardization,
     dra Sarma, V. Sankar, R. A. Shaik (Eds.), Emerging         Information technology — message queuing
     Trends in Electrical, Communications, and Informa-         telemetry transport (mqtt) v3.1.1 (iso/iec stan-
     tion Technologies, Springer Singapore, Singapore,          dard 20922:2016), 2016. URL: https://www.iso.org/
     2020, pp. 751–760.                                         standard/69466.html.
 [2] J. Albusac, J. Castro-Schez, L. Lopez-Lopez,           [9] D. Harrington, R. Presuhn, B. Wijnen, An architec-
     D. Vallejo, L. Jimenez-Linares, A supervised learn-        ture for describing simple network management
     ing approach to automate the acquisition of knowl-         protocol (snmp) management frameworks, 2002.
     edge in surveillance systems, Signal Processing            doi:10.17487/RFC3411.
     89 (2009) 2400–2414. doi:https://doi.org/10.          [10] Papouch, Papago sensor modules, 2021. URL: https:
     1016/j.sigpro.2009.04.008, special Section:                //en.papouch.com/papago/.
     Visual Information Analysis for Security.             [11] The Linux Foundation, many authors, Prometheus,
 [3] M. Elhoseny, Multi-object detection and tracking           2021. URL: https://prometheus.io/.
     (modt) machine learning model for real-time video     [12] Redis Labs, S. Sanfilippo, Redis, 2021. URL: https:
     surveillance systems, Circuits, Systems, and Sig-          //redis.io/.
     nal Processing 39 (2020) 611–630. doi:10.1007/        [13] SQLite Consortium, many authors, SQLite, 2021.
     s00034-019-01234-7.                                        URL: https://www.sqlite.org/.
 [4] F. Opitz, K. Dästner, B. v. H. z. Roseneckh-Köhler,   [14] Grafana Labs, many authors, Grafana: the open
     E. Schmid, Data analytics and machine learning             observability platform, 2021. URL: https://grafana.
     in wide area surveillance systems, in: 2019 20th           com/.
     International Radar Symposium (IRS), 2019, pp. 1–     [15] A. K. Jain, R. C. Dubes, Algorithms for Clustering
     10. doi:10.23919/IRS.2019.8768102.                         Data, Prentice-Hall, Inc., USA, 1988.
 [5] N. Niknejad, W. Ismail, I. Ghani, B. Nazari,          [16] I. Letteri, G. Della Penna, P. Caianiello, Feature
     M. Bahari, A. R. B. C. Hussin, Understand-                 selection strategies for http botnet traffic detec-
     ing service-oriented architecture (soa): A sys-            tion, in: IEEE Computer Society (Ed.), Proceed-
     tematic literature review and directions for fur-          ings of Workshop on Machine Learning for Cyber-
     ther investigation,      Information Systems 91            Crime Investigation and Cybersecurity, Proceed-
     (2020) 101491. URL: https://www.sciencedirect.com/         ings of 2019 IEEE European Symposium on Secu-
     science/article/pii/S0306437920300028. doi:https:          rity and Privacy Workshops, 2019, pp. 202–210.
     //doi.org/10.1016/j.is.2020.101491.                        doi:10.1109/EuroSPW.2019.00029.
 [6] M. Fowler, Microservices: a definition of this new    [17] B. Krawczyk, Learning from imbalanced data: open
     challenges and future directions, Progress in Arti-
     ficial Intelligence 5 (2016) 221–232. doi:10.1007/
     s13748-016-0094-0.
[18] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P.
     Kegelmeyer, Smote: Synthetic minority over-
     sampling technique, Journal of Artificial Intelli-
     gence Research 16 (2002) 321–357. URL: http://dx.
     doi.org/10.1613/jair.953. doi:10.1613/jair.953.
[19] J. Schmidhuber, Deep learning in neural networks:
     An overview, Neural Networks 61 (2015) 85–
     117. URL: https://www.sciencedirect.com/science/
     article/pii/S0893608014002135. doi:https://doi.
     org/10.1016/j.neunet.2014.09.003.
[20] X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectifier
     neural networks, in: Proceedings of the Fourteenth
     International Conference on Artificial Intelligence
     and Statistics, 2011, pp. 315–323.