=Paper=
{{Paper
|id=Vol-2646/25-paper
|storemode=property
|title=Edge-Based Video Surveillance with Embedded Devices
|pdfUrl=https://ceur-ws.org/Vol-2646/25-paper.pdf
|volume=Vol-2646
|authors=Hanna Kavalionak,Claudio Gennaro,Giuseppe Amato,Claudio Vairo,Costantino Perciante,Carlo Meghini,Fabrizio Falchi,Fausto Rabitti
|dblpUrl=https://dblp.org/rec/conf/sebd/KavalionakGAVPM20
}}
==Edge-Based Video Surveillance with Embedded Devices==
<pdf width="1500px">https://ceur-ws.org/Vol-2646/25-paper.pdf</pdf>
<pre>
              Edge-Based Video Surveillance with
                     Embedded Devices
                                  Discussion Paper

  Hanna Kavalionak1 , Claudio Gennaro1 , Giuseppe Amato1 , Claudio Vairo1 ,
 Costantino Perciante2 , Carlo Meghini1 , Fabrizio Falchi1 , and Fausto Rabitti1
     1
         ISTI-CNR, via G. Moruzzi 1, 56124 Pisa, Italy name.surname@isti.cnr.it
                2
                  Fluidmesh Networks, via Carlo Farini 5, Milano, Italy
                        costantino.perciante@fluidmesh.com


          Abstract. Video surveillance systems have become indispensable tools
          for the security and organization of public and private areas. In this work,
          we propose a novel distributed protocol for an edge-based face recogni-
          tion system that takes advantage of the computational capabilities of
          the surveillance devices (i.e., cameras) to perform person recognition.
          The cameras fall back to a centralized server if their hardware capabili-
          ties are not enough to perform the recognition. We evaluate the proposed
          algorithm via extensive experiments on a freely available dataset. As a
          prototype of surveillance embedded devices, we have considered a Rasp-
          berry PI with the camera module. Using simulations, we show that our
          algorithm can reduce up to 50% of the load of the server with no negative
          impact on the quality of the surveillance service.

          Keywords: Edge Computing · Distributed Architectures · Internet of
          Things · Video Surveillance · Embedded Devices.


1        Introduction

Video surveillance is of paramount importance in areas like law enforcement,
military and even for a commercial environment. One of the straightforward
approaches for video surveillance is to use the client-server model of commu-
nication: the surveillance devices stream the video directly to a main powerful
server, where the data can be displayed to the human operators, who are re-
sponsible to analyze the video [8]. Human resources used in the field of video
surveillance services are both costly and not reliable. Also, this approach presents
several negative sides, like the creation of a bottleneck for the system security
and reliability and the need to maintain a big and costly infrastructure of servers
    Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0). This volume is published
    and copyrighted by its editors. SEBD 2020, June 21-24, 2020, Villasimius, Italy.
       H. Kavalionak et al.

dedicated only to the surveillance task. Hence, with the recent advances in smart
technologies, automated video surveillance, where the video streams are analyzed
automatically by network edge devices, gained a lot of interest [13, 11].
     The idea of exploiting the computation capabilities and the topology of a
distributed network of smart cameras to reduce the amount of messaging has
been proposed in [14] and [15]. In both approaches, the communication is effi-
ciently handled using a task-oriented node clustering that partition the network
in different groups according to the pathway among cameras. The former work,
however, is limited to face tracking, while the second one targets collaborative
person tracking with a combination of hardware acceleration and middleware.
These works focus on people tracking rather than recognition and use an efficient
camera clustering protocol to dynamically form groups of cameras for in-network
tracking of individual persons.
     This work summarizes the contribution presented in [7]. The motivating sce-
nario that we consider for this work is that of face recognition [10, 3, 1, 4]. The
surveillance system should be able to recognize faces, and track their movements
in a possibly large area (e.g. an airport, a city district, a campus, etc. ), using
various cameras appositely installed. Real-time video surveillance requires sig-
nificant storage and processing resources. This work aims to analyze the issues
and solutions related to resource allocation, for executing automated edge-based
video surveillance, in a fully distributed environment [2, 12]. In particular, we
suppose that the (smart) camera devices themselves can cooperate to execute
the needed faces detection, recognition and visual analysis tasks. In this work, we
study how the image recognition algorithms can be orchestrated across several
devices so that bottlenecks are reduced and resource can be managed more effec-
tively. This work extends and enhances our previously published research work
[6], where we have proposed a distributed algorithm for load balancing between
Smart Sensing Units for the video surveillance task. The adaptive algorithm dis-
tributes, at run time, the recognition tasks between the resources of surveillance
devices and servers. The detection and recognition tasks are executed locally by
surveillance devices, which exploit both the spatial and temporal topology of
the moving people, to cache and reuse locally parts of the classification features.
Only when devices are not able to execute the recognition task with the cache,
a recognition request is sent to the server. We extend the previously proposed
adaptive algorithm considering the overhead of real classification techniques.
     The rest of the paper is structured as follows. Section 2 provides the defini-
tion of the system model and problem statement, whereas Section 3 presents the
algorithm for the adaptive camera-assisted person recognition. Section 4 evalu-
ates the distributed surveillance solution through simulations. Finally, Section 5
concludes the paper.


2   System model and problem statement

In our work, we model the geographical area, where the surveillance system
is deployed, as divided into sectors. A set of Smart Sensing Units (SSUs) are
                     Edge-Based Video Surveillance with Embedded Devices


                               Fig. 1: System model


distributed among the sectors to execute the video surveillance task (Figure
1). Such units have various sensing, computing, storage, and communication
capabilities. An SSU is a logical unit, it could be a smart camera having onboard
sensing, computing, storage resources, or it could be composed of a camera
connected to a computer. Moreover, we assume also the existence of a Main
Server (MS) unit in the network. A main server is a stand-alone SSU that is
characterized by (theoretically) infinite computing, storage, and communication
capabilities. It also plays the role of the central server where the devices send
their requests, in case of client-server modality.
     In our work, we distinguish two possible scenarios: basic and enhanced. In the
basic scenario, all SSUs are connected to the central MS and their functionality
is limited to video observation and data streaming to the MS. In the enhanced
scenario, on the other hand, all SSUs are connected in a structured peer-to-
peer topology. SSUs can cooperate to execute complex tasks and build together
a rich representation of the context where the infrastructure is deployed. We
consider the peer-to-peer overlay connecting the SSUs to be fixed. Each SSU
can contact directly the MS and its geographical neighbor SSUs. We assume
the communication between SSUs is reliable, i.e., no message is ever lost or
corrupted. In the enhanced scenario, each SSU has computational resources and
local storage memory, where it keeps a recognition library, which is a collection of
recognizers, each specialized to recognize a certain person. The MS’s recognition
library contains the recognizers for all the recognizable persons in the surveillance
area, while the recognition library of an SSU is, basically, a local cache that keeps
only a subset of all recognizable persons.
     As an illustrative example, let us consider the following scenario, taking place
in a geographical area (the surveilled area) of appropriate size, where a number
of SSUs have been deployed for video surveillance purposes. Here we consider
a person that is moving in the area. The system should visually recognize the
person in order to evaluate the presence of this person in the area. One of the
straightforward solutions for area video surveillance is to stream all the data from
the SSUs to the M S for the following processing and storage. This approach can
cause some additional system issues. At first, the interested organization needs
to have a powerful server system devoted specifically to the surveillance needs.
         H. Kavalionak et al.


    Algorithm 1: Active thread algorithm executed by SSU
     repeat
         recognizers.increaseAge();
         events ← getVisibleFaces();
         selection ← selectRecognizers(recognizers);
         foreach f ace in events do
             if alarm ←getStatus(f ace, selection) then
                  send(F ACEID, f ace, alarm, view);
                  startAlarmActivity();
                  return;
             if recognized ←getStatus(f ace, selection) then
                  send(F ACEID, f ace, recognized, view);
                  return;
             if null ←getStatus(f ace, selection) then
                 tδ ←setRecognitionTimeout();
                 reclBuf f er.add(f ace, tδ );
                 send(ALARM REQU EST, f ace, null, view);
                 return;

         wait ∆T ;
     until;


This server system needs to be powerful enough to continuously process the
video streaming data that is coming from the SSUs. The second problem is
connected with the delays for the person recognition, caused by possible server
overloading and network bandwidth overhead in case of a high rate of requests
for recognition.
    In order to tackle these issues, our solution exploits a distributed architecture
for face recognition, where the detection and recognition tasks are moved to SSUs
when it is possible. We performed some simulations in order to move the most
possible computation on the camera themselves and we used optimized strategies
of distributed computation to solve the most challenging resource-demanding
tasks.


3     Algorithm

In this section, we describe our enhanced algorithm for distributing the recog-
nition processes on the SSUs. Each camera in the system keeps the list of its
neighbors SSUs, which we name view. In order to improve the efficiency of the
recognition process each camera locally keeps a recognition library that contains
a subset of all recognizers for faces. Using these local recognizers, each camera
in the network tries to recognize the face of the detected person locally and
with the help of neighbor SSUs, without involving the MS, thus reducing its
computational and bandwidth consumption. The size of the library is limited to
recognizersLimit according to the available storage space of an SSU.
    Three threads are executed by each of the SSUs: (1) active thread (Algorithm
1) is responsible for the active area monitoring, (2) time-out thread corresponds
to the timeout event in the face recognition by neighbors and (3) passive thread
(Algorithm 2) processes the incoming messages received by an entity.
                            Edge-Based Video Surveillance with Embedded Devices


 Algorithm 2: Passive thread algorithm executed by SSU
   on event msg (type, f ace, status, sender)receive do
       if type == alarmrequest then
            result ← getStatus(f ace, selection);
            send(alarmreply, f ace, result, sender);
       if type == alarmreply then
            if sender == M S then
                 if status == alarm then
                      startAlarmActivity();
                  integrate(recognizers, f ace);
                  send(faceid, f ace, status, view);
           else
                  recBuf f er.getFace(f ace).counter + +;
                  if recBuf f er.contains(f ace) and status == null then
                       if recBuf f er.getFace(f ace).counter ≥ view.size then
                            send(alarmrequest, f ace, null, M S);
                            recBuf f er.remove(f ace);
                  else
                         if status == alarm then
                              startAlarmActivity();
                         integrate(f ace, recognizers);
                         recBuf f er.remove(f ace);
                         send(faceid, f ace, status, view);

       if type == faceid then
            integrate(f ace, recognizers);


     Every ∆T each SSU executes the active thread Algorithm 1. At first, it
increases the ”age” of all known recognizers in the library recognizers. Then, it
processes the current surveillance area image and extracts the facial features for
the detected persons. Finally, it selects the subset selection of recognizers with
the lowest age parameter from the local collection.
     For each f ace in events an SSU executes the recognition algorithm. In case
f ace corresponds to the entry in recognizers with alarm status, the SSU (1)
sends a notification message type faceid with a f ace features and the alarm
tag to the neighbor SSUs; then (2) starts predefined alarm activity, for example,
video recording and live streaming to the operator displays.
     In case a f ace is recognized, a notification faceid is distributed between
the neighbor SSUs. If there is no correspondence in selection, the camera estab-
lishes a recognition waiting time and adds the face features into the reclassifi-
cation buffer reclBuf f er. It also sends the recognition request alarmrequest
to the neighbors’ SSUs. When the recognition waiting time has passed, if the
object is still in the reclBuf f er, then the camera sends the classification request
alarmrequest to the M S and removes the features from thereclBuf f er.
     When an SSU receives a message, it is processed according to the Algo-
rithm 2. We consider three types of messages: alarmrequest, alarmreply
and faceid. The alarmrequest message contains the request for a f ace clas-
sification. The recipient checks the status of the received f ace in the local subset
of a recognizers list and sends an alarmreply containing the results of the
recognition result to the sender. The faceid message is used for the face fea-
       H. Kavalionak et al.

tures broadcasting to the neighbor SSUs. Hence, whenever a camera receives this
message it integrates the received face features to the local recognizers library.
    The alarmreply message corresponds to the reply to a recognition. If it
comes from M S entity, the recipient integrates the recognized face and its sta-
tus to the local recognizers and sends the face notification message faceid to
its neighbors view. Moreover, in case the received status is alarm the camera
initiates some predefined alarm activity. Each SSU keeps track of the replies of
all the neighbor SSUs for each f ace in the reclBuf f er. If no neighbor SSU is
able to recognize the current f ace, then the camera sends the recognition request
alarmrequest to the M S and removes the f ace from the reclassification buffer
reclBuf f er. If, on the other hand, the received status is not null it means the
f ace has been recognized by the sender and the camera integrates it to the local
recognizers, removes the face from the reclassification buffer and sends the face
notification message faceid to its neighbors SSUs. In case the received status is
alarm, the camera also initiates some predefined alarm activity.


4   Evaluation of the area surveillance algorithm

In this section, we present a simulation experiment in order to evaluate the load
imposed on the M S and the performance of the distributed surveillance system.
    In our simulation, we assume that the SSUs are uniformly distributed in the
monitored area. Persons in the “surveillance area” move according to the mobil-
ity model described in [5] for the movements of avatars in a virtual environment.
In this model, the area has a number of hotspots equals to nhs . The hotspots are
the most attractive places in the area where it is expected to find more persons
there than in other places. Each person that enters the surveillance area follows
the following behavior: the person chooses a random number of hotspots to visit
and moves towards them, in sequence. After all the hotspots have been visited,
the person leaves the surveillance area. The population of the area follows a
bell-shaped curve very similar to the normal distribution in which the popula-
tion starts from 0 and reaches a maximum of 800 persons. In our simulations, we
considered the 10% of persons in the area to be unknown (i.e. associated with
alarm tag).
    We executed simulation of 6 hours using the peersim simulator [9] and the
mobility model described above. We evaluate two algorithms: (1) when all the
data is transmitted to the M S for the analysis (our baseline) and (2) the adaptive
algorithm described in Section 3.
    In the baseline model, all SSUs stream the video directly to the M S, where
the M S extracts the face features and recognizes the persons. The baseline curve
in Figure 2 shows the number of recognition the M S is required to process in
a 60sec time interval. Other curves on the figure show the evaluation of the
recognition requests to M S when the enhanced algorithm is applied and under
different system parameters, such as Tmax and ct, where Tmax is the maximum
allowed face recognition delay and ct is the time needed to compare the extracted
features with one class in the recognition library. The results show that the
                                                                        Edge-Based Video Surveillance with Embedded Devices

                                                                                                                                                                                                    850000


                                                                                                                                                                MS classification requests per 6h
                                       10000                                                                          10000                                                                                                       ct=450ms


 MS classification requests per 60s


                                                                                 MS classification requests per 60s
                                                                 baseline                                                                        baseline                                           800000                           360ms
                                                                 Tmax=4s                                                                       ct=450ms                                             750000                           135ms
                                                                        6s                                                                        360ms                                                                               45ms
                                       8000                           10s                                             8000                        135ms                                             700000
                                                                      15s                                                                          45ms                                             650000
                                       6000                                                                           6000                                                                          600000
                                                                                                                                                                                                    550000
                                       4000                                                                           4000                                                                          500000
                                                                                                                                                                                                    450000
                                                                                                                                                                                                    400000
                                       2000                                                                           2000
                                                                                                                                                                                                    350000
                                                                                                                                                                                                    300000
                                          0                                                                              0                                                                                   4       6     8    10     12    14    16
                                               1   2        3       4        5                                                1   2        3       4        5
                                                                                                                                                                                                                 Maximum allowed delay (seconds)
                                                       Time (hours)                                                                   Time (hours)


                                      (a) ct = 450ms; workload A                                          (b) Tmax = 10s; workload A (c) aggregated requests to SC;
                                                                                                                                     6 h experiment; workload B

                                      Fig. 2: MS recognition requests per 60sec for different system parameters.


enhanced algorithm can reduce up to 50% of the recognition activity of the M S
unit in peak hours.
    Each class of the cache is a simulated sample of the face features that char-
acterize a person. Therefore, the time needed by an SSUs to perform a single
face recognition task is given by the size of the cache selection times ct. A higher
number of the face features in the class increases the accuracy of the recognition,
but at the same time increases also the ct.
    As we can see in Figure 2a, in case of a lower Tmax the system does not have
enough time for the whole algorithm recognition cycle and it is forced to rely on
the M S recognition more often. Instead, a higher Tmax allows an SSU to perform
the recognition using the cache and to receive the replies from the neighbor SSUs.
Nevertheless, even in case of low Tmax , the enhanced algorithm significantly
reduces the recognition load of M S compared to the baseline solution.
    Figure 2b shows the influence of ct time interval on the M S recognition
load. Even in case of high accuracy of local recognition and relatively low Tmax ,
the load of M S by recognition requests is significantly lower than the baseline
solution. Moreover, lower ct values further decreases the load imposed on the
M S by the recognition. Figure 2c shows the impact of the system parameters on
the algorithm effectiveness. Lower values of ct and larger of Tmax significantly
reduce the M S recognition load. One of the straightforward ways to minimize ct
time is to reduce the accuracy of the local recognition algorithm by decreasing
the number of features in the sample. In case fast person recognition is not a
requirement, the increment of Tmax can also reduce the load of the M S.


5                                        Conclusion

In this paper, we propose a distributed edge-based protocol for area video surveil-
lance based on image classification. In order to minimize the classification load
on the main server, recognition tasks take place on the SSUs when possible. To
perform person recognition, an SSU uses the local resources together with the
resources of the neighbor SSUs. The surveillance devices fall back to the main
server when the classification cannot be done in the desired time interval. In this
extended abstract, we gave an overview of how our system works, we described
        H. Kavalionak et al.

the proposed algorithms and we presented simulations that show that by par-
tially placing the recognition tasks on the local resources of surveillance devices
can reduce the load on the main server up to 50%.

References
 1. Amato, G., Carrara, F., Falchi, F., Gennaro, C., Vairo, C.: Facial-based intrusion
    detection system with deep learning in embedded devices. In: International Con-
    ference on Sensors, Signal and Image Processing (SSIP). pp. 64–68. ACM (2018)
 2. Amato, G., Chessa, S., Gennaro, C., Vairo, C.: Efficient detection of composite
    events in wireless sensor networks: design and evaluation. In: 2011 IEEE Sympo-
    sium on Computers and Communications (ISCC). pp. 821–823. IEEE (2011)
 3. Amato, G., Falchi, F., Gennaro, C., Vairo, C.: A comparison of face verification
    with facial landmarks and deep features. In: 10th International Conference on
    Advances in Multimedia (MMEDIA). pp. 1–6 (2018)
 4. Amato, G., Gennaro, C., Massoli, F.V., Passalis, N., Tefas, A., Trvillini, A., Vairo,
    C.: Face verification and recognition for digital forensics and information security.
    In: IEEE 7th International Symposium on Digital Forensic and Security (ISDFS).
    pp. 1–6. IEEE (2019)
 5. Kavalionak, H., Carlini, E., Ricci, L., Montresor, A., Coppola, M.: Integrating peer-
    to-peer and cloud computing for massively multiuser online games. Peer-to-Peer
    Networking and Applications (PPNA) pp. 1–19 (2013)
 6. Kavalionak, H., Gennaro, C., Amato, G., Meghini, C.: Dice: A distributed protocol
    for camera-aided video surveillance. In: Computer and Information Technology;
    Ubiquitous Computing and Communications; Dependable, Autonomic and Secure
    Computing; Pervasive Intelligence and Computing (CIT/IUCC/DASC/PICOM),
    2015 IEEE International Conference on. pp. 477–484. IEEE (2015)
 7. Kavalionak, H., Gennaro, C., Amato, G., Vairo, C., Perciante, C., Meghini, C.,
    Falchi, F.: Distributed video surveillance using smart cameras. Journal of Grid
    Computing 17(1), 59–77 (2019)
 8. Mishra, R., Kumar, P., Chaudhury, S., Indu, S.: Monitoring a large surveillance
    space through distributed face matching. In: Computer Vision, Pattern Recogni-
    tion, Image Processing and Graphics (NCVPRIPG), 2013 Fourth National Con-
    ference on. pp. 1–5. IEEE (2013)
 9. Montresor, A., Jelasity, M.: PeerSim: A scalable p2p simulator. In: Proc. of P2P’09.
    pp. 99–100. IEEE (2009)
10. Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition (2015)
11. Song, M., Tao, D., Maybank, S.J.: Sparse camera network for visual surveillance–a
    comprehensive survey. arXiv preprint arXiv:1302.0446 (2013)
12. Vairo, C., Amato, G., Chessa, S., Valleri, P.: Modeling detection and tracking of
    complex events in wireless sensor networks. In: 2010 IEEE International Conference
    on Systems, Man and Cybernetics. pp. 235–242. IEEE (2010)
13. Wang, X.: Intelligent multi-camera video surveillance: A review. Pattern recogni-
    tion letters 34(1), 3–19 (2013)
14. Yoder, J., Medeiros, H., Park, J., Kak, A.C.: Cluster-based distributed face tracking
    in camera networks. Image Processing, IEEE Transactions on 19(10), 2551–2563
    (2010)
15. Zarezadeh, A., Bobda, C., Yonga, F., Mefenza, M.: Efficient network clustering for
    traffic reduction in embedded smart camera networks. Journal of Real-Time Image
    Processing pp. 1–14 (2015)

</pre>