=Paper= {{Paper |id=Vol-3869/p07 |storemode=property |title=A Real-Time Machine Learning Based Solution for Privacy Enforcement in Video Recordings and Live Streaming |pdfUrl=https://ceur-ws.org/Vol-3869/p07.pdf |volume=Vol-3869 |authors=Pietro Manganelli Conforti,Matteo Emanuele,Lorenzo Mandelli |dblpUrl=https://dblp.org/rec/conf/icyrime/ConfortiEM24 }} ==A Real-Time Machine Learning Based Solution for Privacy Enforcement in Video Recordings and Live Streaming== https://ceur-ws.org/Vol-3869/p07.pdf
                                A Real-Time Machine Learning Based Solution for Privacy
                                Enforcement in Video Recordings and Live Streaming
                                Pietro Manganelli Conforti1 , Matteo Emanuele1 and Lorenzo Mandelli1
                                1
                                    Department of Computer, Control and Management Engineering, Sapienza University of Rome, Italy


                                                                          Abstract
                                                                          These past years the world had to deal with a whole new situation brought by Covid-19. Everyone’s routine changed
                                                                          and we started passing way more time than before on virtual meeting, virtual chats and similar. With this, many privacy
                                                                          problems arised from all the video data generated by a single user. Google and Zoom introduced the possibility to blur out
                                                                          the background while using a front face camera, but this did not solve many privacy concerns ranging from showing people
                                                                          in videos without their permission, to the leaking of sensible data and information from videos uploaded online. We propose
                                                                          a solution build over the use of computer vision techniques like image segmentation and classification for context recognition
                                                                          for a privacy enforcement solution capable of fitting the user’s personal need, blurring out selectively specific objects from a
                                                                          video based on the user’s preferences for each room in which they are.

                                                                          Keywords
                                                                          Image segmentation, Context Recognition, Detectron2, Privacy enforcement, Covid-19, Alexnet, Transfer learning,



                                1. Introduction                                                                       to show, based on the recognition of the environment
                                                                                                                      framed, in order to blur out objects relatively to both the
                                In the past years there has been a solid shift for the en- user needs and the context in which they are.
                                tire world population towards a more active presence
                                online. Covid-19 has further pushed many activities to
                                be faced digitally. Virtual meeting application like Zoom,
                                had 10 million daily meeting participants in December 2. Related works
                                2019, but by April 2020, that number increased to reach
                                up to 300 million [1]. It is estimated that in 2024 only With the advancement of technology, people have been
                                25% of the business meetings will take place in person sharing a continuously growing amount of personal data
                                [2]. Studies started during 2020 have demonstrated that online. Additionally to life-logging devices[9], social
                                nowadays people spend on average way more time in vir- medias have recently stepped in, ending up quickly dom-
                                tual meetings than before [3], leading to many concerns inating the landscape of mass produced data with "visual
                                for the single person. Users have started experiencing data"(i.e. images and video). For instance, in 2020, the
                                stress related to not being competent in the use of the first year of pandemic, users have generated and shared
                                technology, but most importantly to "Zoom fatigue" due via Facebook a total of 10.5million videos[10].
                                to it being always "on" [4]. Many privacy related issues This impactful amount of data brought to the attention
                                have been crippling the user experience ever since, such of experts and users to many privacy related issues; stud-
                                as exposing private and personal spaces on camera, un- ies started identifying and observing how easily privacy
                                intentionally framing a person who did not give consent could be violated just from unintentionally sharing per-
                                to be on video or sharing sensible information leaked sonal data contained inside images and videos, and sub-
                                from careless online posting. Many solutions have been sequently started proposing privacy models to formally
                                promptly developed to prevent such things from hap- approach and tackle said scenarios [11]. The scientific
                                pening, providing virtual meeting room services with world went quickly from defining sub-fields like Privacy-
                                safeguard-privacy functionalities like blurring the back- Preserving Machine Learning(PPML) [12], to adopting
                                ground and virtual backgrounds[5, 6, 7, 8]. We present deep learning models for image disguising [13], Context
                                in this paper a novel computer vision based approach for Recognition[14] as well as Image-Based Localization[15]
                                privacy enforcement in video data, capable of filter out and again computer vision based framework[16] as novel
                                from a video a list of objects that a user does not want solutions for privacy preserving of first person vision
                                                                                                                      image sequences, placing computer vision, Artificial In-
                                                                                                                      telligence and data driven approaches as state of the
                                ICYRIME 2024: 9th International Conference of Yearly Reports on
                                Informatics, Mathematics, and Engineering. Catania, July 29-August art techniques for preserving privacy on line. Among
                                1, 2024                                                                               the many available solutions for privacy preservation
                                $ mandelli@diag.uniroma1.it (L. Mandelli)                                             and safeguarding, it is currently missing one which al-
                                         © 2024 Copyright for this paper by its authors. Use permitted under Creative
                                         Commons License Attribution 4.0 International (CC BY 4.0).                   lows single users to selectively censor objects from visual
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)




                                                                                                                                 53




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Pietro Manganelli Conforti et al. CEUR Workshop Proceedings                                                          53–59




Figure 1: The pipeline of our system.



data depending on their personal needs and preferences.        very powerful instance segmentation network published
The propose work aims therefore to provide experience-         by Facebook in 2019 [27], is used to identify user’s con-
lacking users with an intuitive, easy to use tool for pri-     text specific, privacy related data, within video frames,
vacy enforcement in video data based on computer vision        with no ambiguity; combining the output masks pro-
techniques.                                                    duced by Detectron 2 with all the information retrieved
                                                               before, a particular region of the frame is identified and
                                                               filtered with a Gaussian transform. Similar or identical
                                                               contexts disambiguation is possible to be tackled and
3. Implementation                                              solved with the support of RFID technology: with the
                                                               introduction of a beacon that send a constant signal, it
Sensible users’ data is crucial to be kept private. The
                                                               is possible to recognize and distinguish two apparently-
proposed software tracks such information by means of
                                                               same looking environment. Such discriminatory action
various modules which pipeline is shown in fig.1. Mem-
                                                               is essential, yet simple to be applied since it is integrable
ory buffers are used in between modules to guarantee
                                                               in any environment with low effort or invasiveness. A
flexibility towards input videos of any 𝑎𝑠𝑝𝑒𝑐𝑡, 𝑟𝑎𝑡𝑖𝑜 and
                                                               similar RFID-based solution for context recognition was
𝑓 𝑝𝑠, as well as to stabilize the output overcoming the
                                                               already presented by another research [14] conducted
common flickering experienced in these kind of appli-
                                                               some years ago. Finally the desired effect is obtained by
cations. Thanks to such buffers it is possible to store
                                                               processing and collecting all the frames of the video and
past frames and reuse those to statistically smooth of the
                                                               setting the right frame velocity.
final output; past frames are reused according to level of
confidence the class recognition module predicted with.
Input videos can be directly uploaded to the system or
streamed from cameras(i.e. webcams).                           3.1. Dataset
The proposed solution separates the overarching learning Distinct datasets have been used for the two different
problem in two sub problems, namely context recogni- learning tasks respectively, image segmentation and con-
tion and image segmentation; this approach guarantees text recognition. The choice of using the Detectron2
robustness through modularity and simplifies the overall network for the the image segmentation tasks leaves lit-
functioning of the software.                                   tle to no choice but using the 2017 version for the COCO
Recognizing the users, their emotional state [17, 18, 19], dataset [28] which has been demonstrated to be perfor-
his attentive state [20, 21, 22], and the context surroundings[23,
                                                               mative with such dataset.
24, 25] allows to selectively obscure specific elements COCO is a dataset composed of two groups of elements:
based on preferences the same user expressed at the reg- images and annotations. Images contain a vast variety
istration time; such data is stored into a database for later of objects, for a total of 80 different category of elements.
inference. Context recognition has been tackled with a The network was capable to recognize them all, and even
neural network inspired by Alexnet [26], a famous Deep apparently odd objects were left untouched and not re-
Convolutional Neural Network, designed for image clas- moved. Together with the set of images, COCO is com-
sification.                                                    posed of a set of so-called "annotations" that contain
Together with an RFID application [14] Detectron 2, a information related to the position of the object masks,




                                                            54
Pietro Manganelli Conforti et al. CEUR Workshop Proceedings                                                             53–59



their bounding box and their location on the image ref-         image to classify. For this task Alexnet has been fine
erence frame.                                                   tuned on our reduced dataset by means of transfer learn-
For what concern the context recognition part, a slightly       ing.
modified version of a dataset available on Kaggle[29] has       The structure of the network is shown in figure 2.
been used; such dataset is composed of 5 different classes
symbolizing five different kind of rooms, two of which            Here we report the performance with which we eval-
has been merged together, namely living room and din-           uated our model. We will give particular importance to
ing room. Each element is originally an RGB picture of          both the accuracy and F1-score of each class.
a fixed size of 224x224x3 which has been resized to be
                                                                   Classes     precision   recall   F1-score   overall Accuracy
227x227x3, to better fit through AlexNet (?).
                                                                  Bathroom       0.84       0.90      0.87
As part of the training & testing process a defined set of       LivingRoom      0.92       0.79      0.85
image processing techniques have been organized into a            Bedroom        0.69       0.76      0.76
pipeline. Such transformation pipeline has been imple-             Kitchen       0.75       0.76      0.76
mented using Albumentation library[30], an easy-to-use                                                               0.83
and intuitive library for image processing; it consists of:     Table 1
ShiftScaleRotate, for shifting or rotating images; RGB-                    Performance of our classification task
shift for randomly altering RGB channels’ values; Ran-
domBrightnessContrast for randomly changing iamges’                The performance of the network are reported in the
brightness and contrast; MultiplicativeNoise for randomly       table 1. As we can see we are capable of obtaining high F1-
adding noise; Normalize, for normalizing data; HueSatu-         score values for each class and an overall accuracy above
rationValue, for randomly changing images’ saturation.          80%, making the results satisfying for our standards. We
                                                                can consider the macro F1 score as a general metric of
                                                                evaluation, defined as:
3.2. Image Segmentation Network                                                              1 ∑︁
                                                                                                    𝑁
                                                                          𝑚𝑎𝑐𝑟𝑜 − 𝐹 1 =            𝐹 1𝑖 = 0.81
Detectron2 is Facebook AI Research’s library [27] that                                       𝑁 𝑖=1
provides state-of-the-art detection and segmentation al-
                                                                Which is simply an overall F1 score among all classes. In
gorithms. It is the successor of Detectron [31] which is
                                                                our case, we can see a macro F1 score of 0.81.
in turn based on the Maskrcnn-benchmark model [32].
It supports a great number of computer vision research
                                                                It is also possible to take vision of the confusion matrix
projects thanks to its flexibility, output capabilities and
                                                                in figure 3, showing how the different samples from the
available documentation.
                                                                test set were classified during the test phase.
Among the available Detectron2’ architectures mask-
rcnn-fpn has been chosen. Such architecture is mainly
built from three modules: a Backbone Network, a Region
Proposal Network and a Box Head.                                3.4. Output stabilization techniques
The 𝐵𝑎𝑐𝑘𝑏𝑜𝑛𝑒 𝑁 𝑒𝑡𝑤𝑜𝑟𝑘, whose role is to extract multi-          Videos in daily life scenarios, likely happen to contain
scale feature maps with different receptive fields starting     temporary blank frames, as well as artifacts, due to user
from the input image, is based on the Feature Pyramid           or scene related conditions. A context recognition net-
Network [33] technique. In this way areas of interest           work will therefore generate a classification label that
from different points of view are identified and passed         will be trivially assigned, leading to an instability prob-
to both the two next modules. The 𝑅𝑒𝑔𝑖𝑜𝑛 𝑃 𝑟𝑜𝑝𝑜𝑠𝑎𝑙              lem. We dealt with this problem through the introduction
𝑁 𝑒𝑡𝑤𝑜𝑟𝑘 detects object regions (which are the so called        of a "memory buffer", that is capable of stabilizing statis-
“proposal boxes” ) based on multi-scale features, which         tically the result.
together with the feature maps serve as input for 𝑅𝑜𝑖
(Region of interest) 𝐻𝑒𝑎𝑑. This last module warps fea-             This is achieved by endowing the system with two
ture maps using proposal boxes into multiple fixed-size         buffers, one for the context recognition network that
features, and retrieves the fine-tuned box locations and        stores the predicted context classes, and one used to track
classification results via fully-connected layers.              the instance segmentation network output and to store
                                                                the classes predicted with enough accuracy. Storage of
                                                                past data allows to create a time relation between suc-
3.3. Context Recognition Network                                cessive frames, thus enforcing the output of each net-
                                                                work and stabilizing the final one. This method allows
The context recognition task belongs to a classification        to correlate information inside the video with the least
problem, where each frame of the video is treated as an         expenditure of resources. There are two buffers.



                                                           55
Pietro Manganelli Conforti et al. CEUR Workshop Proceedings                                                               53–59




Figure 2: Alexnet architecture. All rights reserved to the owner of the picture[34]



                                                                    to change the output and in this short period of time is
                                                                    wrongly classified with the previous stable label. This is
                                                                    strongly compensated by the stability provided and the
                                                                    delay is short enough to be difficult to notice.

                                                                    3.4.2. Instance segmentation memory buffer
                                                                    The other output stabilization technique is the instance
                                                                    segmentation memory buffer dedicated to the Detectron
                                                                    output. The rationale here is regarding the threshold
                                                                    used by the model to decide if an element belongs or not
                                                                    to a certain class. Being a privacy concern to conceal
                                                                    as much as possible the sensible information inside the
                                                                    frames, false positives in exchange of a higher number
                                                                    of true positives are preferable. Therefore, two kinds
                                                                    of thresholds are considered: the basic one and the op-
                                                                    timal one. The first one is lower than the second one
Figure 3: The confusion matrix generated with the use of the
scikit library [35]                                                 and is the minimal value considered acceptable to take
                                                                    into account the output of the network. If the output
                                                                    confidence regarding a specific instance inside the frame
                                                                    falls below this value is considered too unclear and it is
3.4.1. Context memory buffer                                        not counted. The second threshold instead represents
The first buffer visible in the top of fig.1 is the one ded-        the optimal value of confidence used by the system in
icated to the output of Alexnet. The assumption taken               order to properly recognize an element with enough ac-
by this approach is that the context changes don’t take             curacy. This information is used in order to track the
place suddenly but instead are related to a smooth trend.           last elements appeared to the network, appending them
For instance, if the frame 𝑛 recognizes a specific context,         inside the buffer.
there is a high probability that also the frame 𝑛 + 1 will          If the networks find an instance of a class inside the buffer
carry out similar information and represents the same               even with a lower confidence value in respect to the op-
context. Thus, doing an average among the past frames               timal threshold it is still considered acceptable, therefore
increases the overall accuracy by smoothing the output              processed and eventually concealed. In this way it gets
trend.                                                              easier for the network to work with moving objects be-
The length of the buffer is set dynamically in relation to          cause this method allows it to trace them even in the case
the 𝑓 𝑝𝑠 value retrieved from the video and the informa-            of uncertainty due to movement.
tion obtained by the frames from the last half of second            The buffer length is dynamically related to the 𝑓 𝑝𝑠 value
gets stored within.                                                 and stores information regarding the set of frames con-
The trade-off of this method is a small delay from the              cerning the last three seconds. If an element is recog-
context recognition module because the output context               nized by the network after this time interval in order to
label has to stabilize for at least half of the buffer 𝑙𝑒𝑛𝑔𝑡ℎ       be evaluated it needs to overcome the optimal confidence
                                                                    threshold again.



                                                               56
Pietro Manganelli Conforti et al. CEUR Workshop Proceedings                                                      53–59




Figure 4: a: kitchen b: Bathroom c: Bedroom



The trade-off of this system, as mentioned above, is a               mentation network recognize as a WC also the
higher frequency of false positives, which can be mis-               bidet given the similarity in their structure.
leading for the final result and inversely proportional
in number to the two threshold values. Overall, the ac-           • 𝐵𝑒𝑑𝑟𝑜𝑜𝑚 (fig. 4.c), In this scenario there is a
curacy following this approach improved by a discrete               bowl placed in a flat surface behind the bed, and
percentage, mainly in the more dynamic scenarios.                   the user wants to blur it out. This scenario can
                                                                    be potentially challenging since the portion of
                                                                    the room we are framing is very restricted, and
                                                                    the only object that can be considered a strong
                                                                    feature is the bed.
4. Results
The results of the system are evaluated accordingly on The results are shown in the following table. Here we
how many times the full procedure works consistently to have 5 different values of evaluation for each test:
specific information given as input related to a specific
test video. Knowing in advance those information which           • accuracy of context recognition network respect the
are the settings inside a set of test videos as well as the list   total number of frames(C.R.), indicating the per-
of elements inside of them, we can measure the overall             centage of success for the context recognition
accuracy of the system.                                            network applied to the frames of the video. This
For instance, if a video displays a specific context with          value does not show the improvements brought
a certain number of known elements inside of it we can             by the memory buffer.
control how many times those elements are found by
the two networks and by the two memory buffers in the            • accuracy of context recognition network + memory
output trend.                                                      buffer respect the total number of frames (C.R. ∖𝑤
In order to achieve this result an evaluation procedure            B.), indicating the percentage of success for the
was implemented that given an input video follows sim-             context recognition network combined with the
ilar steps that the system does but keeps into account             memory buffer for the context recognition task.
the number of times the output given starting from each
input frame is correct in respect to the total number of         • Accuracy of the instance segmentation network in
them. The test video used are three, which are providing           finding the objects of interest in a frame respect
the following scenarios:                                           the total number of frame(I.S.). This indicates the
      • 𝑘𝑖𝑡𝑐ℎ𝑒𝑛 (fig. 4.a), where we want to blur out a            percentage of success for the instance segmen-
        bowl from a table given that the system recognize          tation task respect the objects of interest for the
        the context. The instance segmentation network             user. This value does not show the improvements
        can identify various objects as a table, a oven and        brought by the memory buffer.
        bottles. Due to the user preferences, here the sta-
        tionary objects we want to blur out from all the         • Accuracy of the instance segmentation network +
        video frames is simply one, a black bowl.                  memory buffer in finding the objects of interest in a
                                                                   frame respect the total number of frame(I.S. ∖𝑤 B.).
      • 𝑏𝑎𝑡ℎ𝑟𝑜𝑜𝑚 (fig. 4.b), where the user want to blur           this indicates the percentage of success for the
        out the WC. In such scenario, the instance seg-            instance segmentation task respect the objects of



                                                          57
Pietro Manganelli Conforti et al. CEUR Workshop Proceedings                                                                 53–59



          interest for the user. This value does not show               References
          the improvements brought by the memory buffer.
                                                                         [1] B. Evans, The zoom revolution: 10 eye-popping
     • overall accuracy of the whole system. This indi-                      stats from tech’s new superstar, 2020.
       cates, as the name states, the overall accuracy                   [2] W. Standaert, S. Muylle, A. Basu, How shall we
       of the whole pipeline. This accuracy is given as                      meet? understanding the importance of meeting
       a combined accuracy from the accuracy of the                          mode capabilities for different meeting objectives,
       two tasks, obtained as the product between the                        Information Management (2021). doi:https://
       accuracy of the instance segmentation consider-                       doi.org/10.1016/j.im.2020.103393.
       ing also the buffer and the context recognition                   [3] D. Chew, M. Azizi, The state of video conferencing
       considering also the buffer.                                          2022, 2022.
                                                                         [4] K. A. Karl, J. V. Peluchette, N. Aghakhani, Virtual
                                                                             work meetings during the covid-19 pandemic: The
In table 2 tests’ results are reported, confirming that mem-                 good, bad, and ugly, Small Group Research 0 (2021).
ory buffers contribute to increase the accuracy of both                      URL: https://doi.org/10.1177/10464964211015286.
tasks, translating in overall better system’s accuracy.                  [5] M. Wozniak, C. Napoli, E. Tramontana, G. Capizzi,
It must be noted that accuracy can be further improved                       G. Lo Sciuto, R. K. Nowicki, J. T. Starczewski, A mul-
by fine tuning the thresholds required by the instance                       tiscale image compressor with rbfnn and discrete
segmentation task. A general thumb rule is that, if the                      wavelet decomposition, in: Proceedings of the Inter-
accuracy is similar between the system using the buffers                     national Joint Conference on Neural Networks, vol-
and the system not using it, it is possible to improve the                   ume 2015-September, 2015. doi:10.1109/IJCNN.
performance through such fine tuning.                                        2015.7280461.
                                                                         [6] G. Capizzi, S. Coco, G. L. Sciuto, C. Napoli, A new
                                                                             iterative fir filter design approach using a gaus-
   Input       C.R.   C.R. \w B.   I.S.   I.S. \w B.   Overall %
  Kitchen      0.91      1.0       0.89      0.98        0.98
                                                                             sian approximation, IEEE Signal Processing Letters
 Bathroom      0.74      0.81      0.89       1.0        0.81                25 (2018) 1615 – 1619. doi:10.1109/LSP.2018.
 Bedroom       0.92      1.0       0.5       0.78        0.78                2866926.
                                                                         [7] D. Połap, M. Woźniak, C. Napoli, E. Tramontana,
Table 2
                                                                             R. Damaševičius, Is the colony of ants able to rec-
                 Performance of the system
                                                                             ognize graphic objects?, Communications in Com-
                                                                             puter and Information Science 538 (2015) 376 – 387.
                                                                             doi:10.1007/978-3-319-24770-0_33.
                                                                         [8] M. Woźniak, D. Połap, M. Gabryel, R. K. Nowicki,
5. Conclusions                                                               C. Napoli, E. Tramontana, Can we process 2d im-
                                                                             ages using artificial bee colony?, in: Lecture Notes
In this paper we presented a machine learning-powered
                                                                             in Artificial Intelligence (Subseries of Lecture Notes
solution for privacy enforcement in video data, a data-
                                                                             in Computer Science), volume 9119, 2015, p. 660 –
driven implementation to safeguard the privacy of any
                                                                             671. doi:10.1007/978-3-319-19324-3_59.
user that may be forced to spend plenty of hours in videos
                                                                         [9] A. L. Allen, Dredging up the past: Lifelogging,
and/or in video meetings. Such solution will solve an un-
                                                                             memory and surveillance, University of Chicago
touched problem that wasn’t formally faced by the field
                                                                             Law Review 12 (2008) 2825–2830.
in the past years, where our time started to be off-centre
                                                                        [10] T. Dobrilova, The most astonishing facebook statis-
towards the time spent online.
                                                                             tics in 2022, 2022. URL: https://techjury.net/blog/
Our implementation presents good performance even in
                                                                             facebook-statistics/.
presence of noisy or foggy videos, while perform almost
                                                                        [11] S. Cunningham, M. Masoodian, A. Adams, Pri-
perfectly in the most common scenarios of videos with
                                                                             vacy issues for online personal photograph collec-
common perspectives extracted from general recordings
                                                                             tions, Journal of Theoretical and Applied Elec-
using mobile devices. The adaptability of the system to
                                                                             tronic Commerce Research 5 (2010). doi:10.4067/
change with the needs of different users both for the
                                                                             S0718-18762010000200003.
objects of interest and the context of interest makes the
                                                                        [12] R. Xu, N. Baracaldo, J. Joshi, Privacy-preserving ma-
solution we propose a solid step forward in the field of
                                                                             chine learning: Methods, challenges and directions,
privacy enforcement for video data.
                                                                             2021. arXiv:2108.04417.
                                                                        [13] S. Sharma, K. Chen, Image disguising for privacy-
                                                                             preserving deep learning, in: Proceedings of the
                                                                             2018 ACM SIGSAC Conference on Computer and



                                                                   58
Pietro Manganelli Conforti et al. CEUR Workshop Proceedings                                                           53–59



     Communications Security, Association for Com-                    ing a machine learning approach with gan-based
     puting Machinery, New York, NY, USA, 2018. URL:                  data augmentation technique trained using a cus-
     https://doi.org/10.1145/3243734.3278511.                         tom dataset, OBM Neurobiology 6 (2022). doi:10.
[14] G. M. Farinella, C. Napoli, G. Nicotra, S. Riccobene,            21926/obm.neurobiol.2204139.
     A context-driven privacy enforcement system for             [25] E. Iacobelli, S. Russo, C. Napoli, A machine learning
     autonomous media capture devices, Multimedia                     based real-time application for engagement detec-
     Tools and Applications 78 (2019) 14091–14108. URL:               tion, in: CEUR Workshop Proceedings, volume
     https://doi.org/10.1007/s11042-019-7376-z.                       3695, 2023, p. 75 – 84.
[15] P. Speciale, J. L. Schönberger, S. B. Kang, S. N. Sinha,    [26] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet
     M. Pollefeys, Privacy preserving image-based local-              classification with deep convolutional neural net-
     ization, CoRR abs/1903.05572 (2019).                             works (2012).
[16] A. T.-Y. Chen, M. Biglari-Abhari, K. I.-K. Wang,            [27] Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, R. Girshick,
     Trusting the computer in computer vision: A                      Detectron2, https://github.com/facebookresearch/
     privacy-affirming framework, in: 2017 IEEE Confer-               detectron2, 2019.
     ence on Computer Vision and Pattern Recognition             [28] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Gir-
     Workshops (CVPRW), 2017.                                         shick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick,
[17] I. E. Tibermacine, A. Tibermacine, W. Guettala,                  P. Dollár, Microsoft coco: Common objects in con-
     C. Napoli, S. Russo, Enhancing sentiment anal-                   text, 2015. arXiv:1405.0312.
     ysis on seed-iv dataset with vision transformers:           [29] RobinReni, House rooms image dataset, 2020.
     A comparative study, in: ACM International                       URL:             https://www.kaggle.com/robinreni/
     Conference Proceeding Series, 2023, p. 238 – 246.                house-rooms-image-dataset.
     doi:10.1145/3638985.3639024.                                [30] A. Buslaev, V. I. Iglovikov, E. Khvedchenya, A. Pari-
[18] V. Ponzi, S. Russo, A. Wajda, R. Brociek, C. Napoli,             nov, M. Druzhinin, A. A. Kalinin, Albumentations:
     Analysis pre and post covid-19 pandemic rorschach                Fast and flexible image augmentations, Informa-
     test data of using em algorithms and gmm mod-                    tion 11 (2020) 125. URL: http://dx.doi.org/10.3390/
     els, in: CEUR Workshop Proceedings, volume 3360,                 info11020125.
     2022, p. 55 – 63.                                           [31] R. Girshick, I. Radosavovic, G. Gkioxari, P. Dollár,
[19] A. Alfarano, G. De Magistris, L. Mongelli, S. Russo,             K. He, Detectron, 2018. URL: https://github.com/
     J. Starczewski, C. Napoli, A novel convmixer trans-              facebookresearch/detectron.
     former based architecture for violent behavior de-          [32] F. Massa, R. Girshick, maskrcnn-benchmark: Fast,
     tection, in: Lecture Notes in Computer Science                   modular reference implementation of Instance
     (including subseries Lecture Notes in Artificial In-             Segmentation and Object Detection algorithms
     telligence and Lecture Notes in Bioinformatics), vol-            in PyTorch, https://github.com/facebookresearch/
     ume 14126 LNAI, 2023, p. 3 – 16. doi:10.1007/                    maskrcnn-benchmark, 2018. Accessed: [Insert date
     978-3-031-42508-0_1.                                             here].
[20] E. Iacobelli, V. Ponzi, S. Russo, C. Napoli, Eye-           [33] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan,
     tracking system with low-end hardware: Devel-                    S. Belongie, Feature pyramid networks for object
     opment and evaluation, Information (Switzerland)                 detection, 2017. arXiv:1612.03144.
     14 (2023). doi:10.3390/info14120644.                        [34] A. Khvostikov, K. Aderghal, J. Benois-Pineau,
[21] F. Fiani, S. Russo, C. Napoli, An advanced solu-                 A. Krylov, G. Catheline, 3d cnn-based classifica-
     tion based on machine learning for remote emdr                   tion using smri and md-dti images for alzheimer
     therapy, Technologies 11 (2023). doi:10.3390/                    disease studies (2018).
     technologies11060172.                                       [35] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
[22] V. Ponzi, S. Russo, V. Bianco, C. Napoli, A. Wa-                 B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
     jda, Psychoeducative social robots for an healthier              R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
     lifestyle using artificial intelligence: a case-study,           D. Cournapeau, M. Brucher, M. Perrot, E. Duch-
     in: CEUR Workshop Proceedings, volume 3118,                      esnay, Scikit-learn: Machine learning in Python,
     2021, p. 26 – 33.                                                Journal of Machine Learning Research 12 (2011)
[23] R. Brociek, G. D. Magistris, F. Cardia, F. Coppa,                2825–2830.
     S. Russo, Contagion prevention of covid-19 by
     means of touch detection for retail stores, in: CEUR
     Workshop Proceedings, volume 3092, 2021, p. 89 –
     94.
[24] S. Pepe, S. Tedeschi, N. Brandizzi, S. Russo, L. Ioc-
     chi, C. Napoli, Human attention assessment us-



                                                            59