1. Introduction

Journal of Imaging 7 (2021) 76. doi:1 0 . 3 3 9 0 / C. Vairo

Activities for Vision

Luca Ciampi

luca.ciampi@isti.cnr.it 0 1 2

Giuseppe Amato

0 1 2

Paolo Bolettieri

0 1 2

Fabio Carrara

0 1 2

Marco Di Benedetto

0 1 2

Fabrizio Falchi

0 1 2

Claudio Gennaro

0 1 2

Nicola Messina

0 1 2

Lucia Vadicamo

0 1 2

Claudio Vairo

0 1 2 0 Computer Vision , Multimedia Understanding, Deep Learning, Large-scale Video Retrieval, Learning with Scarce Data 1 ISTI-CNR , via G. Moruzzi, 1, Pisa, 56100 , Italy 2 Workshop Proce dings

2018

2161 591 596

The explosion of smartphones and cameras has led to a vast production of multimedia data. Consequently, Artificial Intelligence-based tools for automatically understanding and exploring these data have recently gained much attention. In this short paper, we report some activities of the Artificial Intelligence for Media and Humanities (AIMH) laboratory of the ISTI-CNR, tackling some challenges in the field of Computer Vision for the automatic understanding of visual data and for novel interactive tools aimed at multimedia data exploration. Specifically, we provide innovative solutions based on Deep Learning techniques carrying out typical vision tasks such as object detection and visual counting, with particular emphasis on scenarios characterized by scarcity of labeled data needed for the supervised training and on environments with limited power resources imposing miniaturization of the models. Furthermore, we describe VISIONE, our large-scale video search system designed to search extensive multimedia databases in an interactive and user-friendly manner.

1. Introduction

The pervasive difusion of smartphones and cheap cameras leads to an exponential daily production of digital visual data, such as images and videos. In this context, a constant increase of attention to the automatic underputer Vision has become one of the hottest fields that make extensive use of Artificial Intelligence (AI), to such day lives, and they are making human life easier. Some examples include pedestrian detection and human activity monitoring in surveillance systems or face detection and recognition in smartphones. Furthermore, an imtities of data coming from diferent sources is the need to eficiently and efectively organize them so that also non-expert users can easily manage and browse them. tions carried out by the Artificial Intelligence for Media and Humanities (AIMH) laboratory of the ISTI-CNR, focusing on multimedia understanding and novel interactive software for multimedia data exploration. Specifinized by CINI, May 29–31, 2023, Pisa, Italy ∗Corresponding author. (C. Vairo)

2. Research Areas 2.1. Learning with Scarce Data

Current vision systems powered by data-driven AI methods sufer from strong domain shifts and are not invariant to substantial context variations. Indeed, these AI technologies usually need a massive amount of annotated data required for the supervised learning phase, and they often sufer when applied to unseen data. Consequently, adopting these solutions is often dampened for large-scale contexts, considering that the data annotation procedure requires extraordinary human efort and colThis paper presents some research topics and applica- limited power resources.

AIMH Lab is tackling this challenge from several sides, based on single-image classification [ 11], which can mitofering diferent approaches detailed in the following. igate the domain gap between annotated datasets containing violent/non-violent clips in general contexts and 2.1.1. Learning from Synthetic Data a recently introduced collection of videos specific for detection of violent behaviors in public transport [12].

An appealing solution to mitigate the human efort needed for manual annotation is to gather synthetic data 2.1.3. Learning from multi-rating data from virtual environments resembling the real world, where the labels are automatically collected by interact- Often, non-trivial patterns produce a non-negligible dising with the graphical engine. In this context, we released agreement between multiple annotators, such as when and exploited several synthetic collections of images to dealing with biological structures in microscopy images. build DL solutions that carry out several human-centered A possible solution to have more robust labels is to agtasks. In particular, in [1], we presented an embedded gregate and average the decisions given by several anmodular AI-assisted Computer Vision-based system that notators to the same data. However, the scale of many provides many functionalities to help monitor individual tasks prevents the creation of large datasets annotated by and collective human safety rules, ranging from social several experts, i.e., annotators prefer to label new data distance estimation and crowd counting to Personal Pro- rather than label the same data more than once, resulting tective Equipment (PPE) detection (such as helmets and in large, single-labeled weakly labeled datasets and very masks). Our solution consists of multiple modules rely- small multi-labeled data, from which it is crucial to make ing on neural network components, each responsible for the most. In [13], we proposed a two-stage counting specific functionalities that users can easily enable, con- strategy in a weakly labeled data scenario. In the first ifgure, and combine. One of the main peculiarity is that stage, we trained state-of-the-art DL-based methodolosome of these components have been trained by exploit- gies to detect and count biological structures exploiting ing synthetic data collected from the GTAV videogame a large set of single-labeled data sure to contain errors; and automatically annotated [2] [3] [4] [5]. Furthermore, in the second stage, using a small set of multi-labeled we employed the GTAV videogame also for gathering data, we refined the predictions, increasing the correlaother collections of images and labels to train a DL-based tion between the scores assigned to the samples and the approach for human fall detection [6] and a technique for agreement of the raters on the annotations, i.e., we immulti-camera vehicle tracking in urban scenarios. Finally, proved confidence calibration by taking advantage of the more recently, we proposed CrowdSim2, a new synthetic redundant information characterizing the multi-labeled collection of images suitable for people and vehicle detec- data. Furthermore, we are currently exploring the possition and tracking gathered from a simulator based on the bility of exploiting multi-labeled data from annotations Unity graphical engine [7] [8] consisting of thousands automatically generated by several state-of-the-art detecof images collected from various synthetic scenarios re- tors. sembling the real world, where we varied some factors of interest, such as the weather conditions and the number 2.2. Smart Parking on the Edge of objects in the scenes.

Trafic-related issues are constantly increasing, and to

2.1.2. Unsupervised Domain Adaptation morrow’s cities can be considered intelligent only if they provide smart mobility applications, such as smart parkUnsupervised Domain Adaptation (UDA) is a technique ing and trafic management. In this context, city camera that addresses the Domain Shift problem by taking a networks represent the perfect tool for monitoring large source labeled dataset and a target unlabeled one. The urban areas while providing visual data to AI systems challenge here is to automatically infer some knowledge responsible for extracting relevant information and sugfrom the target data to reduce the gap between the two gesting/making decisions helpful for intelligent mobility domains. In [9] and [10], the AIMH Lab introduced an applications. However, implementing these solutions is end-to-end CNN-based UDA algorithm for trafic density often hampered by the massive flow of data that must be estimation and counting, based on adversarial learning sent to central servers or the cloud for processing. On performed directly on the output space. We validated our the other hand, the recent paradigm of edge computing approach over diferent types of domain shifts, i.e., the promotes the decentralization of data processing to the Camera2Camera, the Day2Night, and the Synthetic2Real border, i.e., where the data are gathered, thus reducing domain shifts, demonstrating significant improvement the trafic on the network and the pressure on central compared to the performance of the model without do- servers. Nonetheless, this promising standard brings main adaptation. Furthermore, very recently, we also along with it also some new challenges related to the proposed a UDA scheme for video violence detection limited computational resources on the disposable edge devices and also concerning security inside IoT networks.

The AIMH Lab proposed and is actively researching DL-based solutions for intelligent parking monitoring matching the edge AI idea, i.e., that can run directly onboard embedded vision systems equipped with limited computational capabilities able to capture images, process them, and eventually communicate with other devices sending the elaborated information. Specifically, in [14] and [15], we introduced a decentralized and eficient solution for visual parking lot occupancy detection, which exploits a miniaturized CNN to classify parking Figure 1: A sample of our MobDrone dataset for man overspace occupancy. It runs directly onboard smart cam- board detection from drones. eras built using the Raspberry Pi platform equipped with a camera module. On the other hand, in [16] and [17], we extended this application by proposing a DL-based pictures of sticky paper traps; controlling pest populamethod that can instead estimate the number of vehi- tion is crucial in agriculture, and an efective Integrated cles present in the Field Of View of the smart cameras. Pest Management tool can prevent crop damage and Such a task is more flexible than the previous one since suggest corrective measures to keep pests from causit does not rely on meta-information regarding the mon- ing significant problems. On the other hand, in [ 22], we itored scene, such as the position of the parking lots. tackled the problem of counting cells in microscopy imMore, in [18], we proposed a DL solution to automati- ages, a fundamental step in diagnosing several diseases cally detect and count vehicles in images taken from a in biology and medicine. Finally, in [23], we proposed camera-equipped drone directly onboard the UAV. More a smart-surveillance application for crowd counting in recently, we proposed a multi-camera system capable of videos gathered from city cameras. Here, we also conautomatically estimating the number of cars in an entire sidered the temporal context relying on the evidence parking lot directly on board the edge devices [19]. The that when instances of particular objects are moving – peculiarity of this solution is that, unlike most of the like persons or cars – it is usually easier to spot their works in the literature which focus on the analysis of presence and consequently count them with greater acsingle images, it uses multiple visual sources to monitor curacy. Specifically, we introduced a transformer-based a wider parking area from diferent perspectives. More attentive mechanism where the movement flows are iniin detail, it comprises an on-device DL-based detector tially estimated through a network trained by enforcing that locates and counts the vehicles from the captured person-corservation laws – no persons can suddently disimages of a single smart camera together with a decen- appear between consecutive frames in the video except tralized geometric-based approach that can analyze the at the frame borders. Then, the person flow is integrated inter-camera shared areas and merge the data acquired to get the actual people count. This solution obtained by all the devices. state-of-the-art results, surpassing frame-based visual counting networks.

2.3. Object Detection and Visual Counting in Multi-disciplinary Areas 2.4. The VISIONE Search System

The AIMH Lab is currently researching the field of object With the increasing difusion of multimedia databases, detection and crowd counting, proposing novel solutions there is today, as never before, the need to analyze, orin multi-disciplinary areas. ganize, and index all the produced data so that they can

In [20], we collected and publicly released MOBDrone, be quickly and eficiently retrieved. The development of a large-scale dataset of aerial footage of people who, be- tools to automatically analyze and index all these coning in the water, simulated the need to be rescued. This tents constitutes a significant achievement in the autodata includes 66 video clips with 126, 170 frames manu- matic content-based organization and browsing of large ally annotated having more than 180K bounding boxes multimedia databases. Exciting applications of these tech(of which more than 113K belong to the person category). nologies are, for example, the organization of raw mulWe provide a sample of our dataset in Figure 1. Further- timedia data scraped from the web or the browsing of more, we presented an in-depth experimental analysis audiovisual archives owned by national televisions. of the performance of several state-of-the-art object de- To fulfill these needs, the AIMH Lab is developing tectors over this newly established scenario. In [21], we VISIONE [24, 25, 26, 27], a large-scale video search sysintroduced a Computer Vision tool in the Smart Agri- tem designed to search extensive multimedia databases culture area aimed at automatically counting pests in in an interactive and user-friendly manner. VISIONE Canvas-based

queries (objects/colors) Query by text

Duplicated search interface for temporal queries

Video ID

Keyframes and playback

Similarity Search employs various content-based analysis tools to extract videos containing particular scene cuts. knowledge from raw shots, and it uses reliable indexing The system employs two diferent indexes to store and techniques for achieving good scalability. The system perform similarity search on the extracted visual features ofers advanced search functionalities powered by em- and the detected objects and colors, Apache Lucene1and ploying publicly-available state-of-the-art image anal- FAISS2. The need for two indexes is motivated by their ysis models and technologies developed internally for diferent functionalities and implementations. Lucene, in advanced multimedia representation and large-scale in- particular, is disk-based and designed to handle billions dexing. Specifically, VISIONE enables the search for of documents, scaling better than in-memory indexes like video shots based on specific object classes, employing FAISS. Lucene is commonly used for text-based search in a canvas-oriented interface that enables users to spec- collections of long unstructured text documents. To enify objects and colors in particular positions within the code image features for similarity search, we developed frame. The latest versions of VISIONE feature state-of- a family of techniques called Surrogate Text Representathe-art cross-modal models able to search keyframes and tions (STRs) [31, 32] to enable dense features to be transvideos using natural language prompts. Specifically, VI- formed into sparse term frequencies from an appropriate SIONE integrates some CLIP-based models[28, 29], as codebook. well as a novel cross-modal retrieval deep neural net- The system participated in the 12th Video Browser work, called ALADIN (ALign And DIstill Network) [30]. Showdown competition [33], where it ranked second This network generates fixed-length features in a com- in the overall leaderboard and performed well in sevmon visual-textual space by distilling the relevance scores eral subtasks, achieving first place in visual known item from large pre-trained vision-language transformers, en- search. abling quick and accurate keyframe retrieval using detailed natural language prompts. VISIONE also provides visual similarity techniques to help users browse results 3. Conclusions and find keyframes similar to the selected shots based on instance or semantic similarities. An overview of the In this short paper, we reported some activities of the interface is shown in Figure 2. Finally, VISIONE supports Artificial Intelligence for Media and Humanities (AIMH) temporal queries, allowing the user to specify two in- laboratory of the ISTI-CNR concerning Computer Vision dependent queries used for searching videos containing approaches for multimedia understanding as well as solutwo keyframes satisfying the two queries but having a tions for interactive tools aimed at exploring multimedia temporal distance smaller than 12 seconds. This enables users to easily search for long-lasting specific actions or

1https://lucene.apache.org/ 2https://github.com/facebookresearch/faiss Acknowledgments This work was partially supported by: the AI4Media

project, funded by the EC (H2020 - Contract n. 951911); PNRR - M4C2 - Investimento 1.3, Partenariato Esteso PE00000013 - ”FAIR - Future Artificial Intelligence Research” - Spoke 1 ”Human-centered AI”, funded by the European Commission under the NextGeneration EU programme. data. The widely spread of multimedia data, such as images and videos gathered from smartphones or smart cameras, is driving the research of these tools. Indeed, this deluge of visual data needs more and more AI-based techniques able to automatically understand and browse it. Specifically, we described some particularly interesting challenges we are tackling, such as the Deep Learning methods operating in contexts of scarce data or in limitedpowered environments, as well as systems designed to search extremely large video databases with interactive and user-friendly interfaces.