<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Journal of Imaging 7 (2021) 76. doi:1 0 . 3 3 9 0 /
C. Vairo</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Activities for Vision</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luca Ciampi</string-name>
          <email>luca.ciampi@isti.cnr.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Amato</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Bolettieri</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabio Carrara</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Di Benedetto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabrizio Falchi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Claudio Gennaro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicola Messina</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lucia Vadicamo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Claudio Vairo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Vision</institution>
          ,
          <addr-line>Multimedia Understanding, Deep Learning, Large-scale Video Retrieval, Learning with Scarce Data</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>ISTI-CNR</institution>
          ,
          <addr-line>via G. Moruzzi, 1, Pisa, 56100</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Workshop Proce dings</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <volume>2161</volume>
      <fpage>591</fpage>
      <lpage>596</lpage>
      <abstract>
        <p>The explosion of smartphones and cameras has led to a vast production of multimedia data. Consequently, Artificial Intelligence-based tools for automatically understanding and exploring these data have recently gained much attention. In this short paper, we report some activities of the Artificial Intelligence for Media and Humanities (AIMH) laboratory of the ISTI-CNR, tackling some challenges in the field of Computer Vision for the automatic understanding of visual data and for novel interactive tools aimed at multimedia data exploration. Specifically, we provide innovative solutions based on Deep Learning techniques carrying out typical vision tasks such as object detection and visual counting, with particular emphasis on scenarios characterized by scarcity of labeled data needed for the supervised training and on environments with limited power resources imposing miniaturization of the models. Furthermore, we describe VISIONE, our large-scale video search system designed to search extensive multimedia databases in an interactive and user-friendly manner.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The pervasive difusion of smartphones and cheap
cameras leads to an exponential daily production of digital
visual data, such as images and videos. In this context,
a constant increase of attention to the automatic
underputer Vision has become one of the hottest fields that
make extensive use of Artificial Intelligence (AI), to such
day lives, and they are making human life easier. Some
examples include pedestrian detection and human
activity monitoring in surveillance systems or face detection
and recognition in smartphones. Furthermore, an
imtities of data coming from diferent sources is the need
to eficiently and efectively organize them so that also
non-expert users can easily manage and browse them.
tions carried out by the Artificial Intelligence for Media
and Humanities (AIMH) laboratory of the ISTI-CNR,
focusing on multimedia understanding and novel
interactive software for multimedia data exploration.
Specifinized by CINI, May 29–31, 2023, Pisa, Italy
∗Corresponding author.
(C. Vairo)</p>
    </sec>
    <sec id="sec-2">
      <title>2. Research Areas</title>
      <sec id="sec-2-1">
        <title>2.1. Learning with Scarce Data</title>
        <p>Current vision systems powered by data-driven AI
methods sufer from strong domain shifts and are not
invariant to substantial context variations. Indeed, these AI
technologies usually need a massive amount of
annotated data required for the supervised learning phase,
and they often sufer when applied to unseen data.
Consequently, adopting these solutions is often dampened for
large-scale contexts, considering that the data annotation
procedure requires extraordinary human efort and
colThis paper presents some research topics and applica- limited power resources.</p>
        <p>© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License lecting data for every specific scenario is unfeasible. The
Attribution 4.0 International (CC BY 4.0).</p>
        <p>AIMH Lab is tackling this challenge from several sides, based on single-image classification [ 11], which can
mitofering diferent approaches detailed in the following. igate the domain gap between annotated datasets
containing violent/non-violent clips in general contexts and
2.1.1. Learning from Synthetic Data a recently introduced collection of videos specific for
detection of violent behaviors in public transport [12].</p>
        <p>An appealing solution to mitigate the human efort
needed for manual annotation is to gather synthetic data 2.1.3. Learning from multi-rating data
from virtual environments resembling the real world,
where the labels are automatically collected by interact- Often, non-trivial patterns produce a non-negligible
dising with the graphical engine. In this context, we released agreement between multiple annotators, such as when
and exploited several synthetic collections of images to dealing with biological structures in microscopy images.
build DL solutions that carry out several human-centered A possible solution to have more robust labels is to
agtasks. In particular, in [1], we presented an embedded gregate and average the decisions given by several
anmodular AI-assisted Computer Vision-based system that notators to the same data. However, the scale of many
provides many functionalities to help monitor individual tasks prevents the creation of large datasets annotated by
and collective human safety rules, ranging from social several experts, i.e., annotators prefer to label new data
distance estimation and crowd counting to Personal Pro- rather than label the same data more than once, resulting
tective Equipment (PPE) detection (such as helmets and in large, single-labeled weakly labeled datasets and very
masks). Our solution consists of multiple modules rely- small multi-labeled data, from which it is crucial to make
ing on neural network components, each responsible for the most. In [13], we proposed a two-stage counting
specific functionalities that users can easily enable, con- strategy in a weakly labeled data scenario. In the first
ifgure, and combine. One of the main peculiarity is that stage, we trained state-of-the-art DL-based
methodolosome of these components have been trained by exploit- gies to detect and count biological structures exploiting
ing synthetic data collected from the GTAV videogame a large set of single-labeled data sure to contain errors;
and automatically annotated [2] [3] [4] [5]. Furthermore, in the second stage, using a small set of multi-labeled
we employed the GTAV videogame also for gathering data, we refined the predictions, increasing the
correlaother collections of images and labels to train a DL-based tion between the scores assigned to the samples and the
approach for human fall detection [6] and a technique for agreement of the raters on the annotations, i.e., we
immulti-camera vehicle tracking in urban scenarios. Finally, proved confidence calibration by taking advantage of the
more recently, we proposed CrowdSim2, a new synthetic redundant information characterizing the multi-labeled
collection of images suitable for people and vehicle detec- data. Furthermore, we are currently exploring the
possition and tracking gathered from a simulator based on the bility of exploiting multi-labeled data from annotations
Unity graphical engine [7] [8] consisting of thousands automatically generated by several state-of-the-art
detecof images collected from various synthetic scenarios re- tors.
sembling the real world, where we varied some factors of
interest, such as the weather conditions and the number 2.2. Smart Parking on the Edge
of objects in the scenes.</p>
        <sec id="sec-2-1-1">
          <title>Trafic-related issues are constantly increasing, and to</title>
          <p>2.1.2. Unsupervised Domain Adaptation morrow’s cities can be considered intelligent only if they
provide smart mobility applications, such as smart
parkUnsupervised Domain Adaptation (UDA) is a technique ing and trafic management. In this context, city camera
that addresses the Domain Shift problem by taking a networks represent the perfect tool for monitoring large
source labeled dataset and a target unlabeled one. The urban areas while providing visual data to AI systems
challenge here is to automatically infer some knowledge responsible for extracting relevant information and
sugfrom the target data to reduce the gap between the two gesting/making decisions helpful for intelligent mobility
domains. In [9] and [10], the AIMH Lab introduced an applications. However, implementing these solutions is
end-to-end CNN-based UDA algorithm for trafic density often hampered by the massive flow of data that must be
estimation and counting, based on adversarial learning sent to central servers or the cloud for processing. On
performed directly on the output space. We validated our the other hand, the recent paradigm of edge computing
approach over diferent types of domain shifts, i.e., the promotes the decentralization of data processing to the
Camera2Camera, the Day2Night, and the Synthetic2Real border, i.e., where the data are gathered, thus reducing
domain shifts, demonstrating significant improvement the trafic on the network and the pressure on central
compared to the performance of the model without do- servers. Nonetheless, this promising standard brings
main adaptation. Furthermore, very recently, we also along with it also some new challenges related to the
proposed a UDA scheme for video violence detection limited computational resources on the disposable edge
devices and also concerning security inside IoT networks.</p>
          <p>The AIMH Lab proposed and is actively researching
DL-based solutions for intelligent parking monitoring
matching the edge AI idea, i.e., that can run directly
onboard embedded vision systems equipped with
limited computational capabilities able to capture images,
process them, and eventually communicate with other
devices sending the elaborated information. Specifically,
in [14] and [15], we introduced a decentralized and
eficient solution for visual parking lot occupancy detection,
which exploits a miniaturized CNN to classify parking Figure 1: A sample of our MobDrone dataset for man
overspace occupancy. It runs directly onboard smart cam- board detection from drones.
eras built using the Raspberry Pi platform equipped with
a camera module. On the other hand, in [16] and [17],
we extended this application by proposing a DL-based pictures of sticky paper traps; controlling pest
populamethod that can instead estimate the number of vehi- tion is crucial in agriculture, and an efective Integrated
cles present in the Field Of View of the smart cameras. Pest Management tool can prevent crop damage and
Such a task is more flexible than the previous one since suggest corrective measures to keep pests from
causit does not rely on meta-information regarding the mon- ing significant problems. On the other hand, in [ 22], we
itored scene, such as the position of the parking lots. tackled the problem of counting cells in microscopy
imMore, in [18], we proposed a DL solution to automati- ages, a fundamental step in diagnosing several diseases
cally detect and count vehicles in images taken from a in biology and medicine. Finally, in [23], we proposed
camera-equipped drone directly onboard the UAV. More a smart-surveillance application for crowd counting in
recently, we proposed a multi-camera system capable of videos gathered from city cameras. Here, we also
conautomatically estimating the number of cars in an entire sidered the temporal context relying on the evidence
parking lot directly on board the edge devices [19]. The that when instances of particular objects are moving –
peculiarity of this solution is that, unlike most of the like persons or cars – it is usually easier to spot their
works in the literature which focus on the analysis of presence and consequently count them with greater
acsingle images, it uses multiple visual sources to monitor curacy. Specifically, we introduced a transformer-based
a wider parking area from diferent perspectives. More attentive mechanism where the movement flows are
iniin detail, it comprises an on-device DL-based detector tially estimated through a network trained by enforcing
that locates and counts the vehicles from the captured person-corservation laws – no persons can suddently
disimages of a single smart camera together with a decen- appear between consecutive frames in the video except
tralized geometric-based approach that can analyze the at the frame borders. Then, the person flow is integrated
inter-camera shared areas and merge the data acquired to get the actual people count. This solution obtained
by all the devices. state-of-the-art results, surpassing frame-based visual
counting networks.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.3. Object Detection and Visual</title>
      </sec>
      <sec id="sec-2-3">
        <title>Counting in Multi-disciplinary Areas</title>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. The VISIONE Search System</title>
        <p>The AIMH Lab is currently researching the field of object With the increasing difusion of multimedia databases,
detection and crowd counting, proposing novel solutions there is today, as never before, the need to analyze,
orin multi-disciplinary areas. ganize, and index all the produced data so that they can</p>
        <p>In [20], we collected and publicly released MOBDrone, be quickly and eficiently retrieved. The development of
a large-scale dataset of aerial footage of people who, be- tools to automatically analyze and index all these
coning in the water, simulated the need to be rescued. This tents constitutes a significant achievement in the
autodata includes 66 video clips with 126, 170 frames manu- matic content-based organization and browsing of large
ally annotated having more than 180K bounding boxes multimedia databases. Exciting applications of these
tech(of which more than 113K belong to the person category). nologies are, for example, the organization of raw
mulWe provide a sample of our dataset in Figure 1. Further- timedia data scraped from the web or the browsing of
more, we presented an in-depth experimental analysis audiovisual archives owned by national televisions.
of the performance of several state-of-the-art object de- To fulfill these needs, the AIMH Lab is developing
tectors over this newly established scenario. In [21], we VISIONE [24, 25, 26, 27], a large-scale video search
sysintroduced a Computer Vision tool in the Smart Agri- tem designed to search extensive multimedia databases
culture area aimed at automatically counting pests in in an interactive and user-friendly manner. VISIONE
Canvas-based</p>
        <p>queries
(objects/colors)
Query by text</p>
        <p>Duplicated
search interface
for temporal
queries</p>
        <p>Video ID</p>
        <p>Keyframes and
playback</p>
        <p>Similarity Search
employs various content-based analysis tools to extract videos containing particular scene cuts.
knowledge from raw shots, and it uses reliable indexing The system employs two diferent indexes to store and
techniques for achieving good scalability. The system perform similarity search on the extracted visual features
ofers advanced search functionalities powered by em- and the detected objects and colors, Apache Lucene1and
ploying publicly-available state-of-the-art image anal- FAISS2. The need for two indexes is motivated by their
ysis models and technologies developed internally for diferent functionalities and implementations. Lucene, in
advanced multimedia representation and large-scale in- particular, is disk-based and designed to handle billions
dexing. Specifically, VISIONE enables the search for of documents, scaling better than in-memory indexes like
video shots based on specific object classes, employing FAISS. Lucene is commonly used for text-based search in
a canvas-oriented interface that enables users to spec- collections of long unstructured text documents. To
enify objects and colors in particular positions within the code image features for similarity search, we developed
frame. The latest versions of VISIONE feature state-of- a family of techniques called Surrogate Text
Representathe-art cross-modal models able to search keyframes and tions (STRs) [31, 32] to enable dense features to be
transvideos using natural language prompts. Specifically, VI- formed into sparse term frequencies from an appropriate
SIONE integrates some CLIP-based models[28, 29], as codebook.
well as a novel cross-modal retrieval deep neural net- The system participated in the 12th Video Browser
work, called ALADIN (ALign And DIstill Network) [30]. Showdown competition [33], where it ranked second
This network generates fixed-length features in a com- in the overall leaderboard and performed well in
sevmon visual-textual space by distilling the relevance scores eral subtasks, achieving first place in visual known item
from large pre-trained vision-language transformers, en- search.
abling quick and accurate keyframe retrieval using
detailed natural language prompts. VISIONE also provides
visual similarity techniques to help users browse results 3. Conclusions
and find keyframes similar to the selected shots based
on instance or semantic similarities. An overview of the In this short paper, we reported some activities of the
interface is shown in Figure 2. Finally, VISIONE supports Artificial Intelligence for Media and Humanities (AIMH)
temporal queries, allowing the user to specify two in- laboratory of the ISTI-CNR concerning Computer Vision
dependent queries used for searching videos containing approaches for multimedia understanding as well as
solutwo keyframes satisfying the two queries but having a tions for interactive tools aimed at exploring multimedia
temporal distance smaller than 12 seconds. This enables
users to easily search for long-lasting specific actions or</p>
        <sec id="sec-2-4-1">
          <title>1https://lucene.apache.org/ 2https://github.com/facebookresearch/faiss</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgments</title>
      <sec id="sec-3-1">
        <title>This work was partially supported by: the AI4Media</title>
        <p>project, funded by the EC (H2020 - Contract n. 951911);
PNRR - M4C2 - Investimento 1.3, Partenariato Esteso
PE00000013 - ”FAIR - Future Artificial Intelligence
Research” - Spoke 1 ”Human-centered AI”, funded by the
European Commission under the NextGeneration EU
programme.
data. The widely spread of multimedia data, such as
images and videos gathered from smartphones or smart
cameras, is driving the research of these tools. Indeed,
this deluge of visual data needs more and more AI-based
techniques able to automatically understand and browse
it. Specifically, we described some particularly
interesting challenges we are tackling, such as the Deep Learning
methods operating in contexts of scarce data or in
limitedpowered environments, as well as systems designed to
search extremely large video databases with interactive
and user-friendly interfaces.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>