<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luigi Sabetta</string-name>
          <email>luigi.sabetta.ext@leonardocompany.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesco Pelosin</string-name>
          <email>francesco.pelosin.ext@leonardocompany.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giulia Denevi</string-name>
          <email>giulia.denevi.ext@leonardocompany.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Nicolosi</string-name>
          <email>alessandro.nicolosi@leonardocompany.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Automatic Image Annotation, Incremental Learning, Few-Shot</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Leonardo Labs</institution>
          ,
          <addr-line>via Tiburtina, Roma</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <fpage>29</fpage>
      <lpage>31</lpage>
      <abstract>
        <p>The availability of labelled data is often limited, which hinders the potential of deep learning pipelines in industry. To address this issue, many industries resort to third-party solutions that involve human annotators manually labeling data. However, these solutions are costly, time-consuming, and their accuracy may be questionable. In this paper, we propose an alternative approach that utilizes a deep learning system capable of automatically labeling images with varying levels of supervision from human annotators. Our proposed Automatic Image Annotation system encodes a class using a prototype vector obtained by averaging the projections of images annotated as belonging to that class by a pre-trained backbone. The system eficiently annotates images in real-time without the need to memorize them. It can remember past annotations and also efectively identify new classes. We have developed a web application (link to code) to demonstrate the efectiveness of our approach.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In the last years, Deep Learning has achieved impressive
results on a variety of tasks, from computer vision [1]
to NLP [2] and also as a tool to help natural sciences
model our world such as in biology [3] and in physics
[4]. This powerful tool is becoming more and more
pervasive.</p>
      <sec id="sec-1-1">
        <title>But, it comes with a drawback: in the</title>
        <p>supervised learning realm
the training procedure
data generation does not constitute a problem; the
bottleneck lies in the slow and painful annotation procedure.
The standard way to cope with this incomplete data
is to rely on human annotators. Human annotation is
typically performed by companies that, after a careful
interaction with the costumer, agree on a labeling
scheme. When such scheme has been defined, the data
is forwarded to several humans that subjectively carry
the job. This subjective step intrinsically carries low
homogeneity on the final labeling. Another problem, is
the quantity of data to be curated. Obviously, this reflects
on the quantity of time required to accomplish the task
which, in the end, results in more expensive services.
Due to these reasons, the call to use automatic tools to
CEUR
Workshop
Proce dings
htp:/ceur-ws.org
ISN1613-073</p>
        <sec id="sec-1-1-1">
          <title>CEUR</title>
        </sec>
        <sec id="sec-1-1-2">
          <title>Workshop Proceedings (CEUR-WS.org)</title>
          <p>© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License</p>
          <p>Attribution 4.0 International (CC BY 4.0).
∗ These authors contributed equally to this work
annotate data is nowadays taking place [5]. Although
an important topic, Automatic Image Annotation (AIA)
has not received enough attention from the research
community. In fact, according to latest reviews on
the topic [5, 6] most of the published works are from
2003-2016.</p>
          <p>We then tackle such problem by devising a pipeline to
assist humans during the labeling process. By exploiting
a minimal human feedback, we can cut down the
timeonerous and error-prone process of image annotation.</p>
          <p>Contribution
the following.</p>
          <p>The main contributions of this work are
1. We develop an Automatic Image Annotation
(AIA) system to support humans in labelling
a stream of images by designing an appropriate
variant of the method described in [7].
2. The system is robust to domain-shifts . Since
the prototype vectors representing the diferent
classes are computed by projecting into the
embedding space of CLIP [8], the system is resilient
to domain-shifts and is almost free of
catastrophicforgetting.
3. The system is eficient and user friendly .</p>
          <p>The disentangled representation provided by
CLIP does not require additional expensive
training procedures and it reveals to be very efective
for this kind of application. The cost to store
the protypes for each class is negligible and
performs a on-line update which is computationally
eficient. Moreover it allows human interaction
at diferent levels. Such an interaction is also
facilitated by the development of a web app
implementing the system.</p>
          <p>4. The good performance of the system is con- side information, such as semantic label relationships, to
ifrmed through numerical experiments . We
analyze the proposed system under diferent
datasets and assess the optimal perfomance.</p>
          <p>Organization</p>
          <p>The work is organized as follows. In
section 2, we present an overview of the most related
work in literature. In section 3, we describe in details the
Automatic Image Annotation method we propose. In
seccorrectly predict tags [11].</p>
          <p>Our approach combines the deep learning and nearest
neighbor-based approaches, falling into a mixture of
these two categories. By leveraging the strengths of both
approaches, we achieve better performance in handling
the challenges posed by incremental learning scenarios,
as shown in our experimental results.
tion 4, we report the numerical experiments we used to
Incremental</p>
          <p>Learning</p>
        </sec>
      </sec>
      <sec id="sec-1-2">
        <title>Nowadays the need of test the performance of the method. Finally, in section 5, using Incremental Learning (or Continual Learning) we draw conclusion and possible future directions.</title>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related</title>
    </sec>
    <sec id="sec-3">
      <title>Works</title>
      <p>Our proposed method in this work combines Automatic
Image Annotation (AIA) with Incremental Learning. In
this section, we mention the most related literature of
these two fields with our work.</p>
      <p>Automatic Image Annotation
AIA has been the
subject of numerous studies in recent years, and the research
community has developed a common taxonomy for its
diferent categories [ 5, 6]. We briefly describe these
categories below. One category is generative model-based
AIA, which involves learning a joint probabilistic model
of image features and words from training datasets.
Anwhere the tag of the query data point is derived from the
most similar data points. For example, in [9], low-level
features are combined with distances to find the nearest
neighbor. Discriminative model-based AIA methods, on
the other hand, view image annotation as a multi-label
classification problem [ 10]. The third category, tag com- Let 
pletion models, works by assuming an optimal matrix
dataset describing the correspondence between data and
labels, and recovering such initial matrix. Lastly, deep
learning-based solutions couple feature extractors with
approaches to overcome data shortage is becoming more
and more critical. These approaches aim at facilitating
the learning process of new tasks, by exploiting the
knowledge accumulated by solving previous tasks.
However, these Incremental Learning systems have
often revealed to be subject to an undesired negative
efect: the so-called catastrophic forgetting.</p>
      <p>More
specifically, during the incremental learning process,
these models gradually forget the tasks they previously
learnt in the past. In quite recent years, the usage of
pre-trained backbones has revealed to be a possible
and efective solution to overcome this issue, see e.g.
[12, 13, 14, 15, 16]. The main idea supported in these
works is that pre-training</p>
      <p>mitigates forgetting by
exploiting the disentangling power of the pre-trained
backbones.
propose for Incremental Automatic Image Annotation.
3.</p>
    </sec>
    <sec id="sec-4">
      <title>Method</title>
      <p>be the images space. We propose a method to</p>
      <p>automatically label a sequence of images (  )=1 ∈   .</p>
      <p>The proposed method is reported in algorithm 1. As
explained in detail below, the algorithm allows the
interaction with a human annotator, at diferent levels.
other category is nearest neighbor model-based AIA, In the next section we describe in detail the method we
2: Initialization
and the new class detection accuracy   , which are
More specifically, at each iteration  = 1, … ,  , the
algorithm performs the steps below in order.</p>
      <p>1. The algorithm receives the current image   ∈ 
to be labelled.</p>
      <p>Φ.
2. The algorithm computes the corresponding
em</p>
      <p>bedded vector   = Φ(  ) ∈ ℝ by the backbone
3. If there exist a prototype vector in the current
memory with distance less than  to the current
embedded vector   , the algorithm associates to
the current image the class index  ̂  associated to
the closest prototype vector in the memory and
increases the frequency of that class represents ̂
by one. The algorithm also returns the indicator
 =̂ 0 , indicating that the returned class is among
the classes already observed in the memory. On
the contrary, if there no exist a prototype vector
in the memory distant at most  to the current
embedded vector, the algorithm associates the</p>
      <p>current image to a new class label  ̂  =  ,̂ with
frequency  ̂ ̂ = 1. In such a case, the algorithm</p>
      <p>also returns the indicator  =̂ 1 , indicating that
the returned class is a new class not contained in

̂

the actual memory.
4. The human annotator tells to the algorithm if the
current image belongs to.
current image belongs to a previously observed
class ( = 0 ) or a new one ( = 1 ) and it
provides to the algorithm the right class index   the
5. The algorithm uses the feedback received by the
human annotator in order to update its memory.</p>
      <p>Specifically, if the class has been already observed
before, the algorithm updates the prototype
vector associated to that class by computing an
incremental average of the prototype vectors
associated to that class. On the contrary, if the class is
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:</p>
      <p>∈ℳ (  ,   )
Else:</p>
      <p>Define  =̂ 0
Define  ̂  = argmin</p>
      <p>(old class)</p>
      <sec id="sec-4-1">
        <title>Update  ̂</title>
        <p>̂ =  ̂ ̂ + 1


 =̂  +̂ 1
Define  =̂ 1 (new class)
Define  ̂  =  ̂
Define  ̂ ̂ = 1



Update    =    + 1
Pick up    ∈ ℳ
Update    =   −1</p>
        <p>Else:</p>
        <p>Update    = 1
Define    =  
   +
1
 



5) If  = 0 :
4) Receive user’s check:  ∈ {0, 1} ,   ∈ ℕ</p>
        <p>Update ℳ+1 = ℳ ∪ {  }

6) Update the classification accuracy
 
=
 − 1</p>
        <p>1
+</p>
        <p>{ ̂ =  }
7) Update the new class confusion matrix
(, ) ̂ = (, ) ̂+ 1
(1)
(2)
26: Return</p>
        <p>, 
new, the algorithm adds the new prototype vector
  to its memory.
6. The algorithm updates the computation of the
classification accuracy and the new class
detection confusion matrix until that time, by
comparing the quantities estimated by the algorithm
(denoted by the symbol ⋅)̂ with the
corresponding exact counterparts returned by the human
annotator (denoted by the same letters without
the symbol ⋅)̂.</p>
        <p>Interaction with the human annotator In
algorithm 1, the interaction with the human can be also
queried less frequently, only at some iterations. In such a
Accuracy
Moving Avg. Acc.
case, at the iterations with no human annotator feedback,
the update of the memory can be done in a similar way
as described in the step 5) in the algorithm, by replacing
the true quantities with the corresponding estimates
(denoted by the symbol ⋅)̂. In section 4 we will propose an
analysis on the performance of the system under diferent
quantities of human supervision.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Experiments</title>
      <p>To assess the performance of our method we defined
four diferent experimental settings and used the datasets
below.
• CIFAR100 [17]: the dataset is composed by 50000
train, 32 × 32 RGB images subdivided in 100
classess with 600 images each. This dataset has
been chosen to provide a comparative benchmark
in line with the research community.</p>
      <p>Accuracy and Distance Analysis The first
experiment is aimed at measuring the performance in terms of
the classification accuracy for the images in the dataset.</p>
      <p>We did not consider the first occurrence of each class
when computing the accuracy. In order to assess the
• CelebA [18]: the dataset is composed of 64 × 64 incremental improvement of the algorithm, we plot as
RGB images divided in 10177 classes; it is com- well the moving average accuracy with a variable time
posed of 202599 images. This dataset represents frame. We implemented algorithm 1 by using two
difera fine grained benchmark to assess our system. ent distances  ∶ ℝ  × ℝ → ℝ+:
• Core50 [19]: the dataset is composed by 164866,
128 × 128 RGB images of 50 domestic objects
divided in 10 classes. Each object appears in 11
diferent scenarios. We opted for this dataset to
provide a more realistic dataset benchmark and to
test the system under the domain shift . In Figure 4
we show the structure of the data.</p>
      <p>In all the experiments we implemented algorithm 1 with
backbone Φ equal to CLIP [8].
• the Euclidian distance (2 )
• the cosine distance (</p>
      <p>)
( 1,  2) = ‖ 1 −  2‖2
( 1,  2) = 1 −</p>
      <p>⟨ 1,  2⟩
‖ 1‖2 ‖ 2‖2</p>
      <p>.</p>
      <p>The comparison results in terms of accuracy across the
three diferent datasets are presented in Figure 2. The
system’s performance shows a noticeable improvement
over time, starting of with poor accuracy and gradually</p>
      <p>Core50 OOD Precision-Recall curve
increasing its accuracy across all datasets. This
behavior is expected, as the centroids need to adjust to the
data and ”warm up” before delivering optimal
performance. For a summary of the numerical accuracy values
obtained, refer to Table 1. In the case of the Core50
dataset, the system is highly efective in separating all
classes, achieving exceptional performance with just a
few centroid updates. These results demonstrate the
efectiveness of our approach in tackling real-world
classification tasks. It is worth noting that while challenging,
the CIFAR100 dataset may not be fully representative of
real-world usage. Nevertheless, we report our system’s
performance on this dataset to facilitate future
comparisons. It is worth mentioning that the system requires
2000 iterations before achieving stable labeling on this
dataset. The CelebA dataset poses the greatest challenge
among the three datasets, as it represents a fine-grained
benchmark with a large number of classes and few
examples per class. As a result, our system’s performance
on this dataset is relatively lower than that of the other
two datasets. This observation highlights the importance
of having a robust system that can align with the data,
which requires a larger number of images (around 10k)
for this particular dataset. Since the performance for
 are slightly better and stabler, we choose to use
it for all the other experiments.
scenario are presented before moving to the next
sceOOD (new class detection) Analysis In this experi- nario. This experiment aims to demonstrate that the
ment, we conducted out-of-distribution (OOD) analysis CLIP space is resilient enough to cope with distributional
on the Core50 dataset by varying the classification thresh- shift. As shown in Figure 4, there is only a slight drop
old used to determine whether a new class instance is in performance when the background scenario changes,
OOD or not. The results are presented as the precision which becomes increasingly irrelevant as the centroids
vs. recall curve relative to the confusion matrix in Fig- fine-tune. These results demonstrate the efectiveness of
ure 4. While these results are empirical and may not our approach in handling domain shift, which is a critical
generalize to diferent datasets, they provide a starting aspect of real-world applications.
point for more thorough threshold estimation that could
potentially be applicable to unseen datasets.</p>
      <p>Self-Annotation In our final experiment, we
evaluated the performance of our pipeline under minimal
huDomain Shift Analysis In the third experiment, we man feedback. We present the results on the
challengevaluate the robustness of our pipeline under incremen- ing Core50 dataset under domain-shift in Figure 4. The
tal domain shift on the Core50 dataset. Specifically, we ifndings reveal that even with minimal interaction, our
compare the system’s performance over a  set of im- system can achieve good results, indicating that it can
ages (similarly to the previous plots) against the same autonomously propose correct labels for the input data.
set of images featuring coherent-ordered backgrounds. These results demonstrate the efectiveness and eficiency
In other words, all images from the same background
of our approach in minimizing human intervention, mak- [4] J. Degrave, F. Felici, J. Buchli, M. Neunert, B. Tracey,
ing it suitable for real-world applications where manual F. Carpanese, T. Ewalds, R. Hafner, A. Abdolmaleki,
labeling can be time-consuming and expensive. D. de Las Casas, et al., Magnetic control of
tokamak plasmas through deep reinforcement learning,
1.1 CORE50 Accuracy 1.1 Nature (2022).
1.0 1.0 [5] I. Namatevs, K. Sudars, I. Polaka, Automatic data
labeling by neural networks for the counting of
0.9 0.9 objects in videos, Procedia Computer Science (2018).
rcya0.8 0.8 ICTE in Transportation and Logistics.
ccuA0.7 h.i. 0%, SL, ACC = 0.995 0.7 [6] Q. Cheng, Q. Zhang, P. Fu, C. Tu, S. Li, A survey and
00..56 hhhhh.....iiiii..... 115500000%%%%%,,,,,NSSNNoLLnoo,,-nnSAA--LSSCC,LLCC,,AAA==CCCC00CC=..99==99057.008..69938762 00..56 [7] aFR.nePaceolylgosnsisiintio,onSniam(u2p0tol1em8r)a.istibceitmtearg:eof-athnen-osthaetlifocno,nPtiantutearln
0.4 0 50 100 150 T2im00e 250 300 350 4000.4 learning through pretrained backbones, in:
Transformers 4 Vision Workshop CVPR, 2022.</p>
      <p>Figure 6: Comparison of diferent levels of human interaction [8] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh,
(h.i.) in the self-labeling (SL) case. As can be seen with 10% G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin,
of probability of interaction of human feedback the system J. Clark, G. Krueger, I. Sutskever, Learning
transferis able to autonomously label images with small amount of able visual models from natural language
supervidata. sion, in: M. Meila, T. Zhang (Eds.), ICML, 2021.
[9] A. Makadia, V. Pavlovic, S. Kumar, A new baseline
for image annotation, in: ECCV, 2008.
5. Conclusion [10] G. Carneiro, A. B. Chan, P. J. Moreno, N.
Vasconcelos, Supervised learning of semantic classes for
In this work, we proposed a deep learning system image annotation and retrieval, TPAMI (2007).
for automatically annotating a sequence of images [11] Y. Niu, Z. Lu, J.-R. Wen, T. Xiang, S.-F. Chang,
Multiwith diferent levels of active human supervision. The modal multi-scale deep learning for large-scale
imsystem encodes a class by a prototype vector that is age annotation, 2018. a r X i v : 1 7 0 9 . 0 1 2 2 0 .
computed by averaging the projections of the images [12] S. V. Mehta, D. Patil, S. Chandar, E. Strubell, An
annotated as belonging to the same class by a pretrained empirical investigation of the role of pre-training
backbone. The system is computationally eficient in lifelong learning (2021). a r X i v : 2 1 1 2 . 0 9 1 5 3 .
and does not require memorizing the images. Our [13] A. Cossu, T. Tuytelaars, A. Carta, L. C. Passaro,
pipeline eficiently keeps memory of the past, and, at V. Lomonaco, D. Bacciu, Continual pre-training
the same time, identifies new classes. We also devel- mitigates forgetting in language and vision (2022).
oped a web app for our method and carried extensive a r X i v : 2 2 0 5 . 0 9 3 5 7 .
numerical analysis to assess the robustness of the system. [14] T. Wu, G. Swaminathan, Z. Li, A. Ravichandran,
N. Vasconcelos, R. Bhotika, S. Soatto,
ClassIn the future, it would be interesting to further investi- incremental learning with strong pre-trained
modgate the applicability of the proposed method to diferent els, in: CVPR, 2022.
scenarios and extend the pipeline with a learnable mod- [15] K. Lee, Y. Zhong, Y. Wang, Do pre-trained models
ule. It would be also interesting to provide theoretical benefit equally in continual learning?, in: WACV,
certification for its performance. 2023.
[16] T. Wu, M. Caccia, Z. Li, Y. Li, G. Qi, G. Hafari,</p>
      <p>Pretrained language model in continual learning:
References A comparative study, in: ICLR, 2022.
[17] A. Krizhevsky, Learning Multiple Layers of Features
[1] J. Ho, A. Jain, P. Abbeel, Denoising difusion prob- from Tiny Images, Technical Report, University of
abilistic models, in: H. Larochelle, M. Ranzato, Toronto, 2009.</p>
      <p>R. Hadsell, M. Balcan, H. Lin (Eds.), NeurIPS, 2020. [18] Z. Liu, P. Luo, X. Wang, X. Tang, Deep learning
[2] R. Zhang, J. Han, A. Zhou, X. Hu, S. Yan, P. Lu, face attributes in the wild, in: ICCM, 2015.</p>
      <p>H. Li, P. Gao, Y. Qiao, Llama-adapter: Eficient fine- [19] V. Lomonaco, D. Maltoni, Core50: a new dataset
tuning of language models with zero-init attention, and benchmark for continuous object recognition,
2023. a r X i v : 2 3 0 3 . 1 6 1 9 9 . in: CoRL, 2017.
[3] J. M. J. et al, Highly accurate protein structure
prediction with alphafold, Nature (2021).</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>