1. Introduction

Luigi Sabetta

luigi.sabetta.ext@leonardocompany.com 0

Francesco Pelosin

francesco.pelosin.ext@leonardocompany.com 0

Giulia Denevi

giulia.denevi.ext@leonardocompany.com 0

Alessandro Nicolosi

alessandro.nicolosi@leonardocompany.com 0

Automatic Image Annotation, Incremental Learning, Few-Shot

0 Leonardo Labs , via Tiburtina, Roma , Italy

2023

29 31

The availability of labelled data is often limited, which hinders the potential of deep learning pipelines in industry. To address this issue, many industries resort to third-party solutions that involve human annotators manually labeling data. However, these solutions are costly, time-consuming, and their accuracy may be questionable. In this paper, we propose an alternative approach that utilizes a deep learning system capable of automatically labeling images with varying levels of supervision from human annotators. Our proposed Automatic Image Annotation system encodes a class using a prototype vector obtained by averaging the projections of images annotated as belonging to that class by a pre-trained backbone. The system eficiently annotates images in real-time without the need to memorize them. It can remember past annotations and also efectively identify new classes. We have developed a web application (link to code) to demonstrate the efectiveness of our approach.

1. Introduction

In the last years, Deep Learning has achieved impressive results on a variety of tasks, from computer vision [1] to NLP [2] and also as a tool to help natural sciences model our world such as in biology [3] and in physics [4]. This powerful tool is becoming more and more pervasive.

But, it comes with a drawback: in the

supervised learning realm the training procedure data generation does not constitute a problem; the bottleneck lies in the slow and painful annotation procedure. The standard way to cope with this incomplete data is to rely on human annotators. Human annotation is typically performed by companies that, after a careful interaction with the costumer, agree on a labeling scheme. When such scheme has been defined, the data is forwarded to several humans that subjectively carry the job. This subjective step intrinsically carries low homogeneity on the final labeling. Another problem, is the quantity of data to be curated. Obviously, this reflects on the quantity of time required to accomplish the task which, in the end, results in more expensive services. Due to these reasons, the call to use automatic tools to CEUR Workshop Proce dings htp:/ceur-ws.org ISN1613-073

CEUR Workshop Proceedings (CEUR-WS.org)

Attribution 4.0 International (CC BY 4.0). ∗ These authors contributed equally to this work annotate data is nowadays taking place [5]. Although an important topic, Automatic Image Annotation (AIA) has not received enough attention from the research community. In fact, according to latest reviews on the topic [5, 6] most of the published works are from 2003-2016.

We then tackle such problem by devising a pipeline to assist humans during the labeling process. By exploiting a minimal human feedback, we can cut down the timeonerous and error-prone process of image annotation.

Contribution the following.

The main contributions of this work are 1. We develop an Automatic Image Annotation (AIA) system to support humans in labelling a stream of images by designing an appropriate variant of the method described in [7]. 2. The system is robust to domain-shifts . Since the prototype vectors representing the diferent classes are computed by projecting into the embedding space of CLIP [8], the system is resilient to domain-shifts and is almost free of catastrophicforgetting. 3. The system is eficient and user friendly .

The disentangled representation provided by CLIP does not require additional expensive training procedures and it reveals to be very efective for this kind of application. The cost to store the protypes for each class is negligible and performs a on-line update which is computationally eficient. Moreover it allows human interaction at diferent levels. Such an interaction is also facilitated by the development of a web app implementing the system.

4. The good performance of the system is con- side information, such as semantic label relationships, to ifrmed through numerical experiments . We analyze the proposed system under diferent datasets and assess the optimal perfomance.

Organization

The work is organized as follows. In section 2, we present an overview of the most related work in literature. In section 3, we describe in details the Automatic Image Annotation method we propose. In seccorrectly predict tags [11].

Our approach combines the deep learning and nearest neighbor-based approaches, falling into a mixture of these two categories. By leveraging the strengths of both approaches, we achieve better performance in handling the challenges posed by incremental learning scenarios, as shown in our experimental results. tion 4, we report the numerical experiments we used to Incremental

Learning

Nowadays the need of test the performance of the method. Finally, in section 5, using Incremental Learning (or Continual Learning) we draw conclusion and possible future directions. 2. Related Works

Our proposed method in this work combines Automatic Image Annotation (AIA) with Incremental Learning. In this section, we mention the most related literature of these two fields with our work.

Automatic Image Annotation AIA has been the subject of numerous studies in recent years, and the research community has developed a common taxonomy for its diferent categories [ 5, 6]. We briefly describe these categories below. One category is generative model-based AIA, which involves learning a joint probabilistic model of image features and words from training datasets. Anwhere the tag of the query data point is derived from the most similar data points. For example, in [9], low-level features are combined with distances to find the nearest neighbor. Discriminative model-based AIA methods, on the other hand, view image annotation as a multi-label classification problem [ 10]. The third category, tag com- Let pletion models, works by assuming an optimal matrix dataset describing the correspondence between data and labels, and recovering such initial matrix. Lastly, deep learning-based solutions couple feature extractors with approaches to overcome data shortage is becoming more and more critical. These approaches aim at facilitating the learning process of new tasks, by exploiting the knowledge accumulated by solving previous tasks. However, these Incremental Learning systems have often revealed to be subject to an undesired negative efect: the so-called catastrophic forgetting.

More specifically, during the incremental learning process, these models gradually forget the tasks they previously learnt in the past. In quite recent years, the usage of pre-trained backbones has revealed to be a possible and efective solution to overcome this issue, see e.g. [12, 13, 14, 15, 16]. The main idea supported in these works is that pre-training

mitigates forgetting by exploiting the disentangling power of the pre-trained backbones. propose for Incremental Automatic Image Annotation. 3.

Method

be the images space. We propose a method to

automatically label a sequence of images ( )=1 ∈ .

The proposed method is reported in algorithm 1. As explained in detail below, the algorithm allows the interaction with a human annotator, at diferent levels. other category is nearest neighbor model-based AIA, In the next section we describe in detail the method we 2: Initialization and the new class detection accuracy , which are More specifically, at each iteration = 1, … , , the algorithm performs the steps below in order.

1. The algorithm receives the current image ∈ to be labelled.

Φ. 2. The algorithm computes the corresponding em

bedded vector = Φ( ) ∈ ℝ by the backbone 3. If there exist a prototype vector in the current memory with distance less than to the current embedded vector , the algorithm associates to the current image the class index ̂ associated to the closest prototype vector in the memory and increases the frequency of that class represents ̂ by one. The algorithm also returns the indicator =̂ 0 , indicating that the returned class is among the classes already observed in the memory. On the contrary, if there no exist a prototype vector in the memory distant at most to the current embedded vector, the algorithm associates the

current image to a new class label ̂ = ,̂ with frequency ̂ ̂ = 1. In such a case, the algorithm

also returns the indicator =̂ 1 , indicating that the returned class is a new class not contained in ̂ the actual memory. 4. The human annotator tells to the algorithm if the current image belongs to. current image belongs to a previously observed class ( = 0 ) or a new one ( = 1 ) and it provides to the algorithm the right class index the 5. The algorithm uses the feedback received by the human annotator in order to update its memory.

Specifically, if the class has been already observed before, the algorithm updates the prototype vector associated to that class by computing an incremental average of the prototype vectors associated to that class. On the contrary, if the class is 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25:

∈ℳ ( , ) Else:

Define =̂ 0 Define ̂ = argmin

(old class)

Update ̂

̂ = ̂ ̂ + 1 =̂ +̂ 1 Define =̂ 1 (new class) Define ̂ = ̂ Define ̂ ̂ = 1 Update = + 1 Pick up ∈ ℳ Update = −1

Else:

Update = 1 Define = + 1 5) If = 0 : 4) Receive user’s check: ∈ {0, 1} , ∈ ℕ

Update ℳ+1 = ℳ ∪ { } 6) Update the classification accuracy = − 1

1 +

{ ̂ = } 7) Update the new class confusion matrix (, ) ̂ = (, ) ̂+ 1 (1) (2) 26: Return

, new, the algorithm adds the new prototype vector to its memory. 6. The algorithm updates the computation of the classification accuracy and the new class detection confusion matrix until that time, by comparing the quantities estimated by the algorithm (denoted by the symbol ⋅)̂ with the corresponding exact counterparts returned by the human annotator (denoted by the same letters without the symbol ⋅)̂.

Interaction with the human annotator In algorithm 1, the interaction with the human can be also queried less frequently, only at some iterations. In such a Accuracy Moving Avg. Acc. case, at the iterations with no human annotator feedback, the update of the memory can be done in a similar way as described in the step 5) in the algorithm, by replacing the true quantities with the corresponding estimates (denoted by the symbol ⋅)̂. In section 4 we will propose an analysis on the performance of the system under diferent quantities of human supervision.

4. Experiments

To assess the performance of our method we defined four diferent experimental settings and used the datasets below. • CIFAR100 [17]: the dataset is composed by 50000 train, 32 × 32 RGB images subdivided in 100 classess with 600 images each. This dataset has been chosen to provide a comparative benchmark in line with the research community.

Accuracy and Distance Analysis The first experiment is aimed at measuring the performance in terms of the classification accuracy for the images in the dataset.

We did not consider the first occurrence of each class when computing the accuracy. In order to assess the • CelebA [18]: the dataset is composed of 64 × 64 incremental improvement of the algorithm, we plot as RGB images divided in 10177 classes; it is com- well the moving average accuracy with a variable time posed of 202599 images. This dataset represents frame. We implemented algorithm 1 by using two difera fine grained benchmark to assess our system. ent distances ∶ ℝ × ℝ → ℝ+: • Core50 [19]: the dataset is composed by 164866, 128 × 128 RGB images of 50 domestic objects divided in 10 classes. Each object appears in 11 diferent scenarios. We opted for this dataset to provide a more realistic dataset benchmark and to test the system under the domain shift . In Figure 4 we show the structure of the data.

In all the experiments we implemented algorithm 1 with backbone Φ equal to CLIP [8]. • the Euclidian distance (2 ) • the cosine distance (

) ( 1, 2) = ‖ 1 − 2‖2 ( 1, 2) = 1 −

⟨ 1, 2⟩ ‖ 1‖2 ‖ 2‖2

The comparison results in terms of accuracy across the three diferent datasets are presented in Figure 2. The system’s performance shows a noticeable improvement over time, starting of with poor accuracy and gradually

Core50 OOD Precision-Recall curve increasing its accuracy across all datasets. This behavior is expected, as the centroids need to adjust to the data and ”warm up” before delivering optimal performance. For a summary of the numerical accuracy values obtained, refer to Table 1. In the case of the Core50 dataset, the system is highly efective in separating all classes, achieving exceptional performance with just a few centroid updates. These results demonstrate the efectiveness of our approach in tackling real-world classification tasks. It is worth noting that while challenging, the CIFAR100 dataset may not be fully representative of real-world usage. Nevertheless, we report our system’s performance on this dataset to facilitate future comparisons. It is worth mentioning that the system requires 2000 iterations before achieving stable labeling on this dataset. The CelebA dataset poses the greatest challenge among the three datasets, as it represents a fine-grained benchmark with a large number of classes and few examples per class. As a result, our system’s performance on this dataset is relatively lower than that of the other two datasets. This observation highlights the importance of having a robust system that can align with the data, which requires a larger number of images (around 10k) for this particular dataset. Since the performance for are slightly better and stabler, we choose to use it for all the other experiments. scenario are presented before moving to the next sceOOD (new class detection) Analysis In this experi- nario. This experiment aims to demonstrate that the ment, we conducted out-of-distribution (OOD) analysis CLIP space is resilient enough to cope with distributional on the Core50 dataset by varying the classification thresh- shift. As shown in Figure 4, there is only a slight drop old used to determine whether a new class instance is in performance when the background scenario changes, OOD or not. The results are presented as the precision which becomes increasingly irrelevant as the centroids vs. recall curve relative to the confusion matrix in Fig- fine-tune. These results demonstrate the efectiveness of ure 4. While these results are empirical and may not our approach in handling domain shift, which is a critical generalize to diferent datasets, they provide a starting aspect of real-world applications. point for more thorough threshold estimation that could potentially be applicable to unseen datasets.

Self-Annotation In our final experiment, we evaluated the performance of our pipeline under minimal huDomain Shift Analysis In the third experiment, we man feedback. We present the results on the challengevaluate the robustness of our pipeline under incremen- ing Core50 dataset under domain-shift in Figure 4. The tal domain shift on the Core50 dataset. Specifically, we ifndings reveal that even with minimal interaction, our compare the system’s performance over a set of im- system can achieve good results, indicating that it can ages (similarly to the previous plots) against the same autonomously propose correct labels for the input data. set of images featuring coherent-ordered backgrounds. These results demonstrate the efectiveness and eficiency In other words, all images from the same background of our approach in minimizing human intervention, mak- [4] J. Degrave, F. Felici, J. Buchli, M. Neunert, B. Tracey, ing it suitable for real-world applications where manual F. Carpanese, T. Ewalds, R. Hafner, A. Abdolmaleki, labeling can be time-consuming and expensive. D. de Las Casas, et al., Magnetic control of tokamak plasmas through deep reinforcement learning, 1.1 CORE50 Accuracy 1.1 Nature (2022). 1.0 1.0 [5] I. Namatevs, K. Sudars, I. Polaka, Automatic data labeling by neural networks for the counting of 0.9 0.9 objects in videos, Procedia Computer Science (2018). rcya0.8 0.8 ICTE in Transportation and Logistics. ccuA0.7 h.i. 0%, SL, ACC = 0.995 0.7 [6] Q. Cheng, Q. Zhang, P. Fu, C. Tu, S. Li, A survey and 00..56 hhhhh.....iiiii..... 115500000%%%%%,,,,,NSSNNoLLnoo,,-nnSAA--LSSCC,LLCC,,AAA==CCCC00CC=..99==99057.008..69938762 00..56 [7] aFR.nePaceolylgosnsisiintio,onSniam(u2p0tol1em8r)a.istibceitmtearg:eof-athnen-osthaetlifocno,nPtiantutearln 0.4 0 50 100 150 T2im00e 250 300 350 4000.4 learning through pretrained backbones, in: Transformers 4 Vision Workshop CVPR, 2022.

Figure 6: Comparison of diferent levels of human interaction [8] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, (h.i.) in the self-labeling (SL) case. As can be seen with 10% G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, of probability of interaction of human feedback the system J. Clark, G. Krueger, I. Sutskever, Learning transferis able to autonomously label images with small amount of able visual models from natural language supervidata. sion, in: M. Meila, T. Zhang (Eds.), ICML, 2021. [9] A. Makadia, V. Pavlovic, S. Kumar, A new baseline for image annotation, in: ECCV, 2008. 5. Conclusion [10] G. Carneiro, A. B. Chan, P. J. Moreno, N. Vasconcelos, Supervised learning of semantic classes for In this work, we proposed a deep learning system image annotation and retrieval, TPAMI (2007). for automatically annotating a sequence of images [11] Y. Niu, Z. Lu, J.-R. Wen, T. Xiang, S.-F. Chang, Multiwith diferent levels of active human supervision. The modal multi-scale deep learning for large-scale imsystem encodes a class by a prototype vector that is age annotation, 2018. a r X i v : 1 7 0 9 . 0 1 2 2 0 . computed by averaging the projections of the images [12] S. V. Mehta, D. Patil, S. Chandar, E. Strubell, An annotated as belonging to the same class by a pretrained empirical investigation of the role of pre-training backbone. The system is computationally eficient in lifelong learning (2021). a r X i v : 2 1 1 2 . 0 9 1 5 3 . and does not require memorizing the images. Our [13] A. Cossu, T. Tuytelaars, A. Carta, L. C. Passaro, pipeline eficiently keeps memory of the past, and, at V. Lomonaco, D. Bacciu, Continual pre-training the same time, identifies new classes. We also devel- mitigates forgetting in language and vision (2022). oped a web app for our method and carried extensive a r X i v : 2 2 0 5 . 0 9 3 5 7 . numerical analysis to assess the robustness of the system. [14] T. Wu, G. Swaminathan, Z. Li, A. Ravichandran, N. Vasconcelos, R. Bhotika, S. Soatto, ClassIn the future, it would be interesting to further investi- incremental learning with strong pre-trained modgate the applicability of the proposed method to diferent els, in: CVPR, 2022. scenarios and extend the pipeline with a learnable mod- [15] K. Lee, Y. Zhong, Y. Wang, Do pre-trained models ule. It would be also interesting to provide theoretical benefit equally in continual learning?, in: WACV, certification for its performance. 2023. [16] T. Wu, M. Caccia, Z. Li, Y. Li, G. Qi, G. Hafari,

Pretrained language model in continual learning: References A comparative study, in: ICLR, 2022. [17] A. Krizhevsky, Learning Multiple Layers of Features [1] J. Ho, A. Jain, P. Abbeel, Denoising difusion prob- from Tiny Images, Technical Report, University of abilistic models, in: H. Larochelle, M. Ranzato, Toronto, 2009.

R. Hadsell, M. Balcan, H. Lin (Eds.), NeurIPS, 2020. [18] Z. Liu, P. Luo, X. Wang, X. Tang, Deep learning [2] R. Zhang, J. Han, A. Zhou, X. Hu, S. Yan, P. Lu, face attributes in the wild, in: ICCM, 2015.

H. Li, P. Gao, Y. Qiao, Llama-adapter: Eficient fine- [19] V. Lomonaco, D. Maltoni, Core50: a new dataset tuning of language models with zero-init attention, and benchmark for continuous object recognition, 2023. a r X i v : 2 3 0 3 . 1 6 1 9 9 . in: CoRL, 2017. [3] J. M. J. et al, Highly accurate protein structure prediction with alphafold, Nature (2021).