Analysis of Knowledge Distillation on Image Captioning Models Srivatsan S1 and Shridevi S2 1 School of Computer Science and Engineering, Vellore Institute of Technology, India 2 Centre for Advanced Data Science, Vellore Institute of Technology, India Abstract Image Captioning involves generating a textual description of an image, in the most accurate way possible. It requires a combination of Computer Vision and Natural Language Processing techniques, which can both be enhanced individually to improve the overall performance of the model. Knowledge Distillation is a model compression technique, where a smaller network learns from the predictions of a larger network to find a more optimal convergence space. It effectively improves the performance of the smaller network, without any problems like over fitting. Typically, the performance of image captioning is measured in terms of the BLEU score and CIDER score. In this work, we have tested and recorded the performance of three different Image Captioning model architectures, in terms of a large network, small network and knowledge distilled small network to track and analyse the effects of Knowledge Distillation. The results are promising when compared with the state of art models. Keywords: Image Captioning, Deep Learning, Knowledge Distillation 1. Introduction Image Captioning is the task of generating a natural language description of the scenes and objects in an image. The sentences given as output must be grammatically correct and describe the image as accurately as possible. Therefore, the image captioning models must be able to recognize/describe the objects, their context within the scene, and their relationships, and frame them into a proper sentence in the target language. The image captioning tasks involves a combination of computer vision (CV) and natural language processing (NLP) techniques, where the CV parts come into play during object detection and recognition, and embedding into a feature vector. The NLP parts involve the conversion of this feature vector into the sentence, framing the words based on the object’s location, action, and features, as well as their importance in the image. The typical input and output for an Image Captioning task is seen in Figure 1. There have been tremendous advancements in the efficiency and accuracy of CV and NLP techniques; hence this has been seen as a consequential increase in the performance of image captioning models. Models like Inception and other advanced CNNs for image detection, as well as Transformers for NLP tasks, provide access to state-of-the-art models on all devices. Figure 1: Sample Images with their caption ACI’22: Workshop on Advances in Computation Intelligence, its Concepts & Applications at ISIC 2022, May 17-19, Savannah, United States EMAIL: shridevi.s@vit.ac.in ORCID: 0000-0002-0038-7212 ©️ 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 327 In this work, we have seen the performance of 3 different models used typically for image captioning. Creating a characteristic language depiction of an image has drawn in interests as of late both in light of its significance in useful applications and on the grounds that it interfaces two significant machine learning fields: CV and NLP. Existing methodologies are either top-down, that start from an image and convert it into words, or bottom up, which concoct words portraying different parts of an image and afterward consolidate them. The general approach is seen in Figure 2 below. Figure 2: Image Captioning Network Architecture The Transformer and BERT models, and their applications, exhibit that models with attention mechanisms, wherein the recurrent layers are replaced for the utilization of self-attention, offer far better performances in sequence modelling. This alternative likewise gives unique architecture modelling abilities, as the attention layer is utilized in a multi-layer structure differently. Image Captioning might need to be deployed in a number of use cases, including on edge devices or in areas with low bandwidth or low powered devices. They might not support large accurate models; hence the concept of Knowledge Distillation (KD) has been used to improve the performance of low accuracy models. By distilling the knowledge in an ensemble of models into a single model, Hinton et al [1] proposed the notion of Knowledge Distillation. In this work, we aim to compare the effects of this approach on a number of different models trained for the Image Captioning task. Multiple teacher models and student models are trained, with the student models gaining the distilled knowledge of the teacher and effectively improving its performance. The effects of KD have been analyzed on different image captioning models and performance metrics were recorded in section 4. 2. Related Work Image Captioning has seen much advancement in recent years, in part due to the independence of the modules involved, namely CV and NLP modules. Any improvements in one of these fields can be shown to have a consequential performance improvement in the overall image captioning task too. There have also been vast improvements in model evaluations and performance metrics used, to ensure captions generated that are closer to the ground truth. The earliest works involving Image Captioning include articles like [2], and [3]. Image Captioning tasks have different criteria they can be split on, based on approach to the task as bottom-up or top-down or hybrid [4], or based on the techniques used as either classical machine learning or deep learning-based. Machine Learning would involve unsupervised learning models to detect and extract features from input data 328 and pass to a classifier model. Deep learning techniques allow models to be trained on large datasets for more accuracy, typically involving Convolutional Neural Networks in the image encoding process, and Recurrent Neural Networks in the decoding process. More recently, the emergence of the attention concept has led to its wide-spread use in any image recognition/ object detection tasks, due to the inherent correlation to human understanding of images. Papers like [5] can be referred to as the start of the trends in current research on image captioning, getting state-of-the-art results on multiple datasets. It was followed by [6] and [7], which used the attention mechanism in training to achieve the best results, as well as visualizing the parts focused on by the layers. More recent papers utilizing attention for image captioning include [8] [9]. [13] Uses an object relation transformer to exploit the spatial relationships between objects using geometric attention. Approaches using more modern methods like Vision Transformers, or Spatial and Semantic Graphs, have also been explored. With respect to the decoding components of image captioning, typical approaches include greedy search and beam search. The advancements in deep learning have led to Recurrent Neural Network-based NLP models, in order to predict the highest probable sentences for the visual embedding. The LSTM based approach has been quite prevalent, with varying layers leading to better performance [10]. Attention layers have also been used in the decoder NLP model [11]. The latest and state-of-the-art approaches involve Transformers, from the paper [12]. Transformers provide inherent parallelism, which can be taken advantage of to provide faster training. Just like the success seen in object detection/classification tasks by pre-training large models like Inception V3, or YOLO and applying transfer learning to customize for the required tasks, the strategy can be replicated in these NLP tasks too, by pre-training large Transformer models like BERT and customizing later. Distillation in Image Captioning has also been explored by [14] and other papers, but in this work, a comparison of the amount of degradation in performance for each model is being done. Hence, from the above article, we can see the current state of the art in Image Captioning techniques. 3. Datasets The datasets used for the task are the Flickr8k and MS-COCO dataset. The Flickr8k dataset consists of 8,000 photos, each of which is accompanied by five different captions that provide clear descriptions of the important items and events. An example from the dataset is shown below in Figure 3. Figure 3: Example from Flickr8k dataset The COCO dataset is a large-scale object detection, segmentation, and captioning dataset. Each of the almost 82000 images includes 5 captions for training Image Captioning models. For this work, only a small subset of around 5000 images has been used. An example is shown in Figure 4. 329 Figure 4: MS-COCO Dataset 4. Image Captioning Models The best image captioning models involve usage of a SOTA-level object detection/recognition model, for the initial image processing purposes. The final layer of the detection model is passed through an encoder model, which captures it as an embedded feature vector. This is finally passed through the decoder model (typically RNN architecture), which outputs the probabilities for all 5001 words in pre-defined vocabulary. In this work, the performances of 3 different image captioning techniques, with 3 models of varying architecture sizes for each technique, have been compared to analyze the impact of KD. The 3 models used are: 1. ResNet50 + Beam Search 2. Inception V3 + Encoder-Decoder with Attention 3. EffNetB0 + Transformer Encoder-Decoder To analyze the effects of Knowledge Distillation on these image captioning models, we have taken a teacher and student architecture for each and trained separately. The distilled model takes untrained student architecture and learns along with the teacher’s predictions in order to more efficiently converge. So, 9 different models have been trained and tested. 4.1 ResNet50 + Beam Search Figure 5. ResNet 50 Architecture 330 ResNet50 is the SOTA-object detection model used in this technique. It is a type of ResNet model having 50 layers, with 48 Convolutional layers, a MaxPool and an AveragePool layer. The model as in Figure 5 and Figure 6 are used for feature extraction and the embeddings are passed to the encoder- decoder network (RNN) which uses a beam search algorithm in order to give the probabilities of the next word for all the words in the pre-defined dictionary. Figure 6. Beam Search The model initially involves the ResNet50 model, whose final layer output is passed to the Encoder. The Encoder comprises of a fully connected linear/dense layer and a Normalization layer before being passed to the Decoder. The Decoder involves an embedding layer, an LSTM layer and the final fully connected layer of the vocab size for the probabilities. The below table 1 depicts the architecture details. Table 1.Model 1 Teacher and Student Architectures Teacher Model Student Model Layer Type Size Size ResNet ResNet50 ResNet18 Fully Connected 512 512 BatchNorm 512 512 Embedding 5001,512 5001,512 LSTM 512,512,2 512,256,1 Fully Connected 512,5001 256,5001 4.2 Inception V3 + Encoder-Decoder with Attention Because just the latest hidden state of the encoder RNN is used as the context vector for the decoder, the traditional seq2seq model is generally unable to effectively handle extended input sequences. In this model, we have used Inception V3 to preprocess all the images on the datasets. The captions are tokenized and the model is trained. The model consists of Encoder-Decoder architecture as in Figure 7 is similar to [17]. The output from the lower convolutional layer of Inception V3 is squashed and directly passed to the Encoder’s Fully Connected Layer. It is then decoded by the RNN decoder (GRU with attention) and predicts the caption. 331 Figure 7. Flow of Model with Attention The architecture details as in table 2 consists of the Inception V3 model for object recognition and image pre-processing. The final layer is directly passed through the dense/fully connected layer which then passes it to the decoder. The decoder consists of the Bahdanu attention layer, GRU layer for holding information for longer periods of time compared to RNN and is more efficient than LSTM, and dense/fully connected layers to predict the probabilities of the words for the image. Table 2.Model 2 Teacher and Student Architectures Teacher Model Student Model Layer Type Size Size Inception InceptionV3 InceptionV3 Dense 256 256 Embedding Layer 5001, 256 5001, 256 Attention 512 256 GRU Layer 512 256 Dense 512 256 Dense 5001 5001 4.3 EffNet-B0 and Transformer Encoder-Decoder The Transformer Neural Network is a unique design that tries to tackle sequence-to-sequence tasks while also being able to handle long-range dependencies. One major distinction in these networks is that the input sequence may be sent in parallel, allowing the GPU to be fully exploited while also increasing training speed. The vanishing gradient issue is also overcome by a substantial margin because it is based on the multi-headed attention layer. Transformers as in Figure 8 have been successfully adapted to many deep learning tasks, easily outperforming other network architectures. 332 Figure 8. Transformer Architecture In Model 3, the architecture as in table 3 involves the Efficient Net B0 model for object recognition and passes the final layer output to the Transformer-based encoder. The Encoder has a normalization layer and a dense layer for the inputs received from the EffNet model. It is passed through a Multi-Head Attention layer and is passed through another dense layer which connects to the Decoder. The Decoder involves an embedding layer for the inputs from the Encoder, followed by Multi-Head Attention layers which are normalized and squashed into dense layers with the final layer giving the probabilities. Table 3.Model 3 Teacher and Student Architectures Teacher Model Student Model Layer Size Size Efficient Net EffNet-B0 EffNet-B0 Layer - - Normalization Dense 512 256 MHA( 1 Head) 512 512 Layer - - Normalization Embedding Layer 512 512 MHA ( 2 Heads) 512 512 Layer - - Normalization MHA( 2 Heads) 512 512 Layer - - Normalization Dense 512 256 Dropout 0.3 0.3 Dense 512 256 Layer - - Normalization Dropout 0.5 0.5 Dense 5001 5001 333 5. Results We have evaluated the results on BLEU and CIDER metrics. Bleu is the standard evaluation metric for measuring the amount of correspondence between the network output and the ground truth. The models were tested on a split of around 1000 images. Cider metric is a consensus-based metric to measure the similarity between generated output and the set of human-translated sentences. In Tables 4 and 5, we see how the performances of state-of-the-art architectures for image captioning and our architectures performances. Despite the smaller size of the knowledge distilled models, it shows comparable performance and is far more efficient than the larger models. Table 4.Results on Flickr8k and MS-COCO datasets Results MS- Results on COCO Flickr8k Model Average CIDER Average CIDER BLEU-1 Score BLEU-1 Score Teacher(Model 1) 61.7 84.7 41.8 28.8 Student(Model 1) 48.0 76.8 32.1 22.1 DistilledStudent(Model 1) 54.3 80.4 34.3 26.3 Teacher (Model 2) 66.9 92.6 46.7 32.5 Student(Model 2) 53.4 81.0 37.3 24.6 DistilledStudent(Model 2) 58.1 84.5 39.6 27.9 Teacher (Model 3) 73.8 103.4 52.5 36.5 Student(Model 3) 64.6 91.1 42.3 28.2 DistilledStudent(Model 3) 67.3 96.2 46.6 29.1 Table 5.Karpathy Split Performances for MS-COCO Model BLEU-1 CIDEr SCST[15] 78.1 114.7 GCN-LSTM[16] 80.2 117.9 Recurrent Fusion Network[17] 80.4 122.9 Meshed Memory Transformer [18] 81.6 129.3 Distilled Student (Model 3) 72.1 104.7 6. Conclusion and Future Discussion In this work, we trained multiple teachers, students, and distilled models on the Image Captioning task. We used two standard datasets, namely Flickr8k and MS-COCO dataset to train the models. A comparison was made showing the results obtained for all the models, and we could see that the transformer models effectively outperformed their counterparts. We saw that in some cases, the student model outperformed teacher models of other architectures (Model 3 student vs Model 1 Teacher), whereas in other cases, the knowledge distilled model was given a boost was able to match/outperform other teachers (Model 3 distilled student vs Model 2 teacher). We can clearly see the use cases where a smaller model would be able to replace and to an extent, even outperform existing slower larger models. For future works, we can include more models in the study, as well as tuning the distillation parameters. We can also choose to study the effects of further distillation to establish a relationship function between performance and distillation. 334 References [1] Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the knowledge in a neural network." arXiv preprint arXiv:1503.02531 (2015). [2] J.-Y. Pan, H.-J. Yang, P. Duygulu, and C. Faloutsos, “Automatic image captioning,” in ICME, 2004. [3] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth, “Every picture tells a story: Generating sentences from images,” in ECCV, 2010 [4] Anderson, Peter, et al. "Bottom-up and top-down attention for image captioning and visual question answering." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018. [5] Anderson, Peter, et al. "Bottom-up and top-down attention for image captioning and visual question answering." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018. [6] You, Quanzeng, et al. "Image captioning with semantic attention." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. [7] Xu, Kelvin & Ba, Jimmy & Kiros, Ryan & Cho, Kyunghyun & Courville, Aaron & Salakhutdinov, Ruslan & Zemel, Richard & Bengio, Y.. (2015). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. [8] Xuying Zhang, Xiaoshuai Sun, Yunpeng Luo, Jiayi Ji, Yiyi Zhou, Yongjian Wu, Feiyue Huang, Rongrong Ji; RSTNet: Captioning With Adaptive Attention on Visual and Non-Visual Words ; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 15465-15474 [9] You, Quanzeng, et al. "Image captioning with semantic attention." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. [10] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” in CVPR, 2015. [11] Huang, Lun, et al. "Attention on attention for image captioning." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019. [12] Vaswani, Ashish & Shazeer, Noam & Parmar, Niki & Uszkoreit, Jakob & Jones, Llion & Gomez, Aidan & Kaiser, Lukasz & Polosukhin, Illia. (2017). Attention Is All You Need. [13] Herdade, Simao & Kappeler, Armin & Boakye, Kofi & Soares, Joao. (2019). Image Captioning: Transforming Objects into Words. [14] Self-Distillation for Few-Shot Image Captioning Xianyu Chen, Ming Jiang, Qi Zhao University of Minnesota, Twin Cities [15] Rennie, Steven J., et al. “Self-Critical Sequence Training for Image Captioning.” ArXiv Preprint ArXiv:1612.00563, 2016. [16] Yao, Ting, et al. “Exploring Visual Relationship for Image Captioning.” Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 711–727. [17] Jiang, Wenhao et al. “Recurrent Fusion Network for Image Captioning.” ECCV (2018). [18] Cornia, Marcella, et al. "Meshed-memory transformer for image captioning." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. 335