<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Analysis of Knowledge Distillation on Image Captioning Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Srivatsan S</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shridevi S</string-name>
          <email>shridevi.s@vit.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Image Captioning, Deep Learning, Knowledge Distillation</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centre for Advanced Data Science, Vellore Institute of Technology</institution>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Computer Science and Engineering, Vellore Institute of Technology</institution>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <fpage>327</fpage>
      <lpage>335</lpage>
      <abstract>
        <p>Image Captioning involves generating a textual description of an image, in the most accurate way possible. It requires a combination of Computer Processing techniques, which can both be enhanced individually to improve the overall performance of the model. Knowledge Distillation is a model compression technique, where a smaller network learns from the predictions of a larger network to find a more optimal convergence space. It effectively improves the performance of the smaller network, without any problems like over fitting. Typically, the performance of image captioning is measured in terms of the BLEU score and CIDER score. In this work, we have tested and recorded the performance of three different Image Captioning model architectures, in terms of a large network, small network and knowledge distilled small network to track and analyse the effects of Knowledge Distillation. The results are promising when compared with the state of art models.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Image Captioning is the task of generating a natural language description of the scenes and objects
in an image. The sentences given as output must be grammatically correct and describe the image as
accurately as possible. Therefore, the image captioning models must be able to recognize/describe the
objects, their context within the scene, and their relationships, and frame them into a proper sentence
in the target language. The image captioning tasks involves a combination of computer vision (CV)
and natural language processing (NLP) techniques, where the CV parts come into play during object
detection and recognition, and embedding into a feature vector. The NLP parts involve the conversion
of this feature vector into the sentence, framing the words based on the object’s location, action, and
features, as well as their importance in the image. The typical input and output for an Image
Captioning task is seen in Figure 1. There have been tremendous advancements in the efficiency and
accuracy of CV and NLP techniques; hence this has been seen as a consequential increase in the
performance of image captioning models. Models like Inception and other advanced CNNs for image
detection, as well as Transformers for NLP tasks, provide access to state-of-the-art models on all
devices.</p>
      <p>2020 Copyright for this paper by its authors.
In this work, we have seen the performance of 3 different models used typically for image
captioning. Creating a characteristic language depiction of an image has drawn in interests as of late
both in light of its significance in useful applications and on the grounds that it interfaces two
significant machine learning fields: CV and NLP. Existing methodologies are either top-down, that
start from an image and convert it into words, or bottom up, which concoct words portraying different
parts of an image and afterward consolidate them. The general approach is seen in Figure 2 below.</p>
      <p>
        The Transformer and BERT models, and their applications, exhibit that models with attention
mechanisms, wherein the recurrent layers are replaced for the utilization of self-attention, offer far
better performances in sequence modelling. This alternative likewise gives unique architecture
modelling abilities, as the attention layer is utilized in a multi-layer structure differently. Image
Captioning might need to be deployed in a number of use cases, including on edge devices or in areas
with low bandwidth or low powered devices. They might not support large accurate models; hence the
concept of Knowledge Distillation (KD) has been used to improve the performance of low accuracy
models. By distilling the knowledge in an ensemble of models into a single model, Hinton et al [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
proposed the notion of Knowledge Distillation. In this work, we aim to compare the effects of this
approach on a number of different models trained for the Image Captioning task. Multiple teacher
models and student models are trained, with the student models gaining the distilled knowledge of the
teacher and effectively improving its performance. The effects of KD have been analyzed on different
image captioning models and performance metrics were recorded in section 4.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Image Captioning has seen much advancement in recent years, in part due to the
independence of the modules involved, namely CV and NLP modules. Any improvements in one of
these fields can be shown to have a consequential performance improvement in the overall image
captioning task too. There have also been vast improvements in model evaluations and performance
metrics used, to ensure captions generated that are closer to the ground truth. The earliest works
involving Image Captioning include articles like [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Image Captioning tasks have different
criteria they can be split on, based on approach to the task as bottom-up or top-down or hybrid [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], or
based on the techniques used as either classical machine learning or deep learning-based. Machine
Learning would involve unsupervised learning models to detect and extract features from input data
and pass to a classifier model. Deep learning techniques allow models to be trained on large datasets
for more accuracy, typically involving Convolutional Neural Networks in the image encoding
process, and Recurrent Neural Networks in the decoding process. More recently, the emergence of the
attention concept has led to its wide-spread use in any image recognition/ object detection tasks, due
to the inherent correlation to human understanding of images. Papers like [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] can be referred to as the
start of the trends in current research on image captioning, getting state-of-the-art results on multiple
datasets. It was followed by [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], which used the attention mechanism in training to achieve the
best results, as well as visualizing the parts focused on by the layers. More recent papers utilizing
attention for image captioning include [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] Uses an object relation transformer to exploit the
spatial relationships between objects using geometric attention. Approaches using more modern
methods like Vision Transformers, or Spatial and Semantic Graphs, have also been explored.
      </p>
      <p>
        With respect to the decoding components of image captioning, typical approaches include
greedy search and beam search. The advancements in deep learning have led to Recurrent Neural
Network-based NLP models, in order to predict the highest probable sentences for the visual
embedding. The LSTM based approach has been quite prevalent, with varying layers leading to better
performance [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Attention layers have also been used in the decoder NLP model [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The latest and
state-of-the-art approaches involve Transformers, from the paper [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Transformers provide inherent
parallelism, which can be taken advantage of to provide faster training. Just like the success seen in
object detection/classification tasks by pre-training large models like Inception V3, or YOLO and
applying transfer learning to customize for the required tasks, the strategy can be replicated in these
NLP tasks too, by pre-training large Transformer models like BERT and customizing later.
Distillation in Image Captioning has also been explored by [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and other papers, but in this work, a
comparison of the amount of degradation in performance for each model is being done. Hence, from
the above article, we can see the current state of the art in Image Captioning techniques.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Datasets</title>
      <p>The datasets used for the task are the Flickr8k and MS-COCO dataset. The Flickr8k dataset
consists of 8,000 photos, each of which is accompanied by five different captions that provide clear
descriptions of the important items and events. An example from the dataset is shown below in Figure
3.</p>
      <p>The COCO dataset is a large-scale object detection, segmentation, and captioning dataset. Each of
the almost 82000 images includes 5 captions for training Image Captioning models. For this work,
only a small subset of around 5000 images has been used. An example is shown in Figure 4.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Image Captioning Models</title>
      <p>The best image captioning models involve usage of a SOTA-level object detection/recognition
model, for the initial image processing purposes. The final layer of the detection model is passed
through an encoder model, which captures it as an embedded feature vector. This is finally passed
through the decoder model (typically RNN architecture), which outputs the probabilities for all 5001
words in pre-defined vocabulary.</p>
      <p>In this work, the performances of 3 different image captioning techniques, with 3 models of
varying architecture sizes for each technique, have been compared to analyze the impact of KD. The 3
models used are:
1. ResNet50 + Beam Search
2. Inception V3 + Encoder-Decoder with Attention
3. EffNetB0 + Transformer Encoder-Decoder</p>
      <p>To analyze the effects of Knowledge Distillation on these image captioning models, we have taken
a teacher and student architecture for each and trained separately. The distilled model takes untrained
student architecture and learns along with the teacher’s predictions in order to more efficiently
converge. So, 9 different models have been trained and tested.</p>
      <sec id="sec-4-1">
        <title>4.1 ResNet50 + Beam Search</title>
        <p>ResNet50 is the SOTA-object detection model used in this technique. It is a type of ResNet model
having 50 layers, with 48 Convolutional layers, a MaxPool and an AveragePool layer. The model as
in Figure 5 and Figure 6 are used for feature extraction and the embeddings are passed to the
encoderdecoder network (RNN) which uses a beam search algorithm in order to give the probabilities of the
next word for all the words in the pre-defined dictionary.
The model initially involves the ResNet50 model, whose final layer output is passed to the Encoder.
The Encoder comprises of a fully connected linear/dense layer and a Normalization layer before being
passed to the Decoder. The Decoder involves an embedding layer, an LSTM layer and the final fully
connected layer of the vocab size for the probabilities. The below table 1 depicts the architecture
details.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2 Inception V3 + Encoder-Decoder with Attention</title>
        <p>
          Because just the latest hidden state of the encoder RNN is used as the context vector for the decoder,
the traditional seq2seq model is generally unable to effectively handle extended input sequences. In
this model, we have used Inception V3 to preprocess all the images on the datasets. The captions are
tokenized and the model is trained. The model consists of Encoder-Decoder architecture as in Figure
7 is similar to [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. The output from the lower convolutional layer of Inception V3 is squashed and
directly passed to the Encoder’s Fully Connected Layer. It is then decoded by the RNN decoder
(GRU with attention) and predicts the caption.
        </p>
        <p>The architecture details as in table 2 consists of the Inception V3 model for object recognition and
image pre-processing. The final layer is directly passed through the dense/fully connected layer which
then passes it to the decoder. The decoder consists of the Bahdanu attention layer, GRU layer for
holding information for longer periods of time compared to RNN and is more efficient than LSTM,
and dense/fully connected layers to predict the probabilities of the words for the image.</p>
        <p>The Transformer Neural Network is a unique design that tries to tackle sequence-to-sequence tasks
while also being able to handle long-range dependencies. One major distinction in these networks is
that the input sequence may be sent in parallel, allowing the GPU to be fully exploited while also
increasing training speed. The vanishing gradient issue is also overcome by a substantial margin
because it is based on the multi-headed attention layer. Transformers as in Figure 8 have been
successfully adapted to many deep learning tasks, easily outperforming other network architectures.
256
512
512
512
512
256
0.3
256
0.5
5001</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>We have evaluated the results on BLEU and CIDER metrics. Bleu is the standard evaluation
metric for measuring the amount of correspondence between the network output and the ground truth.
The models were tested on a split of around 1000 images. Cider metric is a consensus-based metric to
measure the similarity between generated output and the set of human-translated sentences. In Tables
4 and 5, we see how the performances of state-of-the-art architectures for image captioning and our
architectures performances. Despite the smaller size of the knowledge distilled models, it shows
comparable performance and is far more efficient than the larger models.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future Discussion</title>
      <p>In this work, we trained multiple teachers, students, and distilled models on the Image Captioning
task. We used two standard datasets, namely Flickr8k and MS-COCO dataset to train the models. A
comparison was made showing the results obtained for all the models, and we could see that the
transformer models effectively outperformed their counterparts. We saw that in some cases, the
student model outperformed teacher models of other architectures (Model 3 student vs Model 1
Teacher), whereas in other cases, the knowledge distilled model was given a boost was able to
match/outperform other teachers (Model 3 distilled student vs Model 2 teacher). We can clearly see
the use cases where a smaller model would be able to replace and to an extent, even outperform
existing slower larger models. For future works, we can include more models in the study, as well as
tuning the distillation parameters. We can also choose to study the effects of further distillation to
establish a relationship function between performance and distillation.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Hinton</surname>
            , Geoffrey,
            <given-names>Oriol</given-names>
          </string-name>
          <string-name>
            <surname>Vinyals</surname>
            , and
            <given-names>Jeff</given-names>
          </string-name>
          <string-name>
            <surname>Dean</surname>
          </string-name>
          .
          <article-title>"Distilling the knowledge in a neural network</article-title>
          .
          <source>" arXiv preprint arXiv:1503.02531</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.-Y.</given-names>
            <surname>Pan</surname>
          </string-name>
          , H.
          <article-title>-</article-title>
          <string-name>
            <surname>J. Yang</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Duygulu</surname>
            , and
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Faloutsos</surname>
          </string-name>
          , “
          <article-title>Automatic image captioning</article-title>
          ,” in ICME,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hejrati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Sadeghi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Young</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rashtchian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hockenmaier</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Forsyth</surname>
          </string-name>
          , “
          <article-title>Every picture tells a story: Generating sentences from images,” in ECCV, 2010</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Anderson</surname>
          </string-name>
          ,
          <string-name>
            <surname>Peter</surname>
          </string-name>
          , et al.
          <article-title>"Bottom-up and top-down attention for image captioning and visual question answering</article-title>
          .
          <source>" Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          .
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Anderson</surname>
          </string-name>
          ,
          <string-name>
            <surname>Peter</surname>
          </string-name>
          , et al.
          <article-title>"Bottom-up and top-down attention for image captioning and visual question answering</article-title>
          .
          <source>" Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          .
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>You</surname>
          </string-name>
          ,
          <string-name>
            <surname>Quanzeng</surname>
          </string-name>
          , et al.
          <article-title>"Image captioning with semantic attention."</article-title>
          <source>Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          .
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>Kelvin</given-names>
          </string-name>
          &amp; Ba, Jimmy &amp; Kiros, Ryan &amp; Cho, Kyunghyun &amp; Courville, Aaron &amp; Salakhutdinov, Ruslan &amp; Zemel, Richard &amp; Bengio,
          <string-name>
            <surname>Y..</surname>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Show, Attend and Tell: Neural Image Caption Generation with Visual Attention</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Xuying</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Xiaoshuai Sun, Yunpeng Luo, Jiayi Ji, Yiyi Zhou, Yongjian Wu, Feiyue Huang, Rongrong Ji;
          <article-title>RSTNet: Captioning With Adaptive Attention on Visual and Non-Visual Words ;</article-title>
          <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>15465</fpage>
          -
          <lpage>15474</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>You</surname>
          </string-name>
          ,
          <string-name>
            <surname>Quanzeng</surname>
          </string-name>
          , et al.
          <article-title>"Image captioning with semantic attention."</article-title>
          <source>Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          .
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Donahue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Anne</given-names>
            <surname>Hendricks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Guadarrama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rohrbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Venugopalan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Saenko</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Darrell</surname>
          </string-name>
          <article-title>, “Long-term recurrent convolutional networks for visual recognition and description</article-title>
          ,” in CVPR,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Lun</surname>
          </string-name>
          , et al.
          <article-title>"Attention on attention for image captioning</article-title>
          .
          <source>" Proceedings of the IEEE/CVF International Conference on Computer Vision</source>
          .
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Vaswani</surname>
            ,
            <given-names>Ashish</given-names>
          </string-name>
          &amp; Shazeer, Noam &amp; Parmar, Niki &amp; Uszkoreit, Jakob &amp; Jones, Llion &amp; Gomez, Aidan &amp; Kaiser, Lukasz &amp; Polosukhin,
          <string-name>
            <surname>Illia.</surname>
          </string-name>
          (
          <year>2017</year>
          ). Attention Is All You Need.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Herdade</surname>
            ,
            <given-names>Simao</given-names>
          </string-name>
          &amp; Kappeler, Armin &amp; Boakye, Kofi &amp; Soares,
          <string-name>
            <surname>Joao.</surname>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>Image Captioning: Transforming Objects into Words</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <article-title>Self-Distillation for Few-Shot Image Captioning Xianyu Chen</article-title>
          , Ming Jiang, Qi Zhao University of Minnesota, Twin Cities
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Rennie</surname>
            ,
            <given-names>Steven J.</given-names>
          </string-name>
          , et al. “
          <article-title>Self-Critical Sequence Training for Image Captioning</article-title>
          .” ArXiv Preprint ArXiv:
          <volume>1612</volume>
          .00563,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ting</surname>
          </string-name>
          , et al. “
          <article-title>Exploring Visual Relationship for Image Captioning</article-title>
          .
          <source>” Proceedings of the European Conference on Computer Vision (ECCV)</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>711</fpage>
          -
          <lpage>727</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Jiang</surname>
          </string-name>
          , Wenhao et al. “
          <article-title>Recurrent Fusion Network for Image Captioning</article-title>
          .”
          <source>ECCV</source>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Cornia</surname>
          </string-name>
          ,
          <string-name>
            <surname>Marcella</surname>
          </string-name>
          , et al.
          <article-title>"Meshed-memory transformer for image captioning</article-title>
          .
          <source>" Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          .
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>