Exploring the Relationship between Dataset Size and Image
Captioning Model Performance
Tomáš Železný1 , Marek Hrúz1
1
    Department of Cybernetics and New Technologies for the Information Society, Technická 8, 301 00 Plzeň, Czech Republic


                                       Abstract
                                       Image captioning is a deep learning task that involves computer vision methods to extract visual information from the image
                                       and also natural language processing to generate the result caption in natural language. Image captioning models, just like
                                       other deep learning models, need a large amount of training data and require a long time to train. In this work, we investigate
                                       the impact of using a smaller amount of training data on the performance of the standard image captioning model Oscar. We
                                       train Oscar on different sizes of the training dataset and measure its performance in terms of accuracy and computational
                                       complexity. We observe that the computational time increases linearly with the amount of data used for training. However,
                                       the accuracy does not follow this linear trend and the relative improvement diminishes as we add more data to the training.
                                       We also measure the consistency of individual sizes of the training sets and observe that the more data we use for training
                                       the more consistent the metrics are. In addition to traditional evaluation metrics, we evaluate the performance using CLIP
                                       similarity. We investigate whether it can be used as a fully-fledged metric providing a unique advantage over the traditional
                                       metrics; it does not need reference captions that had to be acquired by human annotators. Our results show a high correlation
                                       between CLIP with the other metrics. This work provides valuable insights for understanding the requirements for training
                                       effective image captioning models. We believe our results can be transferred to other models, even in other deep-learning
                                       tasks.

                                       Keywords
                                       Image captioning, deep learning, computer vision, machine learning, data size analysis


1. Introduction                                                                                           An important feature of image captioning is that there
                                                                                                       is not only one correct caption for an image. This is be-
Image captioning is a task in computer vision that in- cause different individuals may consider different aspects
volves generating a textual description of an image. The of an image to be important, and they may therefore de-
goal is to provide a comprehensive and human-like de- scribe the image in different ways. Because of this, there
scription of the content of an image, which can be useful is not one ideal evaluation metric that can be used to
for a variety of applications, such as enabling individu- measure the quality of a generated caption, as different
als with visual impairments to better understand visual metrics may be better suited for evaluating different at-
information, improving the accuracy and relevance of tributes of the caption.
image search results, etc. It is a complex task because                                                   The general problem of deep learning is that it re-
it requires the identification and interpretation of visual quires a large amount of data and the training process
information, as well as the generation of grammatically can be computationally intensive. In this work, we inves-
correct and fluent sentences. This requires a combined ef- tigate the relationship between the size of the training
fort of computer vision and natural language processing dataset and the performance of a standard image caption-
methods.                                                                                               ing model, Oscar [2]. We train Oscar on different sizes
   The scientific community has been interested in this of the training dataset and measure the performance by
task for over a decade [1]. The methods used for this task means of accuracy and also computational complexity.
were relying on hand-crafted features and rule-based We expect this dependency to have linear behavior, where
algorithms. Recent advances in machine learning and increasing the size of the training dataset will result in
artificial intelligence have enabled the development of a corresponding increase in computational time. This
more effective image captioning models, which are able research is important because it can help us understand
to generate high-quality captions for a wide range of the limitations of deep learning models and the com-
images.                                                                                                putational resources required to train them effectively.
                                                                                                       Additionally, our results can provide valuable insights
26th Computer Vision Winter Workshop, Robert Sablatnig and Florian for future research on image captioning and other appli-
Kleber (eds.), Krems, Lower Austria, Austria, Feb. 15-17, 2023                                         cations of deep learning.
$ zeleznyt@kky.zcu.cz (T. Železný); mhruz@ntis.zcu.cz (M. Hrúz)                                           Our contribution in this work is an experiment that
 0000-0002-0974-7069 (T. Železný); 0000-0002-7851-9879
                                                                                                       confirms the expected behavior of the Oscar model, i.e.,
(M. Hrúz)
          © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License linear dependence. We also provide insight into the re-
          Attribution 4.0 International (CC BY 4.0).
    CEUR

          CEUR Workshop Proceedings (CEUR-WS.org)
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                                                                                       lationship between the size of the training dataset and


                                                                                              1
Tomáš Železný et al. CEUR Workshop Proceedings                                                                            1–8


its performance on selected metrics. Furthermore, we               such as Conceptual Captions and Conceptual12m, may
measure the consistency of the data for each of the met-           depend on the filtering applied during collection, and
rics used, and we expect that smaller subsets of the data          their consistency may be harder to guarantee. These
will have higher variance than larger subsets. Our re-             datasets, however, offer a larger number of images and
sults will help to better understand the requirements for          a greater variance. As a result, state-of-the-art image
training effective image captioning models and the po-             captioning models often utilize a combination of multiple
tential trade-offs between dataset size and performance.           datasets in order to achieve the best performance. In this
Additionally, our findings may be useful for researchers           work, we chose to use the COCO Captions dataset for our
and practitioners who are interested in optimizing the             experiments due to its suitable size for training and also
training of deep learning models in general.                       dividing into subsets. The COCO Captions dataset also
   In addition to using state-of-the-art evaluation metrics,       has a sufficient number of images to allow for a robust
we also evaluate our image captioning methods on CLIP              evaluation of the model’s performance.
(Contrastive Language-Image Pre-training) similarity [3].
We investigate whether CLIP can be used as a full-fledged          2.2. Evaluation
evaluation metric for image captioning. We find that it
has a major advantage over traditional metrics: it does            The evaluation of image captions is a challenging task
not require reference labels from annotators. This means           due to the inherent subjectivity of language and the mul-
that CLIP can be used to evaluate image captioning mod-            tiple ways in which an image can be correctly described.
els in an unsupervised or self-supervised manner, which            Most evaluation metrics for image captioning compute
can be useful in situations where annotated data is not            the difference between a candidate caption and a refer-
available or is too expensive to obtain.                           ence caption provided by human annotators. Traditional
                                                                   metrics, such as BLEU [8], ROUGE [9], METEOR [10], and
                                                                   CIDEr [11], are based on the positions of n-grams in the
2. Related Work                                                    candidate and reference captions. More advanced met-
                                                                   rics, such as SPICE [12], measure the semantic similarity
2.1. Datasets                                                      between the captions using graph-based representations.
Image captioning models are trained on large datasets                 Individual metrics may be suitable in different situa-
consisting of pairs of images and captions. These datasets         tions. For example, BLEU is a simple and inexpensive
may differ in terms of the domain they cover, the number           metric to compute, but it does not perform well when
of image-caption pairs they contain, and the number of             compared to other metrics [13]. On the other hand, CIDEr
captions per image.                                                is considered to be the best-performing metric that com-
   One well-known dataset for image captioning is                  pares n-grams in candidate and reference captions. How-
Flickr30k [4], which includes approximately 31,000 im-             ever, it requires the entire dataset to be computed, making
ages of everyday scenes, each described by five inde-              it computationally expensive for larger datasets. SPICE
pendent annotators, resulting in 155,000 image-caption             is a popular metric that compares the semantics of the
pairs. Another popular dataset is COCO Captions [5],               captions rather than their syntax. However, it requires a
which contains over 164,000 images of everyday scenes,             complex model to accurately capture semantic relation-
with five annotations per image, for a total of over               ships, making it computationally expensive.
820,000 image-caption pairs. The Conceptual Captions                  In tasks of image generation, the Fréchet inception dis-
dataset [6] comprises images collected from a large num-           tance (FID) [14] is used to evaluate the quality of images
ber of web pages, with one caption per image extracted             generated by a generative model, such as a generative
from the alt-text HTML attribute. This dataset contains            adversarial network (GAN) [15]. Similarly, CLIP [3] can
over 3,000,000 image-caption pairs. Conceptual12m [7]              be used to assess the similarity between an image and
is a similar dataset, also extracted from web pages, with          text. CLIP is a deep learning model developed by OpenAI
a total of over 12,000,000 image-caption pairs.                    that is able to encode the image and text into a common
   Each of these datasets has its own advantages and               semantic space. The cosine similarity can then be used
disadvantages. For instance, the Flickr30k dataset has a           to compute the agreement between the input text and
good consistency and is well-suited for evaluation due to          the image. Also, diffusion models for generating images
the multiple reference captions provided for each image.           use CLIP [16] to evaluate the generated image based on
It is a valuable feature because a single image can often          text input. In image captioning, CLIP can be used to
be described in multiple ways, and it is useful to have a          evaluate the generated caption. Although CLIP has not
diverse set of captions for each image to better capture           been considered a standard evaluation metric for image
the range of possible descriptions. However, the quality           captioning, in this study we present it as such. In this
of datasets containing images collected from the internet,         study, we present it as a potential fully-fledged metric
                                                                   that thoroughly assesses the semantic quality of candi-


                                                               2
Tomáš Železný et al. CEUR Workshop Proceedings                                                                                           1–8


1% sub01   a group of brown cows standing in a field                   25% sub01   a cow that is laying down in the grass.
1% sub02   a group of cows that are standing together.                 25% sub02   a cow is standing in a field with another cow behind it.
1% sub03   a group of cows are standing in the grass.                  25% sub03   a cow is standing in a field with another cow.
1% sub04   a herd of black and white cows in a field.                  25% sub04   a cow with a red ear tag standing in a field.
1% sub05   a group of cows stand together in a grassy area.            25% sub05   a black and white cow standing in a field.
1% sub06   a herd of cows standing in a field.                         25% sub06   a cow is standing in the grass with another cow behind it.
1% sub07   a group of cows grazing on a field.                         25% sub07   a cow is standing in a field with another cow behind it.
1% sub08   a group of brown cows laying in a field                     25% sub08   a cow is standing in a field of grass.
1% sub09   a couple of cows standing together in a field.              25% sub09   a cow is standing in a field with other cows.
1% sub10   two cows in a field with a fence surrounded by green grass. 25% sub10   two cows are laying down in a field.
Figure 1: Examples of generated captions for the same image. On the left side, there are captions from different models
trained on the 1% subset of data. On the right, there are captions from models trained on the 25% subset. We see that there is
greater variability of the captions from the 1% subset, while the semantics are mostly correct.


date captions. We compute the correlation between CLIP                  tivation for using this specific method is that we have pre-
and other metrics and investigate whether CLIP can be                   viously used it in our own experiments and found it to be
used in this manner. A previous research study [17] has                 a convenient method to use. While it may not currently
conducted similar experiments, but focused on comput-                   be the best-performing model, Oscar is a transformer-
ing the correlation with human judgment and comparing                   based method and we believe that the results of our exper-
it to correlations with other metrics, whereas we compute               iments may be generalizable to other transformer-based
correlations with other metrics directly.                               or deep-learning models in the field.
                                                                           To assess the performance of Oscar, we conducted two
2.3. Image Captioning Methods                                           main experiments. The first experiment involved mea-
                                                                        suring the time needed to train the model using various
Recent advances in image captioning have seen the                       amounts of data while tracking the performance on a set
widespread adoption of deep learning techniques. Early                  of chosen evaluation metrics. In addition to traditional
methods used convolutional neural networks (CNNs) as                    metrics, we also evaluated the model using CLIP simi-
encoders, such as the model proposed by [18]. More re-                  larity [3]. In the second experiment, we measured the
cent approaches have used Faster R-CNN [19] for ob-                     correlation between the various metrics used in order
ject detection in images, leading to improved perfor-                   to determine the potential use of CLIP as a fully-fledged
mance. The latest methods employ transformer ar-                        metric in the image captioning field.
chitectures [20], which have achieved state-of-the-art
performance on a variety of tasks. Among the best-
                                                                        3.1. Method
performing methods are transformer-based methods Os-
car [2], VinVL [21] and OFA [22], which use multimodal   Our experiments are based on the training and evalu-
input. mPLUG [23] is another image captioning method     ation of the image captioning model Oscar [2]. Oscar
that uses two unimodal encoders, one for images and      is a transformer-based model, which uses a multimodal
one for text. These encoders are then combined using a   input. The input consists of feature vectors and tags of
cross-modal skip-connected network, which consists of    objects detected in the source image by an external object
multiple skip-connected fusion blocks.                   detector. The output is the predicted caption describing
                                                         the source image.
                                                            The authors of Oscar provide a demonstration dataset
3. Experiments                                           of feature vectors and object tags that can be used as
                                                         input to Oscar, but do not specify the method by which
In this work, we investigate the performance and effi-
                                                         these object detections are obtained. In order to generate
ciency of the image captioning method Oscar [2]. Our mo-
                                                         captions for custom images outside of the demonstration


                                                                    3
Tomáš Železný et al. CEUR Workshop Proceedings                                                                                                              1–8


dataset, we developed a full pipeline that takes a source
                                                                                         12000
image as input and produces a caption as output. Accord-
ing to [2], Oscar’s input is a 2054-dimensional vector for                               10000
each detected object, where the first 2048 dimensions are


                                                                    Time per epoch [s]
image features extracted from a detection network and                                    8000
the remaining 6 values contain the coordinates and size
of the bounding box for the detected object. We used the                                 6000
Faster R-CNN detection network implemented in the De-
                                                                                         4000
tectron2 [24] framework as the object detector. We used
the R50-C4 backbone, which meets the requirements of                                     2000
having a 2048-dimensional vector in the final layer. We                                                                                   Individual runs
use the feature vector from this layer together with the                                    0                                             Linear approximation
predicted class as the input to Oscar. The Faster R-CNN                                          1     10     25             50                           100
                                                                                                                     Size of subset [%]
model was pre-trained on the COCO dataset [25] and is
used without any further fine-tuning for our task. The              Figure 2: Relationship between average time elapsed per
quality of our pipeline is definitely restricted by the qual-       epoch and the subset size used for training. We see that the
ity of the detector. In our case, we are able to detect only        measured data confirm the expected behaviour, i.e. linear
80 possible classes (COCO classes), which may hinder                dependence.
the expressivity of the model.
   Analysis of the demonstration dataset provided by
Oscar revealed that there are always at least 10 detections
per image, with confidence scores higher than 0.2. Based
on this finding, we configured the object detector in our
pipeline to generate detections with confidence scores
higher than 0.2, and to include detections with lower
confidence scores if there are fewer than 10 detections in
total. This ensures that the input to Oscar matches the
format of the demonstration dataset.

3.2. Dataset
In this work, we conducted experiments using the COCO
Captions [5] dataset. It consists of 164,062 images with                          1%                 a dog laying on top of a bed.
5 captions each, divided into the train, validation, and
                                                                                 10 %                a dog is laying on a bed in a room.
                                                                                 25 %                a dog sitting on a bed next to a person.
test sets. The annotation for the test set is not publicly                       50 %                a dog sitting on a bed with clothes and a book.
                                                                                100%                 a dog sitting on a bed with a blanket and a pillow.
available, so we redistributed the original train+val sets
into our own train+val+test sets for evaluation on the              Figure 3: Examples of captions generated from the best mod-
COCO Captions dataset.                                              els of each subset of the data. We can see the improvement of
   The demonstration dataset provided by Oscar also                 the caption as we add more data.
consists of images from the COCO Captions dataset,
which is split into train+val+test sets that originally be-
longed to the original train+val COCO Captions dataset.        To assess the effect of training data size on model per-
We decided to follow this distribution, resulting in final  formance, we selected various amounts of data from the
train+val+test sets of 5,000+5,000+113,287 images.          training set to train Oscar. The sizes of the training sub-
                                                            sets were 100%, 50%, 25%, 10%, and 1% of the original
3.3. Impact of Different Volumes of Data train set. For each subset size, multiple random selec-
      on Model Performance                                  tions were made from the full training set to measure the
                                                            consistency of the selected data. The number of random
In this experiment, we evaluate the performance of the selections for each subset size is shown in Table 1. The
Oscar image captioning model on the COCO Captions number of data selections was chosen to provide a suffi-
dataset. As described in Section 3.2, the dataset was split cient number of samples to measure variance while also
into training, validation, and test sets, with the valida- considering the computational resources available.
tion and test sets remaining unchanged for evaluation          The Oscar model was trained using various sizes
purposes.                                                   of training subsets for a total of 30 epochs, and the


                                                                4
Tomáš Železný et al. CEUR Workshop Proceedings                                                                                1–8


BLEU-4 CIDEr CLIP
 0.316 1.076 0.2882

  0.286 0.981 0.2826

  0.257 0.886 0.2769

  0.227 0.791 0.2713

  0.197 0.696 0.2657

  0.167 0.602 0.2600
                                            1               10                25                 50                100
                                                                       Size of subset [%]
Figure 4: Relationship between the size of the training set used to train Oscar [2] and the score of BLEU-4, CIDEr and CLIP
metrics obtained by evaluating trained Oscar on the test set. We use different axis for each metric to better visualize the trends
in individual metrics for a clearer comparison. The variance of individual sets of given sizes is visualized by boxplots. We can
see that the upper quartile of the smaller set does not intersect with the lower quartile of the larger set. Note that there is no
variance for the 100% split because there was only one selection.


Table 1                                                              obtain reliable results. For qualitative assessment of this
Number of selections per subset size.                                experiment see Figures 1 and 3.
    Subset size    100 %     50 %    25 %       10 %   1%
    Selections       1         5      10         10    10            3.4. Evaluating Image Captioning with
                                                                          CLIP
elapsed time was recorded. Training was conducted using              In the second experiment, we investigate whether CLIP
NVIDIA GeForce GTX 1080 Ti GPUs. The relationship                    similarity can be used as a fully-fledged metric for evalu-
between elapsed time and training subset size is shown               ating image captioning tasks. Our analysis of the data, as
in Figure 2. As expected, this relationship follows a linear         depicted in Figure 4, revealed that CLIP exhibits behavior
dependence between data size and computational time.                 similar to that of other metrics. To further investigate
   During training, the model was evaluated on the vali-             this relationship, we calculated Pearson’s correlation co-
dation set after every 5th epoch, and the best-performing            efficient between all metrics across all subsets of the data.
checkpoint was saved. The CIDEr metric was used for                  The resulting correlations are presented in Figure 5.
this evaluation because it has been found to correlate                  Our findings show that all metrics are highly corre-
well with human judgment [17] and Oscar uses it as its               lated. This indicates the correct, consistent, and expected
default output score. After training, the best-performing            behavior of all the metrics. In addition, we observed that
checkpoint was selected based on its performance on the              the BLEU, METEOR, ROUGE, and CIDEr metrics tend to
validation set and then evaluated on the test set. The               be on average more correlated with each other than with
resulting score on the test set is shown in Figure 4.                SPICE or CLIP. This trend is likely due to the fact that
   In order to assess the consistency of evaluation re-              the former group of metrics compares the placement of
sults, we measured the variability of the metric scores              n-grams in candidate and reference captions, while the
for each subset size. The variability is visualized in the           latter two metrics do not consider syntactic content but
figure using boxplots, which allow us to see the variance            rather focus on semantics.
of different metrics across the individual subsets. The                 The main takeaway is that CLIP is a viable metric for
non-overlapping quarters of the boxplots indicate that               image captioning evaluation which does not need refer-
there is a statistically significant difference in the scores        ence captions. This outcome is essential since it enables
depending on the subset size. This highlights the impor-             hypothetical training of a captioning system without ref-
tance of carefully considering the subset size in order to           erences in an unsupervised manner.


                                                                 5
Tomáš Železný et al. CEUR Workshop Proceedings                                                                             1–8


                                                                                                                     1.0000
    Bleu_1 1.0000 0.9996 0.9990 0.9982 0.9981 0.9990 0.9986 0.9974 0.9983

    Bleu_2 0.9996 1.0000 0.9997 0.9991 0.9989 0.9994 0.9990 0.9981 0.9986
                                                                                                                     0.9995
    Bleu_3 0.9990 0.9997 1.0000 0.9998 0.9994 0.9993 0.9993 0.9986 0.9983

    Bleu_4 0.9982 0.9991 0.9998 1.0000 0.9993 0.9989 0.9994 0.9986 0.9977                                            0.9990

   METEOR 0.9981 0.9989 0.9994 0.9993 1.0000 0.9993 0.9993 0.9991 0.9986
                                                                                                                     0.9985
  ROUGE_L 0.9990 0.9994 0.9993 0.9989 0.9993 1.0000 0.9988 0.9981 0.9985

      CIDEr 0.9986 0.9990 0.9993 0.9994 0.9993 0.9988 1.0000 0.9991 0.9984
                                                                                                                     0.9980
     SPICE 0.9974 0.9981 0.9986 0.9986 0.9991 0.9981 0.9991 1.0000 0.9986

       CLIP 0.9983 0.9986 0.9983 0.9977 0.9986 0.9985 0.9984 0.9986 1.0000                                           0.9975
                                                                       ROUGE_L
                                                        METEOR
                 Bleu_1

                           Bleu_2

                                    Bleu_3

                                              Bleu_4


                                                                                 CIDEr

                                                                                         SPICE

                                                                                                   CLIP
Figure 5: Pearson’s correlation coefficient matrix computed pair-wise for all used metrics. We see that all the metrics highly
correlate.


4. Conclusion                                                        The second one is the quality of the dataset. We chose
                                                                     COCO Caption for multiple reasons. Because we believe
In our work, we conducted several experiments to ana-                it has good consistency - it contains scenes of everyday
lyze the training of the image captioning method Oscar.              life with a limited variety of objects and because it has
First, we trained the method on different sizes of training          5 annotations per image. Another reason is that it has
data. We measured the elapsed time of the training loop              good size - it is big enough to make an adequate 1% split
and the performance on given metrics. The training dura-             from it, but it is also small enough for 36 training runs
tion has a linear relationship with the volume of data that          of 30 epochs to be computed in reasonable time on our
is used. Furthermore, we have measured the behavior                  GPUs. Lastly, the quality of the detector producing the
of individual metrics based on the size of the training              detections and feature vectors affects the performance.
data. We measured the consistency of the data for indi-                 Based on our output, one can now decide to reduce the
vidual subsets. We experimentally show that the models               training data volume if the goal is to achieve a specific
trained on smaller subsets have a higher variance of all             minimum score of a metric. It can be assumed, that the
the evaluation metrics than the models trained on larger             behavior will be similar to other models and datasets.
sets. We observe that the scores converge to some value.                In our second experiment, we evaluated the correlation
However, the improvement of the individual metrics is                between various state-of-the-art metrics and the CLIP
not linearly dependent on the amount of data used for                metric, which we believe, can be used as a fully-fledged
training. As we add more data for training, the improve-             metric for image captioning with its huge advantage - it
ment diminishes. This is affected by multiple phenomena:             does not need any reference captions. Our results showed
The first one is the capacity of the model itself, hence             that all the metrics including CLIP are highly correlated.
the convergence to a non-perfect value of the metrics.               This supports CLIP’s potential use as a fully-fledged met-


                                                                 6
Tomáš Železný et al. CEUR Workshop Proceedings                                                                             1–8


ric for image captioning. Previous research [17] has also           [7] S. Changpinyo, P. Sharma, N. Ding, R. Soricut, Con-
investigated the CLIP metric, focusing on the correlation               ceptual 12M: Pushing web-scale image-text pre-
with human judgment and comparing it to the correla-                    training to recognize long-tail visual concepts, in:
tion of other metrics. In comparing those results to ours,              CVPR, 2021, pp. 3558–3568.
we found that the ranking of the correlation of individual          [8] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a
metrics to human judgment corresponds to the ranking                    method for automatic evaluation of machine trans-
of the correlation of other metrics with CLIP.                          lation, in: Proceedings of the 40th annual meeting
                                                                        of the Association for Computational Linguistics,
                                                                        2002, pp. 311–318.
Acknowledgments                                                     [9] C.-Y. Lin, Rouge: A package for automatic eval-
                                                                        uation of summaries, in: Text summarization
The work has been supported by the grant of the Univer-
                                                                        branches out, 2004, pp. 74–81.
sity of West Bohemia, project No. SGS-2022-017. Com-
                                                                   [10] S. Banerjee, A. Lavie, Meteor: An automatic met-
putational resources were supplied by the project "e-
                                                                        ric for mt evaluation with improved correlation
Infrastruktura CZ" (e-INFRA CZ LM2018140) supported
                                                                        with human judgments, in: Proceedings of the
by the Ministry of Education, Youth and Sports of the
                                                                        acl workshop on intrinsic and extrinsic evaluation
Czech Republic. Also, we would like to thank RNDr.
                                                                        measures for machine translation and/or summa-
Blanka Šedivá, Ph.D. for giving us the initial idea for this
                                                                        rization, 2005, pp. 65–72.
research.
                                                                   [11] R. Vedantam, C. Lawrence Zitnick, D. Parikh, Cider:
                                                                        Consensus-based image description evaluation, in:
References                                                              CVPR, 2015, pp. 4566–4575.
                                                                   [12] P. Anderson, B. Fernando, M. Johnson, S. Gould,
 [1] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young,                   Spice: Semantic propositional image caption evalu-
     C. Rashtchian, J. Hockenmaier, D. Forsyth, Every                   ation, in: European conference on computer vision,
     picture tells a story: Generating sentences from im-               Springer, 2016, pp. 382–398.
     ages, in: European conference on computer vision,             [13] Y. Cui, G. Yang, A. Veit, X. Huang, S. Belongie,
     Springer, 2010, pp. 15–29.                                         Learning to evaluate image captioning, in: Proceed-
 [2] X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang,                   ings of the IEEE conference on computer vision and
     L. Wang, H. Hu, L. Dong, F. Wei, et al., Oscar:                    pattern recognition, 2018, pp. 5804–5812.
     Object-semantics aligned pre-training for vision-             [14] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler,
     language tasks, in: European Conference on Com-                    S. Hochreiter, Gans trained by a two time-scale
     puter Vision, Springer, 2020, pp. 121–137.                         update rule converge to a local nash equilibrium,
 [3] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh,                      Advances in neural information processing systems
     G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin,              30 (2017).
     J. Clark, et al., Learning transferable visual mod-           [15] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,
     els from natural language supervision, in: Inter-                  D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio,
     national Conference on Machine Learning, PMLR,                     Generative adversarial networks, Communications
     2021, pp. 8748–8763.                                               of the ACM 63 (2020) 139–144.
 [4] P. Young, A. Lai, M. Hodosh, J. Hockenmaier, From             [16] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, M. Chen,
     image descriptions to visual denotations: New sim-                 Hierarchical text-conditional image generation
     ilarity metrics for semantic inference over event                  with clip latents, arXiv preprint arXiv:2204.06125
     descriptions, Transactions of the Association for                  (2022).
     Computational Linguistics 2 (2014) 67–78. doi:10.             [17] J. Hessel, A. Holtzman, M. Forbes, R. L. Bras, Y. Choi,
     1162/tacl_a_00166.                                                 Clipscore: A reference-free evaluation metric for
 [5] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta,                image captioning, arXiv preprint arXiv:2104.08718
     P. Dollár, C. L. Zitnick, Microsoft coco cap-                      (2021).
     tions: Data collection and evaluation server, arXiv           [18] O. Vinyals, A. Toshev, S. Bengio, D. Erhan, Show
     preprint arXiv:1504.00325 (2015).                                  and tell: A neural image caption generator, in:
 [6] P. Sharma, N. Ding, S. Goodman, R. Soricut, Con-                   CVPR, 2015, pp. 3156–3164.
     ceptual captions: A cleaned, hypernymed, image                [19] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn:
     alt-text dataset for automatic image captioning, in:               Towards real-time object detection with region pro-
     Proceedings of the 56th Annual Meeting of the As-                  posal networks, Advances in neural information
     sociation for Computational Linguistics (Volume 1:                 processing systems 28 (2015).
     Long Papers), 2018, pp. 2556–2565.                            [20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
                                                                        L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At-


                                                               7
Tomáš Železný et al. CEUR Workshop Proceedings                     1–8


     tention is all you need, Advances in neural infor-
     mation processing systems 30 (2017).
[21] P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang,
     Y. Choi, J. Gao, Vinvl: Revisiting visual represen-
     tations in vision-language models, in: CVPR, 2021,
     pp. 5579–5588.
[22] P. Wang, A. Yang, R. Men, J. Lin, S. Bai, Z. Li, J. Ma,
     C. Zhou, J. Zhou, H. Yang, Unifying architectures,
     tasks, and modalities through a simple sequence-
     to-sequence learning framework, arXiv preprint
     arXiv:2202.03052 (2022).
[23] C. Li, H. Xu, J. Tian, W. Wang, M. Yan, B. Bi, J. Ye,
     H. Chen, G. Xu, Z. Cao, et al., mplug: Effective and
     efficient vision-language learning by cross-modal
     skip-connections, arXiv preprint arXiv:2205.12005
     (2022).
[24] Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, R. Girshick,
     Detectron2, https://github.com/facebookresearch/
     detectron2, 2019.
[25] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona,
     D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft
     coco: Common objects in context, in: European
     conference on computer vision, Springer, 2014, pp.
     740–755.


                                                               8