Exploring the Relationship between Dataset Size and Image Captioning Model Performance Tomáš Železný1 , Marek Hrúz1 1 Department of Cybernetics and New Technologies for the Information Society, Technická 8, 301 00 Plzeň, Czech Republic Abstract Image captioning is a deep learning task that involves computer vision methods to extract visual information from the image and also natural language processing to generate the result caption in natural language. Image captioning models, just like other deep learning models, need a large amount of training data and require a long time to train. In this work, we investigate the impact of using a smaller amount of training data on the performance of the standard image captioning model Oscar. We train Oscar on different sizes of the training dataset and measure its performance in terms of accuracy and computational complexity. We observe that the computational time increases linearly with the amount of data used for training. However, the accuracy does not follow this linear trend and the relative improvement diminishes as we add more data to the training. We also measure the consistency of individual sizes of the training sets and observe that the more data we use for training the more consistent the metrics are. In addition to traditional evaluation metrics, we evaluate the performance using CLIP similarity. We investigate whether it can be used as a fully-fledged metric providing a unique advantage over the traditional metrics; it does not need reference captions that had to be acquired by human annotators. Our results show a high correlation between CLIP with the other metrics. This work provides valuable insights for understanding the requirements for training effective image captioning models. We believe our results can be transferred to other models, even in other deep-learning tasks. Keywords Image captioning, deep learning, computer vision, machine learning, data size analysis 1. Introduction An important feature of image captioning is that there is not only one correct caption for an image. This is be- Image captioning is a task in computer vision that in- cause different individuals may consider different aspects volves generating a textual description of an image. The of an image to be important, and they may therefore de- goal is to provide a comprehensive and human-like de- scribe the image in different ways. Because of this, there scription of the content of an image, which can be useful is not one ideal evaluation metric that can be used to for a variety of applications, such as enabling individu- measure the quality of a generated caption, as different als with visual impairments to better understand visual metrics may be better suited for evaluating different at- information, improving the accuracy and relevance of tributes of the caption. image search results, etc. It is a complex task because The general problem of deep learning is that it re- it requires the identification and interpretation of visual quires a large amount of data and the training process information, as well as the generation of grammatically can be computationally intensive. In this work, we inves- correct and fluent sentences. This requires a combined ef- tigate the relationship between the size of the training fort of computer vision and natural language processing dataset and the performance of a standard image caption- methods. ing model, Oscar [2]. We train Oscar on different sizes The scientific community has been interested in this of the training dataset and measure the performance by task for over a decade [1]. The methods used for this task means of accuracy and also computational complexity. were relying on hand-crafted features and rule-based We expect this dependency to have linear behavior, where algorithms. Recent advances in machine learning and increasing the size of the training dataset will result in artificial intelligence have enabled the development of a corresponding increase in computational time. This more effective image captioning models, which are able research is important because it can help us understand to generate high-quality captions for a wide range of the limitations of deep learning models and the com- images. putational resources required to train them effectively. Additionally, our results can provide valuable insights 26th Computer Vision Winter Workshop, Robert Sablatnig and Florian for future research on image captioning and other appli- Kleber (eds.), Krems, Lower Austria, Austria, Feb. 15-17, 2023 cations of deep learning. $ zeleznyt@kky.zcu.cz (T. Železný); mhruz@ntis.zcu.cz (M. Hrúz) Our contribution in this work is an experiment that  0000-0002-0974-7069 (T. Železný); 0000-0002-7851-9879 confirms the expected behavior of the Oscar model, i.e., (M. Hrúz) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License linear dependence. We also provide insight into the re- Attribution 4.0 International (CC BY 4.0). CEUR CEUR Workshop Proceedings (CEUR-WS.org) Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 lationship between the size of the training dataset and 1 Tomáš Železný et al. CEUR Workshop Proceedings 1–8 its performance on selected metrics. Furthermore, we such as Conceptual Captions and Conceptual12m, may measure the consistency of the data for each of the met- depend on the filtering applied during collection, and rics used, and we expect that smaller subsets of the data their consistency may be harder to guarantee. These will have higher variance than larger subsets. Our re- datasets, however, offer a larger number of images and sults will help to better understand the requirements for a greater variance. As a result, state-of-the-art image training effective image captioning models and the po- captioning models often utilize a combination of multiple tential trade-offs between dataset size and performance. datasets in order to achieve the best performance. In this Additionally, our findings may be useful for researchers work, we chose to use the COCO Captions dataset for our and practitioners who are interested in optimizing the experiments due to its suitable size for training and also training of deep learning models in general. dividing into subsets. The COCO Captions dataset also In addition to using state-of-the-art evaluation metrics, has a sufficient number of images to allow for a robust we also evaluate our image captioning methods on CLIP evaluation of the model’s performance. (Contrastive Language-Image Pre-training) similarity [3]. We investigate whether CLIP can be used as a full-fledged 2.2. Evaluation evaluation metric for image captioning. We find that it has a major advantage over traditional metrics: it does The evaluation of image captions is a challenging task not require reference labels from annotators. This means due to the inherent subjectivity of language and the mul- that CLIP can be used to evaluate image captioning mod- tiple ways in which an image can be correctly described. els in an unsupervised or self-supervised manner, which Most evaluation metrics for image captioning compute can be useful in situations where annotated data is not the difference between a candidate caption and a refer- available or is too expensive to obtain. ence caption provided by human annotators. Traditional metrics, such as BLEU [8], ROUGE [9], METEOR [10], and CIDEr [11], are based on the positions of n-grams in the 2. Related Work candidate and reference captions. More advanced met- rics, such as SPICE [12], measure the semantic similarity 2.1. Datasets between the captions using graph-based representations. Image captioning models are trained on large datasets Individual metrics may be suitable in different situa- consisting of pairs of images and captions. These datasets tions. For example, BLEU is a simple and inexpensive may differ in terms of the domain they cover, the number metric to compute, but it does not perform well when of image-caption pairs they contain, and the number of compared to other metrics [13]. On the other hand, CIDEr captions per image. is considered to be the best-performing metric that com- One well-known dataset for image captioning is pares n-grams in candidate and reference captions. How- Flickr30k [4], which includes approximately 31,000 im- ever, it requires the entire dataset to be computed, making ages of everyday scenes, each described by five inde- it computationally expensive for larger datasets. SPICE pendent annotators, resulting in 155,000 image-caption is a popular metric that compares the semantics of the pairs. Another popular dataset is COCO Captions [5], captions rather than their syntax. However, it requires a which contains over 164,000 images of everyday scenes, complex model to accurately capture semantic relation- with five annotations per image, for a total of over ships, making it computationally expensive. 820,000 image-caption pairs. The Conceptual Captions In tasks of image generation, the Fréchet inception dis- dataset [6] comprises images collected from a large num- tance (FID) [14] is used to evaluate the quality of images ber of web pages, with one caption per image extracted generated by a generative model, such as a generative from the alt-text HTML attribute. This dataset contains adversarial network (GAN) [15]. Similarly, CLIP [3] can over 3,000,000 image-caption pairs. Conceptual12m [7] be used to assess the similarity between an image and is a similar dataset, also extracted from web pages, with text. CLIP is a deep learning model developed by OpenAI a total of over 12,000,000 image-caption pairs. that is able to encode the image and text into a common Each of these datasets has its own advantages and semantic space. The cosine similarity can then be used disadvantages. For instance, the Flickr30k dataset has a to compute the agreement between the input text and good consistency and is well-suited for evaluation due to the image. Also, diffusion models for generating images the multiple reference captions provided for each image. use CLIP [16] to evaluate the generated image based on It is a valuable feature because a single image can often text input. In image captioning, CLIP can be used to be described in multiple ways, and it is useful to have a evaluate the generated caption. Although CLIP has not diverse set of captions for each image to better capture been considered a standard evaluation metric for image the range of possible descriptions. However, the quality captioning, in this study we present it as such. In this of datasets containing images collected from the internet, study, we present it as a potential fully-fledged metric that thoroughly assesses the semantic quality of candi- 2 Tomáš Železný et al. CEUR Workshop Proceedings 1–8 1% sub01 a group of brown cows standing in a field 25% sub01 a cow that is laying down in the grass. 1% sub02 a group of cows that are standing together. 25% sub02 a cow is standing in a field with another cow behind it. 1% sub03 a group of cows are standing in the grass. 25% sub03 a cow is standing in a field with another cow. 1% sub04 a herd of black and white cows in a field. 25% sub04 a cow with a red ear tag standing in a field. 1% sub05 a group of cows stand together in a grassy area. 25% sub05 a black and white cow standing in a field. 1% sub06 a herd of cows standing in a field. 25% sub06 a cow is standing in the grass with another cow behind it. 1% sub07 a group of cows grazing on a field. 25% sub07 a cow is standing in a field with another cow behind it. 1% sub08 a group of brown cows laying in a field 25% sub08 a cow is standing in a field of grass. 1% sub09 a couple of cows standing together in a field. 25% sub09 a cow is standing in a field with other cows. 1% sub10 two cows in a field with a fence surrounded by green grass. 25% sub10 two cows are laying down in a field. Figure 1: Examples of generated captions for the same image. On the left side, there are captions from different models trained on the 1% subset of data. On the right, there are captions from models trained on the 25% subset. We see that there is greater variability of the captions from the 1% subset, while the semantics are mostly correct. date captions. We compute the correlation between CLIP tivation for using this specific method is that we have pre- and other metrics and investigate whether CLIP can be viously used it in our own experiments and found it to be used in this manner. A previous research study [17] has a convenient method to use. While it may not currently conducted similar experiments, but focused on comput- be the best-performing model, Oscar is a transformer- ing the correlation with human judgment and comparing based method and we believe that the results of our exper- it to correlations with other metrics, whereas we compute iments may be generalizable to other transformer-based correlations with other metrics directly. or deep-learning models in the field. To assess the performance of Oscar, we conducted two 2.3. Image Captioning Methods main experiments. The first experiment involved mea- suring the time needed to train the model using various Recent advances in image captioning have seen the amounts of data while tracking the performance on a set widespread adoption of deep learning techniques. Early of chosen evaluation metrics. In addition to traditional methods used convolutional neural networks (CNNs) as metrics, we also evaluated the model using CLIP simi- encoders, such as the model proposed by [18]. More re- larity [3]. In the second experiment, we measured the cent approaches have used Faster R-CNN [19] for ob- correlation between the various metrics used in order ject detection in images, leading to improved perfor- to determine the potential use of CLIP as a fully-fledged mance. The latest methods employ transformer ar- metric in the image captioning field. chitectures [20], which have achieved state-of-the-art performance on a variety of tasks. Among the best- 3.1. Method performing methods are transformer-based methods Os- car [2], VinVL [21] and OFA [22], which use multimodal Our experiments are based on the training and evalu- input. mPLUG [23] is another image captioning method ation of the image captioning model Oscar [2]. Oscar that uses two unimodal encoders, one for images and is a transformer-based model, which uses a multimodal one for text. These encoders are then combined using a input. The input consists of feature vectors and tags of cross-modal skip-connected network, which consists of objects detected in the source image by an external object multiple skip-connected fusion blocks. detector. The output is the predicted caption describing the source image. The authors of Oscar provide a demonstration dataset 3. Experiments of feature vectors and object tags that can be used as input to Oscar, but do not specify the method by which In this work, we investigate the performance and effi- these object detections are obtained. In order to generate ciency of the image captioning method Oscar [2]. Our mo- captions for custom images outside of the demonstration 3 Tomáš Železný et al. CEUR Workshop Proceedings 1–8 dataset, we developed a full pipeline that takes a source 12000 image as input and produces a caption as output. Accord- ing to [2], Oscar’s input is a 2054-dimensional vector for 10000 each detected object, where the first 2048 dimensions are Time per epoch [s] image features extracted from a detection network and 8000 the remaining 6 values contain the coordinates and size of the bounding box for the detected object. We used the 6000 Faster R-CNN detection network implemented in the De- 4000 tectron2 [24] framework as the object detector. We used the R50-C4 backbone, which meets the requirements of 2000 having a 2048-dimensional vector in the final layer. We Individual runs use the feature vector from this layer together with the 0 Linear approximation predicted class as the input to Oscar. The Faster R-CNN 1 10 25 50 100 Size of subset [%] model was pre-trained on the COCO dataset [25] and is used without any further fine-tuning for our task. The Figure 2: Relationship between average time elapsed per quality of our pipeline is definitely restricted by the qual- epoch and the subset size used for training. We see that the ity of the detector. In our case, we are able to detect only measured data confirm the expected behaviour, i.e. linear 80 possible classes (COCO classes), which may hinder dependence. the expressivity of the model. Analysis of the demonstration dataset provided by Oscar revealed that there are always at least 10 detections per image, with confidence scores higher than 0.2. Based on this finding, we configured the object detector in our pipeline to generate detections with confidence scores higher than 0.2, and to include detections with lower confidence scores if there are fewer than 10 detections in total. This ensures that the input to Oscar matches the format of the demonstration dataset. 3.2. Dataset In this work, we conducted experiments using the COCO Captions [5] dataset. It consists of 164,062 images with 1% a dog laying on top of a bed. 5 captions each, divided into the train, validation, and 10 % a dog is laying on a bed in a room. 25 % a dog sitting on a bed next to a person. test sets. The annotation for the test set is not publicly 50 % a dog sitting on a bed with clothes and a book. 100% a dog sitting on a bed with a blanket and a pillow. available, so we redistributed the original train+val sets into our own train+val+test sets for evaluation on the Figure 3: Examples of captions generated from the best mod- COCO Captions dataset. els of each subset of the data. We can see the improvement of The demonstration dataset provided by Oscar also the caption as we add more data. consists of images from the COCO Captions dataset, which is split into train+val+test sets that originally be- longed to the original train+val COCO Captions dataset. To assess the effect of training data size on model per- We decided to follow this distribution, resulting in final formance, we selected various amounts of data from the train+val+test sets of 5,000+5,000+113,287 images. training set to train Oscar. The sizes of the training sub- sets were 100%, 50%, 25%, 10%, and 1% of the original 3.3. Impact of Different Volumes of Data train set. For each subset size, multiple random selec- on Model Performance tions were made from the full training set to measure the consistency of the selected data. The number of random In this experiment, we evaluate the performance of the selections for each subset size is shown in Table 1. The Oscar image captioning model on the COCO Captions number of data selections was chosen to provide a suffi- dataset. As described in Section 3.2, the dataset was split cient number of samples to measure variance while also into training, validation, and test sets, with the valida- considering the computational resources available. tion and test sets remaining unchanged for evaluation The Oscar model was trained using various sizes purposes. of training subsets for a total of 30 epochs, and the 4 Tomáš Železný et al. CEUR Workshop Proceedings 1–8 BLEU-4 CIDEr CLIP 0.316 1.076 0.2882 0.286 0.981 0.2826 0.257 0.886 0.2769 0.227 0.791 0.2713 0.197 0.696 0.2657 0.167 0.602 0.2600 1 10 25 50 100 Size of subset [%] Figure 4: Relationship between the size of the training set used to train Oscar [2] and the score of BLEU-4, CIDEr and CLIP metrics obtained by evaluating trained Oscar on the test set. We use different axis for each metric to better visualize the trends in individual metrics for a clearer comparison. The variance of individual sets of given sizes is visualized by boxplots. We can see that the upper quartile of the smaller set does not intersect with the lower quartile of the larger set. Note that there is no variance for the 100% split because there was only one selection. Table 1 obtain reliable results. For qualitative assessment of this Number of selections per subset size. experiment see Figures 1 and 3. Subset size 100 % 50 % 25 % 10 % 1% Selections 1 5 10 10 10 3.4. Evaluating Image Captioning with CLIP elapsed time was recorded. Training was conducted using In the second experiment, we investigate whether CLIP NVIDIA GeForce GTX 1080 Ti GPUs. The relationship similarity can be used as a fully-fledged metric for evalu- between elapsed time and training subset size is shown ating image captioning tasks. Our analysis of the data, as in Figure 2. As expected, this relationship follows a linear depicted in Figure 4, revealed that CLIP exhibits behavior dependence between data size and computational time. similar to that of other metrics. To further investigate During training, the model was evaluated on the vali- this relationship, we calculated Pearson’s correlation co- dation set after every 5th epoch, and the best-performing efficient between all metrics across all subsets of the data. checkpoint was saved. The CIDEr metric was used for The resulting correlations are presented in Figure 5. this evaluation because it has been found to correlate Our findings show that all metrics are highly corre- well with human judgment [17] and Oscar uses it as its lated. This indicates the correct, consistent, and expected default output score. After training, the best-performing behavior of all the metrics. In addition, we observed that checkpoint was selected based on its performance on the the BLEU, METEOR, ROUGE, and CIDEr metrics tend to validation set and then evaluated on the test set. The be on average more correlated with each other than with resulting score on the test set is shown in Figure 4. SPICE or CLIP. This trend is likely due to the fact that In order to assess the consistency of evaluation re- the former group of metrics compares the placement of sults, we measured the variability of the metric scores n-grams in candidate and reference captions, while the for each subset size. The variability is visualized in the latter two metrics do not consider syntactic content but figure using boxplots, which allow us to see the variance rather focus on semantics. of different metrics across the individual subsets. The The main takeaway is that CLIP is a viable metric for non-overlapping quarters of the boxplots indicate that image captioning evaluation which does not need refer- there is a statistically significant difference in the scores ence captions. This outcome is essential since it enables depending on the subset size. This highlights the impor- hypothetical training of a captioning system without ref- tance of carefully considering the subset size in order to erences in an unsupervised manner. 5 Tomáš Železný et al. CEUR Workshop Proceedings 1–8 1.0000 Bleu_1 1.0000 0.9996 0.9990 0.9982 0.9981 0.9990 0.9986 0.9974 0.9983 Bleu_2 0.9996 1.0000 0.9997 0.9991 0.9989 0.9994 0.9990 0.9981 0.9986 0.9995 Bleu_3 0.9990 0.9997 1.0000 0.9998 0.9994 0.9993 0.9993 0.9986 0.9983 Bleu_4 0.9982 0.9991 0.9998 1.0000 0.9993 0.9989 0.9994 0.9986 0.9977 0.9990 METEOR 0.9981 0.9989 0.9994 0.9993 1.0000 0.9993 0.9993 0.9991 0.9986 0.9985 ROUGE_L 0.9990 0.9994 0.9993 0.9989 0.9993 1.0000 0.9988 0.9981 0.9985 CIDEr 0.9986 0.9990 0.9993 0.9994 0.9993 0.9988 1.0000 0.9991 0.9984 0.9980 SPICE 0.9974 0.9981 0.9986 0.9986 0.9991 0.9981 0.9991 1.0000 0.9986 CLIP 0.9983 0.9986 0.9983 0.9977 0.9986 0.9985 0.9984 0.9986 1.0000 0.9975 ROUGE_L METEOR Bleu_1 Bleu_2 Bleu_3 Bleu_4 CIDEr SPICE CLIP Figure 5: Pearson’s correlation coefficient matrix computed pair-wise for all used metrics. We see that all the metrics highly correlate. 4. Conclusion The second one is the quality of the dataset. We chose COCO Caption for multiple reasons. Because we believe In our work, we conducted several experiments to ana- it has good consistency - it contains scenes of everyday lyze the training of the image captioning method Oscar. life with a limited variety of objects and because it has First, we trained the method on different sizes of training 5 annotations per image. Another reason is that it has data. We measured the elapsed time of the training loop good size - it is big enough to make an adequate 1% split and the performance on given metrics. The training dura- from it, but it is also small enough for 36 training runs tion has a linear relationship with the volume of data that of 30 epochs to be computed in reasonable time on our is used. Furthermore, we have measured the behavior GPUs. Lastly, the quality of the detector producing the of individual metrics based on the size of the training detections and feature vectors affects the performance. data. We measured the consistency of the data for indi- Based on our output, one can now decide to reduce the vidual subsets. We experimentally show that the models training data volume if the goal is to achieve a specific trained on smaller subsets have a higher variance of all minimum score of a metric. It can be assumed, that the the evaluation metrics than the models trained on larger behavior will be similar to other models and datasets. sets. We observe that the scores converge to some value. In our second experiment, we evaluated the correlation However, the improvement of the individual metrics is between various state-of-the-art metrics and the CLIP not linearly dependent on the amount of data used for metric, which we believe, can be used as a fully-fledged training. As we add more data for training, the improve- metric for image captioning with its huge advantage - it ment diminishes. This is affected by multiple phenomena: does not need any reference captions. Our results showed The first one is the capacity of the model itself, hence that all the metrics including CLIP are highly correlated. the convergence to a non-perfect value of the metrics. This supports CLIP’s potential use as a fully-fledged met- 6 Tomáš Železný et al. CEUR Workshop Proceedings 1–8 ric for image captioning. Previous research [17] has also [7] S. Changpinyo, P. Sharma, N. Ding, R. Soricut, Con- investigated the CLIP metric, focusing on the correlation ceptual 12M: Pushing web-scale image-text pre- with human judgment and comparing it to the correla- training to recognize long-tail visual concepts, in: tion of other metrics. In comparing those results to ours, CVPR, 2021, pp. 3558–3568. we found that the ranking of the correlation of individual [8] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a metrics to human judgment corresponds to the ranking method for automatic evaluation of machine trans- of the correlation of other metrics with CLIP. lation, in: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318. Acknowledgments [9] C.-Y. Lin, Rouge: A package for automatic eval- uation of summaries, in: Text summarization The work has been supported by the grant of the Univer- branches out, 2004, pp. 74–81. sity of West Bohemia, project No. SGS-2022-017. Com- [10] S. Banerjee, A. Lavie, Meteor: An automatic met- putational resources were supplied by the project "e- ric for mt evaluation with improved correlation Infrastruktura CZ" (e-INFRA CZ LM2018140) supported with human judgments, in: Proceedings of the by the Ministry of Education, Youth and Sports of the acl workshop on intrinsic and extrinsic evaluation Czech Republic. Also, we would like to thank RNDr. measures for machine translation and/or summa- Blanka Šedivá, Ph.D. for giving us the initial idea for this rization, 2005, pp. 65–72. research. [11] R. Vedantam, C. Lawrence Zitnick, D. Parikh, Cider: Consensus-based image description evaluation, in: References CVPR, 2015, pp. 4566–4575. [12] P. Anderson, B. Fernando, M. Johnson, S. Gould, [1] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, Spice: Semantic propositional image caption evalu- C. Rashtchian, J. Hockenmaier, D. Forsyth, Every ation, in: European conference on computer vision, picture tells a story: Generating sentences from im- Springer, 2016, pp. 382–398. ages, in: European conference on computer vision, [13] Y. Cui, G. Yang, A. Veit, X. Huang, S. Belongie, Springer, 2010, pp. 15–29. Learning to evaluate image captioning, in: Proceed- [2] X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, ings of the IEEE conference on computer vision and L. Wang, H. Hu, L. Dong, F. Wei, et al., Oscar: pattern recognition, 2018, pp. 5804–5812. Object-semantics aligned pre-training for vision- [14] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, language tasks, in: European Conference on Com- S. Hochreiter, Gans trained by a two time-scale puter Vision, Springer, 2020, pp. 121–137. update rule converge to a local nash equilibrium, [3] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, Advances in neural information processing systems G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, 30 (2017). J. Clark, et al., Learning transferable visual mod- [15] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, els from natural language supervision, in: Inter- D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, national Conference on Machine Learning, PMLR, Generative adversarial networks, Communications 2021, pp. 8748–8763. of the ACM 63 (2020) 139–144. [4] P. Young, A. Lai, M. Hodosh, J. Hockenmaier, From [16] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, M. Chen, image descriptions to visual denotations: New sim- Hierarchical text-conditional image generation ilarity metrics for semantic inference over event with clip latents, arXiv preprint arXiv:2204.06125 descriptions, Transactions of the Association for (2022). Computational Linguistics 2 (2014) 67–78. doi:10. [17] J. Hessel, A. Holtzman, M. Forbes, R. L. Bras, Y. Choi, 1162/tacl_a_00166. Clipscore: A reference-free evaluation metric for [5] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, image captioning, arXiv preprint arXiv:2104.08718 P. Dollár, C. L. Zitnick, Microsoft coco cap- (2021). tions: Data collection and evaluation server, arXiv [18] O. Vinyals, A. Toshev, S. Bengio, D. Erhan, Show preprint arXiv:1504.00325 (2015). and tell: A neural image caption generator, in: [6] P. Sharma, N. Ding, S. Goodman, R. Soricut, Con- CVPR, 2015, pp. 3156–3164. ceptual captions: A cleaned, hypernymed, image [19] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: alt-text dataset for automatic image captioning, in: Towards real-time object detection with region pro- Proceedings of the 56th Annual Meeting of the As- posal networks, Advances in neural information sociation for Computational Linguistics (Volume 1: processing systems 28 (2015). Long Papers), 2018, pp. 2556–2565. [20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At- 7 Tomáš Železný et al. CEUR Workshop Proceedings 1–8 tention is all you need, Advances in neural infor- mation processing systems 30 (2017). [21] P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y. Choi, J. Gao, Vinvl: Revisiting visual represen- tations in vision-language models, in: CVPR, 2021, pp. 5579–5588. [22] P. Wang, A. Yang, R. Men, J. Lin, S. Bai, Z. Li, J. Ma, C. Zhou, J. Zhou, H. Yang, Unifying architectures, tasks, and modalities through a simple sequence- to-sequence learning framework, arXiv preprint arXiv:2202.03052 (2022). [23] C. Li, H. Xu, J. Tian, W. Wang, M. Yan, B. Bi, J. Ye, H. Chen, G. Xu, Z. Cao, et al., mplug: Effective and efficient vision-language learning by cross-modal skip-connections, arXiv preprint arXiv:2205.12005 (2022). [24] Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, R. Girshick, Detectron2, https://github.com/facebookresearch/ detectron2, 2019. [25] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft coco: Common objects in context, in: European conference on computer vision, Springer, 2014, pp. 740–755. 8