1. Introduction

1613-0073

Captioning⋆

Zhishen Yang

zhishen.yang@nlp.c.titech.ac.jp 1 2

Raj Dabre

raj.dabre@nict.go.jp 0 2

Hideki Tanaka

hideki.tanaka@nict.go.jp 0 2

Naoaki Okazaki

okazaki@c.titech.ac.jp 1 2

Figure Captioning, Multimodal Machine Learning, Scientific Document Understanding

0 National Institute of Information and Communications Technology , 3-5 Hikaridai, Seika-cho, Soraku-gun, Kyoto, 619-0289 , Japan 1 Tokyo Institute of Technology , 2-12-1 Ookayama, Meguro-ku, Tokyo, 152-8550 , Japan 2 Workshop Proce dings

In scholarly documents, figures provide a straightforward way of communicating scientific findings to readers. Automating ifgure caption generation helps move model understandings of scientific documents beyond text and will help authors write informative captions that facilitate communicating scientific findings. Unlike previous studies, we reframe scientific ifgure captioning as a knowledge-augmented image captioning task that models need to utilize knowledge embedded across modalities for caption generation. To this end, we extended the large-scale SciCap dataset [1] to SciCap+, which includes mention-paragraphs (paragraphs mentioning figures) and OCR tokens. Then, we conduct experiments with the M4C-Captioner (a multimodal transformer-based model with a pointer network) as a baseline for our study. Our results indicate that mention-paragraphs serve as additional context knowledge, significantly boosting the automatic standard image caption evaluation scores compared to the figure-only baselines. Human evaluations further reveal the challenges of generating figure captions that are informative to readers. The code and SciCap+ dataset are publicly available: https: various formats, such as journal articles, book chapters, caption should contain an analysis that the authors intables, they help communicate knowledge to readers. Us- ure captioning task as a figure-to-caption task: A model ∗Corresponding author.

1. Introduction

Scholarly documents are the primary source for sharing scientific knowledge. These documents are available in and conference proceedings. A significant portion of these documents is text and together with figures and ing figures provides visual representations of complex information that facilitate the sharing of scientific findings with readers eficiently and straightforwardly. The standard practice for scientific writing is to write a caption for each figure, accompanied by paragraphs with detailed explanations. Figures and captions should be standalone, and readers should be able to understand the figures without referring to the main text. Helping authors write appropriate and informative captions for ifgures will improve the quality of scientific documents, thereby enhancing the speed and quality of scientific communication. In this study, we focus on automating the generation of captions for figures in scientific papers.

Scientific figure captioning is a variant of the image captioning task. However, with the same goal of generating a caption, it has two unique challenges: 1. Figures are

CEUR

CEUR not natural images: In contrast to natural images, visual objects are texts and data points in scientific figures. 2.

The captions of the figures should explain: Instead of

simply identifying objects and texts in the figures, the tend to present and highlight findings.

A previous study [ 1 ], SciCap, defines the scientific figgenerates captions only referring to figures. Their work reported relatively lower scores as measured by automatic evaluation metrics, indicating that there is considerable room for improvement. Intuitively, writing appropriate figure captions without suficient background knowledge is dificult, since even humans will struggle to interpret a figure and write a caption unless some background knowledge is available. On the basis of this observation, we think that generating appropriate captions is infeasible without adding context knowledge to the caption generation model. This context comes in two forms: background knowledge from the running text and the OCR tokens in the figure, both of which should help reduce the burden on the captioning model. To this end, we augment the existing large-scale scientific figure captioning dataset: SciCap with mention-paragraphs and OCR tokens and call the resultant dataset as Sci

Cap+. We then pose scientific figure captioning as a mul

timodal summarization task and use the M4C captioner model [ 2 ] (a model that utilizes multimodal knowledge to generate captions) as a baseline to study the scientific ifgure captioning task. The experimental result of auCEUR

ceur-ws.org ures to highlight scientific findings that authors want to present to readers. With this unique characteristic, without referring to mention-paragraphs, which usually refer to the figure, it is extremely challenging for a human to have proper interpretations of figures. This is because they may lack background knowledge of the domain or context of the figure. As figure 1 shows, by only looking at the figure, we do not know what ”comm.(KB)” stands for; therefore lacking the knowledge to write informative captions is challenging. However, the mentionparagraph contains ”communication cost” and this is also present in the caption, indicating that such background knowledge should help in writing accurate captions.

3. Problem Formulation 4. SciCap+ Dataset

SciCap is a large-scale figure-caption dataset comprising graph plots extracted from 10 years of collections of arXiv computer science papers. We used around 414k figures from SciCap and augment each figure with its mentionparagraphs and OCR tokens with metadata. This section details the data set creation and data augmentation processes. Figure 2 shows the overall workflow behind the creation of SciCap+.

4.1. Mention-paragraph Extraction

Caption: Fig. 7. (a) Speedup of CHEETAH over GAZELLE for computing ReLu. (b) Comparison of communication cost for ReLu.

Mention-paragraph: Fig. 7 plots the speedup and communication cost as a function of the output dimension. Similarly, CHEETAH achieves an outstanding speedup with much smaller communication cost, independent of the output dimension, compared with GAZELLE. …… embedded in diferent modalities, especially in the form of mention-paragraphs and OCR tokens, significantly boosts performance.

In addition to experiments using automatic evaluation metrics, we also performed human generation and evaluation tasks in order to establish the inherent dificulty of scientific figure captioning. The results of the human evaluation reveal three findings: 1. Multimodal knowledge helps models outperform humans in caption generation tasks. 2. Model-generated captions are almost as informative as ground-truth captions: Human evaluators do not prefer either type of caption. 3. Even referring to mention-paragraphs, it is still challenging for humans to write captions that are close to ground truth. To the best of our knowledge, we are the first to pose scientific figure captioning as a multimodel summarization task and show that mention-paragraphs and OCR tokens as context substantially enhance the quality of generated captions.

We first obtained papers in PDF format from Kaggle arXiv

dastaset 1. The reason for using PDFs is that not all papers have source files and some are complicated to parse. 2. Preliminary Study After obtaining PDFs, we used PDFFigures 2.0 [ 4] 2 to extract the body text of each paper. PDFFigure 2.0 is a In the traditional image captioning task, captioning an im- tool that extracts figures, captions, tables, and text from age aims at describing the appearances or natures of rec- scholarly PDFs in computer science. In scholarly docuognized objects and illustrating the relationships between ments, authors label figures with numbers (e.g. Figure recognized objects. Unlike the usual image captioning 1. Fig. 1). For a figure, we used its figure number in a tasks, figures do not contain visual scenes. Instead, the 1https://www.kaggle.com/datasets/Cornell-University/arxiv captions provide interpretations of data presented in fig- 2https://github.com/allenai/pdffigures2 regular expression to locate a paragraph that mentions it.

4.2. OCR Extraction

The SciCap dataset also provides texts extracted from ifgures as metadata, but does not provide location information for each text. To include location information for each text in a figure, we used Google Vision OCR API to extract text tokens from each figure with its coordinates of bounding boxes.

4.3. Data Statistics The splitting of the SciCap dataset is at the figure level.

Therefore, figures from the same paper may appear in diferent splits. This will lead to unfair evaluation, since the information of one figure in one split may coincidentally overlap with the information of another figure. We thus re-split figures at the document level to eliminate this overlapping problem. Hsu et al. [ 1 ] show that text normalization and figure filtering do not improve model performance. Hence, we keep original captions and all ifgures (with/without sub-figures) in the SciCap+ dataset. For a figure, we kept only the first paragraph that mentions it in the body text. Table 1 shows statistics of the SciCap+ dataset. In all three splits, around 90% of the captions are less than 66 words. All figures are graph plots.

Split Training Test Validation

4.4. Dataset Quality Evaluation Before conducting experiments, we conducted human

evaluation of SciCap+ where we checked the mentionparagraphs and OCR tokens extraction quality. The aim was to establish whether the mention-paragraphs and OCR tokens were extracted correctly and relevant to the ifgure and its caption. To this end, we randomly selected 200 figures from the training set and for each figure, we asked two human evaluators to give scores of 1-5 (1 represents no relevance and 5 is highly relevant) for relevance between a caption of a figure and its mentionparagraphs and OCR tokens.

Compared to natural image captioning, human evaluation tasks for the figure captioning domain requires expert knowledge. We recruited two colleagues to carry out this evaluation task. Both of them have Ph.D. degrees in computer science and work as researchers. Their experience implies that they have adequate experience writing ifgure captions.

Figure 3 shows the distributions of the relevance scores.

We can observe that two evaluators gave most of the allows users to specify diverse pre-trained encoders for each modality, which can be fine-tuned or frozen during training. The M4C-captioner itself has = 768 hidden dimension size, = 4 transformer layers and 12 attention heads. We used sentencepiece [7] to obtain a dictionary of 32000 subwords built from both mention-paragraphs and OCR tokens. This is used as the M4C-captioner’s vocabulary. We followed the BERT-BASE hyperparameter setting and trained from scratch.

Regarding the encoders that feed features to M4Ccaptioner, we used pre-trained Resnet-152 as the figure’s Figure 3: Score distribution on correlations between mention- vision encoder. For each figure, we applied a 2D adapparagraph, OCR tokens and figure captions. Both evaluators tive average pooling over outputs from layer 5 to objudged most of the figures with at least moderate correlations tain a global visual feature vector with a dimension of with captions. 2048. Layers 2, 3 and 4 layers were fine-tuned during training. For mention-paragraph features, SciBERT [8] was used to encode3 it into 758-dimensional feature vecifgures (evaluator 1: 64% and evaluator 2: 79.5%) with tors. The number of vectors equals the number of subrelevance scores greater than 3 and a cohen kappa score word tokens in the mention-paragraph, which we limit to of 0.28. This evaluation result indicates that the mention- 192. The mention-paragraph encoder is also fine-tuned paragraphs and OCR tokens have a satisfactory extrac- during training. Finally, for OCR tokens, we use both tion quality and that the annotators considered most of text and visual features. We selected FastText [9] as the them as relevant to the figure and its caption. However, word encoder and Pyramidal Histogram of Characters the two annotators seem to have a relatively lower agree- (PHOC) [10] as the character encoder. Regarding the ment (0.28) regarding which figures and captions are visual feature encoder of OCR tokens, we first extracted relevant to their mention-paragraphs and OCR tokens. Faster R-CNN fc6 features and then applied fc7 weights We attribute this to the fact that evaluations of figure to it to obtain 2048-dimensional appearance features for captions are highly subjective. bounding boxes of OCR tokens. The fc7 weights were ifne-tuned during training. We kept a maximum of 95 OCR tokens per figure. 5. Experiments We trained a model on a GPU server with 8 Nvidia Tesla V100 GPUs. Training a model with a complete set of We conduct experiments using SciCap+ to empirically features took 13 hours. During training, we used a batch prove that scientific figure captioning is inherently a size of 128. We selected CIDEr as the evaluation metric. knowledge-augmented task and benefits from knowledge The evaluation interval is every 2000 iterations, we stop coming from both text and vision modalities. training if CIDEr score does not improve for 4 evaluation intervals. The optimizer is Adam with a learning rate of 5.1. Figure Captioning Model 0.001 and = 1.0 E−08. We also used a multistep learning rate schedule with warmup iterations of 1000 and a warmup factor of 0.2. We kept the maximum number of decoding steps at the decoding time as 67. For evaluation, we used five standard metrics for evaluating image captions: BLEU-4 [11], METEOR [12], ROUGE-L [13], CIDEr [14] and SPICE [15]. Since figure captions contain scientific terms which can be seen as uncommon words, among all five metrics, we are particularly interested in CIDEr since it emphasizes them.

We used M4C-Captioner [2] as the baseline model to

study the scientific figure captioning task. The M4CCaptioner is based on Multimodal Multi-Copy Mesh (M4C) [5] that jointly learns representations across input modalities. To solve the out-of-vocabulary problem during caption generation, it is equipped with a pointer network that picks up text from OCR tokens or a predeifned fixed dictionary. In this work, 3 input features are used, figure, mention-paragraphs and OCR tokens fed to encoders, the output representations of which are fed to the M4C-Captioner.

5.2. Implementation and Training Our implementation of M4C-Captioner is based on the

MMF framework [6] and Pytorch. The implementation

3We only used the first 3 layers of SciBERT for lightweightness.

Model 1. M4C-Captioner (Figure Only ) 2. M4C-Captioner (Mention Only) 3. M4C-Captioner (Figure and OCR features) 4. M4C-Captioner (Mention, Figure and OCR features) Ablation Study on Figures 5. M4C-Captioner (Mention and OCR features) Ablation Study on OCR features 6. M4C-Captioner (Mention, Figure and w/o OCR features ) 7. M4C-Captioner (Mention, Figure and OCR spatial features) 8. M4C-Captioner (Mention, Figure and OCR (w/o spatial features) features ) 9. M4C-Captioner (Mention, Figure and OCR (w/o visual features) features ) 6.3 6.4 5.8 6.4 6.2

6. Results

noise for the model. This is likely because the Resnet-152 visual encoder we used was not trained on figures. 6.1. Main Result We enriched the representations of the OCR features by adding text, visual, and spatial features. Ablation studThe experimental results in table 2 demonstrate that us- ies aim to reveal impacts of each OCR token feature. All ing the mention-paragraph and OCR tokens significantly comparisons are with row #4 even though row #5 gives improves scores on all five metrics compared to the figure- slightly better scores. With OCR features completely only baseline. The experimental results align with our removed in row #6, the CIDEr scores decrease by 5.3. hypothesis and preliminary study that scientific figure Using only OCR spatial features in row #7, the CIDEr captioning is a knowledge-augmented image captioning score dropped by 7.8. Removing OCR spatial features in task, OCR tokens and knowledge embedded in mention- row #8, the CIDEr scores dropped by 1.2. Upon removal paragraphs help in composing informative captions. of OCR visual features in row #9, the CIDEr score is close

We established a baseline M4C-Captioner (Figure only) to removing spatial features. with figures as the only input modality to the M4C- The above ablation study indicates that the enriched Captioner model in row #1. This baseline is in the non- OCR contributes to the informativeness of generated capknowledge setting. Therefore, low scores in all metrics tions. Unlike OCR features, where appearance features show that the model needs knowledge of other modal- are helpful to the model, removing visual features of ities. Using the mention only in row #2 shows that the ifgures increases CIDEr scores, further indicating that mention certainly contains a lot of useful information, as we need a specific vision encoder for figures to provide evidenced by the increase in performance. When OCR meaningful features. features are added to the figure input in row #3, scores for all metrics have significant gains compared to the ifgure-only baseline, but are still weaker than when only 7. Human Evaluation mentions are used. This motivates the combination of mentions and OCR features and in row #4, compared to Having established that knowledge helps a model perthe figure-only baseline and figure-OCR-only baseline, form figure captioning, we conducted some human evalthe performance further improves. Perhaps the most uation activities to determine their subjective quality. interesting result is in row #5 where we only use the We conducted human caption generation and evaluamentions and OCR features but not the figure and get tion tasks. The human generation task is to examine the best performance, particularly for SPICE and CIDEr, whether humans can write better captions than models. albeit comparable to when the figure is included in row The evaluation task is the appropriateness evaluation #4. All these results indicate that explicitly extracted task, which consists of evaluating how appropriate the multimodal knowledge helps to compose informative model-generated captions are versus ground-truth capcaptions. tions. Both tasks were performed by the same human subjects for the quality assessment of the data set.

6.2. Ablation Studies We first performed an ablation study on figures by removing visual feature vectors, the CIDEr score increases slightly, indicating that the visual feature is more like 7.1. Figure Caption Generation Task The figure caption generation task is to generate captions under two conditions separately: 1. Figure-only: Human

Annotator 1. Annotator 1 2. Annotator 2 3. M4C-Captioner 4. Annotator 1 5. Annotator 2 6. M4C-Captioner

Inputs Figure-only

Figure-only Image and OCR features

Figure-Mention

Figure-Mention Mention, Figure and OCR features annotators write captions given only figures. This is to Even given mention-paragraphs, our annotator wrote compare with captions generated by M4C-Captioner that captions with low scores across all standard image caponly has access to figures and OCR features. 2. Figure- tioning evaluation metrics. We ascribe it as figure Mention: Human annotators write captions given both captions are highly subjective and require in-domain ifgures and their mention-paragraphs. We randomly knowledge to write. Although our annotators are reselected 100 figures from the test set and to compare searchers, they cannot be professional in all knowlhuman-generated captions with captions generated by edge existing in the computer science domain. Granted M4C-Captioner. mention-paragraphs and OCR tokens as external knowl

The table 3 shows automatic evaluation results for hu- edge sources, and with large-amount data training, the man caption generation tasks. Given only figures (rows model can significantly outperform humans. #1, 2), both annotators got low scores across all metrics, among those, annotator 2 led all metrics except SPICE. 7.2. Appropriateness Evaluation Since humans perform OCR naturally with their eyes we compare with M4C-captioner (Figure and OCR features). This task evaluates the appropriateness of modelIt has the best SPICE score, although it outperformed an- generated and ground-truth captions. We used the same notator 1 in 4 of 5 evaluation metrics, it achieved similar set of 100 figures as in the figure caption generation task, performance compared with annotator 2. This shows and placed ground-truth captions and model-generated that without additional knowledge, humans aren’t that captions in random order. Then, human evaluators rank better than machines. each caption to give appropriateness scores (1-4) to each

However, given mention-paragraphs and figures (rows caption. The evaluation scale: 1. Inappropriate: a caption #4, 5), compared to the figure-only condition, both anno- does not match the figure, is not a sentence, is wrong, or tators got improved scores in BLEU-4, METEOR, ROUGE- is misleading. 2. Not sure: It is impossible to judge approL, and SPICE but lower scores in CIDEr. Previous studies priateness solely from the figure. 3. Possible: A possible have shown that CIDEr is more reliable as an evaluation candidate that is incomplete but not wrong. 4. Apprometric for caption generation, and the lowered CIDEr priate: An informative caption that interprets the figure scores indicates that humans are likely to struggle with well. Since an appropriate figure caption should stand additional knowledge. On the other hand, having access alone and readers should understand the messages the to full features, M4C-captioner gained a significantly bet- ifgure wants to represent without referring to the body ter CIDEr score compared to human annotators. The text, we do not show mention-paragraphs to evaluators. automatic evaluation results of the human generation Table 4 shows the results of the evaluations. Two evaltasks show the steep dificulty in writing figure captions uators gave low average scores to both model-generated close to ground truth. captions and ground-truth captions. In addition, eval9. Conclusion uators only reached fair agreements on scoring (0.230.36). Using the mention and OCR features (row #2), gets the best human evaluation scores and this is in line In this paper, we study the challenges of the scienwith the corresponding score in Table 2 where it also tific figure captioning task. Extending from the preachieves the best CIDEr performance, indicating that hu- vious study [ 1 ], we reframe this task as a knowledgeman evaluation is reliable despite the fair agreements. augmented image captioning task, that is, a model needs The evaluation results indicate that the model-generated to use knowledge extracted across modalities to generand ground-truth captions are not always informative ate captions. To this end, we released a new version of to both evaluators, which reveals the need to improve the SciCap dataset: SciCap+ by augmenting figures with caption writing quality and model performance. We ob- their mention-paragraphs and OCR tokens. We used served that captions tend to be written without following M4C-Captioner model as the baseline model to utilize specific rules, and this may contribute to lack of agree- knowledge across three modalities: mention-paragraphs, ment. With low inter-rater agreements, we found how ifgures, and OCR tokens. The automatic evaluation exinformative a figure caption is highly subjective and de- periments further reveal that using knowledge signifpends on in-domain background knowledge evaluators icantly improves evaluation metric scores. Compared have. with human-generated captions, we found models can generate better captions than humans regarding the automatic evaluation metrics. However, human evaluations 8. Related Work demonstrated that writing scientific figure captioning is challenging even for humans, and the model-generated ifgure captions, despite their reasonable automatic evaluation quality, are still far from achieving a level appropriate for humans. The release of the SciCap+ dataset is to promote the further development of scientific figure captioning. For future work, we are interested in how to use multimodal pretraining strategies in this task.

Unlike natural image captioning, figure captioning has

been scarcely studied in history. SciCap [ 1 ] is the most recent work on scientific figure captioning, they released a large-scale scientific figure captioning dataset that includes figures from academic papers in arXiv dataset. Before SciCap, FigCAP [16] [17] and FigureQA [18] are two figure captioning datasets, but their figures are synthesized. We decided to extend and study on SciCap dataset, since its figures are from real-world scientific papers. In this paper, we also have leveraged multimodal knowledge using pre-trained models.

Multimodal machine learning is to model knowledge across various modalities. The closest multimodal task to figure captioning is image captioning, a popular architecture is encode-decoder, where the decoder learns to generate captions conditioned on visual features extracted from the encoder. Recent works on integrating texts in natural images for visual question answering and image captioning tasks are based on transformer architecture augmented with a pointer network [5, 19]. The transformer enriches representations by integrating knowledge from both text and visual modality. The pointer network dynamically selects words from the fixed dictionary or OCR tokens during generation.

Using knowledge embedded in pre-trained models is a common practice in solving multimodal tasks. In this work, we used SciBert [8], a BERT model [20] that was pre-trained in scientific papers, to obtain informative representations for the texts extracted from computer science papers. Since terms that exist in the figures may be uncommon words, we also used FastText [21] to obtain word embeddings with subword information. For visual modality, we used Renst152 [22] and Faster R-CNN [23] used in extract features from images and bounding boxes. 10. Acknowledgment

These research results were partly obtained from the

commissioned research (No. 225) by National Institute of Information and Communications Technology (NICT), Japan, and partly obtained from the first author’s internship research under NICT. [4] C. Clark, S. Divvala, Pdfigures 2.0: Mining figures uation of summaries, in: Text summarization from research papers, in: 2016 IEEE/ACM Joint branches out, 2004, pp. 74–81.

Conference on Digital Libraries (JCDL), IEEE, 2016, [14] R. Vedantam, C. Lawrence Zitnick, D. Parikh, Cider: pp. 143–152. Consensus-based image description evaluation, in: [5] R. Hu, A. Singh, T. Darrell, M. Rohrbach, Iterative Proceedings of the IEEE conference on computer answer prediction with pointer-augmented multi- vision and pattern recognition, 2015, pp. 4566–4575. modal transformers for textvqa, in: Proceedings [15] P. Anderson, B. Fernando, M. Johnson, S. Gould, of the IEEE Conference on Computer Vision and Spice: Semantic propositional image caption evaluPattern Recognition, 2020. ation, in: European conference on computer vision, [6] A. Singh, V. Goswami, V. Natarajan, Y. Jiang, Springer, 2016, pp. 382–398.

X. Chen, M. Shah, M. Rohrbach, D. Batra, [16] C. Chen, R. Zhang, E. Koh, S. Kim, S. Cohen, T. Yu, D. Parikh, Mmf: A multimodal framework for vi- R. Rossi, R. Bunescu, Figure captioning with reasion and language research, https://github.com/ soning and sequence-level training, arXiv preprint facebookresearch/mmf, 2020. arXiv:1906.02850 (2019). [7] T. Kudo, J. Richardson, SentencePiece: A sim- [17] C. Chen, R. Zhang, E. Koh, S. Kim, S. Cohen, R. Rossi, ple and language independent subword tokenizer Figure captioning with relation maps for reasoning, and detokenizer for neural text processing, in: in: Proceedings of the IEEE/CVF Winter ConferProceedings of the 2018 Conference on Empirical ence on Applications of Computer Vision (WACV), Methods in Natural Language Processing: System 2020.

Demonstrations, Association for Computational [18] S. E. Kahou, V. Michalski, A. Atkinson, Á. Kádár, Linguistics, Brussels, Belgium, 2018, pp. 66–71. URL: A. Trischler, Y. Bengio, Figureqa: An annotated https://aclanthology.org/D18-2012. doi:10.18653/ ifgure dataset for visual reasoning, arXiv preprint v1/D18-2012. arXiv:1710.07300 (2017). [8] I. Beltagy, K. Lo, A. Cohan, SciBERT: A pretrained [19] O. Sidorov, R. Hu, M. Rohrbach, A. Singh, Textcaps: language model for scientific text, in: Proceed- a dataset for image captioningwith reading compreings of the 2019 Conference on Empirical Meth- hension, 2020. ods in Natural Language Processing and the 9th [20] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: International Joint Conference on Natural Lan- Pre-training of deep bidirectional transformers for guage Processing (EMNLP-IJCNLP), Association language understanding, in: Proceedings of the for Computational Linguistics, Hong Kong, China, 2019 Conference of the North American Chap2019, pp. 3615–3620. URL: https://aclanthology.org/ ter of the Association for Computational LinguisD19-1371. doi:10.18653/v1/D19-1371. tics: Human Language Technologies, Volume 1 [9] J. Almazán, A. Gordo, A. Fornés, E. Valveny, Word (Long and Short Papers), Association for Comspotting and recognition with embedded attributes, putational Linguistics, Minneapolis, Minnesota, IEEE transactions on pattern analysis and machine 2019, pp. 4171–4186. URL: https://aclanthology.org/ intelligence 36 (2014) 2552–2566. N19-1423. doi:10.18653/v1/N19-1423. [10] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, En- [21] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching word vectors with subword information., riching word vectors with subword information, TACL 5 (2017) 135–146. URL: http://dblp.uni-trier. Transactions of the Association for Computational de/db/journals/tacl/tacl5.html#BojanowskiGJM17. Linguistics 5 (2017) 135–146. [11] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: [22] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learna method for automatic evaluation of machine ing for image recognition, in: Proceedings of the translation, in: Proceedings of the 40th Annual IEEE conference on computer vision and pattern Meeting of the Association for Computational Lin- recognition, 2016, pp. 770–778. guistics, Association for Computational Linguis- [23] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: tics, Philadelphia, Pennsylvania, USA, 2002, pp. Towards real-time object detection with region 311–318. URL: https://aclanthology.org/P02-1040. proposal networks, in: C. Cortes, N. Lawrence, doi:10.3115/1073083.1073135. D. Lee, M. Sugiyama, R. Garnett (Eds.), Advances [12] S. Banerjee, A. Lavie, Meteor: An automatic met- in Neural Information Processing Systems, ric for mt evaluation with improved correlation volume 28, Curran Associates, Inc., 2015. URL: with human judgments, in: Proceedings of the https://proceedings.neurips.cc/paper/2015/file/ acl workshop on intrinsic and extrinsic evaluation 14bfa6bb14875e45bba028a21ed38046-Paper.pdf. measures for machine translation and/or summarization, 2005, pp. 65–72. [13] C.-Y. Lin, Rouge: A package for automatic eval

[1]

T.-Y.

Hsu ,

C. L.

Giles , T.-H. Huang, SciCap: Generating captions for scientific figures , in: Findings of the Association for Computational Linguistics: EMNLP 2021 , Association for Computational Linguistics , Punta Cana, Dominican Republic, 2021 , pp. 3258 - 3264 . URL: https://aclanthology.org/ 2021 .findings-emnlp. 277 . doi: 10 .18653/v1/ 2021 . findings- emnlp.277.

[2]

Sidorov ,

Hu ,

Rohrbach ,

Singh , Textcaps: a dataset for image captioning with reading comprehension , in: European conference on computer vision , Springer, 2020 , pp. 742 - 758 .

[3]

Zhang ,

Wang ,

Xin , H. Wu, Cheetah: An ultra-fast, approximation-free, and privacypreserved neural network framework based on joint obscure linear and nonlinear computations , arXiv preprint arXiv: 1911 . 05184 ( 2019 ).