=Paper=
{{Paper
|id=Vol-3656/paper13
|storemode=property
|title=SciCap+: A Knowledge Augmented Dataset to Study the Challenges of Scientific Figure Captioning
|pdfUrl=https://ceur-ws.org/Vol-3656/paper13.pdf
|volume=Vol-3656
|authors=Zhishen Yang,Raj Dabre,Hideki Tanaka,Naoaki Okazaki
|dblpUrl=https://dblp.org/rec/conf/aaai/YangDTO23
}}
==SciCap+: A Knowledge Augmented Dataset to Study the Challenges of Scientific Figure Captioning==
SciCap+: A Knowledge Augmented Dataset to Study the
Challenges of Scientific Figure Captioning⋆
Zhishen Yang1,∗ , Raj Dabre2 , Hideki Tanaka2 and Naoaki Okazaki1
1
Tokyo Institute of Technology, 2-12-1 Ookayama, Meguro-ku, Tokyo, 152-8550, Japan
2
National Institute of Information and Communications Technology, 3-5 Hikaridai, Seika-cho, Soraku-gun, Kyoto, 619-0289, Japan
Abstract
In scholarly documents, figures provide a straightforward way of communicating scientific findings to readers. Automating
figure caption generation helps move model understandings of scientific documents beyond text and will help authors
write informative captions that facilitate communicating scientific findings. Unlike previous studies, we reframe scientific
figure captioning as a knowledge-augmented image captioning task that models need to utilize knowledge embedded
across modalities for caption generation. To this end, we extended the large-scale SciCap dataset [1] to SciCap+, which
includes mention-paragraphs (paragraphs mentioning figures) and OCR tokens. Then, we conduct experiments with the
M4C-Captioner (a multimodal transformer-based model with a pointer network) as a baseline for our study. Our results
indicate that mention-paragraphs serve as additional context knowledge, significantly boosting the automatic standard
image caption evaluation scores compared to the figure-only baselines. Human evaluations further reveal the challenges
of generating figure captions that are informative to readers. The code and SciCap+ dataset are publicly available:https:
//github.com/ZhishenYang/scientific_figure_captioning_dataset
Keywords
Figure Captioning, Multimodal Machine Learning, Scientific Document Understanding
1. Introduction not natural images: In contrast to natural images, visual
objects are texts and data points in scientific figures. 2.
Scholarly documents are the primary source for sharing The captions of the figures should explain: Instead of
scientific knowledge. These documents are available in simply identifying objects and texts in the figures, the
various formats, such as journal articles, book chapters, caption should contain an analysis that the authors in-
and conference proceedings. A significant portion of tend to present and highlight findings.
these documents is text and together with figures and A previous study [1], SciCap, defines the scientific fig-
tables, they help communicate knowledge to readers. Us- ure captioning task as a figure-to-caption task: A model
ing figures provides visual representations of complex generates captions only referring to figures. Their work
information that facilitate the sharing of scientific find- reported relatively lower scores as measured by auto-
ings with readers efficiently and straightforwardly. The matic evaluation metrics, indicating that there is consid-
standard practice for scientific writing is to write a cap- erable room for improvement. Intuitively, writing ap-
tion for each figure, accompanied by paragraphs with propriate figure captions without sufficient background
detailed explanations. Figures and captions should be knowledge is difficult, since even humans will struggle
standalone, and readers should be able to understand to interpret a figure and write a caption unless some
the figures without referring to the main text. Helping background knowledge is available. On the basis of this
authors write appropriate and informative captions for observation, we think that generating appropriate cap-
figures will improve the quality of scientific documents, tions is infeasible without adding context knowledge to
thereby enhancing the speed and quality of scientific the caption generation model. This context comes in two
communication. In this study, we focus on automating forms: background knowledge from the running text
the generation of captions for figures in scientific papers. and the OCR tokens in the figure, both of which should
Scientific figure captioning is a variant of the image help reduce the burden on the captioning model. To this
captioning task. However, with the same goal of generat- end, we augment the existing large-scale scientific fig-
ing a caption, it has two unique challenges: 1. Figures are ure captioning dataset: SciCap with mention-paragraphs
and OCR tokens and call the resultant dataset as Sci-
SDU@AAAI-23: The Third AAAI Workshop on Scientific Document Cap+. We then pose scientific figure captioning as a mul-
Understanding 2023
∗
Corresponding author.
timodal summarization task and use the M4C captioner
Envelope-Open zhishen.yang@nlp.c.titech.ac.jp (Z. Yang); raj.dabre@nict.go.jp model [2] (a model that utilizes multimodal knowledge
(R. Dabre); hideki.tanaka@nict.go.jp (H. Tanaka); to generate captions) as a baseline to study the scientific
okazaki@c.titech.ac.jp (N. Okazaki) figure captioning task. The experimental result of au-
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0). tomatic evaluation demonstrates that using knowledge
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
ures to highlight scientific findings that authors want
to present to readers. With this unique characteristic,
without referring to mention-paragraphs, which usually
refer to the figure, it is extremely challenging for a hu-
man to have proper interpretations of figures. This is
because they may lack background knowledge of the do-
main or context of the figure. As figure 1 shows, by only
looking at the figure, we do not know what ”comm.(KB)”
stands for; therefore lacking the knowledge to write in-
Caption: formative captions is challenging. However, the mention-
Fig. 7. (a) Speedup of CHEETAH over GAZELLE for computing ReLu. paragraph contains ”communication cost” and this is also
(b) Comparison of communication cost for ReLu. present in the caption, indicating that such background
Mention-paragraph: knowledge should help in writing accurate captions.
Fig. 7 plots the speedup and communication cost as a function of the
output dimension. Similarly, CHEETAH achieves an outstanding
speedup with much smaller communication cost, independent of the 3. Problem Formulation
output dimension, compared with GAZELLE.
……
The previous study [1] defined this task as an image cap-
Figure 1: An example figure [3] with its captions and mention- tioning task as: Given a figure 𝐼, the model generates a
paragraph and the text tokens recognized via OCR. Without caption 𝐶 = [𝑐0 , 𝑐1 , ..., 𝑐𝑁 ]. However, we reframe the sci-
referring to the mention-paragraph and the OCR tokens to entific figure captioning task as a knowledge-augmented
tie the figure and the mention, we cannot have a proper in- image captioning task requiring knowledge extracted
terpretation of the data presented in the figure, which is com- from text and vision modalities. For a figure, we define
munication cost comparison and speed up of CHEETAH over a paragraph that mentions it (mention-paragraph) and
GAZELLE. text within the figure, extracted via OCR, as text modal-
ities. The figure itself and visual appearances of OCR
texts are visual modalities. Given a scientific figure 𝐼 and
embedded in different modalities, especially in the form knowledge extracted from text and vision modality: 𝐾𝑡𝑒𝑥𝑡
of mention-paragraphs and OCR tokens, significantly and 𝐾𝑣𝑖𝑠𝑖𝑜𝑛 , we define figure caption generation task as
boosts performance. 𝑃(𝐶|𝐼 , 𝐾𝑡𝑒𝑥𝑡 , 𝐾𝑣𝑖𝑠𝑖𝑜𝑛 )
In addition to experiments using automatic evalua-
tion metrics, we also performed human generation and
evaluation tasks in order to establish the inherent diffi- 4. SciCap+ Dataset
culty of scientific figure captioning. The results of the
SciCap is a large-scale figure-caption dataset comprising
human evaluation reveal three findings: 1. Multimodal
graph plots extracted from 10 years of collections of arXiv
knowledge helps models outperform humans in caption
computer science papers. We used around 414k figures
generation tasks. 2. Model-generated captions are al-
from SciCap and augment each figure with its mention-
most as informative as ground-truth captions: Human
paragraphs and OCR tokens with metadata. This section
evaluators do not prefer either type of caption. 3. Even
details the data set creation and data augmentation pro-
referring to mention-paragraphs, it is still challenging for
cesses. Figure 2 shows the overall workflow behind the
humans to write captions that are close to ground truth.
creation of SciCap+.
To the best of our knowledge, we are the first to pose sci-
entific figure captioning as a multimodel summarization
task and show that mention-paragraphs and OCR tokens 4.1. Mention-paragraph Extraction
as context substantially enhance the quality of generated We first obtained papers in PDF format from Kaggle arXiv
captions. dastaset 1 . The reason for using PDFs is that not all pa-
pers have source files and some are complicated to parse.
2. Preliminary Study After obtaining PDFs, we used PDFFigures 2.0 [4] 2 to
extract the body text of each paper. PDFFigure 2.0 is a
In the traditional image captioning task, captioning an im- tool that extracts figures, captions, tables, and text from
age aims at describing the appearances or natures of rec- scholarly PDFs in computer science. In scholarly docu-
ognized objects and illustrating the relationships between ments, authors label figures with numbers (e.g. Figure
recognized objects. Unlike the usual image captioning 1. Fig. 1). For a figure, we used its figure number in a
tasks, figures do not contain visual scenes. Instead, the 1 https://www.kaggle.com/datasets/Cornell-University/arxiv
captions provide interpretations of data presented in fig- 2 https://github.com/allenai/pdffigures2
Figure 2: The overall workflow of the data augmentation for creating SciCap+ dataset. For each figure in SciCap+, we
extracted its mention-paragraphs and OCR tokens (OCR texts and bounding boxes).
regular expression to locate a paragraph that mentions Split Figures Words
Training 394,005 12,336,511
it.
Test 10,336 323,382
Validation 10,468 329,072
4.2. OCR Extraction
Table 1
The SciCap dataset also provides texts extracted from Statistics of the SciCap+ dataset.
figures as metadata, but does not provide location infor-
mation for each text. To include location information for
each text in a figure, we used Google Vision OCR API to 4.4. Dataset Quality Evaluation
extract text tokens from each figure with its coordinates
of bounding boxes. Before conducting experiments, we conducted human
evaluation of SciCap+ where we checked the mention-
paragraphs and OCR tokens extraction quality. The aim
4.3. Data Statistics was to establish whether the mention-paragraphs and
The splitting of the SciCap dataset is at the figure level. OCR tokens were extracted correctly and relevant to the
Therefore, figures from the same paper may appear in figure and its caption. To this end, we randomly selected
different splits. This will lead to unfair evaluation, since 200 figures from the training set and for each figure,
the information of one figure in one split may coinciden- we asked two human evaluators to give scores of 1-5
tally overlap with the information of another figure. We (1 represents no relevance and 5 is highly relevant) for
thus re-split figures at the document level to eliminate relevance between a caption of a figure and its mention-
this overlapping problem. Hsu et al. [1] show that text paragraphs and OCR tokens.
normalization and figure filtering do not improve model Compared to natural image captioning, human eval-
performance. Hence, we keep original captions and all uation tasks for the figure captioning domain requires
figures (with/without sub-figures) in the SciCap+ dataset. expert knowledge. We recruited two colleagues to carry
For a figure, we kept only the first paragraph that men- out this evaluation task. Both of them have Ph.D. degrees
tions it in the body text. Table 1 shows statistics of the in computer science and work as researchers. Their expe-
SciCap+ dataset. In all three splits, around 90% of the rience implies that they have adequate experience writing
captions are less than 66 words. All figures are graph figure captions.
plots. Figure 3 shows the distributions of the relevance scores.
We can observe that two evaluators gave most of the
allows users to specify diverse pre-trained encoders for
each modality, which can be fine-tuned or frozen during
training. The M4C-captioner itself has 𝐷 = 768 hidden di-
mension size, 𝐾 = 4 transformer layers and 12 attention
heads. We used sentencepiece [7] to obtain a dictionary
of 32000 subwords built from both mention-paragraphs
and OCR tokens. This is used as the M4C-captioner’s vo-
cabulary. We followed the BERT-BASE hyperparameter
setting and trained from scratch.
Regarding the encoders that feed features to M4C-
captioner, we used pre-trained Resnet-152 as the figure’s
Figure 3: Score distribution on correlations between mention- vision encoder. For each figure, we applied a 2D adap-
paragraph, OCR tokens and figure captions. Both evaluators tive average pooling over outputs from layer 5 to ob-
judged most of the figures with at least moderate correlations tain a global visual feature vector with a dimension of
with captions. 2048. Layers 2, 3 and 4 layers were fine-tuned during
training. For mention-paragraph features, SciBERT [8]
was used to encode3 it into 758-dimensional feature vec-
figures (evaluator 1: 64% and evaluator 2: 79.5%) with tors. The number of vectors equals the number of sub-
relevance scores greater than 3 and a cohen kappa score word tokens in the mention-paragraph, which we limit to
of 0.28. This evaluation result indicates that the mention- 192. The mention-paragraph encoder is also fine-tuned
paragraphs and OCR tokens have a satisfactory extrac- during training. Finally, for OCR tokens, we use both
tion quality and that the annotators considered most of text and visual features. We selected FastText [9] as the
them as relevant to the figure and its caption. However, word encoder and Pyramidal Histogram of Characters
the two annotators seem to have a relatively lower agree- (PHOC) [10] as the character encoder. Regarding the
ment (0.28) regarding which figures and captions are visual feature encoder of OCR tokens, we first extracted
relevant to their mention-paragraphs and OCR tokens. Faster R-CNN fc6 features and then applied fc7 weights
We attribute this to the fact that evaluations of figure to it to obtain 2048-dimensional appearance features for
captions are highly subjective. bounding boxes of OCR tokens. The fc7 weights were
fine-tuned during training. We kept a maximum of 95
OCR tokens per figure.
5. Experiments We trained a model on a GPU server with 8 Nvidia
Tesla V100 GPUs. Training a model with a complete set of
We conduct experiments using SciCap+ to empirically features took 13 hours. During training, we used a batch
prove that scientific figure captioning is inherently a size of 128. We selected CIDEr as the evaluation metric.
knowledge-augmented task and benefits from knowledge The evaluation interval is every 2000 iterations, we stop
coming from both text and vision modalities. training if CIDEr score does not improve for 4 evaluation
intervals. The optimizer is Adam with a learning rate of
5.1. Figure Captioning Model 0.001 and 𝜖 = 1.0E−08. We also used a multistep learn-
ing rate schedule with warmup iterations of 1000 and a
We used M4C-Captioner [2] as the baseline model to warmup factor of 0.2. We kept the maximum number of
study the scientific figure captioning task. The M4C- decoding steps at the decoding time as 67. For evalua-
Captioner is based on Multimodal Multi-Copy Mesh tion, we used five standard metrics for evaluating image
(M4C) [5] that jointly learns representations across in- captions: BLEU-4 [11], METEOR [12], ROUGE-L [13],
put modalities. To solve the out-of-vocabulary problem CIDEr [14] and SPICE [15]. Since figure captions contain
during caption generation, it is equipped with a pointer scientific terms which can be seen as uncommon words,
network that picks up text from OCR tokens or a prede- among all five metrics, we are particularly interested in
fined fixed dictionary. In this work, 3 input features are CIDEr since it emphasizes them.
used, figure, mention-paragraphs and OCR tokens fed to
encoders, the output representations of which are fed to
the M4C-Captioner.
5.2. Implementation and Training
Our implementation of M4C-Captioner is based on the
MMF framework [6] and Pytorch. The implementation 3
We only used the first 3 layers of SciBERT for lightweightness.
Model BLEU-4 METEOR ROUGE-L SPICE CIDEr
1. M4C-Captioner (Figure Only ) 1.5 5.6 15.4 4.3 4.6
2. M4C-Captioner (Mention Only) 5.3 11.0 27.4 14.3 49.0
3. M4C-Captioner (Figure and OCR features) 2.6 7.6 20.5 10.1 22.2
4. M4C-Captioner (Mention, Figure and OCR features) 6.3 12.0 29.2 15.8 55.8
Ablation Study on Figures
5. M4C-Captioner (Mention and OCR features) 6.3 12.0 29.3 16.1 56.4
Ablation Study on OCR features
6. M4C-Captioner (Mention, Figure and w/o OCR features ) 6.4 11.5 27.9 14.6 50.5
7. M4C-Captioner (Mention, Figure and OCR spatial features) 5.8 11.1 27.3 14.1 48.0
8. M4C-Captioner (Mention, Figure and OCR (w/o spatial features) features ) 6.4 12.0 29.1 15.7 54.6
9. M4C-Captioner (Mention, Figure and OCR (w/o visual features) features ) 6.2 11.9 28.9 15.6 54.1
Table 2
Automatic evaluation scores of M4C-captioning on SciCap dataset. Aggregate knowledge from text and vision modalities
significantly boosts the model performance compared to the figure-only baseline.
6. Results noise for the model. This is likely because the Resnet-152
visual encoder we used was not trained on figures.
6.1. Main Result We enriched the representations of the OCR features
by adding text, visual, and spatial features. Ablation stud-
The experimental results in table 2 demonstrate that us-
ies aim to reveal impacts of each OCR token feature. All
ing the mention-paragraph and OCR tokens significantly
comparisons are with row #4 even though row #5 gives
improves scores on all five metrics compared to the figure-
slightly better scores. With OCR features completely
only baseline. The experimental results align with our
removed in row #6, the CIDEr scores decrease by 5.3.
hypothesis and preliminary study that scientific figure
Using only OCR spatial features in row #7, the CIDEr
captioning is a knowledge-augmented image captioning
score dropped by 7.8. Removing OCR spatial features in
task, OCR tokens and knowledge embedded in mention-
row #8, the CIDEr scores dropped by 1.2. Upon removal
paragraphs help in composing informative captions.
of OCR visual features in row #9, the CIDEr score is close
We established a baseline M4C-Captioner (Figure only)
to removing spatial features.
with figures as the only input modality to the M4C-
The above ablation study indicates that the enriched
Captioner model in row #1. This baseline is in the non-
OCR contributes to the informativeness of generated cap-
knowledge setting. Therefore, low scores in all metrics
tions. Unlike OCR features, where appearance features
show that the model needs knowledge of other modal-
are helpful to the model, removing visual features of
ities. Using the mention only in row #2 shows that the
figures increases CIDEr scores, further indicating that
mention certainly contains a lot of useful information, as
we need a specific vision encoder for figures to provide
evidenced by the increase in performance. When OCR
meaningful features.
features are added to the figure input in row #3, scores
for all metrics have significant gains compared to the
figure-only baseline, but are still weaker than when only 7. Human Evaluation
mentions are used. This motivates the combination of
mentions and OCR features and in row #4, compared to Having established that knowledge helps a model per-
the figure-only baseline and figure-OCR-only baseline, form figure captioning, we conducted some human eval-
the performance further improves. Perhaps the most uation activities to determine their subjective quality.
interesting result is in row #5 where we only use the We conducted human caption generation and evalua-
mentions and OCR features but not the figure and get tion tasks. The human generation task is to examine
the best performance, particularly for SPICE and CIDEr, whether humans can write better captions than models.
albeit comparable to when the figure is included in row The evaluation task is the appropriateness evaluation
#4. All these results indicate that explicitly extracted task, which consists of evaluating how appropriate the
multimodal knowledge helps to compose informative model-generated captions are versus ground-truth cap-
captions. tions. Both tasks were performed by the same human
subjects for the quality assessment of the data set.
6.2. Ablation Studies
7.1. Figure Caption Generation Task
We first performed an ablation study on figures by re-
moving visual feature vectors, the CIDEr score increases The figure caption generation task is to generate captions
slightly, indicating that the visual feature is more like under two conditions separately: 1. Figure-only: Human
Annotator Inputs BLEU-4 METEOR ROUGE-L SPICE CIDEr
1. Annotator 1 Figure-only 2.4 8.3 13.2 9.4 14.6
2. Annotator 2 Figure-only 3.8 10.1 21.5 8.9 23.8
3. M4C-Captioner Image and OCR features 3.6 7.6 20.5 11.5 18.7
4. Annotator 1 Figure-Mention 7.7 13.4 19.1 15.9 11.3
5. Annotator 2 Figure-Mention 7.5 14.8 24.8 14.3 18.8
6. M4C-Captioner Mention, Figure and OCR features 5.5 11.6 28.1 16.1 47.7
Table 3
Automatic evaluation scores on human-generated captions. The model has similar performances when the figure is the only
available source. Using knowledge from vision and text modality, the model has a larger gain on CIDEr scores.
Model Average Scores Average Scores Cohen-Kappa Scores
Evaluator 1 Evaluator 2
1. M4C-Captioner (Mention, Figure and OCR features) 1.8 2.13 0.27
2. M4C-Captioner (Mention and OCR features) 2.03 2.35 0.23
3. M4C-Captioner (Mention Only) 1.8 2.22 0.31
4. M4C-Captioner (Figure and OCR features) 1.91 2.08 0.36
5. Ground truth 1.95 2.07 0.32
Table 4
Average appropriateness score on model-generated and ground truth captions. Two evaluators gave low scores on both
model-generated and ground truth captions, with the fair inter-annotator agreement.
annotators write captions given only figures. This is to Even given mention-paragraphs, our annotator wrote
compare with captions generated by M4C-Captioner that captions with low scores across all standard image cap-
only has access to figures and OCR features. 2. Figure- tioning evaluation metrics. We ascribe it as figure
Mention: Human annotators write captions given both captions are highly subjective and require in-domain
figures and their mention-paragraphs. We randomly knowledge to write. Although our annotators are re-
selected 100 figures from the test set and to compare searchers, they cannot be professional in all knowl-
human-generated captions with captions generated by edge existing in the computer science domain. Granted
M4C-Captioner. mention-paragraphs and OCR tokens as external knowl-
The table 3 shows automatic evaluation results for hu- edge sources, and with large-amount data training, the
man caption generation tasks. Given only figures (rows model can significantly outperform humans.
#1, 2), both annotators got low scores across all metrics,
among those, annotator 2 led all metrics except SPICE. 7.2. Appropriateness Evaluation
Since humans perform OCR naturally with their eyes we
compare with M4C-captioner (Figure and OCR features). This task evaluates the appropriateness of model-
It has the best SPICE score, although it outperformed an- generated and ground-truth captions. We used the same
notator 1 in 4 of 5 evaluation metrics, it achieved similar set of 100 figures as in the figure caption generation task,
performance compared with annotator 2. This shows and placed ground-truth captions and model-generated
that without additional knowledge, humans aren’t that captions in random order. Then, human evaluators rank
better than machines. each caption to give appropriateness scores (1-4) to each
However, given mention-paragraphs and figures (rows caption. The evaluation scale: 1. Inappropriate: a caption
#4, 5), compared to the figure-only condition, both anno- does not match the figure, is not a sentence, is wrong, or
tators got improved scores in BLEU-4, METEOR, ROUGE- is misleading. 2. Not sure: It is impossible to judge appro-
L, and SPICE but lower scores in CIDEr. Previous studies priateness solely from the figure. 3. Possible: A possible
have shown that CIDEr is more reliable as an evaluation candidate that is incomplete but not wrong. 4. Appro-
metric for caption generation, and the lowered CIDEr priate: An informative caption that interprets the figure
scores indicates that humans are likely to struggle with well. Since an appropriate figure caption should stand
additional knowledge. On the other hand, having access alone and readers should understand the messages the
to full features, M4C-captioner gained a significantly bet- figure wants to represent without referring to the body
ter CIDEr score compared to human annotators. The text, we do not show mention-paragraphs to evaluators.
automatic evaluation results of the human generation Table 4 shows the results of the evaluations. Two eval-
tasks show the steep difficulty in writing figure captions uators gave low average scores to both model-generated
close to ground truth. captions and ground-truth captions. In addition, eval-
uators only reached fair agreements on scoring (0.23- 9. Conclusion
0.36). Using the mention and OCR features (row #2),
gets the best human evaluation scores and this is in line In this paper, we study the challenges of the scien-
with the corresponding score in Table 2 where it also tific figure captioning task. Extending from the pre-
achieves the best CIDEr performance, indicating that hu- vious study [1], we reframe this task as a knowledge-
man evaluation is reliable despite the fair agreements. augmented image captioning task, that is, a model needs
The evaluation results indicate that the model-generated to use knowledge extracted across modalities to gener-
and ground-truth captions are not always informative ate captions. To this end, we released a new version of
to both evaluators, which reveals the need to improve the SciCap dataset: SciCap+ by augmenting figures with
caption writing quality and model performance. We ob- their mention-paragraphs and OCR tokens. We used
served that captions tend to be written without following M4C-Captioner model as the baseline model to utilize
specific rules, and this may contribute to lack of agree- knowledge across three modalities: mention-paragraphs,
ment. With low inter-rater agreements, we found how figures, and OCR tokens. The automatic evaluation ex-
informative a figure caption is highly subjective and de- periments further reveal that using knowledge signif-
pends on in-domain background knowledge evaluators icantly improves evaluation metric scores. Compared
have. with human-generated captions, we found models can
generate better captions than humans regarding the auto-
matic evaluation metrics. However, human evaluations
8. Related Work demonstrated that writing scientific figure captioning is
challenging even for humans, and the model-generated
Unlike natural image captioning, figure captioning has figure captions, despite their reasonable automatic eval-
been scarcely studied in history. SciCap [1] is the most uation quality, are still far from achieving a level appro-
recent work on scientific figure captioning, they released priate for humans. The release of the SciCap+ dataset is
a large-scale scientific figure captioning dataset that in- to promote the further development of scientific figure
cludes figures from academic papers in arXiv dataset. captioning. For future work, we are interested in how to
Before SciCap, FigCAP [16] [17] and FigureQA [18] are use multimodal pretraining strategies in this task.
two figure captioning datasets, but their figures are syn-
thesized. We decided to extend and study on SciCap
dataset, since its figures are from real-world scientific 10. Acknowledgment
papers. In this paper, we also have leveraged multimodal
knowledge using pre-trained models. These research results were partly obtained from the
Multimodal machine learning is to model knowledge commissioned research (No. 225) by National Institute
across various modalities. The closest multimodal task of Information and Communications Technology (NICT),
to figure captioning is image captioning, a popular ar- Japan, and partly obtained from the first author’s intern-
chitecture is encode-decoder, where the decoder learns ship research under NICT.
to generate captions conditioned on visual features ex-
tracted from the encoder. Recent works on integrating
texts in natural images for visual question answering
References
and image captioning tasks are based on transformer [1] T.-Y. Hsu, C. L. Giles, T.-H. Huang, SciCap: Gen-
architecture augmented with a pointer network [5, 19]. erating captions for scientific figures, in: Find-
The transformer enriches representations by integrat- ings of the Association for Computational Lin-
ing knowledge from both text and visual modality. The guistics: EMNLP 2021, Association for Computa-
pointer network dynamically selects words from the fixed tional Linguistics, Punta Cana, Dominican Republic,
dictionary or OCR tokens during generation. 2021, pp. 3258–3264. URL: https://aclanthology.org/
Using knowledge embedded in pre-trained models is 2021.findings-emnlp.277. doi:10.18653/v1/2021.
a common practice in solving multimodal tasks. In this findings- emnlp.277 .
work, we used SciBert [8], a BERT model [20] that was [2] O. Sidorov, R. Hu, M. Rohrbach, A. Singh, Textcaps:
pre-trained in scientific papers, to obtain informative a dataset for image captioning with reading com-
representations for the texts extracted from computer prehension, in: European conference on computer
science papers. Since terms that exist in the figures may vision, Springer, 2020, pp. 742–758.
be uncommon words, we also used FastText [21] to obtain [3] Q. Zhang, C. Wang, C. Xin, H. Wu, Cheetah:
word embeddings with subword information. For visual An ultra-fast, approximation-free, and privacy-
modality, we used Renst152 [22] and Faster R-CNN [23] preserved neural network framework based on joint
used in extract features from images and bounding boxes. obscure linear and nonlinear computations, arXiv
preprint arXiv:1911.05184 (2019).
[4] C. Clark, S. Divvala, Pdffigures 2.0: Mining figures uation of summaries, in: Text summarization
from research papers, in: 2016 IEEE/ACM Joint branches out, 2004, pp. 74–81.
Conference on Digital Libraries (JCDL), IEEE, 2016, [14] R. Vedantam, C. Lawrence Zitnick, D. Parikh, Cider:
pp. 143–152. Consensus-based image description evaluation, in:
[5] R. Hu, A. Singh, T. Darrell, M. Rohrbach, Iterative Proceedings of the IEEE conference on computer
answer prediction with pointer-augmented multi- vision and pattern recognition, 2015, pp. 4566–4575.
modal transformers for textvqa, in: Proceedings [15] P. Anderson, B. Fernando, M. Johnson, S. Gould,
of the IEEE Conference on Computer Vision and Spice: Semantic propositional image caption evalu-
Pattern Recognition, 2020. ation, in: European conference on computer vision,
[6] A. Singh, V. Goswami, V. Natarajan, Y. Jiang, Springer, 2016, pp. 382–398.
X. Chen, M. Shah, M. Rohrbach, D. Batra, [16] C. Chen, R. Zhang, E. Koh, S. Kim, S. Cohen, T. Yu,
D. Parikh, Mmf: A multimodal framework for vi- R. Rossi, R. Bunescu, Figure captioning with rea-
sion and language research, https://github.com/ soning and sequence-level training, arXiv preprint
facebookresearch/mmf, 2020. arXiv:1906.02850 (2019).
[7] T. Kudo, J. Richardson, SentencePiece: A sim- [17] C. Chen, R. Zhang, E. Koh, S. Kim, S. Cohen, R. Rossi,
ple and language independent subword tokenizer Figure captioning with relation maps for reasoning,
and detokenizer for neural text processing, in: in: Proceedings of the IEEE/CVF Winter Confer-
Proceedings of the 2018 Conference on Empirical ence on Applications of Computer Vision (WACV),
Methods in Natural Language Processing: System 2020.
Demonstrations, Association for Computational [18] S. E. Kahou, V. Michalski, A. Atkinson, Á. Kádár,
Linguistics, Brussels, Belgium, 2018, pp. 66–71. URL: A. Trischler, Y. Bengio, Figureqa: An annotated
https://aclanthology.org/D18-2012. doi:10.18653/ figure dataset for visual reasoning, arXiv preprint
v1/D18- 2012 . arXiv:1710.07300 (2017).
[8] I. Beltagy, K. Lo, A. Cohan, SciBERT: A pretrained [19] O. Sidorov, R. Hu, M. Rohrbach, A. Singh, Textcaps:
language model for scientific text, in: Proceed- a dataset for image captioningwith reading compre-
ings of the 2019 Conference on Empirical Meth- hension, 2020.
ods in Natural Language Processing and the 9th [20] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT:
International Joint Conference on Natural Lan- Pre-training of deep bidirectional transformers for
guage Processing (EMNLP-IJCNLP), Association language understanding, in: Proceedings of the
for Computational Linguistics, Hong Kong, China, 2019 Conference of the North American Chap-
2019, pp. 3615–3620. URL: https://aclanthology.org/ ter of the Association for Computational Linguis-
D19-1371. doi:10.18653/v1/D19- 1371 . tics: Human Language Technologies, Volume 1
[9] J. Almazán, A. Gordo, A. Fornés, E. Valveny, Word (Long and Short Papers), Association for Com-
spotting and recognition with embedded attributes, putational Linguistics, Minneapolis, Minnesota,
IEEE transactions on pattern analysis and machine 2019, pp. 4171–4186. URL: https://aclanthology.org/
intelligence 36 (2014) 2552–2566. N19-1423. doi:10.18653/v1/N19- 1423 .
[10] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, En- [21] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, En-
riching word vectors with subword information., riching word vectors with subword information,
TACL 5 (2017) 135–146. URL: http://dblp.uni-trier. Transactions of the Association for Computational
de/db/journals/tacl/tacl5.html#BojanowskiGJM17. Linguistics 5 (2017) 135–146.
[11] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: [22] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learn-
a method for automatic evaluation of machine ing for image recognition, in: Proceedings of the
translation, in: Proceedings of the 40th Annual IEEE conference on computer vision and pattern
Meeting of the Association for Computational Lin- recognition, 2016, pp. 770–778.
guistics, Association for Computational Linguis- [23] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn:
tics, Philadelphia, Pennsylvania, USA, 2002, pp. Towards real-time object detection with region
311–318. URL: https://aclanthology.org/P02-1040. proposal networks, in: C. Cortes, N. Lawrence,
doi:10.3115/1073083.1073135 . D. Lee, M. Sugiyama, R. Garnett (Eds.), Advances
[12] S. Banerjee, A. Lavie, Meteor: An automatic met- in Neural Information Processing Systems,
ric for mt evaluation with improved correlation volume 28, Curran Associates, Inc., 2015. URL:
with human judgments, in: Proceedings of the https://proceedings.neurips.cc/paper/2015/file/
acl workshop on intrinsic and extrinsic evaluation 14bfa6bb14875e45bba028a21ed38046-Paper.pdf.
measures for machine translation and/or summa-
rization, 2005, pp. 65–72.
[13] C.-Y. Lin, Rouge: A package for automatic eval-