=Paper=
{{Paper
|id=Vol-3656/paper13
|storemode=property
|title=SciCap+: A Knowledge Augmented Dataset to Study the Challenges of Scientific Figure Captioning
|pdfUrl=https://ceur-ws.org/Vol-3656/paper13.pdf
|volume=Vol-3656
|authors=Zhishen Yang,Raj Dabre,Hideki Tanaka,Naoaki Okazaki
|dblpUrl=https://dblp.org/rec/conf/aaai/YangDTO23
}}
==SciCap+: A Knowledge Augmented Dataset to Study the Challenges of Scientific Figure Captioning==
<pdf width="1500px">https://ceur-ws.org/Vol-3656/paper13.pdf</pdf>
<pre>
                                SciCap+: A Knowledge Augmented Dataset to Study the
                                Challenges of Scientific Figure Captioning⋆
                                Zhishen Yang1,∗ , Raj Dabre2 , Hideki Tanaka2 and Naoaki Okazaki1
                                1
                                    Tokyo Institute of Technology, 2-12-1 Ookayama, Meguro-ku, Tokyo, 152-8550, Japan
                                2
                                    National Institute of Information and Communications Technology, 3-5 Hikaridai, Seika-cho, Soraku-gun, Kyoto, 619-0289, Japan


                                                                          Abstract
                                                                          In scholarly documents, figures provide a straightforward way of communicating scientific findings to readers. Automating
                                                                          figure caption generation helps move model understandings of scientific documents beyond text and will help authors
                                                                          write informative captions that facilitate communicating scientific findings. Unlike previous studies, we reframe scientific
                                                                          figure captioning as a knowledge-augmented image captioning task that models need to utilize knowledge embedded
                                                                          across modalities for caption generation. To this end, we extended the large-scale SciCap dataset [1] to SciCap+, which
                                                                          includes mention-paragraphs (paragraphs mentioning figures) and OCR tokens. Then, we conduct experiments with the
                                                                          M4C-Captioner (a multimodal transformer-based model with a pointer network) as a baseline for our study. Our results
                                                                          indicate that mention-paragraphs serve as additional context knowledge, significantly boosting the automatic standard
                                                                          image caption evaluation scores compared to the figure-only baselines. Human evaluations further reveal the challenges
                                                                          of generating figure captions that are informative to readers. The code and SciCap+ dataset are publicly available:https:
                                                                          //github.com/ZhishenYang/scientific_figure_captioning_dataset

                                                                          Keywords
                                                                          Figure Captioning, Multimodal Machine Learning, Scientific Document Understanding


                                1. Introduction                                                                                                  not natural images: In contrast to natural images, visual
                                                                                                                                                 objects are texts and data points in scientific figures. 2.
                                Scholarly documents are the primary source for sharing The captions of the figures should explain: Instead of
                                scientific knowledge. These documents are available in simply identifying objects and texts in the figures, the
                                various formats, such as journal articles, book chapters, caption should contain an analysis that the authors in-
                                and conference proceedings. A significant portion of tend to present and highlight findings.
                                these documents is text and together with figures and                                                               A previous study [1], SciCap, defines the scientific fig-
                                tables, they help communicate knowledge to readers. Us- ure captioning task as a figure-to-caption task: A model
                                ing figures provides visual representations of complex generates captions only referring to figures. Their work
                                information that facilitate the sharing of scientific find- reported relatively lower scores as measured by auto-
                                ings with readers efficiently and straightforwardly. The matic evaluation metrics, indicating that there is consid-
                                standard practice for scientific writing is to write a cap- erable room for improvement. Intuitively, writing ap-
                                tion for each figure, accompanied by paragraphs with propriate figure captions without sufficient background
                                detailed explanations. Figures and captions should be knowledge is difficult, since even humans will struggle
                                standalone, and readers should be able to understand to interpret a figure and write a caption unless some
                                the figures without referring to the main text. Helping background knowledge is available. On the basis of this
                                authors write appropriate and informative captions for observation, we think that generating appropriate cap-
                                figures will improve the quality of scientific documents, tions is infeasible without adding context knowledge to
                                thereby enhancing the speed and quality of scientific the caption generation model. This context comes in two
                                communication. In this study, we focus on automating forms: background knowledge from the running text
                                the generation of captions for figures in scientific papers. and the OCR tokens in the figure, both of which should
                                              Scientific figure captioning is a variant of the image help reduce the burden on the captioning model. To this
                                captioning task. However, with the same goal of generat- end, we augment the existing large-scale scientific fig-
                                ing a caption, it has two unique challenges: 1. Figures are ure captioning dataset: SciCap with mention-paragraphs
                                                                                                                                                 and OCR tokens and call the resultant dataset as Sci-
                                SDU@AAAI-23: The Third AAAI Workshop on Scientific Document Cap+. We then pose scientific figure captioning as a mul-
                                Understanding 2023
                                ∗
                                     Corresponding author.
                                                                                                                                                 timodal summarization task and use the M4C captioner
                                Envelope-Open zhishen.yang@nlp.c.titech.ac.jp (Z. Yang); raj.dabre@nict.go.jp                                    model [2] (a model that utilizes multimodal knowledge
                                (R. Dabre); hideki.tanaka@nict.go.jp (H. Tanaka);                                                                to generate captions) as a baseline to study the scientific
                                okazaki@c.titech.ac.jp (N. Okazaki)                                                                              figure captioning task. The experimental result of au-
                                                    © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                                    Attribution 4.0 International (CC BY 4.0).                                                   tomatic evaluation demonstrates that using knowledge
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
                                                                        ures to highlight scientific findings that authors want
                                                                        to present to readers. With this unique characteristic,
                                                                        without referring to mention-paragraphs, which usually
                                                                        refer to the figure, it is extremely challenging for a hu-
                                                                        man to have proper interpretations of figures. This is
                                                                        because they may lack background knowledge of the do-
                                                                        main or context of the figure. As figure 1 shows, by only
                                                                        looking at the figure, we do not know what ”comm.(KB)”
                                                                        stands for; therefore lacking the knowledge to write in-
 Caption:                                                               formative captions is challenging. However, the mention-
 Fig. 7. (a) Speedup of CHEETAH over GAZELLE for computing ReLu.        paragraph contains ”communication cost” and this is also
 (b) Comparison of communication cost for ReLu.                         present in the caption, indicating that such background
 Mention-paragraph:                                                     knowledge should help in writing accurate captions.
 Fig. 7 plots the speedup and communication cost as a function of the
 output dimension. Similarly, CHEETAH achieves an outstanding
 speedup with much smaller communication cost, independent of the       3. Problem Formulation
 output dimension, compared with GAZELLE.
 ……
                                                                        The previous study [1] defined this task as an image cap-
Figure 1: An example figure [3] with its captions and mention-          tioning task as: Given a figure 𝐼, the model generates a
paragraph and the text tokens recognized via OCR. Without               caption 𝐶 = [𝑐0 , 𝑐1 , ..., 𝑐𝑁 ]. However, we reframe the sci-
referring to the mention-paragraph and the OCR tokens to                entific figure captioning task as a knowledge-augmented
tie the figure and the mention, we cannot have a proper in-             image captioning task requiring knowledge extracted
terpretation of the data presented in the figure, which is com-         from text and vision modalities. For a figure, we define
munication cost comparison and speed up of CHEETAH over                 a paragraph that mentions it (mention-paragraph) and
GAZELLE.                                                                text within the figure, extracted via OCR, as text modal-
                                                                        ities. The figure itself and visual appearances of OCR
                                                                        texts are visual modalities. Given a scientific figure 𝐼 and
embedded in different modalities, especially in the form                knowledge extracted from text and vision modality: 𝐾𝑡𝑒𝑥𝑡
of mention-paragraphs and OCR tokens, significantly                     and 𝐾𝑣𝑖𝑠𝑖𝑜𝑛 , we define figure caption generation task as
boosts performance.                                                     𝑃(𝐶|𝐼 , 𝐾𝑡𝑒𝑥𝑡 , 𝐾𝑣𝑖𝑠𝑖𝑜𝑛 )
   In addition to experiments using automatic evalua-
tion metrics, we also performed human generation and
evaluation tasks in order to establish the inherent diffi-              4. SciCap+ Dataset
culty of scientific figure captioning. The results of the
                                                                        SciCap is a large-scale figure-caption dataset comprising
human evaluation reveal three findings: 1. Multimodal
                                                                        graph plots extracted from 10 years of collections of arXiv
knowledge helps models outperform humans in caption
                                                                        computer science papers. We used around 414k figures
generation tasks. 2. Model-generated captions are al-
                                                                        from SciCap and augment each figure with its mention-
most as informative as ground-truth captions: Human
                                                                        paragraphs and OCR tokens with metadata. This section
evaluators do not prefer either type of caption. 3. Even
                                                                        details the data set creation and data augmentation pro-
referring to mention-paragraphs, it is still challenging for
                                                                        cesses. Figure 2 shows the overall workflow behind the
humans to write captions that are close to ground truth.
                                                                        creation of SciCap+.
To the best of our knowledge, we are the first to pose sci-
entific figure captioning as a multimodel summarization
task and show that mention-paragraphs and OCR tokens                    4.1. Mention-paragraph Extraction
as context substantially enhance the quality of generated   We first obtained papers in PDF format from Kaggle arXiv
captions.                                                   dastaset 1 . The reason for using PDFs is that not all pa-
                                                            pers have source files and some are complicated to parse.
2. Preliminary Study                                        After obtaining PDFs, we used PDFFigures 2.0 [4] 2 to
                                                            extract the body text of each paper. PDFFigure 2.0 is a
In the traditional image captioning task, captioning an im- tool that extracts figures, captions, tables, and text from
age aims at describing the appearances or natures of rec- scholarly PDFs in computer science. In scholarly docu-
ognized objects and illustrating the relationships between ments, authors label figures with numbers (e.g. Figure
recognized objects. Unlike the usual image captioning 1. Fig. 1). For a figure, we used its figure number in a
tasks, figures do not contain visual scenes. Instead, the 1 https://www.kaggle.com/datasets/Cornell-University/arxiv
captions provide interpretations of data presented in fig- 2 https://github.com/allenai/pdffigures2
Figure 2: The overall workflow of the data augmentation for creating SciCap+ dataset. For each figure in SciCap+, we
extracted its mention-paragraphs and OCR tokens (OCR texts and bounding boxes).


regular expression to locate a paragraph that mentions                   Split        Figures       Words
                                                                         Training      394,005   12,336,511
it.
                                                                         Test           10,336      323,382
                                                                         Validation     10,468      329,072
4.2. OCR Extraction
                                                             Table 1
The SciCap dataset also provides texts extracted from Statistics of the SciCap+ dataset.
figures as metadata, but does not provide location infor-
mation for each text. To include location information for
each text in a figure, we used Google Vision OCR API to 4.4. Dataset Quality Evaluation
extract text tokens from each figure with its coordinates
of bounding boxes.                                           Before conducting experiments, we conducted human
                                                             evaluation of SciCap+ where we checked the mention-
                                                             paragraphs and OCR tokens extraction quality. The aim
4.3. Data Statistics                                         was to establish whether the mention-paragraphs and
The splitting of the SciCap dataset is at the figure level. OCR tokens were extracted correctly and relevant to the
Therefore, figures from the same paper may appear in figure and its caption. To this end, we randomly selected
different splits. This will lead to unfair evaluation, since 200 figures from the training set and for each figure,
the information of one figure in one split may coinciden- we asked two human evaluators to give scores of 1-5
tally overlap with the information of another figure. We (1 represents no relevance and 5 is highly relevant) for
thus re-split figures at the document level to eliminate relevance between a caption of a figure and its mention-
this overlapping problem. Hsu et al. [1] show that text paragraphs and OCR tokens.
normalization and figure filtering do not improve model        Compared to natural image captioning, human eval-
performance. Hence, we keep original captions and all uation tasks for the figure captioning domain requires
figures (with/without sub-figures) in the SciCap+ dataset. expert knowledge. We recruited two colleagues to carry
For a figure, we kept only the first paragraph that men- out this evaluation task. Both of them have Ph.D. degrees
tions it in the body text. Table 1 shows statistics of the in computer science and work as researchers. Their expe-
SciCap+ dataset. In all three splits, around 90% of the rience implies that they have adequate experience writing
captions are less than 66 words. All figures are graph figure captions.
plots.                                                         Figure 3 shows the distributions of the relevance scores.
                                                             We can observe that two evaluators gave most of the
                                                                 allows users to specify diverse pre-trained encoders for
                                                                 each modality, which can be fine-tuned or frozen during
                                                                 training. The M4C-captioner itself has 𝐷 = 768 hidden di-
                                                                 mension size, 𝐾 = 4 transformer layers and 12 attention
                                                                 heads. We used sentencepiece [7] to obtain a dictionary
                                                                 of 32000 subwords built from both mention-paragraphs
                                                                 and OCR tokens. This is used as the M4C-captioner’s vo-
                                                                 cabulary. We followed the BERT-BASE hyperparameter
                                                                 setting and trained from scratch.
                                                                    Regarding the encoders that feed features to M4C-
                                                                 captioner, we used pre-trained Resnet-152 as the figure’s
Figure 3: Score distribution on correlations between mention-    vision encoder. For each figure, we applied a 2D adap-
paragraph, OCR tokens and figure captions. Both evaluators       tive average pooling over outputs from layer 5 to ob-
judged most of the figures with at least moderate correlations   tain a global visual feature vector with a dimension of
with captions.                                                   2048. Layers 2, 3 and 4 layers were fine-tuned during
                                                                 training. For mention-paragraph features, SciBERT [8]
                                                                 was used to encode3 it into 758-dimensional feature vec-
figures (evaluator 1: 64% and evaluator 2: 79.5%) with           tors. The number of vectors equals the number of sub-
relevance scores greater than 3 and a cohen kappa score          word tokens in the mention-paragraph, which we limit to
of 0.28. This evaluation result indicates that the mention-      192. The mention-paragraph encoder is also fine-tuned
paragraphs and OCR tokens have a satisfactory extrac-            during training. Finally, for OCR tokens, we use both
tion quality and that the annotators considered most of          text and visual features. We selected FastText [9] as the
them as relevant to the figure and its caption. However,         word encoder and Pyramidal Histogram of Characters
the two annotators seem to have a relatively lower agree-        (PHOC) [10] as the character encoder. Regarding the
ment (0.28) regarding which figures and captions are             visual feature encoder of OCR tokens, we first extracted
relevant to their mention-paragraphs and OCR tokens.             Faster R-CNN fc6 features and then applied fc7 weights
We attribute this to the fact that evaluations of figure         to it to obtain 2048-dimensional appearance features for
captions are highly subjective.                                  bounding boxes of OCR tokens. The fc7 weights were
                                                                 fine-tuned during training. We kept a maximum of 95
                                                                 OCR tokens per figure.
5. Experiments                                                      We trained a model on a GPU server with 8 Nvidia
                                                                 Tesla V100 GPUs. Training a model with a complete set of
We conduct experiments using SciCap+ to empirically              features took 13 hours. During training, we used a batch
prove that scientific figure captioning is inherently a          size of 128. We selected CIDEr as the evaluation metric.
knowledge-augmented task and benefits from knowledge             The evaluation interval is every 2000 iterations, we stop
coming from both text and vision modalities.                     training if CIDEr score does not improve for 4 evaluation
                                                                 intervals. The optimizer is Adam with a learning rate of
5.1. Figure Captioning Model                                     0.001 and 𝜖 = 1.0E−08. We also used a multistep learn-
                                                                 ing rate schedule with warmup iterations of 1000 and a
We used M4C-Captioner [2] as the baseline model to               warmup factor of 0.2. We kept the maximum number of
study the scientific figure captioning task. The M4C-            decoding steps at the decoding time as 67. For evalua-
Captioner is based on Multimodal Multi-Copy Mesh                 tion, we used five standard metrics for evaluating image
(M4C) [5] that jointly learns representations across in-         captions: BLEU-4 [11], METEOR [12], ROUGE-L [13],
put modalities. To solve the out-of-vocabulary problem           CIDEr [14] and SPICE [15]. Since figure captions contain
during caption generation, it is equipped with a pointer         scientific terms which can be seen as uncommon words,
network that picks up text from OCR tokens or a prede-           among all five metrics, we are particularly interested in
fined fixed dictionary. In this work, 3 input features are       CIDEr since it emphasizes them.
used, figure, mention-paragraphs and OCR tokens fed to
encoders, the output representations of which are fed to
the M4C-Captioner.

5.2. Implementation and Training
Our implementation of M4C-Captioner is based on the
MMF framework [6] and Pytorch. The implementation                3
                                                                 We only used the first 3 layers of SciBERT for lightweightness.
  Model                                                                         BLEU-4    METEOR     ROUGE-L     SPICE    CIDEr
  1. M4C-Captioner (Figure Only )                                                   1.5        5.6       15.4       4.3      4.6
  2. M4C-Captioner (Mention Only)                                                   5.3       11.0       27.4      14.3     49.0
  3. M4C-Captioner (Figure and OCR features)                                        2.6        7.6       20.5      10.1     22.2
  4. M4C-Captioner (Mention, Figure and OCR features)                               6.3      12.0        29.2      15.8     55.8
  Ablation Study on Figures
  5. M4C-Captioner (Mention and OCR features)                                       6.3       12.0        29.3     16.1     56.4
  Ablation Study on OCR features
  6. M4C-Captioner (Mention, Figure and w/o OCR features )                          6.4       11.5        27.9     14.6     50.5
  7. M4C-Captioner (Mention, Figure and OCR spatial features)                       5.8       11.1        27.3     14.1     48.0
  8. M4C-Captioner (Mention, Figure and OCR (w/o spatial features) features )       6.4       12.0        29.1     15.7     54.6
  9. M4C-Captioner (Mention, Figure and OCR (w/o visual features) features )        6.2       11.9        28.9     15.6     54.1

Table 2
Automatic evaluation scores of M4C-captioning on SciCap dataset. Aggregate knowledge from text and vision modalities
significantly boosts the model performance compared to the figure-only baseline.


6. Results                                                            noise for the model. This is likely because the Resnet-152
                                                                      visual encoder we used was not trained on figures.
6.1. Main Result                                                         We enriched the representations of the OCR features
                                                                      by adding text, visual, and spatial features. Ablation stud-
The experimental results in table 2 demonstrate that us-
                                                                      ies aim to reveal impacts of each OCR token feature. All
ing the mention-paragraph and OCR tokens significantly
                                                                      comparisons are with row #4 even though row #5 gives
improves scores on all five metrics compared to the figure-
                                                                      slightly better scores. With OCR features completely
only baseline. The experimental results align with our
                                                                      removed in row #6, the CIDEr scores decrease by 5.3.
hypothesis and preliminary study that scientific figure
                                                                      Using only OCR spatial features in row #7, the CIDEr
captioning is a knowledge-augmented image captioning
                                                                      score dropped by 7.8. Removing OCR spatial features in
task, OCR tokens and knowledge embedded in mention-
                                                                      row #8, the CIDEr scores dropped by 1.2. Upon removal
paragraphs help in composing informative captions.
                                                                      of OCR visual features in row #9, the CIDEr score is close
   We established a baseline M4C-Captioner (Figure only)
                                                                      to removing spatial features.
with figures as the only input modality to the M4C-
                                                                         The above ablation study indicates that the enriched
Captioner model in row #1. This baseline is in the non-
                                                                      OCR contributes to the informativeness of generated cap-
knowledge setting. Therefore, low scores in all metrics
                                                                      tions. Unlike OCR features, where appearance features
show that the model needs knowledge of other modal-
                                                                      are helpful to the model, removing visual features of
ities. Using the mention only in row #2 shows that the
                                                                      figures increases CIDEr scores, further indicating that
mention certainly contains a lot of useful information, as
                                                                      we need a specific vision encoder for figures to provide
evidenced by the increase in performance. When OCR
                                                                      meaningful features.
features are added to the figure input in row #3, scores
for all metrics have significant gains compared to the
figure-only baseline, but are still weaker than when only             7. Human Evaluation
mentions are used. This motivates the combination of
mentions and OCR features and in row #4, compared to                  Having established that knowledge helps a model per-
the figure-only baseline and figure-OCR-only baseline,                form figure captioning, we conducted some human eval-
the performance further improves. Perhaps the most                    uation activities to determine their subjective quality.
interesting result is in row #5 where we only use the                 We conducted human caption generation and evalua-
mentions and OCR features but not the figure and get                  tion tasks. The human generation task is to examine
the best performance, particularly for SPICE and CIDEr,               whether humans can write better captions than models.
albeit comparable to when the figure is included in row               The evaluation task is the appropriateness evaluation
#4. All these results indicate that explicitly extracted              task, which consists of evaluating how appropriate the
multimodal knowledge helps to compose informative                     model-generated captions are versus ground-truth cap-
captions.                                                             tions. Both tasks were performed by the same human
                                                                      subjects for the quality assessment of the data set.
6.2. Ablation Studies
                                                                      7.1. Figure Caption Generation Task
We first performed an ablation study on figures by re-
moving visual feature vectors, the CIDEr score increases The figure caption generation task is to generate captions
slightly, indicating that the visual feature is more like under two conditions separately: 1. Figure-only: Human
   Annotator                         Inputs                   BLEU-4     METEOR        ROUGE-L       SPICE     CIDEr
   1. Annotator 1                  Figure-only                    2.4         8.3          13.2         9.4      14.6
   2. Annotator 2                  Figure-only                   3.8        10.1           21.5         8.9      23.8
   3. M4C-Captioner        Image and OCR features                 3.6         7.6          20.5        11.5      18.7
   4. Annotator 1                Figure-Mention                  7.7         13.4          19.1        15.9      11.3
   5. Annotator 2                Figure-Mention                   7.5       14.8           24.8        14.3      18.8
   6. M4C-Captioner     Mention, Figure and OCR features          5.5        11.6          28.1        16.1      47.7
Table 3
Automatic evaluation scores on human-generated captions. The model has similar performances when the figure is the only
available source. Using knowledge from vision and text modality, the model has a larger gain on CIDEr scores.

 Model                                                     Average Scores     Average Scores      Cohen-Kappa Scores
                                                              Evaluator 1        Evaluator 2
 1. M4C-Captioner (Mention, Figure and OCR features)                   1.8               2.13                        0.27
 2. M4C-Captioner (Mention and OCR features)                         2.03               2.35                         0.23
 3. M4C-Captioner (Mention Only)                                       1.8               2.22                        0.31
 4. M4C-Captioner (Figure and OCR features)                           1.91               2.08                        0.36
 5. Ground truth                                                      1.95               2.07                        0.32
Table 4
Average appropriateness score on model-generated and ground truth captions. Two evaluators gave low scores on both
model-generated and ground truth captions, with the fair inter-annotator agreement.


annotators write captions given only figures. This is to         Even given mention-paragraphs, our annotator wrote
compare with captions generated by M4C-Captioner that         captions with low scores across all standard image cap-
only has access to figures and OCR features. 2. Figure-       tioning evaluation metrics. We ascribe it as figure
Mention: Human annotators write captions given both           captions are highly subjective and require in-domain
figures and their mention-paragraphs. We randomly             knowledge to write. Although our annotators are re-
selected 100 figures from the test set and to compare         searchers, they cannot be professional in all knowl-
human-generated captions with captions generated by           edge existing in the computer science domain. Granted
M4C-Captioner.                                                mention-paragraphs and OCR tokens as external knowl-
   The table 3 shows automatic evaluation results for hu-     edge sources, and with large-amount data training, the
man caption generation tasks. Given only figures (rows        model can significantly outperform humans.
#1, 2), both annotators got low scores across all metrics,
among those, annotator 2 led all metrics except SPICE.        7.2. Appropriateness Evaluation
Since humans perform OCR naturally with their eyes we
compare with M4C-captioner (Figure and OCR features).         This task evaluates the appropriateness of model-
It has the best SPICE score, although it outperformed an-     generated and ground-truth captions. We used the same
notator 1 in 4 of 5 evaluation metrics, it achieved similar   set of 100 figures as in the figure caption generation task,
performance compared with annotator 2. This shows             and placed ground-truth captions and model-generated
that without additional knowledge, humans aren’t that         captions in random order. Then, human evaluators rank
better than machines.                                         each caption to give appropriateness scores (1-4) to each
   However, given mention-paragraphs and figures (rows        caption. The evaluation scale: 1. Inappropriate: a caption
#4, 5), compared to the figure-only condition, both anno-     does not match the figure, is not a sentence, is wrong, or
tators got improved scores in BLEU-4, METEOR, ROUGE-          is misleading. 2. Not sure: It is impossible to judge appro-
L, and SPICE but lower scores in CIDEr. Previous studies      priateness solely from the figure. 3. Possible: A possible
have shown that CIDEr is more reliable as an evaluation       candidate that is incomplete but not wrong. 4. Appro-
metric for caption generation, and the lowered CIDEr          priate: An informative caption that interprets the figure
scores indicates that humans are likely to struggle with      well. Since an appropriate figure caption should stand
additional knowledge. On the other hand, having access        alone and readers should understand the messages the
to full features, M4C-captioner gained a significantly bet-   figure wants to represent without referring to the body
ter CIDEr score compared to human annotators. The             text, we do not show mention-paragraphs to evaluators.
automatic evaluation results of the human generation             Table 4 shows the results of the evaluations. Two eval-
tasks show the steep difficulty in writing figure captions    uators gave low average scores to both model-generated
close to ground truth.                                        captions and ground-truth captions. In addition, eval-
uators only reached fair agreements on scoring (0.23-       9. Conclusion
0.36). Using the mention and OCR features (row #2),
gets the best human evaluation scores and this is in line   In this paper, we study the challenges of the scien-
with the corresponding score in Table 2 where it also       tific figure captioning task. Extending from the pre-
achieves the best CIDEr performance, indicating that hu-    vious study [1], we reframe this task as a knowledge-
man evaluation is reliable despite the fair agreements.     augmented image captioning task, that is, a model needs
The evaluation results indicate that the model-generated    to use knowledge extracted across modalities to gener-
and ground-truth captions are not always informative        ate captions. To this end, we released a new version of
to both evaluators, which reveals the need to improve       the SciCap dataset: SciCap+ by augmenting figures with
caption writing quality and model performance. We ob-       their mention-paragraphs and OCR tokens. We used
served that captions tend to be written without following   M4C-Captioner model as the baseline model to utilize
specific rules, and this may contribute to lack of agree-   knowledge across three modalities: mention-paragraphs,
ment. With low inter-rater agreements, we found how         figures, and OCR tokens. The automatic evaluation ex-
informative a figure caption is highly subjective and de-   periments further reveal that using knowledge signif-
pends on in-domain background knowledge evaluators          icantly improves evaluation metric scores. Compared
have.                                                       with human-generated captions, we found models can
                                                            generate better captions than humans regarding the auto-
                                                            matic evaluation metrics. However, human evaluations
8. Related Work                                             demonstrated that writing scientific figure captioning is
                                                            challenging even for humans, and the model-generated
Unlike natural image captioning, figure captioning has figure captions, despite their reasonable automatic eval-
been scarcely studied in history. SciCap [1] is the most uation quality, are still far from achieving a level appro-
recent work on scientific figure captioning, they released priate for humans. The release of the SciCap+ dataset is
a large-scale scientific figure captioning dataset that in- to promote the further development of scientific figure
cludes figures from academic papers in arXiv dataset. captioning. For future work, we are interested in how to
Before SciCap, FigCAP [16] [17] and FigureQA [18] are use multimodal pretraining strategies in this task.
two figure captioning datasets, but their figures are syn-
thesized. We decided to extend and study on SciCap
dataset, since its figures are from real-world scientific 10. Acknowledgment
papers. In this paper, we also have leveraged multimodal
knowledge using pre-trained models.                         These research results were partly obtained from the
   Multimodal machine learning is to model knowledge commissioned research (No. 225) by National Institute
across various modalities. The closest multimodal task of Information and Communications Technology (NICT),
to figure captioning is image captioning, a popular ar- Japan, and partly obtained from the first author’s intern-
chitecture is encode-decoder, where the decoder learns ship research under NICT.
to generate captions conditioned on visual features ex-
tracted from the encoder. Recent works on integrating
texts in natural images for visual question answering
                                                            References
and image captioning tasks are based on transformer           [1] T.-Y. Hsu, C. L. Giles, T.-H. Huang, SciCap: Gen-
architecture augmented with a pointer network [5, 19].             erating captions for scientific figures, in: Find-
The transformer enriches representations by integrat-              ings of the Association for Computational Lin-
ing knowledge from both text and visual modality. The              guistics: EMNLP 2021, Association for Computa-
pointer network dynamically selects words from the fixed           tional Linguistics, Punta Cana, Dominican Republic,
dictionary or OCR tokens during generation.                        2021, pp. 3258–3264. URL: https://aclanthology.org/
   Using knowledge embedded in pre-trained models is               2021.findings-emnlp.277. doi:10.18653/v1/2021.
a common practice in solving multimodal tasks. In this             findings- emnlp.277 .
work, we used SciBert [8], a BERT model [20] that was         [2] O. Sidorov, R. Hu, M. Rohrbach, A. Singh, Textcaps:
pre-trained in scientific papers, to obtain informative            a dataset for image captioning with reading com-
representations for the texts extracted from computer              prehension, in: European conference on computer
science papers. Since terms that exist in the figures may          vision, Springer, 2020, pp. 742–758.
be uncommon words, we also used FastText [21] to obtain       [3] Q. Zhang, C. Wang, C. Xin, H. Wu, Cheetah:
word embeddings with subword information. For visual               An ultra-fast, approximation-free, and privacy-
modality, we used Renst152 [22] and Faster R-CNN [23]              preserved neural network framework based on joint
used in extract features from images and bounding boxes.           obscure linear and nonlinear computations, arXiv
                                                                 preprint arXiv:1911.05184 (2019).
 [4] C. Clark, S. Divvala, Pdffigures 2.0: Mining figures         uation of summaries, in: Text summarization
     from research papers, in: 2016 IEEE/ACM Joint                branches out, 2004, pp. 74–81.
     Conference on Digital Libraries (JCDL), IEEE, 2016,     [14] R. Vedantam, C. Lawrence Zitnick, D. Parikh, Cider:
     pp. 143–152.                                                 Consensus-based image description evaluation, in:
 [5] R. Hu, A. Singh, T. Darrell, M. Rohrbach, Iterative          Proceedings of the IEEE conference on computer
     answer prediction with pointer-augmented multi-              vision and pattern recognition, 2015, pp. 4566–4575.
     modal transformers for textvqa, in: Proceedings         [15] P. Anderson, B. Fernando, M. Johnson, S. Gould,
     of the IEEE Conference on Computer Vision and                Spice: Semantic propositional image caption evalu-
     Pattern Recognition, 2020.                                   ation, in: European conference on computer vision,
 [6] A. Singh, V. Goswami, V. Natarajan, Y. Jiang,                Springer, 2016, pp. 382–398.
     X. Chen, M. Shah, M. Rohrbach, D. Batra,                [16] C. Chen, R. Zhang, E. Koh, S. Kim, S. Cohen, T. Yu,
     D. Parikh, Mmf: A multimodal framework for vi-               R. Rossi, R. Bunescu, Figure captioning with rea-
     sion and language research, https://github.com/              soning and sequence-level training, arXiv preprint
     facebookresearch/mmf, 2020.                                  arXiv:1906.02850 (2019).
 [7] T. Kudo, J. Richardson, SentencePiece: A sim-           [17] C. Chen, R. Zhang, E. Koh, S. Kim, S. Cohen, R. Rossi,
     ple and language independent subword tokenizer               Figure captioning with relation maps for reasoning,
     and detokenizer for neural text processing, in:              in: Proceedings of the IEEE/CVF Winter Confer-
     Proceedings of the 2018 Conference on Empirical              ence on Applications of Computer Vision (WACV),
     Methods in Natural Language Processing: System               2020.
     Demonstrations, Association for Computational           [18] S. E. Kahou, V. Michalski, A. Atkinson, Á. Kádár,
     Linguistics, Brussels, Belgium, 2018, pp. 66–71. URL:        A. Trischler, Y. Bengio, Figureqa: An annotated
     https://aclanthology.org/D18-2012. doi:10.18653/             figure dataset for visual reasoning, arXiv preprint
     v1/D18- 2012 .                                               arXiv:1710.07300 (2017).
 [8] I. Beltagy, K. Lo, A. Cohan, SciBERT: A pretrained      [19] O. Sidorov, R. Hu, M. Rohrbach, A. Singh, Textcaps:
     language model for scientific text, in: Proceed-             a dataset for image captioningwith reading compre-
     ings of the 2019 Conference on Empirical Meth-               hension, 2020.
     ods in Natural Language Processing and the 9th          [20] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT:
     International Joint Conference on Natural Lan-               Pre-training of deep bidirectional transformers for
     guage Processing (EMNLP-IJCNLP), Association                 language understanding, in: Proceedings of the
     for Computational Linguistics, Hong Kong, China,             2019 Conference of the North American Chap-
     2019, pp. 3615–3620. URL: https://aclanthology.org/          ter of the Association for Computational Linguis-
     D19-1371. doi:10.18653/v1/D19- 1371 .                        tics: Human Language Technologies, Volume 1
 [9] J. Almazán, A. Gordo, A. Fornés, E. Valveny, Word            (Long and Short Papers), Association for Com-
     spotting and recognition with embedded attributes,           putational Linguistics, Minneapolis, Minnesota,
     IEEE transactions on pattern analysis and machine            2019, pp. 4171–4186. URL: https://aclanthology.org/
     intelligence 36 (2014) 2552–2566.                            N19-1423. doi:10.18653/v1/N19- 1423 .
[10] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, En-     [21] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, En-
     riching word vectors with subword information.,              riching word vectors with subword information,
     TACL 5 (2017) 135–146. URL: http://dblp.uni-trier.           Transactions of the Association for Computational
     de/db/journals/tacl/tacl5.html#BojanowskiGJM17.              Linguistics 5 (2017) 135–146.
[11] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu:       [22] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learn-
     a method for automatic evaluation of machine                 ing for image recognition, in: Proceedings of the
     translation, in: Proceedings of the 40th Annual              IEEE conference on computer vision and pattern
     Meeting of the Association for Computational Lin-            recognition, 2016, pp. 770–778.
     guistics, Association for Computational Linguis-        [23] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn:
     tics, Philadelphia, Pennsylvania, USA, 2002, pp.             Towards real-time object detection with region
     311–318. URL: https://aclanthology.org/P02-1040.             proposal networks, in: C. Cortes, N. Lawrence,
     doi:10.3115/1073083.1073135 .                                D. Lee, M. Sugiyama, R. Garnett (Eds.), Advances
[12] S. Banerjee, A. Lavie, Meteor: An automatic met-             in Neural Information Processing Systems,
     ric for mt evaluation with improved correlation              volume 28, Curran Associates, Inc., 2015. URL:
     with human judgments, in: Proceedings of the                 https://proceedings.neurips.cc/paper/2015/file/
     acl workshop on intrinsic and extrinsic evaluation           14bfa6bb14875e45bba028a21ed38046-Paper.pdf.
     measures for machine translation and/or summa-
     rization, 2005, pp. 65–72.
[13] C.-Y. Lin, Rouge: A package for automatic eval-

</pre>