<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Captioning⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Zhishen Yang</string-name>
          <email>zhishen.yang@nlp.c.titech.ac.jp</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Raj Dabre</string-name>
          <email>raj.dabre@nict.go.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hideki Tanaka</string-name>
          <email>hideki.tanaka@nict.go.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Naoaki Okazaki</string-name>
          <email>okazaki@c.titech.ac.jp</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Figure Captioning, Multimodal Machine Learning, Scientific Document Understanding</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Institute of Information and Communications Technology</institution>
          ,
          <addr-line>3-5 Hikaridai, Seika-cho, Soraku-gun, Kyoto, 619-0289</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Tokyo Institute of Technology</institution>
          ,
          <addr-line>2-12-1 Ookayama, Meguro-ku, Tokyo, 152-8550</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Workshop Proce dings</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>In scholarly documents, figures provide a straightforward way of communicating scientific findings to readers. Automating ifgure caption generation helps move model understandings of scientific documents beyond text and will help authors write informative captions that facilitate communicating scientific findings. Unlike previous studies, we reframe scientific ifgure captioning as a knowledge-augmented image captioning task that models need to utilize knowledge embedded across modalities for caption generation. To this end, we extended the large-scale SciCap dataset [1] to SciCap+, which includes mention-paragraphs (paragraphs mentioning figures) and OCR tokens. Then, we conduct experiments with the M4C-Captioner (a multimodal transformer-based model with a pointer network) as a baseline for our study. Our results indicate that mention-paragraphs serve as additional context knowledge, significantly boosting the automatic standard image caption evaluation scores compared to the figure-only baselines. Human evaluations further reveal the challenges of generating figure captions that are informative to readers. The code and SciCap+ dataset are publicly available: https: various formats, such as journal articles, book chapters, caption should contain an analysis that the authors intables, they help communicate knowledge to readers. Us- ure captioning task as a figure-to-caption task: A model ∗Corresponding author.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Scholarly documents are the primary source for sharing
scientific knowledge. These documents are available in
and conference proceedings. A significant portion of
these documents is text and together with figures and
ing figures provides visual representations of complex
information that facilitate the sharing of scientific
findings with readers eficiently and straightforwardly. The
standard practice for scientific writing is to write a
caption for each figure, accompanied by paragraphs with
detailed explanations. Figures and captions should be
standalone, and readers should be able to understand
the figures without referring to the main text. Helping
authors write appropriate and informative captions for
ifgures will improve the quality of scientific documents,
thereby enhancing the speed and quality of scientific
communication. In this study, we focus on automating
the generation of captions for figures in scientific papers.</p>
      <sec id="sec-1-1">
        <title>Scientific figure captioning is a variant of the image captioning task. However, with the same goal of generating a caption, it has two unique challenges: 1. Figures are</title>
        <p>CEUR</p>
        <p>CEUR
not natural images: In contrast to natural images, visual
objects are texts and data points in scientific figures. 2.</p>
      </sec>
      <sec id="sec-1-2">
        <title>The captions of the figures should explain: Instead of</title>
        <p>simply identifying objects and texts in the figures, the
tend to present and highlight findings.</p>
        <p>
          A previous study [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], SciCap, defines the scientific
figgenerates captions only referring to figures. Their work
reported relatively lower scores as measured by
automatic evaluation metrics, indicating that there is
considerable room for improvement. Intuitively, writing
appropriate figure captions without suficient background
knowledge is dificult, since even humans will struggle
to interpret a figure and write a caption unless some
background knowledge is available. On the basis of this
observation, we think that generating appropriate
captions is infeasible without adding context knowledge to
the caption generation model. This context comes in two
forms: background knowledge from the running text
and the OCR tokens in the figure, both of which should
help reduce the burden on the captioning model. To this
end, we augment the existing large-scale scientific
figure captioning dataset: SciCap with mention-paragraphs
and OCR tokens and call the resultant dataset as
Sci
        </p>
      </sec>
      <sec id="sec-1-3">
        <title>Cap+. We then pose scientific figure captioning as a mul</title>
        <p>
          timodal summarization task and use the M4C captioner
model [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] (a model that utilizes multimodal knowledge
to generate captions) as a baseline to study the scientific
ifgure captioning task. The experimental result of
auCEUR
        </p>
        <p>ceur-ws.org
ures to highlight scientific findings that authors want
to present to readers. With this unique characteristic,
without referring to mention-paragraphs, which usually
refer to the figure, it is extremely challenging for a
human to have proper interpretations of figures. This is
because they may lack background knowledge of the
domain or context of the figure. As figure 1 shows, by only
looking at the figure, we do not know what ”comm.(KB)”
stands for; therefore lacking the knowledge to write
informative captions is challenging. However, the
mentionparagraph contains ”communication cost” and this is also
present in the caption, indicating that such background
knowledge should help in writing accurate captions.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Problem Formulation</title>
    </sec>
    <sec id="sec-3">
      <title>4. SciCap+ Dataset</title>
      <p>SciCap is a large-scale figure-caption dataset comprising
graph plots extracted from 10 years of collections of arXiv
computer science papers. We used around 414k figures
from SciCap and augment each figure with its
mentionparagraphs and OCR tokens with metadata. This section
details the data set creation and data augmentation
processes. Figure 2 shows the overall workflow behind the
creation of SciCap+.</p>
      <sec id="sec-3-1">
        <title>4.1. Mention-paragraph Extraction</title>
        <p>Caption:
Fig. 7. (a) Speedup of CHEETAH over GAZELLE for computing ReLu.
(b) Comparison of communication cost for ReLu.</p>
        <p>Mention-paragraph:
Fig. 7 plots the speedup and communication cost as a function of the
output dimension. Similarly, CHEETAH achieves an outstanding
speedup with much smaller communication cost, independent of the
output dimension, compared with GAZELLE.
……
embedded in diferent modalities, especially in the form
of mention-paragraphs and OCR tokens, significantly
boosts performance.</p>
        <p>In addition to experiments using automatic
evaluation metrics, we also performed human generation and
evaluation tasks in order to establish the inherent
dificulty of scientific figure captioning. The results of the
human evaluation reveal three findings: 1. Multimodal
knowledge helps models outperform humans in caption
generation tasks. 2. Model-generated captions are
almost as informative as ground-truth captions: Human
evaluators do not prefer either type of caption. 3. Even
referring to mention-paragraphs, it is still challenging for
humans to write captions that are close to ground truth.
To the best of our knowledge, we are the first to pose
scientific figure captioning as a multimodel summarization
task and show that mention-paragraphs and OCR tokens
as context substantially enhance the quality of generated
captions.</p>
        <sec id="sec-3-1-1">
          <title>We first obtained papers in PDF format from Kaggle arXiv</title>
          <p>dastaset 1. The reason for using PDFs is that not all
papers have source files and some are complicated to parse.
2. Preliminary Study After obtaining PDFs, we used PDFFigures 2.0 [ 4] 2 to
extract the body text of each paper. PDFFigure 2.0 is a
In the traditional image captioning task, captioning an im- tool that extracts figures, captions, tables, and text from
age aims at describing the appearances or natures of rec- scholarly PDFs in computer science. In scholarly
docuognized objects and illustrating the relationships between ments, authors label figures with numbers (e.g. Figure
recognized objects. Unlike the usual image captioning 1. Fig. 1). For a figure, we used its figure number in a
tasks, figures do not contain visual scenes. Instead, the 1https://www.kaggle.com/datasets/Cornell-University/arxiv
captions provide interpretations of data presented in fig- 2https://github.com/allenai/pdffigures2
regular expression to locate a paragraph that mentions
it.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>4.2. OCR Extraction</title>
        <p>The SciCap dataset also provides texts extracted from
ifgures as metadata, but does not provide location
information for each text. To include location information for
each text in a figure, we used Google Vision OCR API to
extract text tokens from each figure with its coordinates
of bounding boxes.</p>
      </sec>
      <sec id="sec-3-3">
        <title>4.3. Data Statistics</title>
        <sec id="sec-3-3-1">
          <title>The splitting of the SciCap dataset is at the figure level.</title>
          <p>
            Therefore, figures from the same paper may appear in
diferent splits. This will lead to unfair evaluation, since
the information of one figure in one split may
coincidentally overlap with the information of another figure. We
thus re-split figures at the document level to eliminate
this overlapping problem. Hsu et al. [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ] show that text
normalization and figure filtering do not improve model
performance. Hence, we keep original captions and all
ifgures (with/without sub-figures) in the SciCap+ dataset.
For a figure, we kept only the first paragraph that
mentions it in the body text. Table 1 shows statistics of the
SciCap+ dataset. In all three splits, around 90% of the
captions are less than 66 words. All figures are graph
plots.
          </p>
          <p>Split
Training
Test
Validation</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>4.4. Dataset Quality Evaluation</title>
        <sec id="sec-3-4-1">
          <title>Before conducting experiments, we conducted human</title>
          <p>evaluation of SciCap+ where we checked the
mentionparagraphs and OCR tokens extraction quality. The aim
was to establish whether the mention-paragraphs and
OCR tokens were extracted correctly and relevant to the
ifgure and its caption. To this end, we randomly selected
200 figures from the training set and for each figure,
we asked two human evaluators to give scores of 1-5
(1 represents no relevance and 5 is highly relevant) for
relevance between a caption of a figure and its
mentionparagraphs and OCR tokens.</p>
          <p>Compared to natural image captioning, human
evaluation tasks for the figure captioning domain requires
expert knowledge. We recruited two colleagues to carry
out this evaluation task. Both of them have Ph.D. degrees
in computer science and work as researchers. Their
experience implies that they have adequate experience writing
ifgure captions.</p>
          <p>Figure 3 shows the distributions of the relevance scores.</p>
          <p>We can observe that two evaluators gave most of the
allows users to specify diverse pre-trained encoders for
each modality, which can be fine-tuned or frozen during
training. The M4C-captioner itself has  = 768 hidden
dimension size,  = 4 transformer layers and 12 attention
heads. We used sentencepiece [7] to obtain a dictionary
of 32000 subwords built from both mention-paragraphs
and OCR tokens. This is used as the M4C-captioner’s
vocabulary. We followed the BERT-BASE hyperparameter
setting and trained from scratch.</p>
          <p>Regarding the encoders that feed features to
M4Ccaptioner, we used pre-trained Resnet-152 as the figure’s
Figure 3: Score distribution on correlations between mention- vision encoder. For each figure, we applied a 2D
adapparagraph, OCR tokens and figure captions. Both evaluators tive average pooling over outputs from layer 5 to
objudged most of the figures with at least moderate correlations tain a global visual feature vector with a dimension of
with captions. 2048. Layers 2, 3 and 4 layers were fine-tuned during
training. For mention-paragraph features, SciBERT [8]
was used to encode3 it into 758-dimensional feature
vecifgures (evaluator 1: 64% and evaluator 2: 79.5%) with tors. The number of vectors equals the number of
subrelevance scores greater than 3 and a cohen kappa score word tokens in the mention-paragraph, which we limit to
of 0.28. This evaluation result indicates that the mention- 192. The mention-paragraph encoder is also fine-tuned
paragraphs and OCR tokens have a satisfactory extrac- during training. Finally, for OCR tokens, we use both
tion quality and that the annotators considered most of text and visual features. We selected FastText [9] as the
them as relevant to the figure and its caption. However, word encoder and Pyramidal Histogram of Characters
the two annotators seem to have a relatively lower agree- (PHOC) [10] as the character encoder. Regarding the
ment (0.28) regarding which figures and captions are visual feature encoder of OCR tokens, we first extracted
relevant to their mention-paragraphs and OCR tokens. Faster R-CNN fc6 features and then applied fc7 weights
We attribute this to the fact that evaluations of figure to it to obtain 2048-dimensional appearance features for
captions are highly subjective. bounding boxes of OCR tokens. The fc7 weights were
ifne-tuned during training. We kept a maximum of 95
OCR tokens per figure.
5. Experiments We trained a model on a GPU server with 8 Nvidia
Tesla V100 GPUs. Training a model with a complete set of
We conduct experiments using SciCap+ to empirically features took 13 hours. During training, we used a batch
prove that scientific figure captioning is inherently a size of 128. We selected CIDEr as the evaluation metric.
knowledge-augmented task and benefits from knowledge The evaluation interval is every 2000 iterations, we stop
coming from both text and vision modalities. training if CIDEr score does not improve for 4 evaluation
intervals. The optimizer is Adam with a learning rate of
5.1. Figure Captioning Model 0.001 and  = 1.0 E−08. We also used a multistep
learning rate schedule with warmup iterations of 1000 and a
warmup factor of 0.2. We kept the maximum number of
decoding steps at the decoding time as 67. For
evaluation, we used five standard metrics for evaluating image
captions: BLEU-4 [11], METEOR [12], ROUGE-L [13],
CIDEr [14] and SPICE [15]. Since figure captions contain
scientific terms which can be seen as uncommon words,
among all five metrics, we are particularly interested in
CIDEr since it emphasizes them.</p>
        </sec>
        <sec id="sec-3-4-2">
          <title>We used M4C-Captioner [2] as the baseline model to</title>
          <p>study the scientific figure captioning task. The
M4CCaptioner is based on Multimodal Multi-Copy Mesh
(M4C) [5] that jointly learns representations across
input modalities. To solve the out-of-vocabulary problem
during caption generation, it is equipped with a pointer
network that picks up text from OCR tokens or a
predeifned fixed dictionary. In this work, 3 input features are
used, figure, mention-paragraphs and OCR tokens fed to
encoders, the output representations of which are fed to
the M4C-Captioner.</p>
        </sec>
      </sec>
      <sec id="sec-3-5">
        <title>5.2. Implementation and Training</title>
        <sec id="sec-3-5-1">
          <title>Our implementation of M4C-Captioner is based on the</title>
          <p>MMF framework [6] and Pytorch. The implementation</p>
        </sec>
        <sec id="sec-3-5-2">
          <title>3We only used the first 3 layers of SciBERT for lightweightness.</title>
          <p>Model
1. M4C-Captioner (Figure Only )
2. M4C-Captioner (Mention Only)
3. M4C-Captioner (Figure and OCR features)
4. M4C-Captioner (Mention, Figure and OCR features)
Ablation Study on Figures
5. M4C-Captioner (Mention and OCR features)
Ablation Study on OCR features
6. M4C-Captioner (Mention, Figure and w/o OCR features )
7. M4C-Captioner (Mention, Figure and OCR spatial features)
8. M4C-Captioner (Mention, Figure and OCR (w/o spatial features) features )
9. M4C-Captioner (Mention, Figure and OCR (w/o visual features) features )
6.3
6.4
5.8
6.4
6.2</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>6. Results</title>
      <p>noise for the model. This is likely because the Resnet-152
visual encoder we used was not trained on figures.
6.1. Main Result We enriched the representations of the OCR features
by adding text, visual, and spatial features. Ablation
studThe experimental results in table 2 demonstrate that us- ies aim to reveal impacts of each OCR token feature. All
ing the mention-paragraph and OCR tokens significantly comparisons are with row #4 even though row #5 gives
improves scores on all five metrics compared to the figure- slightly better scores. With OCR features completely
only baseline. The experimental results align with our removed in row #6, the CIDEr scores decrease by 5.3.
hypothesis and preliminary study that scientific figure Using only OCR spatial features in row #7, the CIDEr
captioning is a knowledge-augmented image captioning score dropped by 7.8. Removing OCR spatial features in
task, OCR tokens and knowledge embedded in mention- row #8, the CIDEr scores dropped by 1.2. Upon removal
paragraphs help in composing informative captions. of OCR visual features in row #9, the CIDEr score is close</p>
      <p>We established a baseline M4C-Captioner (Figure only) to removing spatial features.
with figures as the only input modality to the M4C- The above ablation study indicates that the enriched
Captioner model in row #1. This baseline is in the non- OCR contributes to the informativeness of generated
capknowledge setting. Therefore, low scores in all metrics tions. Unlike OCR features, where appearance features
show that the model needs knowledge of other modal- are helpful to the model, removing visual features of
ities. Using the mention only in row #2 shows that the ifgures increases CIDEr scores, further indicating that
mention certainly contains a lot of useful information, as we need a specific vision encoder for figures to provide
evidenced by the increase in performance. When OCR meaningful features.
features are added to the figure input in row #3, scores
for all metrics have significant gains compared to the
ifgure-only baseline, but are still weaker than when only 7. Human Evaluation
mentions are used. This motivates the combination of
mentions and OCR features and in row #4, compared to Having established that knowledge helps a model
perthe figure-only baseline and figure-OCR-only baseline, form figure captioning, we conducted some human
evalthe performance further improves. Perhaps the most uation activities to determine their subjective quality.
interesting result is in row #5 where we only use the We conducted human caption generation and
evaluamentions and OCR features but not the figure and get tion tasks. The human generation task is to examine
the best performance, particularly for SPICE and CIDEr, whether humans can write better captions than models.
albeit comparable to when the figure is included in row The evaluation task is the appropriateness evaluation
#4. All these results indicate that explicitly extracted task, which consists of evaluating how appropriate the
multimodal knowledge helps to compose informative model-generated captions are versus ground-truth
capcaptions. tions. Both tasks were performed by the same human
subjects for the quality assessment of the data set.</p>
      <sec id="sec-4-1">
        <title>6.2. Ablation Studies</title>
        <sec id="sec-4-1-1">
          <title>We first performed an ablation study on figures by removing visual feature vectors, the CIDEr score increases slightly, indicating that the visual feature is more like</title>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>7.1. Figure Caption Generation Task</title>
        <sec id="sec-4-2-1">
          <title>The figure caption generation task is to generate captions under two conditions separately: 1. Figure-only: Human</title>
          <p>Annotator
1. Annotator 1
2. Annotator 2
3. M4C-Captioner
4. Annotator 1
5. Annotator 2
6. M4C-Captioner</p>
          <p>Inputs
Figure-only</p>
          <p>Figure-only
Image and OCR features</p>
          <p>Figure-Mention</p>
          <p>Figure-Mention
Mention, Figure and OCR features
annotators write captions given only figures. This is to Even given mention-paragraphs, our annotator wrote
compare with captions generated by M4C-Captioner that captions with low scores across all standard image
caponly has access to figures and OCR features. 2. Figure- tioning evaluation metrics. We ascribe it as figure
Mention: Human annotators write captions given both captions are highly subjective and require in-domain
ifgures and their mention-paragraphs. We randomly knowledge to write. Although our annotators are
reselected 100 figures from the test set and to compare searchers, they cannot be professional in all
knowlhuman-generated captions with captions generated by edge existing in the computer science domain. Granted
M4C-Captioner. mention-paragraphs and OCR tokens as external
knowl</p>
          <p>The table 3 shows automatic evaluation results for hu- edge sources, and with large-amount data training, the
man caption generation tasks. Given only figures (rows model can significantly outperform humans.
#1, 2), both annotators got low scores across all metrics,
among those, annotator 2 led all metrics except SPICE. 7.2. Appropriateness Evaluation
Since humans perform OCR naturally with their eyes we
compare with M4C-captioner (Figure and OCR features). This task evaluates the appropriateness of
modelIt has the best SPICE score, although it outperformed an- generated and ground-truth captions. We used the same
notator 1 in 4 of 5 evaluation metrics, it achieved similar set of 100 figures as in the figure caption generation task,
performance compared with annotator 2. This shows and placed ground-truth captions and model-generated
that without additional knowledge, humans aren’t that captions in random order. Then, human evaluators rank
better than machines. each caption to give appropriateness scores (1-4) to each</p>
          <p>
            However, given mention-paragraphs and figures (rows caption. The evaluation scale: 1. Inappropriate: a caption
#4, 5), compared to the figure-only condition, both anno- does not match the figure, is not a sentence, is wrong, or
tators got improved scores in BLEU-4, METEOR, ROUGE- is misleading. 2. Not sure: It is impossible to judge
approL, and SPICE but lower scores in CIDEr. Previous studies priateness solely from the figure. 3. Possible: A possible
have shown that CIDEr is more reliable as an evaluation candidate that is incomplete but not wrong. 4.
Apprometric for caption generation, and the lowered CIDEr priate: An informative caption that interprets the figure
scores indicates that humans are likely to struggle with well. Since an appropriate figure caption should stand
additional knowledge. On the other hand, having access alone and readers should understand the messages the
to full features, M4C-captioner gained a significantly bet- ifgure wants to represent without referring to the body
ter CIDEr score compared to human annotators. The text, we do not show mention-paragraphs to evaluators.
automatic evaluation results of the human generation Table 4 shows the results of the evaluations. Two
evaltasks show the steep dificulty in writing figure captions uators gave low average scores to both model-generated
close to ground truth. captions and ground-truth captions. In addition,
eval9. Conclusion
uators only reached fair agreements on scoring
(0.230.36). Using the mention and OCR features (row #2),
gets the best human evaluation scores and this is in line In this paper, we study the challenges of the
scienwith the corresponding score in Table 2 where it also tific figure captioning task. Extending from the
preachieves the best CIDEr performance, indicating that hu- vious study [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ], we reframe this task as a
knowledgeman evaluation is reliable despite the fair agreements. augmented image captioning task, that is, a model needs
The evaluation results indicate that the model-generated to use knowledge extracted across modalities to
generand ground-truth captions are not always informative ate captions. To this end, we released a new version of
to both evaluators, which reveals the need to improve the SciCap dataset: SciCap+ by augmenting figures with
caption writing quality and model performance. We ob- their mention-paragraphs and OCR tokens. We used
served that captions tend to be written without following M4C-Captioner model as the baseline model to utilize
specific rules, and this may contribute to lack of agree- knowledge across three modalities: mention-paragraphs,
ment. With low inter-rater agreements, we found how ifgures, and OCR tokens. The automatic evaluation
exinformative a figure caption is highly subjective and de- periments further reveal that using knowledge
signifpends on in-domain background knowledge evaluators icantly improves evaluation metric scores. Compared
have. with human-generated captions, we found models can
generate better captions than humans regarding the
automatic evaluation metrics. However, human evaluations
8. Related Work demonstrated that writing scientific figure captioning is
challenging even for humans, and the model-generated
ifgure captions, despite their reasonable automatic
evaluation quality, are still far from achieving a level
appropriate for humans. The release of the SciCap+ dataset is
to promote the further development of scientific figure
captioning. For future work, we are interested in how to
use multimodal pretraining strategies in this task.
          </p>
        </sec>
        <sec id="sec-4-2-2">
          <title>Unlike natural image captioning, figure captioning has</title>
          <p>
            been scarcely studied in history. SciCap [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ] is the most
recent work on scientific figure captioning, they released
a large-scale scientific figure captioning dataset that
includes figures from academic papers in arXiv dataset.
Before SciCap, FigCAP [16] [17] and FigureQA [18] are
two figure captioning datasets, but their figures are
synthesized. We decided to extend and study on SciCap
dataset, since its figures are from real-world scientific
papers. In this paper, we also have leveraged multimodal
knowledge using pre-trained models.
          </p>
          <p>Multimodal machine learning is to model knowledge
across various modalities. The closest multimodal task
to figure captioning is image captioning, a popular
architecture is encode-decoder, where the decoder learns
to generate captions conditioned on visual features
extracted from the encoder. Recent works on integrating
texts in natural images for visual question answering
and image captioning tasks are based on transformer
architecture augmented with a pointer network [5, 19].
The transformer enriches representations by
integrating knowledge from both text and visual modality. The
pointer network dynamically selects words from the fixed
dictionary or OCR tokens during generation.</p>
          <p>Using knowledge embedded in pre-trained models is
a common practice in solving multimodal tasks. In this
work, we used SciBert [8], a BERT model [20] that was
pre-trained in scientific papers, to obtain informative
representations for the texts extracted from computer
science papers. Since terms that exist in the figures may
be uncommon words, we also used FastText [21] to obtain
word embeddings with subword information. For visual
modality, we used Renst152 [22] and Faster R-CNN [23]
used in extract features from images and bounding boxes.
10. Acknowledgment</p>
        </sec>
        <sec id="sec-4-2-3">
          <title>These research results were partly obtained from the</title>
          <p>commissioned research (No. 225) by National Institute
of Information and Communications Technology (NICT),
Japan, and partly obtained from the first author’s
internship research under NICT.
[4] C. Clark, S. Divvala, Pdfigures 2.0: Mining figures uation of summaries, in: Text summarization
from research papers, in: 2016 IEEE/ACM Joint branches out, 2004, pp. 74–81.</p>
          <p>Conference on Digital Libraries (JCDL), IEEE, 2016, [14] R. Vedantam, C. Lawrence Zitnick, D. Parikh, Cider:
pp. 143–152. Consensus-based image description evaluation, in:
[5] R. Hu, A. Singh, T. Darrell, M. Rohrbach, Iterative Proceedings of the IEEE conference on computer
answer prediction with pointer-augmented multi- vision and pattern recognition, 2015, pp. 4566–4575.
modal transformers for textvqa, in: Proceedings [15] P. Anderson, B. Fernando, M. Johnson, S. Gould,
of the IEEE Conference on Computer Vision and Spice: Semantic propositional image caption
evaluPattern Recognition, 2020. ation, in: European conference on computer vision,
[6] A. Singh, V. Goswami, V. Natarajan, Y. Jiang, Springer, 2016, pp. 382–398.</p>
          <p>X. Chen, M. Shah, M. Rohrbach, D. Batra, [16] C. Chen, R. Zhang, E. Koh, S. Kim, S. Cohen, T. Yu,
D. Parikh, Mmf: A multimodal framework for vi- R. Rossi, R. Bunescu, Figure captioning with
reasion and language research, https://github.com/ soning and sequence-level training, arXiv preprint
facebookresearch/mmf, 2020. arXiv:1906.02850 (2019).
[7] T. Kudo, J. Richardson, SentencePiece: A sim- [17] C. Chen, R. Zhang, E. Koh, S. Kim, S. Cohen, R. Rossi,
ple and language independent subword tokenizer Figure captioning with relation maps for reasoning,
and detokenizer for neural text processing, in: in: Proceedings of the IEEE/CVF Winter
ConferProceedings of the 2018 Conference on Empirical ence on Applications of Computer Vision (WACV),
Methods in Natural Language Processing: System 2020.</p>
          <p>Demonstrations, Association for Computational [18] S. E. Kahou, V. Michalski, A. Atkinson, Á. Kádár,
Linguistics, Brussels, Belgium, 2018, pp. 66–71. URL: A. Trischler, Y. Bengio, Figureqa: An annotated
https://aclanthology.org/D18-2012. doi:10.18653/ ifgure dataset for visual reasoning, arXiv preprint
v1/D18-2012. arXiv:1710.07300 (2017).
[8] I. Beltagy, K. Lo, A. Cohan, SciBERT: A pretrained [19] O. Sidorov, R. Hu, M. Rohrbach, A. Singh, Textcaps:
language model for scientific text, in: Proceed- a dataset for image captioningwith reading
compreings of the 2019 Conference on Empirical Meth- hension, 2020.
ods in Natural Language Processing and the 9th [20] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT:
International Joint Conference on Natural Lan- Pre-training of deep bidirectional transformers for
guage Processing (EMNLP-IJCNLP), Association language understanding, in: Proceedings of the
for Computational Linguistics, Hong Kong, China, 2019 Conference of the North American
Chap2019, pp. 3615–3620. URL: https://aclanthology.org/ ter of the Association for Computational
LinguisD19-1371. doi:10.18653/v1/D19-1371. tics: Human Language Technologies, Volume 1
[9] J. Almazán, A. Gordo, A. Fornés, E. Valveny, Word (Long and Short Papers), Association for
Comspotting and recognition with embedded attributes, putational Linguistics, Minneapolis, Minnesota,
IEEE transactions on pattern analysis and machine 2019, pp. 4171–4186. URL: https://aclanthology.org/
intelligence 36 (2014) 2552–2566. N19-1423. doi:10.18653/v1/N19-1423.
[10] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, En- [21] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov,
Enriching word vectors with subword information., riching word vectors with subword information,
TACL 5 (2017) 135–146. URL: http://dblp.uni-trier. Transactions of the Association for Computational
de/db/journals/tacl/tacl5.html#BojanowskiGJM17. Linguistics 5 (2017) 135–146.
[11] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: [22] K. He, X. Zhang, S. Ren, J. Sun, Deep residual
learna method for automatic evaluation of machine ing for image recognition, in: Proceedings of the
translation, in: Proceedings of the 40th Annual IEEE conference on computer vision and pattern
Meeting of the Association for Computational Lin- recognition, 2016, pp. 770–778.
guistics, Association for Computational Linguis- [23] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn:
tics, Philadelphia, Pennsylvania, USA, 2002, pp. Towards real-time object detection with region
311–318. URL: https://aclanthology.org/P02-1040. proposal networks, in: C. Cortes, N. Lawrence,
doi:10.3115/1073083.1073135. D. Lee, M. Sugiyama, R. Garnett (Eds.), Advances
[12] S. Banerjee, A. Lavie, Meteor: An automatic met- in Neural Information Processing Systems,
ric for mt evaluation with improved correlation volume 28, Curran Associates, Inc., 2015. URL:
with human judgments, in: Proceedings of the https://proceedings.neurips.cc/paper/2015/file/
acl workshop on intrinsic and extrinsic evaluation 14bfa6bb14875e45bba028a21ed38046-Paper.pdf.
measures for machine translation and/or
summarization, 2005, pp. 65–72.
[13] C.-Y. Lin, Rouge: A package for automatic
eval</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.-Y.</given-names>
            <surname>Hsu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Giles</surname>
          </string-name>
          , T.-H. Huang,
          <article-title>SciCap: Generating captions for scientific figures</article-title>
          ,
          <source>in: Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2021</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Punta Cana, Dominican Republic,
          <year>2021</year>
          , pp.
          <fpage>3258</fpage>
          -
          <lpage>3264</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .findings-emnlp.
          <volume>277</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          . findings- emnlp.277.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>O.</given-names>
            <surname>Sidorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rohrbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <surname>Textcaps:</surname>
          </string-name>
          <article-title>a dataset for image captioning with reading comprehension</article-title>
          ,
          <source>in: European conference on computer vision</source>
          , Springer,
          <year>2020</year>
          , pp.
          <fpage>742</fpage>
          -
          <lpage>758</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xin</surname>
          </string-name>
          , H. Wu,
          <string-name>
            <surname>Cheetah:</surname>
          </string-name>
          <article-title>An ultra-fast, approximation-free, and privacypreserved neural network framework based on joint obscure linear and nonlinear computations</article-title>
          , arXiv preprint arXiv:
          <year>1911</year>
          .
          <volume>05184</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>