=Paper=
{{Paper
|id=Vol-3164/paper24
|storemode=property
|title=Scientific Chart Summarization: Datasets and Improved Text Modeling
|pdfUrl=https://ceur-ws.org/Vol-3164/paper24.pdf
|volume=Vol-3164
|authors=Hao Tan,Chen-Tse Tsai,Yujie He,Mohit Bansal
|dblpUrl=https://dblp.org/rec/conf/aaai/TanTHB22
}}
==Scientific Chart Summarization: Datasets and Improved Text Modeling==
<pdf width="1500px">https://ceur-ws.org/Vol-3164/paper24.pdf</pdf>
<pre>
Scientific Chart Summarization: Datasets and Improved
Text Modeling
Hao Tan1ab , Chen-tse Tsai2a , Yujie He2a and Mohit Bansal1
1
    University of North Carolina at Chapel Hill, USA
2
    Bloomberg, USA


                                             Abstract
                                             Chart figures usually convey the key message in a multimodal document. Understanding charts automatically and making
                                             charts more accessible becomes indispensable in the information era. In this paper, we study the chart summarization
                                             problem in which the goal is to generate sentences that describe the salient information in a chart image. To obtain training
                                             examples, we leverage image-caption pairs in multiple scientific areas. We create a dataset of single-chart images from
                                             research papers in PubMed Central (PMC) and arXiv. Most recent vision-and-language works focus on natural images.
                                             Several challenges in structured images such as charts are under-explored. One key property of charts is that the text
                                             components (e.g., legends and axis names) carry important information. In our proposed model, we not only use a standard
                                             visual encoder but also a text encoder to encode a chart image. The visual and textual representations are connected to a
                                             large pre-trained language decoder via pre-embedding and cross-attention approaches, respectively. Experimental results
                                             show that the proposed model is significantly better than an image captioning baseline.

                                             Keywords
                                             Chart Summarization, Multimodal Learning, Document Understanding, Image Captioning, Natural Language Processing


1. Introduction                                                                                                     mary for structural charts. First, to obtain a large quantity
                                                                                                                    of summaries of chart images, we leverage captions in
Information graphics, such as line charts and bar charts,                                                           scientific articles. Unlike magazines or newspapers, in
are essential and common components of a document.                                                                  which image captions could be less descriptive, captions
Charts are usually used for visually summarizing im-                                                                in scientific papers tend to be more detailed and verbose.
portant information that a document intends to convey.                                                              We build a chart summarization dataset from the papers
Moreover, as shown in the study of Carberry et al. [1],                                                             in arXiv and PubMed Central (PMC) by assuming that
information graphics in magazines and newspapers of-                                                                captions are salient summaries of chart figures. Image
ten convey messages that are not repeated in the text.                                                              captions in these data sources are written by the cor-
Therefore, summarizing the primary message in a chart                                                               responding paper’s authors, and hence would be more
is an important step towards understanding a multimodal                                                             natural in the language format. Since these articles also
document. Potential applications of chart summarization                                                             contain figures other than charts, we create crowdsourc-
include indexing information content for a search engine,                                                           ing tasks to select single-chart images and collect these
making charts accessible for individuals with eyesight                                                              charts’ detailed types (e.g., line chart, bar chart, etc.).
impairments, and simplifying information dissemination                                                                 Different from the traditional captioning for natural
of technical visual info to a layperson.                                                                            images, there are two main challenges from the language
   We have seen the success of image captioning works                                                               perspective when the target images are charts: (1) Be-
recently, which can be viewed as generating summaries                                                               sides visual content, charts usually also contain text (e.g.,
for an image. However, this research has mostly focused                                                             legends and axis titles) which carries significant infor-
on natural images while other types of images (e.g., struc-                                                         mation of components in charts. (2) Charts are likely
tured images shown in Fig. 2) are under-explored. On                                                                to be used in some specific domains, thus the language
the other hand, abstractive text summarization models                                                               generation model may suffer from rare-word issues.
also have been greatly improved due to the development                                                                 To address these two challenges, we first use an opti-
of neural network models. However, these models only                                                                cal character recognition (OCR) model to detect the text
look at the text component in a document. In this work,                                                             boxes in the charts. An OCR embedding layer is proposed
we focus on the less-studied yet important task of ‘chart                                                           to encode these extracted texts with their position infor-
summarization’, where we want to generate a salient sum-                                                            mation into vectors, and these vector representations
a
 Equal Contribution.                                                                                                are treated as another input to the language decoder
b                                                                                                                   through cross-attention mechanism. Secondly, to endow
 Work done during an internship at Bloomberg
The second workshop on Scientific Document Understanding at AAAI                                                    the decoder with domain-specific knowledge, we use a
2022                                                                                                                large pre-trained language decoder instead of training it
                                       © 2022 Bloomberg Finance L.P. Use permitted under Creative Commons License
    CEUR
                                       Attribution 4.0 International (CC BY 4.0).                                   from the scratch. The chart information is connected to
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
Figure 1: Pipeline of datasets creation. We first sample scientific papers from arXiv and PubMed Central, and then extract
image-caption pairs by parsing the source LaTeX or XML files. Finally, crowdsourcing is applied to annotate whether an image
contains a single chart and the corresponding chart type.


this pre-trained language decoder via two approaches:              and focus more on the text generation model. These bet-
pre-embedding and cross attention. We empirically find             ter text analysis models could potentially improve our
that using pre-embedding for visual content and cross-             model performance, which we leave for future investi-
attention for OCR representations gives the best results.          gation. Kahou et al. [10] introduce FigureQA, a visual
   We apply our models on our collected datasets of two            reasoning corpus of question-answer pairs over synthetic
scientific domains. We conduct both metric-based auto-             chart images. Instead of answering questions on the syn-
matic evaluation and human-annotated qualitative eval-             thetic charts, we aim at directly summarizing real chart
uation. Experimental results show that our model with              images.
the integration of OCR and pre-trained language model                 There are some earlier works on chart summarization.
significantly outperforms the baseline image captioning            Elzer et al. [15] proposed SIGHT, a system that summa-
model. We also show the ablation studies that illustrate           rizes bar charts for visually impaired users. The system
the effectiveness of our proposed methods.                         identifies one of the twelve message categories that can
                                                                   be conveyed by a bar chart and produces a logical form.
                                                                   This logic representation is then translated into natural
2. Related Work                                                    language via templates. Demir et al. [16] built on top of
                                                                   SIGHT. The proposed system first identifies an additional
Most work on understanding chart images involves chart
                                                                   set of propositions that may reflect some information in
type classification. Savva et al. [2] classify given chart
                                                                   a bar chart by rules. These propositions are then orga-
images into 10 chart categories using an SVM classifier
                                                                   nized and structured by a bottom-up planner. Finally, a
with visual bag-of-words and text-region features. With
                                                                   surface realizer is applied to produce natural language
a similar model, Ray Choudhury and Giles [3] proposed
                                                                   summaries.
a binary classifier to determine whether an image is a
                                                                      Greenbacker et al. [17] built a corpus of human-written
line chart. Siegel et al. [4] experimented with CNN-based
                                                                   English summaries of line graphs. They selected 23 line
models for classifying images they extracted from schol-
                                                                   graphs and asked annotators to summarize the most im-
arly articles. In order to identify chart figures for training
                                                                   portant information in each graph. As this process is
our summarization model, we build a binary classifier
                                                                   difficult to be scaled up, we take the captions of chart
to identify common charts (e.g., line charts, bar charts,
                                                                   images in scientific papers to represent the summaries
scatter plots, etc.).
                                                                   instead. Greenbacker et al. [18] further used this cor-
   There is a line of works on interpreting text compo-
                                                                   pus and proposed an abstractive summarization system
nents in chart images [5, 6, 7, 8, 9, 4, 10, 11, 12, 13]. One
                                                                   for line charts. The system uses a Bayesian network to
of the applications here is to recover visual encodings for
                                                                   classify the intents of line segment, and then rules are
purposes of indexing and search. For example, Poco and
                                                                   applied to identify additional important informational
Heer [14] proposed an end-to-end text analysis pipeline
                                                                   propositions conveyed by the line graph. The sets of
that identifies text elements in a chart image, determine
                                                                   intents and prepositions are pre-defined from the study
their bounding box, and classifies their role in the chart
                                                                   on the corpus. They left the final step of generating natu-
(e.g., x-axis label, x-axis title, legend title, etc). They also
                                                                   ral language summary from prepositions as future work.
proposed a CNN model that classifies the type of graph-
                                                                   Therefore, no evaluation results were shown.
ical mark (e.g., bars or lines). We simply use a general
                                                                      A common challenge of these earlier works is that
purpose OCR tool for recognizing text in chart images
they are limited to a fixed set of propositions and need      most of the figures in these papers are not charts. Hence,
to convert the selected propositions to natural language.     to be able to train and evaluate the proposed chart sum-
Instead of using a pipeline with hand-crafted intents and     marization model, we need to identify which figures are
propositions, we propose to leverage an end-to-end neu-       charts. In this work, we focus on the common 5 chart
ral network, which has been shown to be powerful in           types, including line, bar, scatter, pie, and area charts
generating coherent and grammatical sentences in the          (Figure 2). Moreover, we further focus on the simplest
context of image captioning and abstractive text summa-       case where images only contain a single chart. Figures
rization.                                                     with multiple charts or with any non-chart component
   Another thread of related works is (natural) image cap-    will be considered as negative images in this work. In the
tioning, which tries to generate descriptions for natural     following sections, we describe how do we obtain single
images. Vinyals et al. [19] first illustrate the end-to-end   chart and chart type annotations.
encoder-decoder architecture and Xu et al. [20] extends it
with attention modules. Ranzato et al. [21] use reinforce-    3.1. PubMed Central Data
ment learning to eliminate exposure bias but requires a
large amount of data to reduce the high variance. An-        For PMC data, we create a crowdsourcing task to anno-
derson et al. [22] take object-level information to enable   tate whether a given image contains single chart. We ran-
fine-grained visual understanding. However, we empiri-       domly sample 50,000 images from the papers published
cally found that the detection features for natural image    from 2011 to 2019. For each image, we ask annotators
do not work well for charts (structural images). Previous    whether it is a single chart figure. If the answer is yes, the
vision-and-language pre-training, e.g., VLP [23] and OS-     annotators are required to select a chart type from line,
CAR [24], use pre-trained vision-and-language model to       bar, scatter, pie, area, or other chart. Since this task is
improve image captioning but requires a large in-domain      pretty simple, we ask two annotators to label each image
corpus and heavy pre-training.                               in the first round. In most cases, two annotators agree
                                                             on the labels. More specifically, the Fleiss’ kappa scores
                                                             for “whether it’s a single chart” and “chart type” tasks
3. Datasets Creation                                         are 0.56 and 0.73 respectively, which shows significant
                                                             agreement 5 .
We create our datasets based on image-caption pairs that        If there is a disagreement on either single chart label
appear in public scientific papers. Different from the or chart types, we further ask the other three annotators
figures in magazines or newspapers where the captions to perform a second round of annotation on these im-
could be less descriptive, figure captions in scientific ar- ages. Finally, majority vote is applied to resolve conflicts
ticles tend to convey the key message of figures. The among all five annotators. We note that single charts
assumption here is that captions written by the paper with “other” chart type are considered negative images
authors could represent the most salient information in in our experiments.
the figures, therefore could serve as summaries of the          Among 50,000 images, we obtain 7,397 positive images
corresponding figures. The overview of our datasets cre- (single chart), including 3681 line charts, 3088 bar charts,
ation pipeline is shown in Figure 1. We consider two data 478 scatter charts, 125 pie charts, and 25 area charts. The
sources: arXiv1 and PMC.2 ArXiv is a free distribution positive ratio of the charts is about 13%. This low ratio is
service and an open-access archive for scholarly articles because most of the figures in scientific articles are non-
in the fields such as physics, computer science, and math- chart figures (e.g., model architecture diagrams). In this
ematics. PMC is a free full-text archive of biomedical work, we only use chart types in analyzing model perfor-
and life sciences journal literature at the U.S. National mance. That is, chart type information is not included
Institutes of Health’s National Library of Medicine. We explicitly in model training.
take articles in the Open Access Subset.3 These two data
sources are chosen because they both provide structural
data in addition to the PDF files. That is, we can obtain 3.2. ArXiv Data
image-caption pairs by parsing the LaTeX source files We also build another dataset from the arXiv data. We
provided by arXiv or the XML files provided by PMC. We take papers in Computer Vision, Computation and Lan-
write our own LaTeX parser for the arXiv data, and use guage, Machine Learning, Artificial Intelligence, and Neu-
a public PubMed parser4 for parsing XML information. ral and Evolutionary Computing fields from 2008 to 2020.
   Although we can extract lots of image-caption pairs, Because of the copyright issue, we cannot put arXiv im-
                                                             ages on a public crowdsourcing platform. Instead, the
    1
      https://arxiv.org/                                     authors went through and annotated 2000 randomly sam-
    2
      https://www.ncbi.nlm.nih.gov/pmc/                      pled figures with the same crowdsourcing interface that
    3
        https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/
    4
        https://github.com/titipata/pubmed_parser                 5
                                                                      https://en.wikipedia.org/wiki/Fleiss%27_kappa
                                                              the previous word 𝑤𝑡−1 and states (ℎ𝑡−1 , 𝑐𝑡−1 ). The
                                                              attention module (denoted as Attℎ→𝑓 ) then attends to
                                                              the feature sequence {𝑓𝑖 } with the hidden output ℎ𝑡 as
                                                              a query. The context 𝑓ˆ 𝑡 and the hidden vector ℎ𝑡 are
                                                              merged into an attentive hidden vector ℎ̂𝑡 with a fully-
                                                              connected layer:
                                                                         𝑤
                                                                         ˜ 𝑡−1 = embedding (𝑤𝑡−1 )
                                                                         ℎ𝑡 , 𝑐𝑡 = LSTM (𝑤
                                                                                         ˜ 𝑡−1 , ℎ𝑡−1 , 𝑐𝑡−1 )
                                                                             ˆ
                                                                             𝑓 = Attℎ→𝑓 (ℎ𝑡 , {𝑓𝑖 })
                                                                              𝑡

                                                                            ℎ̂𝑡 = tanh(𝑊1 [𝑓ˆ 𝑡 ; ℎ𝑡 ] + 𝑏1 )
                                                              The probability of generating the 𝑘-th token at time step
                                                              𝑡 is the softmax over a linear transformation of the atten-
                                                              tive hidden ℎ̂𝑡 . The loss ℒ𝑡 is the negative log likelihood
                                                              of the ground truth token 𝑤𝑡* :
                                                                                                 (︁            )︁
                                                                         𝑝𝑡 (𝑤𝑡,𝑘 ) = softmax𝑘 𝑊w ℎ̂𝑡 + 𝑏w
                                                                              ℒ𝑡 = − log 𝑝𝑡 (𝑤𝑡* )
Figure 2: Example charts with the corresponding chart types
from the PubMed Central dataset. The dataset we build con-
tains the most common 5 chart types.                          4.2. Text Understanding
                                                              Different from natural image captioning, the summariza-
                                                              tion of charts heavily relies on the understanding of text
we use for annotating PMC data. This results in 370           inside the images. However, the ResNet visual encoder
single chart images.                                          (in Section 4.1) is insensitive to the text in the images (as
                                                              shown in Singh et al. [11] as well) thus we need to build a
                                                              pipeline to extract the text information from the images.
4. Methodology                                                Specifically, we first use the Tesseract [27] to extract a
                                                              sequence of 𝑚 texts text 𝑗 with their positions pos 𝑗 from
In this section, we introduce the proposed models and
                                                              the image 𝑥.
training strategies for the chart summarization task. In
this chart summarization task, the model needs to gener-                   {(text 𝑗 , pos 𝑗 )}𝑚
                                                                                              𝑗=1 = OCR(𝑥)             (1)
ate a sequence of words {𝑤𝑖 } for describing the contents
in a chart 𝑥. We start with introducing the basic cap-     Since the characters in charts are usually in small font
tioning model. To enhance in-image text understanding     and sometimes blurred with the chart content, the copy
and endow external knowledge, we incorporate an OCR       mechanism [28, 29] that directly brings the text into final
encoder and a pre-trained language decoder. Lastly, we    summarization does not provide good results. We instead
propose a simple semi-supervised learning and domain      use the shallow text embedding layer to project the OCR
adaptation approach using a chart classifier.             text to dense vector representations that denoises the
                                                          OCR detection results. We also encode the position of
                                                          the OCR along with the text representation since the
4.1. Base Model                                           spatial information indicates the properties of the text
Our base model is adopted from the attentive encoder- (e.g., in the legend, in the title, or inside the chart):
decoder model for image captioning proposed in Xu et al.               𝑔𝑗 = Embtext (text 𝑗 ) + 𝑊pos pos 𝑗        (2)
[20]. A ResNet-101 [25] visual feature extractor encodes
the chart into a 7 × 7 × 2048 dimensional feature map, These OCR representations are treated as another view
where each vector in the feature map corresponds to a of the charts and the language decoder simultaneously
grid region of the image. Feature maps are then flattened attends to the OCR information {𝑔𝑖 } and visual image
to 49 × 2048 feature sequences {𝑓𝑖 }.                     features {𝑓𝑗 }. The final hidden output ℎ̃𝑡 is calculated
                                                          based on the concatenation of the visually attended vec-
                 {𝑓𝑖 }49
                      𝑖=1 = ResNet (𝑥)                    tor 𝑓˜ , the OCR attended vector 𝑔˜, and the hidden state
                                                          ℎ𝑡 .
At each decoding step 𝑡, the LSTM [26] language decoder
outputs the hidden outputs ℎ𝑡 and cell 𝑐𝑡 by reading                      𝑓˜ = Attℎ→𝑓 (ℎ𝑡 , {𝑓𝑖 })                (3)
                                                    Text        Position
                                                    Fluorescence (X, Y)
                                                                                OCR                               Cross
                                      OCR           intensity    (X, Y)
                                                                              Embedding
                                                    nm           (X, Y)                                          Attention
                                                    1800         (X, Y)
                                                    ......


                                                                                                                                  [OMIT]
                                                                              Fixed-Len
                                    ResNet                                   Transformer                                          [OMIT]
                                                                                                                                  [OMIT]
                                                                                                               Pre-trained
                                                                                                                                  F
                                                                                                               Language           #lu
 Fluorescence emission                                                                                          Decoder           #orescence
                                                    F #lu #orescence
 spectrum recorded from            Tokenizer
                                                    emission spectrum
                                                    recorded from the
                                                                                Word
                                                                                                                                  emission
                                                                              Embedding
 the ...                                            ......
                                                                                                                                  spectrum


Figure 3: Illustration of the proposed chart summarization model. We have two branches of image encoding: (1) the visual
branch via the ResNet and fixed-length transformer (2) the text branch via the OCR system and the OCR embedding layer.
The output of these two branches are then fused into the pre-trained language decoder by pre-embedding (concatenation)
and cross-attention layer, respectively. The grey boxes are neural networks.


               𝑔˜ = Attℎ→𝑔 (ℎ𝑡 , {𝑔𝑗 })                            (4)     of red blocks and blue blocks in Figure 3). The cross-
                                                                           attention approach adds cross-attention layers [34] in-
              ℎ̃𝑡 = tanh(𝑊2 [𝑓˜ , 𝑔˜, ℎ𝑡 ] + 𝑏2 )                  (5)
                                                                           side the language decoder to fuse visual information. The
                                                                           cross-attention layers contain residual short-cut connec-
We next replace the original attentive hidden ℎ̂𝑡 with this
                                                                           tions thus the decoder still benefits from the pre-trained
OCR-enhanced hidden output ℎ̃𝑡 (in Sec. 4.1) in succeed-
                                                                           weights with these additional layers.
ing decoding steps.
                                                                               As shown in Figure 3, we use the pre-embedding ap-
                                                                           proach for the features from the visual image content
4.3. Pre-trained Language Decoder                                          (i.e., from the ResNet encoder) and use the cross-attention
When summarizing charts in news or scientific papers, a                    layers for the OCR texts. The idea of this specific design
faithful description of the chart contents also relies on                  is that the generation would be led by the image content
external knowledge, and hence a pre-trained language                       and will use the OCR information to generate concrete
decoder might help the generation. As shown in Figure 3,                   words. We empirically find that it is the best combina-
we illustrate our model which integrates a pre-trained                     tion to fuse information into the language decoder, and
language decoder GPT-2 [30].6 As described in the pre-                     we show the comparison in Section 6.2. In detail, the
vious section, we have two image encoders (i.e., ResNet                    length of the ResNet feature map is 49 and the order of
encoder and OCR text encoder) to process the image con-                    the features is not aligned with the positional encoding
tent and image text respectively. The ResNet encoder                       in the pre-trained language decoder. We thus do not di-
maps the features into a squared feature map (the purple                   rectly append it before the word embedding but use a
vector blocks in Figure 3) where each vector corresponds                   fixed-length transformer to map it to a sequence of 10
to a part of image content. We will view this feature map                  vectors (the red blocks in Figure 3; we only draw 3 vec-
as a sequence of vectors (as in Eq. 1) in the following pro-               tors for simplicity). The fixed-length transformer is built
cedures. The OCR encoder (Eq. 4.2) maps the chart into a                   by transformer decoder layers [34] with only positional
sequence of recognized words and their positions on the                    embedding (without word embedding). We use only 1
chart. The OCR embedding layer (Eq. 2) adds the word                       layer in our experiments.
embedding and the position encoding into one vector for
each OCR entry (the yellow vectors in Figure 3).                           4.4. Semi-Supervised Learning and
   In order to connect these visual and textual infor-                          Domain Adaptation
mation from the image to the language decoder, we
adopt two ways: appending pre-embeddings and adding              Although we can extract abundant image-caption pairs,
cross-attention layers. The pre-embedding approach is            most figures in scientific articles do not contain a chart as
to concatenate the sequence of visual vectors before             we discussed in Section 3. If we want to reserve enough
the word embeddings thus the language decoder will               human-annotated examples for the metric-based evalu-
take this concatenation as input (e.g., the concatenation        ation purpose, that leaves very little data for training,
                                                                 especially for the arXiv domain in which we only have
                                                                 hundreds of single-chart images. Therefore, we leverage
    6
      The method could also be applied to other pre-trained lan- semi-supervised learning techniques to take advantage
guage decoders such as XLNet [31], T5 [32], and BART [33].
                      PMC (Supervised)                   PMC (Semi-Supervised)                  arXiv (Domain Adaptation)
             BLEU   ROUGE-L    METEOR    CIDEr    BLEU   ROUGE-L     METEOR      CIDEr   BLEU     ROUGE-L      METEOR       CIDEr
Base Model   1.66     11.35      2.77     2.76    2.09     11.05       2.91       4.49   3.55       14.10        3.79        8.99
+ OCR        1.97     11.77      3.09     6.00    2.53     11.95       3.50       7.98   4.78       15.88        4.68       15.88
+ GPT-2      3.19     11.66      3.68     1.57    4.47     12.46       4.32      10.30   5.89       14.32        4.92       32.34

Table 1
Results on the PubMed Central (PMC) and arXiv datasets. Supervised: training images are human-labeled single chart images.
Semi-Supervised: training images also include the positive images from the proposed chart classifier. Domain Adaptation:
the chart classifier trained on the PMC domain is applied on arXiv domain to obtain training data for the summarization
model. The best results are marked in bold.


of large unannotated data and use domain adaption to           5. Results
transfer to other datasets. Both of these two methods
rely on a chart classifier that we will introduce first.       In this section, we evaluate our proposed methods on
Chart Classifier. The key component in getting more            our collected datasets of two domains: PMC and arXiv.
training examples is a classifier that can identify single-    We start with describing the experiment setups and show
chart images. We take the ResNet [25] as the visual back-      results with both automatic metric-based evaluation and
bone and use a binary linear classifier after the mean-        human evaluation.
pooled features. Instead of freezing the backbone model
as in the previous works [20], we fine-tune the classi-        5.1. Experimental Setup
fier with a small learning rate, 10−4 . We find that this
standard classifier reaches good results (see Appendix         Data Setup. The supervised learning setup is conducted
for details).                                                  on our annotated PMC dataset. We randomly sample
Semi-Supervised Learning. In the semi-supervised               1,000 charts as the test set and split the remaining charts
learning setup, we have labeled data (Section 3) and we        into training (5,819) and validation (646) sets with a ratio
want to improve the performance from the unlabeled             of 9:1.
data. The unlabeled data contains both charts and non-            In order to increase the number of training examples,
chart images (e.g., model figures in scientific publications   we apply the proposed semi-supervised learning tech-
and natural images in news). Including these non-chart         nique (Section 4.4). The single-chart classifier is based on
images in training data will introduce noise and thus          the ResNet-101 model and is fine-tuned on our datasets.
lead to an increment in training time. To provide clean        We use the 50,000 human-labeled images (7,465 positives)
data in semi-supervised learning, we filter the unlabeled      from PMC data to build this binary classifier. After the
data with our chart classifier and train the summarization     model converges on the training set, we calibrate the
model based on the filtered data. In this way, we increase     classifier to optimize the recall with an precision over
the amount of data and the coverage of topics.                 99% on the validation set. Since we have lots of images,
Domain Adaptation. Different from semi-supervised              we can afford a lower recall for high-quality positive
learning, domain adaptation focuses on transferring the        examples. We then use this classifier to filter the unla-
labeled dataset into another domain. Naïve transferring        beled images in the PMC data to augment the training set.
without training on the target domain would under-fit          More specifically, besides the 50,000 images we used in
the target distribution and we empirically show its in-        the crowdsourcing task, there are 137,928 remaining arti-
effectiveness in Appendix. To solve this issue, we use         cles in our PMC collection from the year of 2011 to 2019.
a similar approach to the semi-supervised learning that        After applying the chart classifier, we obtain 13,637 single
trains the proposed summarization model on the dataset         chart images which could serve as additional training
created by the chart classifier. More specifically, since      examples for the summarization model.
we have much less labeled charts in the arXiv domain,             For domain adaptation, we take charts and captions
we treat it as the target domain whereas PMC data is the       from arXiv as the target domain. As described in Sec-
source domain. We train the chart classifier on the PMC        tion 3, we have manually annotated 370 single-chart
data, and apply it on the images from arXiv papers to          images in this domain, which are served as the test set.
obtain large amount of single-chart images.                    We use the same chart classifier in the previous semi-
                                                               supervised learning setup to annotate 140,000 arXiv im-
                                                               ages. This results in 22,044 positive examples. We split
                                                               this 22,044 examples into training data (19,840) and vali-
                                                               dation data (2,204) with a ratio of 9:1.
                                                               Model Setup. For the base model, we use a ResNet-101
model from the Torchvision [35] library7 . We resize the                      Baseline     Final Model       Equally      Equally Bad
image into 224 × 224 and the backbone model maps it                            Better         Better          Good           Bad
to a 7 × 7 × 2048 vectors. We sort the OCR-extracted              PMC            20             70             3               7
texts by their confidence and only keep the top 20 texts          arXiv          37             50             2              11
for post-processing. Since we want the image position
                                                                 Table 2
to be related to the OCR position. We do not apply ran-          Human study on the results with 100 pairwise comparisons.
dom resize and cropping but directly resize the chart into
224 × 224. For the pre-trained GPT-2 [30] model, we
downloaded the small GPT-2 model from Hugging Face’s                                  BLEU      ROUGE-L            METEOR          CIDEr
Transformer [36]. The GPT-2 implementation has sup-              All                     4.47        12.46             4.32        10.30
port of cross-attention layers as in Vaswani et al. [34]         Line Chart              4.44        12.70             4.28        10.18
and we use it to attention to the OCR features. For the          Bar Chart               4.77        12.30             4.71        7.14
fixed-length transformer, we use 1 layer with the same           Scatter Chart           5.96        16.63             5.39        40.78
architecture as the GPT-2 model but do not apply the
causal attention mask. More implementation and hyper-            Table 3
parameter details can be found in Appendix.                      Results regarding different types of charts.


5.2. Metric-based Evaluation
                                                                 5.3. Human Evaluation
In order to conduct efficient evaluation, we take the au-
                                                                 In order to get a faithful evaluation, we conduct a human
tomatic language metrics to evaluate our model. We
                                                                 evaluation on 100 randomly sampled examples for PMC
report the BLEU [37], ROUGE-L [38], METEOR [39], and
                                                                 and arXiv. The human evaluation is conducted by the
CIDEr [40] as in previous image captioning papers. As
                                                                 authors and their colleagues (4 in total) since this task
shown in Table 1, we compare our proposed models (in
                                                                 requires a certain expert knowledge. We use both base
Section 4.2 and Section 4.3) with the baseline captioning
                                                                 captioning model and our final model (with OCR encoder
model (in Section 4.1) on both PMC and arXiv datasets.
                                                                 and GPT-2 decoder)8 to generate two summaries. Each
The model with OCR text encoder is strictly better than
                                                                 image with the generated summaries from the two mod-
the baseline captioning model for every metrics, which
                                                                 els is annotated by all 4 annotators. We randomly shuffle
indicates that the in-chart text understanding is very
                                                                 the order of these two summaries and only show the A/B
important for generating good summarization for scien-
                                                                 labels to the human annotators. The human annotators is
tific charts. The integration of the pre-trained language
                                                                 asked to choose one from the four options: “Both Good”,
model (GPT-2) further enhances the performance over
                                                                 “Both Bad”, “A wins”, and “B wins”. As shown in Table 2,
the OCR encoder results. The pre-trained decoder shows
                                                                 our proposed model significantly outperforms the base-
more improvement on the semi-supervised setup since
                                                                 line model for both datasets. Moreover, we find that our
the model needs enough data to learn the weights in the
                                                                 annotators have a high agreement on which generated
fixed-length transformer and the cross-attention mod-
                                                                 sentence is better since this scientific summarization is
ules, which bridge the vision encoder and the language
                                                                 mostly about facts and salience.
decoder.
   Note that the CIDEr score of the +GPT-2 model is lower
than the +OCR model on the PMC dataset under the su-             6. Analysis
pervised setup. We find that this is due to the size of
data. The smaller size of the PMC data makes the learned         In this section, we provide the fine-grained analysis to
model have a stronger bias towards the original GPT-2            illustrate the effectiveness of each component in the pro-
generation. Namely, although the model would gener-              posed pipeline. We first demonstrate the results for dif-
ate more fluent sentences (reflected on the high BLEU            ferent chart types and cross-domain evaluation in Sec-
score), it is biased towards the GPT-2 prior by leverag-         tion 6.1. In Section 6.2, we empirically show the advan-
ing mostly common words. This bias is captured by the            tage of our pre-embedding and cross-attention combina-
CIDEr metric’s over-weighting protocol. However, under           tion.
the semi-supervised setting, the CIDEr score is higher
with GPT-2 because of the adequate amount of data. This
                                                                 6.1. Different Chart Categories
also demonstrates the usefulness of the proposed semi-
supervised approach.                                             During our data collection, we also let the annotators to
                                                                 select the type of the chart (Figure 2). In this paper, we

   7                                                                 8
       https://pytorch.org/docs/stable/torchvision/models.html           The PMC model is with the semi-supervised setup.
Pre-Embed   Cross-Att   BLEU   ROUGE-L     METEOR     CIDEr  ages from lots of automatically extracted image-caption
None        None        1.91     10.59       3.01     0.52   pairs. Since the images filtered by the classifier will be
Concat      None        2.88     11.92       3.79     4.78   further used as data augmentation, we take the 𝐹1 score
None        Concat      3.64     12.07       3.69     2.91
Img         OCR         4.47     12.46       4.32     10.30  as the main metric to balance the precision and recall.
OCR         Img         4.46     12.12       4.08     11.18  We start with the frozen ResNet-101 [25] features with
Concat      Concat      3.61     12.18       3.76     2.79   an additional linear classifier. This setup achieves 90% 𝐹1
                                                             score. After fine-tuning the backbone model on our data,
Table 4
                                                             the model achieves an 𝐹1 score of 94.9%. We also tried
Comparison of different approaches of connecting the image
content and the language decoder.                            adding other neural modules (e.g., attentive modules and
                                                             detection branches) and enhanced visual backbones but
                                                             we do not observer a significant result improvement on
                                                             the test set.
aim for a general chart summarization model that does           When we use this classifier in the semi-supervised and
not rely on the details of each chart type. We here analyze domain adaptation setups, we calibrate the classification
the performance of the proposed model on each chart threshold to maintain a precision over 99% since we have
category with our final model trained on PMC (Semi- lots of unannotated images. Under this precision level,
Supervised). In Table 3, we show the results of the most we achieve a recall of 59.8% and precision of 99.2%. We
common three chart types (i.e., “Line”, “Bar”, “Scatter”) kept the same classification threshold and test it on our
that have sufficient amount of data (513 for Line, 400 annotated arXiv test split. The precision and recall are
for Bar, and 57 for Scatter) to support automatic metric- 93.4% and 65.7%, respectively.
based evaluation. Although the line charts contribute the
most to the training and test data, the BLEU score is the
lowest compared to the results of bar charts and scatter 7. Conclusions
charts. The reason might be that the image features
produced by convolutional neural networks (CNN) are In this paper, we propose datasets and models for summa-
insensitive to the properties (e.g., trending, crossings) of rizing scientific charts, a specific type of structured im-
the curved lines. At the same time, the CNN could capture ages. We construct datasets from PMC and arXiv by lever-
the local intensity of points thus show higher results for aging crowdsourcing and the figure captions in the pa-
scatter chart. According to this observation, we think pers. To enable better understanding text components in
that using visual encoder that are specifically designed charts and to endow the model with external knowledge,
for understanding the curved lines in chart might be a we propose to use an OCR encoder and a pre-trained
promising future direction.                                  language decoder on top of a standard image captioning
                                                             model. In our experiments, we show the effectiveness of
                                                             our models in terms of both automatic evaluation metrics
6.2. Pre-Embeddings and
                                                             and human evaluation.
      Cross-Attention Layers
In Section 4.3, we discuss two ways to connect the visual
information to the language decoder: the pre-embedding
                                                               Acknowledgments
approach and the additional cross-attention layers. In         The authors thank Bloomberg’s AI Engineering team,
Table 4, we show the results of different combinations on      especially Alakananda Vempala, Ketevan Tsereteli, and
PMC (semi-supervised) dataset. “Img” and “OCR” indi-           Anju Kambadur for helpful feedback and directions. Ad-
cates using the image output and OCR representations as        ditional thanks to the anonymous reviewers for their in-
the input to the pre-embedding approach and the cross-         sights. Hao Tan acknowledges support from Bloomberg’s
attention layers. “None” means that we do not use input        Data Science Ph.D. Fellowship.
and thus excludes the parameters. “Concat” means that
we concatenate the output of image and OCR representa-
tions together and use it as the input. We can see that the    References
our approach (Img for Pre-Embed and OCR for Cross-Att)
is comparable to its reverse (OCR for Pre-Embed and Img         [1] S. Carberry, S. Elzer, S. Demir, Information graph-
for Cross-Att) and is much better than other alternatives.          ics: an untapped resource for digital libraries, in:
                                                                    Proceedings of the 29th annual international ACM
                                                                    SIGIR conference on Research and development in
6.3. Chart Classification Performance                               information retrieval, 2006, pp. 581–588.
In both the semi-supervised learning and domain adap-           [2] M. Savva, N. Kong, A. Chhajta, L. Fei-Fei,
tion setup, we use a classifier to identify single-chart im-        M. Agrawala, J. Heer, Revision: Automated classi-
     fication, analysis and redesign of chart images, in:       [15] S. Elzer, E. Schwartz, S. Carberry, D. Chester,
     Proceedings of the 24th annual ACM symposium                    S. Demir, P. Wu, A browser extension for providing
     on User interface software and technology, 2011,                visually impaired users access to the content of bar
     pp. 393–402.                                                    charts on the web., in: WEBIST (2), Citeseer, 2007,
 [3] S. Ray Choudhury, C. L. Giles, An architecture                  pp. 59–66.
     for information extraction from figures in digital         [16] S. Demir, S. Carberry, K. McCoy, Generating tex-
     libraries, in: Proceedings of the 24th International            tual summaries of bar charts, in: Proceedings of the
     Conference on World Wide Web, 2015, pp. 667–672.                Fifth International Natural Language Generation
 [4] N. Siegel, Z. Horvitz, R. Levin, S. Divvala, A. Farhadi,        Conference, Association for Computational Lin-
     Figureseer: Parsing result-figures in research pa-              guistics, Salt Fork, Ohio, USA, 2008, pp. 7–15. URL:
     pers, in: European Conference on Computer Vision,               https://www.aclweb.org/anthology/W08-1103.
     Springer, 2016, pp. 664–680.                               [17] C. Greenbacker, S. Carberry, K. McCoy, A corpus
 [5] W. Huang, C. L. Tan, A system for understand-                   of human-written summaries of line graphs, in:
     ing imaged infographics and its applications, in:               Proceedings of the UCNLG+Eval: Language Gen-
     Proceedings of the 2007 ACM symposium on Doc-                   eration and Evaluation Workshop, Association for
     ument engineering, 2007, pp. 9–18.                              Computational Linguistics, Edinburgh, Scotland,
 [6] S. Demir, S. Carberry, K. F. McCoy, Summarizing                 2011, pp. 23–27. URL: https://www.aclweb.org/
     information graphics textually, Computational Lin-              anthology/W11-2703.
     guistics 38 (2012) 527–574.                                [18] C. Greenbacker, P. Wu, S. Carberry, K. McCoy,
 [7] Z. Chen, M. Cafarella, E. Adar, Diagramflyer: A                 S. Elzer, Abstractive summarization of line graphs
     search engine for data-driven diagrams, in: Pro-                from popular media, in: Proceedings of the
     ceedings of the 24th International Conference on                Workshop on Automatic Summarization for Dif-
     World Wide Web, 2015, pp. 183–186.                              ferent Genres, Media, and Languages, Association
 [8] S. R. Choudhury, S. Wang, C. L. Giles, Scalable algo-           for Computational Linguistics, Portland, Oregon,
     rithms for scholarly figure mining and semantics,               2011, pp. 41–48. URL: https://www.aclweb.org/
     in: Proceedings of the International Workshop on                anthology/W11-0506.
     Semantic Big Data, 2016, pp. 1–6.                          [19] O. Vinyals, A. Toshev, S. Bengio, D. Erhan, Show
 [9] A. Kembhavi, M. Salvato, E. Kolve, M. Seo, H. Ha-               and tell: A neural image caption generator, in:
     jishirzi, A. Farhadi, A diagram is worth a dozen                Proceedings of the IEEE conference on computer
     images, in: European Conference on Computer                     vision and pattern recognition, 2015, pp. 3156–3164.
     Vision, Springer, 2016, pp. 235–251.                       [20] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville,
[10] S. E. Kahou, A. Atkinson, V. Michalski, Á. Kádár,               R. Salakhudinov, R. Zemel, Y. Bengio, Show, at-
     A. Trischler, Y. Bengio, Figureqa: An annotated                 tend and tell: Neural image caption generation with
     figure dataset for visual reasoning, in: ICLR Work-             visual attention, in: International conference on
     shop, 2018.                                                     machine learning, 2015, pp. 2048–2057.
[11] A. Singh, V. Natarajan, M. Shah, Y. Jiang, X. Chen,        [21] M. Ranzato, S. Chopra, M. Auli, W. Zaremba, Se-
     D. Batra, D. Parikh, M. Rohrbach, Towards vqa                   quence level training with recurrent neural net-
     models that can read, in: Proceedings of the IEEE               works, in: International Conference on Learning
     Conference on Computer Vision and Pattern Recog-                Representations, 2016.
     nition, 2019, pp. 8317–8326.                               [22] P. Anderson, X. He, C. Buehler, D. Teney, M. John-
[12] T. Hiippala, M. Alikhani, J. Haverinen,                         son, S. Gould, L. Zhang, Bottom-up and top-down
     T. Kalliokoski, E. Logacheva, S. Orekhova,                      attention for image captioning and visual question
     A. Tuomainen, M. Stone, J. A. Bateman, Ai2d-rst: A              answering, in: Proceedings of the IEEE Conference
     multimodal corpus of 1000 primary school science                on Computer Vision and Pattern Recognition, 2018,
     diagrams, Language Resources and Evaluation                     pp. 6077–6086.
     (2020) 1–28.                                               [23] L. Zhou, H. Palangi, L. Zhang, H. Hu, J. J. Corso,
[13] N. Methani, P. Ganguly, M. M. Khapra, P. Kumar,                 J. Gao, Unified vision-language pre-training for
     Plotqa: Reasoning over scientific plots, in: Pro-               image captioning and vqa, in: AAAI, 2019.
     ceedings of the IEEE/CVF Winter Conference on              [24] X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang,
     Applications of Computer Vision, 2020, pp. 1527–                L. Wang, H. Hu, L. Dong, F. Wei, et al., Oscar:
     1536.                                                           Object-semantics aligned pre-training for vision-
[14] J. Poco, J. Heer, Reverse-engineering visualizations:           language tasks, in: European Conference on Com-
     Recovering visual encodings from chart images, in:              puter Vision, Springer, 2020, pp. 121–137.
     Computer Graphics Forum, volume 36, Wiley On-              [25] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learn-
     line Library, 2017, pp. 353–363.                                ing for image recognition, in: Proceedings of the
     IEEE conference on computer vision and pattern                 sociation for Computational Linguistics, 2002, pp.
     recognition, 2016, pp. 770–778.                                311–318.
[26] S. Hochreiter, J. Schmidhuber, Long short-term            [38] C.-Y. Lin, Rouge: A package for automatic eval-
     memory, Neural computation 9 (1997) 1735–1780.                 uation of summaries, in: Text summarization
[27] R. Smith, An overview of the tesseract ocr engine,             branches out, 2004, pp. 74–81.
     in: Ninth international conference on document            [39] S. Banerjee, A. Lavie, Meteor: An automatic met-
     analysis and recognition (ICDAR 2007), volume 2,               ric for mt evaluation with improved correlation
     IEEE, 2007, pp. 629–633.                                       with human judgments, in: Proceedings of the
[28] J. Gu, Z. Lu, H. Li, V. O. Li, Incorporating copying           acl workshop on intrinsic and extrinsic evaluation
     mechanism in sequence-to-sequence learning, in:                measures for machine translation and/or summa-
     Proceedings of the 54th Annual Meeting of the As-              rization, 2005, pp. 65–72.
     sociation for Computational Linguistics (Volume 1:        [40] R. Vedantam, C. Lawrence Zitnick, D. Parikh, Cider:
     Long Papers), 2016, pp. 1631–1640.                             Consensus-based image description evaluation, in:
[29] A. See, P. J. Liu, C. D. Manning, Get to the point:            Proceedings of the IEEE conference on computer
     Summarization with pointer-generator networks,                 vision and pattern recognition, 2015, pp. 4566–4575.
     in: Proceedings of the 55th Annual Meeting of the         [41] D. P. Kingma, J. Ba, Adam: A method for stochastic
     Association for Computational Linguistics (Volume              optimization, in: ICLR, 2015.
     1: Long Papers), 2017, pp. 1073–1083.                     [42] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert:
[30] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei,               Pre-training of deep bidirectional transformers for
     I. Sutskever, Language models are unsupervised                 language understanding, in: NAACL-HLT (1), 2019.
     multitask learners (2019).
[31] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhut-
     dinov, Q. V. Le, Xlnet: Generalized autoregressive
     pretraining for language understanding, in: Ad-
     vances in neural information processing systems,
     2019, pp. 5753–5763.
[32] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang,
     M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the
     limits of transfer learning with a unified text-to-text
     transformer, JMLR (2019).
[33] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mo-
     hamed, O. Levy, V. Stoyanov, L. Zettlemoyer, Bart:
     Denoising sequence-to-sequence pre-training for
     natural language generation, translation, and com-
     prehension, in: ACL, 2020.
[34] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
     L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At-
     tention is all you need, in: Advances in Neural
     Information Processing Systems, 2017, pp. 5998–
     6008.
[35] S. Marcel, Y. Rodriguez, Torchvision the machine-
     vision package of torch, in: Proceedings of the 18th
     ACM international conference on Multimedia, 2010,
     pp. 1485–1488.
[36] T. Wolf, J. Chaumond, L. Debut, V. Sanh, C. De-
     langue, A. Moi, P. Cistac, M. Funtowicz, J. Davison,
     S. Shleifer, et al., Transformers: State-of-the-art
     natural language processing, in: Proceedings of
     the 2020 Conference on Empirical Methods in Nat-
     ural Language Processing: System Demonstrations,
     2020, pp. 38–45.
[37] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a
     method for automatic evaluation of machine trans-
     lation, in: Proceedings of the 40th annual meeting
     on association for computational linguistics, As-
A. Implementation Details                                            We then use this classifier to filter the unlabeled im-
                                                                  ages in the PMC data to augment the training set. More
The supervised learning setup is conducted on our an-             specifically, besides the 50,000 images we used in the
notated English PMC dataset in Sec. 3. We kept 1,000              crowdsourcing task, there are 137,928 remaining articles
charts in the test set and split the the remaining charts         in our PMC collection from the year of 2011 to 2019. After
into training(5,819)/validation(646) with a ratio of 9:1.         applying the chart classifier, we obtain 13,637 single chart
We train our model on the training set and tune the hy-           images which could serve as additional training examples
perparamters on the validation set. The test set is only          for the summarization model. The hyper-parameters of
used to report results. We train for 200 epochs on this           the summarization model is the same as the ones used
small dataset. All our code are written in PyTorch and            in the supervised setup. For the models trained on this
all experiments converge in 4 5 hours on 1 Titan V GPU.           dataset, we use a max sequence of 80 and train for 100
   For the base model, we use a ResNet-101 model from             epochs. The other hyperparameters are same as the small
the Torchvision [35] library 9 . We resize the image into         supervised PMC data for each model.
224 x 224 and the backbone model maps it to a 7 x 7 x                For domain adaptation, we take charts and captions
2048 vectors. We use 512 dimensions for the LSTM and              from English arXiv as the target domain. As described
256 dimensions for the word embedding. The attentive              in the dataset section, we have manually annotated 370
hidden states has the same size as the hidden states (512         single-chart images in this domain, which are served as
dimensions). We use an Adam [41] with a fixed learning            the test set. We use the same chart classifier in the previ-
rate of 10−4 . The batch size is 64.                              ous semi-supervised learning setup to annotate 140,000
   For the OCR model, we sort the ocr texts by their              arXiv images. This results in 22,044 positive examples.
confidence and remove the empty text. We kept the top             We split this 22,044 examples into training data (19,840)
20 ocr texts for post-processing. We use 512 dimensions           and validation data (2,204) with a ratio of 9:1. The summa-
for the OCR feature representations (yellow blocks in Fig.        rization model is trained on the training data, tuned on
3). Since we want the image position to be related to the         the validation data, and finally evaluated on the manually-
OCR position. We did not do random resize and cropping            annotated test set. For the models trained on this dataset,
but directly resize the chart into 224 x 224.                     we use a max sequence of 40 since the captions in arXiv
   For the pre-trained GPT-2 [30] model, we downloaded            are much shorter. Since we halve the max sequence, we
the small GPT-2 model (124M parameters) from Hugging              train for 200 epochs thus roughly keep the same compu-
Face’s Transformer [36] 10 . The GPT-2 implementation             tational resources for both datasets.
has support of cross-attention layers as in Vaswani et al.
[34] and we use it to attention to the OCR features. For
the fixed-length transformer, we use 1 layer with the             B. Details of Data Collection
same architecture as the GPT-2 model but do not apply
the causal attention mask. We use an Adam [41] with               The crowdsourcing task is conducted on Appen11 . There
weight decay of 0.01 following the practice in Devlin             are 2263 distinct annotators from 50 countries. Since the
et al. [42]. We do not use weight decay for the layer             task is to classify image types, it doesn’t require native
normalization layer and bias. We use a linear warmup              English speakers. The top 5 countries are Venezuela
with a peak learning rate at 10−4 . The first 5% steps are        (53%), USA (23%), Egypt (8%), Colombia (2%), and Peru
warmup steps. The batch size is 64.                               (1.4%). We paid one cent per judgement (image). For the
   In order to increase the number of training examples,          first round of annotation tasks, the Fleiss’ kappa scores
we apply the proposed semi-supervised learning tech-              for “whether it’s a single chart” and “chart type” tasks are
nique. The single-chart classifier is based on the ResNet-        0.56 and 0.73 respectively, which shows pretty significant
101 model and is fine-tuned on our datasets. We use the           agreement.
50,000 human-labeled images (7,465 positives) from PMC
data to build this classifier. The training, validation, and      C. Additional Analysis
test sets have 5,819, 646, and 1,000 data point, respec-
tively. The data split is the same as the above supervised        C.1. Cross-Domain Transferability
learning setup. After the model converges on the train-
ing set, we calibrate the classifier to optimize the recall       To illustrate the need of domain adaption led by the chart
with an precision over 99% on the validation set. Since           classifier (in Sec. 4.4), we show the low cross-domain
we have lots of images, we can afford a lower recall for          transferability of models in this section. Each row in
high-quality positive examples.                                   Table 5 indicates the results of our final model trained on
                                                                  the designated dataset while each line in the Table indi-
    9
                                                                  cate the evaluation results on the test set. The model does
        https://pytorch.org/docs/stable/torchvision/models.html
   10                                                                11
        https://github.com/huggingface/transformers                       client.appen.com
                                          PMC                                             arXiv
                      BLEU     ROUGE-L        METEOR       CIDEr     BLEU      ROUGE-L            METEOR   CIDEr
             PMC       4.47       12.46          4.32       10.30     0.06         8.19            1.93      0.63
             arXiv     0.22       10.11          3.25       1.43      5.89        14.32            4.92     32.34

Table 5
The transferability of our captioning model across different domains. The columns indicate the training dataset while the
rows indicate the testing dataset. The PMC training data is augmented with filtered charts (in Sec. 4.4) and the arXiv training
data is built by the chart classifier. All test data are human-annotated.


not transfer well between different domains, probably be-
cause the different figuring and captioning conventions
from different communities. The different topics also
introduce diverging vocabularies.


D. Ethical Considerations
The technique developed in this paper would help auto-
matic summarize news, articles, and publications where
charts are involved in. It would also help visually im-
paired people to understand the content of the charts. It
would fail in cases when the OCR detector miss the key
information of the charts and would lead to unfaithful
summarization of the chart. Since we use a pre-trained
language decoder in our final model, the generated sum-
marization might be biased towards the pre-training do-
main of the language decoder. Regrading the dataset
collection, we have resolved all legal and licenses issue
for the PMC dataset before showing them to annotators.
More specifically, we only use articles with CC BY li-
censes from the Open Access Subset of PMC data. For
arXiv data, we annotate a small test set by the authors.

</pre>