=Paper=
{{Paper
|id=Vol-3164/paper24
|storemode=property
|title=Scientific Chart Summarization: Datasets and Improved Text Modeling
|pdfUrl=https://ceur-ws.org/Vol-3164/paper24.pdf
|volume=Vol-3164
|authors=Hao Tan,Chen-Tse Tsai,Yujie He,Mohit Bansal
|dblpUrl=https://dblp.org/rec/conf/aaai/TanTHB22
}}
==Scientific Chart Summarization: Datasets and Improved Text Modeling==
Scientific Chart Summarization: Datasets and Improved
Text Modeling
Hao Tan1ab , Chen-tse Tsai2a , Yujie He2a and Mohit Bansal1
1
University of North Carolina at Chapel Hill, USA
2
Bloomberg, USA
Abstract
Chart figures usually convey the key message in a multimodal document. Understanding charts automatically and making
charts more accessible becomes indispensable in the information era. In this paper, we study the chart summarization
problem in which the goal is to generate sentences that describe the salient information in a chart image. To obtain training
examples, we leverage image-caption pairs in multiple scientific areas. We create a dataset of single-chart images from
research papers in PubMed Central (PMC) and arXiv. Most recent vision-and-language works focus on natural images.
Several challenges in structured images such as charts are under-explored. One key property of charts is that the text
components (e.g., legends and axis names) carry important information. In our proposed model, we not only use a standard
visual encoder but also a text encoder to encode a chart image. The visual and textual representations are connected to a
large pre-trained language decoder via pre-embedding and cross-attention approaches, respectively. Experimental results
show that the proposed model is significantly better than an image captioning baseline.
Keywords
Chart Summarization, Multimodal Learning, Document Understanding, Image Captioning, Natural Language Processing
1. Introduction mary for structural charts. First, to obtain a large quantity
of summaries of chart images, we leverage captions in
Information graphics, such as line charts and bar charts, scientific articles. Unlike magazines or newspapers, in
are essential and common components of a document. which image captions could be less descriptive, captions
Charts are usually used for visually summarizing im- in scientific papers tend to be more detailed and verbose.
portant information that a document intends to convey. We build a chart summarization dataset from the papers
Moreover, as shown in the study of Carberry et al. [1], in arXiv and PubMed Central (PMC) by assuming that
information graphics in magazines and newspapers of- captions are salient summaries of chart figures. Image
ten convey messages that are not repeated in the text. captions in these data sources are written by the cor-
Therefore, summarizing the primary message in a chart responding paper’s authors, and hence would be more
is an important step towards understanding a multimodal natural in the language format. Since these articles also
document. Potential applications of chart summarization contain figures other than charts, we create crowdsourc-
include indexing information content for a search engine, ing tasks to select single-chart images and collect these
making charts accessible for individuals with eyesight charts’ detailed types (e.g., line chart, bar chart, etc.).
impairments, and simplifying information dissemination Different from the traditional captioning for natural
of technical visual info to a layperson. images, there are two main challenges from the language
We have seen the success of image captioning works perspective when the target images are charts: (1) Be-
recently, which can be viewed as generating summaries sides visual content, charts usually also contain text (e.g.,
for an image. However, this research has mostly focused legends and axis titles) which carries significant infor-
on natural images while other types of images (e.g., struc- mation of components in charts. (2) Charts are likely
tured images shown in Fig. 2) are under-explored. On to be used in some specific domains, thus the language
the other hand, abstractive text summarization models generation model may suffer from rare-word issues.
also have been greatly improved due to the development To address these two challenges, we first use an opti-
of neural network models. However, these models only cal character recognition (OCR) model to detect the text
look at the text component in a document. In this work, boxes in the charts. An OCR embedding layer is proposed
we focus on the less-studied yet important task of ‘chart to encode these extracted texts with their position infor-
summarization’, where we want to generate a salient sum- mation into vectors, and these vector representations
a
Equal Contribution. are treated as another input to the language decoder
b through cross-attention mechanism. Secondly, to endow
Work done during an internship at Bloomberg
The second workshop on Scientific Document Understanding at AAAI the decoder with domain-specific knowledge, we use a
2022 large pre-trained language decoder instead of training it
© 2022 Bloomberg Finance L.P. Use permitted under Creative Commons License
CEUR
Attribution 4.0 International (CC BY 4.0). from the scratch. The chart information is connected to
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
Figure 1: Pipeline of datasets creation. We first sample scientific papers from arXiv and PubMed Central, and then extract
image-caption pairs by parsing the source LaTeX or XML files. Finally, crowdsourcing is applied to annotate whether an image
contains a single chart and the corresponding chart type.
this pre-trained language decoder via two approaches: and focus more on the text generation model. These bet-
pre-embedding and cross attention. We empirically find ter text analysis models could potentially improve our
that using pre-embedding for visual content and cross- model performance, which we leave for future investi-
attention for OCR representations gives the best results. gation. Kahou et al. [10] introduce FigureQA, a visual
We apply our models on our collected datasets of two reasoning corpus of question-answer pairs over synthetic
scientific domains. We conduct both metric-based auto- chart images. Instead of answering questions on the syn-
matic evaluation and human-annotated qualitative eval- thetic charts, we aim at directly summarizing real chart
uation. Experimental results show that our model with images.
the integration of OCR and pre-trained language model There are some earlier works on chart summarization.
significantly outperforms the baseline image captioning Elzer et al. [15] proposed SIGHT, a system that summa-
model. We also show the ablation studies that illustrate rizes bar charts for visually impaired users. The system
the effectiveness of our proposed methods. identifies one of the twelve message categories that can
be conveyed by a bar chart and produces a logical form.
This logic representation is then translated into natural
2. Related Work language via templates. Demir et al. [16] built on top of
SIGHT. The proposed system first identifies an additional
Most work on understanding chart images involves chart
set of propositions that may reflect some information in
type classification. Savva et al. [2] classify given chart
a bar chart by rules. These propositions are then orga-
images into 10 chart categories using an SVM classifier
nized and structured by a bottom-up planner. Finally, a
with visual bag-of-words and text-region features. With
surface realizer is applied to produce natural language
a similar model, Ray Choudhury and Giles [3] proposed
summaries.
a binary classifier to determine whether an image is a
Greenbacker et al. [17] built a corpus of human-written
line chart. Siegel et al. [4] experimented with CNN-based
English summaries of line graphs. They selected 23 line
models for classifying images they extracted from schol-
graphs and asked annotators to summarize the most im-
arly articles. In order to identify chart figures for training
portant information in each graph. As this process is
our summarization model, we build a binary classifier
difficult to be scaled up, we take the captions of chart
to identify common charts (e.g., line charts, bar charts,
images in scientific papers to represent the summaries
scatter plots, etc.).
instead. Greenbacker et al. [18] further used this cor-
There is a line of works on interpreting text compo-
pus and proposed an abstractive summarization system
nents in chart images [5, 6, 7, 8, 9, 4, 10, 11, 12, 13]. One
for line charts. The system uses a Bayesian network to
of the applications here is to recover visual encodings for
classify the intents of line segment, and then rules are
purposes of indexing and search. For example, Poco and
applied to identify additional important informational
Heer [14] proposed an end-to-end text analysis pipeline
propositions conveyed by the line graph. The sets of
that identifies text elements in a chart image, determine
intents and prepositions are pre-defined from the study
their bounding box, and classifies their role in the chart
on the corpus. They left the final step of generating natu-
(e.g., x-axis label, x-axis title, legend title, etc). They also
ral language summary from prepositions as future work.
proposed a CNN model that classifies the type of graph-
Therefore, no evaluation results were shown.
ical mark (e.g., bars or lines). We simply use a general
A common challenge of these earlier works is that
purpose OCR tool for recognizing text in chart images
they are limited to a fixed set of propositions and need most of the figures in these papers are not charts. Hence,
to convert the selected propositions to natural language. to be able to train and evaluate the proposed chart sum-
Instead of using a pipeline with hand-crafted intents and marization model, we need to identify which figures are
propositions, we propose to leverage an end-to-end neu- charts. In this work, we focus on the common 5 chart
ral network, which has been shown to be powerful in types, including line, bar, scatter, pie, and area charts
generating coherent and grammatical sentences in the (Figure 2). Moreover, we further focus on the simplest
context of image captioning and abstractive text summa- case where images only contain a single chart. Figures
rization. with multiple charts or with any non-chart component
Another thread of related works is (natural) image cap- will be considered as negative images in this work. In the
tioning, which tries to generate descriptions for natural following sections, we describe how do we obtain single
images. Vinyals et al. [19] first illustrate the end-to-end chart and chart type annotations.
encoder-decoder architecture and Xu et al. [20] extends it
with attention modules. Ranzato et al. [21] use reinforce- 3.1. PubMed Central Data
ment learning to eliminate exposure bias but requires a
large amount of data to reduce the high variance. An- For PMC data, we create a crowdsourcing task to anno-
derson et al. [22] take object-level information to enable tate whether a given image contains single chart. We ran-
fine-grained visual understanding. However, we empiri- domly sample 50,000 images from the papers published
cally found that the detection features for natural image from 2011 to 2019. For each image, we ask annotators
do not work well for charts (structural images). Previous whether it is a single chart figure. If the answer is yes, the
vision-and-language pre-training, e.g., VLP [23] and OS- annotators are required to select a chart type from line,
CAR [24], use pre-trained vision-and-language model to bar, scatter, pie, area, or other chart. Since this task is
improve image captioning but requires a large in-domain pretty simple, we ask two annotators to label each image
corpus and heavy pre-training. in the first round. In most cases, two annotators agree
on the labels. More specifically, the Fleiss’ kappa scores
for “whether it’s a single chart” and “chart type” tasks
3. Datasets Creation are 0.56 and 0.73 respectively, which shows significant
agreement 5 .
We create our datasets based on image-caption pairs that If there is a disagreement on either single chart label
appear in public scientific papers. Different from the or chart types, we further ask the other three annotators
figures in magazines or newspapers where the captions to perform a second round of annotation on these im-
could be less descriptive, figure captions in scientific ar- ages. Finally, majority vote is applied to resolve conflicts
ticles tend to convey the key message of figures. The among all five annotators. We note that single charts
assumption here is that captions written by the paper with “other” chart type are considered negative images
authors could represent the most salient information in in our experiments.
the figures, therefore could serve as summaries of the Among 50,000 images, we obtain 7,397 positive images
corresponding figures. The overview of our datasets cre- (single chart), including 3681 line charts, 3088 bar charts,
ation pipeline is shown in Figure 1. We consider two data 478 scatter charts, 125 pie charts, and 25 area charts. The
sources: arXiv1 and PMC.2 ArXiv is a free distribution positive ratio of the charts is about 13%. This low ratio is
service and an open-access archive for scholarly articles because most of the figures in scientific articles are non-
in the fields such as physics, computer science, and math- chart figures (e.g., model architecture diagrams). In this
ematics. PMC is a free full-text archive of biomedical work, we only use chart types in analyzing model perfor-
and life sciences journal literature at the U.S. National mance. That is, chart type information is not included
Institutes of Health’s National Library of Medicine. We explicitly in model training.
take articles in the Open Access Subset.3 These two data
sources are chosen because they both provide structural
data in addition to the PDF files. That is, we can obtain 3.2. ArXiv Data
image-caption pairs by parsing the LaTeX source files We also build another dataset from the arXiv data. We
provided by arXiv or the XML files provided by PMC. We take papers in Computer Vision, Computation and Lan-
write our own LaTeX parser for the arXiv data, and use guage, Machine Learning, Artificial Intelligence, and Neu-
a public PubMed parser4 for parsing XML information. ral and Evolutionary Computing fields from 2008 to 2020.
Although we can extract lots of image-caption pairs, Because of the copyright issue, we cannot put arXiv im-
ages on a public crowdsourcing platform. Instead, the
1
https://arxiv.org/ authors went through and annotated 2000 randomly sam-
2
https://www.ncbi.nlm.nih.gov/pmc/ pled figures with the same crowdsourcing interface that
3
https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/
4
https://github.com/titipata/pubmed_parser 5
https://en.wikipedia.org/wiki/Fleiss%27_kappa
the previous word 𝑤𝑡−1 and states (ℎ𝑡−1 , 𝑐𝑡−1 ). The
attention module (denoted as Attℎ→𝑓 ) then attends to
the feature sequence {𝑓𝑖 } with the hidden output ℎ𝑡 as
a query. The context 𝑓ˆ 𝑡 and the hidden vector ℎ𝑡 are
merged into an attentive hidden vector ℎ̂𝑡 with a fully-
connected layer:
𝑤
˜ 𝑡−1 = embedding (𝑤𝑡−1 )
ℎ𝑡 , 𝑐𝑡 = LSTM (𝑤
˜ 𝑡−1 , ℎ𝑡−1 , 𝑐𝑡−1 )
ˆ
𝑓 = Attℎ→𝑓 (ℎ𝑡 , {𝑓𝑖 })
𝑡
ℎ̂𝑡 = tanh(𝑊1 [𝑓ˆ 𝑡 ; ℎ𝑡 ] + 𝑏1 )
The probability of generating the 𝑘-th token at time step
𝑡 is the softmax over a linear transformation of the atten-
tive hidden ℎ̂𝑡 . The loss ℒ𝑡 is the negative log likelihood
of the ground truth token 𝑤𝑡* :
(︁ )︁
𝑝𝑡 (𝑤𝑡,𝑘 ) = softmax𝑘 𝑊w ℎ̂𝑡 + 𝑏w
ℒ𝑡 = − log 𝑝𝑡 (𝑤𝑡* )
Figure 2: Example charts with the corresponding chart types
from the PubMed Central dataset. The dataset we build con-
tains the most common 5 chart types. 4.2. Text Understanding
Different from natural image captioning, the summariza-
tion of charts heavily relies on the understanding of text
we use for annotating PMC data. This results in 370 inside the images. However, the ResNet visual encoder
single chart images. (in Section 4.1) is insensitive to the text in the images (as
shown in Singh et al. [11] as well) thus we need to build a
pipeline to extract the text information from the images.
4. Methodology Specifically, we first use the Tesseract [27] to extract a
sequence of 𝑚 texts text 𝑗 with their positions pos 𝑗 from
In this section, we introduce the proposed models and
the image 𝑥.
training strategies for the chart summarization task. In
this chart summarization task, the model needs to gener- {(text 𝑗 , pos 𝑗 )}𝑚
𝑗=1 = OCR(𝑥) (1)
ate a sequence of words {𝑤𝑖 } for describing the contents
in a chart 𝑥. We start with introducing the basic cap- Since the characters in charts are usually in small font
tioning model. To enhance in-image text understanding and sometimes blurred with the chart content, the copy
and endow external knowledge, we incorporate an OCR mechanism [28, 29] that directly brings the text into final
encoder and a pre-trained language decoder. Lastly, we summarization does not provide good results. We instead
propose a simple semi-supervised learning and domain use the shallow text embedding layer to project the OCR
adaptation approach using a chart classifier. text to dense vector representations that denoises the
OCR detection results. We also encode the position of
the OCR along with the text representation since the
4.1. Base Model spatial information indicates the properties of the text
Our base model is adopted from the attentive encoder- (e.g., in the legend, in the title, or inside the chart):
decoder model for image captioning proposed in Xu et al. 𝑔𝑗 = Embtext (text 𝑗 ) + 𝑊pos pos 𝑗 (2)
[20]. A ResNet-101 [25] visual feature extractor encodes
the chart into a 7 × 7 × 2048 dimensional feature map, These OCR representations are treated as another view
where each vector in the feature map corresponds to a of the charts and the language decoder simultaneously
grid region of the image. Feature maps are then flattened attends to the OCR information {𝑔𝑖 } and visual image
to 49 × 2048 feature sequences {𝑓𝑖 }. features {𝑓𝑗 }. The final hidden output ℎ̃𝑡 is calculated
based on the concatenation of the visually attended vec-
{𝑓𝑖 }49
𝑖=1 = ResNet (𝑥) tor 𝑓˜ , the OCR attended vector 𝑔˜, and the hidden state
ℎ𝑡 .
At each decoding step 𝑡, the LSTM [26] language decoder
outputs the hidden outputs ℎ𝑡 and cell 𝑐𝑡 by reading 𝑓˜ = Attℎ→𝑓 (ℎ𝑡 , {𝑓𝑖 }) (3)
Text Position
Fluorescence (X, Y)
OCR Cross
OCR intensity (X, Y)
Embedding
nm (X, Y) Attention
1800 (X, Y)
......
[OMIT]
Fixed-Len
ResNet Transformer [OMIT]
[OMIT]
Pre-trained
F
Language #lu
Fluorescence emission Decoder #orescence
F #lu #orescence
spectrum recorded from Tokenizer
emission spectrum
recorded from the
Word
emission
Embedding
the ... ......
spectrum
Figure 3: Illustration of the proposed chart summarization model. We have two branches of image encoding: (1) the visual
branch via the ResNet and fixed-length transformer (2) the text branch via the OCR system and the OCR embedding layer.
The output of these two branches are then fused into the pre-trained language decoder by pre-embedding (concatenation)
and cross-attention layer, respectively. The grey boxes are neural networks.
𝑔˜ = Attℎ→𝑔 (ℎ𝑡 , {𝑔𝑗 }) (4) of red blocks and blue blocks in Figure 3). The cross-
attention approach adds cross-attention layers [34] in-
ℎ̃𝑡 = tanh(𝑊2 [𝑓˜ , 𝑔˜, ℎ𝑡 ] + 𝑏2 ) (5)
side the language decoder to fuse visual information. The
cross-attention layers contain residual short-cut connec-
We next replace the original attentive hidden ℎ̂𝑡 with this
tions thus the decoder still benefits from the pre-trained
OCR-enhanced hidden output ℎ̃𝑡 (in Sec. 4.1) in succeed-
weights with these additional layers.
ing decoding steps.
As shown in Figure 3, we use the pre-embedding ap-
proach for the features from the visual image content
4.3. Pre-trained Language Decoder (i.e., from the ResNet encoder) and use the cross-attention
When summarizing charts in news or scientific papers, a layers for the OCR texts. The idea of this specific design
faithful description of the chart contents also relies on is that the generation would be led by the image content
external knowledge, and hence a pre-trained language and will use the OCR information to generate concrete
decoder might help the generation. As shown in Figure 3, words. We empirically find that it is the best combina-
we illustrate our model which integrates a pre-trained tion to fuse information into the language decoder, and
language decoder GPT-2 [30].6 As described in the pre- we show the comparison in Section 6.2. In detail, the
vious section, we have two image encoders (i.e., ResNet length of the ResNet feature map is 49 and the order of
encoder and OCR text encoder) to process the image con- the features is not aligned with the positional encoding
tent and image text respectively. The ResNet encoder in the pre-trained language decoder. We thus do not di-
maps the features into a squared feature map (the purple rectly append it before the word embedding but use a
vector blocks in Figure 3) where each vector corresponds fixed-length transformer to map it to a sequence of 10
to a part of image content. We will view this feature map vectors (the red blocks in Figure 3; we only draw 3 vec-
as a sequence of vectors (as in Eq. 1) in the following pro- tors for simplicity). The fixed-length transformer is built
cedures. The OCR encoder (Eq. 4.2) maps the chart into a by transformer decoder layers [34] with only positional
sequence of recognized words and their positions on the embedding (without word embedding). We use only 1
chart. The OCR embedding layer (Eq. 2) adds the word layer in our experiments.
embedding and the position encoding into one vector for
each OCR entry (the yellow vectors in Figure 3). 4.4. Semi-Supervised Learning and
In order to connect these visual and textual infor- Domain Adaptation
mation from the image to the language decoder, we
adopt two ways: appending pre-embeddings and adding Although we can extract abundant image-caption pairs,
cross-attention layers. The pre-embedding approach is most figures in scientific articles do not contain a chart as
to concatenate the sequence of visual vectors before we discussed in Section 3. If we want to reserve enough
the word embeddings thus the language decoder will human-annotated examples for the metric-based evalu-
take this concatenation as input (e.g., the concatenation ation purpose, that leaves very little data for training,
especially for the arXiv domain in which we only have
hundreds of single-chart images. Therefore, we leverage
6
The method could also be applied to other pre-trained lan- semi-supervised learning techniques to take advantage
guage decoders such as XLNet [31], T5 [32], and BART [33].
PMC (Supervised) PMC (Semi-Supervised) arXiv (Domain Adaptation)
BLEU ROUGE-L METEOR CIDEr BLEU ROUGE-L METEOR CIDEr BLEU ROUGE-L METEOR CIDEr
Base Model 1.66 11.35 2.77 2.76 2.09 11.05 2.91 4.49 3.55 14.10 3.79 8.99
+ OCR 1.97 11.77 3.09 6.00 2.53 11.95 3.50 7.98 4.78 15.88 4.68 15.88
+ GPT-2 3.19 11.66 3.68 1.57 4.47 12.46 4.32 10.30 5.89 14.32 4.92 32.34
Table 1
Results on the PubMed Central (PMC) and arXiv datasets. Supervised: training images are human-labeled single chart images.
Semi-Supervised: training images also include the positive images from the proposed chart classifier. Domain Adaptation:
the chart classifier trained on the PMC domain is applied on arXiv domain to obtain training data for the summarization
model. The best results are marked in bold.
of large unannotated data and use domain adaption to 5. Results
transfer to other datasets. Both of these two methods
rely on a chart classifier that we will introduce first. In this section, we evaluate our proposed methods on
Chart Classifier. The key component in getting more our collected datasets of two domains: PMC and arXiv.
training examples is a classifier that can identify single- We start with describing the experiment setups and show
chart images. We take the ResNet [25] as the visual back- results with both automatic metric-based evaluation and
bone and use a binary linear classifier after the mean- human evaluation.
pooled features. Instead of freezing the backbone model
as in the previous works [20], we fine-tune the classi- 5.1. Experimental Setup
fier with a small learning rate, 10−4 . We find that this
standard classifier reaches good results (see Appendix Data Setup. The supervised learning setup is conducted
for details). on our annotated PMC dataset. We randomly sample
Semi-Supervised Learning. In the semi-supervised 1,000 charts as the test set and split the remaining charts
learning setup, we have labeled data (Section 3) and we into training (5,819) and validation (646) sets with a ratio
want to improve the performance from the unlabeled of 9:1.
data. The unlabeled data contains both charts and non- In order to increase the number of training examples,
chart images (e.g., model figures in scientific publications we apply the proposed semi-supervised learning tech-
and natural images in news). Including these non-chart nique (Section 4.4). The single-chart classifier is based on
images in training data will introduce noise and thus the ResNet-101 model and is fine-tuned on our datasets.
lead to an increment in training time. To provide clean We use the 50,000 human-labeled images (7,465 positives)
data in semi-supervised learning, we filter the unlabeled from PMC data to build this binary classifier. After the
data with our chart classifier and train the summarization model converges on the training set, we calibrate the
model based on the filtered data. In this way, we increase classifier to optimize the recall with an precision over
the amount of data and the coverage of topics. 99% on the validation set. Since we have lots of images,
Domain Adaptation. Different from semi-supervised we can afford a lower recall for high-quality positive
learning, domain adaptation focuses on transferring the examples. We then use this classifier to filter the unla-
labeled dataset into another domain. Naïve transferring beled images in the PMC data to augment the training set.
without training on the target domain would under-fit More specifically, besides the 50,000 images we used in
the target distribution and we empirically show its in- the crowdsourcing task, there are 137,928 remaining arti-
effectiveness in Appendix. To solve this issue, we use cles in our PMC collection from the year of 2011 to 2019.
a similar approach to the semi-supervised learning that After applying the chart classifier, we obtain 13,637 single
trains the proposed summarization model on the dataset chart images which could serve as additional training
created by the chart classifier. More specifically, since examples for the summarization model.
we have much less labeled charts in the arXiv domain, For domain adaptation, we take charts and captions
we treat it as the target domain whereas PMC data is the from arXiv as the target domain. As described in Sec-
source domain. We train the chart classifier on the PMC tion 3, we have manually annotated 370 single-chart
data, and apply it on the images from arXiv papers to images in this domain, which are served as the test set.
obtain large amount of single-chart images. We use the same chart classifier in the previous semi-
supervised learning setup to annotate 140,000 arXiv im-
ages. This results in 22,044 positive examples. We split
this 22,044 examples into training data (19,840) and vali-
dation data (2,204) with a ratio of 9:1.
Model Setup. For the base model, we use a ResNet-101
model from the Torchvision [35] library7 . We resize the Baseline Final Model Equally Equally Bad
image into 224 × 224 and the backbone model maps it Better Better Good Bad
to a 7 × 7 × 2048 vectors. We sort the OCR-extracted PMC 20 70 3 7
texts by their confidence and only keep the top 20 texts arXiv 37 50 2 11
for post-processing. Since we want the image position
Table 2
to be related to the OCR position. We do not apply ran- Human study on the results with 100 pairwise comparisons.
dom resize and cropping but directly resize the chart into
224 × 224. For the pre-trained GPT-2 [30] model, we
downloaded the small GPT-2 model from Hugging Face’s BLEU ROUGE-L METEOR CIDEr
Transformer [36]. The GPT-2 implementation has sup- All 4.47 12.46 4.32 10.30
port of cross-attention layers as in Vaswani et al. [34] Line Chart 4.44 12.70 4.28 10.18
and we use it to attention to the OCR features. For the Bar Chart 4.77 12.30 4.71 7.14
fixed-length transformer, we use 1 layer with the same Scatter Chart 5.96 16.63 5.39 40.78
architecture as the GPT-2 model but do not apply the
causal attention mask. More implementation and hyper- Table 3
parameter details can be found in Appendix. Results regarding different types of charts.
5.2. Metric-based Evaluation
5.3. Human Evaluation
In order to conduct efficient evaluation, we take the au-
In order to get a faithful evaluation, we conduct a human
tomatic language metrics to evaluate our model. We
evaluation on 100 randomly sampled examples for PMC
report the BLEU [37], ROUGE-L [38], METEOR [39], and
and arXiv. The human evaluation is conducted by the
CIDEr [40] as in previous image captioning papers. As
authors and their colleagues (4 in total) since this task
shown in Table 1, we compare our proposed models (in
requires a certain expert knowledge. We use both base
Section 4.2 and Section 4.3) with the baseline captioning
captioning model and our final model (with OCR encoder
model (in Section 4.1) on both PMC and arXiv datasets.
and GPT-2 decoder)8 to generate two summaries. Each
The model with OCR text encoder is strictly better than
image with the generated summaries from the two mod-
the baseline captioning model for every metrics, which
els is annotated by all 4 annotators. We randomly shuffle
indicates that the in-chart text understanding is very
the order of these two summaries and only show the A/B
important for generating good summarization for scien-
labels to the human annotators. The human annotators is
tific charts. The integration of the pre-trained language
asked to choose one from the four options: “Both Good”,
model (GPT-2) further enhances the performance over
“Both Bad”, “A wins”, and “B wins”. As shown in Table 2,
the OCR encoder results. The pre-trained decoder shows
our proposed model significantly outperforms the base-
more improvement on the semi-supervised setup since
line model for both datasets. Moreover, we find that our
the model needs enough data to learn the weights in the
annotators have a high agreement on which generated
fixed-length transformer and the cross-attention mod-
sentence is better since this scientific summarization is
ules, which bridge the vision encoder and the language
mostly about facts and salience.
decoder.
Note that the CIDEr score of the +GPT-2 model is lower
than the +OCR model on the PMC dataset under the su- 6. Analysis
pervised setup. We find that this is due to the size of
data. The smaller size of the PMC data makes the learned In this section, we provide the fine-grained analysis to
model have a stronger bias towards the original GPT-2 illustrate the effectiveness of each component in the pro-
generation. Namely, although the model would gener- posed pipeline. We first demonstrate the results for dif-
ate more fluent sentences (reflected on the high BLEU ferent chart types and cross-domain evaluation in Sec-
score), it is biased towards the GPT-2 prior by leverag- tion 6.1. In Section 6.2, we empirically show the advan-
ing mostly common words. This bias is captured by the tage of our pre-embedding and cross-attention combina-
CIDEr metric’s over-weighting protocol. However, under tion.
the semi-supervised setting, the CIDEr score is higher
with GPT-2 because of the adequate amount of data. This
6.1. Different Chart Categories
also demonstrates the usefulness of the proposed semi-
supervised approach. During our data collection, we also let the annotators to
select the type of the chart (Figure 2). In this paper, we
7 8
https://pytorch.org/docs/stable/torchvision/models.html The PMC model is with the semi-supervised setup.
Pre-Embed Cross-Att BLEU ROUGE-L METEOR CIDEr ages from lots of automatically extracted image-caption
None None 1.91 10.59 3.01 0.52 pairs. Since the images filtered by the classifier will be
Concat None 2.88 11.92 3.79 4.78 further used as data augmentation, we take the 𝐹1 score
None Concat 3.64 12.07 3.69 2.91
Img OCR 4.47 12.46 4.32 10.30 as the main metric to balance the precision and recall.
OCR Img 4.46 12.12 4.08 11.18 We start with the frozen ResNet-101 [25] features with
Concat Concat 3.61 12.18 3.76 2.79 an additional linear classifier. This setup achieves 90% 𝐹1
score. After fine-tuning the backbone model on our data,
Table 4
the model achieves an 𝐹1 score of 94.9%. We also tried
Comparison of different approaches of connecting the image
content and the language decoder. adding other neural modules (e.g., attentive modules and
detection branches) and enhanced visual backbones but
we do not observer a significant result improvement on
the test set.
aim for a general chart summarization model that does When we use this classifier in the semi-supervised and
not rely on the details of each chart type. We here analyze domain adaptation setups, we calibrate the classification
the performance of the proposed model on each chart threshold to maintain a precision over 99% since we have
category with our final model trained on PMC (Semi- lots of unannotated images. Under this precision level,
Supervised). In Table 3, we show the results of the most we achieve a recall of 59.8% and precision of 99.2%. We
common three chart types (i.e., “Line”, “Bar”, “Scatter”) kept the same classification threshold and test it on our
that have sufficient amount of data (513 for Line, 400 annotated arXiv test split. The precision and recall are
for Bar, and 57 for Scatter) to support automatic metric- 93.4% and 65.7%, respectively.
based evaluation. Although the line charts contribute the
most to the training and test data, the BLEU score is the
lowest compared to the results of bar charts and scatter 7. Conclusions
charts. The reason might be that the image features
produced by convolutional neural networks (CNN) are In this paper, we propose datasets and models for summa-
insensitive to the properties (e.g., trending, crossings) of rizing scientific charts, a specific type of structured im-
the curved lines. At the same time, the CNN could capture ages. We construct datasets from PMC and arXiv by lever-
the local intensity of points thus show higher results for aging crowdsourcing and the figure captions in the pa-
scatter chart. According to this observation, we think pers. To enable better understanding text components in
that using visual encoder that are specifically designed charts and to endow the model with external knowledge,
for understanding the curved lines in chart might be a we propose to use an OCR encoder and a pre-trained
promising future direction. language decoder on top of a standard image captioning
model. In our experiments, we show the effectiveness of
our models in terms of both automatic evaluation metrics
6.2. Pre-Embeddings and
and human evaluation.
Cross-Attention Layers
In Section 4.3, we discuss two ways to connect the visual
information to the language decoder: the pre-embedding
Acknowledgments
approach and the additional cross-attention layers. In The authors thank Bloomberg’s AI Engineering team,
Table 4, we show the results of different combinations on especially Alakananda Vempala, Ketevan Tsereteli, and
PMC (semi-supervised) dataset. “Img” and “OCR” indi- Anju Kambadur for helpful feedback and directions. Ad-
cates using the image output and OCR representations as ditional thanks to the anonymous reviewers for their in-
the input to the pre-embedding approach and the cross- sights. Hao Tan acknowledges support from Bloomberg’s
attention layers. “None” means that we do not use input Data Science Ph.D. Fellowship.
and thus excludes the parameters. “Concat” means that
we concatenate the output of image and OCR representa-
tions together and use it as the input. We can see that the References
our approach (Img for Pre-Embed and OCR for Cross-Att)
is comparable to its reverse (OCR for Pre-Embed and Img [1] S. Carberry, S. Elzer, S. Demir, Information graph-
for Cross-Att) and is much better than other alternatives. ics: an untapped resource for digital libraries, in:
Proceedings of the 29th annual international ACM
SIGIR conference on Research and development in
6.3. Chart Classification Performance information retrieval, 2006, pp. 581–588.
In both the semi-supervised learning and domain adap- [2] M. Savva, N. Kong, A. Chhajta, L. Fei-Fei,
tion setup, we use a classifier to identify single-chart im- M. Agrawala, J. Heer, Revision: Automated classi-
fication, analysis and redesign of chart images, in: [15] S. Elzer, E. Schwartz, S. Carberry, D. Chester,
Proceedings of the 24th annual ACM symposium S. Demir, P. Wu, A browser extension for providing
on User interface software and technology, 2011, visually impaired users access to the content of bar
pp. 393–402. charts on the web., in: WEBIST (2), Citeseer, 2007,
[3] S. Ray Choudhury, C. L. Giles, An architecture pp. 59–66.
for information extraction from figures in digital [16] S. Demir, S. Carberry, K. McCoy, Generating tex-
libraries, in: Proceedings of the 24th International tual summaries of bar charts, in: Proceedings of the
Conference on World Wide Web, 2015, pp. 667–672. Fifth International Natural Language Generation
[4] N. Siegel, Z. Horvitz, R. Levin, S. Divvala, A. Farhadi, Conference, Association for Computational Lin-
Figureseer: Parsing result-figures in research pa- guistics, Salt Fork, Ohio, USA, 2008, pp. 7–15. URL:
pers, in: European Conference on Computer Vision, https://www.aclweb.org/anthology/W08-1103.
Springer, 2016, pp. 664–680. [17] C. Greenbacker, S. Carberry, K. McCoy, A corpus
[5] W. Huang, C. L. Tan, A system for understand- of human-written summaries of line graphs, in:
ing imaged infographics and its applications, in: Proceedings of the UCNLG+Eval: Language Gen-
Proceedings of the 2007 ACM symposium on Doc- eration and Evaluation Workshop, Association for
ument engineering, 2007, pp. 9–18. Computational Linguistics, Edinburgh, Scotland,
[6] S. Demir, S. Carberry, K. F. McCoy, Summarizing 2011, pp. 23–27. URL: https://www.aclweb.org/
information graphics textually, Computational Lin- anthology/W11-2703.
guistics 38 (2012) 527–574. [18] C. Greenbacker, P. Wu, S. Carberry, K. McCoy,
[7] Z. Chen, M. Cafarella, E. Adar, Diagramflyer: A S. Elzer, Abstractive summarization of line graphs
search engine for data-driven diagrams, in: Pro- from popular media, in: Proceedings of the
ceedings of the 24th International Conference on Workshop on Automatic Summarization for Dif-
World Wide Web, 2015, pp. 183–186. ferent Genres, Media, and Languages, Association
[8] S. R. Choudhury, S. Wang, C. L. Giles, Scalable algo- for Computational Linguistics, Portland, Oregon,
rithms for scholarly figure mining and semantics, 2011, pp. 41–48. URL: https://www.aclweb.org/
in: Proceedings of the International Workshop on anthology/W11-0506.
Semantic Big Data, 2016, pp. 1–6. [19] O. Vinyals, A. Toshev, S. Bengio, D. Erhan, Show
[9] A. Kembhavi, M. Salvato, E. Kolve, M. Seo, H. Ha- and tell: A neural image caption generator, in:
jishirzi, A. Farhadi, A diagram is worth a dozen Proceedings of the IEEE conference on computer
images, in: European Conference on Computer vision and pattern recognition, 2015, pp. 3156–3164.
Vision, Springer, 2016, pp. 235–251. [20] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville,
[10] S. E. Kahou, A. Atkinson, V. Michalski, Á. Kádár, R. Salakhudinov, R. Zemel, Y. Bengio, Show, at-
A. Trischler, Y. Bengio, Figureqa: An annotated tend and tell: Neural image caption generation with
figure dataset for visual reasoning, in: ICLR Work- visual attention, in: International conference on
shop, 2018. machine learning, 2015, pp. 2048–2057.
[11] A. Singh, V. Natarajan, M. Shah, Y. Jiang, X. Chen, [21] M. Ranzato, S. Chopra, M. Auli, W. Zaremba, Se-
D. Batra, D. Parikh, M. Rohrbach, Towards vqa quence level training with recurrent neural net-
models that can read, in: Proceedings of the IEEE works, in: International Conference on Learning
Conference on Computer Vision and Pattern Recog- Representations, 2016.
nition, 2019, pp. 8317–8326. [22] P. Anderson, X. He, C. Buehler, D. Teney, M. John-
[12] T. Hiippala, M. Alikhani, J. Haverinen, son, S. Gould, L. Zhang, Bottom-up and top-down
T. Kalliokoski, E. Logacheva, S. Orekhova, attention for image captioning and visual question
A. Tuomainen, M. Stone, J. A. Bateman, Ai2d-rst: A answering, in: Proceedings of the IEEE Conference
multimodal corpus of 1000 primary school science on Computer Vision and Pattern Recognition, 2018,
diagrams, Language Resources and Evaluation pp. 6077–6086.
(2020) 1–28. [23] L. Zhou, H. Palangi, L. Zhang, H. Hu, J. J. Corso,
[13] N. Methani, P. Ganguly, M. M. Khapra, P. Kumar, J. Gao, Unified vision-language pre-training for
Plotqa: Reasoning over scientific plots, in: Pro- image captioning and vqa, in: AAAI, 2019.
ceedings of the IEEE/CVF Winter Conference on [24] X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang,
Applications of Computer Vision, 2020, pp. 1527– L. Wang, H. Hu, L. Dong, F. Wei, et al., Oscar:
1536. Object-semantics aligned pre-training for vision-
[14] J. Poco, J. Heer, Reverse-engineering visualizations: language tasks, in: European Conference on Com-
Recovering visual encodings from chart images, in: puter Vision, Springer, 2020, pp. 121–137.
Computer Graphics Forum, volume 36, Wiley On- [25] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learn-
line Library, 2017, pp. 353–363. ing for image recognition, in: Proceedings of the
IEEE conference on computer vision and pattern sociation for Computational Linguistics, 2002, pp.
recognition, 2016, pp. 770–778. 311–318.
[26] S. Hochreiter, J. Schmidhuber, Long short-term [38] C.-Y. Lin, Rouge: A package for automatic eval-
memory, Neural computation 9 (1997) 1735–1780. uation of summaries, in: Text summarization
[27] R. Smith, An overview of the tesseract ocr engine, branches out, 2004, pp. 74–81.
in: Ninth international conference on document [39] S. Banerjee, A. Lavie, Meteor: An automatic met-
analysis and recognition (ICDAR 2007), volume 2, ric for mt evaluation with improved correlation
IEEE, 2007, pp. 629–633. with human judgments, in: Proceedings of the
[28] J. Gu, Z. Lu, H. Li, V. O. Li, Incorporating copying acl workshop on intrinsic and extrinsic evaluation
mechanism in sequence-to-sequence learning, in: measures for machine translation and/or summa-
Proceedings of the 54th Annual Meeting of the As- rization, 2005, pp. 65–72.
sociation for Computational Linguistics (Volume 1: [40] R. Vedantam, C. Lawrence Zitnick, D. Parikh, Cider:
Long Papers), 2016, pp. 1631–1640. Consensus-based image description evaluation, in:
[29] A. See, P. J. Liu, C. D. Manning, Get to the point: Proceedings of the IEEE conference on computer
Summarization with pointer-generator networks, vision and pattern recognition, 2015, pp. 4566–4575.
in: Proceedings of the 55th Annual Meeting of the [41] D. P. Kingma, J. Ba, Adam: A method for stochastic
Association for Computational Linguistics (Volume optimization, in: ICLR, 2015.
1: Long Papers), 2017, pp. 1073–1083. [42] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert:
[30] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, Pre-training of deep bidirectional transformers for
I. Sutskever, Language models are unsupervised language understanding, in: NAACL-HLT (1), 2019.
multitask learners (2019).
[31] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhut-
dinov, Q. V. Le, Xlnet: Generalized autoregressive
pretraining for language understanding, in: Ad-
vances in neural information processing systems,
2019, pp. 5753–5763.
[32] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang,
M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the
limits of transfer learning with a unified text-to-text
transformer, JMLR (2019).
[33] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mo-
hamed, O. Levy, V. Stoyanov, L. Zettlemoyer, Bart:
Denoising sequence-to-sequence pre-training for
natural language generation, translation, and com-
prehension, in: ACL, 2020.
[34] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At-
tention is all you need, in: Advances in Neural
Information Processing Systems, 2017, pp. 5998–
6008.
[35] S. Marcel, Y. Rodriguez, Torchvision the machine-
vision package of torch, in: Proceedings of the 18th
ACM international conference on Multimedia, 2010,
pp. 1485–1488.
[36] T. Wolf, J. Chaumond, L. Debut, V. Sanh, C. De-
langue, A. Moi, P. Cistac, M. Funtowicz, J. Davison,
S. Shleifer, et al., Transformers: State-of-the-art
natural language processing, in: Proceedings of
the 2020 Conference on Empirical Methods in Nat-
ural Language Processing: System Demonstrations,
2020, pp. 38–45.
[37] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a
method for automatic evaluation of machine trans-
lation, in: Proceedings of the 40th annual meeting
on association for computational linguistics, As-
A. Implementation Details We then use this classifier to filter the unlabeled im-
ages in the PMC data to augment the training set. More
The supervised learning setup is conducted on our an- specifically, besides the 50,000 images we used in the
notated English PMC dataset in Sec. 3. We kept 1,000 crowdsourcing task, there are 137,928 remaining articles
charts in the test set and split the the remaining charts in our PMC collection from the year of 2011 to 2019. After
into training(5,819)/validation(646) with a ratio of 9:1. applying the chart classifier, we obtain 13,637 single chart
We train our model on the training set and tune the hy- images which could serve as additional training examples
perparamters on the validation set. The test set is only for the summarization model. The hyper-parameters of
used to report results. We train for 200 epochs on this the summarization model is the same as the ones used
small dataset. All our code are written in PyTorch and in the supervised setup. For the models trained on this
all experiments converge in 4 5 hours on 1 Titan V GPU. dataset, we use a max sequence of 80 and train for 100
For the base model, we use a ResNet-101 model from epochs. The other hyperparameters are same as the small
the Torchvision [35] library 9 . We resize the image into supervised PMC data for each model.
224 x 224 and the backbone model maps it to a 7 x 7 x For domain adaptation, we take charts and captions
2048 vectors. We use 512 dimensions for the LSTM and from English arXiv as the target domain. As described
256 dimensions for the word embedding. The attentive in the dataset section, we have manually annotated 370
hidden states has the same size as the hidden states (512 single-chart images in this domain, which are served as
dimensions). We use an Adam [41] with a fixed learning the test set. We use the same chart classifier in the previ-
rate of 10−4 . The batch size is 64. ous semi-supervised learning setup to annotate 140,000
For the OCR model, we sort the ocr texts by their arXiv images. This results in 22,044 positive examples.
confidence and remove the empty text. We kept the top We split this 22,044 examples into training data (19,840)
20 ocr texts for post-processing. We use 512 dimensions and validation data (2,204) with a ratio of 9:1. The summa-
for the OCR feature representations (yellow blocks in Fig. rization model is trained on the training data, tuned on
3). Since we want the image position to be related to the the validation data, and finally evaluated on the manually-
OCR position. We did not do random resize and cropping annotated test set. For the models trained on this dataset,
but directly resize the chart into 224 x 224. we use a max sequence of 40 since the captions in arXiv
For the pre-trained GPT-2 [30] model, we downloaded are much shorter. Since we halve the max sequence, we
the small GPT-2 model (124M parameters) from Hugging train for 200 epochs thus roughly keep the same compu-
Face’s Transformer [36] 10 . The GPT-2 implementation tational resources for both datasets.
has support of cross-attention layers as in Vaswani et al.
[34] and we use it to attention to the OCR features. For
the fixed-length transformer, we use 1 layer with the B. Details of Data Collection
same architecture as the GPT-2 model but do not apply
the causal attention mask. We use an Adam [41] with The crowdsourcing task is conducted on Appen11 . There
weight decay of 0.01 following the practice in Devlin are 2263 distinct annotators from 50 countries. Since the
et al. [42]. We do not use weight decay for the layer task is to classify image types, it doesn’t require native
normalization layer and bias. We use a linear warmup English speakers. The top 5 countries are Venezuela
with a peak learning rate at 10−4 . The first 5% steps are (53%), USA (23%), Egypt (8%), Colombia (2%), and Peru
warmup steps. The batch size is 64. (1.4%). We paid one cent per judgement (image). For the
In order to increase the number of training examples, first round of annotation tasks, the Fleiss’ kappa scores
we apply the proposed semi-supervised learning tech- for “whether it’s a single chart” and “chart type” tasks are
nique. The single-chart classifier is based on the ResNet- 0.56 and 0.73 respectively, which shows pretty significant
101 model and is fine-tuned on our datasets. We use the agreement.
50,000 human-labeled images (7,465 positives) from PMC
data to build this classifier. The training, validation, and C. Additional Analysis
test sets have 5,819, 646, and 1,000 data point, respec-
tively. The data split is the same as the above supervised C.1. Cross-Domain Transferability
learning setup. After the model converges on the train-
ing set, we calibrate the classifier to optimize the recall To illustrate the need of domain adaption led by the chart
with an precision over 99% on the validation set. Since classifier (in Sec. 4.4), we show the low cross-domain
we have lots of images, we can afford a lower recall for transferability of models in this section. Each row in
high-quality positive examples. Table 5 indicates the results of our final model trained on
the designated dataset while each line in the Table indi-
9
cate the evaluation results on the test set. The model does
https://pytorch.org/docs/stable/torchvision/models.html
10 11
https://github.com/huggingface/transformers client.appen.com
PMC arXiv
BLEU ROUGE-L METEOR CIDEr BLEU ROUGE-L METEOR CIDEr
PMC 4.47 12.46 4.32 10.30 0.06 8.19 1.93 0.63
arXiv 0.22 10.11 3.25 1.43 5.89 14.32 4.92 32.34
Table 5
The transferability of our captioning model across different domains. The columns indicate the training dataset while the
rows indicate the testing dataset. The PMC training data is augmented with filtered charts (in Sec. 4.4) and the arXiv training
data is built by the chart classifier. All test data are human-annotated.
not transfer well between different domains, probably be-
cause the different figuring and captioning conventions
from different communities. The different topics also
introduce diverging vocabularies.
D. Ethical Considerations
The technique developed in this paper would help auto-
matic summarize news, articles, and publications where
charts are involved in. It would also help visually im-
paired people to understand the content of the charts. It
would fail in cases when the OCR detector miss the key
information of the charts and would lead to unfaithful
summarization of the chart. Since we use a pre-trained
language decoder in our final model, the generated sum-
marization might be biased towards the pre-training do-
main of the language decoder. Regrading the dataset
collection, we have resolved all legal and licenses issue
for the PMC dataset before showing them to annotators.
More specifically, we only use articles with CC BY li-
censes from the Open Access Subset of PMC data. For
arXiv data, we annotate a small test set by the authors.