1. Introduction

Scientific Chart Summarization: Datasets and Improved Text Modeling

Hao Tan

Chen-tse Tsai

Yujie He

Mohit Bansal

1 0 Bloomberg , USA 1 University of North Carolina at Chapel Hill , USA

Chart figures usually convey the key message in a multimodal document. Understanding charts automatically and making charts more accessible becomes indispensable in the information era. In this paper, we study the chart summarization problem in which the goal is to generate sentences that describe the salient information in a chart image. To obtain training examples, we leverage image-caption pairs in multiple scientific areas. We create a dataset of single-chart images from research papers in PubMed Central (PMC) and arXiv. Most recent vision-and-language works focus on natural images. Several challenges in structured images such as charts are under-explored. One key property of charts is that the text components (e.g., legends and axis names) carry important information. In our proposed model, we not only use a standard visual encoder but also a text encoder to encode a chart image. The visual and textual representations are connected to a large pre-trained language decoder via pre-embedding and cross-attention approaches, respectively. Experimental results show that the proposed model is significantly better than an image captioning baseline.

eol>Chart Summarization Multimodal Learning Document Understanding Image Captioning Natural Language Processing

1. Introduction

mary for structural charts. First, to obtain a large quantity of summaries of chart images, we leverage captions in Information graphics, such as line charts and bar charts, scientific articles. Unlike magazines or newspapers, in are essential and common components of a document. which image captions could be less descriptive, captions Charts are usually used for visually summarizing im- in scientific papers tend to be more detailed and verbose. portant information that a document intends to convey. We build a chart summarization dataset from the papers Moreover, as shown in the study of Carberry et al. [1], in arXiv and PubMed Central (PMC) by assuming that information graphics in magazines and newspapers of- captions are salient summaries of chart figures. Image ten convey messages that are not repeated in the text. captions in these data sources are written by the corTherefore, summarizing the primary message in a chart responding paper’s authors, and hence would be more is an important step towards understanding a multimodal natural in the language format. Since these articles also document. Potential applications of chart summarization contain figures other than charts, we create crowdsourcinclude indexing information content for a search engine, ing tasks to select single-chart images and collect these making charts accessible for individuals with eyesight charts’ detailed types (e.g., line chart, bar chart, etc.). impairments, and simplifying information dissemination Diferent from the traditional captioning for natural of technical visual info to a layperson. images, there are two main challenges from the language

We have seen the success of image captioning works perspective when the target images are charts: ( 1 ) Berecently, which can be viewed as generating summaries sides visual content, charts usually also contain text (e.g., for an image. However, this research has mostly focused legends and axis titles) which carries significant inforon natural images while other types of images (e.g., struc- mation of components in charts. ( 2 ) Charts are likely tured images shown in Fig. 2) are under-explored. On to be used in some specific domains, thus the language the other hand, abstractive text summarization models generation model may sufer from rare-word issues. also have been greatly improved due to the development To address these two challenges, we first use an optiof neural network models. However, these models only cal character recognition (OCR) model to detect the text look at the text component in a document. In this work, boxes in the charts. An OCR embedding layer is proposed we focus on the less-studied yet important task of ‘chart to encode these extracted texts with their position inforsummarization’, where we want to generate a salient sum- mation into vectors, and these vector representations aEqual Contribution. are treated as another input to the language decoder bWork done during an internship at Bloomberg through cross-attention mechanism. Secondly, to endow The second workshop on Scientific Document Understanding at AAAI the decoder with domain-specific knowledge, we use a 2022 large pre-trained language decoder instead of training it from the scratch. The chart information is connected to © 2022 Bloomberg Finance L.P. Use permitted under Creative Commons License CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) this pre-trained language decoder via two approaches: and focus more on the text generation model. These betpre-embedding and cross attention. We empirically find ter text analysis models could potentially improve our that using pre-embedding for visual content and cross- model performance, which we leave for future investiattention for OCR representations gives the best results. gation. Kahou et al. [10] introduce FigureQA, a visual

We apply our models on our collected datasets of two reasoning corpus of question-answer pairs over synthetic scientific domains. We conduct both metric-based auto- chart images. Instead of answering questions on the synmatic evaluation and human-annotated qualitative eval- thetic charts, we aim at directly summarizing real chart uation. Experimental results show that our model with images. the integration of OCR and pre-trained language model There are some earlier works on chart summarization. significantly outperforms the baseline image captioning Elzer et al. [15] proposed SIGHT, a system that summamodel. We also show the ablation studies that illustrate rizes bar charts for visually impaired users. The system the efectiveness of our proposed methods. identifies one of the twelve message categories that can be conveyed by a bar chart and produces a logical form.

This logic representation is then translated into natural 2. Related Work language via templates. Demir et al. [16] built on top of SIGHT. The proposed system first identifies an additional Most work on understanding chart images involves chart set of propositions that may reflect some information in type classification. Savva et al. [2] classify given chart a bar chart by rules. These propositions are then orgaimages into 10 chart categories using an SVM classifier nized and structured by a bottom-up planner. Finally, a with visual bag-of-words and text-region features. With surface realizer is applied to produce natural language a similar model, Ray Choudhury and Giles [3] proposed summaries. a binary classifier to determine whether an image is a Greenbacker et al. [17] built a corpus of human-written line chart. Siegel et al. [4] experimented with CNN-based English summaries of line graphs. They selected 23 line models for classifying images they extracted from schol- graphs and asked annotators to summarize the most imarly articles. In order to identify chart figures for training portant information in each graph. As this process is our summarization model, we build a binary classifier dificult to be scaled up, we take the captions of chart to identify common charts (e.g., line charts, bar charts, images in scientific papers to represent the summaries scatter plots, etc.). instead. Greenbacker et al. [18] further used this cor

There is a line of works on interpreting text compo- pus and proposed an abstractive summarization system nents in chart images [5, 6, 7, 8, 9, 4, 10, 11, 12, 13]. One for line charts. The system uses a Bayesian network to of the applications here is to recover visual encodings for classify the intents of line segment, and then rules are purposes of indexing and search. For example, Poco and applied to identify additional important informational Heer [14] proposed an end-to-end text analysis pipeline propositions conveyed by the line graph. The sets of that identifies text elements in a chart image, determine intents and prepositions are pre-defined from the study their bounding box, and classifies their role in the chart on the corpus. They left the final step of generating natu(e.g., x-axis label, x-axis title, legend title, etc). They also ral language summary from prepositions as future work. proposed a CNN model that classifies the type of graph- Therefore, no evaluation results were shown. ical mark (e.g., bars or lines). We simply use a general A common challenge of these earlier works is that purpose OCR tool for recognizing text in chart images they are limited to a fixed set of propositions and need most of the figures in these papers are not charts. Hence, to convert the selected propositions to natural language. to be able to train and evaluate the proposed chart sumInstead of using a pipeline with hand-crafted intents and marization model, we need to identify which figures are propositions, we propose to leverage an end-to-end neu- charts. In this work, we focus on the common 5 chart ral network, which has been shown to be powerful in types, including line, bar, scatter, pie, and area charts generating coherent and grammatical sentences in the (Figure 2). Moreover, we further focus on the simplest context of image captioning and abstractive text summa- case where images only contain a single chart. Figures rization. with multiple charts or with any non-chart component

Another thread of related works is (natural) image cap- will be considered as negative images in this work. In the tioning, which tries to generate descriptions for natural following sections, we describe how do we obtain single images. Vinyals et al. [19] first illustrate the end-to-end chart and chart type annotations. encoder-decoder architecture and Xu et al. [20] extends it with attention modules. Ranzato et al. [21] use reinforce- 3.1. PubMed Central Data ment learning to eliminate exposure bias but requires a large amount of data to reduce the high variance. An- For PMC data, we create a crowdsourcing task to annoderson et al. [22] take object-level information to enable tate whether a given image contains single chart. We ranifne-grained visual understanding. However, we empiri- domly sample 50,000 images from the papers published cally found that the detection features for natural image from 2011 to 2019. For each image, we ask annotators do not work well for charts (structural images). Previous whether it is a single chart figure. If the answer is yes, the vision-and-language pre-training, e.g., VLP [23] and OS- annotators are required to select a chart type from line, CAR [24], use pre-trained vision-and-language model to bar, scatter, pie, area, or other chart. Since this task is improve image captioning but requires a large in-domain pretty simple, we ask two annotators to label each image corpus and heavy pre-training. in the first round. In most cases, two annotators agree on the labels. More specifically, the Fleiss’ kappa scores for “whether it’s a single chart” and “chart type” tasks 3. Datasets Creation are 0.56 and 0.73 respectively, which shows significant agreement 5.

If there is a disagreement on either single chart label or chart types, we further ask the other three annotators to perform a second round of annotation on these images. Finally, majority vote is applied to resolve conflicts among all five annotators. We note that single charts with “other” chart type are considered negative images in our experiments.

Among 50,000 images, we obtain 7,397 positive images (single chart), including 3681 line charts, 3088 bar charts, 478 scatter charts, 125 pie charts, and 25 area charts. The positive ratio of the charts is about 13%. This low ratio is because most of the figures in scientific articles are nonchart figures (e.g., model architecture diagrams). In this work, we only use chart types in analyzing model performance. That is, chart type information is not included explicitly in model training.

We create our datasets based on image-caption pairs that appear in public scientific papers. Diferent from the ifgures in magazines or newspapers where the captions could be less descriptive, figure captions in scientific articles tend to convey the key message of figures. The assumption here is that captions written by the paper authors could represent the most salient information in the gfiures, therefore could serve as summaries of the corresponding figures. The overview of our datasets creation pipeline is shown in Figure 1. We consider two data sources: arXiv1 and PMC.2 ArXiv is a free distribution service and an open-access archive for scholarly articles in the fields such as physics, computer science, and mathematics. PMC is a free full-text archive of biomedical and life sciences journal literature at the U.S. National Institutes of Health’s National Library of Medicine. We take articles in the Open Access Subset.3 These two data sources are chosen because they both provide structural data in addition to the PDF files. That is, we can obtain image-caption pairs by parsing the LaTeX source files provided by arXiv or the XML files provided by PMC. We write our own LaTeX parser for the arXiv data, and use a public PubMed parser4 for parsing XML information.

Although we can extract lots of image-caption pairs,

3.2. ArXiv Data 1https://arxiv.org/ 2https://www.ncbi.nlm.nih.gov/pmc/ 3https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/ 4https://github.com/titipata/pubmed_parser We also build another dataset from the arXiv data. We

take papers in Computer Vision, Computation and Language, Machine Learning, Artificial Intelligence, and Neural and Evolutionary Computing fields from 2008 to 2020. Because of the copyright issue, we cannot put arXiv images on a public crowdsourcing platform. Instead, the authors went through and annotated 2000 randomly sampled figures with the same crowdsourcing interface that

5https://en.wikipedia.org/wiki/Fleiss%27_kappa

Methodology

In this section, we introduce the proposed models and training strategies for the chart summarization task. In this chart summarization task, the model needs to generate a sequence of words {} for describing the contents in a chart . We start with introducing the basic captioning model. To enhance in-image text understanding and endow external knowledge, we incorporate an OCR encoder and a pre-trained language decoder. Lastly, we propose a simple semi-supervised learning and domain adaptation approach using a chart classifier. 4.1. Base Model decoder model for image captioning proposed in Xu et al. [20]. A ResNet-101 [25] visual feature extractor encodes the chart into a 7 × 7

× where each vector in the feature map corresponds to a grid region of the image. Feature maps are then flattened to 49 × 2048 feature sequences {}.

49 {}=1 = ResNet ()

At each decoding step , the LSTM [26] language decoder

outputs the hidden outputs ℎ and cell by reading Our base model is adopted from the attentive encoder- (e.g., in the legend, in the title, or inside the chart): 2048 dimensional feature map, These OCR representations are treated as another view attention module (denoted as Attℎ→ ) then attends to the feature sequence {} with the hidden output ℎ as a query. The context ˆ and the hidden vector ℎ are merged into an attentive hidden vector ℎˆ with a fullyconnected layer: ˜− 1 = embedding (− 1) ℎ, = LSTM (˜− 1, ℎ− 1, − 1) ˆ = Attℎ→ (ℎ, {}) ℎ = tanh(1[ˆ ; ℎ] + 1) ˆ of the ground truth token * : The probability of generating the -th token at time step is the softmax over a linear transformation of the attentive hidden ℎˆ. The loss ℒ is the negative log likelihood (,) = softmax w ℎˆ + w ︁(

︁) ℒ = − log (* )

4.2. Text Understanding

the image .

Diferent from natural image captioning, the summariza

tion of charts heavily relies on the understanding of text inside the images. However, the ResNet visual encoder (in Section 4.1) is insensitive to the text in the images (as shown in Singh et al. [11] as well) thus we need to build a pipeline to extract the text information from the images.

Specifically, we first use the Tesseract [ 27] to extract a

sequence of texts text with their positions pos from {(text , pos )}=1 = OCR()

Since the characters in charts are usually in small font

and sometimes blurred with the chart content, the copy mechanism [28, 29] that directly brings the text into final summarization does not provide good results. We instead use the shallow text embedding layer to project the OCR text to dense vector representations that denoises the OCR detection results. We also encode the position of the OCR along with the text representation since the spatial information indicates the properties of the text

= Embtext(text ) + pos pos features { }. The final hidden output of the charts and the language decoder simultaneously attends to the OCR information {} and visual image ˜ ℎ is calculated based on the concatenation of the visually attended vector ˜ , the OCR attended vector ˜, and the hidden state ℎ.

˜ = Attℎ→ (ℎ, {}) ( 1 ) ( 2 ) (3) Fluorescence emission spectrum recorded from the ... ResNet Tokenizer

Text Position Fluorescence (X, Y) i1n.n8m.t0.e0.n.s.ity (((XXX,,, YYY))) F #lu #orescence emission spectrum recorded from the ...... Fixed-Len Transformer

Word Embedding Pre-trained Language Decoder [OMIT] [OMIT] [OMIT] F #lu #orescence emission spectrum

˜ = Attℎ→(ℎ, { }) (4) of red blocks and blue blocks in Figure 3). The crossℎ˜ = tanh(2[˜ , ˜, ℎ] + 2) (5) attention approach adds cross-attention layers [34] inside the language decoder to fuse visual information. The We next replace the original attentive hidden ℎˆ with this cross-attention layers contain residual short-cut connecOCR-enhanced hidden output ℎ˜ (in Sec. 4.1) in succeed- tions thus the decoder still benefits from the pre-trained ing decoding steps. weights with these additional layers.

As shown in Figure 3, we use the pre-embedding approach for the features from the visual image content 4.3. Pre-trained Language Decoder (i.e., from the ResNet encoder) and use the cross-attention When summarizing charts in news or scientific papers, a layers for the OCR texts. The idea of this specific design faithful description of the chart contents also relies on is that the generation would be led by the image content external knowledge, and hence a pre-trained language and will use the OCR information to generate concrete decoder might help the generation. As shown in Figure 3, words. We empirically find that it is the best combinawe illustrate our model which integrates a pre-trained tion to fuse information into the language decoder, and language decoder GPT-2 [30].6 As described in the pre- we show the comparison in Section 6.2. In detail, the vious section, we have two image encoders (i.e., ResNet length of the ResNet feature map is 49 and the order of encoder and OCR text encoder) to process the image con- the features is not aligned with the positional encoding tent and image text respectively. The ResNet encoder in the pre-trained language decoder. We thus do not dimaps the features into a squared feature map (the purple rectly append it before the word embedding but use a vector blocks in Figure 3) where each vector corresponds ifxed-length transformer to map it to a sequence of 10 to a part of image content. We will view this feature map vectors (the red blocks in Figure 3; we only draw 3 vecas a sequence of vectors (as in Eq. 1) in the following pro- tors for simplicity). The fixed-length transformer is built cedures. The OCR encoder (Eq. 4.2) maps the chart into a by transformer decoder layers [34] with only positional sequence of recognized words and their positions on the embedding (without word embedding). We use only 1 chart. The OCR embedding layer (Eq. 2) adds the word layer in our experiments. embedding and the position encoding into one vector for each OCR entry (the yellow vectors in Figure 3). 4.4. Semi-Supervised Learning and

In order to connect these visual and textual infor- Domain Adaptation mation from the image to the language decoder, we adopt two ways: appending pre-embeddings and adding cross-attention layers. The pre-embedding approach is to concatenate the sequence of visual vectors before the word embeddings thus the language decoder will take this concatenation as input (e.g., the concatenation Although we can extract abundant image-caption pairs, most figures in scientific articles do not contain a chart as we discussed in Section 3. If we want to reserve enough human-annotated examples for the metric-based evaluation purpose, that leaves very little data for training, especially for the arXiv domain in which we only have hundreds of single-chart images. Therefore, we leverage 6The method could also be applied to other pre-trained lan- semi-supervised learning techniques to take advantage guage decoders such as XLNet [31], T5 [32], and BART [33].

Base Model + OCR + GPT-2

BLEU

5. Results

of large unannotated data and use domain adaption to transfer to other datasets. Both of these two methods rely on a chart classifier that we will introduce first. In this section, we evaluate our proposed methods on Chart Classifier. The key component in getting more our collected datasets of two domains: PMC and arXiv. training examples is a classifier that can identify single- We start with describing the experiment setups and show chart images. We take the ResNet [25] as the visual back- results with both automatic metric-based evaluation and bone and use a binary linear classifier after the mean- human evaluation. pooled features. Instead of freezing the backbone model as in the previous works [20], we fine-tune the classi- 5.1. Experimental Setup ifer with a small learning rate, 10− 4. We find that this standard classifier reaches good results (see Appendix Data Setup. The supervised learning setup is conducted for details). on our annotated PMC dataset. We randomly sample Semi-Supervised Learning. In the semi-supervised 1,000 charts as the test set and split the remaining charts learning setup, we have labeled data (Section 3) and we into training (5,819) and validation (646) sets with a ratio want to improve the performance from the unlabeled of 9:1. data. The unlabeled data contains both charts and non- In order to increase the number of training examples, chart images (e.g., model figures in scientific publications we apply the proposed semi-supervised learning techand natural images in news). Including these non-chart nique (Section 4.4). The single-chart classifier is based on images in training data will introduce noise and thus the ResNet-101 model and is fine-tuned on our datasets. lead to an increment in training time. To provide clean We use the 50,000 human-labeled images (7,465 positives) data in semi-supervised learning, we filter the unlabeled from PMC data to build this binary classifier. After the data with our chart classifier and train the summarization model converges on the training set, we calibrate the model based on the filtered data. In this way, we increase classifier to optimize the recall with an precision over the amount of data and the coverage of topics. 99% on the validation set. Since we have lots of images, Domain Adaptation. Diferent from semi-supervised we can aford a lower recall for high-quality positive learning, domain adaptation focuses on transferring the examples. We then use this classifier to filter the unlalabeled dataset into another domain. Naïve transferring beled images in the PMC data to augment the training set. without training on the target domain would under-fit More specifically, besides the 50,000 images we used in the target distribution and we empirically show its in- the crowdsourcing task, there are 137,928 remaining artiefectiveness in Appendix. To solve this issue, we use cles in our PMC collection from the year of 2011 to 2019. a similar approach to the semi-supervised learning that After applying the chart classifier, we obtain 13,637 single trains the proposed summarization model on the dataset chart images which could serve as additional training created by the chart classifier. More specifically, since examples for the summarization model. we have much less labeled charts in the arXiv domain, For domain adaptation, we take charts and captions we treat it as the target domain whereas PMC data is the from arXiv as the target domain. As described in Secsource domain. We train the chart classifier on the PMC tion 3, we have manually annotated 370 single-chart data, and apply it on the images from arXiv papers to images in this domain, which are served as the test set. obtain large amount of single-chart images. We use the same chart classifier in the previous semisupervised learning setup to annotate 140,000 arXiv images. This results in 22,044 positive examples. We split this 22,044 examples into training data (19,840) and validation data ( 2,204 ) with a ratio of 9:1.

Model Setup. For the base model, we use a ResNet-101 7 11 4.32 4.28 4.71 5.39

CIDEr model from the Torchvision [35] library7. We resize the Baseline Final Model image into 224 × 224 and the backbone model maps it Better Better to a 7 × 7 × 2048 vectors. We sort the OCR-extracted PMC 20 70 3 texts by their confidence and only keep the top 20 texts arXiv 37 50 2 for post-processing. Since we want the image position to be related to the OCR position. We do not apply random resize and cropping but directly resize the chart into 224 × 224. For the pre-trained GPT-2 [30] model, we downloaded the small GPT-2 model from Hugging Face’s BLEU ROUGE-L Transformer [36]. The GPT-2 implementation has sup- All 4.47 12.46 port of cross-attention layers as in Vaswani et al. [34] Line Chart 4.44 12.70 and we use it to attention to the OCR features. For the Bar Chart 4.77 12.30 ifxed-length transformer, we use 1 layer with the same Scatter Chart 5.96 16.63 architecture as the GPT-2 model but do not apply the causal attention mask. More implementation and hyper- Table 3 parameter details can be found in Appendix. Results regarding diferent types of charts.

5.2. Metric-based Evaluation 5.3. Human Evaluation In order to conduct eficient evaluation, we take the au

tomatic language metrics to evaluate our model. We In order to get a faithful evaluation, we conduct a human report the BLEU [37], ROUGE-L [38], METEOR [39], and evaluation on 100 randomly sampled examples for PMC CIDEr [40] as in previous image captioning papers. As and arXiv. The human evaluation is conducted by the shown in Table 1, we compare our proposed models (in authors and their colleagues (4 in total) since this task Section 4.2 and Section 4.3) with the baseline captioning requires a certain expert knowledge. We use both base model (in Section 4.1) on both PMC and arXiv datasets. captioning model and our final model (with OCR encoder The model with OCR text encoder is strictly better than and GPT-2 decoder)8 to generate two summaries. Each the baseline captioning model for every metrics, which image with the generated summaries from the two modindicates that the in-chart text understanding is very els is annotated by all 4 annotators. We randomly shufle important for generating good summarization for scien- the order of these two summaries and only show the A/B tific charts. The integration of the pre-trained language labels to the human annotators. The human annotators is model (GPT-2) further enhances the performance over asked to choose one from the four options: “Both Good”, the OCR encoder results. The pre-trained decoder shows “Both Bad”, “A wins”, and “B wins”. As shown in Table 2, more improvement on the semi-supervised setup since our proposed model significantly outperforms the basethe model needs enough data to learn the weights in the line model for both datasets. Moreover, we find that our ifxed-length transformer and the cross-attention mod- annotators have a high agreement on which generated ules, which bridge the vision encoder and the language sentence is better since this scientific summarization is decoder. mostly about facts and salience.

Note that the CIDEr score of the +GPT-2 model is lower than the +OCR model on the PMC dataset under the su- 6. Analysis pervised setup. We find that this is due to the size of data. The smaller size of the PMC data makes the learned In this section, we provide the fine-grained analysis to model have a stronger bias towards the original GPT-2 illustrate the efectiveness of each component in the progeneration. Namely, although the model would gener- posed pipeline. We first demonstrate the results for difate more fluent sentences (reflected on the high BLEU ferent chart types and cross-domain evaluation in Secscore), it is biased towards the GPT-2 prior by leverag- tion 6.1. In Section 6.2, we empirically show the advaning mostly common words. This bias is captured by the tage of our pre-embedding and cross-attention combinaCIDEr metric’s over-weighting protocol. However, under tion. the semi-supervised setting, the CIDEr score is higher with GPT-2 because of the adequate amount of data. This 6.1. Diferent Chart Categories also demonstrates the usefulness of the proposed semisupervised approach.

During our data collection, we also let the annotators to

select the type of the chart (Figure 2). In this paper, we

7https://pytorch.org/docs/stable/torchvision/models.html 8The PMC model is with the semi-supervised setup.

None Concat None Img OCR Concat

None None Concat OCR Img Concat aim for a general chart summarization model that does not rely on the details of each chart type. We here analyze the performance of the proposed model on each chart category with our final model trained on PMC (SemiSupervised). In Table 3, we show the results of the most common three chart types (i.e., “Line”, “Bar”, “Scatter”) that have suficient amount of data (513 for Line, 400 for Bar, and 57 for Scatter) to support automatic metricbased evaluation. Although the line charts contribute the most to the training and test data, the BLEU score is the lowest compared to the results of bar charts and scatter charts. The reason might be that the image features produced by convolutional neural networks (CNN) are insensitive to the properties (e.g., trending, crossings) of the curved lines. At the same time, the CNN could capture the local intensity of points thus show higher results for scatter chart. According to this observation, we think that using visual encoder that are specifically designed for understanding the curved lines in chart might be a promising future direction.

6.2. Pre-Embeddings and Cross-Attention Layers In Section 4.3, we discuss two ways to connect the visual

information to the language decoder: the pre-embedding approach and the additional cross-attention layers. In Table 4, we show the results of diferent combinations on PMC (semi-supervised) dataset. “Img” and “OCR” indicates using the image output and OCR representations as the input to the pre-embedding approach and the crossattention layers. “None” means that we do not use input and thus excludes the parameters. “Concat” means that we concatenate the output of image and OCR representations together and use it as the input. We can see that the our approach (Img for Pre-Embed and OCR for Cross-Att) is comparable to its reverse (OCR for Pre-Embed and Img for Cross-Att) and is much better than other alternatives.

6.3. Chart Classification Performance In both the semi-supervised learning and domain adap

tion setup, we use a classifier to identify single-chart images from lots of automatically extracted image-caption pairs. Since the images filtered by the classifier will be further used as data augmentation, we take the 1 score as the main metric to balance the precision and recall. We start with the frozen ResNet-101 [25] features with an additional linear classifier. This setup achieves 90% 1 score. After fine-tuning the backbone model on our data, the model achieves an 1 score of 94.9%. We also tried adding other neural modules (e.g., attentive modules and detection branches) and enhanced visual backbones but we do not observer a significant result improvement on the test set.

When we use this classifier in the semi-supervised and domain adaptation setups, we calibrate the classification threshold to maintain a precision over 99% since we have lots of unannotated images. Under this precision level, we achieve a recall of 59.8% and precision of 99.2%. We kept the same classification threshold and test it on our annotated arXiv test split. The precision and recall are 93.4% and 65.7%, respectively.

7. Conclusions

In this paper, we propose datasets and models for summarizing scientific charts, a specific type of structured images. We construct datasets from PMC and arXiv by leveraging crowdsourcing and the figure captions in the papers. To enable better understanding text components in charts and to endow the model with external knowledge, we propose to use an OCR encoder and a pre-trained language decoder on top of a standard image captioning model. In our experiments, we show the efectiveness of our models in terms of both automatic evaluation metrics and human evaluation.

Acknowledgments The authors thank Bloomberg’s AI Engineering team,

especially Alakananda Vempala, Ketevan Tsereteli, and Anju Kambadur for helpful feedback and directions. Additional thanks to the anonymous reviewers for their insights. Hao Tan acknowledges support from Bloomberg’s Data Science Ph.D. Fellowship. ifcation, analysis and redesign of chart images, in: [15] S. Elzer, E. Schwartz, S. Carberry, D. Chester, Proceedings of the 24th annual ACM symposium S. Demir, P. Wu, A browser extension for providing on User interface software and technology, 2011, visually impaired users access to the content of bar pp. 393–402. charts on the web., in: WEBIST ( 2 ), Citeseer, 2007, [3] S. Ray Choudhury, C. L. Giles, An architecture pp. 59–66.

for information extraction from figures in digital [16] S. Demir, S. Carberry, K. McCoy, Generating texlibraries, in: Proceedings of the 24th International tual summaries of bar charts, in: Proceedings of the Conference on World Wide Web, 2015, pp. 667–672. Fifth International Natural Language Generation [4] N. Siegel, Z. Horvitz, R. Levin, S. Divvala, A. Farhadi, Conference, Association for Computational LinFigureseer: Parsing result-figures in research pa- guistics, Salt Fork, Ohio, USA, 2008, pp. 7–15. URL: pers, in: European Conference on Computer Vision, https://www.aclweb.org/anthology/W08-1103.

Springer, 2016, pp. 664–680. [17] C. Greenbacker, S. Carberry, K. McCoy, A corpus [5] W. Huang, C. L. Tan, A system for understand- of human-written summaries of line graphs, in: ing imaged infographics and its applications, in: Proceedings of the UCNLG+Eval: Language GenProceedings of the 2007 ACM symposium on Doc- eration and Evaluation Workshop, Association for ument engineering, 2007, pp. 9–18. Computational Linguistics, Edinburgh, Scotland, [6] S. Demir, S. Carberry, K. F. McCoy, Summarizing 2011, pp. 23–27. URL: https://www.aclweb.org/ information graphics textually, Computational Lin- anthology/W11-2703.

guistics 38 (2012) 527–574. [18] C. Greenbacker, P. Wu, S. Carberry, K. McCoy, [7] Z. Chen, M. Cafarella, E. Adar, Diagramflyer: A S. Elzer, Abstractive summarization of line graphs search engine for data-driven diagrams, in: Pro- from popular media, in: Proceedings of the ceedings of the 24th International Conference on Workshop on Automatic Summarization for DifWorld Wide Web, 2015, pp. 183–186. ferent Genres, Media, and Languages, Association [8] S. R. Choudhury, S. Wang, C. L. Giles, Scalable algo- for Computational Linguistics, Portland, Oregon, rithms for scholarly figure mining and semantics, 2011, pp. 41–48. URL: https://www.aclweb.org/ in: Proceedings of the International Workshop on anthology/W11-0506.

Semantic Big Data, 2016, pp. 1–6. [19] O. Vinyals, A. Toshev, S. Bengio, D. Erhan, Show [9] A. Kembhavi, M. Salvato, E. Kolve, M. Seo, H. Ha- and tell: A neural image caption generator, in: jishirzi, A. Farhadi, A diagram is worth a dozen Proceedings of the IEEE conference on computer images, in: European Conference on Computer vision and pattern recognition, 2015, pp. 3156–3164.

Vision, Springer, 2016, pp. 235–251. [20] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, [10] S. E. Kahou, A. Atkinson, V. Michalski, Á. Kádár, R. Salakhudinov, R. Zemel, Y. Bengio, Show, atA. Trischler, Y. Bengio, Figureqa: An annotated tend and tell: Neural image caption generation with ifgure dataset for visual reasoning, in: ICLR Work- visual attention, in: International conference on shop, 2018. machine learning, 2015, pp. 2048–2057. [11] A. Singh, V. Natarajan, M. Shah, Y. Jiang, X. Chen, [21] M. Ranzato, S. Chopra, M. Auli, W. Zaremba, SeD. Batra, D. Parikh, M. Rohrbach, Towards vqa quence level training with recurrent neural netmodels that can read, in: Proceedings of the IEEE works, in: International Conference on Learning Conference on Computer Vision and Pattern Recog- Representations, 2016.

nition, 2019, pp. 8317–8326. [22] P. Anderson, X. He, C. Buehler, D. Teney, M. John[12] T. Hiippala, M. Alikhani, J. Haverinen, son, S. Gould, L. Zhang, Bottom-up and top-down T. Kalliokoski, E. Logacheva, S. Orekhova, attention for image captioning and visual question A. Tuomainen, M. Stone, J. A. Bateman, Ai2d-rst: A answering, in: Proceedings of the IEEE Conference multimodal corpus of 1000 primary school science on Computer Vision and Pattern Recognition, 2018, diagrams, Language Resources and Evaluation pp. 6077–6086.

(2020) 1–28. [23] L. Zhou, H. Palangi, L. Zhang, H. Hu, J. J. Corso, [13] N. Methani, P. Ganguly, M. M. Khapra, P. Kumar, J. Gao, Unified vision-language pre-training for Plotqa: Reasoning over scientific plots, in: Pro- image captioning and vqa, in: AAAI, 2019. ceedings of the IEEE/CVF Winter Conference on [24] X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, Applications of Computer Vision, 2020, pp. 1527– L. Wang, H. Hu, L. Dong, F. Wei, et al., Oscar: 1536. Object-semantics aligned pre-training for vision[14] J. Poco, J. Heer, Reverse-engineering visualizations: language tasks, in: European Conference on ComRecovering visual encodings from chart images, in: puter Vision, Springer, 2020, pp. 121–137. Computer Graphics Forum, volume 36, Wiley On- [25] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learnline Library, 2017, pp. 353–363. ing for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern sociation for Computational Linguistics, 2002, pp. recognition, 2016, pp. 770–778. 311–318. [26] S. Hochreiter, J. Schmidhuber, Long short-term [38] C.-Y. Lin, Rouge: A package for automatic evalmemory, Neural computation 9 (1997) 1735–1780. uation of summaries, in: Text summarization [27] R. Smith, An overview of the tesseract ocr engine, branches out, 2004, pp. 74–81. in: Ninth international conference on document [39] S. Banerjee, A. Lavie, Meteor: An automatic metanalysis and recognition (ICDAR 2007), volume 2, ric for mt evaluation with improved correlation IEEE, 2007, pp. 629–633. with human judgments, in: Proceedings of the [28] J. Gu, Z. Lu, H. Li, V. O. Li, Incorporating copying acl workshop on intrinsic and extrinsic evaluation mechanism in sequence-to-sequence learning, in: measures for machine translation and/or summaProceedings of the 54th Annual Meeting of the As- rization, 2005, pp. 65–72. sociation for Computational Linguistics (Volume 1: [40] R. Vedantam, C. Lawrence Zitnick, D. Parikh, Cider: Long Papers), 2016, pp. 1631–1640. Consensus-based image description evaluation, in: [29] A. See, P. J. Liu, C. D. Manning, Get to the point: Proceedings of the IEEE conference on computer Summarization with pointer-generator networks, vision and pattern recognition, 2015, pp. 4566–4575. in: Proceedings of the 55th Annual Meeting of the [41] D. P. Kingma, J. Ba, Adam: A method for stochastic Association for Computational Linguistics (Volume optimization, in: ICLR, 2015.

1: Long Papers), 2017, pp. 1073–1083. [42] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: [30] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, Pre-training of deep bidirectional transformers for I. Sutskever, Language models are unsupervised language understanding, in: NAACL-HLT ( 1 ), 2019. multitask learners (2019). [31] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, Q. V. Le, Xlnet: Generalized autoregressive pretraining for language understanding, in: Advances in neural information processing systems, 2019, pp. 5753–5763. [32] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang,

M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, JMLR (2019). [33] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, L. Zettlemoyer, Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension, in: ACL, 2020. [34] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,

L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, in: Advances in Neural Information Processing Systems, 2017, pp. 5998– 6008. [35] S. Marcel, Y. Rodriguez, Torchvision the machinevision package of torch, in: Proceedings of the 18th ACM international conference on Multimedia, 2010, pp. 1485–1488. [36] T. Wolf, J. Chaumond, L. Debut, V. Sanh, C. Delangue, A. Moi, P. Cistac, M. Funtowicz, J. Davison, S. Shleifer, et al., Transformers: State-of-the-art natural language processing, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2020, pp. 38–45. [37] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation of machine translation, in: Proceedings of the 40th annual meeting on association for computational linguistics, As

We then use this classifier to filter the unlabeled im

ages in the PMC data to augment the training set. More The supervised learning setup is conducted on our an- specifically, besides the 50,000 images we used in the notated English PMC dataset in Sec. 3. We kept 1,000 crowdsourcing task, there are 137,928 remaining articles charts in the test set and split the the remaining charts in our PMC collection from the year of 2011 to 2019. After into training(5,819)/validation(646) with a ratio of 9:1. applying the chart classifier, we obtain 13,637 single chart We train our model on the training set and tune the hy- images which could serve as additional training examples perparamters on the validation set. The test set is only for the summarization model. The hyper-parameters of used to report results. We train for 200 epochs on this the summarization model is the same as the ones used small dataset. All our code are written in PyTorch and in the supervised setup. For the models trained on this all experiments converge in 4 5 hours on 1 Titan V GPU. dataset, we use a max sequence of 80 and train for 100

For the base model, we use a ResNet-101 model from epochs. The other hyperparameters are same as the small the Torchvision [35] library 9. We resize the image into supervised PMC data for each model. 224 x 224 and the backbone model maps it to a 7 x 7 x For domain adaptation, we take charts and captions 2048 vectors. We use 512 dimensions for the LSTM and from English arXiv as the target domain. As described 256 dimensions for the word embedding. The attentive in the dataset section, we have manually annotated 370 hidden states has the same size as the hidden states (512 single-chart images in this domain, which are served as dimensions). We use an Adam [41] with a fixed learning the test set. We use the same chart classifier in the previrate of 10− 4. The batch size is 64. ous semi-supervised learning setup to annotate 140,000

For the OCR model, we sort the ocr texts by their arXiv images. This results in 22,044 positive examples. confidence and remove the empty text. We kept the top We split this 22,044 examples into training data (19,840) 20 ocr texts for post-processing. We use 512 dimensions and validation data ( 2,204 ) with a ratio of 9:1. The summafor the OCR feature representations (yellow blocks in Fig. rization model is trained on the training data, tuned on 3). Since we want the image position to be related to the the validation data, and finally evaluated on the manuallyOCR position. We did not do random resize and cropping annotated test set. For the models trained on this dataset, but directly resize the chart into 224 x 224. we use a max sequence of 40 since the captions in arXiv

For the pre-trained GPT-2 [30] model, we downloaded are much shorter. Since we halve the max sequence, we the small GPT-2 model (124M parameters) from Hugging train for 200 epochs thus roughly keep the same compuFace’s Transformer [36] 10. The GPT-2 implementation tational resources for both datasets. has support of cross-attention layers as in Vaswani et al. [34] and we use it to attention to the OCR features. For the fixed-length transformer, we use 1 layer with the B. Details of Data Collection same architecture as the GPT-2 model but do not apply the causal attention mask. We use an Adam [41] with The crowdsourcing task is conducted on Appen11. There weight decay of 0.01 following the practice in Devlin are 2263 distinct annotators from 50 countries. Since the et al. [42]. We do not use weight decay for the layer task is to classify image types, it doesn’t require native normalization layer and bias. We use a linear warmup English speakers. The top 5 countries are Venezuela with a peak learning rate at 10− 4. The first 5% steps are (53%), USA (23%), Egypt (8%), Colombia (2%), and Peru warmup steps. The batch size is 64. (1.4%). We paid one cent per judgement (image). For the

In order to increase the number of training examples, first round of annotation tasks, the Fleiss’ kappa scores we apply the proposed semi-supervised learning tech- for “whether it’s a single chart” and “chart type” tasks are nique. The single-chart classifier is based on the ResNet- 0.56 and 0.73 respectively, which shows pretty significant 101 model and is fine-tuned on our datasets. We use the agreement. 50,000 human-labeled images (7,465 positives) from PMC data to build this classifier. The training, validation, and C. Additional Analysis test sets have 5,819, 646, and 1,000 data point, respectively. The data split is the same as the above supervised C.1. Cross-Domain Transferability learning setup. After the model converges on the training set, we calibrate the classifier to optimize the recall with an precision over 99% on the validation set. Since we have lots of images, we can aford a lower recall for high-quality positive examples.

To illustrate the need of domain adaption led by the chart

classifier (in Sec. 4.4), we show the low cross-domain transferability of models in this section. Each row in Table 5 indicates the results of our final model trained on the designated dataset while each line in the Table indicate the evaluation results on the test set. The model does

9https://pytorch.org/docs/stable/torchvision/models.html 10https://github.com/huggingface/transformers 11client.appen.com

not transfer well between diferent domains, probably because the diferent figuring and captioning conventions from diferent communities. The diferent topics also introduce diverging vocabularies.

D. Ethical Considerations

The technique developed in this paper would help automatic summarize news, articles, and publications where charts are involved in. It would also help visually impaired people to understand the content of the charts. It would fail in cases when the OCR detector miss the key information of the charts and would lead to unfaithful summarization of the chart. Since we use a pre-trained language decoder in our final model, the generated summarization might be biased towards the pre-training domain of the language decoder. Regrading the dataset collection, we have resolved all legal and licenses issue for the PMC dataset before showing them to annotators. More specifically, we only use articles with CC BY licenses from the Open Access Subset of PMC data. For arXiv data, we annotate a small test set by the authors.

[1]

Carberry ,

Elzer ,

Demir , Information graphics: an untapped resource for digital libraries , in: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval , 2006 , pp. 581 - 588 .

[2]

Savva ,

Kong ,

Chhajta ,

Fei-Fei ,

Agrawala ,

Heer , Revision: Automated classi-