Scientific Chart Summarization: Datasets and Improved Text Modeling Hao Tan1ab , Chen-tse Tsai2a , Yujie He2a and Mohit Bansal1 1 University of North Carolina at Chapel Hill, USA 2 Bloomberg, USA Abstract Chart figures usually convey the key message in a multimodal document. Understanding charts automatically and making charts more accessible becomes indispensable in the information era. In this paper, we study the chart summarization problem in which the goal is to generate sentences that describe the salient information in a chart image. To obtain training examples, we leverage image-caption pairs in multiple scientific areas. We create a dataset of single-chart images from research papers in PubMed Central (PMC) and arXiv. Most recent vision-and-language works focus on natural images. Several challenges in structured images such as charts are under-explored. One key property of charts is that the text components (e.g., legends and axis names) carry important information. In our proposed model, we not only use a standard visual encoder but also a text encoder to encode a chart image. The visual and textual representations are connected to a large pre-trained language decoder via pre-embedding and cross-attention approaches, respectively. Experimental results show that the proposed model is significantly better than an image captioning baseline. Keywords Chart Summarization, Multimodal Learning, Document Understanding, Image Captioning, Natural Language Processing 1. Introduction mary for structural charts. First, to obtain a large quantity of summaries of chart images, we leverage captions in Information graphics, such as line charts and bar charts, scientific articles. Unlike magazines or newspapers, in are essential and common components of a document. which image captions could be less descriptive, captions Charts are usually used for visually summarizing im- in scientific papers tend to be more detailed and verbose. portant information that a document intends to convey. We build a chart summarization dataset from the papers Moreover, as shown in the study of Carberry et al. [1], in arXiv and PubMed Central (PMC) by assuming that information graphics in magazines and newspapers of- captions are salient summaries of chart figures. Image ten convey messages that are not repeated in the text. captions in these data sources are written by the cor- Therefore, summarizing the primary message in a chart responding paper’s authors, and hence would be more is an important step towards understanding a multimodal natural in the language format. Since these articles also document. Potential applications of chart summarization contain figures other than charts, we create crowdsourc- include indexing information content for a search engine, ing tasks to select single-chart images and collect these making charts accessible for individuals with eyesight charts’ detailed types (e.g., line chart, bar chart, etc.). impairments, and simplifying information dissemination Different from the traditional captioning for natural of technical visual info to a layperson. images, there are two main challenges from the language We have seen the success of image captioning works perspective when the target images are charts: (1) Be- recently, which can be viewed as generating summaries sides visual content, charts usually also contain text (e.g., for an image. However, this research has mostly focused legends and axis titles) which carries significant infor- on natural images while other types of images (e.g., struc- mation of components in charts. (2) Charts are likely tured images shown in Fig. 2) are under-explored. On to be used in some specific domains, thus the language the other hand, abstractive text summarization models generation model may suffer from rare-word issues. also have been greatly improved due to the development To address these two challenges, we first use an opti- of neural network models. However, these models only cal character recognition (OCR) model to detect the text look at the text component in a document. In this work, boxes in the charts. An OCR embedding layer is proposed we focus on the less-studied yet important task of ‘chart to encode these extracted texts with their position infor- summarization’, where we want to generate a salient sum- mation into vectors, and these vector representations a Equal Contribution. are treated as another input to the language decoder b through cross-attention mechanism. Secondly, to endow Work done during an internship at Bloomberg The second workshop on Scientific Document Understanding at AAAI the decoder with domain-specific knowledge, we use a 2022 large pre-trained language decoder instead of training it © 2022 Bloomberg Finance L.P. Use permitted under Creative Commons License CEUR Attribution 4.0 International (CC BY 4.0). from the scratch. The chart information is connected to Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Figure 1: Pipeline of datasets creation. We first sample scientific papers from arXiv and PubMed Central, and then extract image-caption pairs by parsing the source LaTeX or XML files. Finally, crowdsourcing is applied to annotate whether an image contains a single chart and the corresponding chart type. this pre-trained language decoder via two approaches: and focus more on the text generation model. These bet- pre-embedding and cross attention. We empirically find ter text analysis models could potentially improve our that using pre-embedding for visual content and cross- model performance, which we leave for future investi- attention for OCR representations gives the best results. gation. Kahou et al. [10] introduce FigureQA, a visual We apply our models on our collected datasets of two reasoning corpus of question-answer pairs over synthetic scientific domains. We conduct both metric-based auto- chart images. Instead of answering questions on the syn- matic evaluation and human-annotated qualitative eval- thetic charts, we aim at directly summarizing real chart uation. Experimental results show that our model with images. the integration of OCR and pre-trained language model There are some earlier works on chart summarization. significantly outperforms the baseline image captioning Elzer et al. [15] proposed SIGHT, a system that summa- model. We also show the ablation studies that illustrate rizes bar charts for visually impaired users. The system the effectiveness of our proposed methods. identifies one of the twelve message categories that can be conveyed by a bar chart and produces a logical form. This logic representation is then translated into natural 2. Related Work language via templates. Demir et al. [16] built on top of SIGHT. The proposed system first identifies an additional Most work on understanding chart images involves chart set of propositions that may reflect some information in type classification. Savva et al. [2] classify given chart a bar chart by rules. These propositions are then orga- images into 10 chart categories using an SVM classifier nized and structured by a bottom-up planner. Finally, a with visual bag-of-words and text-region features. With surface realizer is applied to produce natural language a similar model, Ray Choudhury and Giles [3] proposed summaries. a binary classifier to determine whether an image is a Greenbacker et al. [17] built a corpus of human-written line chart. Siegel et al. [4] experimented with CNN-based English summaries of line graphs. They selected 23 line models for classifying images they extracted from schol- graphs and asked annotators to summarize the most im- arly articles. In order to identify chart figures for training portant information in each graph. As this process is our summarization model, we build a binary classifier difficult to be scaled up, we take the captions of chart to identify common charts (e.g., line charts, bar charts, images in scientific papers to represent the summaries scatter plots, etc.). instead. Greenbacker et al. [18] further used this cor- There is a line of works on interpreting text compo- pus and proposed an abstractive summarization system nents in chart images [5, 6, 7, 8, 9, 4, 10, 11, 12, 13]. One for line charts. The system uses a Bayesian network to of the applications here is to recover visual encodings for classify the intents of line segment, and then rules are purposes of indexing and search. For example, Poco and applied to identify additional important informational Heer [14] proposed an end-to-end text analysis pipeline propositions conveyed by the line graph. The sets of that identifies text elements in a chart image, determine intents and prepositions are pre-defined from the study their bounding box, and classifies their role in the chart on the corpus. They left the final step of generating natu- (e.g., x-axis label, x-axis title, legend title, etc). They also ral language summary from prepositions as future work. proposed a CNN model that classifies the type of graph- Therefore, no evaluation results were shown. ical mark (e.g., bars or lines). We simply use a general A common challenge of these earlier works is that purpose OCR tool for recognizing text in chart images they are limited to a fixed set of propositions and need most of the figures in these papers are not charts. Hence, to convert the selected propositions to natural language. to be able to train and evaluate the proposed chart sum- Instead of using a pipeline with hand-crafted intents and marization model, we need to identify which figures are propositions, we propose to leverage an end-to-end neu- charts. In this work, we focus on the common 5 chart ral network, which has been shown to be powerful in types, including line, bar, scatter, pie, and area charts generating coherent and grammatical sentences in the (Figure 2). Moreover, we further focus on the simplest context of image captioning and abstractive text summa- case where images only contain a single chart. Figures rization. with multiple charts or with any non-chart component Another thread of related works is (natural) image cap- will be considered as negative images in this work. In the tioning, which tries to generate descriptions for natural following sections, we describe how do we obtain single images. Vinyals et al. [19] first illustrate the end-to-end chart and chart type annotations. encoder-decoder architecture and Xu et al. [20] extends it with attention modules. Ranzato et al. [21] use reinforce- 3.1. PubMed Central Data ment learning to eliminate exposure bias but requires a large amount of data to reduce the high variance. An- For PMC data, we create a crowdsourcing task to anno- derson et al. [22] take object-level information to enable tate whether a given image contains single chart. We ran- fine-grained visual understanding. However, we empiri- domly sample 50,000 images from the papers published cally found that the detection features for natural image from 2011 to 2019. For each image, we ask annotators do not work well for charts (structural images). Previous whether it is a single chart figure. If the answer is yes, the vision-and-language pre-training, e.g., VLP [23] and OS- annotators are required to select a chart type from line, CAR [24], use pre-trained vision-and-language model to bar, scatter, pie, area, or other chart. Since this task is improve image captioning but requires a large in-domain pretty simple, we ask two annotators to label each image corpus and heavy pre-training. in the first round. In most cases, two annotators agree on the labels. More specifically, the Fleiss’ kappa scores for “whether it’s a single chart” and “chart type” tasks 3. Datasets Creation are 0.56 and 0.73 respectively, which shows significant agreement 5 . We create our datasets based on image-caption pairs that If there is a disagreement on either single chart label appear in public scientific papers. Different from the or chart types, we further ask the other three annotators figures in magazines or newspapers where the captions to perform a second round of annotation on these im- could be less descriptive, figure captions in scientific ar- ages. Finally, majority vote is applied to resolve conflicts ticles tend to convey the key message of figures. The among all five annotators. We note that single charts assumption here is that captions written by the paper with “other” chart type are considered negative images authors could represent the most salient information in in our experiments. the figures, therefore could serve as summaries of the Among 50,000 images, we obtain 7,397 positive images corresponding figures. The overview of our datasets cre- (single chart), including 3681 line charts, 3088 bar charts, ation pipeline is shown in Figure 1. We consider two data 478 scatter charts, 125 pie charts, and 25 area charts. The sources: arXiv1 and PMC.2 ArXiv is a free distribution positive ratio of the charts is about 13%. This low ratio is service and an open-access archive for scholarly articles because most of the figures in scientific articles are non- in the fields such as physics, computer science, and math- chart figures (e.g., model architecture diagrams). In this ematics. PMC is a free full-text archive of biomedical work, we only use chart types in analyzing model perfor- and life sciences journal literature at the U.S. National mance. That is, chart type information is not included Institutes of Health’s National Library of Medicine. We explicitly in model training. take articles in the Open Access Subset.3 These two data sources are chosen because they both provide structural data in addition to the PDF files. That is, we can obtain 3.2. ArXiv Data image-caption pairs by parsing the LaTeX source files We also build another dataset from the arXiv data. We provided by arXiv or the XML files provided by PMC. We take papers in Computer Vision, Computation and Lan- write our own LaTeX parser for the arXiv data, and use guage, Machine Learning, Artificial Intelligence, and Neu- a public PubMed parser4 for parsing XML information. ral and Evolutionary Computing fields from 2008 to 2020. Although we can extract lots of image-caption pairs, Because of the copyright issue, we cannot put arXiv im- ages on a public crowdsourcing platform. Instead, the 1 https://arxiv.org/ authors went through and annotated 2000 randomly sam- 2 https://www.ncbi.nlm.nih.gov/pmc/ pled figures with the same crowdsourcing interface that 3 https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/ 4 https://github.com/titipata/pubmed_parser 5 https://en.wikipedia.org/wiki/Fleiss%27_kappa the previous word 𝑤𝑡−1 and states (ℎ𝑡−1 , 𝑐𝑡−1 ). The attention module (denoted as Attℎ→𝑓 ) then attends to the feature sequence {𝑓𝑖 } with the hidden output ℎ𝑡 as a query. The context 𝑓ˆ 𝑡 and the hidden vector ℎ𝑡 are merged into an attentive hidden vector ℎ̂𝑡 with a fully- connected layer: 𝑤 ˜ 𝑡−1 = embedding (𝑤𝑡−1 ) ℎ𝑡 , 𝑐𝑡 = LSTM (𝑤 ˜ 𝑡−1 , ℎ𝑡−1 , 𝑐𝑡−1 ) ˆ 𝑓 = Attℎ→𝑓 (ℎ𝑡 , {𝑓𝑖 }) 𝑡 ℎ̂𝑡 = tanh(𝑊1 [𝑓ˆ 𝑡 ; ℎ𝑡 ] + 𝑏1 ) The probability of generating the 𝑘-th token at time step 𝑡 is the softmax over a linear transformation of the atten- tive hidden ℎ̂𝑡 . The loss ℒ𝑡 is the negative log likelihood of the ground truth token 𝑤𝑡* : (︁ )︁ 𝑝𝑡 (𝑤𝑡,𝑘 ) = softmax𝑘 𝑊w ℎ̂𝑡 + 𝑏w ℒ𝑡 = − log 𝑝𝑡 (𝑤𝑡* ) Figure 2: Example charts with the corresponding chart types from the PubMed Central dataset. The dataset we build con- tains the most common 5 chart types. 4.2. Text Understanding Different from natural image captioning, the summariza- tion of charts heavily relies on the understanding of text we use for annotating PMC data. This results in 370 inside the images. However, the ResNet visual encoder single chart images. (in Section 4.1) is insensitive to the text in the images (as shown in Singh et al. [11] as well) thus we need to build a pipeline to extract the text information from the images. 4. Methodology Specifically, we first use the Tesseract [27] to extract a sequence of 𝑚 texts text 𝑗 with their positions pos 𝑗 from In this section, we introduce the proposed models and the image 𝑥. training strategies for the chart summarization task. In this chart summarization task, the model needs to gener- {(text 𝑗 , pos 𝑗 )}𝑚 𝑗=1 = OCR(𝑥) (1) ate a sequence of words {𝑤𝑖 } for describing the contents in a chart 𝑥. We start with introducing the basic cap- Since the characters in charts are usually in small font tioning model. To enhance in-image text understanding and sometimes blurred with the chart content, the copy and endow external knowledge, we incorporate an OCR mechanism [28, 29] that directly brings the text into final encoder and a pre-trained language decoder. Lastly, we summarization does not provide good results. We instead propose a simple semi-supervised learning and domain use the shallow text embedding layer to project the OCR adaptation approach using a chart classifier. text to dense vector representations that denoises the OCR detection results. We also encode the position of the OCR along with the text representation since the 4.1. Base Model spatial information indicates the properties of the text Our base model is adopted from the attentive encoder- (e.g., in the legend, in the title, or inside the chart): decoder model for image captioning proposed in Xu et al. 𝑔𝑗 = Embtext (text 𝑗 ) + 𝑊pos pos 𝑗 (2) [20]. A ResNet-101 [25] visual feature extractor encodes the chart into a 7 × 7 × 2048 dimensional feature map, These OCR representations are treated as another view where each vector in the feature map corresponds to a of the charts and the language decoder simultaneously grid region of the image. Feature maps are then flattened attends to the OCR information {𝑔𝑖 } and visual image to 49 × 2048 feature sequences {𝑓𝑖 }. features {𝑓𝑗 }. The final hidden output ℎ̃𝑡 is calculated based on the concatenation of the visually attended vec- {𝑓𝑖 }49 𝑖=1 = ResNet (𝑥) tor 𝑓˜ , the OCR attended vector 𝑔˜, and the hidden state ℎ𝑡 . At each decoding step 𝑡, the LSTM [26] language decoder outputs the hidden outputs ℎ𝑡 and cell 𝑐𝑡 by reading 𝑓˜ = Attℎ→𝑓 (ℎ𝑡 , {𝑓𝑖 }) (3) Text Position Fluorescence (X, Y) OCR Cross OCR intensity (X, Y) Embedding nm (X, Y) Attention 1800 (X, Y) ...... [OMIT] Fixed-Len ResNet Transformer [OMIT] [OMIT] Pre-trained F Language #lu Fluorescence emission Decoder #orescence F #lu #orescence spectrum recorded from Tokenizer emission spectrum recorded from the Word emission Embedding the ... ...... spectrum Figure 3: Illustration of the proposed chart summarization model. We have two branches of image encoding: (1) the visual branch via the ResNet and fixed-length transformer (2) the text branch via the OCR system and the OCR embedding layer. The output of these two branches are then fused into the pre-trained language decoder by pre-embedding (concatenation) and cross-attention layer, respectively. The grey boxes are neural networks. 𝑔˜ = Attℎ→𝑔 (ℎ𝑡 , {𝑔𝑗 }) (4) of red blocks and blue blocks in Figure 3). The cross- attention approach adds cross-attention layers [34] in- ℎ̃𝑡 = tanh(𝑊2 [𝑓˜ , 𝑔˜, ℎ𝑡 ] + 𝑏2 ) (5) side the language decoder to fuse visual information. The cross-attention layers contain residual short-cut connec- We next replace the original attentive hidden ℎ̂𝑡 with this tions thus the decoder still benefits from the pre-trained OCR-enhanced hidden output ℎ̃𝑡 (in Sec. 4.1) in succeed- weights with these additional layers. ing decoding steps. As shown in Figure 3, we use the pre-embedding ap- proach for the features from the visual image content 4.3. Pre-trained Language Decoder (i.e., from the ResNet encoder) and use the cross-attention When summarizing charts in news or scientific papers, a layers for the OCR texts. The idea of this specific design faithful description of the chart contents also relies on is that the generation would be led by the image content external knowledge, and hence a pre-trained language and will use the OCR information to generate concrete decoder might help the generation. As shown in Figure 3, words. We empirically find that it is the best combina- we illustrate our model which integrates a pre-trained tion to fuse information into the language decoder, and language decoder GPT-2 [30].6 As described in the pre- we show the comparison in Section 6.2. In detail, the vious section, we have two image encoders (i.e., ResNet length of the ResNet feature map is 49 and the order of encoder and OCR text encoder) to process the image con- the features is not aligned with the positional encoding tent and image text respectively. The ResNet encoder in the pre-trained language decoder. We thus do not di- maps the features into a squared feature map (the purple rectly append it before the word embedding but use a vector blocks in Figure 3) where each vector corresponds fixed-length transformer to map it to a sequence of 10 to a part of image content. We will view this feature map vectors (the red blocks in Figure 3; we only draw 3 vec- as a sequence of vectors (as in Eq. 1) in the following pro- tors for simplicity). The fixed-length transformer is built cedures. The OCR encoder (Eq. 4.2) maps the chart into a by transformer decoder layers [34] with only positional sequence of recognized words and their positions on the embedding (without word embedding). We use only 1 chart. The OCR embedding layer (Eq. 2) adds the word layer in our experiments. embedding and the position encoding into one vector for each OCR entry (the yellow vectors in Figure 3). 4.4. Semi-Supervised Learning and In order to connect these visual and textual infor- Domain Adaptation mation from the image to the language decoder, we adopt two ways: appending pre-embeddings and adding Although we can extract abundant image-caption pairs, cross-attention layers. The pre-embedding approach is most figures in scientific articles do not contain a chart as to concatenate the sequence of visual vectors before we discussed in Section 3. If we want to reserve enough the word embeddings thus the language decoder will human-annotated examples for the metric-based evalu- take this concatenation as input (e.g., the concatenation ation purpose, that leaves very little data for training, especially for the arXiv domain in which we only have hundreds of single-chart images. Therefore, we leverage 6 The method could also be applied to other pre-trained lan- semi-supervised learning techniques to take advantage guage decoders such as XLNet [31], T5 [32], and BART [33]. PMC (Supervised) PMC (Semi-Supervised) arXiv (Domain Adaptation) BLEU ROUGE-L METEOR CIDEr BLEU ROUGE-L METEOR CIDEr BLEU ROUGE-L METEOR CIDEr Base Model 1.66 11.35 2.77 2.76 2.09 11.05 2.91 4.49 3.55 14.10 3.79 8.99 + OCR 1.97 11.77 3.09 6.00 2.53 11.95 3.50 7.98 4.78 15.88 4.68 15.88 + GPT-2 3.19 11.66 3.68 1.57 4.47 12.46 4.32 10.30 5.89 14.32 4.92 32.34 Table 1 Results on the PubMed Central (PMC) and arXiv datasets. Supervised: training images are human-labeled single chart images. Semi-Supervised: training images also include the positive images from the proposed chart classifier. Domain Adaptation: the chart classifier trained on the PMC domain is applied on arXiv domain to obtain training data for the summarization model. The best results are marked in bold. of large unannotated data and use domain adaption to 5. Results transfer to other datasets. Both of these two methods rely on a chart classifier that we will introduce first. In this section, we evaluate our proposed methods on Chart Classifier. The key component in getting more our collected datasets of two domains: PMC and arXiv. training examples is a classifier that can identify single- We start with describing the experiment setups and show chart images. We take the ResNet [25] as the visual back- results with both automatic metric-based evaluation and bone and use a binary linear classifier after the mean- human evaluation. pooled features. Instead of freezing the backbone model as in the previous works [20], we fine-tune the classi- 5.1. Experimental Setup fier with a small learning rate, 10−4 . We find that this standard classifier reaches good results (see Appendix Data Setup. The supervised learning setup is conducted for details). on our annotated PMC dataset. We randomly sample Semi-Supervised Learning. In the semi-supervised 1,000 charts as the test set and split the remaining charts learning setup, we have labeled data (Section 3) and we into training (5,819) and validation (646) sets with a ratio want to improve the performance from the unlabeled of 9:1. data. The unlabeled data contains both charts and non- In order to increase the number of training examples, chart images (e.g., model figures in scientific publications we apply the proposed semi-supervised learning tech- and natural images in news). Including these non-chart nique (Section 4.4). The single-chart classifier is based on images in training data will introduce noise and thus the ResNet-101 model and is fine-tuned on our datasets. lead to an increment in training time. To provide clean We use the 50,000 human-labeled images (7,465 positives) data in semi-supervised learning, we filter the unlabeled from PMC data to build this binary classifier. After the data with our chart classifier and train the summarization model converges on the training set, we calibrate the model based on the filtered data. In this way, we increase classifier to optimize the recall with an precision over the amount of data and the coverage of topics. 99% on the validation set. Since we have lots of images, Domain Adaptation. Different from semi-supervised we can afford a lower recall for high-quality positive learning, domain adaptation focuses on transferring the examples. We then use this classifier to filter the unla- labeled dataset into another domain. Naïve transferring beled images in the PMC data to augment the training set. without training on the target domain would under-fit More specifically, besides the 50,000 images we used in the target distribution and we empirically show its in- the crowdsourcing task, there are 137,928 remaining arti- effectiveness in Appendix. To solve this issue, we use cles in our PMC collection from the year of 2011 to 2019. a similar approach to the semi-supervised learning that After applying the chart classifier, we obtain 13,637 single trains the proposed summarization model on the dataset chart images which could serve as additional training created by the chart classifier. More specifically, since examples for the summarization model. we have much less labeled charts in the arXiv domain, For domain adaptation, we take charts and captions we treat it as the target domain whereas PMC data is the from arXiv as the target domain. As described in Sec- source domain. We train the chart classifier on the PMC tion 3, we have manually annotated 370 single-chart data, and apply it on the images from arXiv papers to images in this domain, which are served as the test set. obtain large amount of single-chart images. We use the same chart classifier in the previous semi- supervised learning setup to annotate 140,000 arXiv im- ages. This results in 22,044 positive examples. We split this 22,044 examples into training data (19,840) and vali- dation data (2,204) with a ratio of 9:1. Model Setup. For the base model, we use a ResNet-101 model from the Torchvision [35] library7 . We resize the Baseline Final Model Equally Equally Bad image into 224 × 224 and the backbone model maps it Better Better Good Bad to a 7 × 7 × 2048 vectors. We sort the OCR-extracted PMC 20 70 3 7 texts by their confidence and only keep the top 20 texts arXiv 37 50 2 11 for post-processing. Since we want the image position Table 2 to be related to the OCR position. We do not apply ran- Human study on the results with 100 pairwise comparisons. dom resize and cropping but directly resize the chart into 224 × 224. For the pre-trained GPT-2 [30] model, we downloaded the small GPT-2 model from Hugging Face’s BLEU ROUGE-L METEOR CIDEr Transformer [36]. The GPT-2 implementation has sup- All 4.47 12.46 4.32 10.30 port of cross-attention layers as in Vaswani et al. [34] Line Chart 4.44 12.70 4.28 10.18 and we use it to attention to the OCR features. For the Bar Chart 4.77 12.30 4.71 7.14 fixed-length transformer, we use 1 layer with the same Scatter Chart 5.96 16.63 5.39 40.78 architecture as the GPT-2 model but do not apply the causal attention mask. More implementation and hyper- Table 3 parameter details can be found in Appendix. Results regarding different types of charts. 5.2. Metric-based Evaluation 5.3. Human Evaluation In order to conduct efficient evaluation, we take the au- In order to get a faithful evaluation, we conduct a human tomatic language metrics to evaluate our model. We evaluation on 100 randomly sampled examples for PMC report the BLEU [37], ROUGE-L [38], METEOR [39], and and arXiv. The human evaluation is conducted by the CIDEr [40] as in previous image captioning papers. As authors and their colleagues (4 in total) since this task shown in Table 1, we compare our proposed models (in requires a certain expert knowledge. We use both base Section 4.2 and Section 4.3) with the baseline captioning captioning model and our final model (with OCR encoder model (in Section 4.1) on both PMC and arXiv datasets. and GPT-2 decoder)8 to generate two summaries. Each The model with OCR text encoder is strictly better than image with the generated summaries from the two mod- the baseline captioning model for every metrics, which els is annotated by all 4 annotators. We randomly shuffle indicates that the in-chart text understanding is very the order of these two summaries and only show the A/B important for generating good summarization for scien- labels to the human annotators. The human annotators is tific charts. The integration of the pre-trained language asked to choose one from the four options: “Both Good”, model (GPT-2) further enhances the performance over “Both Bad”, “A wins”, and “B wins”. As shown in Table 2, the OCR encoder results. The pre-trained decoder shows our proposed model significantly outperforms the base- more improvement on the semi-supervised setup since line model for both datasets. Moreover, we find that our the model needs enough data to learn the weights in the annotators have a high agreement on which generated fixed-length transformer and the cross-attention mod- sentence is better since this scientific summarization is ules, which bridge the vision encoder and the language mostly about facts and salience. decoder. Note that the CIDEr score of the +GPT-2 model is lower than the +OCR model on the PMC dataset under the su- 6. Analysis pervised setup. We find that this is due to the size of data. The smaller size of the PMC data makes the learned In this section, we provide the fine-grained analysis to model have a stronger bias towards the original GPT-2 illustrate the effectiveness of each component in the pro- generation. Namely, although the model would gener- posed pipeline. We first demonstrate the results for dif- ate more fluent sentences (reflected on the high BLEU ferent chart types and cross-domain evaluation in Sec- score), it is biased towards the GPT-2 prior by leverag- tion 6.1. In Section 6.2, we empirically show the advan- ing mostly common words. This bias is captured by the tage of our pre-embedding and cross-attention combina- CIDEr metric’s over-weighting protocol. However, under tion. the semi-supervised setting, the CIDEr score is higher with GPT-2 because of the adequate amount of data. This 6.1. Different Chart Categories also demonstrates the usefulness of the proposed semi- supervised approach. During our data collection, we also let the annotators to select the type of the chart (Figure 2). In this paper, we 7 8 https://pytorch.org/docs/stable/torchvision/models.html The PMC model is with the semi-supervised setup. Pre-Embed Cross-Att BLEU ROUGE-L METEOR CIDEr ages from lots of automatically extracted image-caption None None 1.91 10.59 3.01 0.52 pairs. Since the images filtered by the classifier will be Concat None 2.88 11.92 3.79 4.78 further used as data augmentation, we take the 𝐹1 score None Concat 3.64 12.07 3.69 2.91 Img OCR 4.47 12.46 4.32 10.30 as the main metric to balance the precision and recall. OCR Img 4.46 12.12 4.08 11.18 We start with the frozen ResNet-101 [25] features with Concat Concat 3.61 12.18 3.76 2.79 an additional linear classifier. This setup achieves 90% 𝐹1 score. After fine-tuning the backbone model on our data, Table 4 the model achieves an 𝐹1 score of 94.9%. We also tried Comparison of different approaches of connecting the image content and the language decoder. adding other neural modules (e.g., attentive modules and detection branches) and enhanced visual backbones but we do not observer a significant result improvement on the test set. aim for a general chart summarization model that does When we use this classifier in the semi-supervised and not rely on the details of each chart type. We here analyze domain adaptation setups, we calibrate the classification the performance of the proposed model on each chart threshold to maintain a precision over 99% since we have category with our final model trained on PMC (Semi- lots of unannotated images. Under this precision level, Supervised). In Table 3, we show the results of the most we achieve a recall of 59.8% and precision of 99.2%. We common three chart types (i.e., “Line”, “Bar”, “Scatter”) kept the same classification threshold and test it on our that have sufficient amount of data (513 for Line, 400 annotated arXiv test split. The precision and recall are for Bar, and 57 for Scatter) to support automatic metric- 93.4% and 65.7%, respectively. based evaluation. Although the line charts contribute the most to the training and test data, the BLEU score is the lowest compared to the results of bar charts and scatter 7. Conclusions charts. The reason might be that the image features produced by convolutional neural networks (CNN) are In this paper, we propose datasets and models for summa- insensitive to the properties (e.g., trending, crossings) of rizing scientific charts, a specific type of structured im- the curved lines. At the same time, the CNN could capture ages. We construct datasets from PMC and arXiv by lever- the local intensity of points thus show higher results for aging crowdsourcing and the figure captions in the pa- scatter chart. According to this observation, we think pers. To enable better understanding text components in that using visual encoder that are specifically designed charts and to endow the model with external knowledge, for understanding the curved lines in chart might be a we propose to use an OCR encoder and a pre-trained promising future direction. language decoder on top of a standard image captioning model. In our experiments, we show the effectiveness of our models in terms of both automatic evaluation metrics 6.2. Pre-Embeddings and and human evaluation. Cross-Attention Layers In Section 4.3, we discuss two ways to connect the visual information to the language decoder: the pre-embedding Acknowledgments approach and the additional cross-attention layers. In The authors thank Bloomberg’s AI Engineering team, Table 4, we show the results of different combinations on especially Alakananda Vempala, Ketevan Tsereteli, and PMC (semi-supervised) dataset. “Img” and “OCR” indi- Anju Kambadur for helpful feedback and directions. Ad- cates using the image output and OCR representations as ditional thanks to the anonymous reviewers for their in- the input to the pre-embedding approach and the cross- sights. Hao Tan acknowledges support from Bloomberg’s attention layers. “None” means that we do not use input Data Science Ph.D. Fellowship. and thus excludes the parameters. “Concat” means that we concatenate the output of image and OCR representa- tions together and use it as the input. We can see that the References our approach (Img for Pre-Embed and OCR for Cross-Att) is comparable to its reverse (OCR for Pre-Embed and Img [1] S. Carberry, S. Elzer, S. Demir, Information graph- for Cross-Att) and is much better than other alternatives. ics: an untapped resource for digital libraries, in: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in 6.3. Chart Classification Performance information retrieval, 2006, pp. 581–588. In both the semi-supervised learning and domain adap- [2] M. Savva, N. Kong, A. Chhajta, L. Fei-Fei, tion setup, we use a classifier to identify single-chart im- M. Agrawala, J. Heer, Revision: Automated classi- fication, analysis and redesign of chart images, in: [15] S. Elzer, E. Schwartz, S. Carberry, D. Chester, Proceedings of the 24th annual ACM symposium S. Demir, P. Wu, A browser extension for providing on User interface software and technology, 2011, visually impaired users access to the content of bar pp. 393–402. charts on the web., in: WEBIST (2), Citeseer, 2007, [3] S. Ray Choudhury, C. L. Giles, An architecture pp. 59–66. for information extraction from figures in digital [16] S. Demir, S. Carberry, K. McCoy, Generating tex- libraries, in: Proceedings of the 24th International tual summaries of bar charts, in: Proceedings of the Conference on World Wide Web, 2015, pp. 667–672. Fifth International Natural Language Generation [4] N. Siegel, Z. Horvitz, R. Levin, S. Divvala, A. Farhadi, Conference, Association for Computational Lin- Figureseer: Parsing result-figures in research pa- guistics, Salt Fork, Ohio, USA, 2008, pp. 7–15. URL: pers, in: European Conference on Computer Vision, https://www.aclweb.org/anthology/W08-1103. Springer, 2016, pp. 664–680. [17] C. Greenbacker, S. Carberry, K. McCoy, A corpus [5] W. Huang, C. L. Tan, A system for understand- of human-written summaries of line graphs, in: ing imaged infographics and its applications, in: Proceedings of the UCNLG+Eval: Language Gen- Proceedings of the 2007 ACM symposium on Doc- eration and Evaluation Workshop, Association for ument engineering, 2007, pp. 9–18. Computational Linguistics, Edinburgh, Scotland, [6] S. Demir, S. Carberry, K. F. McCoy, Summarizing 2011, pp. 23–27. URL: https://www.aclweb.org/ information graphics textually, Computational Lin- anthology/W11-2703. guistics 38 (2012) 527–574. [18] C. Greenbacker, P. Wu, S. Carberry, K. McCoy, [7] Z. Chen, M. Cafarella, E. Adar, Diagramflyer: A S. Elzer, Abstractive summarization of line graphs search engine for data-driven diagrams, in: Pro- from popular media, in: Proceedings of the ceedings of the 24th International Conference on Workshop on Automatic Summarization for Dif- World Wide Web, 2015, pp. 183–186. ferent Genres, Media, and Languages, Association [8] S. R. Choudhury, S. Wang, C. L. Giles, Scalable algo- for Computational Linguistics, Portland, Oregon, rithms for scholarly figure mining and semantics, 2011, pp. 41–48. URL: https://www.aclweb.org/ in: Proceedings of the International Workshop on anthology/W11-0506. Semantic Big Data, 2016, pp. 1–6. [19] O. Vinyals, A. Toshev, S. Bengio, D. Erhan, Show [9] A. Kembhavi, M. Salvato, E. Kolve, M. Seo, H. Ha- and tell: A neural image caption generator, in: jishirzi, A. Farhadi, A diagram is worth a dozen Proceedings of the IEEE conference on computer images, in: European Conference on Computer vision and pattern recognition, 2015, pp. 3156–3164. Vision, Springer, 2016, pp. 235–251. [20] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, [10] S. E. Kahou, A. Atkinson, V. Michalski, Á. Kádár, R. Salakhudinov, R. Zemel, Y. Bengio, Show, at- A. Trischler, Y. Bengio, Figureqa: An annotated tend and tell: Neural image caption generation with figure dataset for visual reasoning, in: ICLR Work- visual attention, in: International conference on shop, 2018. machine learning, 2015, pp. 2048–2057. [11] A. Singh, V. Natarajan, M. Shah, Y. Jiang, X. Chen, [21] M. Ranzato, S. Chopra, M. Auli, W. Zaremba, Se- D. Batra, D. Parikh, M. Rohrbach, Towards vqa quence level training with recurrent neural net- models that can read, in: Proceedings of the IEEE works, in: International Conference on Learning Conference on Computer Vision and Pattern Recog- Representations, 2016. nition, 2019, pp. 8317–8326. [22] P. Anderson, X. He, C. Buehler, D. Teney, M. John- [12] T. Hiippala, M. Alikhani, J. Haverinen, son, S. Gould, L. Zhang, Bottom-up and top-down T. Kalliokoski, E. Logacheva, S. Orekhova, attention for image captioning and visual question A. Tuomainen, M. Stone, J. A. Bateman, Ai2d-rst: A answering, in: Proceedings of the IEEE Conference multimodal corpus of 1000 primary school science on Computer Vision and Pattern Recognition, 2018, diagrams, Language Resources and Evaluation pp. 6077–6086. (2020) 1–28. [23] L. Zhou, H. Palangi, L. Zhang, H. Hu, J. J. Corso, [13] N. Methani, P. Ganguly, M. M. Khapra, P. Kumar, J. Gao, Unified vision-language pre-training for Plotqa: Reasoning over scientific plots, in: Pro- image captioning and vqa, in: AAAI, 2019. ceedings of the IEEE/CVF Winter Conference on [24] X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, Applications of Computer Vision, 2020, pp. 1527– L. Wang, H. Hu, L. Dong, F. Wei, et al., Oscar: 1536. Object-semantics aligned pre-training for vision- [14] J. Poco, J. Heer, Reverse-engineering visualizations: language tasks, in: European Conference on Com- Recovering visual encodings from chart images, in: puter Vision, Springer, 2020, pp. 121–137. Computer Graphics Forum, volume 36, Wiley On- [25] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learn- line Library, 2017, pp. 353–363. ing for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern sociation for Computational Linguistics, 2002, pp. recognition, 2016, pp. 770–778. 311–318. [26] S. Hochreiter, J. Schmidhuber, Long short-term [38] C.-Y. Lin, Rouge: A package for automatic eval- memory, Neural computation 9 (1997) 1735–1780. uation of summaries, in: Text summarization [27] R. Smith, An overview of the tesseract ocr engine, branches out, 2004, pp. 74–81. in: Ninth international conference on document [39] S. Banerjee, A. Lavie, Meteor: An automatic met- analysis and recognition (ICDAR 2007), volume 2, ric for mt evaluation with improved correlation IEEE, 2007, pp. 629–633. with human judgments, in: Proceedings of the [28] J. Gu, Z. Lu, H. Li, V. O. Li, Incorporating copying acl workshop on intrinsic and extrinsic evaluation mechanism in sequence-to-sequence learning, in: measures for machine translation and/or summa- Proceedings of the 54th Annual Meeting of the As- rization, 2005, pp. 65–72. sociation for Computational Linguistics (Volume 1: [40] R. Vedantam, C. Lawrence Zitnick, D. Parikh, Cider: Long Papers), 2016, pp. 1631–1640. Consensus-based image description evaluation, in: [29] A. See, P. J. Liu, C. D. Manning, Get to the point: Proceedings of the IEEE conference on computer Summarization with pointer-generator networks, vision and pattern recognition, 2015, pp. 4566–4575. in: Proceedings of the 55th Annual Meeting of the [41] D. P. Kingma, J. Ba, Adam: A method for stochastic Association for Computational Linguistics (Volume optimization, in: ICLR, 2015. 1: Long Papers), 2017, pp. 1073–1083. [42] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: [30] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, Pre-training of deep bidirectional transformers for I. Sutskever, Language models are unsupervised language understanding, in: NAACL-HLT (1), 2019. multitask learners (2019). [31] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhut- dinov, Q. V. Le, Xlnet: Generalized autoregressive pretraining for language understanding, in: Ad- vances in neural information processing systems, 2019, pp. 5753–5763. [32] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, JMLR (2019). [33] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mo- hamed, O. Levy, V. Stoyanov, L. Zettlemoyer, Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and com- prehension, in: ACL, 2020. [34] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At- tention is all you need, in: Advances in Neural Information Processing Systems, 2017, pp. 5998– 6008. [35] S. Marcel, Y. Rodriguez, Torchvision the machine- vision package of torch, in: Proceedings of the 18th ACM international conference on Multimedia, 2010, pp. 1485–1488. [36] T. Wolf, J. Chaumond, L. Debut, V. Sanh, C. De- langue, A. Moi, P. Cistac, M. Funtowicz, J. Davison, S. Shleifer, et al., Transformers: State-of-the-art natural language processing, in: Proceedings of the 2020 Conference on Empirical Methods in Nat- ural Language Processing: System Demonstrations, 2020, pp. 38–45. [37] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation of machine trans- lation, in: Proceedings of the 40th annual meeting on association for computational linguistics, As- A. Implementation Details We then use this classifier to filter the unlabeled im- ages in the PMC data to augment the training set. More The supervised learning setup is conducted on our an- specifically, besides the 50,000 images we used in the notated English PMC dataset in Sec. 3. We kept 1,000 crowdsourcing task, there are 137,928 remaining articles charts in the test set and split the the remaining charts in our PMC collection from the year of 2011 to 2019. After into training(5,819)/validation(646) with a ratio of 9:1. applying the chart classifier, we obtain 13,637 single chart We train our model on the training set and tune the hy- images which could serve as additional training examples perparamters on the validation set. The test set is only for the summarization model. The hyper-parameters of used to report results. We train for 200 epochs on this the summarization model is the same as the ones used small dataset. All our code are written in PyTorch and in the supervised setup. For the models trained on this all experiments converge in 4 5 hours on 1 Titan V GPU. dataset, we use a max sequence of 80 and train for 100 For the base model, we use a ResNet-101 model from epochs. The other hyperparameters are same as the small the Torchvision [35] library 9 . We resize the image into supervised PMC data for each model. 224 x 224 and the backbone model maps it to a 7 x 7 x For domain adaptation, we take charts and captions 2048 vectors. We use 512 dimensions for the LSTM and from English arXiv as the target domain. As described 256 dimensions for the word embedding. The attentive in the dataset section, we have manually annotated 370 hidden states has the same size as the hidden states (512 single-chart images in this domain, which are served as dimensions). We use an Adam [41] with a fixed learning the test set. We use the same chart classifier in the previ- rate of 10−4 . The batch size is 64. ous semi-supervised learning setup to annotate 140,000 For the OCR model, we sort the ocr texts by their arXiv images. This results in 22,044 positive examples. confidence and remove the empty text. We kept the top We split this 22,044 examples into training data (19,840) 20 ocr texts for post-processing. We use 512 dimensions and validation data (2,204) with a ratio of 9:1. The summa- for the OCR feature representations (yellow blocks in Fig. rization model is trained on the training data, tuned on 3). Since we want the image position to be related to the the validation data, and finally evaluated on the manually- OCR position. We did not do random resize and cropping annotated test set. For the models trained on this dataset, but directly resize the chart into 224 x 224. we use a max sequence of 40 since the captions in arXiv For the pre-trained GPT-2 [30] model, we downloaded are much shorter. Since we halve the max sequence, we the small GPT-2 model (124M parameters) from Hugging train for 200 epochs thus roughly keep the same compu- Face’s Transformer [36] 10 . The GPT-2 implementation tational resources for both datasets. has support of cross-attention layers as in Vaswani et al. [34] and we use it to attention to the OCR features. For the fixed-length transformer, we use 1 layer with the B. Details of Data Collection same architecture as the GPT-2 model but do not apply the causal attention mask. We use an Adam [41] with The crowdsourcing task is conducted on Appen11 . There weight decay of 0.01 following the practice in Devlin are 2263 distinct annotators from 50 countries. Since the et al. [42]. We do not use weight decay for the layer task is to classify image types, it doesn’t require native normalization layer and bias. We use a linear warmup English speakers. The top 5 countries are Venezuela with a peak learning rate at 10−4 . The first 5% steps are (53%), USA (23%), Egypt (8%), Colombia (2%), and Peru warmup steps. The batch size is 64. (1.4%). We paid one cent per judgement (image). For the In order to increase the number of training examples, first round of annotation tasks, the Fleiss’ kappa scores we apply the proposed semi-supervised learning tech- for “whether it’s a single chart” and “chart type” tasks are nique. The single-chart classifier is based on the ResNet- 0.56 and 0.73 respectively, which shows pretty significant 101 model and is fine-tuned on our datasets. We use the agreement. 50,000 human-labeled images (7,465 positives) from PMC data to build this classifier. The training, validation, and C. Additional Analysis test sets have 5,819, 646, and 1,000 data point, respec- tively. The data split is the same as the above supervised C.1. Cross-Domain Transferability learning setup. After the model converges on the train- ing set, we calibrate the classifier to optimize the recall To illustrate the need of domain adaption led by the chart with an precision over 99% on the validation set. Since classifier (in Sec. 4.4), we show the low cross-domain we have lots of images, we can afford a lower recall for transferability of models in this section. Each row in high-quality positive examples. Table 5 indicates the results of our final model trained on the designated dataset while each line in the Table indi- 9 cate the evaluation results on the test set. The model does https://pytorch.org/docs/stable/torchvision/models.html 10 11 https://github.com/huggingface/transformers client.appen.com PMC arXiv BLEU ROUGE-L METEOR CIDEr BLEU ROUGE-L METEOR CIDEr PMC 4.47 12.46 4.32 10.30 0.06 8.19 1.93 0.63 arXiv 0.22 10.11 3.25 1.43 5.89 14.32 4.92 32.34 Table 5 The transferability of our captioning model across different domains. The columns indicate the training dataset while the rows indicate the testing dataset. The PMC training data is augmented with filtered charts (in Sec. 4.4) and the arXiv training data is built by the chart classifier. All test data are human-annotated. not transfer well between different domains, probably be- cause the different figuring and captioning conventions from different communities. The different topics also introduce diverging vocabularies. D. Ethical Considerations The technique developed in this paper would help auto- matic summarize news, articles, and publications where charts are involved in. It would also help visually im- paired people to understand the content of the charts. It would fail in cases when the OCR detector miss the key information of the charts and would lead to unfaithful summarization of the chart. Since we use a pre-trained language decoder in our final model, the generated sum- marization might be biased towards the pre-training do- main of the language decoder. Regrading the dataset collection, we have resolved all legal and licenses issue for the PMC dataset before showing them to annotators. More specifically, we only use articles with CC BY li- censes from the Open Access Subset of PMC data. For arXiv data, we annotate a small test set by the authors.