ChartParser: Automatic Chart Parsing for Print-Impaired Anukriti Kumar1,∗,† , Tanuja Ganu2,† 1 University of Washington 2 Microsoft Research Abstract Infographics are often an integral component of scientific documents for reporting qualitative or quantitative findings as they make it much simpler to comprehend the underlying complex information. However, their interpretation continues to be a challenge for the blind, low-vision, and other print-impaired (BLV) individuals. In this paper, we propose ChartParser, a fully automated pipeline that leverages deep learning, OCR, and image processing techniques to extract all figures from a research paper, classify them into various chart categories (bar chart, line chart, etc.) and obtain relevant information from them, specifically bar charts (including horizontal, vertical, stacked horizontal and stacked vertical charts) which already have several exciting challenges. Finally, we present the retrieved content in a tabular format that is screen-reader friendly and accessible to the BLV users. We present a thorough evaluation of our approach by applying our pipeline to sample real-world annotated bar charts from research papers. Keywords Infographics Accessibility, Visualization Design, Information Retrieval, Human-centered computing 1. Introduction assumed that analyzing scientific figures is a trivial task. However, understanding charts/infographics present a Academic research is advancing at an incredible pace, plethora of complex challenges. Firstly, a high level of with thousands of scientific documents published accuracy is expected while parsing the figure plot data, monthly [1]. These documents often use figures/charts as even a small mistake in analyzing chart data can lead as a medium for data representation and interpretation. to erroneous conclusions. Also, authors employ different However, the blind, low-vision and other print-disabled design conventions while structuring and formatting the (BLV) individuals are often deprived of insights and un- figures, resulting in high variations across different pa- derstanding offered by these figures. Although these are pers. It is also challenging to extract information from converted into non-visual, screen-reader friendly repre- charts amidst heavy clutter and deformation within the sentations such as alt-text, data table, etc., there is a lot plot area. Even though the color is an essential cue for of reliance on volunteers for this conversion, making it differentiating the plot data, it may only sometimes be an extremely time-consuming process. In most cases, present because many figures frequently reuse similar even the alternate text fails to describe charts properly. colors and some are even published in grayscale. Also, Hence, our goal in this paper is to design a fully auto- figure parsing presents an additional challenge because mated pipeline to extract useful information from charts, there is only one exemplar (the legend symbol) available specifically bar charts, and convert them into accessible for model learning, in contrast to natural image recogni- data tables. Potential applications of our system include tion tasks where the desired amount of labeled training helping authors provide meaningful captions to their fig- data can be obtained to train models per category. Due ures in papers, improving search and retrieval of relevant to these challenges, there currently needs to be a sys- information in the academic domain, generating sum- tem that can automatically parse data from scientific maries from charts, building query-answering systems, figures/charts. developing interfaces that can provide simple and con- In this paper, we make three key contributions. First, venient access to complex information, making charts we propose ChartParser, a fully automated pipeline that accessible for BLV individuals, and helping academic leverages deep learning, OCR, and image processing tech- committees and publishers identify plagiarized articles. niques to extract all figures from a research paper, classify Given the remarkable progress in analyzing natural them into various chart categories and retrieve useful scene images observed in recent years, it is generally information from them, specifically bar charts. Second, we address some of the key challenges present in exist- The Third AAAI Workshop on Scientific Document Understanding, Feb ing systems. For example, our system can parse legend 14, 2023 ∗ and utilize color information for data association. It is Corresponding author. † These authors contributed equally. also robust to variations in the figure designs and has Envelope-Open anukumar@uw.edu (A. Kumar); tanuja.ganu@microsoft.com no assumptions related to the position of axes, legend, (T. Ganu) etc. And finally, we demonstrate the viability of our ap- © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings proach by applying our pipeline to a real-world dataset 3. Methodology of research papers from different sources. This section discusses our proposed pipeline to convert bar charts from scientific publications into data tables. 2. Related Work The process is divided into three steps: First, we ex- tract figures from research papers. Second, we detect Chart understanding in scientific literature has recently bar charts from the extracted figures. And finally, we gained much traction and there have been several at- extract content from bar charts to obtain the desired data tempts to classify charts using heuristics and expert rules. tables. These three steps are depicted in Figure 2. Various machine learning-based algorithms that rely on handcrafted features such as histogram of oriented gradi- ents (HOG), scale-invariant feature transform (SIFT), and 3.1. Figure Extraction others have been proposed in the literature [2, 3]. Sev- To segment all the figures from a research paper, we eral deep learning algorithms for chart and table image use a pre-trained image segmentation model based on classification have recently been introduced [4, 5], and Mask R-CNN architecture from Detectron2 model zoo [6]. to decompose a document into five categories: title, text There is another line of work on interpreting text com- block, list, figure, and table. The model is based on the ponents in chart images [7, 8, 9, 10, 11, 12]. Although ResNet50 feature pyramid network (FPN) base config and semi-automatic software solutions are available for data is trained on the PubLayNet dataset for document layout extraction from charts, using them requires the user to analysis. manually define the chart’s coordinate system, provide metadata about the axes and data or click on the data points [13, 14, 15]. 3.2. Figure Classification One of the difficulties in accurately parsing bar charts Most of the figures extracted are charts including tree is dealing with different types of bar charts in scientific diagrams, network diagrams, bubble charts, etc. This literature. Previous work, for example, [16, 17], focused step describes the chart classification model employed to on developing heuristic models that detect key elements detect bar charts. such as bars, legends, etc. Similarly, machine learning has also been used recently to detect chart components 3.2.1. Chart Images Dataset (e.g., bar or legend) [18]. Also, a deep learning object detection model is trained in [19] to identify sub-figures We create a chart dataset to train and evaluate our chart in compound figures. However, neither of these works classification model. We use the Python module google extracted data values from bar charts. Using synthetic images download to obtain charts from 13 categories data produced by the matplotlib toolkit, [20] created a (scatter plots, bar charts, line charts, etc.), 1000 images model to boost the accuracy while parsing bar values. from each category. Then, we manually identify and Most of the previous methods do not parse the legend. remove some incorrect samples of downloaded charts. Some assumed that the legend was always placed below Finally, we obtain a ground-truth dataset of charts with the chart [20] or horizontally along the same line [21]. a total of approximately 12k charts, including 978 bar This limits the applicability of these models. Previous charts. work was mostly created for visualizations in grayscale, as they did not parse color information from the legend. 3.2.2. Chart Classification Model Also, there has been less focus on measuring the accu- We try out different models pre-trained on the ImageNet racy of detecting the axes or label values. Quantifying dataset and fine-tune them on the figure dataset cre- the accuracy of obtaining this semantic information is ated. All the layers but the final convolutional layer essential for understanding the capping limits in this eval- were frozen. The fully-connected layer uses a softmax uation process. Even though the process of extracting function to classify figures into 13 chart categories. Using information from charts and other infographics has been Adadelta as the optimizer, we re-train the convolutional extensively explored, to our knowledge, prior work has layer and the additional fully-connected layer for 30 it- several shortcomings as discussed above. As a result, erations. We also add a dropout layer with a rate of 0.3 we propose a fully automated system for data extraction before the final fully-connected layer to avoid overfitting. from bar charts which solves these existing limitations Despite similar accuracy achieved by all the baselines, and can be extended to other types of charts, including we choose MobileNet as it uses far less parameters on line charts, scatter plots, etc. ImageNet than others. Figure 1: Illustration of the ChartParser pipeline 3.3. Content Extraction 3.3.3. Axes Ticks Detection Content Extraction from charts is a complex process We filter all the text boxes below the x-axis and to the and in this step, we employ OCR and image processing right of the y-axis. Further, we run a sweeping line from techniques to extract relevant content from bar charts the x-axis to the bottom of the image and the line which through various modules. intersects with the maximum number of text boxes pro- vides the bounding boxes for all the x-axis ticks. A similar 3.3.1. Axes Detection algorithm is used for detecting y-axis ticks using a verti- cal sweeping line. We convert the image into a binary one, and then, obtain the max-continuous ones along each row and column. 3.3.4. Axes Label Detection For this, we scan the matrix vertically and horizontally to trace the continuity of black pixels within the adjacent We filter the text boxes present below the x-axis ticks and columns and rows. Finally, the y-axis is the first column again, run a sweeping line from the x-axis ticks to the where the max-continuous 1s fall in the region [max bottom of the image. While doing so, the line intersecting - threshold, max + threshold], where a predetermined with the maximum number of text boxes provides us with threshold (=10) is assumed. Similarly, for the x-axis, the all the bounding boxes for the x-axis label. Similarly, we last row is chosen based on where the maximum con- also obtain the y-axis label using a vertical sweeping line. tinuous 1s fall within the range [max - threshold, max + threshold]. 3.3.5. Legend Detection Firstly, we remove the axes labels and ticks bounding 3.3.2. Text Detection boxes. Then, we also remove boxes containing only a We apply Azure Cognitive Service (ACS) Optical Charac- single ”I” character because these are typically read as ter Recognition (OCR) to detect text within a chart and error bars and finally, we also remove text boxes with extract all the rectangular bounding boxes of the detected numeric values placed above bars. This implies that only text. legend names and color boxes are found in the remaining text boxes. We combine the bounding boxes with dis- Table 1 tances under 10px into a single legend name because the Content extraction accuracy legend names might have multiple words. We organize Component Accuracy (%) these bounding boxes into groups where each member X-axis 97 is either horizontally or vertically aligned with at least Y-axis 94 one other member. Finally, the maximum length group X-axis label 95 gives the bounding boxes for all the legends. Y-axis label 91 X-axis ticks 89 Y-axis ticks 84 3.3.6. Legend Color Estimation Legend 87 The color boxes are assumed to be on the left or right Legend color 87 side, depending on the placement of text bounding box Data association 76 within the legend extracted in the previous module. Pix- els within a box should ideally all have the same pixel value. Since, these values could change for several rea- 4.1. Test Dataset sons (such as image compression, scanning, etc.), we We sample research papers from two data sources: arXiv start a new group with a random pixel and gradually add and PubLayNet. From the arXiv dataset published on pixels whose R, G, and B values are no higher than 5 Kaggle [22], we obtained research paper PDFs published compared to the average of all the pixels in the group. in the years 2019-2021 and the resulting dataset consisted The color of a legend label is determined by taking the of around 10,024 papers. Also, we use a subset of the Pub- average of all the pixels in the largest group of the R, G, LayNet dataset [23] and obtain approximately 15k docu- and B channels. Later, bars matching a specific legend ment images from the same. Then, we apply the first two are identified using these colors. steps of our fully automated pipeline to these research papers, as mentioned in the previous section 3. First, 3.3.7. Data Extraction we extract approximately 51k figures from the research papers dataset using our image segmentation model, and The bounding boxes for each legend are whitened, and then, on applying our chart classification model to these we eliminate all the white pixels from the original chart figures, we obtain approximately 2,112 bar charts. To image. The colors decided upon in the previous module evaluate our system, we sample 100 bar charts and manu- serve as the initial clusters as all of the image’s pixel ally annotate the relevant data, including axes, axes label, values are further divided into clusters. Then, we divide axes tick’s values, legend, legend color, and the textual the given plot into multiple plots, one for each cluster. bounding boxes. In other words, by clustering, we break down a stacked bar chart into several simpler plots. Then, we obtain all contours within the plot and subsequently, pick the 4.2. Chart Classification closest bounding rectangle for each label. Further, we The accuracy of our chart classification model is calcu- require a mapping function to map pixel values to actual lated using stratified five-fold cross validation. Here, we values in the chart. Hence, we use the value-tick ratio (𝛼) use 20% of the chart images dataset, created using google to estimate the height of each bar. To find this ratio, we images download API, as our validation set and the cate- divide the average of the actual y-label ticks (𝑁𝑡𝑖𝑐𝑘𝑠 ) by gory wise performance (average accuracy) of our model the average distance between ticks in pixels (Δ𝑑). is presented in Table 4.2. We observe that for bar charts, 𝛼 = 𝑁 /Δ𝑑 𝑡𝑖𝑐𝑘𝑠 (1) our model achieves an accuracy of 97.8%. Finally, the bar chart’s y values are defined as y value = 4.3. Text Recognition 𝛼 × H, where H is the bar’s height. After getting all the relevant information, we create a data table using the We use the Intersection Over Union (IoU) metric to assess same as shown in Figure 2 (e.). our text detection module. This metric determines the bounding boxes that most closely match the predicted and actual ones, calculates the area of the intersecting 4. Results region divided by the area of the union region for each match, and considers the prediction successful if the IoU This section focuses on creating a test dataset of bar measure is higher than the threshold, for example, 0.5.We charts from research papers and evaluating various com- achieve an F1-score of 0.935 with an IoU threshold of ponents of our pipeline on this dataset to demonstrate 0.5 and this demonstrates that our module detects text the viability of our approach. bounding boxes within the plot area fairly well. Category Accuracy (%) and also, follow a similar algorithm for extraction of chart Bar Chart 97.8 elements such as axes, labels, ticks, legends, etc. Instead Line Chart 96.86 Scatter Plot 92.00 of simply presenting the raw data in tabular form, we Pareto chart 84.20 can also generate insights from the data by employing Pie Chart 91.52 reasoning on chart images at a high level by finding Venn Diagram 87.88 relationships between various chart elements. Box Plot 94.56 Network Diagram 68.97 Map 79.26 6. Conclusion Tree Diagram 69.09 Area Graph 88.00 In this paper, we present our ongoing work in making Flow Chart 75.54 scientific documents accessible to the blind, low-vision, Bubble Chart 92.20 and print-disabled individuals. Our work focuses on the problem of poor accessibility of infographics/charts in Table 2 Category wise average accuracy of the chart classification research papers. We propose an end-to-end pipeline to model extract all figures from a research paper, classify them into various chart categories, obtain relevant informa- tion from them, specifically bar charts and present the 4.4. Content Extraction retrieved content into accessible data tables. Finally, we apply our pipeline to a test dataset of research papers The performance of the final content extraction process from two different sources: arXiv and PMC to demon- depends on the sequential performance of each module, strate the viability of our approach. We continue to work i.e., axis detection, axis tick values extraction, label ex- towards making charts fully accessible to print-impaired traction, legend detection, and so on. First, we apply individuals by overcoming the existing limitations of our the OCR and image processing techniques to the test work. dataset and extract relevant content. Then, we compare the outcome with the manually annotated data and ob- tain module-wise evaluation metrics presented in Table References 1. [1] arXiv Monthly Stats, Global survey monthly submissions, https://arxiv.org/stats/monthly_ 5. Limitations and Future Work submissions, 2022. November 14, 2022. [2] W. Huang, C. L. Tan, A system for understand- This section mentions the existing limitations of our fully- ing imaged infographics and its applications, in: automated pipeline and also proposes future works for Proceedings of the 2007 ACM symposium on Doc- improvement. ument engineering, 2007, pp. 9–18. Currently, there is a problem with our proposed [3] M. Savva, N. Kong, A. Chhajta, L. Fei-Fei, pipeline that prevents it from successfully parsing the M. Agrawala, J. Heer, Revision: Automated classi- plotted data when there is a lot of clutter. We can employ fication, analysis and redesign of chart images, in: vascular tracking methods like those described in [24] to Proceedings of the 24th annual ACM symposium solve this. on User interface software and technology, 2011, Also, our pipeline fails to recognize axes when there pp. 393–402. is no solid line indicating the y-axis. In this scenario, the [4] D. Jung, W. Kim, H. Song, J.-i. Hwang, B. Lee, B. Kim, y-axis can be identified by recognizing bounding boxes J. Seo, Chartsense: Interactive data extraction from along a vertical line in the bar chart. Also, when the x- chart images, in: Proceedings of the 2017 chi con- axis is at the top of the graphic, x-axis detection may fail. ference on human factors in computing systems, This case can be handled by employing a bidirectional 2017, pp. 6706–6717. sweeping line with heuristic rules. [5] J. Poco, J. Heer, Reverse-engineering visualizations: We also realize that the axes, legend, and data extrac- Recovering visual encodings from chart images, in: tion modules are currently modeled and trained inde- Computer graphics forum, volume 36, Wiley Online pendently in our figure analysis approach. It can be an Library, 2017, pp. 353–363. exciting approach to jointly model and train them to- [6] J. Choi, S. Jung, D. G. Park, J. Choo, N. Elmqvist, gether within an end-to-end deep network. Visualizing for the non-visual: Enabling the visu- In our future work, we will extend our pipeline to other ally impaired to use visualization, in: Computer types of charts as well including line charts, scatter plots, Graphics Forum, volume 38, Wiley Online Library, etc. which have an L-shaped axis, similar to bar charts 2019, pp. 249–260. [7] S. Demir, S. Carberry, K. F. McCoy, Summarizing neural networks, in: 2017 14th IAPR International information graphics textually, Computational Lin- Conference on Document Analysis and Recognition guistics 38 (2012) 527–574. (ICDAR), volume 1, IEEE, 2017, pp. 533–540. [8] S. R. Choudhury, S. Wang, C. L. Giles, Scalable algo- [20] A. Balaji, T. Ramanathan, V. Sonathi, Chart-text: rithms for scholarly figure mining and semantics, A fully automated chart image descriptor, arXiv in: Proceedings of the International Workshop on preprint arXiv:1812.10636 (2018). Semantic Big Data, 2016, pp. 1–6. [21] Y. He, X. Yu, Y. Gan, T. Zhu, S. Xiong, J. Peng, L. Hu, [9] A. Kembhavi, M. Salvato, E. Kolve, M. Seo, H. Ha- G. Xu, X. Yuan, Bar charts detection and analy- jishirzi, A. Farhadi, A diagram is worth a dozen sis in biomedical literature of pubmed central, in: images, in: European conference on computer vi- AMIA Annual Symposium Proceedings, volume sion, Springer, 2016, pp. 235–251. 2017, American Medical Informatics Association, [10] S. Ebrahimi Kahou, V. Michalski, A. Atkinson, 2017, p. 859. A. Kadar, A. Trischler, Y. Bengio, Figureqa: An [22] Kaggle arXiv Dataset, arxiv dataset, annotated figure dataset for visual reasoning, arXiv https://www.kaggle.com/datasets/Cornell- e-prints (2017) arXiv–1710. University/arxiv, 2017. [11] Z. Chen, M. Cafarella, E. Adar, Diagramflyer: A [23] ibm-aur nlp, Github - ibm-aur-nlp/publaynet, search engine for data-driven diagrams, in: Pro- https://github.com/ibm-aur-nlp/PubLayNet, 2019. ceedings of the 24th International conference on [24] A. Sironi, V. Lepetit, P. Fua, Multiscale centerline world wide web, 2015, pp. 183–186. detection by learning a scale-space distance trans- [12] T. Hiippala, M. Alikhani, J. Haverinen, form, in: Proceedings of the IEEE Conference on T. Kalliokoski, E. Logacheva, S. Orekhova, Computer Vision and Pattern Recognition, 2014, pp. A. Tuomainen, M. Stone, J. A. Bateman, Ai2d-rst: A 2697–2704. multimodal corpus of 1000 primary school science diagrams, Language Resources and Evaluation 55 (2021) 661–688. [13] D. Jung, W. Kim, H. Song, J.-i. Hwang, B. Lee, B. Kim, J. Seo, Chartsense: Interactive data extraction from chart images, in: Proceedings of the 2017 chi con- ference on human factors in computing systems, 2017, pp. 6706–6717. [14] L. Yang, W. Huang, C. L. Tan, Semi-automatic ground truth generation for chart image recogni- tion, in: International Workshop on Document Analysis Systems, Springer, 2006, pp. 324–335. [15] W. R. Shadish, I. C. Brasil, D. A. Illingworth, K. D. White, R. Galindo, E. D. Nagler, D. M. Rindskopf, Us- ing ungraph to extract data from image files: Verifi- cation of reliability and validity, Behavior Research Methods 41 (2009) 177–183. [16] N. Yokokura, T. Watanabe, Layout-based approach for extracting constructive elements of bar-charts, in: International workshop on graphics recognition, Springer, 1997, pp. 163–174. [17] Y. P. Zhou, C. L. Tan, Hough technique for bar charts detection and recognition in document im- ages, in: Proceedings 2000 International Confer- ence on Image Processing (Cat. No. 00CH37101), volume 2, IEEE, 2000, pp. 605–608. [18] R. Al-Zaidy, C. Giles, A machine learning approach for semantic structuring of scientific charts in schol- arly documents, in: Proceedings of the AAAI Con- ference on Artificial Intelligence, volume 31, 2017, pp. 4644–4649. [19] S. Tsutsui, D. J. Crandall, A data driven approach for compound figure separation using convolutional