=Paper= {{Paper |id=Vol-3656/paper14 |storemode=property |title=ChartParser: Automatic Chart Parsing for Print-Impaired |pdfUrl=https://ceur-ws.org/Vol-3656/paper14.pdf |volume=Vol-3656 |authors=Anukriti Kumar,Tanuja Ganu |dblpUrl=https://dblp.org/rec/conf/aaai/KumarG23 }} ==ChartParser: Automatic Chart Parsing for Print-Impaired== https://ceur-ws.org/Vol-3656/paper14.pdf
                                ChartParser: Automatic Chart Parsing for Print-Impaired
                                Anukriti Kumar1,∗,† , Tanuja Ganu2,†
                                1
                                    University of Washington
                                2
                                    Microsoft Research


                                                  Abstract
                                                  Infographics are often an integral component of scientific documents for reporting qualitative or quantitative findings as they
                                                  make it much simpler to comprehend the underlying complex information. However, their interpretation continues to be
                                                  a challenge for the blind, low-vision, and other print-impaired (BLV) individuals. In this paper, we propose ChartParser, a
                                                  fully automated pipeline that leverages deep learning, OCR, and image processing techniques to extract all figures from a
                                                  research paper, classify them into various chart categories (bar chart, line chart, etc.) and obtain relevant information from
                                                  them, specifically bar charts (including horizontal, vertical, stacked horizontal and stacked vertical charts) which already have
                                                  several exciting challenges. Finally, we present the retrieved content in a tabular format that is screen-reader friendly and
                                                  accessible to the BLV users. We present a thorough evaluation of our approach by applying our pipeline to sample real-world
                                                  annotated bar charts from research papers.

                                                  Keywords
                                                  Infographics Accessibility, Visualization Design, Information Retrieval, Human-centered computing



                                1. Introduction                                                                                            assumed that analyzing scientific figures is a trivial task.
                                                                                                                                           However, understanding charts/infographics present a
                                Academic research is advancing at an incredible pace,                                                      plethora of complex challenges. Firstly, a high level of
                                with thousands of scientific documents published                                                           accuracy is expected while parsing the figure plot data,
                                monthly [1]. These documents often use figures/charts                                                      as even a small mistake in analyzing chart data can lead
                                as a medium for data representation and interpretation.                                                    to erroneous conclusions. Also, authors employ different
                                However, the blind, low-vision and other print-disabled                                                    design conventions while structuring and formatting the
                                (BLV) individuals are often deprived of insights and un-                                                   figures, resulting in high variations across different pa-
                                derstanding offered by these figures. Although these are                                                   pers. It is also challenging to extract information from
                                converted into non-visual, screen-reader friendly repre-                                                   charts amidst heavy clutter and deformation within the
                                sentations such as alt-text, data table, etc., there is a lot                                              plot area. Even though the color is an essential cue for
                                of reliance on volunteers for this conversion, making it                                                   differentiating the plot data, it may only sometimes be
                                an extremely time-consuming process. In most cases,                                                        present because many figures frequently reuse similar
                                even the alternate text fails to describe charts properly.                                                 colors and some are even published in grayscale. Also,
                                Hence, our goal in this paper is to design a fully auto-                                                   figure parsing presents an additional challenge because
                                mated pipeline to extract useful information from charts,                                                  there is only one exemplar (the legend symbol) available
                                specifically bar charts, and convert them into accessible                                                  for model learning, in contrast to natural image recogni-
                                data tables. Potential applications of our system include                                                  tion tasks where the desired amount of labeled training
                                helping authors provide meaningful captions to their fig-                                                  data can be obtained to train models per category. Due
                                ures in papers, improving search and retrieval of relevant                                                 to these challenges, there currently needs to be a sys-
                                information in the academic domain, generating sum-                                                        tem that can automatically parse data from scientific
                                maries from charts, building query-answering systems,                                                      figures/charts.
                                developing interfaces that can provide simple and con-                                                        In this paper, we make three key contributions. First,
                                venient access to complex information, making charts                                                       we propose ChartParser, a fully automated pipeline that
                                accessible for BLV individuals, and helping academic                                                       leverages deep learning, OCR, and image processing tech-
                                committees and publishers identify plagiarized articles.                                                   niques to extract all figures from a research paper, classify
                                   Given the remarkable progress in analyzing natural                                                      them into various chart categories and retrieve useful
                                scene images observed in recent years, it is generally                                                     information from them, specifically bar charts. Second,
                                                                                                                                           we address some of the key challenges present in exist-
                                The Third AAAI Workshop on Scientific Document Understanding, Feb                                          ing systems. For example, our system can parse legend
                                14, 2023
                                ∗                                                                                                          and utilize color information for data association. It is
                                     Corresponding author.
                                †
                                    These authors contributed equally.
                                                                                                                                           also robust to variations in the figure designs and has
                                Envelope-Open anukumar@uw.edu (A. Kumar); tanuja.ganu@microsoft.com                                        no assumptions related to the position of axes, legend,
                                (T. Ganu)                                                                                                  etc. And finally, we demonstrate the viability of our ap-
                                            © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                            Attribution 4.0 International (CC BY 4.0).




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
proach by applying our pipeline to a real-world dataset      3. Methodology
of research papers from different sources.
                                                             This section discusses our proposed pipeline to convert
                                                             bar charts from scientific publications into data tables.
2. Related Work                                              The process is divided into three steps: First, we ex-
                                                             tract figures from research papers. Second, we detect
Chart understanding in scientific literature has recently bar charts from the extracted figures. And finally, we
gained much traction and there have been several at- extract content from bar charts to obtain the desired data
tempts to classify charts using heuristics and expert rules. tables. These three steps are depicted in Figure 2.
Various machine learning-based algorithms that rely on
handcrafted features such as histogram of oriented gradi-
ents (HOG), scale-invariant feature transform (SIFT), and 3.1. Figure Extraction
others have been proposed in the literature [2, 3]. Sev- To segment all the figures from a research paper, we
eral deep learning algorithms for chart and table image use a pre-trained image segmentation model based on
classification have recently been introduced [4, 5], and Mask R-CNN architecture from Detectron2 model zoo
[6].                                                         to decompose a document into five categories: title, text
   There is another line of work on interpreting text com- block, list, figure, and table. The model is based on the
ponents in chart images [7, 8, 9, 10, 11, 12]. Although ResNet50 feature pyramid network (FPN) base config and
semi-automatic software solutions are available for data is trained on the PubLayNet dataset for document layout
extraction from charts, using them requires the user to analysis.
manually define the chart’s coordinate system, provide
metadata about the axes and data or click on the data
points [13, 14, 15].
                                                             3.2. Figure Classification
   One of the difficulties in accurately parsing bar charts Most of the figures extracted are charts including tree
is dealing with different types of bar charts in scientific diagrams, network diagrams, bubble charts, etc. This
literature. Previous work, for example, [16, 17], focused step describes the chart classification model employed to
on developing heuristic models that detect key elements detect bar charts.
such as bars, legends, etc. Similarly, machine learning
has also been used recently to detect chart components 3.2.1. Chart Images Dataset
(e.g., bar or legend) [18]. Also, a deep learning object
detection model is trained in [19] to identify sub-figures We create a chart dataset to train and evaluate our chart
in compound figures. However, neither of these works classification model. We use the Python module google
extracted data values from bar charts. Using synthetic images download to obtain charts from 13 categories
data produced by the matplotlib toolkit, [20] created a (scatter plots, bar charts, line charts, etc.), 1000 images
model to boost the accuracy while parsing bar values.        from each category. Then, we manually identify and
   Most of the previous methods do not parse the legend. remove some incorrect samples of downloaded charts.
Some assumed that the legend was always placed below Finally, we obtain a ground-truth dataset of charts with
the chart [20] or horizontally along the same line [21]. a total of approximately 12k charts, including 978 bar
This limits the applicability of these models. Previous charts.
work was mostly created for visualizations in grayscale,
as they did not parse color information from the legend. 3.2.2. Chart Classification Model
Also, there has been less focus on measuring the accu-
                                                             We try out different models pre-trained on the ImageNet
racy of detecting the axes or label values. Quantifying
                                                             dataset and fine-tune them on the figure dataset cre-
the accuracy of obtaining this semantic information is
                                                             ated. All the layers but the final convolutional layer
essential for understanding the capping limits in this eval-
                                                             were frozen. The fully-connected layer uses a softmax
uation process. Even though the process of extracting
                                                             function to classify figures into 13 chart categories. Using
information from charts and other infographics has been
                                                             Adadelta as the optimizer, we re-train the convolutional
extensively explored, to our knowledge, prior work has
                                                             layer and the additional fully-connected layer for 30 it-
several shortcomings as discussed above. As a result,
                                                             erations. We also add a dropout layer with a rate of 0.3
we propose a fully automated system for data extraction
                                                             before the final fully-connected layer to avoid overfitting.
from bar charts which solves these existing limitations
                                                             Despite similar accuracy achieved by all the baselines,
and can be extended to other types of charts, including
                                                             we choose MobileNet as it uses far less parameters on
line charts, scatter plots, etc.
                                                             ImageNet than others.
Figure 1: Illustration of the ChartParser pipeline



3.3. Content Extraction                                       3.3.3. Axes Ticks Detection
Content Extraction from charts is a complex process           We filter all the text boxes below the x-axis and to the
and in this step, we employ OCR and image processing          right of the y-axis. Further, we run a sweeping line from
techniques to extract relevant content from bar charts        the x-axis to the bottom of the image and the line which
through various modules.                                      intersects with the maximum number of text boxes pro-
                                                              vides the bounding boxes for all the x-axis ticks. A similar
3.3.1. Axes Detection                                         algorithm is used for detecting y-axis ticks using a verti-
                                                              cal sweeping line.
We convert the image into a binary one, and then, obtain
the max-continuous ones along each row and column.
                                                              3.3.4. Axes Label Detection
For this, we scan the matrix vertically and horizontally
to trace the continuity of black pixels within the adjacent   We filter the text boxes present below the x-axis ticks and
columns and rows. Finally, the y-axis is the first column     again, run a sweeping line from the x-axis ticks to the
where the max-continuous 1s fall in the region [max           bottom of the image. While doing so, the line intersecting
- threshold, max + threshold], where a predetermined          with the maximum number of text boxes provides us with
threshold (=10) is assumed. Similarly, for the x-axis, the    all the bounding boxes for the x-axis label. Similarly, we
last row is chosen based on where the maximum con-            also obtain the y-axis label using a vertical sweeping line.
tinuous 1s fall within the range [max - threshold, max +
threshold].                                                   3.3.5. Legend Detection
                                                           Firstly, we remove the axes labels and ticks bounding
3.3.2. Text Detection
                                                           boxes. Then, we also remove boxes containing only a
We apply Azure Cognitive Service (ACS) Optical Charac- single ”I” character because these are typically read as
ter Recognition (OCR) to detect text within a chart and error bars and finally, we also remove text boxes with
extract all the rectangular bounding boxes of the detected numeric values placed above bars. This implies that only
text.                                                      legend names and color boxes are found in the remaining
text boxes. We combine the bounding boxes with dis- Table 1
tances under 10px into a single legend name because the Content extraction accuracy
legend names might have multiple words. We organize                    Component              Accuracy (%)
these bounding boxes into groups where each member                        X-axis                   97
is either horizontally or vertically aligned with at least                Y-axis                   94
one other member. Finally, the maximum length group                     X-axis label               95
gives the bounding boxes for all the legends.                           Y-axis label               91
                                                                            X-axis ticks           89
                                                                            Y-axis ticks           84
3.3.6. Legend Color Estimation
                                                                              Legend               87
The color boxes are assumed to be on the left or right                     Legend color            87
side, depending on the placement of text bounding box                     Data association         76
within the legend extracted in the previous module. Pix-
els within a box should ideally all have the same pixel
value. Since, these values could change for several rea-      4.1. Test Dataset
sons (such as image compression, scanning, etc.), we         We sample research papers from two data sources: arXiv
start a new group with a random pixel and gradually add      and PubLayNet. From the arXiv dataset published on
pixels whose R, G, and B values are no higher than 5         Kaggle [22], we obtained research paper PDFs published
compared to the average of all the pixels in the group.      in the years 2019-2021 and the resulting dataset consisted
The color of a legend label is determined by taking the      of around 10,024 papers. Also, we use a subset of the Pub-
average of all the pixels in the largest group of the R, G,  LayNet dataset [23] and obtain approximately 15k docu-
and B channels. Later, bars matching a specific legend       ment images from the same. Then, we apply the first two
are identified using these colors.                           steps of our fully automated pipeline to these research
                                                             papers, as mentioned in the previous section 3. First,
3.3.7. Data Extraction                                       we extract approximately 51k figures from the research
                                                             papers dataset using our image segmentation model, and
The bounding boxes for each legend are whitened, and
                                                             then, on applying our chart classification model to these
we eliminate all the white pixels from the original chart
                                                             figures, we obtain approximately 2,112 bar charts. To
image. The colors decided upon in the previous module
                                                             evaluate our system, we sample 100 bar charts and manu-
serve as the initial clusters as all of the image’s pixel
                                                             ally annotate the relevant data, including axes, axes label,
values are further divided into clusters. Then, we divide
                                                             axes tick’s values, legend, legend color, and the textual
the given plot into multiple plots, one for each cluster.
                                                             bounding boxes.
In other words, by clustering, we break down a stacked
bar chart into several simpler plots. Then, we obtain
all contours within the plot and subsequently, pick the 4.2. Chart Classification
closest bounding rectangle for each label. Further, we
                                                             The accuracy of our chart classification model is calcu-
require a mapping function to map pixel values to actual
                                                             lated using stratified five-fold cross validation. Here, we
values in the chart. Hence, we use the value-tick ratio (𝛼)
                                                             use 20% of the chart images dataset, created using google
to estimate the height of each bar. To find this ratio, we
                                                             images download API, as our validation set and the cate-
divide the average of the actual y-label ticks (𝑁𝑡𝑖𝑐𝑘𝑠 ) by
                                                             gory wise performance (average accuracy) of our model
the average distance between ticks in pixels (Δ𝑑).
                                                             is presented in Table 4.2. We observe that for bar charts,
                      𝛼 = 𝑁 /Δ𝑑
                            𝑡𝑖𝑐𝑘𝑠                        (1) our model achieves an accuracy of 97.8%.

Finally, the bar chart’s y values are defined as y value =    4.3. Text Recognition
𝛼 × H, where H is the bar’s height. After getting all the
relevant information, we create a data table using the  We use the Intersection Over Union (IoU) metric to assess
same as shown in Figure 2 (e.).                         our text detection module. This metric determines the
                                                        bounding boxes that most closely match the predicted
                                                        and actual ones, calculates the area of the intersecting
4. Results                                              region divided by the area of the union region for each
                                                        match, and considers the prediction successful if the IoU
This section focuses on creating a test dataset of bar measure is higher than the threshold, for example, 0.5.We
charts from research papers and evaluating various com- achieve an F1-score of 0.935 with an IoU threshold of
ponents of our pipeline on this dataset to demonstrate 0.5 and this demonstrates that our module detects text
the viability of our approach.                          bounding boxes within the plot area fairly well.
                Category         Accuracy (%)                 and also, follow a similar algorithm for extraction of chart
                Bar Chart            97.8
                                                              elements such as axes, labels, ticks, legends, etc. Instead
               Line Chart           96.86
               Scatter Plot         92.00                     of simply presenting the raw data in tabular form, we
              Pareto chart          84.20                     can also generate insights from the data by employing
                Pie Chart           91.52                     reasoning on chart images at a high level by finding
             Venn Diagram           87.88                     relationships between various chart elements.
                 Box Plot           94.56
            Network Diagram         68.97
                  Map               79.26                     6. Conclusion
             Tree Diagram           69.09
               Area Graph           88.00                     In this paper, we present our ongoing work in making
               Flow Chart           75.54                     scientific documents accessible to the blind, low-vision,
              Bubble Chart          92.20                     and print-disabled individuals. Our work focuses on the
                                                              problem of poor accessibility of infographics/charts in
Table 2
Category wise average accuracy of the chart classification research papers. We propose an end-to-end pipeline to
model                                                         extract all figures from a research paper, classify them
                                                              into various chart categories, obtain relevant informa-
                                                              tion from them, specifically bar charts and present the
4.4. Content Extraction                                       retrieved content into accessible data tables. Finally, we
                                                              apply our pipeline to a test dataset of research papers
The performance of the final content extraction process from two different sources: arXiv and PMC to demon-
depends on the sequential performance of each module, strate the viability of our approach. We continue to work
i.e., axis detection, axis tick values extraction, label ex- towards making charts fully accessible to print-impaired
traction, legend detection, and so on. First, we apply individuals by overcoming the existing limitations of our
the OCR and image processing techniques to the test work.
dataset and extract relevant content. Then, we compare
the outcome with the manually annotated data and ob-
tain module-wise evaluation metrics presented in Table References
1.
                                                                [1] arXiv Monthly Stats, Global survey monthly
                                                                    submissions,       https://arxiv.org/stats/monthly_
5. Limitations and Future Work                                      submissions, 2022. November 14, 2022.
                                                                [2] W. Huang, C. L. Tan, A system for understand-
This section mentions the existing limitations of our fully-        ing imaged infographics and its applications, in:
automated pipeline and also proposes future works for               Proceedings of the 2007 ACM symposium on Doc-
improvement.                                                        ument engineering, 2007, pp. 9–18.
   Currently, there is a problem with our proposed              [3] M. Savva, N. Kong, A. Chhajta, L. Fei-Fei,
pipeline that prevents it from successfully parsing the             M. Agrawala, J. Heer, Revision: Automated classi-
plotted data when there is a lot of clutter. We can employ          fication, analysis and redesign of chart images, in:
vascular tracking methods like those described in [24] to           Proceedings of the 24th annual ACM symposium
solve this.                                                         on User interface software and technology, 2011,
   Also, our pipeline fails to recognize axes when there            pp. 393–402.
is no solid line indicating the y-axis. In this scenario, the   [4] D. Jung, W. Kim, H. Song, J.-i. Hwang, B. Lee, B. Kim,
y-axis can be identified by recognizing bounding boxes              J. Seo, Chartsense: Interactive data extraction from
along a vertical line in the bar chart. Also, when the x-           chart images, in: Proceedings of the 2017 chi con-
axis is at the top of the graphic, x-axis detection may fail.       ference on human factors in computing systems,
This case can be handled by employing a bidirectional               2017, pp. 6706–6717.
sweeping line with heuristic rules.                             [5] J. Poco, J. Heer, Reverse-engineering visualizations:
   We also realize that the axes, legend, and data extrac-          Recovering visual encodings from chart images, in:
tion modules are currently modeled and trained inde-                Computer graphics forum, volume 36, Wiley Online
pendently in our figure analysis approach. It can be an             Library, 2017, pp. 353–363.
exciting approach to jointly model and train them to- [6] J. Choi, S. Jung, D. G. Park, J. Choo, N. Elmqvist,
gether within an end-to-end deep network.                           Visualizing for the non-visual: Enabling the visu-
   In our future work, we will extend our pipeline to other         ally impaired to use visualization, in: Computer
types of charts as well including line charts, scatter plots,       Graphics Forum, volume 38, Wiley Online Library,
etc. which have an L-shaped axis, similar to bar charts             2019, pp. 249–260.
 [7] S. Demir, S. Carberry, K. F. McCoy, Summarizing                neural networks, in: 2017 14th IAPR International
     information graphics textually, Computational Lin-             Conference on Document Analysis and Recognition
     guistics 38 (2012) 527–574.                                    (ICDAR), volume 1, IEEE, 2017, pp. 533–540.
 [8] S. R. Choudhury, S. Wang, C. L. Giles, Scalable algo-     [20] A. Balaji, T. Ramanathan, V. Sonathi, Chart-text:
     rithms for scholarly figure mining and semantics,              A fully automated chart image descriptor, arXiv
     in: Proceedings of the International Workshop on               preprint arXiv:1812.10636 (2018).
     Semantic Big Data, 2016, pp. 1–6.                         [21] Y. He, X. Yu, Y. Gan, T. Zhu, S. Xiong, J. Peng, L. Hu,
 [9] A. Kembhavi, M. Salvato, E. Kolve, M. Seo, H. Ha-              G. Xu, X. Yuan, Bar charts detection and analy-
     jishirzi, A. Farhadi, A diagram is worth a dozen               sis in biomedical literature of pubmed central, in:
     images, in: European conference on computer vi-                AMIA Annual Symposium Proceedings, volume
     sion, Springer, 2016, pp. 235–251.                             2017, American Medical Informatics Association,
[10] S. Ebrahimi Kahou, V. Michalski, A. Atkinson,                  2017, p. 859.
     A. Kadar, A. Trischler, Y. Bengio, Figureqa: An           [22] Kaggle      arXiv     Dataset,      arxiv     dataset,
     annotated figure dataset for visual reasoning, arXiv           https://www.kaggle.com/datasets/Cornell-
     e-prints (2017) arXiv–1710.                                    University/arxiv, 2017.
[11] Z. Chen, M. Cafarella, E. Adar, Diagramflyer: A           [23] ibm-aur nlp, Github - ibm-aur-nlp/publaynet,
     search engine for data-driven diagrams, in: Pro-               https://github.com/ibm-aur-nlp/PubLayNet, 2019.
     ceedings of the 24th International conference on          [24] A. Sironi, V. Lepetit, P. Fua, Multiscale centerline
     world wide web, 2015, pp. 183–186.                             detection by learning a scale-space distance trans-
[12] T. Hiippala, M. Alikhani, J. Haverinen,                        form, in: Proceedings of the IEEE Conference on
     T. Kalliokoski, E. Logacheva, S. Orekhova,                     Computer Vision and Pattern Recognition, 2014, pp.
     A. Tuomainen, M. Stone, J. A. Bateman, Ai2d-rst: A             2697–2704.
     multimodal corpus of 1000 primary school science
     diagrams, Language Resources and Evaluation 55
     (2021) 661–688.
[13] D. Jung, W. Kim, H. Song, J.-i. Hwang, B. Lee, B. Kim,
     J. Seo, Chartsense: Interactive data extraction from
     chart images, in: Proceedings of the 2017 chi con-
     ference on human factors in computing systems,
     2017, pp. 6706–6717.
[14] L. Yang, W. Huang, C. L. Tan, Semi-automatic
     ground truth generation for chart image recogni-
     tion, in: International Workshop on Document
     Analysis Systems, Springer, 2006, pp. 324–335.
[15] W. R. Shadish, I. C. Brasil, D. A. Illingworth, K. D.
     White, R. Galindo, E. D. Nagler, D. M. Rindskopf, Us-
     ing ungraph to extract data from image files: Verifi-
     cation of reliability and validity, Behavior Research
     Methods 41 (2009) 177–183.
[16] N. Yokokura, T. Watanabe, Layout-based approach
     for extracting constructive elements of bar-charts,
     in: International workshop on graphics recognition,
     Springer, 1997, pp. 163–174.
[17] Y. P. Zhou, C. L. Tan, Hough technique for bar
     charts detection and recognition in document im-
     ages, in: Proceedings 2000 International Confer-
     ence on Image Processing (Cat. No. 00CH37101),
     volume 2, IEEE, 2000, pp. 605–608.
[18] R. Al-Zaidy, C. Giles, A machine learning approach
     for semantic structuring of scientific charts in schol-
     arly documents, in: Proceedings of the AAAI Con-
     ference on Artificial Intelligence, volume 31, 2017,
     pp. 4644–4649.
[19] S. Tsutsui, D. J. Crandall, A data driven approach for
     compound figure separation using convolutional