ACL-Fig: A Dataset for Scientific Figure Classification
                                Zeba Karishma1,∗ , Shaurya Rohatgi1 , Kavya Shrinivas Puranik1 , Jian Wu2 and C. Lee Giles1
                                1
                                    The Pennsylvania State University, Westgate Building, University Park, PA 16802, United States
                                2
                                    Old Dominion University, 5115 Hampton Blvd, Norfolk, VA 23529, United States


                                                                           Abstract
                                                                           Most existing large-scale academic search engines are built to retrieve text-based information. However, there are no large-
                                                                           scale retrieval services for scientific figures and tables. One challenge for such services is understanding scientific fig-
                                                                           ures’ semantics, such as their types and purposes. A key ob- stacle is the need for datasets containing annotated scientific
                                                                           figures and tables, which can then be used for classification, question-answering, and auto-captioning. Here, we develop a
                                                                           pipeline that extracts figures and tables from the scientific lit- erature and a deep-learning-based framework that classifies
                                                                           scientific figures using visual features. Using this pipeline, we built the first large-scale automatically annotated corpus,
                                                                           ACL-FIG consisting of 112,052 scientific figures extracted from ≈ 56K research papers in the ACL Anthology. The ACL-FIG-
                                                                           PILOT dataset contains 1,671 manually labeled scientific figures belonging to 19 categories. The dataset is ac- cessible at
                                                                           https://huggingface.co/datasets/citeseerx/ACL-fig under a CC BY-NC license.

                                                                           Keywords
                                                                           Scientific Figures, Figure Classification, ACL Anthology, ACL-Fig


                                1. Introduction
                                Figures are ubiquitous in scientific papers illustrating
                                experimental and analytical results. We refer to these
                                figures as scientific figures to distinguish them from nat-
                                ural images, which usually contain richer colors and
                                gradients. Scientific figures provide a compact way to
                                present numerical and categorical data, often facilitating
                                researchers in drawing insights and conclusions. Ma-
                                chine understanding of scientific figures can assist in
                                developing effective retrieval systems from the hundreds
                                of millions of scientific papers readily available on the
                                Web [1]. The state-of-the-art machine learning models
                                can parse captions and shallow semantics for specific
                                categories of scientific figures. [2] However, the task of
                                reliably classifying general scientific figures based on
                                their visual features remains a challenge.
                                   Here, we propose a pipeline to build categorized and                                                      Figure 1: Example figures of each type in ACL-Fig-pilot.
                                contextualized scientific figure datasets. Applying the
                                pipeline on 55,760 papers in the ACL Anthology (down-
                                loaded from https://aclanthology.org/ in mid-2021), we
                                built two datasets: ACL-Fig and ACL-Fig-pilot. ACL-Fig                                                       for scientific figure classification. The pipeline is open-
                                consists of 112,052 scientific figures, their captions, inline                                               source and configurable, enabling others to expand the
                                references, and metadata. ACL-Fig-pilot (Figure 1) is a                                                      datasets from other scholarly datasets with pre-defined
                                subset of unlabeled ACL-Fig, consisting of 1671 scientific                                                   or new labels.
                                figures, which were manually labeled into 19 categories.
                                The ACL-Fig-pilot dataset was used as a benchmark                                                            2. Related Work
                                Washington, DC’23:The AAAI-23 Workshop on Scientific Document                                                                       Scientific Figures Extraction Automatically extract-
                                Understanding, Feb 07–14, 2023, Washington, DC                                                                                      ing figures from scientific papers is essential for many
                                ∗
                                     Corresponding author.                                                                                                          downstream tasks, and many frameworks have been de-
                                Envelope-Open zebakarishma@gmail.com (Z. Karishma); szr207@psu.edu
                                                                                                                                                                    veloped. A multi-entity extraction framework called
                                (S. Rohatgi); kzp5555@psu.edu (K. S. Puranik); jwu@cs.odu.edu
                                (J. Wu); clg20@psu.edu (C. L. Giles)                                                                                                PDFMEF incorporating a figure extraction module was
                                                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License proposed [3]. Shared tasks such as ImageCLEF [4] drew
                                                                       Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)                                                      attention to compound figure detection and separation.


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
                            Table 1
                            Scientific figure classification datasets.

                              Dataset                Labels      #Figures    Image Source
                              Deepchart              5           5,000       Web Image
                              Figureseer1            5           30,600      Web Image
                              Prasad et al.          5           653         Web Image
                              Revision               10          2,000       Web Image
                              FigureQA3              5           100,000     Synthetic figures
                              DeepFigures            2           1,718,000   Scientific Papers
                              DocFigure2             28          33,000      Scientific Papers
                              ACL-Fig-pilot          19          1,671       Scientific Papers
                              ACL-Fig (inferred)4    -           112,052     Scientific Papers
                              1
                                Only 1000 images are public.
                              2
                                Not publicly available.
                              3
                                Scientific-style synthesized data.
                              4
                                ACL-Fig does not contain human-assigned labels.


Clark and Divvala [5] proposed a framework called PDF-
Figures that extracted figures and captions in research
papers. The authors extended their work and built a more
robust framework called PDFFigures2 [6]. DeepFigures                                   DeepFigures PDFFigures2
                                                                     Figure
was later proposed to incorporate deep neural network               Extraction
models [2].                                                                         Figures, captions, inline reference

Scientific Figure Classification Scientific figure clas-                                            VGG16
sification [7, 8] aids machines in understanding figures.
Early work used a visual bag-of-words representation                                      Vector Representation
with a support vector machine classifier [7]. Zhou and
Tan applied hough transforms to recognize bar charts in            Clustering              k-means + Silhouette
document images. Siegel et al. [10] used handcrafted fea-
tures to classify charts in scientific documents. Tang et al.                                Human labeling
[11] combined convolutional neural networks (CNNs)
and the deep belief networks, which showed improved                                              Labeled figures
performance compared with feature-based classifiers .
                                                                                             Pattern matching
Figure classification Datasets There are several ex-                Automatic
isting datasets for figure classification such as DocFigure         annotation        Labeled figures with metadata
[12], FigureSeer [10], Revision [7], and datasets presented
by Karthikeyani and Nagarajan [13] (Table 1). FigureQA
is a public dataset that is similar to ours, consisting of
over one million question-answer pairs grounded in over          Figure 2: Overview of the data generation pipeline.
100,000 synthesized scientific images [14] with five styles.
Our dataset is different from FigureQA because the fig-
ures were directly extracted from research papers. Espe-         3. Data Mining Methodology
cially, the training data of DeepFigures are from arXiv
and PubMed, labeled with only “figure” and “table”, and          The ACL Anthology is a sizable, well-maintained PDF
does not include fine-granular labels. Our dataset con-          corpus with clean metadata covering papers in computa-
tains fine-granular labels, inline context, and is compiled      tional linguistics with freely available full-text. Previous
from a different domain.                                         work on figure classification used a set of pre-defined
                                                                 categories (e.g., [14], which may only cover some figure
                                                                 types. We use an unsupervised method to determine
                                                                 figure categories to overcome this limitation. After the
                                                                 category label is assigned, each figure is automatically
annotated with metadata, captions, and inline references. pivot point (elbow) of the curve determines the number
The pipeline includes 3 steps: figure extraction, cluster- of clusters.
ing, and automatic annotation (Figure 2).                       Silhouette Analysis determines the number of clusters
                                                             by measuring the distance between clusters. It considers
3.1. Figure Extraction                                       multiple factors such as variance, skewness, and high-low
                                                             differences and is usually preferred to the Elbow method.
To mitigate the potential bias of a single figure extractor, The Silhouette plot displays how close each point in one
we extracted figures using pdffigures2 [6] and deep- cluster is to points in the neighboring clusters, allowing
figures [2] which work in different ways. PDFFigures2 us to assess the cluster number visually.
first identifies captions and the body text because they
are identified relatively accurately. Regions containing
                                                             3.3. Linking Figures to Metadata
figures can then be located by identifying rectangular
bounding boxes adjacent to captions that do not overlap This module associates figures to metadata, including
with the body text. DeepFigures uses the distant super- captions, inline reference, figure type, figure boundary
vised learning method to induce labels of figures from coordinates, caption boundary coordinates, and figure
a large collection of scientific documents in LaTeX and text (text appearing on figures, only available for results
XML format. The model is based on TensorBox, applying from PDFFigures2). The figure type is determined in
the Overfeat detection architecture to image embeddings the clustering step above. The inline references are ob-
generated using ResNet-101 [2]. We utilized the publicly tained using GROBID (see below). The other metadata
available model weights1 trained on 4M induced figures fields were output by figure extractors. PDFFigures2
and 1M induced tables for extraction. The model out- and DeepFigures extract the same metadata fields ex-
puts the bounding boxes of figures and tables. Unless cept for “image text” and “regionless captions” (captions
otherwise stated, we collectively refer to figures and ta- for which no figure regions were found), which are only
bles together as “figures”. We used multi-processing to available for results of PDFFigures2.
process PDFs. Each process extracts figures following           An inline reference is a text span that contains a refer-
the steps below. The system processed, on average, 200 ence to a figure or a table. Inline references can help to
papers per minute on a Linux server with 24 cores.           understand the relationship between text and the objects
                                                             it refers to. After processing a paper, GROBID outputs a
1. Retrieve a paper identifier from the job queue.           TEI file (a type of XML file), containing marked-up full-
2. Pull the paper from the file system.                      text and references. We locate inline references using
3. Extract figures and captions from the paper.              regular expressions and extract the sentences containing
4. Crop the figures out of the rendered PDFs using de- reference marks.
    tected bounding boxes.
5. Save cropped figures in PNG format and the metadata
    in JSON format.                                          4. Results
3.2. Clustering Methods                                       4.1. Figure Extraction
Next, we use an unsupervised method to label extracted
figures automatically. We extract visual features using
VGG16 [15], pretrained on ImageNet [16]. All input fig-          14283                 240623                9046
ures are scaled to a dimension of 224 × 224 to be compat-
ible with the input requirement of VGG16. The features         PDFFigures2                              DeepFigures
were extracted from the second last hidden (dense) layer,
consisting of 4096 features. Principal Component Analy-       Figure 3: Numbers of extracted images.
sis was adopted to reduce the dimension to 1000.
   Next, we cluster figures represented by the 1000-
dimension vectors using 𝑘-means clustering. We com-              The numbers of figures extracted by PDFFigures2 and
pare two heuristic methods to determine the optimal           DeepFigures are illustrated in Figure 3, which indicates
number of clusters, including the Elbow method and the        a significant overlap between figures extracted by two
Silhouette Analysis [17]. The Elbow method examines           software packages. However, either package extracted (≈
the explained variation, a measure that quantifies the dif-   5%) figures that were not extracted by the other package.
ference between the between-group variance to the total       By inspecting a random sample of figures extracted by
variance, as a function of the number of clusters. The        either software package, we found that DeepFigures
                                                              tended to miss cases in which two figures were vertically
1
    https://github.com/allenai/deepfigures-open
               trees
                                 label count
                                                              5. Supervised Scientific Figure
      natural images
   confusion matrix
                                                                 Classification
              graph
architecture diagram                                          Based on the ACL-Fig-pilot dataset, we train supervised
        Screenshots
          bar charts
                                                              classifiers. The dataset was split into a training and a test
     neural networks                                          set (8:2 ratio). Three baseline models were investigated.
NLP text_grammar_eg
    Line graph_chart
                                                              Model 1 is a 3-Layer CNN, trained with a categorical
               tables                                         cross-entropy loss function and the Adam optimizer. The
          algorithms
            pie chart
                                                              model contains three typical convolutional layers, each
         scatter plot                                         followed by a max-pooling and a drop-out layer, and
                maps
            boxplots
                                                              three fully-connected layers. The dimensions are reduced
          word cloud                                          from 32 × 32 to 16 × 16 to 8 × 8. The last fully connected
       venn diagram
              pareto
                                                              layer classifies the encoded vector into 19 classes. This
                      0 50    100       150      200      250 classifier achieves an accuracy of 59%.

                                                                 Model 2 was trained based on the VGG16 architecture
Figure 4: Figure class distribution in the ACL-Fig-pilot ,except that the last three fully-connected layers in the
dataset.                                                      original network were replaced by a long short-term
                                                              memory layer, followed by dense layers for classification.
                                                              This model achieved an accuracy of ∼ 79%, 20% higher
                                                              than Model 1.
adjacent to each other. We took the union of all figures
                                                                 Model 3 was the Vision Transformer (ViT) [18], in
extracted by both software packages to build the ACL-
                                                              which a figure was split into fixed-size patches. Each
Fig dataset, which contains a total of 263,952 figures. All
                                                              patch was then linearly embedded, supplemented by po-
images extracted are converted to 100 DPI using standard
                                                              sition embeddings. The resulting sequence of vectors was
OpenCV libraries. The total size of the data is ∼ 25GB
                                                              fed to a standard Transformer encoder. The ViT model
before compression. Inline references were extracted
                                                              achieved the best performance, with 83% accuracy.
using GROBID. About 78% figures have inline references.

4.2. Automatic Figure Annotation                              6. Conclusion
The extraction outputs 151,900 tables and 112,052 figures.    Based on the ACL Anthology papers, we designed a
Only the figures were clustered using the 𝑘-means algo-       pipeline and used it to build a corpus of automatically
rithm. We varied 𝑘 from 2 to 20 with an increment of 1        labeled scientific figures with associated metadata and
to determine the number of clusters. The results were         context information. This corpus, named ACL-Fig, con-
analyzed using the Elbow method and Silhouette Analy-         sists of ≈ 250k objects, of which about 42% are figures
sis. No evident elbow was observed in the Elbow method        and about 58% are tables. We also built ACL-Fig-pilot, a
curve. The Silhouette diagram, a plot of the number of        subset of ACL-Fig, consisting of 1671 scientific figures
clusters versus silhouette score exhibited a clear turn-      with 19 manually verified labels. Our dataset includes
ing point at 𝑘 = 15, where the score reached the global       figures extracted from real-world data and contains more
maximum. Therefore, we grouped the figures into 15            classes than existing datasets, e.g., DeepFigures and Fig-
clusters.                                                     ureQA.
   To validate the clustering results, 100 figures randomly      One limitation of our pipeline is that it used VGG16
sampled from each cluster were visually inspected. Dur-       pre-trained on ImageNet. In the future, we will improve
ing the inspection, we identified three new figure types:     figure representation by retraining more sophisticated
word cloud, pareto, and venn diagram. The ACL-Fig-pilot       models, e.g., CoCa, [19], on scientific figures. Another
dataset was then built using all manually inspected fig-      limitation was that determining the number of clusters
ures. Two annotators manually labeled and inspected           required visual inspection. We will consider density-
these clusters. The consensus rate was measured using         based methods to fully automate the clustering module.
Cohen’s Kappa coefficient, which was 𝜅−0.78 (substantial
agreement) for the ACL-Fig-pilot dataset. For complete-
ness, we added 100 randomly selected tables. Therefore,       References
the ACL-Fig-pilot dataset contains a total of 1671 figures
and tables labeled with 19 classes. The distribution of all    [1] M. Khabsa, C. L. Giles, The number of scholarly
classes is shown in Figure 4.                                      documents on the public web, PLoS ONE 9 (2014)
                                                                   e93949.
 [2] N. Siegel, N. Lourie, R. Power, W. Ammar, Extract-         [14] S. E. Kahou, V. Michalski, A. Atkinson, A. Kadar,
     ing scientific figures with distantly supervised neu-           A. Trischler, Y. Bengio, Figureqa: An anno-
     ral networks, Proceedings of the 18th ACM/IEEE                  tated figure dataset for visual reasoning, 2018.
     on JCDL (2018). URL: http://dx.doi.org/10.1145/                 arXiv:1710.07300 .
     3197026.3197040. doi:10.1145/3197026.3197040 .             [15] K. Simonyan, A. Zisserman, Very deep convolu-
 [3] J. Wu, J. Killian, H. Yang, K. Williams, S. R. Choud-           tional networks for large-scale image recognition,
     hury, S. Tuarob, C. Caragea, C. L. Giles, Pdfmef:               arXiv preprint arXiv:1409.1556 (2014).
     A multi-entity knowledge extraction framework              [16] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-
     for scholarly documents and semantic search, Pro-               Fei, Imagenet: A large-scale hierarchical image
     ceedings of the 8th International Conference on                 database, in: 2009 IEEE Conference on CVPR, 2009,
     Knowledge Capture (2015).                                       pp. 248–255. doi:10.1109/CVPR.2009.5206848 .
 [4] A. G. S. de Herrera, H. Muller, S. Bromuri, Overview       [17] P. J. Rousseeuw,         Silhouettes: A graphi-
     of the imageclef 2015 medical classification task, in:          cal aid to the interpretation and validation
     CLEF, 2015.                                                     of cluster analysis,        Journal of Computa-
 [5] C. Clark, S. Divvala, Looking beyond text: Ex-                  tional and Applied Mathematics 20 (1987)
     tracting figures, tables and captions from computer             53–65. URL: https://www.sciencedirect.com/
     science papers, in: AAAI Workshop: Scholarly Big                science/article/pii/0377042787901257. doi:https:
     Data, 2015.                                                     //doi.org/10.1016/0377- 0427(87)90125- 7 .
 [6] C. Clark, S. Divvala, Pdffigures 2.0: Mining figures       [18] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weis-
     from research papers, 2016 IEEE/ACM Joint Con-                  senborn, X. Zhai, T. Unterthiner, M. Dehghani,
     ference on Digital Libraries (JCDL) (2016) 143–152.             M. Minderer, G. Heigold, S. Gelly, et al., An image is
 [7] M. Savva, N. Kong, A. Chhajta, L. Fei-Fei,                      worth 16x16 words: Transformers for image recog-
     M. Agrawala, J. Heer, Revision: automated clas-                 nition at scale, arXiv preprint arXiv:2010.11929
     sification, analysis and redesign of chart images,              (2020).
     Proceedings of 24th annual ACM symposium on                [19] J. Yu, Z. Wang, V. Vasudevan, L. Yeung,
     User interface software and tech (2011).                        M. Seyedhosseini, Y. Wu, Coca: Contrastive
 [8] S. R. Choudhury, C. L. Giles, An architecture for in-           captioners are image-text foundation mod-
     formation extraction from figures in digital libraries,         els,       CoRR abs/2205.01917 (2022). URL:
     Proceedings of the 24th International Conference                https://doi.org/10.48550/arXiv.2205.01917. doi:10.
     on World Wide Web (2015).                                       48550/arXiv.2205.01917 . arXiv:2205.01917 .
 [9] Y. Zhou, C. Tan, Hough technique for bar charts
     detection and recognition in document images, Pro-
     ceedings 2000 International Conference on Image
     Processing (Cat. No.00CH37101) 2 (2000) 605–608
     vol.2.
[10] N. Siegel, Z. Horvitz, R. Levin, S. Divvala, A. Farhadi,
     Figureseer: Parsing result-figures in research pa-
     pers, in: ECCV, 2016.
[11] B. Tang, X. Liu, J. Lei, M. Song, D. Tao, S. Sun,
     F. Dong, Deepchart: Combining deep convo-
     lutional networks and deep belief networks in
     chart classification, Signal Processing 124 (2016)
     156–161. URL: https://www.sciencedirect.com/
     science/article/pii/S0165168415003291. doi:https:
     //doi.org/10.1016/j.sigpro.2015.09.027 .
[12] K. V. Jobin, A. Mondal, C. V. Jawahar, Docfig-
     ure: A dataset for scientific document figure clas-
     sification, in: 2019 International Conference on
     Document Analysis and Recognition Workshops
     (ICDARW), 2019, pp. 74–79. doi:10.1109/ICDARW.
     2019.00018 .
[13] V. Karthikeyani, S. Nagarajan, Machine learning
     classification algorithms to recognize chart types in
     portable document format (pdf) files, International
     Journal of Computer Applications 39 (2012) 1–5.