ACL-Fig: A Dataset for Scientific Figure Classification Zeba Karishma1,∗ , Shaurya Rohatgi1 , Kavya Shrinivas Puranik1 , Jian Wu2 and C. Lee Giles1 1 The Pennsylvania State University, Westgate Building, University Park, PA 16802, United States 2 Old Dominion University, 5115 Hampton Blvd, Norfolk, VA 23529, United States Abstract Most existing large-scale academic search engines are built to retrieve text-based information. However, there are no large- scale retrieval services for scientific figures and tables. One challenge for such services is understanding scientific fig- ures’ semantics, such as their types and purposes. A key ob- stacle is the need for datasets containing annotated scientific figures and tables, which can then be used for classification, question-answering, and auto-captioning. Here, we develop a pipeline that extracts figures and tables from the scientific lit- erature and a deep-learning-based framework that classifies scientific figures using visual features. Using this pipeline, we built the first large-scale automatically annotated corpus, ACL-FIG consisting of 112,052 scientific figures extracted from ≈ 56K research papers in the ACL Anthology. The ACL-FIG- PILOT dataset contains 1,671 manually labeled scientific figures belonging to 19 categories. The dataset is ac- cessible at https://huggingface.co/datasets/citeseerx/ACL-fig under a CC BY-NC license. Keywords Scientific Figures, Figure Classification, ACL Anthology, ACL-Fig 1. Introduction Figures are ubiquitous in scientific papers illustrating experimental and analytical results. We refer to these figures as scientific figures to distinguish them from nat- ural images, which usually contain richer colors and gradients. Scientific figures provide a compact way to present numerical and categorical data, often facilitating researchers in drawing insights and conclusions. Ma- chine understanding of scientific figures can assist in developing effective retrieval systems from the hundreds of millions of scientific papers readily available on the Web [1]. The state-of-the-art machine learning models can parse captions and shallow semantics for specific categories of scientific figures. [2] However, the task of reliably classifying general scientific figures based on their visual features remains a challenge. Here, we propose a pipeline to build categorized and Figure 1: Example figures of each type in ACL-Fig-pilot. contextualized scientific figure datasets. Applying the pipeline on 55,760 papers in the ACL Anthology (down- loaded from https://aclanthology.org/ in mid-2021), we built two datasets: ACL-Fig and ACL-Fig-pilot. ACL-Fig for scientific figure classification. The pipeline is open- consists of 112,052 scientific figures, their captions, inline source and configurable, enabling others to expand the references, and metadata. ACL-Fig-pilot (Figure 1) is a datasets from other scholarly datasets with pre-defined subset of unlabeled ACL-Fig, consisting of 1671 scientific or new labels. figures, which were manually labeled into 19 categories. The ACL-Fig-pilot dataset was used as a benchmark 2. Related Work Washington, DC’23:The AAAI-23 Workshop on Scientific Document Scientific Figures Extraction Automatically extract- Understanding, Feb 07–14, 2023, Washington, DC ing figures from scientific papers is essential for many ∗ Corresponding author. downstream tasks, and many frameworks have been de- Envelope-Open zebakarishma@gmail.com (Z. Karishma); szr207@psu.edu veloped. A multi-entity extraction framework called (S. Rohatgi); kzp5555@psu.edu (K. S. Puranik); jwu@cs.odu.edu (J. Wu); clg20@psu.edu (C. L. Giles) PDFMEF incorporating a figure extraction module was © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License proposed [3]. Shared tasks such as ImageCLEF [4] drew Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) attention to compound figure detection and separation. CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Table 1 Scientific figure classification datasets. Dataset Labels #Figures Image Source Deepchart 5 5,000 Web Image Figureseer1 5 30,600 Web Image Prasad et al. 5 653 Web Image Revision 10 2,000 Web Image FigureQA3 5 100,000 Synthetic figures DeepFigures 2 1,718,000 Scientific Papers DocFigure2 28 33,000 Scientific Papers ACL-Fig-pilot 19 1,671 Scientific Papers ACL-Fig (inferred)4 - 112,052 Scientific Papers 1 Only 1000 images are public. 2 Not publicly available. 3 Scientific-style synthesized data. 4 ACL-Fig does not contain human-assigned labels. Clark and Divvala [5] proposed a framework called PDF- Figures that extracted figures and captions in research papers. The authors extended their work and built a more robust framework called PDFFigures2 [6]. DeepFigures DeepFigures PDFFigures2 Figure was later proposed to incorporate deep neural network Extraction models [2]. Figures, captions, inline reference Scientific Figure Classification Scientific figure clas- VGG16 sification [7, 8] aids machines in understanding figures. Early work used a visual bag-of-words representation Vector Representation with a support vector machine classifier [7]. Zhou and Tan applied hough transforms to recognize bar charts in Clustering k-means + Silhouette document images. Siegel et al. [10] used handcrafted fea- tures to classify charts in scientific documents. Tang et al. Human labeling [11] combined convolutional neural networks (CNNs) and the deep belief networks, which showed improved Labeled figures performance compared with feature-based classifiers . Pattern matching Figure classification Datasets There are several ex- Automatic isting datasets for figure classification such as DocFigure annotation Labeled figures with metadata [12], FigureSeer [10], Revision [7], and datasets presented by Karthikeyani and Nagarajan [13] (Table 1). FigureQA is a public dataset that is similar to ours, consisting of over one million question-answer pairs grounded in over Figure 2: Overview of the data generation pipeline. 100,000 synthesized scientific images [14] with five styles. Our dataset is different from FigureQA because the fig- ures were directly extracted from research papers. Espe- 3. Data Mining Methodology cially, the training data of DeepFigures are from arXiv and PubMed, labeled with only “figure” and “table”, and The ACL Anthology is a sizable, well-maintained PDF does not include fine-granular labels. Our dataset con- corpus with clean metadata covering papers in computa- tains fine-granular labels, inline context, and is compiled tional linguistics with freely available full-text. Previous from a different domain. work on figure classification used a set of pre-defined categories (e.g., [14], which may only cover some figure types. We use an unsupervised method to determine figure categories to overcome this limitation. After the category label is assigned, each figure is automatically annotated with metadata, captions, and inline references. pivot point (elbow) of the curve determines the number The pipeline includes 3 steps: figure extraction, cluster- of clusters. ing, and automatic annotation (Figure 2). Silhouette Analysis determines the number of clusters by measuring the distance between clusters. It considers 3.1. Figure Extraction multiple factors such as variance, skewness, and high-low differences and is usually preferred to the Elbow method. To mitigate the potential bias of a single figure extractor, The Silhouette plot displays how close each point in one we extracted figures using pdffigures2 [6] and deep- cluster is to points in the neighboring clusters, allowing figures [2] which work in different ways. PDFFigures2 us to assess the cluster number visually. first identifies captions and the body text because they are identified relatively accurately. Regions containing 3.3. Linking Figures to Metadata figures can then be located by identifying rectangular bounding boxes adjacent to captions that do not overlap This module associates figures to metadata, including with the body text. DeepFigures uses the distant super- captions, inline reference, figure type, figure boundary vised learning method to induce labels of figures from coordinates, caption boundary coordinates, and figure a large collection of scientific documents in LaTeX and text (text appearing on figures, only available for results XML format. The model is based on TensorBox, applying from PDFFigures2). The figure type is determined in the Overfeat detection architecture to image embeddings the clustering step above. The inline references are ob- generated using ResNet-101 [2]. We utilized the publicly tained using GROBID (see below). The other metadata available model weights1 trained on 4M induced figures fields were output by figure extractors. PDFFigures2 and 1M induced tables for extraction. The model out- and DeepFigures extract the same metadata fields ex- puts the bounding boxes of figures and tables. Unless cept for “image text” and “regionless captions” (captions otherwise stated, we collectively refer to figures and ta- for which no figure regions were found), which are only bles together as “figures”. We used multi-processing to available for results of PDFFigures2. process PDFs. Each process extracts figures following An inline reference is a text span that contains a refer- the steps below. The system processed, on average, 200 ence to a figure or a table. Inline references can help to papers per minute on a Linux server with 24 cores. understand the relationship between text and the objects it refers to. After processing a paper, GROBID outputs a 1. Retrieve a paper identifier from the job queue. TEI file (a type of XML file), containing marked-up full- 2. Pull the paper from the file system. text and references. We locate inline references using 3. Extract figures and captions from the paper. regular expressions and extract the sentences containing 4. Crop the figures out of the rendered PDFs using de- reference marks. tected bounding boxes. 5. Save cropped figures in PNG format and the metadata in JSON format. 4. Results 3.2. Clustering Methods 4.1. Figure Extraction Next, we use an unsupervised method to label extracted figures automatically. We extract visual features using VGG16 [15], pretrained on ImageNet [16]. All input fig- 14283 240623 9046 ures are scaled to a dimension of 224 × 224 to be compat- ible with the input requirement of VGG16. The features PDFFigures2 DeepFigures were extracted from the second last hidden (dense) layer, consisting of 4096 features. Principal Component Analy- Figure 3: Numbers of extracted images. sis was adopted to reduce the dimension to 1000. Next, we cluster figures represented by the 1000- dimension vectors using 𝑘-means clustering. We com- The numbers of figures extracted by PDFFigures2 and pare two heuristic methods to determine the optimal DeepFigures are illustrated in Figure 3, which indicates number of clusters, including the Elbow method and the a significant overlap between figures extracted by two Silhouette Analysis [17]. The Elbow method examines software packages. However, either package extracted (≈ the explained variation, a measure that quantifies the dif- 5%) figures that were not extracted by the other package. ference between the between-group variance to the total By inspecting a random sample of figures extracted by variance, as a function of the number of clusters. The either software package, we found that DeepFigures tended to miss cases in which two figures were vertically 1 https://github.com/allenai/deepfigures-open trees label count 5. Supervised Scientific Figure natural images confusion matrix Classification graph architecture diagram Based on the ACL-Fig-pilot dataset, we train supervised Screenshots bar charts classifiers. The dataset was split into a training and a test neural networks set (8:2 ratio). Three baseline models were investigated. NLP text_grammar_eg Line graph_chart Model 1 is a 3-Layer CNN, trained with a categorical tables cross-entropy loss function and the Adam optimizer. The algorithms pie chart model contains three typical convolutional layers, each scatter plot followed by a max-pooling and a drop-out layer, and maps boxplots three fully-connected layers. The dimensions are reduced word cloud from 32 × 32 to 16 × 16 to 8 × 8. The last fully connected venn diagram pareto layer classifies the encoded vector into 19 classes. This 0 50 100 150 200 250 classifier achieves an accuracy of 59%. Model 2 was trained based on the VGG16 architecture Figure 4: Figure class distribution in the ACL-Fig-pilot ,except that the last three fully-connected layers in the dataset. original network were replaced by a long short-term memory layer, followed by dense layers for classification. This model achieved an accuracy of ∼ 79%, 20% higher than Model 1. adjacent to each other. We took the union of all figures Model 3 was the Vision Transformer (ViT) [18], in extracted by both software packages to build the ACL- which a figure was split into fixed-size patches. Each Fig dataset, which contains a total of 263,952 figures. All patch was then linearly embedded, supplemented by po- images extracted are converted to 100 DPI using standard sition embeddings. The resulting sequence of vectors was OpenCV libraries. The total size of the data is ∼ 25GB fed to a standard Transformer encoder. The ViT model before compression. Inline references were extracted achieved the best performance, with 83% accuracy. using GROBID. About 78% figures have inline references. 4.2. Automatic Figure Annotation 6. Conclusion The extraction outputs 151,900 tables and 112,052 figures. Based on the ACL Anthology papers, we designed a Only the figures were clustered using the 𝑘-means algo- pipeline and used it to build a corpus of automatically rithm. We varied 𝑘 from 2 to 20 with an increment of 1 labeled scientific figures with associated metadata and to determine the number of clusters. The results were context information. This corpus, named ACL-Fig, con- analyzed using the Elbow method and Silhouette Analy- sists of ≈ 250k objects, of which about 42% are figures sis. No evident elbow was observed in the Elbow method and about 58% are tables. We also built ACL-Fig-pilot, a curve. The Silhouette diagram, a plot of the number of subset of ACL-Fig, consisting of 1671 scientific figures clusters versus silhouette score exhibited a clear turn- with 19 manually verified labels. Our dataset includes ing point at 𝑘 = 15, where the score reached the global figures extracted from real-world data and contains more maximum. Therefore, we grouped the figures into 15 classes than existing datasets, e.g., DeepFigures and Fig- clusters. ureQA. To validate the clustering results, 100 figures randomly One limitation of our pipeline is that it used VGG16 sampled from each cluster were visually inspected. Dur- pre-trained on ImageNet. In the future, we will improve ing the inspection, we identified three new figure types: figure representation by retraining more sophisticated word cloud, pareto, and venn diagram. The ACL-Fig-pilot models, e.g., CoCa, [19], on scientific figures. Another dataset was then built using all manually inspected fig- limitation was that determining the number of clusters ures. Two annotators manually labeled and inspected required visual inspection. We will consider density- these clusters. The consensus rate was measured using based methods to fully automate the clustering module. Cohen’s Kappa coefficient, which was 𝜅−0.78 (substantial agreement) for the ACL-Fig-pilot dataset. For complete- ness, we added 100 randomly selected tables. Therefore, References the ACL-Fig-pilot dataset contains a total of 1671 figures and tables labeled with 19 classes. The distribution of all [1] M. Khabsa, C. L. Giles, The number of scholarly classes is shown in Figure 4. documents on the public web, PLoS ONE 9 (2014) e93949. [2] N. Siegel, N. Lourie, R. Power, W. Ammar, Extract- [14] S. E. Kahou, V. Michalski, A. Atkinson, A. Kadar, ing scientific figures with distantly supervised neu- A. Trischler, Y. Bengio, Figureqa: An anno- ral networks, Proceedings of the 18th ACM/IEEE tated figure dataset for visual reasoning, 2018. on JCDL (2018). URL: http://dx.doi.org/10.1145/ arXiv:1710.07300 . 3197026.3197040. doi:10.1145/3197026.3197040 . [15] K. Simonyan, A. Zisserman, Very deep convolu- [3] J. Wu, J. Killian, H. Yang, K. Williams, S. R. Choud- tional networks for large-scale image recognition, hury, S. Tuarob, C. Caragea, C. L. Giles, Pdfmef: arXiv preprint arXiv:1409.1556 (2014). A multi-entity knowledge extraction framework [16] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei- for scholarly documents and semantic search, Pro- Fei, Imagenet: A large-scale hierarchical image ceedings of the 8th International Conference on database, in: 2009 IEEE Conference on CVPR, 2009, Knowledge Capture (2015). pp. 248–255. doi:10.1109/CVPR.2009.5206848 . [4] A. G. S. de Herrera, H. Muller, S. Bromuri, Overview [17] P. J. Rousseeuw, Silhouettes: A graphi- of the imageclef 2015 medical classification task, in: cal aid to the interpretation and validation CLEF, 2015. of cluster analysis, Journal of Computa- [5] C. Clark, S. Divvala, Looking beyond text: Ex- tional and Applied Mathematics 20 (1987) tracting figures, tables and captions from computer 53–65. URL: https://www.sciencedirect.com/ science papers, in: AAAI Workshop: Scholarly Big science/article/pii/0377042787901257. doi:https: Data, 2015. //doi.org/10.1016/0377- 0427(87)90125- 7 . [6] C. Clark, S. Divvala, Pdffigures 2.0: Mining figures [18] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weis- from research papers, 2016 IEEE/ACM Joint Con- senborn, X. Zhai, T. Unterthiner, M. Dehghani, ference on Digital Libraries (JCDL) (2016) 143–152. M. Minderer, G. Heigold, S. Gelly, et al., An image is [7] M. Savva, N. Kong, A. Chhajta, L. Fei-Fei, worth 16x16 words: Transformers for image recog- M. Agrawala, J. Heer, Revision: automated clas- nition at scale, arXiv preprint arXiv:2010.11929 sification, analysis and redesign of chart images, (2020). Proceedings of 24th annual ACM symposium on [19] J. Yu, Z. Wang, V. Vasudevan, L. Yeung, User interface software and tech (2011). M. Seyedhosseini, Y. Wu, Coca: Contrastive [8] S. R. Choudhury, C. L. Giles, An architecture for in- captioners are image-text foundation mod- formation extraction from figures in digital libraries, els, CoRR abs/2205.01917 (2022). URL: Proceedings of the 24th International Conference https://doi.org/10.48550/arXiv.2205.01917. doi:10. on World Wide Web (2015). 48550/arXiv.2205.01917 . arXiv:2205.01917 . [9] Y. Zhou, C. Tan, Hough technique for bar charts detection and recognition in document images, Pro- ceedings 2000 International Conference on Image Processing (Cat. No.00CH37101) 2 (2000) 605–608 vol.2. [10] N. Siegel, Z. Horvitz, R. Levin, S. Divvala, A. Farhadi, Figureseer: Parsing result-figures in research pa- pers, in: ECCV, 2016. [11] B. Tang, X. Liu, J. Lei, M. Song, D. Tao, S. Sun, F. Dong, Deepchart: Combining deep convo- lutional networks and deep belief networks in chart classification, Signal Processing 124 (2016) 156–161. URL: https://www.sciencedirect.com/ science/article/pii/S0165168415003291. doi:https: //doi.org/10.1016/j.sigpro.2015.09.027 . [12] K. V. Jobin, A. Mondal, C. V. Jawahar, Docfig- ure: A dataset for scientific document figure clas- sification, in: 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), 2019, pp. 74–79. doi:10.1109/ICDARW. 2019.00018 . [13] V. Karthikeyani, S. Nagarajan, Machine learning classification algorithms to recognize chart types in portable document format (pdf) files, International Journal of Computer Applications 39 (2012) 1–5.