Segmenting Technical Drawing Figures in US Patents Md Reshad Ul Hoque1 , Xin Wei2 , Muntabir Hasan Choudhury2 , Kehinde Ajayi2 , Martin Gryder2 , Jian Wu2 and Diane Oyen3 1 Electrical and Computer Engineering, Old Dominion University, Norfolk, Virginia 2 Computer Science, Old Dominion University, Norfolk, Virginia 3 Los Alamos National Laboratory, Los Alamos, New Mexico Abstract Image segmentation is the core computer vision problem for identifying objects within a scene. Segmentation is a challenging task because the prediction for each pixel label requires contextual information. Most recent research deals with the segmentation of natural images rather than drawings. However, there is very little research on sketched image segmentation. In this study, we introduce heuristic (point-shooting) and deep learning-based methods (U-Net, HR-Net, MedT, DETR) to segment technical drawings in US patent documents. Our proposed methods on the US Patent dataset achieved over 90% accuracy where transformer performs well with 97% segmentation accuracy, which is promising and computationally efficient. Our source codes and datasets are available at https://github.com/GoFigure-LANL/figure-segmentation. Keywords Segmentation, US Patents, Sketched images, CNN, Point-shooting, HR-Net, U-Net, Transformer 1. Introduction vertical) pixel array, e.g., Rane et al. [6], does not work well because (1) other components such as figure labels Combining information contained in text and images is may be present, and (2) one figure may contain multiple an important aspect of understanding scientific docu- disconnected parts. There are few existing papers on seg- ments. However, patents and scientific documents often menting technical drawings in patent documents. Most contain compound figures containing subfigures, each existing tools were developed for extracting figures in having its own label, caption, and reference text. To asso- research papers. For example, Clark and Divvala [7] de- ciate individual subfigures with the appropriate caption veloped a framework that extracts figures from scientific and reference text, we must first segment the full figure papers in PDF format. Viziometrics [8], a figure-oriented into its individual subfigures. Although much research literature mining system, was developed which works has been done on figure understanding and extraction on certain pattern figures. These tools could not be used for scientific documents, existing methods rely on either for segmenting compound figures. For compound figure (1) manually-designed rules and human-crafted features separation, background color, layout patterns, spaces and which do not generalize well for new dataset [1]; or (2) lines between sub figures were used as important cues for machine learning approaches most of which were trained rule-based methods [1]. Tsutsui el. at. developed a data on natural images [2]. We demonstrate that we cannot driven deep learning model to segment compound fig- simply apply approaches developed for other datasets to ures [2]. They fine-tuned the pre-trained YOLO-2 model patent drawings, and develop a novel approach while ad- to segment compound images on ImageCLEF Medical dressing questions about how to extend existing methods dataset. to novel datasets. In this paper, we report our preliminary work on au- Image segmentation has been extensively studied with tomatically segmenting scientific figures appearing in rule-based methods such as watershed, and machine patent documents, focusing on technical drawings. We learning methods applied to natural images [3, 4, 5]. In propose a heuristic model and compare it with the state- patent drawings, there are usually white space between of-the-art convolutional neural network (CNN) based individual drawings. A simple sweeping line method, models, including U-Net, HR-Net, and transformer-based which detect boundaries of subfigures by counting the models, including MedT and DETR. The method we pro- maximum number of black-pixels along a horizontal (or pose, called “point-shooting” correctly segments over 92.5% of the patent figures (compound and single). We SDU’22: The AAAI-22 Workshop on Scientific Document perform a comparative study between the point-shooting Understanding, March 1st, 2022. Virtual. Envelope-Open mhoqu001@odu.edu (M. R. U. Hoque); xwei001@odu.edu method and the state-of-the-art deep learning methods (X. Wei); mchou001@odu.edu (M. H. Choudhury); on a benchmark dataset. The transformer-based model, kajay001@odu.edu (K. Ajayi); j1wu@odu.edu (J. Wu); called MedT, fine-tuned on a small set of training samples, doyen@lanl.gov (D. Oyen) works the best with high accuracy (97%) and efficiency. © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). The model is also computationally efficient compared CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Figure 1: An illustration of the point-shooting method. with other methods. We release the benchmark dataset experiment, we chose an empirical value 𝑟 = 2. If a black that can be used for future work on the task of segment- pixel in the original figure was detected inside the open ing technical drawings. dot, the dot retrained. Otherwise, the dot was removed. We constrain the circle centers so they do not fall outside the figure boundary. We then fill all retained circles and 2. Data draw contours1 . Using the contour information, we draw rectangular bounding boxes to segment a figure. The data for this project is obtained from the United States Patent and Trademark Office (USPTO). The ground truth dataset is developed on a corpus of 500 randomly 3.2. Deep Learning Methods selected figures from the design category of patent. The The point-shooting method is easy to implement and dataset consists of 20 figure files with single drawings successfully segments most figures containing multiple and 480 figure files, which containing at least two sub- technical drawings. However, the method does not gener- figures. We preprocess each figure to remove text labels. alize well for certain figures in our dataset. One example The number of subfigures in each figure file is inferred by is shown in Figure 3. One can see that in Row 2, the the number of text labels detected identifying subfigures point shooting method creates wrong bounding boxes so segmentation is only necessary for figures containing (Column 1), while U-Net, a deep learning model, produces multiple subfigures. the correct bounding box (Column 2). We use VGG Image Annotator (VIA) to annotate our Therefore, we consider applying deep learning-based dataset. VIA is a manual annotation open source soft- methods including U-Net [9], HR-Net [10] and trans- ware for annotating images, videos, and audio. We draw former models (MedT [11], DETR [12]). One challenge rectangles bounding boxes around subfigures. Each fig- is that the ground truth only contains bounding boxes, ure consists of 2–12 subfigures. We also performed an while these models produce pixel-level masks. Therefore, independent human verification to ensure the bounding the ground truth cannot be directly used for training boxes were drawn correctly. VIA allows exporting anno- these deep learning models. To overcome this challenge, tation results including filename, file size, region count we first use the point-shooting method to generate masks (e.g., number of the bounding boxes for each figure in an for training figures and use them as input to train U-net, image), region id, and coordinates of bounding boxes. HR-net, and fine-tune MedT and DETR. It is worth men- tioning that the point-shooting method achieves an ac- 3. Segmentation Methods curacy of 92.5%. Although the result output by the point- shooting method is not 100% accurate, we hope that the 3.1. Point Shooting Method neural networks can still encode and capture the right features and achieve better generalization for reasonably We propose a heuristic method for segmenting figures good performance. Figure 2 shows the deep learning seg- containing technical drawings. We call it point-shooting mentation pipeline. All of our deep learning-based mod- because it mimics the shooting of darts onto a dartboard. els except DETR [12] are semantic segmentation mod- The goal is to draw bounding boxes around individual els where the models produce foreground-background subfigures on a figure containing multiple technical draw- masks on the input image. DETR is a transformer based ings. end-to-end object detection model that directly predicts Figure 1 illustrates the procedures of this method. Af- the bounding boxes on the input image. ter removing the figure labels, we randomly pick a pixel in the figure , and draw an open dot of a radius 𝑟. For our 1 https://learnopencv.com/contour-detection-using-opencv- Figure 2: The training (a) and testing (b) of deep learning-based segmentation models. Most existing SoTA deep learning based models were ture map from a contracting path, and two convolutions, pre-trained on natural or medical images, which usually each followed by a ReLU. contain rich color and/or gradient information compared with technical drawings, which are mostly sketched im- 3.2.2. HR-Net ages containing black/grey-scale pixels. Therefore, these pre-trained models usually result in a poor performance In the contracting path of U-Net, feature maps are down- when tested on technical drawings in our dataset. To sampled to the lower resolution using polling and later overcome this limitation, we fine-tuned pre-trained mod- up-sampled in the decoder part. In this process, high- els or trained them from scratch. To reduce unnecessary resolution information is lost. Although skip connec- computational cost, we rescale the resolution of the in- tions are used to copy the high-resolution information put figure to 128 × 128 and use them to train the deep to the expansive path. They can not fully recover high- learning models. The model produces the segmentation resolution information. To overcome this drawback, we mask with a dimension of 128 × 128 × 3 which we use to apply the HR-Net model which retains both high and draw contours and then bounding boxes around contours. low resolution information throughout the training pro- After obtaining bounding boxes from the low-resolution cess. The preserved information may be useful to recon- image, we linearly scale up the predicted bounding boxes struct the segmentation mask. We simplified the original to fit the original figure. HR-Net, which contains three resolution channels, each capturing high, mid, and low resolution information, re- spectively. The three channels contain five, three, and 3.2.1. U-Net two convolutional blocks, respectively. Each convolu- The architecture of U-Net consists of a contracting path tional block contains two convolutional layers followed and an expanding path. The contracting path is a typi- by a batch normalization layer, and a ReLU activation cal convolutional network containing a series of convo- layer. The resolution gap between two channels is 2. lutional layers, each followed by a rectified linear unit (ReLU) and a max pooling layer with stride 2 for down- 3.2.3. Transformer sampling. At each downsampling step, the number of fea- ture channels is doubled. In the expanding path, each step Although the CNN-based models have shown impres- consists of an upsampling of the feature map followed by sive performance on the segmentation tasks [9], they an “up-convolution”, a concatenation with cropped fea- can not capture the long-range dependencies between pixels due to inherent inductive biases [11]. Transform- ers have significantly improved many fundamental nat- python-c/ ural language processing tasks. The novel idea behind the success is “Self Attention” [13]. This mechanism au- that are correctly segmented. We set aside 200 figures tomatically weights more on more important features for evaluation. We use Intersection over Union (IOU), and can capture the long-range dependencies. The com- which compares overlaps between the predicted bound- puter vision domain has borrowed this idea to improve ing boxes with the ground truth bounding boxes. The vision-related tasks. We consider two transformer-based segmentation is determined correct if IOU is greater than models. an empirical threshold of 0.7. To verify consistency, we also perform qualitative evaluation by visually inspecting 3.2.4. MedT predicted and ground truth segmentations. The manual inspection is consistent with automatic inspection with The core component of MedT is a gated position-sensitive an agreement rate of 98%. axial attention mechanism designed for small size datasets In general, deep learning-based methods perform bet- [14] . Gated control axial attention which introduces ter than point-shooting methods, such as the segmen- an additional control mechanism in the self-attention tation results in Row 2 of Figure 3. However, in cer- module is used to train a transformer on a small dataset. tain cases, the point-shooting method produced the cor- These mechanisms control the influence of the relative rect segmentation map but deep learning-based methods positional encoding on non-local context. This architec- failed, such as the segmentation results in Row 1 and ture contains two branches, including a global branch Row 3 in Figure 3. There are a few challenging cases, in that captures the dependencies between pixels and the which all methods failed (Figure 3). This occurred when a entire image and a local branch that captures finer de- subfigure contains relatively isolated fragments without pendencies among neighbouring pixels. prominent connections, which were treated as individual The training figures are passed through a convolution objects. block before passing through the global branch. The same figure is broken down into patches and sent through Table 1 a similar convolution block before passing through the Segmentation model evaluation. Each model was timed local branch sequentially. A re-sampler aggregates the on segmenting 200 figures. Runtime (𝑇) is in seconds. outputs from the local branch based on the position of The point-shooting method is unsupervised. U-Net the patch and generates output feature maps. Outputs and HR-Net were trained from scratch. MedT and from both branches are add together followed by a 1 × 1 DETR were fine-tuned. convolutional layer to pool these output feature maps Models Training Automatic 1 Manual 2 𝑇 into a segmentation mask. Point-shooting NA 92.5% 92.5% 1035 U-Net Scratch 90.5% 91.5% 15 3.2.5. DETR HR-Net Scratch 96.0% 96.5% 18 DETR is an end-to-end object detection transformer model MedT Transfer 97.0% 97.0% 29 [12]. The architecture is simple and does not require DETR Transfer 90.0% 91.0% 1396 specialized layer or a custom function (such as the non- 1 Automatic Evaluation Accuracy. maximum suppression function) for predicting the bound- 2 Manual Verification Accuracy. ing boxes. The original DETR model predicts 80 classes of bounding boxes. We fine-tuned this model that di- rectly predicts the bounding boxes of subfigures given a compound figure. 5. Conclusion In conclusion, we compared heuristic and deep learning 4. Results and Discussion methods on the task of segmenting technical drawings in US patents. Both heuristic and deep learning-based The segmentation task can be seen as a classification models achieve over 90% accuracy. Interestingly, though problem, in which individual subfigures are foreground we trained using data containing noisy labels, generated objects, and the blank area between subfigures is the back- using the point shooting method, the deep learning mod- ground. Although we use a training corpus with noisy els still captured the right features and outperformed the labels, the deep learning models successfully capture la- point shooting method. The CNN-based model (e.g., HR- tent representations and correctly segmented individual Net) under-performs the transformer model by a small drawings. The evaluation results are shown in Table 1. margin. We attribute this to the gated attention mecha- Visual comparisons of segmentation results of different nism in the transformer model, which captured the long- models of challeging cases are shown in Figure 3. range relations between pixels. The performance of each model is measured using the accuracy, which is calculated as the fraction of subfigures Figure 3: Visualization of segmentation performance of different models on challenging samples. Each row is a sample image. Row 1: a single subfigure with two nearby parts; Row 2: a single subfigure with an extended part; Row 3: a figure with 3 subfigures separated by relatively wide white space; Row 4: a single subfigure with a minor part connected to the main part by a band with sparse dots. Each column illustrates segmentation results using different models, from left to right: point-shooting, U-Net, HR-Net, MedT, and DETR. Note that all methods fail correctly segmenting the last image. References Conference on Document Analysis and Recognition (ICDAR), volume 1, IEEE, 2017, pp. 533–540. [1] M. Taschwer, O. Marques, Automatic separation of [3] M. Bai, R. Urtasun, Deep watershed transform for compound figures in scientific articles, Multimedia instance segmentation, in: IEEE Conference on Tools and Applications 77 (2018) 519–548. Computer Vision and Pattern Recognition, 2017. [2] S. Tsutsui, D. J. Crandall, A data driven approach for [4] M. Ren, R. S. Zemel, End-to-end instance segmenta- compound figure separation using convolutional tion with recurrent attention, in: Proceedings of the neural networks, in: 2017 14th IAPR International IEEE conference on computer vision and pattern recognition, 2017, pp. 6656–6664. [5] S. Minaee, Y. Y. Boykov, F. Porikli, A. J. Plaza, N. Ke- htarnavaz, D. Terzopoulos, Image segmentation us- ing deep learning: A survey, IEEE Transactions on Pattern Analysis and Machine Intelligence (2021). [6] C. Rane, S. M. Subramanya, D. S. Endluri, J. Wu, C. L. Giles, Chartreader: Automatic parsing of bar-plots, in: 22nd International Conference on Information Reuse and Integration for Data Science, IRI 2021, Virtual, IEEE, 2021. [7] C. Clark, S. Divvala, Pdffigures 2.0: Mining figures from research papers, in: 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL), IEEE, 2016, pp. 143–152. [8] P.-s. Lee, J. D. West, B. Howe, Viziometrics: Ana- lyzing visual information in the scientific literature, IEEE Transactions on Big Data 4 (2017) 117–129. [9] O. Ronneberger, P. Fischer, T. Brox, U-net: Con- volutional networks for biomedical image segmen- tation, in: International Conference on Medical image computing and computer-assisted interven- tion, Springer, 2015, pp. 234–241. [10] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X. Wang, et al., Deep high- resolution representation learning for visual recog- nition, IEEE transactions on pattern analysis and machine intelligence (2020). [11] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weis- senborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image recog- nition at scale, arXiv preprint arXiv:2010.11929 (2020). [12] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kir- illov, S. Zagoruyko, End-to-end object detection with transformers, in: European Conference on Computer Vision, Springer, 2020, pp. 213–229. [13] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At- tention is all you need, in: Advances in neural in- formation processing systems, 2017, pp. 5998–6008. [14] J. M. J. Valanarasu, P. Oza, I. Hacihaliloglu, V. M. Patel, Medical transformer: Gated axial-attention for medical image segmentation, arXiv preprint arXiv:2102.10662 (2021).