Segmenting Technical Drawing Figures in US Patents
Md Reshad Ul Hoque1 , Xin Wei2 , Muntabir Hasan Choudhury2 , Kehinde Ajayi2 ,
Martin Gryder2 , Jian Wu2 and Diane Oyen3
1
  Electrical and Computer Engineering, Old Dominion University, Norfolk, Virginia
2
  Computer Science, Old Dominion University, Norfolk, Virginia
3
  Los Alamos National Laboratory, Los Alamos, New Mexico


                                             Abstract
                                             Image segmentation is the core computer vision problem for identifying objects within a scene. Segmentation is a challenging
                                             task because the prediction for each pixel label requires contextual information. Most recent research deals with the
                                             segmentation of natural images rather than drawings. However, there is very little research on sketched image segmentation.
                                             In this study, we introduce heuristic (point-shooting) and deep learning-based methods (U-Net, HR-Net, MedT, DETR) to
                                             segment technical drawings in US patent documents. Our proposed methods on the US Patent dataset achieved over 90%
                                             accuracy where transformer performs well with 97% segmentation accuracy, which is promising and computationally efficient.
                                             Our source codes and datasets are available at https://github.com/GoFigure-LANL/figure-segmentation.

                                             Keywords
                                             Segmentation, US Patents, Sketched images, CNN, Point-shooting, HR-Net, U-Net, Transformer


1. Introduction                                                                                                       vertical) pixel array, e.g., Rane et al. [6], does not work
                                                                                                                      well because (1) other components such as figure labels
Combining information contained in text and images is                                                                 may be present, and (2) one figure may contain multiple
an important aspect of understanding scientific docu-                                                                 disconnected parts. There are few existing papers on seg-
ments. However, patents and scientific documents often                                                                menting technical drawings in patent documents. Most
contain compound figures containing subfigures, each                                                                  existing tools were developed for extracting figures in
having its own label, caption, and reference text. To asso-                                                           research papers. For example, Clark and Divvala [7] de-
ciate individual subfigures with the appropriate caption                                                              veloped a framework that extracts figures from scientific
and reference text, we must first segment the full figure                                                             papers in PDF format. Viziometrics [8], a figure-oriented
into its individual subfigures. Although much research                                                                literature mining system, was developed which works
has been done on figure understanding and extraction                                                                  on certain pattern figures. These tools could not be used
for scientific documents, existing methods rely on either                                                             for segmenting compound figures. For compound figure
(1) manually-designed rules and human-crafted features                                                                separation, background color, layout patterns, spaces and
which do not generalize well for new dataset [1]; or (2)                                                              lines between sub figures were used as important cues for
machine learning approaches most of which were trained                                                                rule-based methods [1]. Tsutsui el. at. developed a data
on natural images [2]. We demonstrate that we cannot                                                                  driven deep learning model to segment compound fig-
simply apply approaches developed for other datasets to                                                               ures [2]. They fine-tuned the pre-trained YOLO-2 model
patent drawings, and develop a novel approach while ad-                                                               to segment compound images on ImageCLEF Medical
dressing questions about how to extend existing methods                                                               dataset.
to novel datasets.                                                                                                       In this paper, we report our preliminary work on au-
   Image segmentation has been extensively studied with                                                               tomatically segmenting scientific figures appearing in
rule-based methods such as watershed, and machine                                                                     patent documents, focusing on technical drawings. We
learning methods applied to natural images [3, 4, 5]. In                                                              propose a heuristic model and compare it with the state-
patent drawings, there are usually white space between                                                                of-the-art convolutional neural network (CNN) based
individual drawings. A simple sweeping line method,                                                                   models, including U-Net, HR-Net, and transformer-based
which detect boundaries of subfigures by counting the                                                                 models, including MedT and DETR. The method we pro-
maximum number of black-pixels along a horizontal (or                                                                 pose, called “point-shooting” correctly segments over
                                                                                                                      92.5% of the patent figures (compound and single). We
SDU’22: The AAAI-22 Workshop on Scientific Document
                                                                                                                      perform a comparative study between the point-shooting
Understanding, March 1st, 2022. Virtual.
Envelope-Open mhoqu001@odu.edu (M. R. U. Hoque); xwei001@odu.edu                                                      method and the state-of-the-art deep learning methods
(X. Wei); mchou001@odu.edu (M. H. Choudhury);                                                                         on a benchmark dataset. The transformer-based model,
kajay001@odu.edu (K. Ajayi); j1wu@odu.edu (J. Wu);                                                                    called MedT, fine-tuned on a small set of training samples,
doyen@lanl.gov (D. Oyen)                                                                                              works the best with high accuracy (97%) and efficiency.
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).                     The model is also computationally efficient compared
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
Figure 1: An illustration of the point-shooting method.


with other methods. We release the benchmark dataset experiment, we chose an empirical value 𝑟 = 2. If a black
that can be used for future work on the task of segment- pixel in the original figure was detected inside the open
ing technical drawings.                                     dot, the dot retrained. Otherwise, the dot was removed.
                                                            We constrain the circle centers so they do not fall outside
                                                            the figure boundary. We then fill all retained circles and
2. Data                                                     draw contours1 . Using the contour information, we draw
                                                            rectangular bounding boxes to segment a figure.
The data for this project is obtained from the United
States Patent and Trademark Office (USPTO). The ground
truth dataset is developed on a corpus of 500 randomly 3.2. Deep Learning Methods
selected figures from the design category of patent. The The point-shooting method is easy to implement and
dataset consists of 20 figure files with single drawings successfully segments most figures containing multiple
and 480 figure files, which containing at least two sub- technical drawings. However, the method does not gener-
figures. We preprocess each figure to remove text labels. alize well for certain figures in our dataset. One example
The number of subfigures in each figure file is inferred by is shown in Figure 3. One can see that in Row 2, the
the number of text labels detected identifying subfigures point shooting method creates wrong bounding boxes
so segmentation is only necessary for figures containing (Column 1), while U-Net, a deep learning model, produces
multiple subfigures.                                        the correct bounding box (Column 2).
   We use VGG Image Annotator (VIA) to annotate our            Therefore, we consider applying deep learning-based
dataset. VIA is a manual annotation open source soft- methods including U-Net [9], HR-Net [10] and trans-
ware for annotating images, videos, and audio. We draw former models (MedT [11], DETR [12]). One challenge
rectangles bounding boxes around subfigures. Each fig- is that the ground truth only contains bounding boxes,
ure consists of 2–12 subfigures. We also performed an while these models produce pixel-level masks. Therefore,
independent human verification to ensure the bounding the ground truth cannot be directly used for training
boxes were drawn correctly. VIA allows exporting anno- these deep learning models. To overcome this challenge,
tation results including filename, file size, region count we first use the point-shooting method to generate masks
(e.g., number of the bounding boxes for each figure in an for training figures and use them as input to train U-net,
image), region id, and coordinates of bounding boxes.       HR-net, and fine-tune MedT and DETR. It is worth men-
                                                            tioning that the point-shooting method achieves an ac-
3. Segmentation Methods                                     curacy of 92.5%. Although the result output by the point-
                                                            shooting method is not 100% accurate, we hope that the
3.1. Point Shooting Method                                  neural networks can still encode and capture the right
                                                            features and achieve better generalization for reasonably
We propose a heuristic method for segmenting figures good performance. Figure 2 shows the deep learning seg-
containing technical drawings. We call it point-shooting mentation pipeline. All of our deep learning-based mod-
because it mimics the shooting of darts onto a dartboard. els except DETR [12] are semantic segmentation mod-
The goal is to draw bounding boxes around individual els where the models produce foreground-background
subfigures on a figure containing multiple technical draw- masks on the input image. DETR is a transformer based
ings.                                                       end-to-end object detection model that directly predicts
   Figure 1 illustrates the procedures of this method. Af- the bounding boxes on the input image.
ter removing the figure labels, we randomly pick a pixel
in the figure , and draw an open dot of a radius 𝑟. For our     1
                                                                  https://learnopencv.com/contour-detection-using-opencv-
Figure 2: The training (a) and testing (b) of deep learning-based segmentation models.


   Most existing SoTA deep learning based models were       ture map from a contracting path, and two convolutions,
pre-trained on natural or medical images, which usually     each followed by a ReLU.
contain rich color and/or gradient information compared
with technical drawings, which are mostly sketched im-      3.2.2. HR-Net
ages containing black/grey-scale pixels. Therefore, these
pre-trained models usually result in a poor performance    In the contracting path of U-Net, feature maps are down-
when tested on technical drawings in our dataset. To       sampled to the lower resolution using polling and later
overcome this limitation, we fine-tuned pre-trained mod-   up-sampled in the decoder part. In this process, high-
els or trained them from scratch. To reduce unnecessary    resolution information is lost. Although skip connec-
computational cost, we rescale the resolution of the in-   tions are used to copy the high-resolution information
put figure to 128 × 128 and use them to train the deep     to the expansive path. They can not fully recover high-
learning models. The model produces the segmentation       resolution information. To overcome this drawback, we
mask with a dimension of 128 × 128 × 3 which we use to     apply the HR-Net model which retains both high and
draw contours and then bounding boxes around contours.     low resolution information throughout the training pro-
After obtaining bounding boxes from the low-resolution     cess. The preserved information may be useful to recon-
image, we linearly scale up the predicted bounding boxes   struct the segmentation mask. We simplified the original
to fit the original figure.                                HR-Net, which contains three resolution channels, each
                                                           capturing high, mid, and low resolution information, re-
                                                           spectively. The three channels contain five, three, and
3.2.1. U-Net
                                                           two convolutional blocks, respectively. Each convolu-
The architecture of U-Net consists of a contracting path tional block contains two convolutional layers followed
and an expanding path. The contracting path is a typi- by a batch normalization layer, and a ReLU activation
cal convolutional network containing a series of convo- layer. The resolution gap between two channels is 2.
lutional layers, each followed by a rectified linear unit
(ReLU) and a max pooling layer with stride 2 for down- 3.2.3. Transformer
sampling. At each downsampling step, the number of fea-
ture channels is doubled. In the expanding path, each step Although the CNN-based models have shown impres-
consists of an upsampling of the feature map followed by sive performance on the segmentation tasks [9], they
an “up-convolution”, a concatenation with cropped fea- can not capture the long-range dependencies between
                                                           pixels due to inherent inductive biases [11]. Transform-
                                                           ers have significantly improved many fundamental nat-
python-c/                                                  ural language processing tasks. The novel idea behind
the success is “Self Attention” [13]. This mechanism au-     that are correctly segmented. We set aside 200 figures
tomatically weights more on more important features          for evaluation. We use Intersection over Union (IOU),
and can capture the long-range dependencies. The com-        which compares overlaps between the predicted bound-
puter vision domain has borrowed this idea to improve        ing boxes with the ground truth bounding boxes. The
vision-related tasks. We consider two transformer-based      segmentation is determined correct if IOU is greater than
models.                                                      an empirical threshold of 0.7. To verify consistency, we
                                                             also perform qualitative evaluation by visually inspecting
3.2.4. MedT                                                  predicted and ground truth segmentations. The manual
                                                             inspection is consistent with automatic inspection with
The core component of MedT is a gated position-sensitive     an agreement rate of 98%.
axial attention mechanism designed for small size datasets      In general, deep learning-based methods perform bet-
[14] . Gated control axial attention which introduces        ter than point-shooting methods, such as the segmen-
an additional control mechanism in the self-attention        tation results in Row 2 of Figure 3. However, in cer-
module is used to train a transformer on a small dataset.    tain cases, the point-shooting method produced the cor-
These mechanisms control the influence of the relative       rect segmentation map but deep learning-based methods
positional encoding on non-local context. This architec-     failed, such as the segmentation results in Row 1 and
ture contains two branches, including a global branch        Row 3 in Figure 3. There are a few challenging cases, in
that captures the dependencies between pixels and the        which all methods failed (Figure 3). This occurred when a
entire image and a local branch that captures finer de-      subfigure contains relatively isolated fragments without
pendencies among neighbouring pixels.                        prominent connections, which were treated as individual
   The training figures are passed through a convolution     objects.
block before passing through the global branch. The
same figure is broken down into patches and sent through
                                                             Table 1
a similar convolution block before passing through the
                                                             Segmentation model evaluation. Each model was timed
local branch sequentially. A re-sampler aggregates the       on segmenting 200 figures. Runtime (𝑇) is in seconds.
outputs from the local branch based on the position of       The point-shooting method is unsupervised. U-Net
the patch and generates output feature maps. Outputs         and HR-Net were trained from scratch. MedT and
from both branches are add together followed by a 1 × 1      DETR were fine-tuned.
convolutional layer to pool these output feature maps              Models       Training    Automatic 1   Manual 2    𝑇
into a segmentation mask.
                                                              Point-shooting       NA         92.5%       92.5%      1035
                                                                    U-Net        Scratch      90.5%       91.5%       15
3.2.5. DETR
                                                                   HR-Net        Scratch      96.0%       96.5%       18
DETR is an end-to-end object detection transformer model            MedT         Transfer     97.0%        97.0%      29
[12]. The architecture is simple and does not require               DETR         Transfer     90.0%       91.0%      1396
specialized layer or a custom function (such as the non-      1
                                                                  Automatic Evaluation Accuracy.
maximum suppression function) for predicting the bound-       2
                                                                  Manual Verification Accuracy.
ing boxes. The original DETR model predicts 80 classes
of bounding boxes. We fine-tuned this model that di-
rectly predicts the bounding boxes of subfigures given a
compound figure.                                             5. Conclusion
                                                            In conclusion, we compared heuristic and deep learning
4. Results and Discussion                                   methods on the task of segmenting technical drawings
                                                            in US patents. Both heuristic and deep learning-based
The segmentation task can be seen as a classification models achieve over 90% accuracy. Interestingly, though
problem, in which individual subfigures are foreground we trained using data containing noisy labels, generated
objects, and the blank area between subfigures is the back- using the point shooting method, the deep learning mod-
ground. Although we use a training corpus with noisy els still captured the right features and outperformed the
labels, the deep learning models successfully capture la- point shooting method. The CNN-based model (e.g., HR-
tent representations and correctly segmented individual Net) under-performs the transformer model by a small
drawings. The evaluation results are shown in Table 1. margin. We attribute this to the gated attention mecha-
Visual comparisons of segmentation results of different nism in the transformer model, which captured the long-
models of challeging cases are shown in Figure 3.           range relations between pixels.
   The performance of each model is measured using the
accuracy, which is calculated as the fraction of subfigures
Figure 3: Visualization of segmentation performance of different models on challenging samples. Each row is a
sample image. Row 1: a single subfigure with two nearby parts; Row 2: a single subfigure with an extended
part; Row 3: a figure with 3 subfigures separated by relatively wide white space; Row 4: a single subfigure with
a minor part connected to the main part by a band with sparse dots. Each column illustrates segmentation
results using different models, from left to right: point-shooting, U-Net, HR-Net, MedT, and DETR. Note that
all methods fail correctly segmenting the last image.


References                                                        Conference on Document Analysis and Recognition
                                                                  (ICDAR), volume 1, IEEE, 2017, pp. 533–540.
 [1] M. Taschwer, O. Marques, Automatic separation of         [3] M. Bai, R. Urtasun, Deep watershed transform for
     compound figures in scientific articles, Multimedia          instance segmentation, in: IEEE Conference on
     Tools and Applications 77 (2018) 519–548.                    Computer Vision and Pattern Recognition, 2017.
 [2] S. Tsutsui, D. J. Crandall, A data driven approach for   [4] M. Ren, R. S. Zemel, End-to-end instance segmenta-
     compound figure separation using convolutional               tion with recurrent attention, in: Proceedings of the
     neural networks, in: 2017 14th IAPR International            IEEE conference on computer vision and pattern
     recognition, 2017, pp. 6656–6664.
 [5] S. Minaee, Y. Y. Boykov, F. Porikli, A. J. Plaza, N. Ke-
     htarnavaz, D. Terzopoulos, Image segmentation us-
     ing deep learning: A survey, IEEE Transactions on
     Pattern Analysis and Machine Intelligence (2021).
 [6] C. Rane, S. M. Subramanya, D. S. Endluri, J. Wu, C. L.
     Giles, Chartreader: Automatic parsing of bar-plots,
     in: 22nd International Conference on Information
     Reuse and Integration for Data Science, IRI 2021,
     Virtual, IEEE, 2021.
 [7] C. Clark, S. Divvala, Pdffigures 2.0: Mining figures
     from research papers, in: 2016 IEEE/ACM Joint
     Conference on Digital Libraries (JCDL), IEEE, 2016,
     pp. 143–152.
 [8] P.-s. Lee, J. D. West, B. Howe, Viziometrics: Ana-
     lyzing visual information in the scientific literature,
     IEEE Transactions on Big Data 4 (2017) 117–129.
 [9] O. Ronneberger, P. Fischer, T. Brox, U-net: Con-
     volutional networks for biomedical image segmen-
     tation, in: International Conference on Medical
     image computing and computer-assisted interven-
     tion, Springer, 2015, pp. 234–241.
[10] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao,
     D. Liu, Y. Mu, M. Tan, X. Wang, et al., Deep high-
     resolution representation learning for visual recog-
     nition, IEEE transactions on pattern analysis and
     machine intelligence (2020).
[11] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weis-
     senborn, X. Zhai, T. Unterthiner, M. Dehghani,
     M. Minderer, G. Heigold, S. Gelly, et al., An image is
     worth 16x16 words: Transformers for image recog-
     nition at scale, arXiv preprint arXiv:2010.11929
     (2020).
[12] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kir-
     illov, S. Zagoruyko, End-to-end object detection
     with transformers, in: European Conference on
     Computer Vision, Springer, 2020, pp. 213–229.
[13] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
     L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At-
     tention is all you need, in: Advances in neural in-
     formation processing systems, 2017, pp. 5998–6008.
[14] J. M. J. Valanarasu, P. Oza, I. Hacihaliloglu, V. M.
     Patel, Medical transformer: Gated axial-attention
     for medical image segmentation, arXiv preprint
     arXiv:2102.10662 (2021).