1. Introduction

Technical Drawing Figures in US Patents

Md Reshad Ul Hoque

1 3 4 5

Xin Wei

xwei001@odu.edu 0 3 4 5

Muntabir Hasan Choudhury

0 3 4 5

Kehinde Ajayi

0 3 4 5

Martin Gryder

0 3 4 5

Jian Wu

0 3 4 5

Diane Oyen

doyen@lanl.gov 2 3 4 5 0 Computer Science, Old Dominion University , Norfolk, Virginia , USA 1 Electrical and Computer Engineering, Old Dominion University , Norfolk, Virginia , USA 2 Los Alamos National Laboratory , Los Alamos , New Mexico 3 Segmentation, US Patents , Sketched images, CNN, Point-shooting, HR-Net, U-Net, Transformer 4 called MedT, fine-tuned on a small set of training samples 5 models, including MedT and DETR. The method we pro-

Image segmentation is the core computer vision problem for identifying objects within a scene. Segmentation is a challenging task because the prediction for each pixel label requires contextual information. Most recent research deals with the segmentation of natural images rather than drawings. However, there is very little research on sketched image segmentation. In this study, we introduce heuristic (point-shooting) and deep learning-based methods (U-Net, HR-Net, MedT, DETR) to segment technical drawings in US patent documents. Our proposed methods on the US Patent dataset achieved over 90% accuracy where transformer performs well with 97% segmentation accuracy, which is promising and computationally eficient. Our source codes and datasets are available at https://github.com/GoFigure-LANL/figure-segmentation.

1. Introduction

Combining information contained in text and images is an important aspect of understanding scientific documents. However, patents and scientific documents often contain compound figures containing subfigures, each ciate individual subfigures with the appropriate caption and reference text, we must first segment the full figure into its individual subfigures. Although much research has been done on figure understanding and extraction for scientific documents, existing methods rely on either (1) manually-designed rules and human-crafted features machine learning approaches most of which were trained on natural images [ 2 ]. We demonstrate that we cannot simply apply approaches developed for other datasets to to novel datasets.

Image segmentation has been extensively studied with rule-based methods such as watershed, and machine learning methods applied to natural images [ 3, 4, 5 ]. In patent drawings, there are usually white space between individual drawings. A simple sweeping line method, which detect boundaries of subfigures by counting the maximum number of black-pixels along a horizontal (or

The model is also computationally eficient compared

with other methods. We release the benchmark dataset that can be used for future work on the task of segmenting technical drawings.

2. Data

experiment, we chose an empirical value = 2 . If a black pixel in the original figure was detected inside the open dot, the dot retrained. Otherwise, the dot was removed. We constrain the circle centers so they do not fall outside the figure boundary. We then fill all retained circles and draw contours1. Using the contour information, we draw rectangular bounding boxes to segment a figure. 3.2. Deep Learning Methods

The data for this project is obtained from the United

States Patent and Trademark Ofice (USPTO). The ground truth dataset is developed on a corpus of 500 randomly selected figures from the design category of patent. The dataset consists of 20 figure files with single drawings and 480 figure files, which containing at least two subifgures. We preprocess each figure to remove text labels.

The number of subfigures in each figure file is inferred by the number of text labels detected identifying subfigures so segmentation is only necessary for figures containing multiple subfigures.

We use VGG Image Annotator (VIA) to annotate our dataset. VIA is a manual annotation open source software for annotating images, videos, and audio. We draw rectangles bounding boxes around subfigures. Each figure consists of 2–12 subfigures. We also performed an independent human verification to ensure the bounding boxes were drawn correctly. VIA allows exporting annotation results including filename, file size, region count (e.g., number of the bounding boxes for each figure in an image), region id, and coordinates of bounding boxes.

The point-shooting method is easy to implement and successfully segments most figures containing multiple technical drawings. However, the method does not generalize well for certain figures in our dataset. One example is shown in Figure 3. One can see that in Row 2, the point shooting method creates wrong bounding boxes (Column 1), while U-Net, a deep learning model, produces the correct bounding box (Column 2).

Therefore, we consider applying deep learning-based methods including U-Net [ 9 ], HR-Net [ 10 ] and transformer models (MedT [ 11 ], DETR [ 12 ]). One challenge is that the ground truth only contains bounding boxes, while these models produce pixel-level masks. Therefore, the ground truth cannot be directly used for training these deep learning models. To overcome this challenge, we first use the point-shooting method to generate masks for training figures and use them as input to train U-net, HR-net, and fine-tune MedT and DETR. It is worth mentioning that the point-shooting method achieves an ac3. Segmentation Methods curacy of 92.5%. Although the result output by the pointshooting method is not 100% accurate, we hope that the 3.1. Point Shooting Method neural networks can still encode and capture the right features and achieve better generalization for reasonably We propose a heuristic method for segmenting figures good performance. Figure 2 shows the deep learning segcontaining technical drawings. We call it point-shooting mentation pipeline. All of our deep learning-based modbecause it mimics the shooting of darts onto a dartboard. els except DETR [ 12 ] are semantic segmentation modThe goal is to draw bounding boxes around individual els where the models produce foreground-background subfigures on a figure containing multiple technical draw- masks on the input image. DETR is a transformer based ings. end-to-end object detection model that directly predicts

Figure 1 illustrates the procedures of this method. Af- the bounding boxes on the input image. ter removing the figure labels, we randomly pick a pixel in the figure , and draw an open dot of a radius . For our python-c/

Most existing SoTA deep learning based models were ture map from a contracting path, and two convolutions, pre-trained on natural or medical images, which usually each followed by a ReLU. contain rich color and/or gradient information compared with technical drawings, which are mostly sketched im- 3.2.2. HR-Net ages containing black/grey-scale pixels. Therefore, these pre-trained models usually result in a poor performance In the contracting path of U-Net, feature maps are downwhen tested on technical drawings in our dataset. To sampled to the lower resolution using polling and later overcome this limitation, we fine-tuned pre-trained mod- up-sampled in the decoder part. In this process, highels or trained them from scratch. To reduce unnecessary resolution information is lost. Although skip conneccomputational cost, we rescale the resolution of the in- tions are used to copy the high-resolution information put figure to 128 × 128 and use them to train the deep to the expansive path. They can not fully recover highlearning models. The model produces the segmentation resolution information. To overcome this drawback, we mask with a dimension of 128 × 128 × 3 which we use to apply the HR-Net model which retains both high and draw contours and then bounding boxes around contours. low resolution information throughout the training proAfter obtaining bounding boxes from the low-resolution cess. The preserved information may be useful to reconimage, we linearly scale up the predicted bounding boxes struct the segmentation mask. We simplified the original to fit the original figure. HR-Net, which contains three resolution channels, each capturing high, mid, and low resolution information, re3.2.1. U-Net spectively. The three channels contain five, three, and two convolutional blocks, respectively. Each convoluThe architecture of U-Net consists of a contracting path tional block contains two convolutional layers followed and an expanding path. The contracting path is a typi- by a batch normalization layer, and a ReLU activation cal convolutional network containing a series of convo- layer. The resolution gap between two channels is 2. lutional layers, each followed by a rectified linear unit (ReLU) and a max pooling layer with stride 2 for down- 3.2.3. Transformer sampling. At each downsampling step, the number of feature channels is doubled. In the expanding path, each step consists of an upsampling of the feature map followed by an “up-convolution”, a concatenation with cropped feaAlthough the CNN-based models have shown impressive performance on the segmentation tasks [ 9 ], they can not capture the long-range dependencies between pixels due to inherent inductive biases [ 11 ]. Transformers have significantly improved many fundamental natural language processing tasks. The novel idea behind the success is “Self Attention” [ 13 ]. This mechanism au- that are correctly segmented. We set aside 200 figures tomatically weights more on more important features for evaluation. We use Intersection over Union (IOU), and can capture the long-range dependencies. The comwhich compares overlaps between the predicted boundputer vision domain has borrowed this idea to improve ing boxes with the ground truth bounding boxes. The vision-related tasks. We consider two transformer-based segmentation is determined correct if IOU is greater than

These mechanisms control the influence of the relative

module is used to train a transformer on a small dataset. tain cases, the point-shooting method produced the correct segmentation map but deep learning-based methods positional encoding on non-local context. This architec- failed, such as the segmentation results in Row 1 and models. 3.2.4. MedT The core component of MedT is a gated position-sensitive axial attention mechanism designed for small size datasets [ 14 ] . Gated control axial attention which introduces an additional control mechanism in the self-attention an empirical threshold of 0.7. To verify consistency, we also perform qualitative evaluation by visually inspecting predicted and ground truth segmentations. The manual inspection is consistent with automatic inspection with an agreement rate of 98%.

In general, deep learning-based methods perform better than point-shooting methods, such as the segmentation results in Row 2 of Figure 3. However, in certure contains two branches, including a global branch that captures the dependencies between pixels and the entire image and a local branch that captures finer dependencies among neighbouring pixels.

The training figures are passed through a convolution block before passing through the global branch. The same figure is broken down into patches and sent through a similar convolution block before passing through the local branch sequentially. A re-sampler aggregates the outputs from the local branch based on the position of the patch and generates output feature maps. Outputs from both branches are add together followed by a 1 × 1 convolutional layer to pool these output feature maps into a segmentation mask. 3.2.5. DETR DETR is an end-to-end object detection transformer model [ 12 ]. The architecture is simple and does not require specialized layer or a custom function (such as the nonmaximum suppression function) for predicting the bounding boxes. The original DETR model predicts 80 classes of bounding boxes. We fine-tuned this model that directly predicts the bounding boxes of subfigures given a compound figure.

4. Results and Discussion

The segmentation task can be seen as a classification problem, in which individual subfigures are foreground objects, and the blank area between subfigures is the background. Although we use a training corpus with noisy labels, the deep learning models successfully capture latent representations and correctly segmented individual drawings. The evaluation results are shown in Table 1.

Visual comparisons of segmentation results of diferent

models of challeging cases are shown in Figure 3.

The performance of each model is measured using the accuracy, which is calculated as the fraction of subfigures Row 3 in Figure 3. There are a few challenging cases, in

which all methods failed (Figure 3). This occurred when a subfigure contains relatively isolated fragments without prominent connections, which were treated as individual objects.

5. Conclusion

In conclusion, we compared heuristic and deep learning methods on the task of segmenting technical drawings in US patents. Both heuristic and deep learning-based models achieve over 90% accuracy. Interestingly, though we trained using data containing noisy labels, generated using the point shooting method, the deep learning models still captured the right features and outperformed the point shooting method. The CNN-based model (e.g., HR

Net) under-performs the transformer model by a small margin. We attribute this to the gated attention mechanism in the transformer model, which captured the longrange relations between pixels.

[1]

Taschwer ,

Marques , Automatic separation of compound figures in scientific articles , Multimedia Tools and Applications 77 ( 2018 ) 519 - 548 .

[2]

Tsutsui ,

D. J.

Crandall , A data driven approach for compound figure separation using convolutional neural networks , in: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) , volume 1 , IEEE, 2017 , pp. 533 - 540 .

[3]

Bai ,

Urtasun , Deep watershed transform for instance segmentation , in: IEEE Conference on Computer Vision and Pattern Recognition , 2017 .

[4]

Ren ,

R. S.

Zemel , End-to-end instance segmentation with recurrent attention , in: Proceedings of the IEEE conference on computer vision and pattern recognition , 2017 , pp. 6656 - 6664 .

[5]

Minaee ,

Y. Y.

Boykov ,

Porikli ,

A. J.

Plaza ,

Kehtarnavaz ,

Terzopoulos , Image segmentation using deep learning: A survey , IEEE Transactions on Pattern Analysis and Machine Intelligence ( 2021 ).

[6]

Rane ,

S. M.

Subramanya ,

D. S.

Endluri ,

Wu ,

C. L.

Giles , Chartreader: Automatic parsing of bar-plots , in: 22nd International Conference on Information Reuse and Integration for Data Science, IRI 2021 , Virtual , IEEE, 2021 .

[7]

Clark ,

Divvala , Pdfigures 2.0: Mining figures from research papers , in: 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL) , IEEE, 2016 , pp. 143 - 152 .

[8]

P.-s.

Lee ,

J. D.

West ,

Howe , Viziometrics: Analyzing visual information in the scientific literature , IEEE Transactions on Big Data 4 ( 2017 ) 117 - 129 .

[9]

Ronneberger ,

Fischer ,

Brox , U-net: Convolutional networks for biomedical image segmentation , in: International Conference on Medical image computing and computer-assisted intervention , Springer, 2015 , pp. 234 - 241 .

[10]

Wang ,

Sun , T. Cheng, B. Jiang , C.

Deng , Y.

Zhao , D.

Liu , Y.

Mu , M.

Tan , X.

Wang , et al., Deep highresolution representation learning for visual recognition , IEEE transactions on pattern analysis and machine intelligence ( 2020 ).

[11]

Dosovitskiy ,

Beyer ,

Kolesnikov ,

Weissenborn ,

Zhai ,

Unterthiner ,

Dehghani ,

Minderer , G. Heigold,

Gelly , et al., An image is worth 16x16 words: Transformers for image recognition at scale , arXiv preprint arXiv: 2010 . 11929 ( 2020 ).

[12]

Carion ,

Massa , G. Synnaeve,

Usunier ,

Kirillov ,

Zagoruyko , End-to-end object detection with transformers , in: European Conference on Computer Vision , Springer, 2020 , pp. 213 - 229 .

[13]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez , Ł. Kaiser, I. Polosukhin , Attention is all you need , in: Advances in neural information processing systems , 2017 , pp. 5998 - 6008 .

[14] J. M. J. Valanarasu , P.

Oza , I.

Hacihaliloglu , V. M.

Patel , Medical transformer: Gated axial-attention for medical image segmentation , arXiv preprint arXiv:2102.10662 ( 2021 ).