=Paper=
{{Paper
|id=Vol-2831/paper11
|storemode=property
|title=Recognizing Figure Labels in Patents
|pdfUrl=https://ceur-ws.org/Vol-2831/paper11.pdf
|volume=Vol-2831
|authors=Ming Gong,Xin Wei,Diane Oyen,Jian Wu,Martin Gryder,Liping Yang
|dblpUrl=https://dblp.org/rec/conf/aaai/GongWO0GY21
}}
==Recognizing Figure Labels in Patents==
Recognizing Figure Labels in Patents
Ming Gong1,2 , Xin Wei3 , Diane Oyen2 , Jian Wu3 , Martin Gryder3 , Liping Yang4
1
University of Dayton
2
Los Alamos National Laboratory
3
Old Dominion University
4
University of New Mexico
Abstract regions in which OCR fails to produce reasonable character
recognition results. By following this pipeline, OCR is more
Scientific documents often contain significant information
successful without having distracting non-text pixels present
in figures. The United States Patent and Trademark Office
(USPTO) awards thousands of patents each week, with each in the image, while false-positive regions found in step (1)
patent containing on the order of a dozen figures. The infor- can be filtered out by step (3).
mation conveyed by these figures typically include a drawing We evaluate open-source and commercial OCR methods
or diagram, a label, caption and reference text within the doc- on a set of patent drawings with the goal to recognize all
ument. Yet associating the short bits of text to the figure is figure labels embedded within each patent figure, so that we
challenging when labels are embedded within the figure, as can associate figure labels with specific drawings. Commer-
they typically are in patents. Using patents as a testbench, cial methods significantly outperform open-source OCR, but
this paper highlights an open challenge in analyzing all of even at a recall level of 0.95 there is room for improvement
the information presented in scientific/technical documents - before such methods can be deployed at the scale needed to
namely, there is a technological gap in recognizing charac-
analyze a large number of patents. We find that open-source
ters embedded in drawings, which leads to difficulties in pro-
cessing the text associated with scientific figures. We demon- OCR works well when text is isolated from the rest of the
strate that automatically reading the figure label in patent di- figure, and therefore we develop a method to isolate figure
agram figures is an open challenge, as we evaluate several label text in patent figures.
state-of-the-art optical character recognition (OCR) methods
on recent patents. Because the visual characteristics of draw- Label Text Detection using α-Shapes
ings/diagrams are quite similar to that of text (high contrast,
width of strokes, etc), separating the diagram from the text is
Patent figures are generally composed of a drawing and la-
challenging and leads to both (a) false detection of characters bel text; or several drawings and multiple label texts (Fig-
from pixels that are not text and (b) missed text that is critical ure 1). These drawings and labels are spatially located on
for identifying the figure number. We develop a method for the page such that it is easy for a person to read the figure la-
automatically reading the patent figure labels by first iden- bel and associate it with the corresponding drawing; yet this
tifying the bounding box containing the label using a novel remains challenging for computer vision and OCR methods.
non-convex hull approach, and then demonstrate the success Our goal is to automatically segment patent figures into re-
of OCR when the text is isolated from the diagram. gions of drawing and text. We find that existing OCR meth-
ods often fail to recognize labels in patent drawings, primar-
Introduction ily due to patent drawings being mainly composed of lines,
e.g., strokes, curves, which have similar visual characteris-
Recognition of text that is embedded in an image is a well- tics to text, such as high foreground/background contrast,
studied problem, which can be split into two separate prob- sharpness of edges, and ratio of foreground/background pix-
lems: text detection — identifying regions of the image that els.
contain text — and optical character recognition (OCR). In patent figures, the text labels occupy a reasonably com-
OCR methods generally assume that majority of the pixels pact region of the figure. Therefore, we propose to identify
in the image are text and maps groups of pixels to charac- text regions using α-shapes. An α-shape is the smallest poly-
ters. When the image contains non-text elements, it is best gon that encloses all the points (foreground pixels) in an
to first segment the regions containing text using text detec- area, similar to a convex-hull, but with an α allowance for
tion methods. Typically, text recognition is carried out fol- non-convexity (Edelsbrunner, Kirkpatrick, and Seidel 1983).
lowing the basic pipeline of (1) text region detection which However, α-shape calculation is computationally expensive
may contain false-positive regions, followed by (2) OCR on for a large number of foreground pixels. We therefore de-
each text region and (3) filtering of OCR results to discard velop a workflow that simplifies the figure to candidate re-
Copyright © 2021, for this paper by its authors. Use permitted un- gions of text, then removes dashed lines through morpho-
der Creative Commons License Attribution 4.0 International (CC logical erosion and then isolates text regions with α-shapes.
BY 4.0). Patent figures are processed first by converting to a binary
(a) Input image binarized (b) Closed regions filling (c) Label candidates (d) Dashed lines removal (e) Generated α-shapes
Figure 1: Identifying label regions in an example patent figure.
whether the text contains “Fig” (non case-sensitive). Then
the qualified text information is preserved and the coordi-
nates of the corresponding figure label region, including co-
ordinates of original image and the ones of rotated image if
any, are also recorded.
As shown in Figure 1, the steps for extracting labels from
patent figures are:
1. Segment drawing regions and candidate text regions.
(a) Threshold image to black on white using Otsu method,
Figure 1(a) (Otsu 1979; Sezgin and Sankur 2004).
(b) Fill regions using mathematical morphology to seg-
ment regions of content, Figure 1(b) (Gonzales and
Woods 2002).
Figure 2: Example of α-shapes without dashed line removal
shows text being merged with the nearby dashed line. (c) Filter connected components by size to produce candi-
date text regions, Figure 1(c).
2. Generate α-shape.
image and then closed regions are filled and filtered out by (a) Remove dashed lines by morphological erosion both
size (text contains small enclosed regions inside loops like horizontally and vertically, Figure 1(d).
the number “0”, so we filter out large filled regions). How- (b) Generate α-shapes of neighboring pixels to identify
ever, technical drawings contain not only closed shapes, but candidate bounding polygons of label text, Figure 1(e).
also lines, especially dashed lines in many cases. Dashed
lines will cause the α-shape to enclose not only the figure 3. Read label.
label, but also parts of the drawing, as demonstrated in the (a) Apply OCR to candidate text regions (Smith 2007)
right-most α-shape of Figure 2. Therefore, dashed lines have (with regions rotated 90-degrees clockwise if the region
to be removed before generating the α-shape to ensure the height is greater than its width).
segmented label regions only include label text. Dashed-line (b) Filter regions based on whether the recognized text fits
removal is achieved by morphological erosion both horizon- a rule-based pattern of figure labels.
tally and vertically with a 1 pixel-wide kernel.
With large enclosed regions and dashed lines removed, The obtained text mask facilitate the use of label extrac-
the α-shapes are generated and employed as a mask ap- tion. In addition to facilitating better accuracy of OCR, the
plied against the input image for label region candidates seg- text mask allows the removal of text from drawings to facil-
mentation. Each label region candidate is then fed to OCR itate increasing the accuracy of computer vision approaches
for text recognition. Here, we use Tesseract (Smith 2007), to visual similarity comparisons by reducing the impact of
which is an efficient and portable OCR implementation that figure labels.
is widely used. However, Tesseract is unable to recognize ro- SWT Method Existing text region detection methods in-
tated text in small images. Patent figures are often rotated 90 clude stroke width transform (SWT) (Epshtein, Ofek, and
degrees counter-clockwise; therefore, a figure label region Wexler 2010) and the deep-learning based “efficient and
candidate with a height greater than its width is automati- accurate scene text” (EAST) detector (Zhou et al. 2017).
cally rotated 90 degrees clockwise before being fed to OCR. These approaches generally look for regions of high contrast
After text is recognized, a filter determines if the given within an image and have been demonstrated to work well
region candidate is a figure label region or not by checking for text contained in natural images (such as reading street
signs), as well as for text annotations pasted on other im- so Tesseract receives this one image for OCR. SWT is un-
ages; where sharp transitions between the texture of text and able to detect whether text is rotated to the correct orienta-
non-text regions are evident. Of these, SWT is the most sim- tion, and so we provide hand-rotated images to SWT. The
ilar to α-shape because it does not rely on machine learning; output of EAST is a bounding box for each text region in an
and instead evaluates the width of strokes themselves (i.e. image, and so Tesseract receives just the region of the image
using shape information). within this bounding box, rotated so that the bounding box
is wide rather than tall (and we concatenate the results for
Evaluation of Label Recognition all bounding boxes of an image).
The results in Figure 3 indicate that the Google vi-
Ground Truth Corpus sion API outperforms the other methods achieving F1 =
We build the ground truth by annotating figures in 0.94, followed by α-shape (F1 = 0.91), Amazon Textract
US patents, downloaded from patents.reedtech.com. Each (F1 = 0.88), Adobe Acrobat (F1 = 0.83), and Abbyy
patent folder includes an XML file that contains full text FineReader (F1 = 0.77). Although commercial methods
marked up with XML tags and a number of TIF figure files. seem to achieve outstanding performance, the technology
Patents are grouped into different types, such as DESIGN, behind them is proprietary and the cost can be high to pro-
PLANT, and UTILITY. Random sampling from all types of cess a large volume of figures in patents and other types of
patents results in a heterogeneous corpus including tables, scholarly documents. The α-based method achieves a bet-
flow charts, architecture diagrams, etc. In this study, we fo- ter performance than Amazon Textract and Adobe Acro-
cus on figures in the DESIGN category because they form bat. One common characteristics among all methods except
a relatively homogeneous sample set consisting of techni- SWT is that they all achieve relatively high precision (0.96–
cal drawings, such as the examples shown in Figure 2. As 1.00) but the the difference of recall values is more diverse
a pilot study, the final ground truth corpus consists of 100 (0.44–0.89).
figure files randomly selected from 100 USPTO DESIGN We also compare the performances of methods that sup-
patents approved in January, 2020. We convert the original port figures in TIF format, including Tesseract, Adobe Acro-
TIF files to PNG files because all methods accepts PNG as bat, SWT-based, and α-shape-based methods. Generally, re-
input but not all are compatible with TIF format. A figure file sults are similar or worse when using TIF as an input rather
may contain up to four figures with associated labels. Ex- than PNG, and therefore we do not show them. This is some-
amples are shown in Figure 4. The aspect ratio ranges from what surprising as the conversion from TIF to PNG could
0.122 to 3.555. The size of the PNG files ranges from 89kB lose information, but it seems that the extra detail contained
to 1.1MB. The size of the TIF files ranges from 21.5kB to in the TIF format is not helpful for OCR.
1.3MB. It is interesting that Google vision API failed to extract
We created the ground truth corpus by manually inspect- labels containing only numbers, e.g., Figure 4(b). It also
ing each figure file. The original capitalization (e.g., “FIG.” fails to extract labels of figures in which the labels are ro-
and “Fig.”) and number format (e.g., “3.4”) were preserved. tated by ±90 degrees (Figure 4(c)). Many of Adobe’s failure
The final corpus contains 126 figures in 100 figure files. cases are due to detecting an incorrect rotation of the image -
Adobe’s OCR method only appears to work on upright text.
Abbyy and Tesseract have difficulty extracting labels if they
Methods Compared
are close to the object, e.g., Figure 4(a), or if the labels are in
We compare eight methods, including four open-source a box surrounding the object, e.g., Figure 4(d). Most labels
methods (the α-shape based method, the SWT-based extracted using Google vision API are relatively clean with
method, the EAST-based method, and Tesseract), and four little or no noise characters. Abbyy and Amazon produce a
commercial methods (Abbyy FineReader, Adobe Acrobat, little more noise characters. Tesseract produces a relatively
Amazon Textract, and Google vision API). To make recog- large amount of noisy characters. All imply the importance
nition results solely dependent on the methods but not the of segmenting a figure into text and objects if the label rec-
input, we use PNG files as input for all methods. The output ognizer is built on Tesseract.
may contain irrelevant or gibberish characters in addition to One advantage of the α-shape-based method is to iso-
figure labels, so we apply regular expressions to parse fig- late text and figures, which is why it achieves an outstand-
ure labels. The strings parsed are compared against labels in ing performance. For example, the label in Figure 4(e) is
the ground truth. The precision is calculated as the number not extracted by Google vision API, but is successfully ex-
of correctly identified labels divided by the total number of tracted by the α-shape-based method. An error analysis of
labels identified by an OCR method (the parser can accu- this method indicates that most errors are caused by Tesser-
rately find all labels if they exist). The recall is calculated as act. For example in Figure 4(c), Tesseract reads “Fi.”, instead
the number of correctly identified labels divided by the total of “FIG.3”. In some cases, it is hard to determine the kernel
number of labels found in the ground truth, which is 126. size for removing dashed lines because of the existence of
For the text detection methods (SWT and EAST) which both short and long dash lines, e.g., Figure 4(a), in which
do not in themselves perform OCR, we send the results from our method failed to recognize “FIG. 4” because the α-shape
text detection to Tesseract for OCR (similar to step 3a in our enclosed the dash line point above “FIG. 4”, which confuses
α-shape method). The output of SWT is an image with only Tesseract. The other text detectors tend to detect the “FIG”
the text pixels present (non-text pixels are filtered out); and text but not the number, and this accounts for the very low
0.94
Google 0.89
1.00
0.91
⍺-shape 0.87
0.96
(c)
0.88
Amazon 0.80
0.99
0.83
Adobe 0.71
0.99
(a)
0.77
Abbyy 0.64
0.96
0.61
Tesseract 0.44 (b) (d) (e)
0.97
0.13 Figure 4: Examples of challenging cases for Amazon Tex-
EAST 0.12
0.14 tract, Google vision API, and the α-shape-based methods.
0.12
An orange border is added to mark figure boundaries.
SWT 0.10
0.16
0.00 0.20 0.40 0.60 0.80 1.00 available on Figshare1 and GitHub2 .
F1 Recall Precision
Acknowledgments
Figure 3: A comparison of precision, recall, and F1 of dif- Research conducted by MGong and DO presented in this
ferent OCR methods on the ground truth corpus. paper was supported by the Laboratory Directed Research
and Development program of Los Alamos National Labo-
ratory under project number LDRD20200041ER. Research
performance. The output from SWT includes quite a few ar- conducted by XW, JW, and MGryder presented in this paper
eas of false-positive text detection and so the OCR output was supported by Los Alamos National Laboratory subcon-
can be quite messy. The EAST detector performs quite well tract BA601958 awarded to Old Dominion University.
at detecting the “Fig” text no matter the rotation, with only
a few false-positive regions, but misses the number. References
Edelsbrunner, H.; Kirkpatrick, D.; and Seidel, R. 1983. On the
shape of a set of points in the plane. IEEE Transactions on Infor-
Discussion mation Theory .
Our results indicate that accurately extracting figure labels Epshtein, B.; Ofek, E.; and Wexler, Y. 2010. Detecting text in nat-
is an open question, at least for open-source software. One ural scenes with stroke width transform. In IEEE Conference on
challenge is to automatically rotate figures so that the label Computer Vision and Pattern Recognition.
text is in the right orientation before fed to OCR. Another Gonzales, R. C.; and Woods, R. E. 2002. Digital Image Processing.
challenge is isolating text from drawings. The α-shape-
based method, which beats Google vision API in certain Otsu, N. 1979. A threshold selection method from gray-level his-
tograms. IEEE Transactions on Systems, Man, and Cybernetics .
cases, e.g., Figure 4(e), provides a promising solution.
Sezgin, M.; and Sankur, B. 2004. Survey over image threshold-
ing techniques and quantitative performance evaluation. Journal of
Summary Electronic Imaging .
In this study, we compared eight open-source and commer- Smith, R. 2007. An Overview of the Tesseract OCR engine. In
cial OCR methods in figure label extraction as evaluated on IEEE Intl Conference on Document Analysis and Recognition.
a small sample of figures from US DESIGN patents. We Zhou, X.; Yao, C.; Wen, H.; Wang, Y.; Zhou, S.; He, W.; and Liang,
also developed a heuristic method based on α-shape. Al- J. 2017. EAST: An Efficient and accurate scene text detector. In
though commercial methods achieve the highest precision IEEE Conference on Computer Vision and Pattern Recognition.
and recall, the open-source α-shape-based method, achieves
a comparable performance. We argue that developing a self-
adaptable open-source framework for figure label detection
is still an open challenge. Future work includes developing
learning-based models to adapt kernel size and α parameters
to different label characteristics. 1
https://doi.org/10.6084/m9.figshare.13416311.v1
2
The data and source code used in this work are publicly https://github.com/GoFigure-LANL/patent-label