=Paper= {{Paper |id=Vol-2831/paper11 |storemode=property |title=Recognizing Figure Labels in Patents |pdfUrl=https://ceur-ws.org/Vol-2831/paper11.pdf |volume=Vol-2831 |authors=Ming Gong,Xin Wei,Diane Oyen,Jian Wu,Martin Gryder,Liping Yang |dblpUrl=https://dblp.org/rec/conf/aaai/GongWO0GY21 }} ==Recognizing Figure Labels in Patents== https://ceur-ws.org/Vol-2831/paper11.pdf
                                        Recognizing Figure Labels in Patents
              Ming Gong1,2 , Xin Wei3 , Diane Oyen2 , Jian Wu3 , Martin Gryder3 , Liping Yang4
                                                              1
                                                                 University of Dayton
                                                      2
                                                          Los Alamos National Laboratory
                                                             3
                                                               Old Dominion University
                                                            4
                                                              University of New Mexico



                             Abstract                                      regions in which OCR fails to produce reasonable character
                                                                           recognition results. By following this pipeline, OCR is more
  Scientific documents often contain significant information
                                                                           successful without having distracting non-text pixels present
  in figures. The United States Patent and Trademark Office
  (USPTO) awards thousands of patents each week, with each                 in the image, while false-positive regions found in step (1)
  patent containing on the order of a dozen figures. The infor-            can be filtered out by step (3).
  mation conveyed by these figures typically include a drawing                We evaluate open-source and commercial OCR methods
  or diagram, a label, caption and reference text within the doc-          on a set of patent drawings with the goal to recognize all
  ument. Yet associating the short bits of text to the figure is           figure labels embedded within each patent figure, so that we
  challenging when labels are embedded within the figure, as               can associate figure labels with specific drawings. Commer-
  they typically are in patents. Using patents as a testbench,             cial methods significantly outperform open-source OCR, but
  this paper highlights an open challenge in analyzing all of              even at a recall level of 0.95 there is room for improvement
  the information presented in scientific/technical documents -            before such methods can be deployed at the scale needed to
  namely, there is a technological gap in recognizing charac-
                                                                           analyze a large number of patents. We find that open-source
  ters embedded in drawings, which leads to difficulties in pro-
  cessing the text associated with scientific figures. We demon-           OCR works well when text is isolated from the rest of the
  strate that automatically reading the figure label in patent di-         figure, and therefore we develop a method to isolate figure
  agram figures is an open challenge, as we evaluate several               label text in patent figures.
  state-of-the-art optical character recognition (OCR) methods
  on recent patents. Because the visual characteristics of draw-                  Label Text Detection using α-Shapes
  ings/diagrams are quite similar to that of text (high contrast,
  width of strokes, etc), separating the diagram from the text is
                                                                           Patent figures are generally composed of a drawing and la-
  challenging and leads to both (a) false detection of characters          bel text; or several drawings and multiple label texts (Fig-
  from pixels that are not text and (b) missed text that is critical       ure 1). These drawings and labels are spatially located on
  for identifying the figure number. We develop a method for               the page such that it is easy for a person to read the figure la-
  automatically reading the patent figure labels by first iden-            bel and associate it with the corresponding drawing; yet this
  tifying the bounding box containing the label using a novel              remains challenging for computer vision and OCR methods.
  non-convex hull approach, and then demonstrate the success               Our goal is to automatically segment patent figures into re-
  of OCR when the text is isolated from the diagram.                       gions of drawing and text. We find that existing OCR meth-
                                                                           ods often fail to recognize labels in patent drawings, primar-
                         Introduction                                      ily due to patent drawings being mainly composed of lines,
                                                                           e.g., strokes, curves, which have similar visual characteris-
Recognition of text that is embedded in an image is a well-                tics to text, such as high foreground/background contrast,
studied problem, which can be split into two separate prob-                sharpness of edges, and ratio of foreground/background pix-
lems: text detection — identifying regions of the image that               els.
contain text — and optical character recognition (OCR).                       In patent figures, the text labels occupy a reasonably com-
OCR methods generally assume that majority of the pixels                   pact region of the figure. Therefore, we propose to identify
in the image are text and maps groups of pixels to charac-                 text regions using α-shapes. An α-shape is the smallest poly-
ters. When the image contains non-text elements, it is best                gon that encloses all the points (foreground pixels) in an
to first segment the regions containing text using text detec-             area, similar to a convex-hull, but with an α allowance for
tion methods. Typically, text recognition is carried out fol-              non-convexity (Edelsbrunner, Kirkpatrick, and Seidel 1983).
lowing the basic pipeline of (1) text region detection which               However, α-shape calculation is computationally expensive
may contain false-positive regions, followed by (2) OCR on                 for a large number of foreground pixels. We therefore de-
each text region and (3) filtering of OCR results to discard               velop a workflow that simplifies the figure to candidate re-
Copyright © 2021, for this paper by its authors. Use permitted un-         gions of text, then removes dashed lines through morpho-
der Creative Commons License Attribution 4.0 International (CC             logical erosion and then isolates text regions with α-shapes.
BY 4.0).                                                                      Patent figures are processed first by converting to a binary
  (a) Input image binarized (b) Closed regions filling   (c) Label candidates   (d) Dashed lines removal   (e) Generated α-shapes

                                  Figure 1: Identifying label regions in an example patent figure.


                                                                      whether the text contains “Fig” (non case-sensitive). Then
                                                                      the qualified text information is preserved and the coordi-
                                                                      nates of the corresponding figure label region, including co-
                                                                      ordinates of original image and the ones of rotated image if
                                                                      any, are also recorded.
                                                                         As shown in Figure 1, the steps for extracting labels from
                                                                      patent figures are:
                                                                     1. Segment drawing regions and candidate text regions.
                                                                       (a) Threshold image to black on white using Otsu method,
                                                                           Figure 1(a) (Otsu 1979; Sezgin and Sankur 2004).
                                                                       (b) Fill regions using mathematical morphology to seg-
                                                                           ment regions of content, Figure 1(b) (Gonzales and
                                                                           Woods 2002).
Figure 2: Example of α-shapes without dashed line removal
shows text being merged with the nearby dashed line.                   (c) Filter connected components by size to produce candi-
                                                                           date text regions, Figure 1(c).
                                                                     2. Generate α-shape.
image and then closed regions are filled and filtered out by           (a) Remove dashed lines by morphological erosion both
size (text contains small enclosed regions inside loops like               horizontally and vertically, Figure 1(d).
the number “0”, so we filter out large filled regions). How-           (b) Generate α-shapes of neighboring pixels to identify
ever, technical drawings contain not only closed shapes, but               candidate bounding polygons of label text, Figure 1(e).
also lines, especially dashed lines in many cases. Dashed
lines will cause the α-shape to enclose not only the figure          3. Read label.
label, but also parts of the drawing, as demonstrated in the           (a) Apply OCR to candidate text regions (Smith 2007)
right-most α-shape of Figure 2. Therefore, dashed lines have                (with regions rotated 90-degrees clockwise if the region
to be removed before generating the α-shape to ensure the                   height is greater than its width).
segmented label regions only include label text. Dashed-line           (b) Filter regions based on whether the recognized text fits
removal is achieved by morphological erosion both horizon-                  a rule-based pattern of figure labels.
tally and vertically with a 1 pixel-wide kernel.
   With large enclosed regions and dashed lines removed,                 The obtained text mask facilitate the use of label extrac-
the α-shapes are generated and employed as a mask ap-                 tion. In addition to facilitating better accuracy of OCR, the
plied against the input image for label region candidates seg-        text mask allows the removal of text from drawings to facil-
mentation. Each label region candidate is then fed to OCR             itate increasing the accuracy of computer vision approaches
for text recognition. Here, we use Tesseract (Smith 2007),            to visual similarity comparisons by reducing the impact of
which is an efficient and portable OCR implementation that            figure labels.
is widely used. However, Tesseract is unable to recognize ro-         SWT Method Existing text region detection methods in-
tated text in small images. Patent figures are often rotated 90       clude stroke width transform (SWT) (Epshtein, Ofek, and
degrees counter-clockwise; therefore, a figure label region           Wexler 2010) and the deep-learning based “efficient and
candidate with a height greater than its width is automati-           accurate scene text” (EAST) detector (Zhou et al. 2017).
cally rotated 90 degrees clockwise before being fed to OCR.           These approaches generally look for regions of high contrast
   After text is recognized, a filter determines if the given         within an image and have been demonstrated to work well
region candidate is a figure label region or not by checking          for text contained in natural images (such as reading street
signs), as well as for text annotations pasted on other im-          so Tesseract receives this one image for OCR. SWT is un-
ages; where sharp transitions between the texture of text and        able to detect whether text is rotated to the correct orienta-
non-text regions are evident. Of these, SWT is the most sim-         tion, and so we provide hand-rotated images to SWT. The
ilar to α-shape because it does not rely on machine learning;        output of EAST is a bounding box for each text region in an
and instead evaluates the width of strokes themselves (i.e.          image, and so Tesseract receives just the region of the image
using shape information).                                            within this bounding box, rotated so that the bounding box
                                                                     is wide rather than tall (and we concatenate the results for
          Evaluation of Label Recognition                            all bounding boxes of an image).
                                                                         The results in Figure 3 indicate that the Google vi-
Ground Truth Corpus                                                  sion API outperforms the other methods achieving F1 =
We build the ground truth by annotating figures in                   0.94, followed by α-shape (F1 = 0.91), Amazon Textract
US patents, downloaded from patents.reedtech.com. Each               (F1 = 0.88), Adobe Acrobat (F1 = 0.83), and Abbyy
patent folder includes an XML file that contains full text           FineReader (F1 = 0.77). Although commercial methods
marked up with XML tags and a number of TIF figure files.            seem to achieve outstanding performance, the technology
Patents are grouped into different types, such as DESIGN,            behind them is proprietary and the cost can be high to pro-
PLANT, and UTILITY. Random sampling from all types of                cess a large volume of figures in patents and other types of
patents results in a heterogeneous corpus including tables,          scholarly documents. The α-based method achieves a bet-
flow charts, architecture diagrams, etc. In this study, we fo-       ter performance than Amazon Textract and Adobe Acro-
cus on figures in the DESIGN category because they form              bat. One common characteristics among all methods except
a relatively homogeneous sample set consisting of techni-            SWT is that they all achieve relatively high precision (0.96–
cal drawings, such as the examples shown in Figure 2. As             1.00) but the the difference of recall values is more diverse
a pilot study, the final ground truth corpus consists of 100         (0.44–0.89).
figure files randomly selected from 100 USPTO DESIGN                     We also compare the performances of methods that sup-
patents approved in January, 2020. We convert the original           port figures in TIF format, including Tesseract, Adobe Acro-
TIF files to PNG files because all methods accepts PNG as            bat, SWT-based, and α-shape-based methods. Generally, re-
input but not all are compatible with TIF format. A figure file      sults are similar or worse when using TIF as an input rather
may contain up to four figures with associated labels. Ex-           than PNG, and therefore we do not show them. This is some-
amples are shown in Figure 4. The aspect ratio ranges from           what surprising as the conversion from TIF to PNG could
0.122 to 3.555. The size of the PNG files ranges from 89kB           lose information, but it seems that the extra detail contained
to 1.1MB. The size of the TIF files ranges from 21.5kB to            in the TIF format is not helpful for OCR.
1.3MB.                                                                   It is interesting that Google vision API failed to extract
   We created the ground truth corpus by manually inspect-           labels containing only numbers, e.g., Figure 4(b). It also
ing each figure file. The original capitalization (e.g., “FIG.”      fails to extract labels of figures in which the labels are ro-
and “Fig.”) and number format (e.g., “3.4”) were preserved.          tated by ±90 degrees (Figure 4(c)). Many of Adobe’s failure
The final corpus contains 126 figures in 100 figure files.           cases are due to detecting an incorrect rotation of the image -
                                                                     Adobe’s OCR method only appears to work on upright text.
                                                                     Abbyy and Tesseract have difficulty extracting labels if they
Methods Compared
                                                                     are close to the object, e.g., Figure 4(a), or if the labels are in
We compare eight methods, including four open-source                 a box surrounding the object, e.g., Figure 4(d). Most labels
methods (the α-shape based method, the SWT-based                     extracted using Google vision API are relatively clean with
method, the EAST-based method, and Tesseract), and four              little or no noise characters. Abbyy and Amazon produce a
commercial methods (Abbyy FineReader, Adobe Acrobat,                 little more noise characters. Tesseract produces a relatively
Amazon Textract, and Google vision API). To make recog-              large amount of noisy characters. All imply the importance
nition results solely dependent on the methods but not the           of segmenting a figure into text and objects if the label rec-
input, we use PNG files as input for all methods. The output         ognizer is built on Tesseract.
may contain irrelevant or gibberish characters in addition to            One advantage of the α-shape-based method is to iso-
figure labels, so we apply regular expressions to parse fig-         late text and figures, which is why it achieves an outstand-
ure labels. The strings parsed are compared against labels in        ing performance. For example, the label in Figure 4(e) is
the ground truth. The precision is calculated as the number          not extracted by Google vision API, but is successfully ex-
of correctly identified labels divided by the total number of        tracted by the α-shape-based method. An error analysis of
labels identified by an OCR method (the parser can accu-             this method indicates that most errors are caused by Tesser-
rately find all labels if they exist). The recall is calculated as   act. For example in Figure 4(c), Tesseract reads “Fi.”, instead
the number of correctly identified labels divided by the total       of “FIG.3”. In some cases, it is hard to determine the kernel
number of labels found in the ground truth, which is 126.            size for removing dashed lines because of the existence of
   For the text detection methods (SWT and EAST) which               both short and long dash lines, e.g., Figure 4(a), in which
do not in themselves perform OCR, we send the results from           our method failed to recognize “FIG. 4” because the α-shape
text detection to Tesseract for OCR (similar to step 3a in our       enclosed the dash line point above “FIG. 4”, which confuses
α-shape method). The output of SWT is an image with only             Tesseract. The other text detectors tend to detect the “FIG”
the text pixels present (non-text pixels are filtered out); and      text but not the number, and this accounts for the very low
                                                                           0.94
  Google                                                                0.89
                                                                              1.00
                                                                         0.91
  ⍺-shape                                                              0.87
                                                                            0.96
                                                                                                                        (c)
                                                                        0.88
 Amazon                                                          0.80
                                                                                0.99
                                                                     0.83
   Adobe                                                  0.71
                                                                                0.99
                                                                                                   (a)
                                                               0.77
   Abbyy                                               0.64
                                                                            0.96
                                                   0.61
 Tesseract                               0.44                                                      (b)                  (d)                     (e)
                                                                               0.97
                    0.13                                                               Figure 4: Examples of challenging cases for Amazon Tex-
    EAST            0.12
                     0.14                                                              tract, Google vision API, and the α-shape-based methods.
                    0.12
                                                                                       An orange border is added to mark figure boundaries.
    SWT            0.10
                      0.16

            0.00     0.20         0.40          0.60          0.80          1.00       available on Figshare1 and GitHub2 .
                             F1    Recall       Precision
                                                                                                             Acknowledgments
Figure 3: A comparison of precision, recall, and F1 of dif-                            Research conducted by MGong and DO presented in this
ferent OCR methods on the ground truth corpus.                                         paper was supported by the Laboratory Directed Research
                                                                                       and Development program of Los Alamos National Labo-
                                                                                       ratory under project number LDRD20200041ER. Research
performance. The output from SWT includes quite a few ar-                              conducted by XW, JW, and MGryder presented in this paper
eas of false-positive text detection and so the OCR output                             was supported by Los Alamos National Laboratory subcon-
can be quite messy. The EAST detector performs quite well                              tract BA601958 awarded to Old Dominion University.
at detecting the “Fig” text no matter the rotation, with only
a few false-positive regions, but misses the number.                                                              References
                                                                                       Edelsbrunner, H.; Kirkpatrick, D.; and Seidel, R. 1983. On the
                                                                                       shape of a set of points in the plane. IEEE Transactions on Infor-
Discussion                                                                             mation Theory .
Our results indicate that accurately extracting figure labels                          Epshtein, B.; Ofek, E.; and Wexler, Y. 2010. Detecting text in nat-
is an open question, at least for open-source software. One                            ural scenes with stroke width transform. In IEEE Conference on
challenge is to automatically rotate figures so that the label                         Computer Vision and Pattern Recognition.
text is in the right orientation before fed to OCR. Another                            Gonzales, R. C.; and Woods, R. E. 2002. Digital Image Processing.
challenge is isolating text from drawings. The α-shape-
based method, which beats Google vision API in certain                                 Otsu, N. 1979. A threshold selection method from gray-level his-
                                                                                       tograms. IEEE Transactions on Systems, Man, and Cybernetics .
cases, e.g., Figure 4(e), provides a promising solution.
                                                                                       Sezgin, M.; and Sankur, B. 2004. Survey over image threshold-
                                                                                       ing techniques and quantitative performance evaluation. Journal of
                                  Summary                                              Electronic Imaging .
In this study, we compared eight open-source and commer-                               Smith, R. 2007. An Overview of the Tesseract OCR engine. In
cial OCR methods in figure label extraction as evaluated on                            IEEE Intl Conference on Document Analysis and Recognition.
a small sample of figures from US DESIGN patents. We                                   Zhou, X.; Yao, C.; Wen, H.; Wang, Y.; Zhou, S.; He, W.; and Liang,
also developed a heuristic method based on α-shape. Al-                                J. 2017. EAST: An Efficient and accurate scene text detector. In
though commercial methods achieve the highest precision                                IEEE Conference on Computer Vision and Pattern Recognition.
and recall, the open-source α-shape-based method, achieves
a comparable performance. We argue that developing a self-
adaptable open-source framework for figure label detection
is still an open challenge. Future work includes developing
learning-based models to adapt kernel size and α parameters
to different label characteristics.                                                       1
                                                                                              https://doi.org/10.6084/m9.figshare.13416311.v1
                                                                                          2
   The data and source code used in this work are publicly                                    https://github.com/GoFigure-LANL/patent-label