<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Recognizing Figure Labels in Patents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ming Gong</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xin Wei</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Diane Oyen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jian Wu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Gryder</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Liping Yang</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Los Alamos National Laboratory</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Old Dominion University</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Dayton</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University of New Mexico</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Scientific documents often contain significant information in figures. The United States Patent and Trademark Office (USPTO) awards thousands of patents each week, with each patent containing on the order of a dozen figures. The information conveyed by these figures typically include a drawing or diagram, a label, caption and reference text within the document. Yet associating the short bits of text to the figure is challenging when labels are embedded within the figure, as they typically are in patents. Using patents as a testbench, this paper highlights an open challenge in analyzing all of the information presented in scientific/technical documents namely, there is a technological gap in recognizing characters embedded in drawings, which leads to difficulties in processing the text associated with scientific figures. We demonstrate that automatically reading the figure label in patent diagram figures is an open challenge, as we evaluate several state-of-the-art optical character recognition (OCR) methods on recent patents. Because the visual characteristics of drawings/diagrams are quite similar to that of text (high contrast, width of strokes, etc), separating the diagram from the text is challenging and leads to both (a) false detection of characters from pixels that are not text and (b) missed text that is critical for identifying the figure number. We develop a method for automatically reading the patent figure labels by first identifying the bounding box containing the label using a novel non-convex hull approach, and then demonstrate the success of OCR when the text is isolated from the diagram.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Recognition of text that is embedded in an image is a
wellstudied problem, which can be split into two separate
problems: text detection — identifying regions of the image that
contain text — and optical character recognition (OCR).
OCR methods generally assume that majority of the pixels
in the image are text and maps groups of pixels to
characters. When the image contains non-text elements, it is best
to first segment the regions containing text using text
detection methods. Typically, text recognition is carried out
following the basic pipeline of (1) text region detection which
may contain false-positive regions, followed by (2) OCR on
each text region and (3) filtering of OCR results to discard
regions in which OCR fails to produce reasonable character
recognition results. By following this pipeline, OCR is more
successful without having distracting non-text pixels present
in the image, while false-positive regions found in step (1)
can be filtered out by step (3).</p>
      <p>We evaluate open-source and commercial OCR methods
on a set of patent drawings with the goal to recognize all
figure labels embedded within each patent figure, so that we
can associate figure labels with specific drawings.
Commercial methods significantly outperform open-source OCR, but
even at a recall level of 0.95 there is room for improvement
before such methods can be deployed at the scale needed to
analyze a large number of patents. We find that open-source
OCR works well when text is isolated from the rest of the
figure, and therefore we develop a method to isolate figure
label text in patent figures.</p>
    </sec>
    <sec id="sec-2">
      <title>Label Text Detection using -Shapes</title>
      <p>Patent figures are generally composed of a drawing and
label text; or several drawings and multiple label texts
(Figure 1). These drawings and labels are spatially located on
the page such that it is easy for a person to read the figure
label and associate it with the corresponding drawing; yet this
remains challenging for computer vision and OCR methods.
Our goal is to automatically segment patent figures into
regions of drawing and text. We find that existing OCR
methods often fail to recognize labels in patent drawings,
primarily due to patent drawings being mainly composed of lines,
e.g., strokes, curves, which have similar visual
characteristics to text, such as high foreground/background contrast,
sharpness of edges, and ratio of foreground/background
pixels.</p>
      <p>
        In patent figures, the text labels occupy a reasonably
compact region of the figure. Therefore, we propose to identify
text regions using -shapes. An -shape is the smallest
polygon that encloses all the points (foreground pixels) in an
area, similar to a convex-hull, but with an allowance for
non-convexity
        <xref ref-type="bibr" rid="ref1">(Edelsbrunner, Kirkpatrick, and Seidel 1983)</xref>
        .
However, -shape calculation is computationally expensive
for a large number of foreground pixels. We therefore
develop a workflow that simplifies the figure to candidate
regions of text, then removes dashed lines through
morphological erosion and then isolates text regions with -shapes.
      </p>
      <p>Patent figures are processed first by converting to a binary
(a) Input image binarized (b) Closed regions filling
(c) Label candidates
(d) Dashed lines removal (e) Generated -shapes
image and then closed regions are filled and filtered out by
size (text contains small enclosed regions inside loops like
the number “0”, so we filter out large filled regions).
However, technical drawings contain not only closed shapes, but
also lines, especially dashed lines in many cases. Dashed
lines will cause the -shape to enclose not only the figure
label, but also parts of the drawing, as demonstrated in the
right-most -shape of Figure 2. Therefore, dashed lines have
to be removed before generating the -shape to ensure the
segmented label regions only include label text. Dashed-line
removal is achieved by morphological erosion both
horizontally and vertically with a 1 pixel-wide kernel.</p>
      <p>
        With large enclosed regions and dashed lines removed,
the -shapes are generated and employed as a mask
applied against the input image for label region candidates
segmentation. Each label region candidate is then fed to OCR
for text recognition. Here, we use Tesseract
        <xref ref-type="bibr" rid="ref6">(Smith 2007)</xref>
        ,
which is an efficient and portable OCR implementation that
is widely used. However, Tesseract is unable to recognize
rotated text in small images. Patent figures are often rotated 90
degrees counter-clockwise; therefore, a figure label region
candidate with a height greater than its width is
automatically rotated 90 degrees clockwise before being fed to OCR.
      </p>
      <p>After text is recognized, a filter determines if the given
region candidate is a figure label region or not by checking
whether the text contains “Fig” (non case-sensitive). Then
the qualified text information is preserved and the
coordinates of the corresponding figure label region, including
coordinates of original image and the ones of rotated image if
any, are also recorded.</p>
      <p>As shown in Figure 1, the steps for extracting labels from
patent figures are:
1. Segment drawing regions and candidate text regions.
(a) Threshold image to black on white using Otsu method,</p>
      <p>
        Figure 1(a)
        <xref ref-type="bibr" rid="ref4 ref5">(Otsu 1979; Sezgin and Sankur 2004)</xref>
        .
(b) Fill regions using mathematical morphology to
segment regions of content, Figure 1(b)
        <xref ref-type="bibr" rid="ref3">(Gonzales and
Woods 2002)</xref>
        .
(c) Filter connected components by size to produce
candidate text regions, Figure 1(c).
2. Generate -shape.
      </p>
      <p>(a) Remove dashed lines by morphological erosion both
horizontally and vertically, Figure 1(d).
(b) Generate -shapes of neighboring pixels to identify
candidate bounding polygons of label text, Figure 1(e).
3. Read label.</p>
      <p>
        (a) Apply OCR to candidate text regions
        <xref ref-type="bibr" rid="ref6">(Smith 2007)</xref>
        (with regions rotated 90-degrees clockwise if the region
height is greater than its width).
(b) Filter regions based on whether the recognized text fits
a rule-based pattern of figure labels.
      </p>
      <p>The obtained text mask facilitate the use of label
extraction. In addition to facilitating better accuracy of OCR, the
text mask allows the removal of text from drawings to
facilitate increasing the accuracy of computer vision approaches
to visual similarity comparisons by reducing the impact of
figure labels.</p>
      <p>
        SWT Method Existing text region detection methods
include stroke width transform (SWT)
        <xref ref-type="bibr" rid="ref2">(Epshtein, Ofek, and
Wexler 2010)</xref>
        and the deep-learning based “efficient and
accurate scene text” (EAST) detector
        <xref ref-type="bibr" rid="ref7">(Zhou et al. 2017)</xref>
        .
These approaches generally look for regions of high contrast
within an image and have been demonstrated to work well
for text contained in natural images (such as reading street
signs), as well as for text annotations pasted on other
images; where sharp transitions between the texture of text and
non-text regions are evident. Of these, SWT is the most
similar to -shape because it does not rely on machine learning;
and instead evaluates the width of strokes themselves (i.e.
using shape information).
      </p>
    </sec>
    <sec id="sec-3">
      <title>Evaluation of Label Recognition</title>
      <sec id="sec-3-1">
        <title>Ground Truth Corpus</title>
        <p>We build the ground truth by annotating figures in
US patents, downloaded from patents.reedtech.com. Each
patent folder includes an XML file that contains full text
marked up with XML tags and a number of TIF figure files.
Patents are grouped into different types, such as DESIGN,
PLANT, and UTILITY. Random sampling from all types of
patents results in a heterogeneous corpus including tables,
flow charts, architecture diagrams, etc. In this study, we
focus on figures in the DESIGN category because they form
a relatively homogeneous sample set consisting of
technical drawings, such as the examples shown in Figure 2. As
a pilot study, the final ground truth corpus consists of 100
figure files randomly selected from 100 USPTO DESIGN
patents approved in January, 2020. We convert the original
TIF files to PNG files because all methods accepts PNG as
input but not all are compatible with TIF format. A figure file
may contain up to four figures with associated labels.
Examples are shown in Figure 4. The aspect ratio ranges from
0.122 to 3.555. The size of the PNG files ranges from 89kB
to 1.1MB. The size of the TIF files ranges from 21.5kB to
1.3MB.</p>
        <p>We created the ground truth corpus by manually
inspecting each figure file. The original capitalization (e.g., “FIG.”
and “Fig.”) and number format (e.g., “3.4”) were preserved.
The final corpus contains 126 figures in 100 figure files.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Methods Compared</title>
        <p>We compare eight methods, including four open-source
methods (the -shape based method, the SWT-based
method, the EAST-based method, and Tesseract), and four
commercial methods (Abbyy FineReader, Adobe Acrobat,
Amazon Textract, and Google vision API). To make
recognition results solely dependent on the methods but not the
input, we use PNG files as input for all methods. The output
may contain irrelevant or gibberish characters in addition to
figure labels, so we apply regular expressions to parse
figure labels. The strings parsed are compared against labels in
the ground truth. The precision is calculated as the number
of correctly identified labels divided by the total number of
labels identified by an OCR method (the parser can
accurately find all labels if they exist). The recall is calculated as
the number of correctly identified labels divided by the total
number of labels found in the ground truth, which is 126.</p>
        <p>For the text detection methods (SWT and EAST) which
do not in themselves perform OCR, we send the results from
text detection to Tesseract for OCR (similar to step 3a in our
-shape method). The output of SWT is an image with only
the text pixels present (non-text pixels are filtered out); and
so Tesseract receives this one image for OCR. SWT is
unable to detect whether text is rotated to the correct
orientation, and so we provide hand-rotated images to SWT. The
output of EAST is a bounding box for each text region in an
image, and so Tesseract receives just the region of the image
within this bounding box, rotated so that the bounding box
is wide rather than tall (and we concatenate the results for
all bounding boxes of an image).</p>
        <p>The results in Figure 3 indicate that the Google
vision API outperforms the other methods achieving F1 =
0:94, followed by -shape (F1 = 0:91), Amazon Textract
(F1 = 0:88), Adobe Acrobat (F1 = 0:83), and Abbyy
FineReader (F1 = 0:77). Although commercial methods
seem to achieve outstanding performance, the technology
behind them is proprietary and the cost can be high to
process a large volume of figures in patents and other types of
scholarly documents. The -based method achieves a
better performance than Amazon Textract and Adobe
Acrobat. One common characteristics among all methods except
SWT is that they all achieve relatively high precision (0:96–
1:00) but the the difference of recall values is more diverse
(0:44–0:89).</p>
        <p>We also compare the performances of methods that
support figures in TIF format, including Tesseract, Adobe
Acrobat, SWT-based, and -shape-based methods. Generally,
results are similar or worse when using TIF as an input rather
than PNG, and therefore we do not show them. This is
somewhat surprising as the conversion from TIF to PNG could
lose information, but it seems that the extra detail contained
in the TIF format is not helpful for OCR.</p>
        <p>It is interesting that Google vision API failed to extract
labels containing only numbers, e.g., Figure 4(b). It also
fails to extract labels of figures in which the labels are
rotated by 90 degrees (Figure 4(c)). Many of Adobe’s failure
cases are due to detecting an incorrect rotation of the image
Adobe’s OCR method only appears to work on upright text.
Abbyy and Tesseract have difficulty extracting labels if they
are close to the object, e.g., Figure 4(a), or if the labels are in
a box surrounding the object, e.g., Figure 4(d). Most labels
extracted using Google vision API are relatively clean with
little or no noise characters. Abbyy and Amazon produce a
little more noise characters. Tesseract produces a relatively
large amount of noisy characters. All imply the importance
of segmenting a figure into text and objects if the label
recognizer is built on Tesseract.</p>
        <p>One advantage of the -shape-based method is to
isolate text and figures, which is why it achieves an
outstanding performance. For example, the label in Figure 4(e) is
not extracted by Google vision API, but is successfully
extracted by the -shape-based method. An error analysis of
this method indicates that most errors are caused by
Tesseract. For example in Figure 4(c), Tesseract reads “Fi.”, instead
of “FIG.3”. In some cases, it is hard to determine the kernel
size for removing dashed lines because of the existence of
both short and long dash lines, e.g., Figure 4(a), in which
our method failed to recognize “FIG. 4” because the -shape
enclosed the dash line point above “FIG. 4”, which confuses
Tesseract. The other text detectors tend to detect the “FIG”
text but not the number, and this accounts for the very low
Amazon</p>
        <p>Adobe</p>
        <p>Abbyy
Tesseract
0.80
0.83
0.71
0.77</p>
        <p>0.94
0.89</p>
        <p>1.00
0.91
0.87
0.88
0.96
0.99
0.99
0.96
0.97
0.44</p>
        <p>0.64
0.61
0.13
0.12
0.14
0.12
0.10
0.16
0.20
0.00
0.40
0.60
0.80</p>
        <p>1.00
F1
performance. The output from SWT includes quite a few
areas of false-positive text detection and so the OCR output
can be quite messy. The EAST detector performs quite well
at detecting the “Fig” text no matter the rotation, with only
a few false-positive regions, but misses the number.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Discussion</title>
        <p>Our results indicate that accurately extracting figure labels
is an open question, at least for open-source software. One
challenge is to automatically rotate figures so that the label
text is in the right orientation before fed to OCR. Another
challenge is isolating text from drawings. The
-shapebased method, which beats Google vision API in certain
cases, e.g., Figure 4(e), provides a promising solution.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Summary</title>
      <p>In this study, we compared eight open-source and
commercial OCR methods in figure label extraction as evaluated on
a small sample of figures from US DESIGN patents. We
also developed a heuristic method based on -shape.
Although commercial methods achieve the highest precision
and recall, the open-source -shape-based method, achieves
a comparable performance. We argue that developing a
selfadaptable open-source framework for figure label detection
is still an open challenge. Future work includes developing
learning-based models to adapt kernel size and parameters
to different label characteristics.</p>
      <p>The data and source code used in this work are publicly
available on Figshare1 and GitHub2.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>Research conducted by MGong and DO presented in this
paper was supported by the Laboratory Directed Research
and Development program of Los Alamos National
Laboratory under project number LDRD20200041ER. Research
conducted by XW, JW, and MGryder presented in this paper
was supported by Los Alamos National Laboratory
subcontract BA601958 awarded to Old Dominion University.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Edelsbrunner</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kirkpatrick</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; and Seidel,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <year>1983</year>
          .
          <article-title>On the shape of a set of points in the plane</article-title>
          .
          <source>IEEE Transactions on Information Theory .</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Epshtein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ofek</surname>
            , E.; and Wexler,
            <given-names>Y.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>Detecting text in natural scenes with stroke width transform</article-title>
          .
          <source>In IEEE Conference on Computer Vision</source>
          and Pattern Recognition.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Gonzales</surname>
            ,
            <given-names>R. C.</given-names>
          </string-name>
          ; and Woods,
          <string-name>
            <surname>R. E.</surname>
          </string-name>
          <year>2002</year>
          . Digital Image Processing.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Otsu</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <year>1979</year>
          .
          <article-title>A threshold selection method from gray-level histograms</article-title>
          .
          <source>IEEE Transactions on Systems, Man, and Cybernetics .</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Sezgin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Sankur</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>2004</year>
          .
          <article-title>Survey over image thresholding techniques and quantitative performance evaluation</article-title>
          .
          <source>Journal of Electronic Imaging .</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <year>2007</year>
          .
          <article-title>An Overview of the Tesseract OCR engine</article-title>
          .
          <source>In IEEE Intl Conference on Document Analysis and Recognition.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Yao</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>He</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Liang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>EAST: An Efficient and accurate scene text detector</article-title>
          .
          <source>In IEEE Conference on Computer Vision</source>
          and Pattern Recognition.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>