<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CVC-UAB's participation in the Flowchart Recognition Task of CLEF-IP 2012</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marcal Rusin~ol</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Llu s-Pere de las Heras</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Joan Mas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oriol Ramos Terrades</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dimosthenis Karatzas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anjan Dutta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gemma Sanchez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Josep Llados</string-name>
          <email>josepg@cvc.uab.cat</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Vision Center, Dept. Ciencies de la Computacio, Edi ci O, Universitat Autonoma de Barcelona</institution>
          ,
          <addr-line>08193 Bellaterra (Barcelona)</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The aim of this document is to describe the methods we used in the owchart recognition task of the CLEF-IP 2012 track. The owchart recognition task consisted in interpreting owchart linedrawing images. The participants are asked to extract as much as structural information in these images as possible and return it in a prede ned textual format for further processing for the purpose of patent search. The Document Analysis Group from the Computer Vision Center (CVC-UAB) has been actively working on Graphics Recognition for over a decade. Our main aim in participating in the CLEF-IP owchart recognition task is to test our graphics recognition architectures on this type of graphics understanding problem. Our recognition system comprises a modular architecture where modules tackle di erent steps of the owchart understanding problem. A text/graphic separation technique is applied to separate the textual elements from the graphical ones. An OCR engine is applied on the text layer while on the graphical layer identify with nodes and edges as well as their relationships. We have proposed two di erent families of node and edge segmentation modules. One dealing with the raw pixel data and another working in the vectorial domain. The locations of nodes identi ed are fed to the recognizer module which is in charge of categorizing the node's type. We have proposed two di erent node descriptors for the recognizer module. The module analyzing the edges is analysing the connections between nodes and categorizes the edge style. Finally, a post-processing module is applied in order to correct some syntactic errors. We have submitted four di erent runs by combining the two variants of the segmentation module together with the two variants of the recognition module.</p>
      </abstract>
      <kwd-group>
        <kwd>Flowchart Recognition</kwd>
        <kwd>Text/Graphics Separation</kwd>
        <kwd>Rasterto-Vector Conversion</kwd>
        <kwd>Symbol Recognition</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Graphics Recognition can be de ned as the branch of document analysis that
focuses on the recovery of graphical information from documents. Graphical
documents such as engineering drawings, diagrams, architectural oor plans,
maps, etc. strongly depend on diagrammatic notations de ned in terms of a
visual language. A Graphics recognition system can be structured in three
levels. First, a lexical level aims to extract basic infomation components such as
graphical primitives (straight lines, arcs, solid areas) and text zones. The second
level, the syntactic level, is related to the structure, i.e. how graphical entities
are constructed. It can be modeled by a grammar describing the valid instances
of diagrams according to the notation. The recognition involved in the
syntactic level locates the graphical entities and classi es them into various classes by
their shape and context. Finally, a functional or semantic level de nes what the
syntactically recognized entities mean in the context where they appear.</p>
      <p>In this paper, we describe the owchart recognition system that we proposed
in the context of the Intellectual Property (IP) track at CLEF 2012. The IP
track was rst organized in 2009 as a prior art retrieval task. In 2010 and 2011,
the track continued as a benchmarking activity of the CLEF conference and
in 2011 two image-based tasks were added. One devoted to nd patent
documents relevant to a given patent document containing images and another aimed
at categorizing given patent images into prede ned categories of images (such
as graph, owchart, drawing, etc.). In CLEF-IP 2012 a new image-based task
was proposed. The owchart recognition task deals with the interpretation of
owchart line-drawing images. The participants are asked to extract as much
structural information as possible in these images and return it in a prede ned
textual format for further processing for the purpose of patent search.</p>
      <p>The Document Analysis Group from the Computer Vision Center
(CVCUAB) has been actively working on the Graphics Recognition research topic and
in this notebook paper, we describe our submitted owchart recognition system.
The rest of this paper has been organized as follows. We detail in Section 2 the
architecture overview and summarize the submitted runs. Section 3 details each
of the modules that comprises the system's pipeline. In section 4 we present
and analyze the obtained results and nally in section 5 we draw our concluding
remarks.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Architecture Overview</title>
      <p>The architecture of our recognition system has been designed as follows. As we
can see in Fig. 1, the system pipeline has been structured in separate modules
dealing with the di erent steps of the owchart identi cation problem.</p>
      <p>First of all, a text/graphic separation module has been applied to separate
the textual elements from the graphical ones. An OCR engine has been applied
on the text layer while on the graphical layer we analyze the nodes and edges. For
nodes/edge segmentation we have applied two di erent strategies resulting in two
alternative segmentation modules: a pixel-based and a vector-based approach.
The vector-based approach requires a conversion module that transforms the
raw pixel image into a vectorial representation.</p>
      <p>
        The output of the node segmentation module is a list of bounding-boxes of the
detected nodes. These locations were subsequently fed to the recognizer module
which is in charge of establishing the node's type. Here we have used two di erent
node descriptors, namely a descriptor based on the geometric moments and the
Blurred Shape Model (BSM) descriptor [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], resulting in two alternatives for the
node recognition module. The module analyzing the edges is devoted to assess
which nodes are connected and classify the edges in terms of their style..Finally,
a post-processing module has been applied in order to correct certain syntactic
errors.
      </p>
      <p>
        The combination of the two alternative segmentation modules with the two
alternative recognition modules results to four system variants producing the
four submitted runs summarized in Table 1.
The rst step in the proposed owchart recognition architecture is to separate
text elements from graphical elements. Within the Document Image Analysis
research eld, various text/graphics separation algorithms have been developed
over many years. One of the most used methodologies that yields acceptable
results in a variety of mixed-type documents is the algorithm proposed by Tombre
et al. in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] based on the well-known approach of Fletcher and Kasturi in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>This module proceeds as follows, given an input image, features are extracted
from connected components (height and width ratios, orientation, etc.). Then,
several adaptive thresholds determine whether a connected component
corresponds to a graphical element or a textual element. The connected components
that the system is not certain about are assigned to an undetermind layer via
a rejection criterion. The output of the module is an image consisting of three
separate layers, namely the graphical, the textual and the undetermined one.
We can see an example of the results obtained by this text/graphics separation
module in Fig. 2.
3.2</p>
      <sec id="sec-2-1">
        <title>Raster-to-vector Conversion</title>
        <p>
          Some of the submitted runs are based on the analysis of vectorial data instead
of working on the raw image domain. Thus we need a module that converts the
raster image into a vectorial representation. We have used the raster-to-vector
conversion described in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] by Tombre et al. This algorithm rst computes the
skeleton of the image by means of a distance transform. Chains of pixels are then
polygonally approximated by using the algorithm proposed by Rosin and West
in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] which is based on a recursive split-and-merge technique. We can see an
example of the obtained results after the raster-to-vector conversion in Fig. 3.
We have submited two di erent alternatives for the node segmentation module
depending on the primitives used to segment nodes and edges from the owchart
images. On one hand we have two system variants based on a node
segmentation module that works with the raw pixels from the graphical layer and on the
other hand we propose another segmentation based on the analysis of vectors
arisen from the raster-to-vector conversion of the graphical layer. In both cases,
the segmentation strategy rst attempted to segment symbol nodes (i.e.
rectangles, ovals, diamonds, circles, etc.) and then tried to segment the no-box nodes,
corresponding to text which is not enclosed by any graphical entity.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Pixel-based Node Segmentation</title>
        <p>The node segmentation module takes as input the graphical layer obtained
through the text/graphics separation module and outputs a list of
boundingboxes where the nodes might be found. The pixel-based approach analyses
connected components (CCs) to determine whether they might be a node or not. A
CC is a maximal set of spatially connected pixels that share a common property
(here, the same colour). In the current system connected components of
interest are the ones corresponding to the white area inside the nodes and the main
problem is to discriminate those from connected components corresponding to
the background. After a preliminary step where very small and very large CCs
are ltered, the remaining CCs are labeled as node candidates. Here, not all the
remaining components correspond to nodes since the white areas produced by
loops formed by edges connecting nodes are also detected as candidates. In order
to discriminate the real node components from the rest we set device heuristics
over a couple of features:
{ Solidity: computed as the ratio between the number of pixels in the CCs
and the area of the convex hull of the CCs. Since nodes tend to be convex,
objects below a solidity threshold are rejected.
{ Vertical Symmetry: computed as the ratio between the amount of pixels
in the right and the left part of the CCs. Since nodes tend to be vertically
symmetric, objects below a symmetry threshold are rejected.</p>
        <p>The remaining CCs after this ltering are considered as nodes and the
boundingbox of each element is provided. Going back to the textual layer, once the symbol
nodes have been detected, we can easily extract the no-box nodes as the text
clusters that do not fall within any detected symbol node. We can see an
example of the node segmentation output in Fig. 4 b). This segmentation strategy
has been used in the runs R1 and R2.</p>
      </sec>
      <sec id="sec-2-3">
        <title>Vectorial-based Node Segmentation</title>
        <p>
          Distinctively from pixel-based segmentation, the Vectorial-based Node
Segmentation takes as input the vectorial representation of the images obtained
after the Raster-to-vector Conversion, see Fig. 3 b). It is based on the
exploration of loops in the planar graphs obtained from the vectorial images. In these
graphs, nodes are the vectorial lines and edges are the connection points between
these lines. The seek for loops is driven by the implementation of the optimal
algorithm for nding regions in a planar graph from Jiang et al. in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. As in the
pixel-based segmentation, the heuristic processes based on Solidity and
Symmetry presented above are followed to rule out inconsistent instances. Finally, the
system outputs the bounding-boxes for the remaining nodes. The no-box nodes
are determined by looking at textual clusters connected to shape nodes via an
edge. This approach is implemented in runs R3 and R4.
3.4
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>Node Recognition</title>
        <p>The module devoted to recognize the nodes' types takes as input any of the two
proposed node segmentations and tries to classify the node to one of the possible
classes based on its shape. We have proposed two di erent shape recognizers.</p>
        <p>Pattern recognition systems usually require shape descriptors invariant to
rigid (scale, translation, rotation) and even a ne transforms. However, in the
context of owchart recognition, the shape descriptors needed to recognize the
nodes' type just need to be invariant to translation and scale, while a lack of
invariance to other transforms is actually bene cial as it results to increased
discrimination.</p>
        <p>
          Translation invariance is not an issue since both node segmentations localised
node areas. Both shape descriptors proposed are invariant to scale. In order to
assign a node label a nearest neighbor classi er is used over a training set of
labeled nodes. Let us detail the two di erent descriptors we used.
The Blurred Shape Model descriptor (BSM) was originally created to
perform handwritten musical score recognition but it has also been applied to
other related document analysis with di erent degrees of success [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. The BSM
descriptor is a zoning-based descriptor. Shapes are divided into a 15 15 regular
grid. Then, the area and the center of gravity are computed for each cell. The
nal BSM descriptor is constructed by weighting the areas computed by the
inverse of distances between two gravity centers of adjacent cells. This weighted
average is performed in order to achieve robustness to local variations of the
shape under analysis. A normalization by the object area is used in order to
achieve invariance to scale. This shape recognizer is used in the runs R2 and R4.
Geometric Moments have been widely used as shape descriptors since lower
order moments represent certain well known fundamental geometric properties
of the underlying image functions. The central (p + q)-th order moment for a
digital image I(x; y) is expressed by
        </p>
        <p>The use of the centroid (x; y) allow the descriptor to be invariant to
translation. A normalization by the object area is used to achieve invariance to scale.
pq =</p>
        <p>X(x
x;y
x)p(y</p>
        <p>y)qI(x; y)
pq
where
=</p>
        <p>We have used the geometric moments up to third order as a feature vector.
This shape recognizer is implemented in the runs R1 and R3.
3.5</p>
      </sec>
      <sec id="sec-2-5">
        <title>Edge Segmentation and Node Linking</title>
        <p>Edge segmentation has been done in a similar fashion in both the pixel-based
and the vectorial-based approaches. Taking as input both the graphical layer
and the bounding-boxes from the node segmentation we obtain an edge layer
image. We can see an example in Fig. 4.</p>
        <p>In order to assess which nodes are connected by an edge we pair-wisely select
connected components in the node layer and check whether any element of the
edge layer provokes that those two disjoint components merge into a single one.
An analysis of the edge stroke's width is performed to di erentiate between
directed and undirected edges. Since our node linking modules strongly depend
on the extraction of connected components, they fail at detecting the dotted
edges. The pixel-based approach fails if the edge is broken by some text but the
vectorial-based one performs a post-processing step that merges colinear broken
edges making it robust to such situations.
3.6</p>
      </sec>
      <sec id="sec-2-6">
        <title>Text Recognition</title>
        <p>The module dealing with text recognition receives as input the textual layer
obtained by the text/graphics separation module and the bounding-boxes arisen
from the node segmentation module. The contents of each bounding-box are
processed by the commercial OCR from ABBYY1, and no further post-processing
steps are applied. Taking a look at the raw output from the OCR, some simple
1 ABBYY Finereader Engine 10 "http://www.abbyy.com/ocr_sdk/"
steps for dealing with out-of-vocabulary elements would drastically improve the
performance of this module.
3.7</p>
      </sec>
      <sec id="sec-2-7">
        <title>Post-processing</title>
        <p>Testing our architecture against the training data we realized that when two
edges collide, in the ground-truth we expect to see a new node of type point.
Obviously our node segmentation and recognition do not tackle such nodes since
they can not be found via a connected component analysis. We decided to include
the detection of those nodes in a post-processing step.</p>
        <p>The nal graph is syntactically analyzed and when we nd three or more
nodes that are connected through exactly the same edge element, we add a new
intermediate node of type point. This procedure is iteratively done since there
are no node junctions with cardinality higher than two.
4</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Evaluation</title>
      <p>Let us rst evaluate the proposed system's performance qualitatively by looking
at speci c cases where the we face some problems. Let us then present the
obtained results.
4.1</p>
      <sec id="sec-3-1">
        <title>Qualitative Evaluation</title>
        <p>Although visually the system seems to perform quite well, we have identi ed
several cases where the proposed modules fail, examples of which are presented
in Fig. 5.</p>
        <p>First of all, the node and edge segmentations are based on connected
components. When either a node (Fig. 5a)) or an edge (Fig. 5b)) is broken it is
usually not well segmented by any of the proposed methods. In addition,
lowquality documents are hard to \read" by the OCR engine (Fig. 5c)). Finally,
the text/graphics separation module is also based on the analysis of connected
components, so when text characters overlap with graphical elements, they are
not properly segmented. This is the case shown in Fig. 5d), where the character
F of FROM and the character D from FIELD touch the diamond shape and
thus are classi ed as graphics and assigned to the graphical layer.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>In this paper, we presented the owchart recognition system that the Document
Analysis Group from the Computer Vision Center (CVC-UAB) has submitted
in the context of the Intellectual Property (IP) track at CLEF 2012. The
architecture of our recognition system is structured in di erent modules dealing with
di erent steps of the owchart identi cation problem.</p>
      <p>First of all, a text/graphic separation technique has been applied to separate
the textual elements from the graphical ones. An OCR engine has been applied
on the text layer while on the graphical layer we analyze the nodes and edges. For
nodes/edge segmentation we have applied two di erent strategies: a pixel-based
and a vector-based approach. The node locations have been fed to the recognizer
module which is in charge of categorizing the node's type. Here we have used two
di erent node descriptors, namely a descriptor based on the geometric moments
and the Blurred Shape Model (BSM) descriptor. The module analyzing the
edges determines which nodes are connected and categorizes the edge style. The
combination of the two segmentation modules together with the two recognition
modules results to the four system variants and an equal number of submitted
runs.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgment</title>
      <p>This work has been partially supported by the Spanish Ministry of Education
and Science under projects RYC-2009-05031, TIN2011-24631,
TIN2009-14633C03-03, Consolider Ingenio 2010: MIPRCV (CSD200700018) and the grant
2009SGR-1434 of the Generalitat de Catalunya.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>S.</given-names>
            <surname>Escalera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fornes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Pujol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Radeva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sanchez</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Llados</surname>
          </string-name>
          .
          <article-title>Blurred shape model for binary and grey-level symbol recognition</article-title>
          .
          <source>Pattern Recognition Letters</source>
          ,
          <volume>30</volume>
          (
          <issue>15</issue>
          ):
          <volume>1424</volume>
          {
          <fpage>1433</fpage>
          ,
          <string-name>
            <surname>November</surname>
          </string-name>
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>L.A.</given-names>
            <surname>Fletcher</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Kasturi</surname>
          </string-name>
          .
          <article-title>A robust algorithm for text string separation from mixed text/graphics images</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          ,
          <volume>10</volume>
          (
          <issue>6</issue>
          ):
          <volume>910</volume>
          {
          <fpage>918</fpage>
          ,
          <string-name>
            <surname>November</surname>
          </string-name>
          <year>1988</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>X.Y.</given-names>
            <surname>Jiang</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Bunke</surname>
          </string-name>
          .
          <article-title>An optimal algorithm for extracting the regions of a plane graph</article-title>
          .
          <source>Pattern Recognition Letters</source>
          ,
          <volume>14</volume>
          (
          <issue>7</issue>
          ):
          <volume>553</volume>
          {
          <fpage>558</fpage>
          ,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>P.L.</given-names>
            <surname>Rosin</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.A.</given-names>
            <surname>West</surname>
          </string-name>
          .
          <article-title>Segmentation of edges into lines and arcs</article-title>
          .
          <source>Image and Vision Computing</source>
          ,
          <volume>7</volume>
          (
          <issue>2</issue>
          ):
          <volume>109</volume>
          {
          <fpage>114</fpage>
          , May
          <year>1989</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>K.</given-names>
            <surname>Tombre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ah-Soon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dosch</surname>
          </string-name>
          , G. Massini, and
          <string-name>
            <given-names>S.</given-names>
            <surname>Tabbone</surname>
          </string-name>
          .
          <article-title>Stable and robust vectorization: How to make the right choices</article-title>
          .
          <source>In Graphics Recognition Recent Advances</source>
          , volume
          <volume>1941</volume>
          <source>of Lecture Notes in Computer Science</source>
          , pages
          <fpage>3</fpage>
          <lpage>{</lpage>
          18.
          <string-name>
            <surname>SpringerVerlag</surname>
          </string-name>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>K.</given-names>
            <surname>Tombre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tabbone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Pelissier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lamiroy</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Dosch</surname>
          </string-name>
          .
          <article-title>Text/graphics separation revisited</article-title>
          .
          <source>In Document Analysis Systems</source>
          , volume
          <volume>2423</volume>
          of Lecture Notes in Computer Science, pages
          <volume>615</volume>
          {
          <fpage>620</fpage>
          . Springer-Verlag,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>