CVC-UAB’s participation in the Flowchart Recognition Task of CLEF-IP 2012 Marçal Rusiñol, Lluı́s-Pere de las Heras, Joan Mas, Oriol Ramos Terrades, Dimosthenis Karatzas, Anjan Dutta, Gemma Sánchez, and Josep Lladós Computer Vision Center, Dept. Ciències de la Computació, Edifici O, Universitat Autònoma de Barcelona, 08193 Bellaterra (Barcelona), Spain. {marcal, lpheras, jmas, oriolrt, dimos, adutta, gemma, josep}@cvc.uab.cat Abstract. The aim of this document is to describe the methods we used in the flowchart recognition task of the CLEF-IP 2012 track. The flowchart recognition task consisted in interpreting flowchart line- drawing images. The participants are asked to extract as much as struc- tural information in these images as possible and return it in a pre- defined textual format for further processing for the purpose of patent search. The Document Analysis Group from the Computer Vision Cen- ter (CVC-UAB) has been actively working on Graphics Recognition for over a decade. Our main aim in participating in the CLEF-IP flowchart recognition task is to test our graphics recognition architectures on this type of graphics understanding problem. Our recognition system comprises a modular architecture where mod- ules tackle different steps of the flowchart understanding problem. A text/graphic separation technique is applied to separate the textual el- ements from the graphical ones. An OCR engine is applied on the text layer while on the graphical layer identify with nodes and edges as well as their relationships. We have proposed two different families of node and edge segmentation modules. One dealing with the raw pixel data and another working in the vectorial domain. The locations of nodes identified are fed to the recognizer module which is in charge of catego- rizing the node’s type. We have proposed two different node descriptors for the recognizer module. The module analyzing the edges is analysing the connections between nodes and categorizes the edge style. Finally, a post-processing module is applied in order to correct some syntactic errors. We have submitted four different runs by combining the two variants of the segmentation module together with the two variants of the recogni- tion module. Keywords: Flowchart Recognition, Text/Graphics Separation, Raster- to-Vector Conversion, Symbol Recognition. 1 Introduction Graphics Recognition can be defined as the branch of document analysis that focuses on the recovery of graphical information from documents. Graphical 2 M. Rusiñol et al. documents such as engineering drawings, diagrams, architectural floor plans, maps, etc. strongly depend on diagrammatic notations defined in terms of a visual language. A Graphics recognition system can be structured in three lev- els. First, a lexical level aims to extract basic infomation components such as graphical primitives (straight lines, arcs, solid areas) and text zones. The second level, the syntactic level, is related to the structure, i.e. how graphical entities are constructed. It can be modeled by a grammar describing the valid instances of diagrams according to the notation. The recognition involved in the syntac- tic level locates the graphical entities and classifies them into various classes by their shape and context. Finally, a functional or semantic level defines what the syntactically recognized entities mean in the context where they appear. In this paper, we describe the flowchart recognition system that we proposed in the context of the Intellectual Property (IP) track at CLEF 2012. The IP track was first organized in 2009 as a prior art retrieval task. In 2010 and 2011, the track continued as a benchmarking activity of the CLEF conference and in 2011 two image-based tasks were added. One devoted to find patent docu- ments relevant to a given patent document containing images and another aimed at categorizing given patent images into predefined categories of images (such as graph, flowchart, drawing, etc.). In CLEF-IP 2012 a new image-based task was proposed. The flowchart recognition task deals with the interpretation of flowchart line-drawing images. The participants are asked to extract as much structural information as possible in these images and return it in a predefined textual format for further processing for the purpose of patent search. The Document Analysis Group from the Computer Vision Center (CVC- UAB) has been actively working on the Graphics Recognition research topic and in this notebook paper, we describe our submitted flowchart recognition system. The rest of this paper has been organized as follows. We detail in Section 2 the architecture overview and summarize the submitted runs. Section 3 details each of the modules that comprises the system’s pipeline. In section 4 we present and analyze the obtained results and finally in section 5 we draw our concluding remarks. 2 Architecture Overview The architecture of our recognition system has been designed as follows. As we can see in Fig. 1, the system pipeline has been structured in separate modules dealing with the different steps of the flowchart identification problem. First of all, a text/graphic separation module has been applied to separate the textual elements from the graphical ones. An OCR engine has been applied on the text layer while on the graphical layer we analyze the nodes and edges. For nodes/edge segmentation we have applied two different strategies resulting in two alternative segmentation modules: a pixel-based and a vector-based approach. The vector-based approach requires a conversion module that transforms the raw pixel image into a vectorial representation. The CVC-UAB in the Flowchart Recognition Task 3 Fig. 1. System’s architecture overview. The output of the node segmentation module is a list of bounding-boxes of the detected nodes. These locations were subsequently fed to the recognizer module which is in charge of establishing the node’s type. Here we have used two different node descriptors, namely a descriptor based on the geometric moments and the Blurred Shape Model (BSM) descriptor [1], resulting in two alternatives for the node recognition module. The module analyzing the edges is devoted to assess which nodes are connected and classify the edges in terms of their style..Finally, a post-processing module has been applied in order to correct certain syntactic errors. The combination of the two alternative segmentation modules with the two alternative recognition modules results to four system variants producing the four submitted runs summarized in Table 1. Table 1. Submitted runs produced by the four different system variants. Id Run Segmentation Descriptor R1 IP-FR-CLEF2012.CVC-UAB.GMOMENTS Pixel-based Geometric moments R2 IP-FR-CLEF2012.CVC-UAB.BSM Pixel-based BSM R3 IP-FR-CLEF2012.CVC-UAB.VECTORIALGMOMENTS Vectorial-based Geometric moments R4 IP-FR-CLEF2012.CVC-UAB.VECTORIALBSM Vectorial-based BSM 3 Module Description Let us detail in this Section each of the modules composing our architecture. 3.1 Text/Graphics Separation The first step in the proposed flowchart recognition architecture is to separate text elements from graphical elements. Within the Document Image Analysis research field, various text/graphics separation algorithms have been developed over many years. One of the most used methodologies that yields acceptable re- sults in a variety of mixed-type documents is the algorithm proposed by Tombre et al. in [6] based on the well-known approach of Fletcher and Kasturi in [2]. 4 M. Rusiñol et al. a) b) c) d) Fig. 2. Example of the text/graphics separation module. a) Original image, b) graph- ical layer, c) textual layer, d) undetermined layer. The CVC-UAB in the Flowchart Recognition Task 5 This module proceeds as follows, given an input image, features are extracted from connected components (height and width ratios, orientation, etc.). Then, several adaptive thresholds determine whether a connected component corre- sponds to a graphical element or a textual element. The connected components that the system is not certain about are assigned to an undetermind layer via a rejection criterion. The output of the module is an image consisting of three separate layers, namely the graphical, the textual and the undetermined one. We can see an example of the results obtained by this text/graphics separation module in Fig. 2. 3.2 Raster-to-vector Conversion Some of the submitted runs are based on the analysis of vectorial data instead of working on the raw image domain. Thus we need a module that converts the raster image into a vectorial representation. We have used the raster-to-vector conversion described in [5] by Tombre et al. This algorithm first computes the skeleton of the image by means of a distance transform. Chains of pixels are then polygonally approximated by using the algorithm proposed by Rosin and West in [4] which is based on a recursive split-and-merge technique. We can see an example of the obtained results after the raster-to-vector conversion in Fig. 3. a) b) Fig. 3. Example of the raster-to-vector conversion module. a) Original image, b) vec- torial representation applied to the graphical layer. 3.3 Node Segmentation We have submited two different alternatives for the node segmentation module depending on the primitives used to segment nodes and edges from the flowchart images. On one hand we have two system variants based on a node segmenta- tion module that works with the raw pixels from the graphical layer and on the 6 M. Rusiñol et al. other hand we propose another segmentation based on the analysis of vectors arisen from the raster-to-vector conversion of the graphical layer. In both cases, the segmentation strategy first attempted to segment symbol nodes (i.e. rectan- gles, ovals, diamonds, circles, etc.) and then tried to segment the no-box nodes, corresponding to text which is not enclosed by any graphical entity. Pixel-based Node Segmentation The node segmentation module takes as input the graphical layer obtained through the text/graphics separation module and outputs a list of bounding- boxes where the nodes might be found. The pixel-based approach analyses con- nected components (CCs) to determine whether they might be a node or not. A CC is a maximal set of spatially connected pixels that share a common property (here, the same colour). In the current system connected components of inter- est are the ones corresponding to the white area inside the nodes and the main problem is to discriminate those from connected components corresponding to the background. After a preliminary step where very small and very large CCs are filtered, the remaining CCs are labeled as node candidates. Here, not all the remaining components correspond to nodes since the white areas produced by loops formed by edges connecting nodes are also detected as candidates. In order to discriminate the real node components from the rest we set device heuristics over a couple of features: – Solidity: computed as the ratio between the number of pixels in the CCs and the area of the convex hull of the CCs. Since nodes tend to be convex, objects below a solidity threshold are rejected. – Vertical Symmetry: computed as the ratio between the amount of pixels in the right and the left part of the CCs. Since nodes tend to be vertically symmetric, objects below a symmetry threshold are rejected. The remaining CCs after this filtering are considered as nodes and the bounding- box of each element is provided. Going back to the textual layer, once the symbol nodes have been detected, we can easily extract the no-box nodes as the text clusters that do not fall within any detected symbol node. We can see an exam- ple of the node segmentation output in Fig. 4 b). This segmentation strategy has been used in the runs R1 and R2. Vectorial-based Node Segmentation Distinctively from pixel-based segmentation, the Vectorial-based Node Seg- mentation takes as input the vectorial representation of the images obtained after the Raster-to-vector Conversion, see Fig. 3 b). It is based on the explo- ration of loops in the planar graphs obtained from the vectorial images. In these graphs, nodes are the vectorial lines and edges are the connection points between these lines. The seek for loops is driven by the implementation of the optimal algorithm for finding regions in a planar graph from Jiang et al. in [3]. As in the The CVC-UAB in the Flowchart Recognition Task 7 pixel-based segmentation, the heuristic processes based on Solidity and Symme- try presented above are followed to rule out inconsistent instances. Finally, the system outputs the bounding-boxes for the remaining nodes. The no-box nodes are determined by looking at textual clusters connected to shape nodes via an edge. This approach is implemented in runs R3 and R4. 3.4 Node Recognition The module devoted to recognize the nodes’ types takes as input any of the two proposed node segmentations and tries to classify the node to one of the possible classes based on its shape. We have proposed two different shape recognizers. Pattern recognition systems usually require shape descriptors invariant to rigid (scale, translation, rotation) and even affine transforms. However, in the context of flowchart recognition, the shape descriptors needed to recognize the nodes’ type just need to be invariant to translation and scale, while a lack of invariance to other transforms is actually beneficial as it results to increased discrimination. Translation invariance is not an issue since both node segmentations localised node areas. Both shape descriptors proposed are invariant to scale. In order to assign a node label a nearest neighbor classifier is used over a training set of labeled nodes. Let us detail the two different descriptors we used. The Blurred Shape Model descriptor (BSM) was originally created to perform handwritten musical score recognition but it has also been applied to other related document analysis with different degrees of success [1]. The BSM descriptor is a zoning-based descriptor. Shapes are divided into a 15 × 15 regular grid. Then, the area and the center of gravity are computed for each cell. The final BSM descriptor is constructed by weighting the areas computed by the inverse of distances between two gravity centers of adjacent cells. This weighted average is performed in order to achieve robustness to local variations of the shape under analysis. A normalization by the object area is used in order to achieve invariance to scale. This shape recognizer is used in the runs R2 and R4. Geometric Moments have been widely used as shape descriptors since lower order moments represent certain well known fundamental geometric properties of the underlying image functions. The central (p + q)-th order moment for a digital image I(x, y) is expressed by X µpq = (x − x̄)p (y − ȳ)q I(x, y) (1) x,y The use of the centroid (x̄, ȳ) allow the descriptor to be invariant to transla- tion. A normalization by the object area is used to achieve invariance to scale. µpq p+q ηpq = where γ = +1 (2) µγ00 2 8 M. Rusiñol et al. We have used the geometric moments up to third order as a feature vector. This shape recognizer is implemented in the runs R1 and R3. 3.5 Edge Segmentation and Node Linking Edge segmentation has been done in a similar fashion in both the pixel-based and the vectorial-based approaches. Taking as input both the graphical layer and the bounding-boxes from the node segmentation we obtain an edge layer image. We can see an example in Fig. 4. a) b) c) Fig. 4. Example of the node and edge segmentation modules. a) Original image, b) node layer, b) edge layer. In order to assess which nodes are connected by an edge we pair-wisely select connected components in the node layer and check whether any element of the edge layer provokes that those two disjoint components merge into a single one. An analysis of the edge stroke’s width is performed to differentiate between directed and undirected edges. Since our node linking modules strongly depend on the extraction of connected components, they fail at detecting the dotted edges. The pixel-based approach fails if the edge is broken by some text but the vectorial-based one performs a post-processing step that merges colinear broken edges making it robust to such situations. 3.6 Text Recognition The module dealing with text recognition receives as input the textual layer ob- tained by the text/graphics separation module and the bounding-boxes arisen from the node segmentation module. The contents of each bounding-box are pro- cessed by the commercial OCR from ABBYY1 , and no further post-processing steps are applied. Taking a look at the raw output from the OCR, some simple 1 ABBYY Finereader Engine 10 "http://www.abbyy.com/ocr_sdk/" The CVC-UAB in the Flowchart Recognition Task 9 steps for dealing with out-of-vocabulary elements would drastically improve the performance of this module. 3.7 Post-processing Testing our architecture against the training data we realized that when two edges collide, in the ground-truth we expect to see a new node of type point. Obviously our node segmentation and recognition do not tackle such nodes since they can not be found via a connected component analysis. We decided to include the detection of those nodes in a post-processing step. The final graph is syntactically analyzed and when we find three or more nodes that are connected through exactly the same edge element, we add a new intermediate node of type point. This procedure is iteratively done since there are no node junctions with cardinality higher than two. 4 Evaluation Let us first evaluate the proposed system’s performance qualitatively by looking at specific cases where the we face some problems. Let us then present the obtained results. 4.1 Qualitative Evaluation Although visually the system seems to perform quite well, we have identified several cases where the proposed modules fail, examples of which are presented in Fig. 5. a) b) c) d) Fig. 5. Problematic cases. a) Broken node, b) broken edge, c) Low-quality text, d) Text/graphics overlapping. 10 M. Rusiñol et al. First of all, the node and edge segmentations are based on connected com- ponents. When either a node (Fig. 5a)) or an edge (Fig. 5b)) is broken it is usually not well segmented by any of the proposed methods. In addition, low- quality documents are hard to “read” by the OCR engine (Fig. 5c)). Finally, the text/graphics separation module is also based on the analysis of connected components, so when text characters overlap with graphical elements, they are not properly segmented. This is the case shown in Fig. 5d), where the character F of FROM and the character D from FIELD touch the diamond shape and thus are classified as graphics and assigned to the graphical layer. 5 Conclusions In this paper, we presented the flowchart recognition system that the Document Analysis Group from the Computer Vision Center (CVC-UAB) has submitted in the context of the Intellectual Property (IP) track at CLEF 2012. The archi- tecture of our recognition system is structured in different modules dealing with different steps of the flowchart identification problem. First of all, a text/graphic separation technique has been applied to separate the textual elements from the graphical ones. An OCR engine has been applied on the text layer while on the graphical layer we analyze the nodes and edges. For nodes/edge segmentation we have applied two different strategies: a pixel-based and a vector-based approach. The node locations have been fed to the recognizer module which is in charge of categorizing the node’s type. Here we have used two different node descriptors, namely a descriptor based on the geometric moments and the Blurred Shape Model (BSM) descriptor. The module analyzing the edges determines which nodes are connected and categorizes the edge style. The combination of the two segmentation modules together with the two recognition modules results to the four system variants and an equal number of submitted runs. Acknowledgment This work has been partially supported by the Spanish Ministry of Education and Science under projects RYC-2009-05031, TIN2011-24631, TIN2009-14633- C03-03, Consolider Ingenio 2010: MIPRCV (CSD200700018) and the grant 2009- SGR-1434 of the Generalitat de Catalunya. References 1. S. Escalera, A. Fornés, O. Pujol, P. Radeva, G. Sánchez, and J. Lladós. Blurred shape model for binary and grey-level symbol recognition. Pattern Recognition Letters, 30(15):1424–1433, November 2009. 2. L.A. Fletcher and R. Kasturi. A robust algorithm for text string separation from mixed text/graphics images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 10(6):910–918, November 1988. The CVC-UAB in the Flowchart Recognition Task 11 3. X.Y. Jiang and H. Bunke. An optimal algorithm for extracting the regions of a plane graph. Pattern Recognition Letters, 14(7):553 – 558, 1993. 4. P.L. Rosin and G.A. West. Segmentation of edges into lines and arcs. Image and Vision Computing, 7(2):109–114, May 1989. 5. K. Tombre, C. Ah-Soon, P. Dosch, G. Massini, and S. Tabbone. Stable and ro- bust vectorization: How to make the right choices. In Graphics Recognition Recent Advances, volume 1941 of Lecture Notes in Computer Science, pages 3–18. Springer- Verlag, 2000. 6. K. Tombre, S. Tabbone, L. Pélissier, B. Lamiroy, and P. Dosch. Text/graphics separation revisited. In Document Analysis Systems, volume 2423 of Lecture Notes in Computer Science, pages 615–620. Springer-Verlag, 2002.