INTRODUCTION

Yuhong Yu, A. Samal, and S.C. Seth. A system for recognizing a large class of engineering drawings. Pattern Analysis and Machine Intelligence, IEEE Transactions on

0162-8828

10.1109/34.608290

Textual summarisation of owcharts in patent drawings for CLEF-IP 2012

Andrew Thean

andy.thean@gmail.com

Jean-Marc Deltorn

Patrice Lopez

Laurent Romary

laurent.romary@inria.fr

General Terms

Measurement, Performance, Experimentation

1997

19 8

The CLEF-IP 2012 track included the Flowchart Recognition task, an image-based task where the goal was to process binary images of owcharts taken from patent drawings to produce summaries containing information about their structure. The textual summaries include information about the owchart title, the box-node shapes, the connecting edge types, text describing owchart content and the structural relationships between nodes and edges. An algorithm designed for this task and characterised by the following method steps is presented: Text-graphic segmentation based on connected-component clustering; Line segment bridging with an adaptive, oriented lter; Box shape classi cation using a stretch-invariant transform to extract features based on shape-speci c symmetry; Text object recognition using a noisy channel model to enhance the results of a commercial OCR package.

H 3 [Information Storage and Retrieval] H 3 1 Content Analysis and Indexing H 3 3 Information Search and Retrieval H 3 4 Systems and Software H 3 7 Digital Libraries

INTRODUCTION

Certain inventions are most e ectively summarised in pictorial form and patent professionals sometimes rely heavily on patent drawings for performing prior art searches. Searching patent drawings is currently a labour-intensive, error-prone task which would be facilitated by automatic indexing methods. Generic methods for automatically indexing patent drawings have been reported (Hanbury et al., 2011; Vrochidis et al., 2010; Codina et al., 2009) but the heterogeneity of patent drawings means that it is di cult for a one-size- ts-all approach to reach high levels of performance on all image classes. Flowcharts represent a large and useful subclass of patent images for which specially-adapted methods will improve indexing performance. They have a special role in the patents domain since they o er a way to graphically represent a process (a process is one of four legally-recognised categories of patent claim alongside products, apparatus and use e.g. Rule 43(2) EPC (2000) ) and often contain information-rich text that is prohibited from other types patent drawing (they are usually considered an exception to rules that generally prohibit text from patent drawings e.g. Rule 46(2)(j) of EPC (2000) and Rule 6.2 (b) of PCT (1970) ). Flowcharts were one of the image categories targeted in the Patent Image Cl assi cation task of CLEF-IP 2011 (see Piroi et al. (2011 ) and references therein), but the summarisation of owcharts in patent drawings has not been discussed previously in the literature.

Few studies have addressed the automatic analysi s of owcharts in general. Ito (1982 ) used corner detection and edge-based block shape classi cation to derive FORTRAN code directly from a owchart bitmap (see also US5386508). In Abe et al. (1986), a method for discriminating between the various components of a owchart diagram (symbols, lines and characters) was discussed. Kasturi et al. (1990) proposed an automatic line drawing analysis system (including owcharts) based on the reduction of the image into segments and closed loops. Yu et al. (1997) described a general framework for the analysis of engineering drawings characterised by alternating instances of symbols and connections lines. The particular case of owchart summarisation was also discussed in Fut relle and Nikolakis (1995 ) in the context of automatic document analysis. More recently, Sz woch (2007 ) applied template matching to the recognition and description of handdrawn owchart components for the purpose of automatic diagram aestheticization. Vasudevan et al. (2008) described a prototype system of owchart knowledge extraction from images based on chain code boundary descriptors and neural network classi er. Fin ally, in Lemaitre et al. (2010 ) the characteristic causal relationship between owchart components (described as line strokes) was exploited through a set of syntactic constraints that helped address the problems of owchart diagram recognition as well as text-graphics segmentation.

The paper is set out as follows: in Section 2 the problem addressed in the CLEF-IP 2012 Flowchart Recognition task is de ned in detail; in Section 3 a method of solving this problem is described; in Section 4 the performance of the method is reported and discussed. 2 2.1

Problem Statement Input image data

A training set of 50 training images were released in March 1012, followed by a set of 100 test images three months later: results were submitted on July 10th 2012. The input image data consists of binary images of owcharts taken from patent drawings. Each image contains a single gure and has been cropped from a patent drawing to exclude irrelevant information such as patent metadata and page numbers. As such the task does not include gure segmentation: in general a single patent drawing may contain multiple gures and a single owchart may be spread over multiple drawings/ gures. In principle, the appearance of any patent drawing is constrained by the rules applied at large patent o ces (although not all patent drawings are compliant). Examples of the most important of these rules as applied by the EPO are as follows (Rule 46(2) of EPC (2000) , Section A-X, 7.5.1 of GL (2012) ): only black lines drawn with drafting instruments are allowed; the scale of the drawing should allow for two thirds reduction in size; the character height should be larger than 0.32 mm; reference signs (no-box nodes) should be clear and without brackets; no text should be included in drawings unless it is indispensable (this rule is intended to minimise the need for language translation in drawings, however text in owcharts is usually considered indispensable); leading lines (wiggly edges) should be as short as possible and extend to the feature indicated. 2.2

Text output

The aim of the Flowchart Recognition task was to produce textual summaries in a predetermined format containing information about the owchart structure in the following 3 categories.

Metadata describing the number of nodes and edges in the owchart and, optionally, the 'title' of the owchart (usually the gure number).

Node data describing the type of each node and, optionally, the text appearing in the node. The following node types are possible: oval, rectangle, double-rectangle, parallelogram, diamond, circle, cylinder, no-box, unknown and point. All types of node correspond to boxes or image regions that may contain text apart from point nodes which are the junction points where two or more edges meet.

Edge data describing node connectivity relationships including, optionally, any text attached to the edge. Edges are either undirected or directed (in the sense indicated by arrow symbols) and have the following possible types: plain, dotted, or wiggly. The type wiggly refers to elements referred to as leading lines in patent law: wiggly edges are not necessarily wiggly in shape, but serve to connect reference signs to nodes (e.g. see Section A-X, 7.5.1 of GL (2012) ).

box

connector white background implicit graphic 3.3

pixel noise

noise 3.1 rectangle parallelogram diamond

oval circle double rectangle

cylinder 3.5 black graphic point node 3.6 3.2

text node edge meta label label

3.8 edge 3.7 plain dashed

wiggly UE DE UE DE Figure 1: Possible membership states of image pixels in a owchart. Numerical labels refer to the section headings below which describe how the various membership values are assigned. Note that joint membership states are, in principle, possible but are not allowed in the approach chosen.

Method

The input images may be scanned (and binarised) versions of hard copies led with the patent application. Note that online ling of patent applications in electronic form now dominates submissions and leads to less noisy images, but scanned images dominate archived patent image databases. To remove some of the defects introduced by the digitisation process the following two-step method is applied. Firstly, a morphological opening with a kernel of 3 3 pixels removes isolated black dots. Secondly, a sequence of local median lters is applied in order to remove isolated line breaks and small 1-to-2-pixel gaps and to perform de-jagging of the lines. The choice of a median lter rather than a morphological closing, is motivated by the aim of bridging line breaks without arti cially connecting text characters with the neighboring owchart outline. 3.2

Text-graphic segmentation

For 70% of the training set it is possible to separate text from graphics (nodes and edges) by assuming that graphics are the single largest connected component. Based on this observation, text-graphic segmentation was performed by analysing the spatial extent of all connected components in the image. A mixture of three two-dimensional Gaussians was tted to the height-vs-width distribution of the connected components using the Expectation Maximisation (EM) algorithm. The cluster with the largest height and width was deemed to contain all graphics components. 3.3

Finding implicit-graphics pixels

Certain types of connectivity between black pixels in a owchart are merely implied i.e. the connectivity of dotted/dashed line elements, the connectivity of edges and nodes that do not physically touch and the outlines of broken box nodes (for example diamonds that are broken to accommodate text). The chosen method de nes implicit-graphics pixels as white pixels that should be treated as black pixels in order to physically establish the connectivity implied by the owchart author.

While some of the smaller line breaks are removed by the rst preprocessing step (see Section 3.1), larger gaps can remain that cannot be resolved by simple morphological operations. Wedgeshaped masks, de ned by an opening angle and a radius, are used to bridge such line breaks and connect dashed line segments. Firstly, a skeleton of the isolated graphic component is obtained and an estimate of the orientation in the neighborhood of each endpoint is derived. A wedge-shaped mask is placed at each endpoint and aligned with the local skeleton orientation. All intersecting pairs of masks are identi ed and two endpoints are reconnected by a straight line if their respective wedge masks intersect. This method is similar in nature to the dashed line detection algorithm presented in Kasturi et al. (1990). 3.4

Segmentation of closed regions

A connector is de ned as a line in a owchart that does not de ne a box node. Usually a connector pixel belongs to an edge, but it may also belong to a point node. The approach chosen for nodeconnector segmentation relies on a rst step of box node classi cation. Closed regions in the graphics may correspond to either box nodes or loops (a loop is de ned as a background region enclosed by edge lines). Node-connector segmentation is performed as follows. Firstly, loops are identi ed using the box classi cation technique described below. Secondly, the interior region of any remaining box node is dilated and used as a mask to separate nodes from connectors.

graphic

flowchart bitmap preprocessing & denoising segment graphic component graphic component postprocessing segment closed regions check main orientation regions feature extraction and node classification identify connectors identify point nodes associate nodes and edges idenfify no-box nodes get edge properties extract text regions

OCR text? N generate output flowchart text description

N S H Id S H S H Id Id Id S H Id Id Estimating the global orientation of the owchart was necessary to allow accurate box-node classi cation and OCR processing. The aspect ratio of the box node candidates identi ed after closed-region segmentation was measured and the entire image was rotated clockwise by 90 degrees if the fraction of box node candidates for which the height exceeded the width was greater than a xed threshold.

Classifying box node candidates involved reducing each closed shape to a shape primitive. This step exploits the observation that during the owchart authoring process individual box nodes may be squeezed or stretched in the horizontal and vertical directions according to the amount of text they contain. Since node classi cation should be invariant to such transformations, each candidate was subjected to a squeezing operation in the horizontal and vertical direction by application of the operators SH and SV , where

SH = B(x; y) dy) and SV =

B(x; y) dx); and discarding the segments for which SV or SV are below a prede ned threshold.

The result of the squeezing operations is a shape primitive which characterises the basic shape family of each node type (see Figure 3). This shape simpli cation step is particularly suitable for the present de nition of shape categories. In the present task, the class oval does not only encompass elliptical outlines (ovals in the simplest sense) but also half-circles joined by two horizontal lines and even rounded rectangles (see Figure 3). Similarly, the class diamond includes, in addition to the standard four-sided parallelograms, hexagons and octagons (each having two parallel sides of the same size and aligned with the horizontal or the vertical). Reducing each of these shapes to a single primitive (an elliptical shape in the case of ovals, a quadrilateral shape for diamonds) enables each family to be characterised.

Multiple shape features were extracted from the box nodes: a rst set of features, f0, was derived from the original box node, while a second set, fR, was derived from the horizontally and/or vertically reduced shape. Features in the f0 set include area (A0), ellipticity (E0), solidity (S0: the ratio between the area within the bounding rectangle and the area within the box node perimeter) and the horizontal and vertical symmetries (SyH0 and SyV0). Reduced shape features include area (AR), ellipticity (ER), solidity (SR) as well as symmetry measures aimed at characterising speci c node shapes: diamond, parallelogram and cylinder bodies (see Figure 4).

DR measures the rectangularity (here estimated by the solidity, as de ned above) of a reduced shape after decomposing it into four quadrants and reassembling them as in gure 4(a). Diamonds are thus expected to be formed by complementary shapes and can be characterised by a value of DR close to 1 (indicative of a high degree of rectangularity of the recomposed shape). In the same way PR o ers a measure of rectangularity of a recomposed shape obtained by cutting in two halves a horizontally-squeezed node box and repositioning its two portions (Figure 4(b)). Here again a high degree of rectangularity is expected from a horizontally-sheared rectangle (i.e. a parallelogram under the present task de nitions). CB1R and CB2R measure the symmetries and complementaries expected from typical cylinder body shapes: CB1R measures the ellipticity of a composite shape obtained by adjoining the body half and the horizontally re ected top half of a vertically squeezed node box (Figure 4(c)), whereas CB2R measures the rectangularity of the two recombined halves. For an input node box region B(x; y) the set of features given in Tables 1 and 2 is generated.

Based on the features de ned in Tables 1 and 2, the box nodes are rst classi ed into the following set of shape families, as de ned in the CLEF-IP task: circle, oval, parallelogram, rectangle, diamond. On top of these required categories, we introduced the following additional classes: loop (to identify closed areas de ned by edges and box outlines) and small (to ag regions having an area below a xed threshold). The two other node classes that are not de ned by a closed outline shape (i.e. point node and no box node) are discussed in Sections 3.6 and 3.8.

A simple hierarchical classi cation scheme is used to discard the small regions and separate asymmetrical shapes (loops) from the other classes. The rectangles are then identi ed. The remaining shapes (oval, circle, parallelogram, diamond, cylinder-body) are then classi ed using heuristic thresholds on feature values. Statistical pattern recognition techniques have not been applied due to time constraints and the lack of image-level annotations in the training set. Obviously, the application of such techniques is expected to signi cantly improve classi cation performance.

Composite shapes (double rectangles and cylinders) are identi ed in a second pass as a combination of primitive shapes (double rectangles being formed by a collection of three adjoining rectangles, and cylinders by an oval and a cylinder body). 3.6

Point node detection

Point nodes, de ned as the junctions where edges meet, were identi ed as follows. Firstly, horizontal and vertical lines in the connector-only image were detected using a simpli ed version of the Hough transform by projecting pixel values onto the vertical and horizontal axes and identifying maxima. Secondly, point node candidates were identi ed as the image coordinates where horizontal and vertical lines crossed. Thirdly, a box centered on the point node candidates was used for validation: candidates were accepted if the box contained a branch-point and 3 or more lines crossed the box perimeter. 3.7

Edge classi cation

The main challenge in classifying edge types surrounds the classi cation of dashed and wiggly edges. Dashed edges are not common and were identi ed as those containing multiple implicitgraphics segments (see Section 3.3). Counterintuitively, wiggly edges ('leading lines') are often straight and indistinguishable from plain undirected edges in the absence of context. These edges are classi ed solely according to their relationship to no-box nodes. Since wigglies correspond to leading lines and reference signs are of type no-box, any edge connecting a box node to a no-box was a assumed to be a wiggly. Preliminary trials suggested that truly curved wiggly edges could be detected using features describing the orientation of their endpoints which are neither horizontal nor vertical, such features could be useful for improving no box classi cation in future.

Edge directivity was classi ed using a simple set of features. Firstly, the two endpoints of each edge were identi ed and a xed region around each endpoint was analysed. Next, the number of erosions required to remove all black pixels from the endpoint region was measured and a binary classi cation tree with heuristic thresholds was applied. Again, the use of statistical pattern recognition techniques was not applied due to time constraints and the lack of image-level annotations.

Feature Operator

A0 E0 S0 SyH0 SyV0

Area Ellipticity Solidity Horizontal symmetry Vertical symmetry

Feature Operator

AR ER SR DR PR CB1R CB2R

Area o SH o SV Ellipticity o SH o SV Solidity o SH o SV Solidity o TDR o SH o SV Solidity o TP R o SH Ellipticity o TCB1R o SV Solidity o TCB2R o SV The text-graphic segmentation method (Section 3.2) results in a set of connected components representing characters that must be further grouped into text objects and classi ed into boxnode labels, edge labels, no-box nodes (reference signs) and metadata. Segmenting text objects involved the hierarchical, agglomerative clustering of characters according to the proximity of their centroids. Since characters are generally grouped horizontally, before clustering, vertical distances were weighted with a positive scaling constant which favoured horizontal associations.

Any text within the closed region of a box node was classi ed as a node label in a straightforward manner. To classify edge labels and no-box nodes each text object was dilated and checked for overlap with known edges. If the overlapping edge connects to two nodes (box nodes or point nodes) the text object is classi ed as an edge label, whereas if the overlapping edge connects to only one node the text object is classi ed as a no-box node and the edge is classi ed as a wiggly.

The fonts used for metadata ( gure titles) tend to be larger than the other types of text objects and titles usually appear at the image edges. After classi cation of text objects into box-node labels, edge labels and no-box nodes any remaining text is re-clustered according to centroid positions. The new text objects were scored proportionally to their total area and their distance from the image center: the object with the highest score was classi ed as metadata.

The quality of image digitisation can be very low. This is mainly due to the low quality of the original scanning and the original paper support: The quality of patent owchart draftsmanship is variable. Note that rules and guidelines discussed in Section 2.1 do not exclude hand-drawn owcharts; The segmentation of textual content is di cult (see Section 3.8); The text used in owcharts is a particular form of technical language, di erent from the general domain language addressed by the state of the art OCR engines.

These di erent issues have been addressed using owchart-speci c text segmentation followed by state-of-the-art OCR software (ABBYY FineReader, version 10 and Tesseract 3.0) and customised post-correction techniques based on a Shannon's noisy-channel model approach and language models.

It has been shown that post-correction of the recognition errors based on specialised models can provide a signi cant reduction of the word error rate produced by general OCR engines. The proposed technique is based on a Shannon's noisy-channel model approach, more precisely a probabilist modeling of the degradation of the original string by the OCR process. This approach is used by spell-checker systems and auto-correction/auto-completion systems for information retrieval systems. The model can then be applied to a noisy string for correcting recognition errors, as sh own in Kolak and Resnik (2002 ). 3.9.1

Character Level Model Following a probabilistic framework, nding the correct string c given an original string o corresponds to nding, and, following Bayes' Rule,

argmaxcP (cjo) argmaxcP (ojc)P (c) Following Shannon's noisy-channel model approach, P (c) is the source model estimating which string is likely to be sent, and P (ojc) is the channel model estimating how the signal is degraded. In our case, P (ojc) estimates the probability that the string o is recognised by the OCR when the original string present in the bitmap image is c. For the source and channel model, we introduce rst an N-Gram model of characters denoted CM with N=6:

P (c) ' P (cjCM ) =

Y P (cmjcm 1; :::; cm 5) m For the channel/error model, we use a probabilistic framework for edit operations following a confusion model.

P (o1::njc1::m) =

Y Pe(c1::m ! o1::n) 1::m where Pe corresponds to the probability that the OCR recognised the string o1::n when interpreting the string c1::m based on the four following edit operations:

Pmatch(ci ! oj jci 2 c1::m _ oj 2 o1::n) = Pmatch(ci) Psubstitution(ci ! oj jci 2 c1::m; oj 2 o1::n) = Psubstitution(ci ! oj ) Pinsertion(ci ! oj jci 2 c1::m; oj 2 o1::n) = Pinsertion(ci ! ")

Pdeletion(ci ! oj jci 2 c1::m; oj 2 o1::n) = Pdeletion(" ! oj )

Having a probabilistic interpretation of the edit distance allows the probabilistic costs of the edit operations to be learned from a corpus of errors instead of setting costs manually/experimentally as is often the case with existing spell checkers based on noisy channel models. The probabilities Psubstitution(ci ! oj ), Pinsertion(" ! oj ) and Pdeletion(ci ! ") are learned from a corpus of OCR errors aligned with the expected correction following the Levenshtein distance:

Pmatch(ci) = 1 Psubstitution(ci ! oj ) = Pinsertion(" ! oj ) = Pdeletion(ci ! ") = count(" ! oj )

count(c) count(ci ! ")

count(ci) count(ci ! fcj gj6=i) + count(ci ! ")

count(ci) count(ci ! oj )

count(ci) where c corresponds to any characters of the alphabet of the source.

A smoothing operation is then applied for estimating the edit operations unseen in the OCRerror training data. This was done by giving a very small additive probability to all character substitutions, insertions and deletions (Lidstone's additive smoothing).

Note that transposition is not used in our channel model because this sort of character error is relevant for spell-checker applications, but extremely rare in the case of OCR. The original Levenshtein distance was therefore preferred over the Damerau-Levenshtein distance generally used for spell-checking. 3.9.2

Language Model N-gram language models are used to capture constraints on word sequences. For estimating P(c) a word trigram language model LM was combined with the character model CM : P (c) ' P (cjCM )P (cjLM ) = P (hcpijCM ) Y P (cpjcp 1; cp 2) p where hcpi denotes the complete concatenated character sequence with boundary characters corresponding to the p words cp.

At this stage the main issue is the word segmentation of the corrected word sequence. The character model handles word splits naturally and merges by inserting, substituting or deleting character boundaries. Consequently, the application of the character model may result in di erent hypotheses having di erent numbers of words. The application of a language model on these di erent hypotheses will be biased by the fact that the probability of a longer word sequence is always penalised. For solving this problem, empty nal words are introduced so that all hypotheses have the same word length (set to the maximum among the hypotheses). The transition cost to empty nal words is computed dynamically as the average of the existing transitions.

Given a noisy word sequence recognised by the OCR, the nal aim is to nd, for a sequence of corrections of words: argmaxcP (ojc)P (c) = argmaxc

Y Y Pe(c1::m;p ! on::q) 0 p pmax m

Y P (cm;pjcm 1;p; :::; cm 5;p) m P (cpjcp 1; cp 2) where pmax is the maximum number of words in the segmentation introduced by the character model and n to q correspond to the character range of the corrupted character sequence corresponding to the word p in the correction.

The decoding of the intended sequence of words given the observed sequence of words recognised by the OCR is implemented using the Viterbi approximation. 3.9.3

Training For the present experiments a set of approximatively 2000 corrections of error produced by ABBYY FineReader was used. A correction is simply the alignment of the erroneous strings and the corrected ones which allow the probabilistic costs of the edit operations introduced in the previous section to be learned. Since the corpus of errors was only produced for ABBYY FineReader, the resulting model has limited applicability to Tesseract.

For training the character and language model, a corpus of 10.000 patent descriptions and 2.000 scienti c articles was used. In both cases, the resources were a mixture of English, German and French texts. 4 4.1

Results and discussion Graph structure recognition

At the time of writing no o cial performance results are available for the algorithm test run submitted. Instead, the performance of the algorithm has been measured by manually comparing algorithm output with known ground-truth values. This method of evaluation, although timeconsuming, allows a detailed analysis of the modes of failure of the algorithm. For each class of objects in the owchart, P r, the precision and, Re, the recall value of the classi cation scores over the whole test set were determined as

P r =

true positive true positive + f alse positive and Re =

true positive true positive + f alse negative :

In addition, the balanced F-score, F , is calculated as F = 2:

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 (40) (51) (13) (57) (50) (56)

(60) (59) (4) (90) (55) (2) (49)

(89) (23) (91) (3) (18) (29) (17) (44)

(45) F

N 0

Table 3 summarises the results of the evaluation for the various node and edge types. Note the following.

The relatively high values of P r and Re for box nodes is achieved despite the use of a simplistic classi cation approach. This indicates that the box features de ned in Section 3.5 are discriminative.

The relatively low values of P r for directed edges and Re for undirected edges are probably caused by an overly simplistic approach to extracting directivity features and classifying them. A large fraction of directed edges were misclassi ed as undirected edges as shown by the relatively high values of P r and Re when directivity is ignored.

Since the identi cation of wiggly edges and no-box nodes occurs in conjunction, the performance scores for these two entities are almost identical: P rwiggly=0.96 and Rewiggly=0.76. A recall score of 83% was measured for identifying the position of the title (prior to OCR). The relatively low value of P rpoint is partly explained by text touching characters which causes false positives.

In order to examine the patterns of algorithm failure, a single combined F-score, FN , is calculated for all nodes of each owchart (excluding no-box nodes since they are labels that do not de ne the owchart logic). Similarly, a single combined F-score, FE , is calculated for all edges of each owchart (ignoring directivity). F denotes the average of the node or edge F-scores for a given owchart.

The results of the combined F-score evaluation over the whole test set are summarised in Figure 5. Out of a total sample of 100 analysed owcharts, 34 show a F-score of FN =FE =1.00. 50% of the diagrams have an average F-score superior to 0.90, while 75% show an average score of 0.78 or better. Only 10% of the test sample have an average F-score below 0.50. For the purpose of identifying the most signi cant sources of errors, the owcharts associated with F <0.8 were we identi ed. Each of these owcharts was assigned to one of three categories according to the main mode of failure: (i) initial text-graphics segmentation error, (ii) presence of text characters touching the owchart graphics and (iii) discontinuities in the owchart graphic structure (mainly due to broken lines).

Looking in turn at each of these classes, it was observed that poor segmentation of the main owchart component at the early stage of processing was the main cause of very low F-scores. Six owcharts have F <0.5: EP00000821886 (13), EP00000926609 (23), EP000010469970029 (40), EP000010702880135 (45), EP0000110761 (50) and EP00001113655 (51). The mode of failure for this group of patents is associated with the text-graphic segmentation module (Section 3.2), in particular the assumption that the graphic component of the owchart can be isolated from text by selecting the connected components with the largest heights and widths. A failure at this stage prohibits further node classi cation and edge identi cation.

Characters touching graphics (mostly text intersecting node boundaries) is also an important source of errors. Six owcharts fall into category: EP000006215470107 (2), EP000006479080138 (3), EP000006879010063 (4), EP00001063844 (44), EP00001217549 (57) and EP00001526500 (91). Although this type of error is mostly due to a combination of design choices and scanning noise, it can also stem from the preprocessing steps designed to remove small gaps in the image components (Sections 3.1 and 3.3). This re ects the di culty in bridging line breaks without creating unwanted pixel junctions between otherwise disconnected entities. It is noted that the problem of separating touching text from graphics remains a challenging issue for general OCR processin g engines (Nagy (2000 )). It has been studied in the document image analysis literature in relation to the analysis of maps (Fletcher and Kasturi, 1988; Luo et al., 1995) ), form processing (Yoo et al. (1997)) and technical drawing understanding (Kasturi et al., 1990) . An improved text-graphics segmentation method would bene t from the inclusion of an additional stage of detection and separation of touching characters.

The failure to correctly bridge broken lines (Section 3.3) is the leading source of error for values of F >0.5. A total of 10 owcharts in the range 0.3< F <0.75 show signi cant errors that can be attributed to discontinuous graphical components. Line breaks can a ect the connecting lines between nodes creating false positives in the edge components, as is the case for: EP00000866609 (17), EP000000884919 (18), EP00000098456 (29) and EP000001513121 (89)). Line breaks can also prevent a proper identi cation of the nodes as for: EP000001104172 (49), EP000001197682 (55), EP000001201195 (56), EP000001223519 (59), EP0000012274360 (60) and EP000001521176 (90). Improving gap bridging schemes would improve performance. 4.2

Recognition of text objects

In order to evaluate the performance of the text object recognition module (Section 3.9), text labels were transcribed manually for approximately 20% of the test set (412 out of a total of 2023 text boxes). The sentence and word scores given below are based on this sample.

The following scores are presented in Table 4.

A coverage score, which is the percentage of cases where the OCR returns a result. An accuracy score which is the percentage of correct results produced by the OCR given the total expected results. When the OCR does not produce a result, it is considered that the result is incorrect. An accuracy evaluation is given at both word level and sentence level, i.e. if the complete text associated with a graphical object is correctly recognised. A precision score which is the percentage of correct results produced by the OCR given the total result produced by the OCR, similarly at word and sentence level.

Technique

Coverage (sentence) Coverage (word) Accuracy (sentence) Accuracy (word) Precision (sentence) Precision (word)

ABBYY

As a comparison, the highest accuracy of current state-of-the-art OCR engines is considered to be around 98-99% correctly recognised words in optimal conditions. It is clear from the coverage evaluation that Tesseract lters out a large amount of results that it judges incorrect. Although the Tesseract con dence threshold a ects its coverage, no exploration of this parameter was performed for lack of time. It can be observed that, although producing fewer results, the precision of Tesseract out-of-the-box remains signi cantly below that of ABBYY FineReader.

As the post-correction technique has been trained only on ABBYY FineReader errors, it is less e ective at improving Tesseract results. The accuracy of ABBYY FineReader is improved by 9.4%, meaning a reduction of the word error rate of 26.2%. This reduction of the error rate is lower than the 60% reducti on reported by Kolak et al. (2003 ) using a similar approach (but correcting an OCR engine with a signi cantly lower baseline performance). These results should be considered carefully because, due to time constraints, only a partial specialisation of the models was applied to the owchart text problem:

The texts considered for training the probabilistic costs of the edit operations, the character and language models were patent descriptions and scienti c articles without text from owcharts. Ideally, the text present on the owchart of the training data (50 owcharts) should have been combined with the other training resources.

The errors used for training the costs of the edit operations were generated for ABBYY FineReader only. A fairer comparison would use Tesseract-generated errors for correcting Tesseract output.

The ideal source for training the language model is the patent description of the patent where the owchart gure appears. As the patent publication number was given with the owchart images, the usage of the description text would have been possible, for instance by retrieving the ST.36 document via the 'Open Patent Service' of the EPO.

These issues suggest clear directions for future improvement. It should be possible to improve the post-correction model signi cantly by improving the way specialised training data is selected and exploited.

URL CLEF-IP

URL

Abe ,

Azumatani ,

Mukouda , and

Suzuky . Discrimination of symbols, lines and characters in owchart recognition . In 8th International Conference on Pattern Recognition , pages pages 1071 { 1074 , 1986 .

Codina ,

Pianta ,

Vrochidis , and

Papadopoulos . Integration of semantic, metadata and image search engines with a text search engine for patent retrieval . 2009 . URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.143.98.

EPC. European Patent Convention. 14th edition , 2000 . URL http://www.epo.org/law-practice/legal-texts/epc.htm14l. Note: the wording of Rule 46 EPC is almost identical to the wording used in PCT Rules 6.2 (b ), 11.11 and 11.13 and US Manual of Patent Examining Procedure, Section 1 . 84 .

L.A.

Fletcher and

Kasturi . A robust algorithm for text string separation from mixed text/graphics images . Pattern Analysis and Machine Intelligence , IEEE Transactions on, 10 ( 6 ): 910 {918, nov 1988 . ISSN 0162- 8828 . doi: 10 .1109/34.9112.

R.P.

Futrelle and

Nikolakis . E cient analysis of complex diagrams using constraint-based parsing . In Document Analysis and Recognition , 1995 ., Proceedings of the Third International Conference on, volume 2 , pages 782 { 790 vol. 2 , aug 1995 . doi: 10 .1109/ICDAR. 1995 . 602019 .

GL. Guidelines for Examination in the European Patent O ce , 2012 . http://www.epo.org/law-practice/legal-texts/guidelines.html.

Hanbury ,

Bhatti ,

Lupu , and R. Morzinger. Patent image retrieval: a survey . In Proceedings of the 4th workshop on Patent information retrieval , PaIR '11 , pages 3 { 8 , New York, NY, USA, 2011 . ACM. ISBN 978-1-4503-0955-4 . doi: 10 .1145/2064975.2064979. URL http://doi.acm. org/10 .1145/2064975.2064979.

Ito . Automatic input of ow chart in document images . In 6th international conference on Software engineering , pages 319 { 328 , 1982 .

Kasturi ,

S.T.

Bow ,

El-Masri ,

Shah ,

J.R.

Gattiker , and

U.B.

Mokate . A system for interpretation of line drawings . Pattern Analysis and Machine Intelligence , IEEE Transactions on, 12 ( 10 ): 978 {992, oct 1990 . ISSN 0162-8828 . doi: 10 .1109/34.58870.

Kolak and

Resnik . OCR error correction using a noisy channel model . In Proceedings of the Human Language Technology Conference (HLT) , San Diego, California, USA, 2002 .

Kolak ,

Byrne , and

Resnik . A generative probabilistic OCR model for NLP applications . In Proceedings of the Human Language Technology Conference (HLT-NAACL) , Edmonton, Canada, 2003 .

Lemaitre ,

Mouchere ,

Camillerapp , and

Coasnon . Interest of syntactic knowledge for on-line owchart recognition . In 9th IAPR International Workshop on Graphics RECognition (GREC 2011 ), pages 85 { 88 , 2010 .

Luo , G. Agam, and I. Dinstein. Directional mathematical morphology approach for line thinning and extraction of character strings from maps and line drawings . In Document Analysis and Recognition , 1995 ., Proceedings of the Third International Conference on, volume 1 , pages 257 { 260 vol. 1 , aug 1995 . doi: 10 .1109/ICDAR. 1995 . 598989 .

Nagy . Twenty years of document image analysis in pami . Pattern Analysis and Machine Intelligence , IEEE Transactions on, 22 ( 1 ): 38 {62, jan 2000 . ISSN 0162- 8828 . doi: 10 .1109/34.824820.

PCT. Patent Cooperation Treaty . 1970 . Done at Washington on June 19, 1970 , amended on September 28, 1979 , modi ed on February 3 , 1984 , and on October 3 , 2001 , (as in force from April 1, 2002 ).

2011: Retrieval in the Intellectual Property Domain , 2011 .

http://clef2011.org/resources/proceedings/Overview-CLEF- IP Clef2011 .pdf.

Szwoch . Recognition, understanding and aestheticization of freehand drawing owcharts . In Document Analysis and Recognition , 2007 . ICDAR 2007 . Ninth International Conference on, volume 2 , pages 1138 {1142, sept. 2007 . doi: 10 .1109/ICDAR. 2007 . 4377093 .

B.G.

Vasudevan ,

Dhanapanichkul , and

Balakrishnan . Flowchart knowledge extraction on image processing . In Neural Networks , 2008 . IJCNN 2008 . (IEEE World Congress on Computational Intelligence) . IEEE International Joint Conference on, pages 4075 {4082, june 2008. doi: 10 .1109/IJCNN. 2008 . 4634384 .

Stefanos

Vrochidis , Symeon Papadopoulos, Anastasia Moumtzidou, Panagiotis Sidiropoulos, Emanuelle Pianta, and

Ioannis

Kompatsiaris . Towards content-based patent image retrieval: A framework perspective . World Patent Information , 32 ( 2 ): 94 { 106 , 2010 . ISSN 0172- 2190 . doi: 10 .1016/j.wpi. 2009 . 05 .010. URL http://www.sciencedirect.com/science/article/pii/S0172219009000489.

J-Y. Yoo , M-Ki. Kim , S. Y. Ban , and Y-B. Kwon . Line removal and restoration of handwritten characters on the form documents . In Document Analysis and Recognition , 1997 ., Proceedings of the Fourth International Conference on , volume 1 , pages 128 { 131 vol. 1 , aug 1997 . doi: 10 .1109/ICDAR. 1997 . 619827 .