MolRec at CLEF 2012 — Overview and Analysis of Results Noureddin M. Sadawi, Alan P. Sexton and Volker Sorge School of Computer Science, University of Birmingham Email: N.M.Sadawi|A.P.Sexton|V.Sorge@cs.bham.ac.uk URL: www.cs.bham.ac.uk/~nms|aps|vxs Abstract. We present the results and analysis of our chemical struc- ture recognition system, MolRec, in the CLEF 2012 chemical structure recognition task. MolRec analyses a diagram image, extracts vectorised components from the image and applies a rule based system to construct an internal representation of the chemical structure. This internal repre- sentation can then be exported to MOL or SMILE format. The task assigned in CLEF was to analyse two sets of chemical diagram images clipped from patent documents. The first set is of 965 diagram im- ages, the results of which could be evaluated automatically using Open- Babel. The second set is a more challenging collection of 95 images which include elements not supported by OpenBabel and which therefore have to be evaluated manually. On the first set, MolRec achieved recognition rates of between 94.91% and 96.18% over 4 runs with slightly different parameters. On the more exacting second set, MolRec’s recognition rate was between 46.32% and 58.95%. Overall the results testified to high performance on a large sample of quite complex diagrams but also to the challenges posed by the more difficult images that appear in real patent documents. 1 Introduction We present the recognition results of our MolRec system on the CLEF 2012 corpus of chemical molecule diagrams. MolRec is a rule based system, in that after an initial preprocessing phase, the primary recognition task is performed by a rule engine, in which largely disjoint rules are repeatedly applied to an initial set of geometric primitives, thereby rewriting the set into a graph representation of the given molecule diagram. This final graph structure then serves as a basis from which other efficient electronic representation formats, such as MOL files, can be generated. A previous implementation of the system has already performed well in the TREC 2011 competition [7]. For CLEF 2012 we have used an improved sys- tem with a fully overhauled implementation of the rewrite engine that not only leads to better recognition performance but is also computationally much more efficient. In this note we will first give a short overview of the system, detailing its overall structure and briefly summarising the rewriting rules that perform the primary recognition (Sec. 2). For a more detailed overview we refer the reader to [5]. We will then present the results of our system on the CLEF 2012 recogni- tion task (Sec. 3) and follow it up by a more detailed analysis and discussion of images that were not successfully recognised (Sec. 4) but which motivate future improvements. 2 Overview of MolRec MolRec employs a rule-based approach for the recognition of chemical struc- ture diagrams. It consists of two modules, a vectoriser and a rule-engine. The vectoriser preprocesses an input image, analyses the chemical structure diagram it represents and generates a set of geometric primitives. These primitives are then picked up by the rule engine which rewrites them into a graph structure representation of the recognised molecule. In a post-processing step this graph structure can then be translated into a variety of output formats such as MOL files [6] or SMILES [1,8]. 2.1 Vectorisation The vectorisation works essentially in three steps: 1. Image binarisation 2. Optical character recognition (OCR) 3. Separation of bond elements Binarisation For the first step of image binarisation we use Otsu’s method [4]. This is followed by labelling of connected components. OCR In a second step optical character recognition is performed by extracting a set of structural features from connected components and applying a nearest neighbour classification based on a Euclidean metric. All connected components recognised as characters are removed from the image. Some contextual information is used to disambiguate difficult cases. For example, the lower case, sans serif letter “l” is often visually indistinguishable from short line segments in a molecule diagram, but in all examples we have come across, it does not appear except beside other letters (usually after a capital “C”, to denote a Chlorine atom). The result of this step is a skeleton molecule with all detected characters removed. Separation of Bond Elements At this point we produce a new copy of the (character free) diagram and apply a thinning algorithm to connected compo- nents to thin them to a single pixel width. Using the thinned lines as a guide, we walk the corresponding paths in the original image to determine the average 2 line width by finding the largest disk that fits wholly with the stroke width of the line. At the same time, we build a polyline representation of the thinned lines. At every junction where three or more polylines meet, we split them into separate polylines. Closed polylines are also identified. Because of scanning, discretisation and thinning artifacts, these polylines are not, as we would like, smooth idealised representations of the lines in the original diagram. Therefore we clean them up by applying the Douglas-Peucker line simplification algorithm [2], where we set the simplification threshold to between 1 and 2 average line widths as found above. This is sufficient to smooth out the polylines, removing almost all artifacts, without losing the significant corners in the lines in the diagram. Basing the threshold on the average line width allows the algorithm to adapt to the different line styles that appear in molecule diagrams in practice. In addition to detecting and separating polylines we also detect circles as well as lines with arrows heads and solid triangles. The latter two are then annotated with their respective direction. 2.2 Rule Engine The rule engine essentially works with the geometric primitives resulting from the vectorisation. In particular it uses character groups from the OCR step, as well as line segments, circles, solid triangles and arrows from the bond separation. The goal of the rule engine is to rewrite the input set of primitives into a graph structure that represents the molecule in terms of the atoms (or superatoms) and different types of bonds between them. Rule are defined in terms of preconditions and consequences. A rule is appli- cable if there exist geometric objects that satisfies its preconditions. The conse- quence results in the removal of existing geometric objects and the addition of elements to the graph as well as possibly the addition of new geometric objects. In general, preconditions of different rules are mutually exclusive, and thus the order of rule application is irrelevant. Rules work with a number of parameters, both fuzzy and strict, that set certain thresholds, for instance the minimal bond length, under which decisions will be made. These parameters allow for the cus- tomisation of MolRec and its adaptation to particular requirements of datasets. In this section we will only briefly summarise the main rules and present an example. For more details we refer the reader to [5]. The rule engine consists of 18 rules altogether. Two of these rules have to be applied before all other rules. These two rules deal with the recognition of bridge bonds, 3-dimensional structures representing multiple different connection paths between different parts of the molecule. These are typically presented in a 21/2- dimensional perspective drawing form such as in Fig. 1. The other 16 rules can be applied in arbitrary order. They deal with the recognition of a number of other bonds that MolRec can handle and that are presented in Fig. 2. All these bonds consist of one or several geometric objects, which a rule can select using its preconditions and rewrite into a corresponding graph entry for the recognised bond type. 3 (a) Open-bridge (b) Closed-bridge Fig. 1. Closed and open bridge bonds (circled) (a) Single Planar (b) Double Planar (c) Triple Planar (d) Wedge (e) Hollow wedge (f) Bold (g) Dashed wedge (h) Dashed (i) Dashed bold (j) Wavy (k) Dative Fig. 2. Bond types recognised by MolRec. However, there are also single geometric objects that possibly represent more than one bond. An example are so called implicit nodes presented in Fig. 3. Here carbon atoms are understood to be at the grey circled areas separating the bonds. These cases are dealt with by rules that pick double or triple bonds, respectively, while also producing new geometric objects by effectively cutting the bonds at the implicit nodes. These new objects can then be further processed by other rules. Not all decisions on bond types can be made by inspecting locally a number of geometric objects only. Consequently some rules mark some of their results as possibly ambiguous. These ambiguities have to be resolved taking context information into account. This is done after all geometric objects have been rewritten within the context of the resulting graph. Furthermore, at this stage MolRec also adds the character groups to the graph, which can be used as further aid for disambiguation. For example, disambiguation of lower case “l”, capital case “I”, the digit “1” and a vertical single bond is carried out at this stage. In addition character groups identifying more than one atom are identified as superatoms. Their structure is looked up in a dictionary and the character group is replaced with the molecule subgraph corresponding to that superatom. 4 Fig. 3. Implicit Nodes (circled) An Example Rule - Wavy Bond Wavy bonds (Figure 2(j)) are commonly used in chemical structure diagrams. As the name suggests, they have a wavy form although a less commonly used saw-tooth form can be encountered in the literature. A vectorisation process will most likely turn a wavy bond into a connected sequence of short line segments arranged in a saw-tooth, or a zig-zag, pattern. As illustrated in Figure 4, a straight line can pass through the centre points of these line segments. A pattern of this form can be identified using the following conditions. 1. L = {l1 , . . . , ln }, where n ≥ 3, is a set of line segments, 2. ∀l ∈ L : length(l) ∈ dl , where dl is the dash length: a parameter of the system representing a range of acceptable values for the length of an individual dash in a dashed line. 3. All elements of L are connected. 4. The centre points of the elements of L are approximately collinear.1 5. Two elements of L, called the end elements, dash-neighbour2 precisely one other element of L. All other elements of L, called internal elements, dash- neighbour precisely two other elements of L. 6. Two end points that are not connected must be the pair of end points that are furthest apart. Consequence A wavy bond between the furthest two endpoints. The new wavy bond has unknown direction. To understand this rule, note that condition 1 simply selects possible line segments, such as those from Figure 4, for consideration. Condition 2 ensures that each segment is of an appropriate length. That the endpoints of the line segments are connected is guaranteed by condition 3. The approximate collinearity of their centre points in condition 4 ensures that this sequence of line segments, although having a zig-zag form at the micro-structure level, is straight at a macro-structure level. Finally, the last two conditions ensures that the segments form a single sequence as would be obtained from a wavy bond, and not, for example, from any 1 Approximate collinearity is a precisely defined relationship under which a set of points can be considered to be collinear within the constraints of the limitations of the construction, printing and scanning technologies used for the image. 2 Dash neighbouring is a precisely defined relationship which specify the conditions under which two line segments can be considered to be consecutive dashes in a dashed line. 5 Fig. 4. Dashes in a Wavy Bond kind of star structure with multiple end points. The consequence merely identifies the structure as a wavy bond between the end points and leaves undecided the directionality of the bond. 3 Analysis of MolRec’s Performance We were given a set of 961 test images by CLEF12’s organisers. This collection was split into two sets. The first set was of 865 images selected for automatic evaluation by comparison of generated MOL files with the ground truth MOL files using the OpenBabel toolkit[3]. However, there are chemical diagrams whose valid MOL files are beyond OpenBabel’s ability to compare (typically because they contain some form of Markush structure) and a second set of 95 such diagrams was selected for manual, visual evaluation. This second set was in- tentionally included to provide a greater challenge to the participating diagram recognition systems. We ran MolRec four times on these sets where we slightly adjusted its internal parameters and MolRec achieved the results illustrated in Table 1 and Table 2. The two tables show the number of correct and incorrect recognitions for the four runs on the manual and automatic evaluation sets respectively. Notice that be- cause most of the diagrams mis-recognised in some runs were also mis-recognised in other runs, there were a total of 52 different diagrams mis-recognised in the manual evaluation set and a total of 46 different diagrams mis-recognised in the automatic evaluation set. Some of these diagrams failed for multiple reasons, so we were able to identify the reasons illustrated in Table 3. Run # Recognitions # Mis-Recognitions Accuracy 1 44 51 46.32% 2 56 39 58.95% 3 44 51 46.32% 4 54 41 56.84% Table 1. Four Runs on the Manual Evaluation Set (95 images) 6 Run # Recognitions # Mis-Recognitions Accuracy 1 832 33 96.18% 2 821 44 94.91% 3 821 44 94.91% 4 832 33 96.18% Table 2. Four Runs on the Automatic Evaluation Set (865 images) Reason Manual Set Automatic Set #Images #Images Character Grouping 26 0 Touching Characters 8 1 Vectorisation of Four-way Junction 6 7 Missed Solid Wedge 0 6 Missed Dashed Wedge 0 6 OCR Errors 5 11 Missed wavy bond 2 1 Missed charge sign 1 2 Incorrect Stereocentre 0 1 Atom too close to line endpoint 0 1 Line end too close to closed node 0 1 Table 3. Reasons for Mis-Recognition of Molecules. 4 Evaluation and Analysis of Results We now briefly discuss some of the problems that have lead to MolRec mis- recognising molecule diagrams in the test set. 4.1 Character Grouping An error in the implementation of our character group formation algorithm lead to the digit “1” being repeated when it appears within a atom group, so, for example, MolRec recognised R21 incorrectly as R211 . A separate problem was the difficulty in correctly separating different atom groups which are closely spaced, as shown in Figure 5. Fig. 5. Atoms too Close 7 4.2 Touching Characters We currently do not handle touching characters and therefore they will likely cause mis-recognition. Figure 6 shows several examples. (a) (b) (c) (d) Fig. 6. Example of Touching Characters 4.3 Vectorisation of Four-way Junctions Vectorising junctions where four lines meet was another reason for recognition failure. Two examples are given in Figure 7. MolRec misses such junctions. (a) (b) Fig. 7. Example of Four-way Junctions Missed by MolRec 4.4 OCR Errors There were some cases of OCR errors. These included a “G” interpreted as an “O”, and the “alkyl” atom (Figure 8(a)) being mis-recognised. Also, as illus- trated in Figure 8(b), there was several cases where an “I” was interpreted as a vertical single bond. 4.5 Missed Solid Wedge, Dashed Wedge and Wavy Bonds As shown in Figure 9, MolRec incorrectly recognised a number of solid wedge, dashed wedge and wavy bonds. 8 (a) (b) Fig. 8. OCR Errors (a) (b) (c) Fig. 9. Examples of Missed Solid Wedge, Dashed Wedge and Wavy Bonds 4.6 Missed Charge sign The plus and minus signs, or “+” and “-”, are often used to indicate the exis- tence of positive and negative charges respectively. As shown in Figure 10, they are usually placed to the top right of an atom. While correctly recognising the positive charge sign, MolRec missed three negative charge signs including one that was placed at the top left of an atom name. (a) (b) (c) Fig. 10. Missed Charge Signs 4.7 Other Mis-recognition Reasons These include an atom that was too close to a bond’s endpoint and which was therefore erroneously considered connected (Figure 11(a)), a solid wedge bond that was too close to a closed node so they were considered connected (Fig- ure 11(b)) and a dashed bold bond whose stereocentre was incorrectly deter- mined (Figure 11(c)). 9 (a) (b) (c) Fig. 11. Other Mis-recognition Reasons 4.8 Errors in the Dataset While analysing our results we came across a small number of cases where the ground truth was incorrect and our recognition was correct. As shown in Table 4, there were 11 images in total. Such ground truth dataset errors are very difficult to avoid in such a complex task. US20040254236A1 p0037 x0488 y1434 c00146 US20060122222A1 p0015 x0638 y2146 c00043 US20060122222A1 p0015 x0638 y2624 c00044 US20060122222A1 p0025 x0363 y1056 c00070 US20060122222A1 p0027 x0379 y1064 c00078 US20060122222A1 p0040 x1313 y0721 c00105 US20060154945A1 p0017 x0404 y2072 c00059 US20070155803A1 p0020 x1323 y2114 c00037 US20070155803A1 p0029 x1358 y1732 c00073 US20070179154A1 p0046 x1376 y0890 c00066 US20070270434A1 p0015 x1376 y2386 c00029 Table 4. Images with Incorrect MOL files 5 Conclusions Despite scoring high recognition rates throughout four runs, the presented ex- periments demonstrate that there is still plenty of room to improve MolRec. We believe many of the mis-recognition problems can be solved with some relatively simple enhancements of our system, e.g. the error in character grouping or the vectorisation of four-way junctions. Tackling the notoriously difficult touching character segmentation problem is one aspect where we plan to explore viable solutions. Another area we plan to investigate is the recognition of more general Markush structures. Additionally, robust charge sign spotting, accurate identi- fication of solid wedge bonds and precise identification of dashed wedge bonds are also areas we need to address further. However, we are pleased with the per- formance of MolRec and hope to participate in similar events in the future so 10 that we can contribute to progress in the state of the art of chemical structure diagrams. References 1. Daylight Chemical Information Systems, Inc. SMILES — a simplified chemical language, 2008. http://www.daylight.com/dayhtml/doc/theory/theory.smiles. html. 2. David H. Douglas and Thomas K. Peucker. Algorithms for the reduction of the num- ber of points required to represent a digitized line or its caricature. Cartographica, 10(2):112–122, 1973. 3. Open Babel: The open source chemistry toolbox. http://openbabel.org/. 4. N. Otsu. A threshold selection method from gray-level histograms. IEEE Transac- tions on Systems, Man and Cybernetics, 9:62–66, January 1979. 5. Noureddin M. Sadawi, Alan P. Sexton, and Volker Sorge. Chemical structure recog- nition: A rule based approach. In Christian Viard-Gaudin and Richard Zanibbi, editors, 19th Document Recognition and Retrieval Conference (DRR 2012). SPIE, January 2012. 6. Symyx. CTfile formats, 2010. http://www.symyx.com/downloads/public/ctfile/ ctfile.jsp. 7. Text REtrieval Conference (TREC). Image2structure task, 2011. http://trec. nist.gov/pubs/trec20/t20.proceedings.html. 8. D. Weininger. SMILES, a chemical language and information system. J Chem Inform Comput Sci, 1:31–36, 1988. 11