<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Optical Structure Recognition Application entry to CLEF-IP 2012</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Igor V. Filippov</string-name>
          <email>igor.filippov@nih.gov</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dmitry Katsubo</string-name>
          <email>dmitry.katsubo@gmail.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marc C. Nicklaus</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Chemical Biology Laboratory, NCI, NIH, DHHS, Frederick National Lab</institution>
          ,
          <addr-line>Frederick, Maryland, 21702</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Chemical Biology Laboratory</institution>
          ,
          <addr-line>SAIC-Frederick</addr-line>
          ,
          <institution>Inc., Frederick National Lab</institution>
          ,
          <addr-line>Frederick, Maryland, 21702</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Life Sciences Department</institution>
          ,
          <addr-line>European Patent O ce, The Hague</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present our entry to CLEF 2012 Chemical Structure Recognition task. Our submission includes runs for both bounding box extraction and molecule structure recognition tasks using Optical Structure Recognition Application. OSRA is an open source utility to convert images of chemical structures to connection tables into established computerized molecular formats. It has been under constant development since 2007.</p>
      </abstract>
      <kwd-group>
        <kwd>image recognition</kwd>
        <kwd>document analysis</kwd>
        <kwd>chemoinformatics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Page segmentation</title>
      <p>
        The general work- ow of OSRA has been presented at several meetings and
conferences before: [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. A few modi cations were made in the recently
released version of OSRA to allow for more accurate bounding box coordinates
reporting. Internally OSRA does not use bounding box paradigm, it relies
instead on the minimum pairwise distance between points of di erent components.
This allows to split (or keep together) objects which cannot be separated within
a bounding box approach - i.e. imagine a larger molecule almost surrounding a
smaller one. For this task we have submitted two runs - the rst one was using
ti split to split multi-page TIFF images into separate pages, the second was
using built-in OSRA facilities for page splitting. Surprisingly this lead to
significantly di erent results which we can only attribute to the internal conversion
of TIFF format going on within ti split procedure.
      </p>
      <p>Table 1 shows the result of page segmentation task when ti split was used.
The number of structures in the ground truth set was 5421, the total number
of returned records was 8800. Tolerance shows the allowed margin of error in
bounding box detection in pixels.
For the run using the OSRA native TIFF reading capabilities (Table 2) the
number of returned records was 5254 and the precision overall was much higher.
Both runs demonstrate competitive recall values.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Structure recognition</title>
      <p>For structure recognition task the test set was split in two parts: the rst one
allowed for automatic result evaluation by using InChI keys in the same way as
was applied at TREC-CHEM 2011 meeting. The second part was only possible
to evaluate manually due to the presence of Markush-style atomic labels. The
results are presented in Table 3.</p>
      <p>The results are consistent with those presented at the TREC-CHEM 2011
meeting where OSRA achieved second top-ranking score out of 6 participating
projects.</p>
      <p>Funding Disclaimer: This project has been funded in whole or in part with
federal funds from the National Cancer Institute, National Institutes of Health,
under contract N01-CO-12400. The content of this publication does not
necessarily re ect the views or policies of the Department of Health and Human Services,
nor does mention of trade names, commercial products, or organizations imply
endorsement by the U.S. Government. This Research was supported in part by
the Intramural Research Program of the NIH, National Cancer Institute, Center
for Cancer Research.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>I. V.</given-names>
            <surname>Filippov</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. C.</given-names>
            <surname>Nicklaus</surname>
          </string-name>
          .
          <article-title>Optical Structure Recognition Software To Recover Chemical Information: OSRA, An Open Source Solution</article-title>
          .
          <source>Journal of Chemical Information and Modeling</source>
          ,
          <volume>49</volume>
          (
          <issue>3</issue>
          ):
          <volume>740</volume>
          {
          <fpage>743</fpage>
          ,
          <string-name>
            <surname>MAR</surname>
          </string-name>
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>I. V.</given-names>
            <surname>Filippov and M. C.</surname>
          </string-name>
          <article-title>Nicklaus Extracting chemical structure information: Optical structure recognition application</article-title>
          .
          <source>In Proceedings of the Eight IAPR International Workshop on Graphics Recognition</source>
          , pages
          <volume>133</volume>
          {
          <fpage>142</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>I. V.</given-names>
            <surname>Filippov and M. C.</surname>
          </string-name>
          <article-title>Nicklaus and John Kinney Improvements in Optical Structure Recognition Application</article-title>
          .
          <source>International Workshop on Document Analysis Systems (DAS</source>
          <year>2010</year>
          ),
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>I. V.</given-names>
            <surname>Filippov</surname>
          </string-name>
          and
          <string-name>
            <given-names>Dmitry</given-names>
            <surname>Katsubo and M. C.</surname>
          </string-name>
          <article-title>Nicklaus Optical Structure Recognition Application entry in Image2Structure task</article-title>
          .
          <source>In Proceedings of the Twentieth Text REtrieval Conference (TREC</source>
          <year>2011</year>
          ),
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>