<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Extracting hierarchical data points and tables from scanned contracts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jan Stadermann</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stephan Symons</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ingo Thon</string-name>
          <email>ingo.thong@recommind.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Recommind Inc.</institution>
          ,
          <addr-line>650 California Street, San Francisco, CA 94108</addr-line>
          ,
          <country country="US">United States</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present a technique for developing systems to automatically extract information from scanned semi-structured contracts. Such contracts are based on a template, but have di erent layouts and clientspeci c changes. While the presented technique is applicable to all kinds of such contracts we speci cally focus on so called ISDA credit support annexes. The data model for such documents consists of 150 individual entities some of which are tables that could span multiple pages. The information extraction is based on the Apache UIMA framework. It consists of a collection of small and simple Analysis Components that extract increasingly complex information based on earlier extractions. This technique is applied to extract individual data points and tables. Experiments show an overall precision of 97% with a recall of 93% regarding individual/simple data points and 89%/81% for table cells measured against manually entered ground truth. Due to its modular nature our system can be easily extended and adapted to other collections of contracts as long as some data model can be formulated.</p>
      </abstract>
      <kwd-group>
        <kwd>OCR robust information extraction</kwd>
        <kwd>hierarchical taggers</kwd>
        <kwd>table extraction</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Despite the existence of electronic document handling and content management
systems there is still a large amount of paper based contracts. Even when scanned
and OCRed the interesting data contained in the document is not
machinereadable as there is no semantic attached to the text. Especially in the banking
domain it is necessary to have the underlying information available, e.g., for risk
assessment. Until now, the information has to be extracted by human reviewers.
The goal of the system presented here is to automatically obtain the relevant
information from OTC (over-the-counter) contracts which are based on a
template provided by the ISDA1. The data is given in the form of image-embedded
pdf documents. Each contract contains around 150 data points organized in a
complex hierarchical data model. A data point can be either a (possibly multi
valued) simple eld or a table. The main challenges of such a system are:
(a)
(b)
1. The complex legal language used in the contracts.
2. Despite existing contract templates, the wording varies across customers.
3. The layout varies. Especially tables can be represented in various forms.
4. The scanning quality of the contracts is often poor, especially in old
contracts or documents sent by fax. Still the remaining information needs to be
extracted correctly.
rule based approaches [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] or discriminative context free grammars [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Closest
to our solution is a system described by Surdeanu et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. They employ two
layers of extraction using Conditional Random Fields [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], and deal with OCR
data. For table extraction, heuristic methods [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] have been proposed as well as
Conditional Random Fields [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        In contrast, our system uses a theoretically unlimited number of layers with
separate classi ers for each piece of information, including tables, on each level.
Instead of processing the whole text at once, our classi ers just collect the
information they require, and decide only on that data. Therefore, they allow for
better performance and extensibility, as additional data does not a ect the
existing classi ers. Our work follows strategies commonly used in spoken dialogue
systems [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and uses a set of small classi ers which is inspired by the boosting
idea [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In addition, we use automatically extracted segmentation information
and cross-checks between our classi ers to increase the precision of the extracted
data. From a UI standpoint there is a similar application called GATE [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] which
extracts entities based on given rule-sets. This application provides a
hierarchical organization of entities and the architecture seems to be very similar to the
UIMA framework. However, GATE has no special provisions to deal with noise
from due to the OCR step and it only allows to specify simple extraction rules.
Furthermore there is no direct way that the entity extraction works hierarchically
but only the result can be organized in a hierarchical way.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Information extraction</title>
      <p>An overview of or system's architecture is shown in gure 2. Prior to information
extraction, the OmniPage2 OCR engine is used to convert the image to readable
text. However, many character level errors, and layout distortions remain which
need to be dealt with in the following processing steps. The overall strategy
is based on the idea that small pieces of relevant text can be extracted quite
accurately even in the presence of OCR errors. On top of these pieces we build
several layers of higher level extractors { here called "experts" { that combine
these small pieces to decide on a nal data point. The extraction of tables works
in a similar fashion by rst trying to extract small pieces that form table cells.
Then stretches of cells are collected, trying to deduce a layout from order and
type of the pieces. Finally, an optimal result table is selected (see section 2.2).</p>
      <p>
        Our solution is based on the UIMA framework [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Each type of expert is
implemented as a con gurable annotation engine. The overall extraction system
consists of a large hierarchy of analysis engines, encompassing several hundred
elements. The type system, in contrast, only consists of three principal types, i.e.
for simple elds, tables and table rows. Annotation types, extracted values, etc.
are stored as features. Both nal and intermediate annotations are represented
by these types.
2 http://www.nuance.com/omnipage
      </p>
      <p>OCR</p>
      <p>Recognizedptextp(XML)
Informationpextraction</p>
      <p>RegExppExtractorp1</p>
      <p>RegExppExtractorp2
Normalization</p>
      <p>Normalization</p>
      <p>Dictionary</p>
      <p>Resource</p>
      <p>Expertp1
Normalization</p>
      <p>Dictionaryp</p>
      <p>Extractor</p>
      <p>Normalization</p>
      <p>Expertp2</p>
      <p>Normalization
XMLpwithpmetapdata</p>
      <p>Documentpindex
We use the term \simple-valued elds" for data points, where one key has one or
more values. They di er from named entities as they may include multi-valued
data. Figure 1(a) shows an example of the key eligible currency with the
(normalized) values \USD" and \Base currency". Fields are extracted layer-wise.
On the lowest layer, all instances of the identifying term \Eligible currency", are
captured, as well as the di erent currency expressions, including the special
term \Base currency", which refers to another simple eld. On this level we
typically use annotators based on dictionaries and regular expressions, where
variations due to OCR errors are re ected in dictionary variants, respectively
the regular expressions. All such annotators are implemented as analysis engines.
On the next level, so-called \expert-extractors" combine the existing annotations
to a new one. An expert is a rule, de ned as a set of slots for annotations of
speci c types, and a de nition of which slots form a new annotation if the rule
is satis ed, i.e. if all slots are lled. To allow for ne tuning the experts, slots
can be con gured, e.g. by indicating certain slots as optional. Furthermore, it is
possible to specify the order of annotations in slots appearing in the document. It
is also possible to specify a maximum distance. If the distance between two found
annotations exceeds the de ned threshold for this expert, the expert assumes to
be in the wrong area of the document and clears its internal state to start all over
(EligibleCurrency) (Currency, "Base currency")
Currency, "USD"
Expert 1 Currency Currency Currency
Expert 2</p>
      <p>EligibleCurrency</p>
      <p>Currencies Distance &lt; 20</p>
      <p>Currencies, "Base currency, USD"
(eligible_currency, "Base currency, USD")
again. Finally, slots can be write-protected, accepting only the rst occurrence
of the con gured annotation.</p>
      <p>To extract eligible currency, two experts are employed (see gure 3). The
rst expert collects adjacent currency annotations. The second one combines the
\Eligible Currency" term, and the collected currencies found by expert one, if
both annotations are found within a short distance. The resulting annotation
will span the relevant currency terms. This modular design allow us to reduce
the number of extractors and re-use the already made annotations for completely
di erent data points. In general, the information found in the examined contracts
is not independent of each other. We use business rules and other constraints
to validate and normalize the found results, e.g., the set of currencies is
wellde ned. If the validation fails or the normalization repairs some value due to
business rules a corresponding message can be attached to the annotation to
inform the reviewer.
2.2</p>
      <p>Extraction of tables
We de ne a table as multi-dimensional, structured data present in a document
either in a classical tabular layout, or de ned in a series of sentences or
paragraphs in free text form (like in gure 4). We aim at extracting tables of both
structure types and intermediate formats (e.g. as in gure 1(b)) only from the
document's OCR output at character level. In our application, table extraction
extends the simple valued eld extraction: The basic input for a table expert
is a document annotated with simple value elds and intermediate annotations.
The experts attempt to match sequences of simple annotations to a set of table
models. A table model is user-de ned and describes which columns the resulting
extracted table should have. Each column can contain multiple types of simple
elds. Furthermore, columns can be con gured to be optional and to accept
only unique or non-overlapping annotations. This allows for both more general
models with variable columns and ne-tuning the accepted annotations.</p>
      <p>The process of detecting tables by the table expert (see gure 4 for an
example) begins with collecting all accepted annotations for a model, within a
prede ned range or until a table stop annotation is found, into a list sorted by
order of appearance. For each such list, several lling strategies are employed.
A lling strategy addresses the problem that multiple columns may accept the
same types of annotations. If elements appear row-wise, or column-wise, the
corresponding strategies will recover the correct table, also compensating for
some errors from omitted table elements. In mixed cases, adding a new table
cell to the shortest relevant column is used as a fall back strategy. Each strategy
is evaluated, using the fraction of cells lled in the resulting table c and the
lling strategy speci c score s. The latter score measures how well the
annotations match the expectations of the lling strategy. The table which maximizes
sf = c s is annotated as a candidate, if sf is above a prede ned threshold. The
table expert is implemented an analysis engine. Con guration encompasses the
columns describing the table model, distance and scoring threshold, and the set
of lling strategies to be evaluated. The output is a table type annotation, which
in turn contains several table rows, each containing simple elds as cells.</p>
      <p>Multiple table experts may be used to generate candidate tables for a single
target, and candidates may occur in several locations in a document. Usually,
the correct location gives raise to tables with certain properties, e.g. short, dense
tables. This is used by a feature-based selection of the optimal table candidate.
We model this using both general purpose features (e.g. size, and number of
empty cells) as well as domain speci c features. The table with the highest
weighted sum of score features is selected as the nal output. The weights can
either be user de ned or tted using a formal optimization model.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <p>We composed a document set containing 449 documents3 to measure the
extraction quality of our system. These documents are from various customers and
represent as many variants of di erent wordings and layouts as possible.</p>
      <p>With our customers we agreed upon certain quality gates that the automatic
extraction system has to meet. Due to the nature of the contracts it is much more
important to achieve a high precision of the extracted data instead of recall. For
simple elds the gate's threshold is 95% precision and 80% recall. Table cells
are more di cult to extract since the OCR component not only mis-recognizes
individual characters but makes errors on the structure of a table. For table cells,
our goal is to have a high recall since errors within a structured table are easier
to detect and correct than simple eld errors by a human reviewer. Table 1 shows
our results against a manually created ground truth. The numbers represent the
3 see tinyurl.com/csa-example for a public sample document.</p>
      <sec id="sec-3-1">
        <title>Insertions Deletions Substitutions Correct Precision Recall</title>
      </sec>
      <sec id="sec-3-2">
        <title>Simple elds</title>
        <p>Table cells
total number of data points and errors respectively over all of our documents.
In total, we meet our gate criterion for simple elds. Precision can be as low as
33% for rare elds, where tting appropriate data experts is hard. In contrast,
for frequent elds, precision may exceed 99%. In principle, the same is true for
recall, with both maximum and minimum lower, due to our target criteria. For
table cells, the precision needs improvement mainly due to the OCR's structural
errors like swapping rows within a table or switching between row-wise and
column-wise recognition in one table. This is especially true for tables which are
complex with respect to both lay-out and contents, like the collateral eligibility
table in gure 1(b). Here, precision and recall are 84.4% and 80.2%, respectively.
In contrast, structurally simple tables, like the interest rate table (see gure 4
for an example) can be extracted with much higher con dence (97.4% precision
and 90.8% recall).
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion and outlook</title>
      <p>This article presents a system to automatically extract simple data points and
tables from OTC contract images. The system consists of an OCR component
and a hierarchical set-up of small modular extractors either capturing (noisy)
text or combining already annotated clues using a slot- lling strategy. Our
experiments are conducted on a in-house contract collection resulting in a precision
of 97% (recall 93%) on simple elds and a precision of 89% (recall 81%) on table
cells. While the evaluation we conducted is limited, we expect over tting to be
moderate. The legal nature of the contracts limits the layout and wording
options. Our next steps include the introduction of a con dence score on data-point
level and the use of statistical classi cation methods for selecting the best-suited
table model.</p>
      <p>Acknowledgement. We would like to thank our partner Rule Financial for
providing the data model and for their assistance in understanding the documents.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Paul</given-names>
            <surname>Buitelaar</surname>
          </string-name>
          and
          <string-name>
            <given-names>Srikanth</given-names>
            <surname>Ramaka</surname>
          </string-name>
          .
          <article-title>Unsupervised ontology-based semantic tagging for knowledge markup</article-title>
          .
          <source>In Proceedings of the Workshop on Learning in Web Search at the International Conference on Machine Learning</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Hamish</given-names>
            <surname>Cunningham</surname>
          </string-name>
          . Gate, a
          <article-title>general architecture for text engineering</article-title>
          .
          <source>Computers and the Humanities</source>
          ,
          <volume>36</volume>
          (
          <issue>2</issue>
          ):
          <volume>223</volume>
          {
          <fpage>254</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>David</given-names>
            <surname>Ferrucci</surname>
          </string-name>
          and
          <string-name>
            <given-names>Adam</given-names>
            <surname>Lally</surname>
          </string-name>
          .
          <article-title>Uima: an architectural approach to unstructured information processing in the corporate research environment</article-title>
          .
          <source>Natural Language Engineering</source>
          ,
          <volume>10</volume>
          (
          <issue>3-4</issue>
          ):
          <volume>327</volume>
          {
          <fpage>348</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Kyungduk</given-names>
            <surname>Kim</surname>
          </string-name>
          et al.
          <article-title>A frame-based probabilistic framework for spoken dialog management using dialog examples</article-title>
          .
          <source>In Proceedings of the 9th SIGdial Workshop on Discourse and Dialogue</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. John La erty,
          <source>Andrew McCallum, and Fernando CN Pereira</source>
          .
          <article-title>Conditional random elds: probabilistic models for segmenting and labeling sequence data</article-title>
          .
          <source>In Proceedings of the 18th International Conference on Machine Learning</source>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Ron</given-names>
            <surname>Meir</surname>
          </string-name>
          and
          <article-title>Gunnar Ratsch. An introduction to boosting and leveraging</article-title>
          .
          <source>In Advanced lectures on machine learning</source>
          , pages
          <volume>118</volume>
          {
          <fpage>183</fpage>
          . Springer,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>David</given-names>
            <surname>Pinto</surname>
          </string-name>
          ,
          <string-name>
            <surname>Andrew McCallum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Xing</given-names>
            <surname>Wei</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W Bruce</given-names>
            <surname>Croft</surname>
          </string-name>
          .
          <article-title>Table extraction using conditional random elds</article-title>
          .
          <source>In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval</source>
          , pages
          <volume>235</volume>
          {
          <fpage>242</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Pallavi</given-names>
            <surname>Pyreddy</surname>
          </string-name>
          and
          <string-name>
            <given-names>W Bruce</given-names>
            <surname>Croft</surname>
          </string-name>
          .
          <article-title>Tintin: A system for retrieval in text tables</article-title>
          .
          <source>In Proceedings of the second ACM international conference on Digital libraries</source>
          , pages
          <volume>193</volume>
          {
          <fpage>200</fpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Lev</given-names>
            <surname>Ratinov</surname>
          </string-name>
          and
          <string-name>
            <given-names>Dan</given-names>
            <surname>Roth</surname>
          </string-name>
          .
          <article-title>Design challenges and misconceptions in named entity recognition</article-title>
          .
          <source>In Proceedings of the thirteenth conference on Computational Natural Lanugage Learning</source>
          , pages
          <volume>147</volume>
          {
          <fpage>155</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>Stephen</given-names>
            <surname>Soderland</surname>
          </string-name>
          .
          <article-title>Learning information extraction rules for semi-structured and free text</article-title>
          .
          <source>Machine learning</source>
          ,
          <volume>34</volume>
          (
          <issue>1-3</issue>
          ):
          <volume>233</volume>
          {
          <fpage>272</fpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Mihai</surname>
            <given-names>Surdeanu</given-names>
          </string-name>
          , Ramesh Nallapati, and
          <string-name>
            <given-names>Christopher D.</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <article-title>Legal claim identi cation: Information extraction with hierarchically labeled data</article-title>
          .
          <source>In Proceedings of the LREC 2010 Workshop on the Semantic Processing of Legal Texts</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. Suzanne Liebowitz Taylor, Richard Fritzson, and
          <article-title>Jon A Pastor. Extraction of data from preprinted forms</article-title>
          .
          <source>Machine Vision and Applications</source>
          ,
          <volume>5</volume>
          (
          <issue>3</issue>
          ):
          <volume>211</volume>
          {
          <fpage>222</fpage>
          ,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>Paul</given-names>
            <surname>Viola</surname>
          </string-name>
          and
          <string-name>
            <given-names>Mukund</given-names>
            <surname>Narasimhan</surname>
          </string-name>
          .
          <article-title>Learning to extract information from semistructured text using a discriminative context free grammar</article-title>
          .
          <source>In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , pages
          <volume>330</volume>
          {
          <fpage>337</fpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>