-

Extracting hierarchical data points and tables from scanned contracts

Jan Stadermann

Stephan Symons

Ingo Thon

ingo.thong@recommind.com 0 0 Recommind Inc. , 650 California Street, San Francisco, CA 94108 , United States

We present a technique for developing systems to automatically extract information from scanned semi-structured contracts. Such contracts are based on a template, but have di erent layouts and clientspeci c changes. While the presented technique is applicable to all kinds of such contracts we speci cally focus on so called ISDA credit support annexes. The data model for such documents consists of 150 individual entities some of which are tables that could span multiple pages. The information extraction is based on the Apache UIMA framework. It consists of a collection of small and simple Analysis Components that extract increasingly complex information based on earlier extractions. This technique is applied to extract individual data points and tables. Experiments show an overall precision of 97% with a recall of 93% regarding individual/simple data points and 89%/81% for table cells measured against manually entered ground truth. Due to its modular nature our system can be easily extended and adapted to other collections of contracts as long as some data model can be formulated.

OCR robust information extraction hierarchical taggers table extraction

Despite the existence of electronic document handling and content management systems there is still a large amount of paper based contracts. Even when scanned and OCRed the interesting data contained in the document is not machinereadable as there is no semantic attached to the text. Especially in the banking domain it is necessary to have the underlying information available, e.g., for risk assessment. Until now, the information has to be extracted by human reviewers. The goal of the system presented here is to automatically obtain the relevant information from OTC (over-the-counter) contracts which are based on a template provided by the ISDA1. The data is given in the form of image-embedded pdf documents. Each contract contains around 150 data points organized in a complex hierarchical data model. A data point can be either a (possibly multi valued) simple eld or a table. The main challenges of such a system are: (a) (b) 1. The complex legal language used in the contracts. 2. Despite existing contract templates, the wording varies across customers. 3. The layout varies. Especially tables can be represented in various forms. 4. The scanning quality of the contracts is often poor, especially in old contracts or documents sent by fax. Still the remaining information needs to be extracted correctly. rule based approaches [ 10 ] or discriminative context free grammars [ 13 ]. Closest to our solution is a system described by Surdeanu et al. [ 11 ]. They employ two layers of extraction using Conditional Random Fields [ 5 ], and deal with OCR data. For table extraction, heuristic methods [ 8 ] have been proposed as well as Conditional Random Fields [ 7 ].

In contrast, our system uses a theoretically unlimited number of layers with separate classi ers for each piece of information, including tables, on each level. Instead of processing the whole text at once, our classi ers just collect the information they require, and decide only on that data. Therefore, they allow for better performance and extensibility, as additional data does not a ect the existing classi ers. Our work follows strategies commonly used in spoken dialogue systems [ 4 ] and uses a set of small classi ers which is inspired by the boosting idea [ 6 ]. In addition, we use automatically extracted segmentation information and cross-checks between our classi ers to increase the precision of the extracted data. From a UI standpoint there is a similar application called GATE [ 2 ] which extracts entities based on given rule-sets. This application provides a hierarchical organization of entities and the architecture seems to be very similar to the UIMA framework. However, GATE has no special provisions to deal with noise from due to the OCR step and it only allows to specify simple extraction rules. Furthermore there is no direct way that the entity extraction works hierarchically but only the result can be organized in a hierarchical way. 2

Information extraction

An overview of or system's architecture is shown in gure 2. Prior to information extraction, the OmniPage2 OCR engine is used to convert the image to readable text. However, many character level errors, and layout distortions remain which need to be dealt with in the following processing steps. The overall strategy is based on the idea that small pieces of relevant text can be extracted quite accurately even in the presence of OCR errors. On top of these pieces we build several layers of higher level extractors { here called "experts" { that combine these small pieces to decide on a nal data point. The extraction of tables works in a similar fashion by rst trying to extract small pieces that form table cells. Then stretches of cells are collected, trying to deduce a layout from order and type of the pieces. Finally, an optimal result table is selected (see section 2.2).

Our solution is based on the UIMA framework [ 3 ]. Each type of expert is implemented as a con gurable annotation engine. The overall extraction system consists of a large hierarchy of analysis engines, encompassing several hundred elements. The type system, in contrast, only consists of three principal types, i.e. for simple elds, tables and table rows. Annotation types, extracted values, etc. are stored as features. Both nal and intermediate annotations are represented by these types. 2 http://www.nuance.com/omnipage

OCR

Recognizedptextp(XML) Informationpextraction

RegExppExtractorp1

RegExppExtractorp2 Normalization

Normalization

Dictionary

Resource

Expertp1 Normalization

Dictionaryp

Extractor

Normalization

Expertp2

Normalization XMLpwithpmetapdata

Documentpindex We use the term \simple-valued elds" for data points, where one key has one or more values. They di er from named entities as they may include multi-valued data. Figure 1(a) shows an example of the key eligible currency with the (normalized) values \USD" and \Base currency". Fields are extracted layer-wise. On the lowest layer, all instances of the identifying term \Eligible currency", are captured, as well as the di erent currency expressions, including the special term \Base currency", which refers to another simple eld. On this level we typically use annotators based on dictionaries and regular expressions, where variations due to OCR errors are re ected in dictionary variants, respectively the regular expressions. All such annotators are implemented as analysis engines. On the next level, so-called \expert-extractors" combine the existing annotations to a new one. An expert is a rule, de ned as a set of slots for annotations of speci c types, and a de nition of which slots form a new annotation if the rule is satis ed, i.e. if all slots are lled. To allow for ne tuning the experts, slots can be con gured, e.g. by indicating certain slots as optional. Furthermore, it is possible to specify the order of annotations in slots appearing in the document. It is also possible to specify a maximum distance. If the distance between two found annotations exceeds the de ned threshold for this expert, the expert assumes to be in the wrong area of the document and clears its internal state to start all over (EligibleCurrency) (Currency, "Base currency") Currency, "USD" Expert 1 Currency Currency Currency Expert 2

EligibleCurrency

Currencies Distance < 20

Currencies, "Base currency, USD" (eligible_currency, "Base currency, USD") again. Finally, slots can be write-protected, accepting only the rst occurrence of the con gured annotation.

To extract eligible currency, two experts are employed (see gure 3). The rst expert collects adjacent currency annotations. The second one combines the \Eligible Currency" term, and the collected currencies found by expert one, if both annotations are found within a short distance. The resulting annotation will span the relevant currency terms. This modular design allow us to reduce the number of extractors and re-use the already made annotations for completely di erent data points. In general, the information found in the examined contracts is not independent of each other. We use business rules and other constraints to validate and normalize the found results, e.g., the set of currencies is wellde ned. If the validation fails or the normalization repairs some value due to business rules a corresponding message can be attached to the annotation to inform the reviewer. 2.2

Extraction of tables We de ne a table as multi-dimensional, structured data present in a document either in a classical tabular layout, or de ned in a series of sentences or paragraphs in free text form (like in gure 4). We aim at extracting tables of both structure types and intermediate formats (e.g. as in gure 1(b)) only from the document's OCR output at character level. In our application, table extraction extends the simple valued eld extraction: The basic input for a table expert is a document annotated with simple value elds and intermediate annotations. The experts attempt to match sequences of simple annotations to a set of table models. A table model is user-de ned and describes which columns the resulting extracted table should have. Each column can contain multiple types of simple elds. Furthermore, columns can be con gured to be optional and to accept only unique or non-overlapping annotations. This allows for both more general models with variable columns and ne-tuning the accepted annotations.

The process of detecting tables by the table expert (see gure 4 for an example) begins with collecting all accepted annotations for a model, within a prede ned range or until a table stop annotation is found, into a list sorted by order of appearance. For each such list, several lling strategies are employed. A lling strategy addresses the problem that multiple columns may accept the same types of annotations. If elements appear row-wise, or column-wise, the corresponding strategies will recover the correct table, also compensating for some errors from omitted table elements. In mixed cases, adding a new table cell to the shortest relevant column is used as a fall back strategy. Each strategy is evaluated, using the fraction of cells lled in the resulting table c and the lling strategy speci c score s. The latter score measures how well the annotations match the expectations of the lling strategy. The table which maximizes sf = c s is annotated as a candidate, if sf is above a prede ned threshold. The table expert is implemented an analysis engine. Con guration encompasses the columns describing the table model, distance and scoring threshold, and the set of lling strategies to be evaluated. The output is a table type annotation, which in turn contains several table rows, each containing simple elds as cells.

Multiple table experts may be used to generate candidate tables for a single target, and candidates may occur in several locations in a document. Usually, the correct location gives raise to tables with certain properties, e.g. short, dense tables. This is used by a feature-based selection of the optimal table candidate. We model this using both general purpose features (e.g. size, and number of empty cells) as well as domain speci c features. The table with the highest weighted sum of score features is selected as the nal output. The weights can either be user de ned or tted using a formal optimization model. 3

Experiments

We composed a document set containing 449 documents3 to measure the extraction quality of our system. These documents are from various customers and represent as many variants of di erent wordings and layouts as possible.

With our customers we agreed upon certain quality gates that the automatic extraction system has to meet. Due to the nature of the contracts it is much more important to achieve a high precision of the extracted data instead of recall. For simple elds the gate's threshold is 95% precision and 80% recall. Table cells are more di cult to extract since the OCR component not only mis-recognizes individual characters but makes errors on the structure of a table. For table cells, our goal is to have a high recall since errors within a structured table are easier to detect and correct than simple eld errors by a human reviewer. Table 1 shows our results against a manually created ground truth. The numbers represent the 3 see tinyurl.com/csa-example for a public sample document.

Insertions Deletions Substitutions Correct Precision Recall Simple elds

Table cells total number of data points and errors respectively over all of our documents. In total, we meet our gate criterion for simple elds. Precision can be as low as 33% for rare elds, where tting appropriate data experts is hard. In contrast, for frequent elds, precision may exceed 99%. In principle, the same is true for recall, with both maximum and minimum lower, due to our target criteria. For table cells, the precision needs improvement mainly due to the OCR's structural errors like swapping rows within a table or switching between row-wise and column-wise recognition in one table. This is especially true for tables which are complex with respect to both lay-out and contents, like the collateral eligibility table in gure 1(b). Here, precision and recall are 84.4% and 80.2%, respectively. In contrast, structurally simple tables, like the interest rate table (see gure 4 for an example) can be extracted with much higher con dence (97.4% precision and 90.8% recall). 4

Conclusion and outlook

This article presents a system to automatically extract simple data points and tables from OTC contract images. The system consists of an OCR component and a hierarchical set-up of small modular extractors either capturing (noisy) text or combining already annotated clues using a slot- lling strategy. Our experiments are conducted on a in-house contract collection resulting in a precision of 97% (recall 93%) on simple elds and a precision of 89% (recall 81%) on table cells. While the evaluation we conducted is limited, we expect over tting to be moderate. The legal nature of the contracts limits the layout and wording options. Our next steps include the introduction of a con dence score on data-point level and the use of statistical classi cation methods for selecting the best-suited table model.

Acknowledgement. We would like to thank our partner Rule Financial for providing the data model and for their assistance in understanding the documents.

Paul

Buitelaar and

Srikanth

Ramaka . Unsupervised ontology-based semantic tagging for knowledge markup . In Proceedings of the Workshop on Learning in Web Search at the International Conference on Machine Learning , 2005 .

Hamish

Cunningham . Gate, a general architecture for text engineering . Computers and the Humanities , 36 ( 2 ): 223 { 254 , 2002 .

David

Ferrucci and

Adam

Lally . Uima: an architectural approach to unstructured information processing in the corporate research environment . Natural Language Engineering , 10 ( 3-4 ): 327 { 348 , 2004 .

Kyungduk

Kim et al. A frame-based probabilistic framework for spoken dialog management using dialog examples . In Proceedings of the 9th SIGdial Workshop on Discourse and Dialogue , 2008 .

5. John La erty, Andrew McCallum, and Fernando CN Pereira . Conditional random elds: probabilistic models for segmenting and labeling sequence data . In Proceedings of the 18th International Conference on Machine Learning , 2001 .

Ron

Meir and Gunnar Ratsch. An introduction to boosting and leveraging . In Advanced lectures on machine learning , pages 118 { 183 . Springer, 2003 .

David

Pinto , Andrew McCallum ,

Xing

Wei , and

W Bruce

Croft . Table extraction using conditional random elds . In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval , pages 235 { 242 , 2003 .

Pallavi

Pyreddy and

W Bruce

Croft . Tintin: A system for retrieval in text tables . In Proceedings of the second ACM international conference on Digital libraries , pages 193 { 200 , 1997 .

Lev

Ratinov and

Dan

Roth . Design challenges and misconceptions in named entity recognition . In Proceedings of the thirteenth conference on Computational Natural Lanugage Learning , pages 147 { 155 , 2009 .

10.

Stephen

Soderland . Learning information extraction rules for semi-structured and free text . Machine learning , 34 ( 1-3 ): 233 { 272 , 1999 .

11. Mihai

Surdeanu

, Ramesh Nallapati, and

Christopher D.

Manning . Legal claim identi cation: Information extraction with hierarchically labeled data . In Proceedings of the LREC 2010 Workshop on the Semantic Processing of Legal Texts , 2010 .

12. Suzanne Liebowitz Taylor, Richard Fritzson, and Jon A Pastor. Extraction of data from preprinted forms . Machine Vision and Applications , 5 ( 3 ): 211 { 222 , 1992 .

13.

Paul

Viola and

Mukund

Narasimhan . Learning to extract information from semistructured text using a discriminative context free grammar . In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval , pages 330 { 337 , 2005 .