-

September

Unstructured Information Management Architecture (UIMA) 3rd UIMA@GSCL Workshop

Chappelier

Jean-Cedric Chen

Pei Codina

Pad´o

Sebastian Puppe

Frank

Telefont

Martin Thon

Ingo Toepfer

Martin

2013

23 2013 25 66

Copyright c 2013 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors. For many decades, NLP has su ered from low software engineering standards causing a limited degree of re-usability of code and interoperability of di erent modules within larger NLP systems. While this did not really hamper success in limited task areas (such as implementing a parser), it caused serious problems for the emerging eld of language technology where the focus is on building complex integrated software systems, e.g., for information extraction or machine translation. This lack of integration has led to duplicated software development, work-arounds for programs written in di erent (versions of) programming languages, and ad-hoc tweaking of interfaces between modules developed at di erent sites.

In recent years, the Unstructured Information Management Architecture (UIMA) framework has been proposed as a middleware platform which o ers integration by design through common type systems and standardized communication methods for components analysing streams of unstructured information, such as natural language. The UIMA framework o ers a solid processing infrastructure that allows developers to concentrate on the implementation of the actual analytics components. An increasing number of members of the NLP community thus have adopted UIMA as a platform facilitating the creation of reusable NLP components that can be assembled to address di erent NLP tasks depending on their order, combination and con guration.

This workshop aims at bringing together members of the NLP community { users, developers or providers of either UIMA components or UIMA-related tools in order to explore and discuss the opportunities and challenges in using UIMA as a platform for modern, well-engineered NLP.

This volume now contains the proceedings of the 3rd UIMA workshop to be held under the auspices of the German Language Technology and Computational Linguistics Society (Gesellschaft fur Sprachverarbeitung und Computerlinguistik - GSCL) in Darmstadt, September 23, 2013. From 11 submissions, the programme committee selected 7 full papers and 2 short papers. The organizers of the workshop wish to thank all people involved in this meeting - submitters of papers, reviewers, GSCL sta and representatives - for their great support, rapid and reliable responses, and willingness to act on very sharp time lines. We appreciate their enthusiasm and cooperation.

September 2013

Peter Kluegl, Richard Eckart de Castilho, Katrin Tomanek (Eds.)

Program Committee University of Manchester

KU Leuven Nuance Deutschland University of Bielefeld IBM Thomas J. Watson Research Center University of Colorado Technische Universit¨at Darmstadt Averbis Technische Universit¨at Darmstadt Temis Deutschland IBM Deutschland FSU Jena University of Nantes IBM Deutschland Vassar College University of Wu¨rzburg Carnegie Mellon University Averbis IBM Thomas J. Watson Research Center University of Wu¨rzburg Averbis National ICT Australia University of Helsinki University of Duisburg-Essen

Additional Reviewers Roman Klinger University of Bielefeld ii

1 2 Storing UIMA CASes in a relational database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Georg Fette, Martin Toepfer and Frank Puppe CSE Framework: A UIMA-based Distributed System for Configuration Space Exploration 14 Elmer Garduno, Zi Yang, Avner Maiberg, Collin McCormack, Yan Fang and Eric Nyberg Aid to spatial navigation within a UIMA annotation index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Nicolas Hernandez Using UIMA to Structure An Open Platform for Textual Entailment . . . . . . . . . . . . . . . . . . . . . 26

Tae-Gil Noh and Sebastian Pad´o Bluima: a UIMA-based NLP Toolkit for Neuroscience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

Renaud Richardet, Jean-Cedric Chappelier and Martin Telefont Sentiment Analysis and Visualization using UIMA and Solr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Carlos Rodr´ıguez-Penagos, David Garc´ıa Narbona, Guillem Mass´o Sanabre, Jens Grivolla and Joan Codina Extracting hierarchical data points and tables from scanned contracts . . . . . . . . . . . . . . . . . . . . 50

Jan Stadermann, Stephan Symons and Ingo Thon Constraint-driven Evaluation in UIMA Ruta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Andreas Wittek, Martin Toepfer, Georg Fette, Peter Kluegl and Frank Puppe

Keynote: Apache clinical Text Analysis and Know

ledge Extraction System (cTAKES) The presentation will focus on methods and software development behind the cTAKES platform. An overview of the modules will set the stage, followed by more in-depth discussion of some of the methods and evaluations of select modules. The second part of the presentation will shift to software development topics such as optimization and distributed computing including UIMA integration, UIMA-AS, as well as our plans for UIMA-DUCC integration. A live demo of cTAKES will conclude the talk.

About the speakers

Pei Chen is a Vice President of Apache Software Foundation, leading the top-level cTAKES project1. He is also a lead application development specialist at the Informatics Program at Boston Children's Hospital/Harvard Medical School. Mr. Chen's interests lie in building practical applications using machine learning techniques. He has a passion for the end-user experience and has a background Computer Science/Economics. Mr. Chen is a rm believer in the open source community contributing to cTAKES as well as other Apache Software Foundation projects.

Guergana Savova, Ph.D. is member of the faculty at Harvard Medical School and Childrens Hospital Boston. Her research interest is in natural language processing (NLP), especially as applied to the text generated by physicians (the clinical narrative) focusing on higher level semantic and discourse processing which includes topics such as named entity recognition, event recognition, relation detection, and classi cation including co-reference and temporal relations. The methods are mostly machine learning spanning supervised, lightly supervised, and completely unsupervised. Her interest is also in the application of the NLP methodologies to biomedical use cases. Dr. Savova has been leading the development and is the principal architect of cTAKES. She holds a Master's of Science in Computer Science and a PhD in Linguistics with a minor in Cognitive Science from University of Minnesota.

A Model-driven approach to NLP programming with UIMA Alessandro Di Bari, Alessandro Faraotti, Carmela Gambardella, and Guido Vetere

IBM Center for Advanced Studies of Trento

Piazza Manci, 1 Povo di Trento Abstract. In Natural Language Processing, more complex business use cases and shorter delivery times drive a growing need of smoother, more flexible and faster implementations. This trend also requires integrating and orchestrating different functionalities delivered by services belonging to different technological platforms. All these needs imply raising the level of abstraction for NLP components development. In this paper we present a Model Driven Architecture approach suitable to develop an open and interoperable UIMA-based NLP stack. By decoupling UIMA NLP models from other solution specific platforms and services, we obtain major architectural improvements. 1

Introduction

As Natural Language Processing (NLP) approaches complex tasks such as Question Answering or Dialog Management, the capability for NLP tools to seamlessly interoperate with other software services, such as knowledge bases or rules engines, becomes crucial. Such level of integration may require linguistic models to be shared among a variety of different platforms, each of which comes with its own information representation language. Platforms like UIMA1 or GATE2 consist of middleware and tools for designing and pipelining NLP specific tasks, including support for modeling data structures for text annotation, such as lexical, morphological and syntactic features, which may be embedded in interprocess communication protocols. However, while perfectly suited for annotation purposes, NLP specific schema languages, such as the UIMA Type System, fall short on fulfilling solution-level modeling needs. Model-Driven software Architectures (MDA), on the other hand, are specifically aimed at tackling the complexity of modern software infrastructures, with emphasis on the integration and the orchestration of different technological platforms. The MDA approach is based on providing formal descriptions (models) of requirements, interactions, data structures, protocols, and many other aspects of the desired system, which are automatically turned into technical resources, such as schemes and software modules, by activating transformation rules. 1 http://uima.apache.org/ 2 http://gate.ac.uk/

Based on this consideration, we adopted an MDA approach to develop a “Watson ready”3, UIMA-based NLP stack for Italian, as part of the activity of the newborn IBM Language & Knowledge Center for Advanced Studies of Trento4. We wanted our stack to be as open and interoperable as possible, to help users leveraging the availability of NLP resources and tools in the Open Source / Open Data space. In addition, our stack aims at being independent from language specific issues and domains, to facilitate its reuse across projects and within our (multinational) Company. The basic idea was to design a highly modularized general model including all the required structures, and to obtain technical platform-specific resources from a suitable set of model-to-model transformations. Also, we embraced the idea of abstracting semantic information away from the UIMA Type System, as in [ 5 ] and in [ 7 ], and evaluated the benefit of representing such kind of information by specific means. In sum, we looked at UIMA as a well-suited platform for linguistic analysis, which allows the integration of analytic components into managed workflow pipelines, but regarded at the UIMA Type System as a schema specification for that platform, rather than as a general modeling language for any NLP-based solution.

Here we present an overview of the basic ideas behind our approach, introduce our project, and discuss future directions. At the present stage of development, we can share our vision on MDA positioning and motivation with respect to NLP development (section 3), and we can report our first implementation experiences (section 4). Finally, we outline some related topic and introduce future works. 2

Motivating Scenario

Natural Language based solutions may require the NLP stack to cooperate with other components in a complex system. Such cooperation typically involves data exchanges with reference to a shared information model. The picture 1 shows the integration of an NLP stack with a Knowledge Base (e.g. an Ontology-based Data Access System) and a Rule Engine.

An UIMA-based NLP pipeline produces an annotated text (step 1 in the picture 1) contained in an UIMA CAS (Common Annotation Structure). A wrapper of the UIMA Type System defines all the operations needed for a consumer (the Rule Engine in this case) in order to access the CAS and invoke the appropriate operations within the cooperating subsystem when needed (see 4.2). When developing and maintaining the solution, an Engineer builds a rule set (see step 3) in order to process linguistic structures and interact with a Knowledge Base (step 4), which, in turn, uses the annotated text to store assertions as the result of an Information Extraction process (step 5). In a separate flow, the Knowledge Base can be queried by a User through a Question Answering System based on a suitable query language (step 6). The integration of all components involved is guaranteed by a common abstract model (Platform Independent Model) that contains the overall conceptualization of the system. The transition from one 3 www.ibm.com/watson/ 4 www.ibm.com/ibm/cas/ platform specific data structure to another is handled by a set of Model-toModel transformations (steps 7 and 8 ). The figure also shows the link to legacy (possibly huge) conceptual models, such as the KB ontology (step 9). 3

Model Driven Architecture for NLP

Model Driven Architecture (MDA) [ 6 ] is a development approach, strictly based on formal specifications of information structures and behaviors, and their semantics. MDA is managed by Object Management Group (OMG)5 based on several modeling standard such as: Unified Modeling Language (UML)6, MetaObject Facility (MOF), XML Metadata Interchange (XMI) and others. MDA supports Model Driven Development/Engineering (MDD, MDE).

The key idea behind MDA is to provide a higher level of abstraction so that software can be fully designed independently from the underlying technological platform. More formally, MDA defines three macro “modeling” layers: – Computation Independent Model (CIM) – Platform Independent Model (PIM) – Platform Specific Model (PSM)

The first one can be related to a Business Process Model and does not necessary imply the existence of a system that automates it. The PIM is a model that is independent from any technical platform; the third (PSM) layer is the actual implementation of the model with respect to a given technology and it is automatically derived from the PIM. Notice that the PIM allows a comprehensive representation of the structure and behavior of the system being developed. 5 http://omg.org/ 6 http://www.uml.org/ The modeling language is typically UML or EMF7, but it could actually be any other Domain Specic Language (DSL).

Developing powerful NLP tasks, such as Question Answering systems, requires combining a great variety of analytic components, which is what UIMA has been designed for. We consider UIMA the standard solution for document workflow analysis. Within this framework, MDD tools can be effectively used to better manage the UIMA Type System. In particular, we decided to look at it as a PSM dedicated to text annotation. The motivation for leveraging MDD (in the NLP field) can be summarized as follows: – Formalization: MDA languages are well studied in logics and reasoning mechanisms can be developed upon. [ 1 ] – Expressiveness: MOF meta-modeling allow great and well-founded expressiveness [ 4 ], including modeling behaviors. – Support: The availability of tools, including diagramming and code generation, improves software life-cycle and team collaboration.

In particular, with respect to our architecture, we modeled UIMA annotations by defining classes rather than just (data) types, so that a consumer is able to invoke operations designed for those objects. Access to UIMA annotation is then achieved by means of automatically generated wrappers. Another motivation for a model driven approach was the need to represent complex linguistic data, and exploit existing tooling and resources for generating training data for a statistical parser.

In sum, we tried to exploit the maturity and flexibility of MDD tools while keeping up the power of UIMA as a framework for component integration, pipeline execution, and workflow management in general. As the PIM language, we chose EMF because it is already integrated with UIMA and provides powerful and mature model driven features. Once also the code is generated (by UIMA JCASgen), the type system correspond to an implementation of a (business) domain model, limited to the structural aspects (as opposed to behavioral aspects).

At PIM level, we also have to represent those properties that, once transformed against a target model, give specific characteristics on that model. For instance, in order to generate the UIMA Type System (PSM) starting from the PIM, we have to represent on the source model whether a class (that is a root in a hierarchy on the PIM model) will be generated as an UIMA annotation or not (UIMA TOP). Here we have taken two possible scenarios into account: – Having an UML PIM, this specification is easily accomplished by using an UML profile8. Profiles define stereotypes that can be further structured with custom properties. This way, we have a generic ”Unstructured Information” profile that at least, encompasses an Annotation stereotype; thus a class that is thought to become an annotation will be simply ”marked” with this stereotype. 7 http://www.eclipse.org/modeling/emf/ 8 http://www.omg.org/spec/#M&M – Having an EMF PIM (such as our current implementation), we can represent the same thing as an EMF annotation. Therefore, (we apologize for the words conflict) we will have a class annotated as Annotation.

In any case, a class stereotyped as Annotation on the PIM will take the role of a generic annotation for document analysis, independently from the underlying framework.

The main benefit of our approach is the ability to represent NLP objects independently from any particular implementation: we are using different (generated) PSMs (that are better explained in section 4) all deriving from the starting (PIM) model, as shown in the picture 2.

These benefits have certainly a price, that is essentially represented by the cost of developing the necessary transformations. However, following basic assumptions of the MDD approach, we estimate that those cost are well paying back, especially when heterogeneous components have to be integrated, development is managed iteratively, and models are subject to high volatility. 4

Model Driven Implementation Aspects

In order to better clear up how we are leveraging the Model Driven approach, we list here the artifacts (PSMs and code) we are generating through appropriate transformations that we have developed. Starting from our “application” model: – UIMA type system (we modified the existing transformation from EMF in order to avoid any further modification on the UIMA type system) – EMF wrapper of UIMA type system – this wrapper also acts as the input for creating the model for the Rule engine as explained below Starting from (our) models of common standard data for parser training such as CONLL, PENN and others we generated all necessary (OpenNLP-specific) data for training the parser on: – Tokenization – Named Entities – Part of Speech tagging – Chunking – Parsing

To represent the model (PIM), we use the Eclipse Modeling Framework9 (EMF), which represents a de facto Java-based standard for meta-modeling. Informally, we may say EMF represents a subset of UML (the structural part) with very precise semantics for code generation. In the future, we could move this representation to a profiled UML, as mentioned above (see section 3). Furthermore, EMF offers very powerful generation features. Summarizing, in the current implementation we use EMF in two ways:

1. A language to represent the model 2. A PIM model to generate different target PSM

4.1

NLP Parser The NLP Parser component is implemented using Apache OpenNLP10 and UIMA11; it is based on a UIMA Type System built from the Syntax and the Abstract models using the UIMA transformation utility. The training corpora for the parser has to be provided in a specific format required by OpenNLP. Since the data that we had available for training were in standard formats such as PENN12, CONLL13 and others, some transformations were required. Ecore models have been created for the purpose of representing source formats. Furthermore, some simple JET14 transformations has been developed in order to generate our corpora (in specific OpenNLP formats)

Compared to other solutions, this makes our infrastructure extremely flexible: should the parser be replaced or the data formats changed, the only operation we will have to make is to modify the JET template accordingly. 4.2

Type System EMF Wrapper As anticipated in 2, in the higher layers of our architecture, we have a Rule Engine that acts as a reasoner on annotation objects coming from the UIMA pipeline. We wanted this layer to be able to call operations implemented on those objects (as explained in section 3) and those objects always implementing the exact interfaces of the (Ecore) PIM model. Given these requirements, we developed a transformation that generates a wrapper of the UIMA type system and that 9 http://www.eclipse.org/modeling/emf/ 10 http://opennlp.apache.org/ 11 http://uima.apache.org/ 12 http://www.cis.upenn.edu/~treebank/ 13 http://ilk.uvt.nl/conll/#dataformat 14 http://www.eclipse.org/modeling/m2t/?project=jet fully reflects the starting PIM model, including operations. Once implemented, the code will be kept up also against future re-generations, thanks to merging capabilities of this transformation. Thus, as shown in figure 1, the Rule Engine “consumes” instances of this wrapper, and still can access the underlying UIMA annotation. We considered the possibility of directly adding these operations on classes generated by UIMA (via JCAS generation utility) but this would not be consistent with our model-driven approach since those operations would not be part of a general, system-wide model. 4.3

Rule Engine As far as the Rule Engine is concerned, we chose IBM Operational Decision Manager (ODM)15. ODM rules have to be written against a specific model, called Business Object Model (BOM), that allows a user-friendly business rule editing; ODM provides tools to set up a natural language vocabulary: users can use it to write business rules in a pseudo-natural language. Once defined, the rules are executed on a BOM-related Java implementation named Execution Object Model (XOM). We obtained the BOM by reverse engineering the XOM, and the XOM directly from Java classes (implementing the type system wrapper) generated from our PIM (EMF) model. Therefore, the BOM model can be seen as just another manifestation of our PIM model. 4.4

Knowledge Base Our architecture is backed by a Knowledge Base Management System which stores and reasons on information extracted from many sources. Leveraging on the Knowledge Model included in the PIM, we were able to integrate an external pre-existing system, named ONDA (Ontology Based Data Access) [ 3 ]. ONDA supports Ontology Based Data Access (OBDA) on OWL2-QL (16), by ensuring sound and complete conjunctive query answering with the same efficiency a scalability of a traditional database [ 2 ]. Because the ONDA underlying Knowledge Model was already designed with EMF, we simply adopted it in order to be included in the PIM. This way, reasoning and query answering services have been included in the PIM model as operations available to all other components (i.e. the Rule Engine). 5

Conclusion and future works

We have outlined here an innovative approach to NLP development, based on the idea of setting UIMA as the target platform in a Model-Driven development process. A major benefit of this approach consists in giving NLP models a greater value, especially in terms of generality, usability, and interoperability. 15 http://www-03.ibm.com/software/products/us/en/odm/ 16 http://www.w3.org/TR/owl2-profiles/

While developing this idea, we understood that a suitable Model-Driven machinery for NLP should be supported by specific design patterns for concrete models. In particular, the model we have developed has been abstracted both from morphosyntactic specificity and from semantic aspects. The former (including part-of-speech classes, genders, numbers, verbal tenses, etc) may significantly vary among different languages; the latter (including concepts like persons, events, places, etc) are related to specific application domains. By decoupling these layers, we achieved a lightweight “generic” UIMA type system[ 7 ], we designed a powerful generic model for morphosyntactic features, and we managed ontological information with proper expressive means. Refining and extending this model is part of our future plans.

We implemented a first prototype of a Knowledge Base query system based on the Eclipse Modeling Framework (EMF). For the future, we are considering the possibility of representing the model in UML, in order to have a greater representational power (such as modeling sequence diagrams).

The work presented here is still at an early stage. More work is needed to complete the linguistic model, for instance in the area of argument structures, such as verbal frames. From an implementation standpoint, our priority is to consolidate, improve and extend the set of Model-to-Model transformations, and to further exploit MDD tools.

Storing UIMA CASes in a relational database Georg Fette12, Martin Toepfer1, and Frank Puppe1

1 Department of Computer Science VI, University of Wuerzburg,

Am Hubland, Wuerzburg, Germany 2 Comprehensive Heart Failure Center, University Hospital Wuerzburg,

Straubmuehlweg 2a, Wuerzburg, Germany Abstract. In the UIMA text annotation framework the most common way to store annotated documents (CAS) is by serializing the document to XML and storing this XML in a file in the file system. We present a framework to store CASes as well as their type systems in a relational database. This does not only provide a way to improve document management but also the possibility to access and manipulate selective parts of the annotated documents using the database’s index structures. The approach has been implemented for MSSQL and MySQL databases. 1

Introduction

UIMA [ 2 ] has become a well known and often used framework for processing text data. The main component of the UIMA infrastructure is the CAS (Common Analysis Structure), a data structure which combines the actual data (the text of a document), annotations on this data and the type system the annotations are based on. In many UIMA projects CASes are stored as serialized XML-files in file folders with the corresponding type system file in a separate location. In this storage mode the resource management to load which CAS with which type system lies in the responsibility of the programmer who wants to perform an operation on specific documents. However, manual management of files in folders on local machines or network folders can quickly become confusing and messy especially when projects get bigger. We present a framework to store CASes as well as their corresponding type systems in a relational database. This storage mode provides the possibility to access the data in a centralized, organized way. Furthermore the approach provides all benefits that come along with relational databases including search indices on the data, selective storage, retrieval and deletion as well as the possibility to perform complex queries on the stored data in the well known SQL language.

The structure of the paper is as follows: Section 2 describes the related work, Section 3 describes the technical details of the database storage mechanism, Section 4 illustrates query possibilities using the database, Section 5 demonstrates performance experiences with the framework and Section 6 concludes with a summary of the presented work. The only approach known to the best knowledge of the authors where CASes are stored in a database is the Julielab DB Mapper [ 4 ] which serialized CASes to a PostgreSQL database. However, the mechanism does not store the CASes’ type systems nor does it support features like referencing of annotations by features or derivation of annotation types. Other approaches use indices to improve query performance but do not allow to reconstruct the annotated documents from the index (Lucene based: LUCAS [ 4 ], Fangorn [ 3 ]; relational database based: XPath [ 1 ], ANNIS [ 7 ]; proprietary index based: TGrep/TGrep2 [ 6 ], SystemT [ 5 ]). The indices still need the documents to be stored in the file system. Furthermore some of the mentioned indices only allow specialized search capabilities (e.g. emphasis on parse trees) which are provided by the respective search index and cannot search directly on the UIMA data structures. In contrast to these approaches our system allows searches on arbitrary type systems by formulating queries closely related to the involved annotation and feature types. 3

Database storage

The storage mechanism is based on a relational database for which the table model is illustrated in Figure 1. The schema can be subdivided in a document related part (left), an annotation instance part (middle) and a type system related part (right). Documents are stored as belonging to a named collection and can be manipulated (retrieved, deleted, etc.) as a group, e.g. deleting all annotations of a specific type. Annotated documents can be handled individually by loading/saving a single CAS or by processing a whole collection by creating a collection reader/writer. In either way any communication (loading/saving) can (but need not) be parametrized so that only desired annotation types are loaded/saved, thus speeding up processing time, reducing memory consumption and facilitating debugging processes. A type system, instead of being stored in an XML file and containing a fixed type system, can be retrieved from the database in different task specific ways. One way is by requesting the type system which is needed to load all the annotated documents belonging to a certain collection. Other possibilities are by providing a set of desired type names or by providing a regular expression determining all desired type names. The storage mechanism is able to store the inheritance structures of UIMA type systems as well as referencing of annotations by features of other annotations. For further information on the technical aspects we refer to the documentation of the framework3. 4

Querying

A benefit from storing data in an SQL database is the database index and the well established SQL query standard. The database can be queried for counts of occurrences of specific annotation types, counts of covered texts of annotations or even complex annotation structures in the documents. We want to exemplify this with a query on documents which have been annotated with a dependency parser using the type system shown in Figure 2. 3 http://code.google.com/p/uima-sql/

Fig. 1. schema of the relational database storing CASes and type systems. <typeDescription> <name>Token</name> <supertypeName> uima.tcas.Annotation </supertypeName> <features> <featureDescription> <name>Governor</name> <rangeTypeName>Token</rangeTypeName> </featureDescription></features> </typeDescription>

SELECT govText.covered FROM annot_inst govToken, annot_inst_covered govText, annot_inst baseToken, annot_inst_covered baseText, feat_inst, feat_type WHERE baseText.covered = ’walk’ AND baseToken.covered_ID = baseText.covered_ID AND baseToken.annot_inst_ID = feat_inst.annot_inst_ID AND feat_inst.feat_type_ID = feat_type.feat_type_ID AND feat_type.name = ’Governor’ AND feat_inst.value = govToken.annot_inst_ID AND govText.covered_ID = govToken.covered_ID

Fig. 3. SQL query for governor tokens

To query for all words governing the word walk, we have to look for tokens with the desired covered text, find the tokens governing those tokens and return their covered text. The SQL command for this task is shown in Figure 3. An abstraction layer to cover the complexity could be put on top (like a graph querying language), but even in the presented way with standard SQL the capabilities of the database engine can serve as a useful tool to improve corpus analysis. 5

Performance

To run a performance test on the storage engine we created a corpus of 1000 documents, each consisting of 1000 words. The words were taken from a dictionary of 1000 randomly created words, each of 8 characters length. From each document we created a CAS and added annotations so that each word was covered, with the annotations covering 1 to 5 successive words. Each annotation was given two features, one String feature with a value randomly taken from the word dictionary and a Long feature containing a random number. All documents were stored and then loaded again. This was done with the database engine as well as with a local file folder on the same hard drive the database files were located on. In a second experiment the same documents where loaded again and we added an annotation of another type with a Long feature containing a random number to each document. After adding the additional annotation the documents were stored again. In a third experiment we wanted to query for the frequencies of annotations covering each of the words from the word dictionary. For file system storage this was done by accumulating the annotation counts during an iteration over all serialized CASes, for database storage this was done by performing a single SQL query for each of the words from the dictionary.

In Table 1 we can observe that the time needed for database storage is quite long but reading is as fast as from the file system. Storing to the database during the second experiment was faster than in the first one, because this time only the additional annotations had to be incrementally stored. Storage to the file system again performed about five times faster than to the database but the benefit of being able to incrementally store only the additional annotations can be clearly observed. Physical storage space consumption is larger for database storage but that shouldn’t pose a major problem as hard disc space is not an overly expensive resource nowadays. Query performance in the database is about 20 times faster than using file system storage illustrating the benefit of the database approach.

DB FileSystem We have presented a framework to store/retrieve CASes and perform analysis queries on them using a relational database. We examined the save, load and query speed compared to regular file based storage and presented examples how to use the database index structures to analyze annotations in the corpus. We hope to be able to improve the storage speed of the database engine so that the choice between file system storage and database storage will not be influenced by the still quite large difference in speed performance.

This work was supported by grants from the Bundesministerium fuer Bildung und Forschung (BMBF01 EO1004). CSE Framework: A UIMA-based Distributed System for

Configuration Space Exploration Elmer Garduno1, Zi Yang2, Avner Maiberg2, Collin McCormack3, Yan Fang4, and Eric Nyberg2 Abstract. To efficiently build data analysis and knowledge discovery pipelines, researchers and developers tend to leverage available services and existing components by plugging them into different phases of the pipelines, and then spend hours to days seeking the right components and configurations that optimize the system performance. In this paper, we introduce the CSE framework , a distributed system for a parallel experimentation test bed based on UIMA and uimaFIT, which is general and flexible to configure and powerful enough to sift through thousands option combinations to determine which represent the best system configuration. 1 To efficiently build data analysis and knowledge discovery “pipelines”, researchers and developers tend to leverage available services and existing components by plugging them into different phases of the pipelines [ 1 ], and then spend hours, seeking for the components and configurations that optimize the system performance. The Unstructured Information Management Architecture (UIMA) [ 3 ] provides a general framework for defining common types in the information system (type system), designing pipeline phases (CPE descriptor), and further configuring the components (AE descriptor) without changing the component logic. However there is no easy way to configure and execute a large set of combinations without repeated executions, while evaluating the performance of each component and configuration.

To fully leverage existing components, it must be possible to automatically explore the space of system configurations and determine the optimal combination of tools and parameter settings for a new task. We refer to this problem as configuration space exploration , which can be formally defined as a constraint optimization problem. A particular information processing task is defined by a configuration space, which consists of mt components that define each of the n phases with corresponding configurations. Given a limited total resource capacity C and input set S, configuration space exploration (CSE) aims to find the trace (a combination of configured components) within the space that achieves the highest expected performance without exceeding C total cost. Details on the mathematical definition and proposed greedy solutions can be found in [ 6 ].

In this paper, we introduce the CSE framework implementation, a distributed system for parallel experimentation test bed based on UIMA and uimaFIT [ 4 ]. In addition, we highlight the results from two case studies where we applied the CSE framework to the task of building biomedical question answering systems. We highlight some features of the implementation in this section. Source code, examples, documentation, and other resources are publicly available on GitHub5. To benefit developers who are already familiar with UIMA framework, we have developed a CSE tutorial in alignment with the examples in the official UIMA tutorial.

Declarative descriptors. To leverage the CSE framework, users need to specify how the components should be organized in the pipeline, which values need to be specified for each component configuration, which is the input set, and what measurement metrics should to be applied. Analogous to a typical UIMA CPE descriptor, components, configurations, and collection readers in the CSE framework are declared in extended configuration descriptors which are based on the YAML format. An example of the main pipeline descriptor and a component descriptor are shown in Figure 1.

Architecture. Each pipeline can contain an arbitrary number of AnalysisEngines declared by using the class keyword or by inheriting configuration options from other components by name. Combinations of components are configured using an options block and parameter combinations within a component are configured on a cross-opts block. To take full advantage of the CSE framework capabilities, users inherit from a cse.phase, a CAS multiplier that provides, option multiplexing, intermediate resource persistence and resource management for long running components. The architecture also supports grouping options into sub-pipelines as a convenient way of reducing the configuration space for combinations whose performance is already known.

Evaluation. Unlike a traditional scientific workflow management system, CSE emphasizes the evaluation of component performance, based on user-specified evaluation metrics and goldstandard outputs at each phase. In addition the framework keeps track of the performance of all the executed traces, this allows inter-component evaluation and automatic tracking of performance improvements through time.

Automatic data persistence. To support further error analysis and reproduction of experimental results, intermediate data (CASes) and evaluation results are kept in a repository accessible from any trace at any point during the experiment. To prevent duplicate execution of traces the system keeps track of all the execution traces an recovers those CASes whose predecessors have 5 http://oaqa.github.io/

Review Database Data visualizaBon

OU indexing

Linguis'c annota'on • POS, lemmas, NER, Chunks, dependencies, etc.

OU polarity Assignment Opinionated Unit detec'on • T&C CorrelaBon via dependencies

Architecture and Implementation

This section describes all UIMA modules used in the prototype, as implemented in Figure 3. Some of them are existing open source components, some are adaptations, and some are our own custom developments. We have been publishing our work on Github and will continue doing so as far as possible.2 UIMA Collection Tools This prototype is designed to work on a static document collection, previously loaded into a MySQL database (including the review text as well as associated metadata). UIMA Collection Tools3 is an ecosystem of tools for allowing UIMA pipelines to store and retrieve data from database systems, such as MySQL. Plain text documents can be retrieved from a database, XMI documents can be retrieved from and stored in a database either compressed or uncompressed, features can be extracted into a database table, and annotations within database-stored XMI blobs can be visualized the same way as the standard AnnotationViewer does for XMI files.

– DBCollectionReader is a UIMA collection reader which retrieves plain text documents stored in a MySQL database. Database connection parameters as well as SQL query have to be specified in the component descriptor. It is derived from the FileSystemCollectionReader. – SolrCollectionReader is equivalent to DBCollectionReader, but using a Solr index as the document source. – DBXMICollectionReader is a UIMA collection reader that retrieves XMI documents stored in a MySQL database. DBXMICollectionReader is also prepared to read compressed XMI documents by means of ZLIB compression.

This option can be set in the descriptor file. – DBAnnotationsCASConsumer is a CAS consumer which stores values of the features specified in the component descriptor file in a MySQL database table. Each table row corresponds to the annotation defined as the splitting annotation, e.g. if Sentence annotation has been defined as the splitting annotation, each table row will correspond to a Sentence, and this row will 2 See https://github.com/BarcelonaMedia-ViL/ 3 The UIMA Collection tools have been developed at Barcelona Media, some of them based on the example Collection Readers and CAS Consumers provided with the UIMA distribution. They are published under the Apache License at https://github.com/BarcelonaMedia-ViL/uima-collection-tools. contain features of the Sentence annotation and/or features of annotations covered by the Sentence annotation. – DBXMICASConsumer is a CAS consumer that persists XMI documents in a database. DBXMICASConsumer is also prepared to store compressed XMI documents by means of ZLIB compression. – DBAnnotationViewer is a modification of the Annotation Viewer, and allows reading XMI files directly from a MySQL database without needing to extract them first.

OpenNLP We use OpenNLP4 with the standard UIMA wrappers for our base pipeline, including Sentence Detector, Tokenizer, and POS Tagger, using our own trained models for Spanish.

Lemmatizer We apply Lemmatization using a large dictionary developed inhouse. All candidate lemmas are first added to the CAS using ConceptMapper5 but a second custom component selects the right one using the POS tag. JNET For ML-based detection of Targets and Cues we use JNET6 (the Julielab Named Entity Tagger), which is based on Conditional Random Fields (CRF). It detects token sequences that belong to certain classes, taking into account a variety of features associated with each token (such as the surface form, lemma, POS tag, surface features such as capitalization, etc.) as well as its context of preceeding and successive tokens. While originally intended for Named Entity Recognition, we trained JNET with our own manually annotated corpus.

Compared to the original JNET as released by JulieLab we introduced a series of changes, most importantly making it type system independent by taking all input and output types and features as parameters, and fixing some bugs that were triggered when using a larger amount of token features. We expect to release our changes soon, but are still looking into the question of licensing, to comply with JNET’s original license.

DeSR We developed a UIMA wrapper for the DeSR dependency parser7. The parser creates dependency annotations based on previously generated sentence, token and POStag annotations. It is available at https://github.com/BarcelonaMedia-ViL/desr-uima. The UIMA DeSR analysis engine is a UIMA C++ annotator, developed using the C++ SDK provided by UIMA. It translates between the format required by the DeSR parser shared library and the UIMA CAS format. The mapping between UIMA types and features and the features used internally by DeSR is configurable in the annotator descriptor. 4 http://opennlp.apache.org/ 5 http://uima.apache.org/sandbox.html#concept.mapper.annotator 6 http://www.julielab.de/Resources/Software/NLP Tools.html 7 https://sites.google.com/site/desrparser/ DependencyTreeWalker This is a Pythonnator-based analysis engine for wrapping the DependencyGraph Python module (both developed in-house). This allows us to work easily with the dependency graph generated by DeSR in order to e.g. determine and validate the path between two given UIMA annotations. Weka Wrapper We used the Mayo Weka/UIMA Integration (MAWUI8), as a basis for the machine learning tools. The version we use is adapted to newer versions of UIMA and made much more configurable. MAWUI generates a single vector for each document, that is used to classify it as a whole. In our case, a document can contain several Opinionated Units that need to be classified. For this reason the Weka Wrapper was adapted to be able to deal with all the annotations of a given type inside a document (or collection when generating the training data). 4

Visualization

Beyond being able to extract and classify the opinions, users need an interface that allows them to access and explore the data. They need to know which are the Targets or its features that are being addressed by the opinions and what is being said about them, and this has to be shown in an aggregated way, with drill-down capabilities, so that the end user has a clear view of the contents of hundreds or thousands of opinions.

UIMA does not provide tools to deal with collections of documents, and we use Solr, a Lucene based indexing tool, to index the Opinionated Units. Through the use of Solr’s faceting and pivot utilities we are able to graphically summarize thousands of opinions. Special charts have been dconstructed in order to allow not only to represent the data but also to select subsets of opinions and summarize and compare them. For example, we can compare the global user’s opinions with the opinions about a single hotels or the hotels in a specific area.

To index the data we needed the linguistic information, but also the metadata associated with the opinion, which is located in databases and is not processed with UIMA. For this reason we import the data to Solr in two steps. In a first step we generate from UIMA a table with the data that we then import to Solr together with the metadata. To index the Opinionated Units we use the DBAnnotationsCASConsumer component. We generate a register for each OU, containing: the Target, the Cue, the text span, the polar words, their polarity, the polarity of the cue, and the polarity of the Opinionated Unit. Cues and targets are grouped in single tokens by means of underscoring. 8 http://informatics.mayo.edu/text/index.php?page=weka

We use the the DataImportHandler from Solr in order to import the data from the database. To do it, a query combines the opinionated unit information with the one related with the hotel or the user who writes the opinion. Cues are indexed twice, once all merged and later in different fields depending on the opinion’s polarity, making it easy to retrieve just the positive or negative opinion markers. We selected this option because it is a bit faster, more flexible and reliable than the other ones: when indexing directly from UIMA we have problems in adding all the desired metadata, and if we call UIMA from Solr (or Lucene) then it is difficult to have a general framework that splits a single document into several Opinionated Units.

AJAX-Solr9 is a JavaScript library for creating user interfaces to Apache Solr. This library works with facets. Faceting is a capability of Solr that allows to have a fast statistic of the most frequent terms in each field, after performing a query. Since version 4.0 Solr also has pivots that combine the facets from two or more different fields. We adapted AJAX-Solr to work with pivots and wrote a series of widgets to visualize them. Our own extensions to AJAX-Solr are also published on github10.

By means of clicking the different facets that appear on the widgets, the user can build a query that restricts the set of opinions to summarize. These opinions are then summarized by showing the most frequent terms they contain, or the most differentiating ones (i.e. those terms that are frequent in the current subset but that are less frequent in the general one). Figure 4 shows the pivot result in text and force diagram formats. It shows the relationship between Targets, and positive and negative Cues. In the textual representation, the relationships are not shown directly but scaled to magnify the most discriminative ones. The combination of UIMA and Solr has allowed us to to develop a very flexible platform that makes it easy to integrate and combine processing modules from a variety of sources and in a variety of programming languages, as well as navigate and visualize the results easily and efficiently.

In our evaluations with 700 OUs manually annotated by 3 independent reviewers, there was an agreement on the correctness of the OU identified by the system of 88.5%, while the polarity assigned was found to be correct an average of 70%.

We found many useful UIMA components to be available as open source, and encountered few compatibility issues (other than adapting some components to be type system independent). Solr provides us with a very flexible platform to access large document collections, and in combination with UIMA allows us to explore even complex hidden relationships within those collections.

One of our main objectives was to make all modules configurable and reusable, inasmuch as Sentiment Analysis in general requires tweaking to adapt to domain and genre, but this generalization often requires considerable effort. We found the different open source communities to be very receptive, and we try to participate by publishing our own contributions under permissive licenses that make them easy for others to adopt and use. 6

Thanks References

This work has been partially funded by the Spanish Government project Holopedia, TIN2010-21128- C02-02, and the CENIT program project Social Media, CEN-20101037.

Extracting hierarchical data points and tables from scanned contracts Jan Stadermann, Stephan Symons, and Ingo Thon

Recommind Inc., 650 California Street, San Francisco, CA 94108, United States {jan.stadermann,stephan.symons,ingo.thon}@recommind.com

http://www.recommind.com Abstract. We present a technique for developing systems to automatically extract information from scanned semi-structured contracts. Such contracts are based on a template, but have different layouts and clientspecific changes. While the presented technique is applicable to all kinds of such contracts we specifically focus on so called ISDA credit support annexes. The data model for such documents consists of 150 individual entities some of which are tables that could span multiple pages. The information extraction is based on the Apache UIMA framework. It consists of a collection of small and simple Analysis Components that extract increasingly complex information based on earlier extractions. This technique is applied to extract individual data points and tables. Experiments show an overall precision of 97% with a recall of 93% regarding individual/simple data points and 89%/81% for table cells measured against manually entered ground truth. Due to its modular nature our system can be easily extended and adapted to other collections of contracts as long as some data model can be formulated.

Keywords: OCR robust information extraction, hierarchical taggers, table extraction 1 Despite the existence of electronic document handling and content management systems there is still a large amount of paper based contracts. Even when scanned and OCRed the interesting data contained in the document is not machinereadable as there is no semantic attached to the text. Especially in the banking domain it is necessary to have the underlying information available, e.g., for risk assessment. Until now, the information has to be extracted by human reviewers. The goal of the system presented here is to automatically obtain the relevant information from OTC (over-the-counter) contracts which are based on a template provided by the ISDA1. The data is given in the form of image-embedded pdf documents. Each contract contains around 150 data points organized in a complex hierarchical data model. A data point can be either a (possibly multi valued) simple field or a table. The main challenges of such a system are: 1 International Swaps and Derivatives Association, www.isda.org

1. The complex legal language used in the contracts.

2. Despite existing contract templates, the wording varies across customers. 3. The layout varies. Especially tables can be represented in various forms. 4. The scanning quality of the contracts is often poor, especially in old contracts or documents sent by fax. Still the remaining information needs to be extracted correctly.

Figure 1 shows examples of two simple data points (a), and a table (b).

In general, on the one hand, there are a lot of sophisticated entity extraction systems that try to find flat entities only (“Named entity extraction”) [ 9 ]. These systems sometimes use hierarchical information, like Tokens, Part-Of-SpeechTags, Sentences, but only on a linguistic level without collecting and combining this information. These approaches work well on well-defined and general entities such as persons or locations. However, they are difficult to adapt to a new domain since a new classifier needs to be created which requires huge amounts of labeled training data which is expensive to produce.

On the other hand, there are systems that use a deep hierarchical structure, e.g. represented using Ontologies, but still do the classification in one single, flat step [ 1 ]. This approach is not as flexible and extensible compared to the presented one since in general it requires a re-training or re-building of the classifier if layers within the hierarchy are changed. An early solution for dealing with scanned forms was presented by Taylor et al., who used a model-based approach for data extraction from tax forms [ 12 ]. Semi structured texts have been analyzed using rule based approaches [ 10 ] or discriminative context free grammars [ 13 ]. Closest to our solution is a system described by Surdeanu et al. [ 11 ]. They employ two layers of extraction using Conditional Random Fields [ 5 ], and deal with OCR data. For table extraction, heuristic methods [ 8 ] have been proposed as well as Conditional Random Fields [ 7 ].

In contrast, our system uses a theoretically unlimited number of layers with separate classifiers for each piece of information, including tables, on each level. Instead of processing the whole text at once, our classifiers just collect the information they require, and decide only on that data. Therefore, they allow for better performance and extensibility, as additional data does not affect the existing classifiers. Our work follows strategies commonly used in spoken dialogue systems [ 4 ] and uses a set of small classifiers which is inspired by the boosting idea [ 6 ]. In addition, we use automatically extracted segmentation information and cross-checks between our classifiers to increase the precision of the extracted data. From a UI standpoint there is a similar application called GATE [ 2 ] which extracts entities based on given rule-sets. This application provides a hierarchical organization of entities and the architecture seems to be very similar to the UIMA framework. However, GATE has no special provisions to deal with noise from due to the OCR step and it only allows to specify simple extraction rules. Furthermore there is no direct way that the entity extraction works hierarchically but only the result can be organized in a hierarchical way. 2

Information extraction

An overview of or system’s architecture is shown in figure 2. Prior to information extraction, the OmniPage2 OCR engine is used to convert the image to readable text. However, many character level errors, and layout distortions remain which need to be dealt with in the following processing steps. The overall strategy is based on the idea that small pieces of relevant text can be extracted quite accurately even in the presence of OCR errors. On top of these pieces we build several layers of higher level extractors – here called ”experts” – that combine these small pieces to decide on a final data point. The extraction of tables works in a similar fashion by first trying to extract small pieces that form table cells. Then stretches of cells are collected, trying to deduce a layout from order and type of the pieces. Finally, an optimal result table is selected (see section 2.2).

Our solution is based on the UIMA framework [ 3 ]. Each type of expert is implemented as a configurable annotation engine. The overall extraction system consists of a large hierarchy of analysis engines, encompassing several hundred elements. The type system, in contrast, only consists of three principal types, i.e. for simple fields, tables and table rows. Annotation types, extracted values, etc. are stored as features. Both final and intermediate annotations are represented by these types. Documentpimages

OCR

Recognizedptextp(XML) Informationpextraction

RegExppExtractorp1

RegExppExtractorp2 Normalization

Normalization Normalization

Normalization XMLpwithpmetapdata

Documentpindex

Fig. 2. Extraction architecture

Extraction of simple-valued fields We use the term “simple-valued fields” for data points, where one key has one or more values. They differ from named entities as they may include multi-valued data. Figure 1(a) shows an example of the key eligible currency with the (normalized) values “USD” and “Base currency”. Fields are extracted layer-wise. On the lowest layer, all instances of the identifying term “Eligible currency”, are captured, as well as the different currency expressions, including the special term “Base currency”, which refers to another simple field. On this level we typically use annotators based on dictionaries and regular expressions, where variations due to OCR errors are reflected in dictionary variants, respectively the regular expressions. All such annotators are implemented as analysis engines. On the next level, so-called “expert-extractors” combine the existing annotations to a new one. An expert is a rule, defined as a set of slots for annotations of specific types, and a definition of which slots form a new annotation if the rule is satisfied, i.e. if all slots are filled. To allow for fine tuning the experts, slots can be configured, e.g. by indicating certain slots as optional. Furthermore, it is possible to specify the order of annotations in slots appearing in the document. It is also possible to specify a maximum distance. If the distance between two found annotations exceeds the defined threshold for this expert, the expert assumes to be in the wrong area of the document and clears its internal state to start all over Currency, "USD" Expert 1 Currency Currency Currency Expert 2

EligibleCurrency

Currencies Distance < 20

Currencies, "Base currency, USD" (eligible_currency, "Base currency, USD") Fig. 3. Extraction of a simple field. First level components have tagged the “Eligible Currency” phrase and the different variants of currencies. Expert 1 collects two or more currencies (the third slot is optional). The resulting annotation is used by Expert 2 to build the final annotation. All elements are represented in UIMA as simple fields types. again. Finally, slots can be write-protected, accepting only the first occurrence of the configured annotation.

To extract eligible currency, two experts are employed (see figure 3). The first expert collects adjacent currency annotations. The second one combines the “Eligible Currency” term, and the collected currencies found by expert one, if both annotations are found within a short distance. The resulting annotation will span the relevant currency terms. This modular design allow us to reduce the number of extractors and re-use the already made annotations for completely different data points. In general, the information found in the examined contracts is not independent of each other. We use business rules and other constraints to validate and normalize the found results, e.g., the set of currencies is welldefined. If the validation fails or the normalization repairs some value due to business rules a corresponding message can be attached to the annotation to inform the reviewer. 2.2

Extraction of tables We define a table as multi-dimensional, structured data present in a document either in a classical tabular layout, or defined in a series of sentences or paragraphs in free text form (like in figure 4). We aim at extracting tables of both structure types and intermediate formats (e.g. as in figure 1(b)) only from the document’s OCR output at character level. In our application, table extraction extends the simple valued field extraction: The basic input for a table expert is a document annotated with simple value fields and intermediate annotations. The experts attempt to match sequences of simple annotations to a set of table models. A table model is user-defined and describes which columns the resulting extracted table should have. Each column can contain multiple types of simple fields. Furthermore, columns can be configured to be optional and to accept only unique or non-overlapping annotations. This allows for both more general models with variable columns and fine-tuning the accepted annotations.

The process of detecting tables by the table expert (see figure 4 for an example) begins with collecting all accepted annotations for a model, within a predefined range or until a table stop annotation is found, into a list sorted by order of appearance. For each such list, several filling strategies are employed. A filling strategy addresses the problem that multiple columns may accept the same types of annotations. If elements appear row-wise, or column-wise, the corresponding strategies will recover the correct table, also compensating for some errors from omitted table elements. In mixed cases, adding a new table cell to the shortest relevant column is used as a fall back strategy. Each strategy is evaluated, using the fraction of cells filled in the resulting table c and the filling strategy specific score s. The latter score measures how well the annotations match the expectations of the filling strategy. The table which maximizes sf = c · s is annotated as a candidate, if sf is above a predefined threshold. The table expert is implemented an analysis engine. Configuration encompasses the columns describing the table model, distance and scoring threshold, and the set of filling strategies to be evaluated. The output is a table type annotation, which in turn contains several table rows, each containing simple fields as cells.

Multiple table experts may be used to generate candidate tables for a single target, and candidates may occur in several locations in a document. Usually, the correct location gives raise to tables with certain properties, e.g. short, dense tables. This is used by a feature-based selection of the optimal table candidate. We model this using both general purpose features (e.g. size, and number of empty cells) as well as domain specific features. The table with the highest weighted sum of score features is selected as the final output. The weights can either be user defined or fitted using a formal optimization model. 3

Experiments

We composed a document set containing 449 documents3 to measure the extraction quality of our system. These documents are from various customers and represent as many variants of different wordings and layouts as possible.

With our customers we agreed upon certain quality gates that the automatic extraction system has to meet. Due to the nature of the contracts it is much more important to achieve a high precision of the extracted data instead of recall. For simple fields the gate’s threshold is 95% precision and 80% recall. Table cells are more difficult to extract since the OCR component not only mis-recognizes individual characters but makes errors on the structure of a table. For table cells, our goal is to have a high recall since errors within a structured table are easier to detect and correct than simple field errors by a human reviewer. Table 1 shows our results against a manually created ground truth. The numbers represent the 3 see tinyurl.com/csa-example for a public sample document.

Insertions Deletions Substitutions Correct Precision Recall Simple fields Table cells 375 1492 1267 3563 330 906 20519 18838 total number of data points and errors respectively over all of our documents. In total, we meet our gate criterion for simple fields. Precision can be as low as 33% for rare fields, where fitting appropriate data experts is hard. In contrast, for frequent fields, precision may exceed 99%. In principle, the same is true for recall, with both maximum and minimum lower, due to our target criteria. For table cells, the precision needs improvement mainly due to the OCR’s structural errors like swapping rows within a table or switching between row-wise and column-wise recognition in one table. This is especially true for tables which are complex with respect to both lay-out and contents, like the collateral eligibility table in figure 1(b). Here, precision and recall are 84.4% and 80.2%, respectively. In contrast, structurally simple tables, like the interest rate table (see figure 4 for an example) can be extracted with much higher confidence (97.4% precision and 90.8% recall). 4

Conclusion and outlook

This article presents a system to automatically extract simple data points and tables from OTC contract images. The system consists of an OCR component and a hierarchical set-up of small modular extractors either capturing (noisy) text or combining already annotated clues using a slot-filling strategy. Our experiments are conducted on a in-house contract collection resulting in a precision of 97% (recall 93%) on simple fields and a precision of 89% (recall 81%) on table cells. While the evaluation we conducted is limited, we expect overfitting to be moderate. The legal nature of the contracts limits the layout and wording options. Our next steps include the introduction of a confidence score on data-point level and the use of statistical classification methods for selecting the best-suited table model.

Acknowledgement. We would like to thank our partner Rule Financial for providing the data model and for their assistance in understanding the documents.

Constraint-driven Evaluation in UIMA Ruta Andreas Wittek1, Martin Toepfer1, Georg Fette12, Peter Kluegl12, and Frank Puppe1

1 Department of Computer Science VI, University of Wuerzburg,

Am Hubland, Wuerzburg, Germany 2 Comprehensive Heart Failure Center, University of Wuerzburg,

Straubmuehlweg 2a, Wuerzburg, Germany {a.wittek,toepfer,fette,pkluegl,puppe}@informatik.uni-wuerzburg.de Abstract. This paper presents an extension of the UIMA Ruta Workbench for estimating the quality of arbitrary information extraction models on unseen documents. The user can specify expectations on the domain in the form of constraints, which are applied in order to predict the F1 score or the ranking. The applicability of the tool is illustrated in a case study for the segmentation of references, which also examines the robustness for different models and documents. 1 Apache UIMA [ 5 ] and the surrounding ecosystem provide a powerful framework for engineering state-of-the-art Information Extraction (IE) systems, e.g., in the medical domain [ 13 ]. Two main approaches for building IE models can be distinguished. One approach is based on manually defining a set of rules, e.g., with UIMA Ruta3 (Rule-based Text Annotation) [ 7 ]4, that is able to identify the interesting information or annotations of specific types. A knowledge engineer writes, extends, refines and tests the rules on a set of representative documents. The other approach relies on machine learning algorithms, such as probabilistic graphical models like Conditional Random Fields (CRF) [ 10 ]. Here, a set of annotated gold documents is used as a training set in order to estimate the parameters of the model. The resulting IE system of both approaches, the statistical model and the set of rules, is evaluated on an additional set of annotated documents in order to estimate its accuracy or F1 score, which is then assumed to hold for the application in general. However, while the system performed well in the evaluation setting, its accuracy decreases when applied on unseen documents, maybe because the set of documents applied for developing the IE system was not large or not representative enough. In order to estimate the actual performance, either more data is labeled or the results are manually checked by a human, who is able to validate the correctness of the annotations.

Annotated documents are essential for developing IE systems, but there is a natural lack of labeled data in most application domains and its creation is 3 http://uima.apache.org/ruta.html 4 previously published as TextMarker error-prone, cumbersome and time-consuming as is the manual validation. An automatic estimation of the IE system’s quality on unseen documents would therefore provide many advantages. A human is able to validate the created annotations using background knowledge and expectations on the domain. This kind of knowledge is already used by current research in order to improve the IE models (c.f. [ 1, 6, 11 ]), but barely to estimate IE system’s quality.

This paper introduces an extension of the UIMA Ruta Workbench for exactly this use case: Estimating the quality and performance of arbitrary IE models on unseen documents. The user can specify expectations on the domain in the form of constraints thus the name Constraint-driven Evaluation (CDE). The constraints rate specific aspects of the labeled documents and are aggregated to a single cde score, which provides a simple approximation of the evaluation measure, e.g., the token-based F1 score. The framework currently supports two different kinds of constraints: Simple UIMA Ruta rules, which express specific expectations concerning the relationship of annotations, and annotationdistribution constraints, which rate the coverage of features. We distinguish two tasks: predicting the actual F1 score of a document and estimating the ranking of the documents specified by the actual F1 score. The former task can give answers on how good the model performs. The latter task points to documents where the IE model can be improved. We evaluate the proposed tool in a case study for the segmentation of scientific references, which tries to estimate the F1 score of a rule-based system. The expectations are additionally applied on documents of a different distribution and on documents labeled by a different IE model. The results emphasize the advantages and usability of the approach, which works already with minimal efforts due to a simple fact: It is much easier to estimate how good a document is annotated than to actually identify the positions of defective or missing annotations.

The rest of the paper is structured as follows. In the upcoming section, we describe how our work relates to other fields of Information Extraction research. We explain the proposed CDE approach in Section 3. Section 4 covers the case study and the corresponding results. We conclude with pointers to future work in Section 5. Besides standard classification methods, which fit all model parameters against the labeled data of the supervised setting, there have been several efforts to incorporate background knowledge from either user expectations or external data analysis. Bellare et al. [ 1 ], Grac¸a et al. [ 6 ] and Mann and McCallum [ 11 ], for example, showed how moments of auxiliary expectation functions on unlabeled data can be used for such a purpose with special objective functions and an alternating optimization procedure. Our work on constraint-driven evaluation is partly inspired by this idea, however, we address a different problem. We suggest to use auxiliary expectations to estimate the quality of classifiers on unseen data.

A classifier’s confidence describes the degree to which it believes that its own decisions are correct. Several classifiers provide intrinsic measures of confidence, for example, naive Bayes classifiers. Culotta and McCallum [ 4 ], for instance, studied confidence estimation for information extraction. They focus on predictions about field and record correctness of single instances. Their main motivation is to filter high precision results for database population. Similar to CDE, they use background knowledge features like record length, single field label assignments and field confidence values to estimate record confidence. CDE generalizes common confidence estimation because the goal of CDE is the estimation of the quality of arbitrary models.

Active learning algorithms are able to choose the order in which training examples are presented in order to improve learning, typically by selective sampling [ 2 ]. While the general CDE setting does not necessarily contain aspects of selective sampling, consider for example the batch F1 score prediction task, the ranking task can be used as a selective sampling strategy in applications to find instances that support system refactoring. The focus of the F1 ranking task, however, still differs from active learning goals which is essential for the design of such systems. Both approaches are supposed to favor different techniques to fit their different objectives. Popular active learning approaches such as density-weighting (e.g., [ 12 ]) focus on dense regions of the input distribution. CDE, however, tries to estimate the quality of the model on the whole data set and hence demands for differently designed methods. Despite their differences, the combination of active learning and CDE would be an interesting subject for future work. CDE may be used to find weak learners of ensembles and informative instances for these learners. The Constraint-driven Evaluation (CDE) framework presented in this work allows the user to specify expectations about the domain in form of constraints. These constraints are applied on documents with annotations, which have been created by an information extraction model. The results of the constraints are aggregated to a single cde score, which reflects how well the annotations fulfill the user’s expectations and thus provide a predicted measurement of the model’s quality for these documents. The framework is implemented as an extension of the UIMA Ruta Workbench. Figure 1 provides a screenshot of the CDE perspective, which includes different views to formalize the set of constraints and to present the predicted quality of the model for the specified documents.

We define a constraint in this work as a function C : CAS → [ 0, 1 ], which returns a confidence value for an annotated document (CAS) where high values indicates that the expectations are fulfilled. Two different types of constraints are currently supported: Rule constraints are simple UIMA Ruta rules without actions and allow to specify sequential patterns or other relationships between annotations that need to be fulfilled. The result is basically the ratio of how often the rule has tried to match compared to how often the rule has actually Fig. 1. CDE perspective in the UIMA Ruta Workbench. Bottom left: Expectations on the domain formalized as constraints. Top right: Set of documents and their cde scores. Bottom right: Results of the constraints for the selected document. matched. An example for such a constraint is Document{CONTAINS(Author)};, which specifies that each document must contain an annotation of the type Author. The second type of supported constraints are Annotation Distribution (AD) constraints (c.f. Generalized Expectations [ 11 ]). Here, the expected distribution of an annotation or word is given for the evaluated types. The result of the constraint is the cosine similarity of the expected and the observed presence of the annotation or word within annotations of the given types. A constraint like "Peter": Author 0.9, Title 0.1, for example, indicates that the word “Peter” should rather be covered by an Author annotation than by a Title annotation. The set of constraints and their weights can be defined using the CDE Constraint view (c.f. Figure 1, bottom left).

For a given set of constraints C = {C1, C2...Cn} and corresponding weights w = {w1, w2, ..., wn}, the cde score for each document is defined by the weighted average: n cde = 1 X wi · Ci n i

The cde scores for a set of documents may already be very useful as a report how well the annotations comply with the expectations on the domain. However, one can further distinguish two tasks for CDE: the prediction of the actual evaluation score of the model, e.g., the token-based F1 score, and the prediction of the quality ranking of the documents. While the former task can give answers how good the model performs or whether the model is already good enough for the application, the latter task provides a useful tool for introspection: Which documents are poorly labeled by the model? Where should the model be improved? Are the expectations on the domain realistic? Due to the limited expressiveness of the aggregation function, we concentrate on the latter task. The cde scores for the annotated documents are depicted in the CDE Documents view (c.f. Figure 1, top right). The result of each constraint for the currently selected document is given in the CDE Results view (c.f. Figure 1, bottom right).

The development of the constraints needs to be supported by tooling in order to achieve an improved prediction in the intended task. If the user extends or refines the expectations on the domain, then a feedback whether the prediction has improved or deteriorated is very valuable. For this purpose, the framework provides functionality to evaluate the prediction quality of the constraints itself. Given a set of documents with gold annotations, the cde score of each document can be compared to the actual F1 score. Four measures are applied to evaluate the prediction quality of the constraints: the mean squared error, the Spearman’s rank correlation coefficient, the Pearson correlation coefficient and the cosine similarity. For optimizing the constraints to approximate the actual F1 score, the Pearson’s r is maximized, and for improving the predicted ranking, the Spearman’s ρ is maximized. If documents with gold annotations are available, then the F1 scores and the values of the four evaluation measures are given in the CDE Documents view (c.f. Figure 1, top right). The usability and advantages of the presented work are illustrated with a simple case study concerning the segmentation of scientific references, a popular domain for evaluating novel information extraction models. In this task, the information extraction model normally identifies about 12 different entities of the reference string, but in this case study we limited the relevant entities to Author, Title and Date, which are commonly applied in order to identify the cited publication.

In the main scenario of the case study, we try to estimate the extraction quality of a set of UIMA Ruta rules that shall identify the Author, Title and Date of a reference string. For this purpose, we define constraints representing the background knowledge about the domain for this specific set of rules. Additionally to this main setting of the case study, we also measure the prediction of the constraints in two different scenarios: In the first one, the documents have been labeled not by UIMA Ruta rules, but by a CRF model [ 10 ]. The CRF model was trained with a limited amount of iterations in a 5-fold manner. In a second scenario, we apply the UIMA Ruta rules on a set of documents of a different distribution including unknown style guides.

Table 1 provides an overview of the applied datasets. We make use of the references dataset of [ 9 ]. This data set is homogeneously divided in three subdatasets with respect to their style guides and amount of references, which are

Druta 219 references in 8 documents used to develop the set of UIMA Ruta rules. Ddev 192 references in 8 documents labeled by the UIMA Ruta rules and applied for developing the constraints.

Dtest 155 references in 7 documents labeled by the UIMA Ruta rules and applied to evaluate the constraints.

Dcrf

Druta, Ddev and Dtest (566 references in 23 documents) labeled by a (5-fold) CRF model.

Dgen 452 references in 28 documents from a different source with unknown style guides labeled by the UIMA Ruta rules. applied to develop the UIMA Ruta rules, define the set of constraints, and to evaluate the prediction of the constraints compared to the actual F1 score. The CRF model is trained on the partitions given in [ 9 ]. The last dataset Dgen consists of a mixture of the datasets Cora, CiteSeerX and FLUX-CiM described in [ 3 ] generated by the rearrangement of [ 8 ].

Cruta extended with one additional AD constraint covering the entitydistribution of words extracted from Bibsonomy. The weight of each constraint is set to 1.

Cruta+5xbib Same set of constraints as in Cruta+bib, but the weight of the additional

AD constraint is set to 5.

Table 2 provides an overview of the different sets of constraints, whose predictions are compared to the actual F1 score. First, we extended and refined a set of UIMA Ruta rules until they achieved an F1 score of 1.0 on the dataset Druta. Then, 15 Rule constraints Cruta5 have been specified using the dataset Ddev. The definition of the UIMA Ruta rules took about two hours and the definition of the constraints about one hour. Additionally to the Rule constraints, we created an AD constraint, which consists of the entity distribution of words that occurred at least 1000 times in the latest Bibtex database dump of Bibsonomy6. The set of constraints Cruta+bib and Cruta+5xbib combine both types of constraints with different weighting.

Table 3 contains the evaluation, which compares the predicted cde score to the actual token-based F1 score for each document. We apply two different 5 The actual implementation of the constraints as UIMA Ruta rules is depicted in

Figure 1 (lower left part). 6 http://www.kde.cs.uni-kassel.de/bibsonomy/dumps ρ

Cruta+bib ρ

Cruta+5xbib ρ

r correlation coefficients for measuring the quality of the prediction: Spearman’s ρ gives an indication about the ranking of the documents and Pearson’s r provides a general measure of linear dependency.

Although the expectations defined by the sets of constraints are limited and quite minimalistic covering mostly only common expectations, the results indicate that they can be useful in any scenario. The results for dataset Ddev are only given for completeness since this dataset was applied to define the set of constraints. The results for the dataset Dtest, however, reflect the prediction on unseen documents of the same distribution. The ranking of the documents was almost perfectly estimated with a Spearman’s ρ of 0.96157. The coefficients for the other scenarios Dcrf and Dgen are considerably decreased, but the cde scores are nevertheless very useful for an assessment of the extraction model’s quality. The five worst documents in Dgen (including new style guides), for example, have been reliably detected. The results show that the AD constraints can improve the prediction, but do not exploit their full potential in the current implementation. The impact measured for the dataset Dcrf is not as distinctive since the CRF model already includes such features and thus is able to avoid errors that are detected by these constraints. However, the prediction in the dataset Dgen is considerably improved. The UIMA Ruta rules produce severe errors in documents with new style guides, which are easily detected by the word distribution. 5

Conclusions

This paper presented a tool for the UIMA community implemented in UIMA Ruta, which enables to estimate the extraction quality of arbitrary models on unseen documents. Its introspective report is able to improve the development of information extraction models already with minimal efforts. This is achieved by formalizing the background knowledge about the domain with different types of constraints. We have shown the usability and advantages of the approach in a case study about segmentation of references. Concerning future work, many prospects for improvements remain, for example a logistic regression model for 7 The actual cde and F1 scores of Dtest are depicted in Figure 1 (right part) approximating the scores of arbitrary evaluation measures, new types of constraints, or approaches to automatically acquire the expectations on a domain. Acknowledgments This work was supported by the Competence Network Heart Failure, funded by the German Federal Ministry of Education and Research (BMBF01 EO1004).

References Di Bari, Alessandro Fang, Yan Faraotti, Alessandro Fette, Georg Maiberg, Avner

Mass´o Sanabre, Guillem McCormack, Collin Noh, Tae-Gil Nyberg, Eric

Richardet, Renaud Rodr´ıguez-Penagos, Carlos Savova, Guergana

Stadermann, Jan Symons, Stephan

Wittek, Andreas

2 42 14 42 18 58

Paul

Buitelaar and

Srikanth

Ramaka . Unsupervised ontology-based semantic tagging for knowledge markup . In Proceedings of the Workshop on Learning in Web Search at the International Conference on Machine Learning , 2005 .

Hamish

Cunningham . Gate, a general architecture for text engineering . Computers and the Humanities , 36 ( 2 ): 223 - 254 , 2002 .

David

Ferrucci and

Adam

Lally . Uima: an architectural approach to unstructured information processing in the corporate research environment . Natural Language Engineering , 10 ( 3-4 ): 327 - 348 , 2004 .

Kyungduk

Kim et al. A frame-based probabilistic framework for spoken dialog management using dialog examples . In Proceedings of the 9th SIGdial Workshop on Discourse and Dialogue , 2008 .

5. John Lafferty, Andrew McCallum , and Fernando CN Pereira . Conditional random fields: probabilistic models for segmenting and labeling sequence data . In Proceedings of the 18th International Conference on Machine Learning , 2001 .

Ron

Meir and Gunnar Ra¨tsch. An introduction to boosting and leveraging . In Advanced lectures on machine learning , pages 118 - 183 . Springer, 2003 .

David

Pinto , Andrew McCallum ,

Xing

Wei , and

W Bruce

Croft . Table extraction using conditional random fields . In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval , pages 235 - 242 , 2003 .

Pallavi

Pyreddy and

W Bruce

Croft . Tintin: A system for retrieval in text tables . In Proceedings of the second ACM international conference on Digital libraries , pages 193 - 200 , 1997 .

Lev

Ratinov and

Dan

Roth . Design challenges and misconceptions in named entity recognition . In Proceedings of the thirteenth conference on Computational Natural Lanugage Learning , pages 147 - 155 , 2009 .

10.

Stephen

Soderland . Learning information extraction rules for semi-structured and free text . Machine learning , 34 ( 1-3 ): 233 - 272 , 1999 .

11. Mihai

Surdeanu

, Ramesh Nallapati, and

Christopher D.

Manning . Legal claim identification: Information extraction with hierarchically labeled data . In Proceedings of the LREC 2010 Workshop on the Semantic Processing of Legal Texts , 2010 .

12. Suzanne Liebowitz Taylor, Richard Fritzson, and Jon A Pastor. Extraction of data from preprinted forms . Machine Vision and Applications , 5 ( 3 ): 211 - 222 , 1992 .

13.

Paul

Viola and

Mukund

Narasimhan . Learning to extract information from semistructured text using a discriminative context free grammar . In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval , pages 330 - 337 , 2005 .

1. Bellare , K. , Druck , G. , McCallum , A. : Alternating Projections for Learning with Expectation Constraints . In: Proceedings of the Twenty-Fifth Conference on Uncertainty in AI . pp. 43 - 50 . AUAI Press ( 2009 )

2. Cohn , D. , Atlas , L. , Ladner , R.: Improving generalization with active learning . Machine Learning 15 , 201 - 221 ( 1994 )

3. Councill , I. , Giles , C.L. , Kan , M.Y.: ParsCit: an Open-source CRF Reference String Parsing Package . In: Proceedings of the Sixth International Language Resources and Evaluation (LREC'08) . ELRA, Marrakech, Morocco ( 2008 )

4. Culotta , A. , McCallum , A. : Confidence Estimation for Information Extraction . In: Proceedings of HLT-NAACL 2004: Short Papers . pp. 109 - 112 . HLT-NAACL-Short ' 04 , Association for Computational Linguistics, Stroudsburg, PA, USA ( 2004 )

5. Ferrucci , D. , Lally , A. : UIMA: An Architectural Approach to Unstructured Information Processing in the Corporate Research Environment . Natural Language Engineering 10 ( 3 /4), 327 - 348 ( 2004 )

6. Graca , J. , Ganchev , K. , Taskar , B. : Expectation Maximization and Posterior Constraints . In: Platt, J. , Koller , D. , Singer , Y. , Roweis , S. (eds.) NIPS 20, pp. 569 - 576 . MIT Press, Cambridge, MA ( 2008 )

7. Kluegl , P. , Atzmueller , M. , Puppe , F. : TextMarker: A Tool for Rule-Based Information Extraction . In: Chiarcos, C. , de Castilho , R.E. , Stede , M. (eds.) Proceedings of the 2nd UIMA@GSCL Workshop . pp. 233 - 240 . Gunter Narr Verlag ( 2009 )

8. Kluegl , P. , Hotho , A. , Puppe , F. : Local Adaptive Extraction of References . In: 33rd Annual German Conference on Artificial Intelligence (KI 2010 ). Springer ( 2010 )

9. Kluegl , P. , Toepfer , M. , Lemmerich , F. , Hotho , A. , Puppe , F. : Collective Information Extraction with Context-Specific Consistencies . In: Flach, P.A. , Bie , T.D., Cristianini , N. (eds.) ECML/PKDD (1). Lecture Notes in Computer Science , vol. 7523 , pp. 728 - 743 . Springer ( 2012 )

10. Lafferty , J. , McCallum , A. , Pereira , F. : Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data . Proc. 18th International Conf. on Machine Learning pp. 282 - 289 ( 2001 )

11. Mann , G.S. , McCallum , A. : Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data . J. Mach. Learn. Res . 11 , 955 - 984 ( 2010 )

12. McCallum , A. , Nigam , K. : Employing

and Pool-Based Active Learning for Text Classification . In: Shavlik, J.W . (ed.) ICML. pp. 350 - 358 . Morgan Kaufmann ( 1998 )

13. Savova , G.K. , Masanz , J.J. , Ogren , P.V. , Zheng , J. , Sohn , S. , Kipper-Schuler , K.C. , Chute , C.G. : Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications . Journal of the American Medical Informatics Association : JAMIA 17 ( 5 ), 507 - 513 ( Sep 2010 )