<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Unstructured Information Management Architecture (UIMA) 3rd UIMA@GSCL Workshop</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Chappelier</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jean-Cedric Chen</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pei Codina</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pad´o</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sebastian Puppe</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Frank</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Telefont</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Thon</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ingo Toepfer</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <volume>23</volume>
      <issue>2013</issue>
      <fpage>25</fpage>
      <lpage>66</lpage>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Copyright c 2013 for the individual papers by the papers' authors. Copying
permitted only for private and academic purposes. This volume is published and
copyrighted by its editors.
For many decades, NLP has su ered from low software engineering standards
causing a limited degree of re-usability of code and interoperability of di erent
modules within larger NLP systems. While this did not really hamper success
in limited task areas (such as implementing a parser), it caused serious
problems for the emerging eld of language technology where the focus is on building
complex integrated software systems, e.g., for information extraction or machine
translation. This lack of integration has led to duplicated software development,
work-arounds for programs written in di erent (versions of) programming
languages, and ad-hoc tweaking of interfaces between modules developed at di
erent sites.</p>
      <p>In recent years, the Unstructured Information Management Architecture
(UIMA) framework has been proposed as a middleware platform which o ers
integration by design through common type systems and standardized
communication methods for components analysing streams of unstructured
information, such as natural language. The UIMA framework o ers a solid processing
infrastructure that allows developers to concentrate on the implementation of
the actual analytics components. An increasing number of members of the NLP
community thus have adopted UIMA as a platform facilitating the creation of
reusable NLP components that can be assembled to address di erent NLP tasks
depending on their order, combination and con guration.</p>
      <p>This workshop aims at bringing together members of the NLP community
{ users, developers or providers of either UIMA components or UIMA-related
tools in order to explore and discuss the opportunities and challenges in using
UIMA as a platform for modern, well-engineered NLP.</p>
      <p>This volume now contains the proceedings of the 3rd UIMA workshop to
be held under the auspices of the German Language Technology and
Computational Linguistics Society (Gesellschaft fur Sprachverarbeitung und
Computerlinguistik - GSCL) in Darmstadt, September 23, 2013. From 11 submissions, the
programme committee selected 7 full papers and 2 short papers. The organizers
of the workshop wish to thank all people involved in this meeting - submitters
of papers, reviewers, GSCL sta and representatives - for their great support,
rapid and reliable responses, and willingness to act on very sharp time lines. We
appreciate their enthusiasm and cooperation.</p>
    </sec>
    <sec id="sec-2">
      <title>September 2013</title>
      <p>Peter Kluegl, Richard Eckart de Castilho, Katrin Tomanek (Eds.)</p>
      <sec id="sec-2-1">
        <title>Program</title>
      </sec>
      <sec id="sec-2-2">
        <title>Committee</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>University of Manchester</title>
      <p>KU Leuven
Nuance Deutschland
University of Bielefeld
IBM Thomas J. Watson Research Center
University of Colorado
Technische Universit¨at Darmstadt
Averbis
Technische Universit¨at Darmstadt
Temis Deutschland
IBM Deutschland
FSU Jena
University of Nantes
IBM Deutschland
Vassar College
University of Wu¨rzburg
Carnegie Mellon University
Averbis
IBM Thomas J. Watson Research Center
University of Wu¨rzburg
Averbis
National ICT Australia
University of Helsinki
University of Duisburg-Essen</p>
      <sec id="sec-3-1">
        <title>Additional Reviewers</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Roman Klinger</title>
    </sec>
    <sec id="sec-5">
      <title>University of Bielefeld ii</title>
      <p>1
2
Storing UIMA CASes in a relational database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10</p>
      <p>Georg Fette, Martin Toepfer and Frank Puppe
CSE Framework: A UIMA-based Distributed System for Configuration Space Exploration 14
Elmer Garduno, Zi Yang, Avner Maiberg, Collin McCormack, Yan Fang and Eric
Nyberg
Aid to spatial navigation within a UIMA annotation index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18</p>
      <p>Nicolas Hernandez
Using UIMA to Structure An Open Platform for Textual Entailment . . . . . . . . . . . . . . . . . . . . . 26</p>
      <p>Tae-Gil Noh and Sebastian Pad´o
Bluima: a UIMA-based NLP Toolkit for Neuroscience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34</p>
      <p>Renaud Richardet, Jean-Cedric Chappelier and Martin Telefont
Sentiment Analysis and Visualization using UIMA and Solr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Carlos Rodr´ıguez-Penagos, David Garc´ıa Narbona, Guillem Mass´o Sanabre, Jens
Grivolla and Joan Codina
Extracting hierarchical data points and tables from scanned contracts . . . . . . . . . . . . . . . . . . . . 50</p>
      <p>Jan Stadermann, Stephan Symons and Ingo Thon
Constraint-driven Evaluation in UIMA Ruta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Andreas Wittek, Martin Toepfer, Georg Fette, Peter Kluegl and Frank Puppe</p>
      <sec id="sec-5-1">
        <title>Keynote: Apache clinical Text Analysis and Know</title>
        <p>ledge Extraction System (cTAKES)
The presentation will focus on methods and software development behind the
cTAKES platform. An overview of the modules will set the stage, followed by
more in-depth discussion of some of the methods and evaluations of select
modules. The second part of the presentation will shift to software development
topics such as optimization and distributed computing including UIMA
integration, UIMA-AS, as well as our plans for UIMA-DUCC integration. A live
demo of cTAKES will conclude the talk.</p>
        <sec id="sec-5-1-1">
          <title>About the speakers</title>
          <p>Pei Chen is a Vice President of Apache Software Foundation, leading the
top-level cTAKES project1. He is also a lead application development
specialist at the Informatics Program at Boston Children's Hospital/Harvard Medical
School. Mr. Chen's interests lie in building practical applications using machine
learning techniques. He has a passion for the end-user experience and has a
background Computer Science/Economics. Mr. Chen is a rm believer in the
open source community contributing to cTAKES as well as other Apache
Software Foundation projects.</p>
          <p>Guergana Savova, Ph.D. is member of the faculty at Harvard Medical School
and Childrens Hospital Boston. Her research interest is in natural language
processing (NLP), especially as applied to the text generated by physicians (the
clinical narrative) focusing on higher level semantic and discourse processing
which includes topics such as named entity recognition, event recognition,
relation detection, and classi cation including co-reference and temporal relations.
The methods are mostly machine learning spanning supervised, lightly
supervised, and completely unsupervised. Her interest is also in the application of the
NLP methodologies to biomedical use cases. Dr. Savova has been leading the
development and is the principal architect of cTAKES. She holds a Master's of
Science in Computer Science and a PhD in Linguistics with a minor in Cognitive
Science from University of Minnesota.</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>A Model-driven approach to NLP programming with UIMA</title>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Alessandro Di Bari, Alessandro Faraotti, Carmela Gambardella, and Guido Vetere</title>
      <p>IBM Center for Advanced Studies of Trento</p>
      <p>Piazza Manci, 1 Povo di Trento
Abstract. In Natural Language Processing, more complex business use
cases and shorter delivery times drive a growing need of smoother, more
flexible and faster implementations. This trend also requires integrating
and orchestrating different functionalities delivered by services
belonging to different technological platforms. All these needs imply raising the
level of abstraction for NLP components development. In this paper we
present a Model Driven Architecture approach suitable to develop an
open and interoperable UIMA-based NLP stack. By decoupling UIMA
NLP models from other solution specific platforms and services, we
obtain major architectural improvements.
1</p>
      <sec id="sec-6-1">
        <title>Introduction</title>
        <p>As Natural Language Processing (NLP) approaches complex tasks such as
Question Answering or Dialog Management, the capability for NLP tools to
seamlessly interoperate with other software services, such as knowledge bases or rules
engines, becomes crucial. Such level of integration may require linguistic models
to be shared among a variety of different platforms, each of which comes with
its own information representation language. Platforms like UIMA1 or GATE2
consist of middleware and tools for designing and pipelining NLP specific tasks,
including support for modeling data structures for text annotation, such as
lexical, morphological and syntactic features, which may be embedded in
interprocess communication protocols. However, while perfectly suited for
annotation purposes, NLP specific schema languages, such as the UIMA Type System,
fall short on fulfilling solution-level modeling needs. Model-Driven software
Architectures (MDA), on the other hand, are specifically aimed at tackling the
complexity of modern software infrastructures, with emphasis on the integration
and the orchestration of different technological platforms. The MDA approach
is based on providing formal descriptions (models) of requirements, interactions,
data structures, protocols, and many other aspects of the desired system, which
are automatically turned into technical resources, such as schemes and software
modules, by activating transformation rules.
1 http://uima.apache.org/
2 http://gate.ac.uk/</p>
        <p>
          Based on this consideration, we adopted an MDA approach to develop a
“Watson ready”3, UIMA-based NLP stack for Italian, as part of the activity
of the newborn IBM Language &amp; Knowledge Center for Advanced Studies of
Trento4. We wanted our stack to be as open and interoperable as possible, to
help users leveraging the availability of NLP resources and tools in the Open
Source / Open Data space. In addition, our stack aims at being independent
from language specific issues and domains, to facilitate its reuse across projects
and within our (multinational) Company. The basic idea was to design a highly
modularized general model including all the required structures, and to obtain
technical platform-specific resources from a suitable set of model-to-model
transformations. Also, we embraced the idea of abstracting semantic information away
from the UIMA Type System, as in [
          <xref ref-type="bibr" rid="ref18 ref5">5</xref>
          ] and in [
          <xref ref-type="bibr" rid="ref20 ref7">7</xref>
          ], and evaluated the benefit of
representing such kind of information by specific means. In sum, we looked at
UIMA as a well-suited platform for linguistic analysis, which allows the
integration of analytic components into managed workflow pipelines, but regarded at
the UIMA Type System as a schema specification for that platform, rather than
as a general modeling language for any NLP-based solution.
        </p>
        <p>Here we present an overview of the basic ideas behind our approach, introduce
our project, and discuss future directions. At the present stage of development,
we can share our vision on MDA positioning and motivation with respect to NLP
development (section 3), and we can report our first implementation experiences
(section 4). Finally, we outline some related topic and introduce future works.
2</p>
      </sec>
      <sec id="sec-6-2">
        <title>Motivating Scenario</title>
        <p>Natural Language based solutions may require the NLP stack to cooperate with
other components in a complex system. Such cooperation typically involves data
exchanges with reference to a shared information model. The picture 1 shows
the integration of an NLP stack with a Knowledge Base (e.g. an Ontology-based
Data Access System) and a Rule Engine.</p>
        <p>An UIMA-based NLP pipeline produces an annotated text (step 1 in the
picture 1) contained in an UIMA CAS (Common Annotation Structure). A wrapper
of the UIMA Type System defines all the operations needed for a consumer (the
Rule Engine in this case) in order to access the CAS and invoke the
appropriate operations within the cooperating subsystem when needed (see 4.2). When
developing and maintaining the solution, an Engineer builds a rule set (see step
3) in order to process linguistic structures and interact with a Knowledge Base
(step 4), which, in turn, uses the annotated text to store assertions as the result
of an Information Extraction process (step 5). In a separate flow, the Knowledge
Base can be queried by a User through a Question Answering System based on
a suitable query language (step 6). The integration of all components involved
is guaranteed by a common abstract model (Platform Independent Model) that
contains the overall conceptualization of the system. The transition from one
3 www.ibm.com/watson/
4 www.ibm.com/ibm/cas/
platform specific data structure to another is handled by a set of
Model-toModel transformations (steps 7 and 8 ). The figure also shows the link to legacy
(possibly huge) conceptual models, such as the KB ontology (step 9).
3</p>
      </sec>
      <sec id="sec-6-3">
        <title>Model Driven Architecture for NLP</title>
        <p>
          Model Driven Architecture (MDA) [
          <xref ref-type="bibr" rid="ref19 ref6">6</xref>
          ] is a development approach, strictly based
on formal specifications of information structures and behaviors, and their
semantics. MDA is managed by Object Management Group (OMG)5 based on
several modeling standard such as: Unified Modeling Language (UML)6,
MetaObject Facility (MOF), XML Metadata Interchange (XMI) and others. MDA
supports Model Driven Development/Engineering (MDD, MDE).
        </p>
        <p>The key idea behind MDA is to provide a higher level of abstraction so that
software can be fully designed independently from the underlying technological
platform. More formally, MDA defines three macro “modeling” layers:
– Computation Independent Model (CIM)
– Platform Independent Model (PIM)
– Platform Specific Model (PSM)</p>
        <p>The first one can be related to a Business Process Model and does not
necessary imply the existence of a system that automates it. The PIM is a model
that is independent from any technical platform; the third (PSM) layer is the
actual implementation of the model with respect to a given technology and it is
automatically derived from the PIM. Notice that the PIM allows a
comprehensive representation of the structure and behavior of the system being developed.
5 http://omg.org/
6 http://www.uml.org/
The modeling language is typically UML or EMF7, but it could actually be any
other Domain Specic Language (DSL).</p>
        <p>
          Developing powerful NLP tasks, such as Question Answering systems,
requires combining a great variety of analytic components, which is what UIMA
has been designed for. We consider UIMA the standard solution for document
workflow analysis. Within this framework, MDD tools can be effectively used to
better manage the UIMA Type System. In particular, we decided to look at it
as a PSM dedicated to text annotation. The motivation for leveraging MDD (in
the NLP field) can be summarized as follows:
– Formalization: MDA languages are well studied in logics and reasoning
mechanisms can be developed upon. [
          <xref ref-type="bibr" rid="ref1 ref14">1</xref>
          ]
– Expressiveness: MOF meta-modeling allow great and well-founded
expressiveness [
          <xref ref-type="bibr" rid="ref17 ref4">4</xref>
          ], including modeling behaviors.
– Support: The availability of tools, including diagramming and code
generation, improves software life-cycle and team collaboration.
        </p>
        <p>In particular, with respect to our architecture, we modeled UIMA
annotations by defining classes rather than just (data) types, so that a consumer is able
to invoke operations designed for those objects. Access to UIMA annotation is
then achieved by means of automatically generated wrappers. Another
motivation for a model driven approach was the need to represent complex linguistic
data, and exploit existing tooling and resources for generating training data for
a statistical parser.</p>
        <p>In sum, we tried to exploit the maturity and flexibility of MDD tools while
keeping up the power of UIMA as a framework for component integration,
pipeline execution, and workflow management in general. As the PIM language,
we chose EMF because it is already integrated with UIMA and provides
powerful and mature model driven features. Once also the code is generated (by
UIMA JCASgen), the type system correspond to an implementation of a
(business) domain model, limited to the structural aspects (as opposed to behavioral
aspects).</p>
        <p>At PIM level, we also have to represent those properties that, once
transformed against a target model, give specific characteristics on that model. For
instance, in order to generate the UIMA Type System (PSM) starting from the
PIM, we have to represent on the source model whether a class (that is a root
in a hierarchy on the PIM model) will be generated as an UIMA annotation or
not (UIMA TOP). Here we have taken two possible scenarios into account:
– Having an UML PIM, this specification is easily accomplished by using an
UML profile8. Profiles define stereotypes that can be further structured with
custom properties. This way, we have a generic ”Unstructured Information”
profile that at least, encompasses an Annotation stereotype; thus a class
that is thought to become an annotation will be simply ”marked” with this
stereotype.
7 http://www.eclipse.org/modeling/emf/
8 http://www.omg.org/spec/#M&amp;M
– Having an EMF PIM (such as our current implementation), we can represent
the same thing as an EMF annotation. Therefore, (we apologize for the words
conflict) we will have a class annotated as Annotation.</p>
        <p>In any case, a class stereotyped as Annotation on the PIM will take the role of
a generic annotation for document analysis, independently from the underlying
framework.</p>
        <p>The main benefit of our approach is the ability to represent NLP objects
independently from any particular implementation: we are using different
(generated) PSMs (that are better explained in section 4) all deriving from the
starting (PIM) model, as shown in the picture 2.</p>
        <p>These benefits have certainly a price, that is essentially represented by the
cost of developing the necessary transformations. However, following basic
assumptions of the MDD approach, we estimate that those cost are well paying
back, especially when heterogeneous components have to be integrated,
development is managed iteratively, and models are subject to high volatility.
4</p>
      </sec>
      <sec id="sec-6-4">
        <title>Model Driven Implementation Aspects</title>
        <p>In order to better clear up how we are leveraging the Model Driven approach,
we list here the artifacts (PSMs and code) we are generating through appropriate
transformations that we have developed.
Starting from our “application” model:
– UIMA type system (we modified the existing transformation from EMF in
order to avoid any further modification on the UIMA type system)
– EMF wrapper of UIMA type system
– this wrapper also acts as the input for creating the model for the Rule engine
as explained below
Starting from (our) models of common standard data for parser training such as
CONLL, PENN and others we generated all necessary (OpenNLP-specific) data
for training the parser on:
– Tokenization
– Named Entities
– Part of Speech tagging
– Chunking
– Parsing</p>
        <p>To represent the model (PIM), we use the Eclipse Modeling Framework9
(EMF), which represents a de facto Java-based standard for meta-modeling.
Informally, we may say EMF represents a subset of UML (the structural part)
with very precise semantics for code generation. In the future, we could move
this representation to a profiled UML, as mentioned above (see section 3).
Furthermore, EMF offers very powerful generation features. Summarizing, in the
current implementation we use EMF in two ways:</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>1. A language to represent the model 2. A PIM model to generate different target PSM</title>
      <p>4.1</p>
      <p>NLP Parser
The NLP Parser component is implemented using Apache OpenNLP10 and
UIMA11; it is based on a UIMA Type System built from the Syntax and the
Abstract models using the UIMA transformation utility. The training corpora
for the parser has to be provided in a specific format required by OpenNLP.
Since the data that we had available for training were in standard formats such
as PENN12, CONLL13 and others, some transformations were required. Ecore
models have been created for the purpose of representing source formats.
Furthermore, some simple JET14 transformations has been developed in order to
generate our corpora (in specific OpenNLP formats)</p>
      <p>Compared to other solutions, this makes our infrastructure extremely flexible:
should the parser be replaced or the data formats changed, the only operation
we will have to make is to modify the JET template accordingly.
4.2</p>
      <p>Type System EMF Wrapper
As anticipated in 2, in the higher layers of our architecture, we have a Rule
Engine that acts as a reasoner on annotation objects coming from the UIMA
pipeline. We wanted this layer to be able to call operations implemented on those
objects (as explained in section 3) and those objects always implementing the
exact interfaces of the (Ecore) PIM model. Given these requirements, we developed
a transformation that generates a wrapper of the UIMA type system and that
9 http://www.eclipse.org/modeling/emf/
10 http://opennlp.apache.org/
11 http://uima.apache.org/
12 http://www.cis.upenn.edu/~treebank/
13 http://ilk.uvt.nl/conll/#dataformat
14 http://www.eclipse.org/modeling/m2t/?project=jet
fully reflects the starting PIM model, including operations. Once implemented,
the code will be kept up also against future re-generations, thanks to merging
capabilities of this transformation. Thus, as shown in figure 1, the Rule Engine
“consumes” instances of this wrapper, and still can access the underlying UIMA
annotation. We considered the possibility of directly adding these operations on
classes generated by UIMA (via JCAS generation utility) but this would not be
consistent with our model-driven approach since those operations would not be
part of a general, system-wide model.
4.3</p>
      <p>Rule Engine
As far as the Rule Engine is concerned, we chose IBM Operational Decision
Manager (ODM)15. ODM rules have to be written against a specific model,
called Business Object Model (BOM), that allows a user-friendly business rule
editing; ODM provides tools to set up a natural language vocabulary: users can
use it to write business rules in a pseudo-natural language. Once defined, the
rules are executed on a BOM-related Java implementation named Execution
Object Model (XOM). We obtained the BOM by reverse engineering the XOM,
and the XOM directly from Java classes (implementing the type system wrapper)
generated from our PIM (EMF) model. Therefore, the BOM model can be seen
as just another manifestation of our PIM model.
4.4</p>
      <p>
        Knowledge Base
Our architecture is backed by a Knowledge Base Management System which
stores and reasons on information extracted from many sources. Leveraging on
the Knowledge Model included in the PIM, we were able to integrate an external
pre-existing system, named ONDA (Ontology Based Data Access) [
        <xref ref-type="bibr" rid="ref16 ref3">3</xref>
        ]. ONDA
supports Ontology Based Data Access (OBDA) on OWL2-QL (16), by ensuring
sound and complete conjunctive query answering with the same efficiency a
scalability of a traditional database [
        <xref ref-type="bibr" rid="ref15 ref2">2</xref>
        ]. Because the ONDA underlying Knowledge
Model was already designed with EMF, we simply adopted it in order to be
included in the PIM. This way, reasoning and query answering services have been
included in the PIM model as operations available to all other components (i.e.
the Rule Engine).
5
      </p>
      <sec id="sec-7-1">
        <title>Conclusion and future works</title>
        <p>We have outlined here an innovative approach to NLP development, based on
the idea of setting UIMA as the target platform in a Model-Driven development
process. A major benefit of this approach consists in giving NLP models a greater
value, especially in terms of generality, usability, and interoperability.
15 http://www-03.ibm.com/software/products/us/en/odm/
16 http://www.w3.org/TR/owl2-profiles/</p>
        <p>
          While developing this idea, we understood that a suitable Model-Driven
machinery for NLP should be supported by specific design patterns for concrete
models. In particular, the model we have developed has been abstracted both
from morphosyntactic specificity and from semantic aspects. The former
(including part-of-speech classes, genders, numbers, verbal tenses, etc) may
significantly vary among different languages; the latter (including concepts like
persons, events, places, etc) are related to specific application domains. By
decoupling these layers, we achieved a lightweight “generic” UIMA type system[
          <xref ref-type="bibr" rid="ref20 ref7">7</xref>
          ], we
designed a powerful generic model for morphosyntactic features, and we managed
ontological information with proper expressive means. Refining and extending
this model is part of our future plans.
        </p>
        <p>We implemented a first prototype of a Knowledge Base query system based
on the Eclipse Modeling Framework (EMF). For the future, we are considering
the possibility of representing the model in UML, in order to have a greater
representational power (such as modeling sequence diagrams).</p>
        <p>The work presented here is still at an early stage. More work is needed to
complete the linguistic model, for instance in the area of argument structures,
such as verbal frames. From an implementation standpoint, our priority is to
consolidate, improve and extend the set of Model-to-Model transformations, and
to further exploit MDD tools.</p>
        <sec id="sec-7-1-1">
          <title>Storing UIMA CASes in a relational database</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Georg Fette12, Martin Toepfer1, and Frank Puppe1</title>
      <p>1 Department of Computer Science VI, University of Wuerzburg,</p>
      <p>Am Hubland, Wuerzburg, Germany
2 Comprehensive Heart Failure Center, University Hospital Wuerzburg,</p>
      <p>Straubmuehlweg 2a, Wuerzburg, Germany
Abstract. In the UIMA text annotation framework the most common
way to store annotated documents (CAS) is by serializing the document
to XML and storing this XML in a file in the file system. We present a
framework to store CASes as well as their type systems in a relational
database. This does not only provide a way to improve document
management but also the possibility to access and manipulate selective parts
of the annotated documents using the database’s index structures. The
approach has been implemented for MSSQL and MySQL databases.
1</p>
      <sec id="sec-8-1">
        <title>Introduction</title>
        <p>
          UIMA [
          <xref ref-type="bibr" rid="ref15 ref2">2</xref>
          ] has become a well known and often used framework for processing text
data. The main component of the UIMA infrastructure is the CAS (Common
Analysis Structure), a data structure which combines the actual data (the text
of a document), annotations on this data and the type system the annotations
are based on. In many UIMA projects CASes are stored as serialized XML-files
in file folders with the corresponding type system file in a separate location.
In this storage mode the resource management to load which CAS with which
type system lies in the responsibility of the programmer who wants to perform an
operation on specific documents. However, manual management of files in folders
on local machines or network folders can quickly become confusing and messy
especially when projects get bigger. We present a framework to store CASes as
well as their corresponding type systems in a relational database. This storage
mode provides the possibility to access the data in a centralized, organized way.
Furthermore the approach provides all benefits that come along with relational
databases including search indices on the data, selective storage, retrieval and
deletion as well as the possibility to perform complex queries on the stored data
in the well known SQL language.
        </p>
        <p>
          The structure of the paper is as follows: Section 2 describes the related work,
Section 3 describes the technical details of the database storage mechanism,
Section 4 illustrates query possibilities using the database, Section 5 demonstrates
performance experiences with the framework and Section 6 concludes with a
summary of the presented work.
The only approach known to the best knowledge of the authors where CASes are
stored in a database is the Julielab DB Mapper [
          <xref ref-type="bibr" rid="ref17 ref4">4</xref>
          ] which serialized CASes to a
PostgreSQL database. However, the mechanism does not store the CASes’ type
systems nor does it support features like referencing of annotations by features
or derivation of annotation types. Other approaches use indices to improve query
performance but do not allow to reconstruct the annotated documents from the
index (Lucene based: LUCAS [
          <xref ref-type="bibr" rid="ref17 ref4">4</xref>
          ], Fangorn [
          <xref ref-type="bibr" rid="ref16 ref3">3</xref>
          ]; relational database based: XPath
[
          <xref ref-type="bibr" rid="ref1 ref14">1</xref>
          ], ANNIS [
          <xref ref-type="bibr" rid="ref20 ref7">7</xref>
          ]; proprietary index based: TGrep/TGrep2 [
          <xref ref-type="bibr" rid="ref19 ref6">6</xref>
          ], SystemT [
          <xref ref-type="bibr" rid="ref18 ref5">5</xref>
          ]). The
indices still need the documents to be stored in the file system. Furthermore some
of the mentioned indices only allow specialized search capabilities (e.g. emphasis
on parse trees) which are provided by the respective search index and cannot
search directly on the UIMA data structures. In contrast to these approaches our
system allows searches on arbitrary type systems by formulating queries closely
related to the involved annotation and feature types.
3
        </p>
      </sec>
      <sec id="sec-8-2">
        <title>Database storage</title>
        <p>The storage mechanism is based on a relational database for which the table
model is illustrated in Figure 1. The schema can be subdivided in a document
related part (left), an annotation instance part (middle) and a type system
related part (right). Documents are stored as belonging to a named collection and
can be manipulated (retrieved, deleted, etc.) as a group, e.g. deleting all
annotations of a specific type. Annotated documents can be handled individually
by loading/saving a single CAS or by processing a whole collection by creating
a collection reader/writer. In either way any communication (loading/saving)
can (but need not) be parametrized so that only desired annotation types are
loaded/saved, thus speeding up processing time, reducing memory consumption
and facilitating debugging processes. A type system, instead of being stored in an
XML file and containing a fixed type system, can be retrieved from the database
in different task specific ways. One way is by requesting the type system which
is needed to load all the annotated documents belonging to a certain collection.
Other possibilities are by providing a set of desired type names or by providing
a regular expression determining all desired type names. The storage mechanism
is able to store the inheritance structures of UIMA type systems as well as
referencing of annotations by features of other annotations. For further information
on the technical aspects we refer to the documentation of the framework3.
4</p>
      </sec>
      <sec id="sec-8-3">
        <title>Querying</title>
        <p>A benefit from storing data in an SQL database is the database index and the
well established SQL query standard. The database can be queried for counts of
occurrences of specific annotation types, counts of covered texts of annotations
or even complex annotation structures in the documents. We want to exemplify
this with a query on documents which have been annotated with a dependency
parser using the type system shown in Figure 2.
3 http://code.google.com/p/uima-sql/</p>
        <p>Fig. 1. schema of the relational database storing CASes and type systems.
&lt;typeDescription&gt;
&lt;name&gt;Token&lt;/name&gt;
&lt;supertypeName&gt;
uima.tcas.Annotation
&lt;/supertypeName&gt;
&lt;features&gt;
&lt;featureDescription&gt;
&lt;name&gt;Governor&lt;/name&gt;
&lt;rangeTypeName&gt;Token&lt;/rangeTypeName&gt;
&lt;/featureDescription&gt;&lt;/features&gt;
&lt;/typeDescription&gt;</p>
        <p>SELECT govText.covered FROM
annot_inst govToken, annot_inst_covered govText,
annot_inst baseToken, annot_inst_covered baseText,
feat_inst, feat_type WHERE
baseText.covered = ’walk’ AND
baseToken.covered_ID = baseText.covered_ID AND
baseToken.annot_inst_ID = feat_inst.annot_inst_ID AND
feat_inst.feat_type_ID = feat_type.feat_type_ID AND
feat_type.name = ’Governor’ AND
feat_inst.value = govToken.annot_inst_ID AND
govText.covered_ID = govToken.covered_ID</p>
        <p>Fig. 3. SQL query for governor tokens</p>
        <p>To query for all words governing the word walk, we have to look for tokens
with the desired covered text, find the tokens governing those tokens and return
their covered text. The SQL command for this task is shown in Figure 3. An
abstraction layer to cover the complexity could be put on top (like a graph querying
language), but even in the presented way with standard SQL the capabilities of
the database engine can serve as a useful tool to improve corpus analysis.
5</p>
      </sec>
      <sec id="sec-8-4">
        <title>Performance</title>
        <p>To run a performance test on the storage engine we created a corpus of 1000
documents, each consisting of 1000 words. The words were taken from a
dictionary of 1000 randomly created words, each of 8 characters length. From each
document we created a CAS and added annotations so that each word was
covered, with the annotations covering 1 to 5 successive words. Each annotation
was given two features, one String feature with a value randomly taken from the
word dictionary and a Long feature containing a random number. All documents
were stored and then loaded again. This was done with the database engine as
well as with a local file folder on the same hard drive the database files were
located on. In a second experiment the same documents where loaded again
and we added an annotation of another type with a Long feature containing a
random number to each document. After adding the additional annotation the
documents were stored again. In a third experiment we wanted to query for the
frequencies of annotations covering each of the words from the word dictionary.
For file system storage this was done by accumulating the annotation counts
during an iteration over all serialized CASes, for database storage this was done
by performing a single SQL query for each of the words from the dictionary.</p>
        <p>In Table 1 we can observe that the time needed for database storage is quite
long but reading is as fast as from the file system. Storing to the database during
the second experiment was faster than in the first one, because this time only the
additional annotations had to be incrementally stored. Storage to the file system
again performed about five times faster than to the database but the benefit of
being able to incrementally store only the additional annotations can be clearly
observed. Physical storage space consumption is larger for database storage but
that shouldn’t pose a major problem as hard disc space is not an overly expensive
resource nowadays. Query performance in the database is about 20 times faster
than using file system storage illustrating the benefit of the database approach.</p>
        <p>DB
FileSystem
We have presented a framework to store/retrieve CASes and perform analysis
queries on them using a relational database. We examined the save, load and
query speed compared to regular file based storage and presented examples how
to use the database index structures to analyze annotations in the corpus. We
hope to be able to improve the storage speed of the database engine so that the
choice between file system storage and database storage will not be influenced
by the still quite large difference in speed performance.</p>
        <p>This work was supported by grants from the Bundesministerium fuer Bildung
und Forschung (BMBF01 EO1004).
CSE Framework: A UIMA-based Distributed System for</p>
        <p>
          Configuration Space Exploration
Elmer Garduno1, Zi Yang2, Avner Maiberg2, Collin McCormack3, Yan Fang4, and Eric Nyberg2
Abstract. To efficiently build data analysis and knowledge discovery pipelines, researchers
and developers tend to leverage available services and existing components by plugging them
into different phases of the pipelines, and then spend hours to days seeking the right
components and configurations that optimize the system performance. In this paper, we introduce
the CSE framework , a distributed system for a parallel experimentation test bed based on
UIMA and uimaFIT, which is general and flexible to configure and powerful enough to sift
through thousands option combinations to determine which represent the best system
configuration.
1
To efficiently build data analysis and knowledge discovery “pipelines”, researchers and
developers tend to leverage available services and existing components by plugging them into different
phases of the pipelines [
          <xref ref-type="bibr" rid="ref1 ref14">1</xref>
          ], and then spend hours, seeking for the components and configurations
that optimize the system performance. The Unstructured Information Management Architecture
(UIMA) [
          <xref ref-type="bibr" rid="ref16 ref3">3</xref>
          ] provides a general framework for defining common types in the information system
(type system), designing pipeline phases (CPE descriptor), and further configuring the
components (AE descriptor) without changing the component logic. However there is no easy way to
configure and execute a large set of combinations without repeated executions, while evaluating
the performance of each component and configuration.
        </p>
        <p>
          To fully leverage existing components, it must be possible to automatically explore the space
of system configurations and determine the optimal combination of tools and parameter settings
for a new task. We refer to this problem as configuration space exploration , which can be formally
defined as a constraint optimization problem. A particular information processing task is defined
by a configuration space, which consists of mt components that define each of the n phases with
corresponding configurations. Given a limited total resource capacity C and input set S,
configuration space exploration (CSE) aims to find the trace (a combination of configured components)
within the space that achieves the highest expected performance without exceeding C total cost.
Details on the mathematical definition and proposed greedy solutions can be found in [
          <xref ref-type="bibr" rid="ref19 ref6">6</xref>
          ].
        </p>
        <p>
          In this paper, we introduce the CSE framework implementation, a distributed system for
parallel experimentation test bed based on UIMA and uimaFIT [
          <xref ref-type="bibr" rid="ref17 ref4">4</xref>
          ]. In addition, we highlight the results
from two case studies where we applied the CSE framework to the task of building biomedical
question answering systems.
We highlight some features of the implementation in this section. Source code, examples,
documentation, and other resources are publicly available on GitHub5. To benefit developers who are
already familiar with UIMA framework, we have developed a CSE tutorial in alignment with the
examples in the official UIMA tutorial.
        </p>
        <p>Declarative descriptors. To leverage the CSE framework, users need to specify how the
components should be organized in the pipeline, which values need to be specified for each
component configuration, which is the input set, and what measurement metrics should to be applied.
Analogous to a typical UIMA CPE descriptor, components, configurations, and collection
readers in the CSE framework are declared in extended configuration descriptors which are based on
the YAML format. An example of the main pipeline descriptor and a component descriptor are
shown in Figure 1.</p>
        <p>Architecture. Each pipeline can contain an arbitrary number of AnalysisEngines declared by
using the class keyword or by inheriting configuration options from other components by name.
Combinations of components are configured using an options block and parameter combinations
within a component are configured on a cross-opts block. To take full advantage of the CSE
framework capabilities, users inherit from a cse.phase, a CAS multiplier that provides, option
multiplexing, intermediate resource persistence and resource management for long running
components. The architecture also supports grouping options into sub-pipelines as a convenient way
of reducing the configuration space for combinations whose performance is already known.</p>
        <p>Evaluation. Unlike a traditional scientific workflow management system, CSE emphasizes
the evaluation of component performance, based on user-specified evaluation metrics and
goldstandard outputs at each phase. In addition the framework keeps track of the performance of all the
executed traces, this allows inter-component evaluation and automatic tracking of performance
improvements through time.</p>
        <p>Automatic data persistence. To support further error analysis and reproduction of
experimental results, intermediate data (CASes) and evaluation results are kept in a repository
accessible from any trace at any point during the experiment. To prevent duplicate execution of traces the
system keeps track of all the execution traces an recovers those CASes whose predecessors have
5 http://oaqa.github.io/</p>
        <p>Review Database 
Data visualizaBon </p>
        <p>OU indexing </p>
        <p>Linguis'c annota'on 
• POS, lemmas, NER, Chunks, 
dependencies, etc. </p>
        <p>OU polarity 
Assignment 
Opinionated Unit 
detec'on 
• T&amp;C CorrelaBon via 
dependencies </p>
      </sec>
      <sec id="sec-8-5">
        <title>Architecture and Implementation</title>
        <p>This section describes all UIMA modules used in the prototype, as implemented
in Figure 3. Some of them are existing open source components, some are
adaptations, and some are our own custom developments. We have been publishing
our work on Github and will continue doing so as far as possible.2
UIMA Collection Tools This prototype is designed to work on a static
document collection, previously loaded into a MySQL database (including the review
text as well as associated metadata). UIMA Collection Tools3 is an ecosystem of
tools for allowing UIMA pipelines to store and retrieve data from database
systems, such as MySQL. Plain text documents can be retrieved from a database,
XMI documents can be retrieved from and stored in a database either
compressed or uncompressed, features can be extracted into a database table, and
annotations within database-stored XMI blobs can be visualized the same way
as the standard AnnotationViewer does for XMI files.</p>
        <p>– DBCollectionReader is a UIMA collection reader which retrieves plain text
documents stored in a MySQL database. Database connection parameters
as well as SQL query have to be specified in the component descriptor. It is
derived from the FileSystemCollectionReader.
– SolrCollectionReader is equivalent to DBCollectionReader, but using a Solr
index as the document source.
– DBXMICollectionReader is a UIMA collection reader that retrieves XMI
documents stored in a MySQL database. DBXMICollectionReader is also
prepared to read compressed XMI documents by means of ZLIB compression.</p>
        <p>This option can be set in the descriptor file.
– DBAnnotationsCASConsumer is a CAS consumer which stores values of the
features specified in the component descriptor file in a MySQL database
table. Each table row corresponds to the annotation defined as the splitting
annotation, e.g. if Sentence annotation has been defined as the splitting
annotation, each table row will correspond to a Sentence, and this row will
2 See https://github.com/BarcelonaMedia-ViL/
3 The UIMA Collection tools have been developed at Barcelona Media, some of
them based on the example Collection Readers and CAS Consumers provided
with the UIMA distribution. They are published under the Apache License at
https://github.com/BarcelonaMedia-ViL/uima-collection-tools.
contain features of the Sentence annotation and/or features of annotations
covered by the Sentence annotation.
– DBXMICASConsumer is a CAS consumer that persists XMI documents in
a database. DBXMICASConsumer is also prepared to store compressed XMI
documents by means of ZLIB compression.
– DBAnnotationViewer is a modification of the Annotation Viewer, and
allows reading XMI files directly from a MySQL database without needing to
extract them first.</p>
        <p>OpenNLP We use OpenNLP4 with the standard UIMA wrappers for our base
pipeline, including Sentence Detector, Tokenizer, and POS Tagger, using our
own trained models for Spanish.</p>
        <p>Lemmatizer We apply Lemmatization using a large dictionary developed
inhouse. All candidate lemmas are first added to the CAS using ConceptMapper5
but a second custom component selects the right one using the POS tag.
JNET For ML-based detection of Targets and Cues we use JNET6 (the Julielab
Named Entity Tagger), which is based on Conditional Random Fields (CRF).
It detects token sequences that belong to certain classes, taking into account a
variety of features associated with each token (such as the surface form, lemma,
POS tag, surface features such as capitalization, etc.) as well as its context of
preceeding and successive tokens. While originally intended for Named Entity
Recognition, we trained JNET with our own manually annotated corpus.</p>
        <p>Compared to the original JNET as released by JulieLab we introduced a series
of changes, most importantly making it type system independent by taking all
input and output types and features as parameters, and fixing some bugs that
were triggered when using a larger amount of token features. We expect to release
our changes soon, but are still looking into the question of licensing, to comply
with JNET’s original license.</p>
        <p>DeSR We developed a UIMA wrapper for the DeSR dependency
parser7. The parser creates dependency annotations based on previously
generated sentence, token and POStag annotations. It is available at
https://github.com/BarcelonaMedia-ViL/desr-uima. The UIMA DeSR analysis
engine is a UIMA C++ annotator, developed using the C++ SDK provided by
UIMA. It translates between the format required by the DeSR parser shared
library and the UIMA CAS format. The mapping between UIMA types and
features and the features used internally by DeSR is configurable in the annotator
descriptor.
4 http://opennlp.apache.org/
5 http://uima.apache.org/sandbox.html#concept.mapper.annotator
6 http://www.julielab.de/Resources/Software/NLP Tools.html
7 https://sites.google.com/site/desrparser/
DependencyTreeWalker This is a Pythonnator-based analysis engine for
wrapping the DependencyGraph Python module (both developed in-house). This
allows us to work easily with the dependency graph generated by DeSR in order
to e.g. determine and validate the path between two given UIMA annotations.
Weka Wrapper We used the Mayo Weka/UIMA Integration (MAWUI8), as
a basis for the machine learning tools. The version we use is adapted to newer
versions of UIMA and made much more configurable. MAWUI generates a single
vector for each document, that is used to classify it as a whole. In our case, a
document can contain several Opinionated Units that need to be classified. For
this reason the Weka Wrapper was adapted to be able to deal with all the
annotations of a given type inside a document (or collection when generating
the training data).
4</p>
      </sec>
      <sec id="sec-8-6">
        <title>Visualization</title>
        <p>Beyond being able to extract and classify the opinions, users need an interface
that allows them to access and explore the data. They need to know which are
the Targets or its features that are being addressed by the opinions and what
is being said about them, and this has to be shown in an aggregated way, with
drill-down capabilities, so that the end user has a clear view of the contents of
hundreds or thousands of opinions.</p>
        <p>UIMA does not provide tools to deal with collections of documents, and
we use Solr, a Lucene based indexing tool, to index the Opinionated Units.
Through the use of Solr’s faceting and pivot utilities we are able to graphically
summarize thousands of opinions. Special charts have been dconstructed in order
to allow not only to represent the data but also to select subsets of opinions and
summarize and compare them. For example, we can compare the global user’s
opinions with the opinions about a single hotels or the hotels in a specific area.</p>
        <p>To index the data we needed the linguistic information, but also the metadata
associated with the opinion, which is located in databases and is not processed
with UIMA. For this reason we import the data to Solr in two steps. In a first
step we generate from UIMA a table with the data that we then import to Solr
together with the metadata.
To index the Opinionated Units we use the DBAnnotationsCASConsumer
component. We generate a register for each OU, containing: the Target, the Cue,
the text span, the polar words, their polarity, the polarity of the cue, and the
polarity of the Opinionated Unit. Cues and targets are grouped in single tokens
by means of underscoring.
8 http://informatics.mayo.edu/text/index.php?page=weka</p>
        <p>We use the the DataImportHandler from Solr in order to import the data
from the database. To do it, a query combines the opinionated unit information
with the one related with the hotel or the user who writes the opinion. Cues
are indexed twice, once all merged and later in different fields depending on
the opinion’s polarity, making it easy to retrieve just the positive or negative
opinion markers. We selected this option because it is a bit faster, more flexible
and reliable than the other ones: when indexing directly from UIMA we have
problems in adding all the desired metadata, and if we call UIMA from Solr
(or Lucene) then it is difficult to have a general framework that splits a single
document into several Opinionated Units.</p>
        <p>AJAX-Solr9 is a JavaScript library for creating user interfaces to Apache
Solr. This library works with facets. Faceting is a capability of Solr that allows
to have a fast statistic of the most frequent terms in each field, after performing
a query. Since version 4.0 Solr also has pivots that combine the facets from two
or more different fields. We adapted AJAX-Solr to work with pivots and wrote
a series of widgets to visualize them. Our own extensions to AJAX-Solr are also
published on github10.</p>
        <p>By means of clicking the different facets that appear on the widgets, the user
can build a query that restricts the set of opinions to summarize. These opinions
are then summarized by showing the most frequent terms they contain, or the
most differentiating ones (i.e. those terms that are frequent in the current subset
but that are less frequent in the general one). Figure 4 shows the pivot result in
text and force diagram formats. It shows the relationship between Targets, and
positive and negative Cues. In the textual representation, the relationships are
not shown directly but scaled to magnify the most discriminative ones.
The combination of UIMA and Solr has allowed us to to develop a very flexible
platform that makes it easy to integrate and combine processing modules from a
variety of sources and in a variety of programming languages, as well as navigate
and visualize the results easily and efficiently.</p>
        <p>In our evaluations with 700 OUs manually annotated by 3 independent
reviewers, there was an agreement on the correctness of the OU identified by the
system of 88.5%, while the polarity assigned was found to be correct an average
of 70%.</p>
        <p>We found many useful UIMA components to be available as open source, and
encountered few compatibility issues (other than adapting some components to
be type system independent). Solr provides us with a very flexible platform to
access large document collections, and in combination with UIMA allows us to
explore even complex hidden relationships within those collections.</p>
        <p>One of our main objectives was to make all modules configurable and
reusable, inasmuch as Sentiment Analysis in general requires tweaking to adapt
to domain and genre, but this generalization often requires considerable effort.
We found the different open source communities to be very receptive, and we
try to participate by publishing our own contributions under permissive licenses
that make them easy for others to adopt and use.
6</p>
      </sec>
      <sec id="sec-8-7">
        <title>Thanks References</title>
        <p>This work has been partially funded by the Spanish Government project
Holopedia, TIN2010-21128- C02-02, and the CENIT program project Social Media,
CEN-20101037.</p>
        <sec id="sec-8-7-1">
          <title>Extracting hierarchical data points and tables from scanned contracts</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>Jan Stadermann, Stephan Symons, and Ingo Thon</title>
      <p>Recommind Inc., 650 California Street, San Francisco, CA 94108, United States
{jan.stadermann,stephan.symons,ingo.thon}@recommind.com</p>
      <p>http://www.recommind.com
Abstract. We present a technique for developing systems to
automatically extract information from scanned semi-structured contracts. Such
contracts are based on a template, but have different layouts and
clientspecific changes. While the presented technique is applicable to all kinds
of such contracts we specifically focus on so called ISDA credit support
annexes. The data model for such documents consists of 150 individual
entities some of which are tables that could span multiple pages. The
information extraction is based on the Apache UIMA framework. It
consists of a collection of small and simple Analysis Components that extract
increasingly complex information based on earlier extractions. This
technique is applied to extract individual data points and tables. Experiments
show an overall precision of 97% with a recall of 93% regarding
individual/simple data points and 89%/81% for table cells measured against
manually entered ground truth. Due to its modular nature our system
can be easily extended and adapted to other collections of contracts as
long as some data model can be formulated.</p>
      <p>Keywords: OCR robust information extraction, hierarchical taggers,
table extraction
1
Despite the existence of electronic document handling and content management
systems there is still a large amount of paper based contracts. Even when scanned
and OCRed the interesting data contained in the document is not
machinereadable as there is no semantic attached to the text. Especially in the banking
domain it is necessary to have the underlying information available, e.g., for risk
assessment. Until now, the information has to be extracted by human reviewers.
The goal of the system presented here is to automatically obtain the relevant
information from OTC (over-the-counter) contracts which are based on a
template provided by the ISDA1. The data is given in the form of image-embedded
pdf documents. Each contract contains around 150 data points organized in a
complex hierarchical data model. A data point can be either a (possibly multi
valued) simple field or a table. The main challenges of such a system are:
1 International Swaps and Derivatives Association, www.isda.org</p>
    </sec>
    <sec id="sec-10">
      <title>1. The complex legal language used in the contracts.</title>
      <p>2. Despite existing contract templates, the wording varies across customers.
3. The layout varies. Especially tables can be represented in various forms.
4. The scanning quality of the contracts is often poor, especially in old
contracts or documents sent by fax. Still the remaining information needs to be
extracted correctly.</p>
      <p>Figure 1 shows examples of two simple data points (a), and a table (b).</p>
      <p>
        In general, on the one hand, there are a lot of sophisticated entity extraction
systems that try to find flat entities only (“Named entity extraction”) [
        <xref ref-type="bibr" rid="ref22 ref9">9</xref>
        ]. These
systems sometimes use hierarchical information, like Tokens,
Part-Of-SpeechTags, Sentences, but only on a linguistic level without collecting and combining
this information. These approaches work well on well-defined and general
entities such as persons or locations. However, they are difficult to adapt to a new
domain since a new classifier needs to be created which requires huge amounts
of labeled training data which is expensive to produce.
      </p>
      <p>
        On the other hand, there are systems that use a deep hierarchical structure, e.g.
represented using Ontologies, but still do the classification in one single, flat step
[
        <xref ref-type="bibr" rid="ref1 ref14">1</xref>
        ]. This approach is not as flexible and extensible compared to the presented
one since in general it requires a re-training or re-building of the classifier if
layers within the hierarchy are changed. An early solution for dealing with scanned
forms was presented by Taylor et al., who used a model-based approach for data
extraction from tax forms [
        <xref ref-type="bibr" rid="ref12 ref25">12</xref>
        ]. Semi structured texts have been analyzed using
rule based approaches [
        <xref ref-type="bibr" rid="ref10 ref23">10</xref>
        ] or discriminative context free grammars [
        <xref ref-type="bibr" rid="ref13 ref26">13</xref>
        ]. Closest
to our solution is a system described by Surdeanu et al. [
        <xref ref-type="bibr" rid="ref11 ref24">11</xref>
        ]. They employ two
layers of extraction using Conditional Random Fields [
        <xref ref-type="bibr" rid="ref18 ref5">5</xref>
        ], and deal with OCR
data. For table extraction, heuristic methods [
        <xref ref-type="bibr" rid="ref21 ref8">8</xref>
        ] have been proposed as well as
Conditional Random Fields [
        <xref ref-type="bibr" rid="ref20 ref7">7</xref>
        ].
      </p>
      <p>
        In contrast, our system uses a theoretically unlimited number of layers with
separate classifiers for each piece of information, including tables, on each level.
Instead of processing the whole text at once, our classifiers just collect the
information they require, and decide only on that data. Therefore, they allow for
better performance and extensibility, as additional data does not affect the
existing classifiers. Our work follows strategies commonly used in spoken dialogue
systems [
        <xref ref-type="bibr" rid="ref17 ref4">4</xref>
        ] and uses a set of small classifiers which is inspired by the boosting
idea [
        <xref ref-type="bibr" rid="ref19 ref6">6</xref>
        ]. In addition, we use automatically extracted segmentation information
and cross-checks between our classifiers to increase the precision of the extracted
data. From a UI standpoint there is a similar application called GATE [
        <xref ref-type="bibr" rid="ref15 ref2">2</xref>
        ] which
extracts entities based on given rule-sets. This application provides a
hierarchical organization of entities and the architecture seems to be very similar to the
UIMA framework. However, GATE has no special provisions to deal with noise
from due to the OCR step and it only allows to specify simple extraction rules.
Furthermore there is no direct way that the entity extraction works hierarchically
but only the result can be organized in a hierarchical way.
2
      </p>
      <sec id="sec-10-1">
        <title>Information extraction</title>
        <p>An overview of or system’s architecture is shown in figure 2. Prior to information
extraction, the OmniPage2 OCR engine is used to convert the image to readable
text. However, many character level errors, and layout distortions remain which
need to be dealt with in the following processing steps. The overall strategy
is based on the idea that small pieces of relevant text can be extracted quite
accurately even in the presence of OCR errors. On top of these pieces we build
several layers of higher level extractors – here called ”experts” – that combine
these small pieces to decide on a final data point. The extraction of tables works
in a similar fashion by first trying to extract small pieces that form table cells.
Then stretches of cells are collected, trying to deduce a layout from order and
type of the pieces. Finally, an optimal result table is selected (see section 2.2).</p>
        <p>
          Our solution is based on the UIMA framework [
          <xref ref-type="bibr" rid="ref16 ref3">3</xref>
          ]. Each type of expert is
implemented as a configurable annotation engine. The overall extraction system
consists of a large hierarchy of analysis engines, encompassing several hundred
elements. The type system, in contrast, only consists of three principal types, i.e.
for simple fields, tables and table rows. Annotation types, extracted values, etc.
are stored as features. Both final and intermediate annotations are represented
by these types.
Documentpimages
        </p>
        <p>OCR</p>
        <p>Recognizedptextp(XML)
Informationpextraction</p>
        <p>RegExppExtractorp1</p>
        <p>RegExppExtractorp2
Normalization</p>
        <p>Normalization
Normalization</p>
        <p>Normalization
XMLpwithpmetapdata</p>
        <p>Documentpindex</p>
        <p>Fig. 2. Extraction architecture</p>
        <p>Extraction of simple-valued fields
We use the term “simple-valued fields” for data points, where one key has one or
more values. They differ from named entities as they may include multi-valued
data. Figure 1(a) shows an example of the key eligible currency with the
(normalized) values “USD” and “Base currency”. Fields are extracted layer-wise.
On the lowest layer, all instances of the identifying term “Eligible currency”, are
captured, as well as the different currency expressions, including the special
term “Base currency”, which refers to another simple field. On this level we
typically use annotators based on dictionaries and regular expressions, where
variations due to OCR errors are reflected in dictionary variants, respectively
the regular expressions. All such annotators are implemented as analysis engines.
On the next level, so-called “expert-extractors” combine the existing annotations
to a new one. An expert is a rule, defined as a set of slots for annotations of
specific types, and a definition of which slots form a new annotation if the rule
is satisfied, i.e. if all slots are filled. To allow for fine tuning the experts, slots
can be configured, e.g. by indicating certain slots as optional. Furthermore, it is
possible to specify the order of annotations in slots appearing in the document. It
is also possible to specify a maximum distance. If the distance between two found
annotations exceeds the defined threshold for this expert, the expert assumes to
be in the wrong area of the document and clears its internal state to start all over
Currency, "USD"
Expert 1 Currency Currency Currency
Expert 2</p>
        <p>EligibleCurrency</p>
        <p>Currencies Distance &lt; 20</p>
        <p>Currencies, "Base currency, USD"
(eligible_currency, "Base currency, USD")
Fig. 3. Extraction of a simple field. First level components have tagged the “Eligible
Currency” phrase and the different variants of currencies. Expert 1 collects two or more
currencies (the third slot is optional). The resulting annotation is used by Expert 2
to build the final annotation. All elements are represented in UIMA as simple fields
types.
again. Finally, slots can be write-protected, accepting only the first occurrence
of the configured annotation.</p>
        <p>To extract eligible currency, two experts are employed (see figure 3). The
first expert collects adjacent currency annotations. The second one combines the
“Eligible Currency” term, and the collected currencies found by expert one, if
both annotations are found within a short distance. The resulting annotation
will span the relevant currency terms. This modular design allow us to reduce
the number of extractors and re-use the already made annotations for completely
different data points. In general, the information found in the examined contracts
is not independent of each other. We use business rules and other constraints
to validate and normalize the found results, e.g., the set of currencies is
welldefined. If the validation fails or the normalization repairs some value due to
business rules a corresponding message can be attached to the annotation to
inform the reviewer.
2.2</p>
        <p>Extraction of tables
We define a table as multi-dimensional, structured data present in a document
either in a classical tabular layout, or defined in a series of sentences or
paragraphs in free text form (like in figure 4). We aim at extracting tables of both
structure types and intermediate formats (e.g. as in figure 1(b)) only from the
document’s OCR output at character level. In our application, table extraction
extends the simple valued field extraction: The basic input for a table expert
is a document annotated with simple value fields and intermediate annotations.
The experts attempt to match sequences of simple annotations to a set of table
models. A table model is user-defined and describes which columns the resulting
extracted table should have. Each column can contain multiple types of simple
fields. Furthermore, columns can be configured to be optional and to accept
only unique or non-overlapping annotations. This allows for both more general
models with variable columns and fine-tuning the accepted annotations.</p>
        <p>The process of detecting tables by the table expert (see figure 4 for an
example) begins with collecting all accepted annotations for a model, within a
predefined range or until a table stop annotation is found, into a list sorted by
order of appearance. For each such list, several filling strategies are employed.
A filling strategy addresses the problem that multiple columns may accept the
same types of annotations. If elements appear row-wise, or column-wise, the
corresponding strategies will recover the correct table, also compensating for
some errors from omitted table elements. In mixed cases, adding a new table
cell to the shortest relevant column is used as a fall back strategy. Each strategy
is evaluated, using the fraction of cells filled in the resulting table c and the
filling strategy specific score s. The latter score measures how well the
annotations match the expectations of the filling strategy. The table which maximizes
sf = c · s is annotated as a candidate, if sf is above a predefined threshold. The
table expert is implemented an analysis engine. Configuration encompasses the
columns describing the table model, distance and scoring threshold, and the set
of filling strategies to be evaluated. The output is a table type annotation, which
in turn contains several table rows, each containing simple fields as cells.</p>
        <p>Multiple table experts may be used to generate candidate tables for a single
target, and candidates may occur in several locations in a document. Usually,
the correct location gives raise to tables with certain properties, e.g. short, dense
tables. This is used by a feature-based selection of the optimal table candidate.
We model this using both general purpose features (e.g. size, and number of
empty cells) as well as domain specific features. The table with the highest
weighted sum of score features is selected as the final output. The weights can
either be user defined or fitted using a formal optimization model.
3</p>
      </sec>
      <sec id="sec-10-2">
        <title>Experiments</title>
        <p>We composed a document set containing 449 documents3 to measure the
extraction quality of our system. These documents are from various customers and
represent as many variants of different wordings and layouts as possible.</p>
        <p>With our customers we agreed upon certain quality gates that the automatic
extraction system has to meet. Due to the nature of the contracts it is much more
important to achieve a high precision of the extracted data instead of recall. For
simple fields the gate’s threshold is 95% precision and 80% recall. Table cells
are more difficult to extract since the OCR component not only mis-recognizes
individual characters but makes errors on the structure of a table. For table cells,
our goal is to have a high recall since errors within a structured table are easier
to detect and correct than simple field errors by a human reviewer. Table 1 shows
our results against a manually created ground truth. The numbers represent the
3 see tinyurl.com/csa-example for a public sample document.</p>
        <p>Insertions Deletions Substitutions Correct Precision Recall
Simple fields
Table cells
375
1492
1267
3563
330
906
20519
18838
total number of data points and errors respectively over all of our documents.
In total, we meet our gate criterion for simple fields. Precision can be as low as
33% for rare fields, where fitting appropriate data experts is hard. In contrast,
for frequent fields, precision may exceed 99%. In principle, the same is true for
recall, with both maximum and minimum lower, due to our target criteria. For
table cells, the precision needs improvement mainly due to the OCR’s structural
errors like swapping rows within a table or switching between row-wise and
column-wise recognition in one table. This is especially true for tables which are
complex with respect to both lay-out and contents, like the collateral eligibility
table in figure 1(b). Here, precision and recall are 84.4% and 80.2%, respectively.
In contrast, structurally simple tables, like the interest rate table (see figure 4
for an example) can be extracted with much higher confidence (97.4% precision
and 90.8% recall).
4</p>
      </sec>
      <sec id="sec-10-3">
        <title>Conclusion and outlook</title>
        <p>This article presents a system to automatically extract simple data points and
tables from OTC contract images. The system consists of an OCR component
and a hierarchical set-up of small modular extractors either capturing (noisy)
text or combining already annotated clues using a slot-filling strategy. Our
experiments are conducted on a in-house contract collection resulting in a precision
of 97% (recall 93%) on simple fields and a precision of 89% (recall 81%) on table
cells. While the evaluation we conducted is limited, we expect overfitting to be
moderate. The legal nature of the contracts limits the layout and wording
options. Our next steps include the introduction of a confidence score on data-point
level and the use of statistical classification methods for selecting the best-suited
table model.</p>
        <p>Acknowledgement. We would like to thank our partner Rule Financial for
providing the data model and for their assistance in understanding the documents.</p>
        <sec id="sec-10-3-1">
          <title>Constraint-driven Evaluation in UIMA Ruta</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>Andreas Wittek1, Martin Toepfer1, Georg Fette12, Peter Kluegl12, and Frank Puppe1</title>
      <p>1 Department of Computer Science VI, University of Wuerzburg,</p>
      <p>Am Hubland, Wuerzburg, Germany
2 Comprehensive Heart Failure Center, University of Wuerzburg,</p>
      <p>
        Straubmuehlweg 2a, Wuerzburg, Germany
{a.wittek,toepfer,fette,pkluegl,puppe}@informatik.uni-wuerzburg.de
Abstract. This paper presents an extension of the UIMA Ruta
Workbench for estimating the quality of arbitrary information extraction
models on unseen documents. The user can specify expectations on the
domain in the form of constraints, which are applied in order to predict the
F1 score or the ranking. The applicability of the tool is illustrated in a
case study for the segmentation of references, which also examines the
robustness for different models and documents.
1
Apache UIMA [
        <xref ref-type="bibr" rid="ref18 ref5">5</xref>
        ] and the surrounding ecosystem provide a powerful framework
for engineering state-of-the-art Information Extraction (IE) systems, e.g., in the
medical domain [
        <xref ref-type="bibr" rid="ref13 ref26">13</xref>
        ]. Two main approaches for building IE models can be
distinguished. One approach is based on manually defining a set of rules, e.g., with
UIMA Ruta3 (Rule-based Text Annotation) [
        <xref ref-type="bibr" rid="ref20 ref7">7</xref>
        ]4, that is able to identify the
interesting information or annotations of specific types. A knowledge engineer
writes, extends, refines and tests the rules on a set of representative documents.
The other approach relies on machine learning algorithms, such as
probabilistic graphical models like Conditional Random Fields (CRF) [
        <xref ref-type="bibr" rid="ref10 ref23">10</xref>
        ]. Here, a set
of annotated gold documents is used as a training set in order to estimate the
parameters of the model. The resulting IE system of both approaches, the
statistical model and the set of rules, is evaluated on an additional set of annotated
documents in order to estimate its accuracy or F1 score, which is then assumed
to hold for the application in general. However, while the system performed well
in the evaluation setting, its accuracy decreases when applied on unseen
documents, maybe because the set of documents applied for developing the IE system
was not large or not representative enough. In order to estimate the actual
performance, either more data is labeled or the results are manually checked by a
human, who is able to validate the correctness of the annotations.
      </p>
      <p>
        Annotated documents are essential for developing IE systems, but there is
a natural lack of labeled data in most application domains and its creation is
3 http://uima.apache.org/ruta.html
4 previously published as TextMarker
error-prone, cumbersome and time-consuming as is the manual validation. An
automatic estimation of the IE system’s quality on unseen documents would
therefore provide many advantages. A human is able to validate the created
annotations using background knowledge and expectations on the domain. This
kind of knowledge is already used by current research in order to improve the
IE models (c.f. [
        <xref ref-type="bibr" rid="ref1 ref11 ref14 ref19 ref24 ref6">1, 6, 11</xref>
        ]), but barely to estimate IE system’s quality.
      </p>
      <p>This paper introduces an extension of the UIMA Ruta Workbench for exactly
this use case: Estimating the quality and performance of arbitrary IE models
on unseen documents. The user can specify expectations on the domain in the
form of constraints thus the name Constraint-driven Evaluation (CDE). The
constraints rate specific aspects of the labeled documents and are aggregated
to a single cde score, which provides a simple approximation of the
evaluation measure, e.g., the token-based F1 score. The framework currently supports
two different kinds of constraints: Simple UIMA Ruta rules, which express
specific expectations concerning the relationship of annotations, and
annotationdistribution constraints, which rate the coverage of features. We distinguish two
tasks: predicting the actual F1 score of a document and estimating the ranking
of the documents specified by the actual F1 score. The former task can give
answers on how good the model performs. The latter task points to documents
where the IE model can be improved. We evaluate the proposed tool in a case
study for the segmentation of scientific references, which tries to estimate the
F1 score of a rule-based system. The expectations are additionally applied on
documents of a different distribution and on documents labeled by a different
IE model. The results emphasize the advantages and usability of the approach,
which works already with minimal efforts due to a simple fact: It is much easier
to estimate how good a document is annotated than to actually identify the
positions of defective or missing annotations.</p>
      <p>
        The rest of the paper is structured as follows. In the upcoming section, we
describe how our work relates to other fields of Information Extraction research.
We explain the proposed CDE approach in Section 3. Section 4 covers the case
study and the corresponding results. We conclude with pointers to future work
in Section 5.
Besides standard classification methods, which fit all model parameters against
the labeled data of the supervised setting, there have been several efforts to
incorporate background knowledge from either user expectations or external
data analysis. Bellare et al. [
        <xref ref-type="bibr" rid="ref1 ref14">1</xref>
        ], Grac¸a et al. [
        <xref ref-type="bibr" rid="ref19 ref6">6</xref>
        ] and Mann and McCallum [
        <xref ref-type="bibr" rid="ref11 ref24">11</xref>
        ], for
example, showed how moments of auxiliary expectation functions on unlabeled
data can be used for such a purpose with special objective functions and an
alternating optimization procedure. Our work on constraint-driven evaluation is
partly inspired by this idea, however, we address a different problem. We suggest
to use auxiliary expectations to estimate the quality of classifiers on unseen data.
      </p>
      <p>
        A classifier’s confidence describes the degree to which it believes that its
own decisions are correct. Several classifiers provide intrinsic measures of
confidence, for example, naive Bayes classifiers. Culotta and McCallum [
        <xref ref-type="bibr" rid="ref17 ref4">4</xref>
        ], for
instance, studied confidence estimation for information extraction. They focus on
predictions about field and record correctness of single instances. Their main
motivation is to filter high precision results for database population. Similar to
CDE, they use background knowledge features like record length, single field
label assignments and field confidence values to estimate record confidence. CDE
generalizes common confidence estimation because the goal of CDE is the
estimation of the quality of arbitrary models.
      </p>
      <p>
        Active learning algorithms are able to choose the order in which training
examples are presented in order to improve learning, typically by selective
sampling [
        <xref ref-type="bibr" rid="ref15 ref2">2</xref>
        ]. While the general CDE setting does not necessarily contain aspects
of selective sampling, consider for example the batch F1 score prediction task,
the ranking task can be used as a selective sampling strategy in applications
to find instances that support system refactoring. The focus of the F1 ranking
task, however, still differs from active learning goals which is essential for the
design of such systems. Both approaches are supposed to favor different
techniques to fit their different objectives. Popular active learning approaches such
as density-weighting (e.g., [
        <xref ref-type="bibr" rid="ref12 ref25">12</xref>
        ]) focus on dense regions of the input distribution.
CDE, however, tries to estimate the quality of the model on the whole data set
and hence demands for differently designed methods. Despite their differences,
the combination of active learning and CDE would be an interesting subject for
future work. CDE may be used to find weak learners of ensembles and
informative instances for these learners.
The Constraint-driven Evaluation (CDE) framework presented in this work
allows the user to specify expectations about the domain in form of constraints.
These constraints are applied on documents with annotations, which have been
created by an information extraction model. The results of the constraints are
aggregated to a single cde score, which reflects how well the annotations fulfill
the user’s expectations and thus provide a predicted measurement of the model’s
quality for these documents. The framework is implemented as an extension of
the UIMA Ruta Workbench. Figure 1 provides a screenshot of the CDE
perspective, which includes different views to formalize the set of constraints and
to present the predicted quality of the model for the specified documents.
      </p>
      <p>
        We define a constraint in this work as a function C : CAS → [
        <xref ref-type="bibr" rid="ref1 ref14">0, 1</xref>
        ], which
returns a confidence value for an annotated document (CAS) where high values
indicates that the expectations are fulfilled. Two different types of constraints
are currently supported: Rule constraints are simple UIMA Ruta rules without
actions and allow to specify sequential patterns or other relationships between
annotations that need to be fulfilled. The result is basically the ratio of how
often the rule has tried to match compared to how often the rule has actually
Fig. 1. CDE perspective in the UIMA Ruta Workbench. Bottom left: Expectations
on the domain formalized as constraints. Top right: Set of documents and their cde
scores. Bottom right: Results of the constraints for the selected document.
matched. An example for such a constraint is Document{CONTAINS(Author)};,
which specifies that each document must contain an annotation of the type
Author. The second type of supported constraints are Annotation Distribution
(AD) constraints (c.f. Generalized Expectations [
        <xref ref-type="bibr" rid="ref11 ref24">11</xref>
        ]). Here, the expected
distribution of an annotation or word is given for the evaluated types. The result of
the constraint is the cosine similarity of the expected and the observed presence
of the annotation or word within annotations of the given types. A constraint
like "Peter": Author 0.9, Title 0.1, for example, indicates that the word
“Peter” should rather be covered by an Author annotation than by a Title
annotation. The set of constraints and their weights can be defined using the CDE
Constraint view (c.f. Figure 1, bottom left).
      </p>
      <p>For a given set of constraints C = {C1, C2...Cn} and corresponding weights
w = {w1, w2, ..., wn}, the cde score for each document is defined by the weighted
average:
n
cde = 1 X wi · Ci
n
i</p>
      <p>The cde scores for a set of documents may already be very useful as a
report how well the annotations comply with the expectations on the domain.
However, one can further distinguish two tasks for CDE: the prediction of the
actual evaluation score of the model, e.g., the token-based F1 score, and the
prediction of the quality ranking of the documents. While the former task can
give answers how good the model performs or whether the model is already good
enough for the application, the latter task provides a useful tool for introspection:
Which documents are poorly labeled by the model? Where should the model
be improved? Are the expectations on the domain realistic? Due to the limited
expressiveness of the aggregation function, we concentrate on the latter task. The
cde scores for the annotated documents are depicted in the CDE Documents
view (c.f. Figure 1, top right). The result of each constraint for the currently
selected document is given in the CDE Results view (c.f. Figure 1, bottom right).</p>
      <p>The development of the constraints needs to be supported by tooling in order
to achieve an improved prediction in the intended task. If the user extends or
refines the expectations on the domain, then a feedback whether the prediction
has improved or deteriorated is very valuable. For this purpose, the framework
provides functionality to evaluate the prediction quality of the constraints itself.
Given a set of documents with gold annotations, the cde score of each document
can be compared to the actual F1 score. Four measures are applied to evaluate the
prediction quality of the constraints: the mean squared error, the Spearman’s
rank correlation coefficient, the Pearson correlation coefficient and the cosine
similarity. For optimizing the constraints to approximate the actual F1 score,
the Pearson’s r is maximized, and for improving the predicted ranking, the
Spearman’s ρ is maximized. If documents with gold annotations are available,
then the F1 scores and the values of the four evaluation measures are given in
the CDE Documents view (c.f. Figure 1, top right).
The usability and advantages of the presented work are illustrated with a simple
case study concerning the segmentation of scientific references, a popular domain
for evaluating novel information extraction models. In this task, the information
extraction model normally identifies about 12 different entities of the reference
string, but in this case study we limited the relevant entities to Author, Title
and Date, which are commonly applied in order to identify the cited publication.</p>
      <p>
        In the main scenario of the case study, we try to estimate the extraction
quality of a set of UIMA Ruta rules that shall identify the Author, Title and
Date of a reference string. For this purpose, we define constraints representing
the background knowledge about the domain for this specific set of rules.
Additionally to this main setting of the case study, we also measure the prediction of
the constraints in two different scenarios: In the first one, the documents have
been labeled not by UIMA Ruta rules, but by a CRF model [
        <xref ref-type="bibr" rid="ref10 ref23">10</xref>
        ]. The CRF
model was trained with a limited amount of iterations in a 5-fold manner. In
a second scenario, we apply the UIMA Ruta rules on a set of documents of a
different distribution including unknown style guides.
      </p>
      <p>
        Table 1 provides an overview of the applied datasets. We make use of the
references dataset of [
        <xref ref-type="bibr" rid="ref22 ref9">9</xref>
        ]. This data set is homogeneously divided in three
subdatasets with respect to their style guides and amount of references, which are
      </p>
      <p>Druta 219 references in 8 documents used to develop the set of UIMA Ruta rules.
Ddev 192 references in 8 documents labeled by the UIMA Ruta rules and applied
for developing the constraints.</p>
      <p>Dtest 155 references in 7 documents labeled by the UIMA Ruta rules and applied to
evaluate the constraints.</p>
      <p>Dcrf</p>
      <p>Druta, Ddev and Dtest (566 references in 23 documents) labeled by a (5-fold)
CRF model.</p>
      <p>
        Dgen 452 references in 28 documents from a different source with unknown style
guides labeled by the UIMA Ruta rules.
applied to develop the UIMA Ruta rules, define the set of constraints, and to
evaluate the prediction of the constraints compared to the actual F1 score. The CRF
model is trained on the partitions given in [
        <xref ref-type="bibr" rid="ref22 ref9">9</xref>
        ]. The last dataset Dgen consists of
a mixture of the datasets Cora, CiteSeerX and FLUX-CiM described in [
        <xref ref-type="bibr" rid="ref16 ref3">3</xref>
        ]
generated by the rearrangement of [
        <xref ref-type="bibr" rid="ref21 ref8">8</xref>
        ].
      </p>
      <p>Cruta extended with one additional AD constraint covering the
entitydistribution of words extracted from Bibsonomy. The weight of each
constraint is set to 1.</p>
      <p>Cruta+5xbib Same set of constraints as in Cruta+bib, but the weight of the additional</p>
      <p>AD constraint is set to 5.</p>
      <p>Table 2 provides an overview of the different sets of constraints, whose
predictions are compared to the actual F1 score. First, we extended and refined a
set of UIMA Ruta rules until they achieved an F1 score of 1.0 on the dataset
Druta. Then, 15 Rule constraints Cruta5 have been specified using the dataset
Ddev. The definition of the UIMA Ruta rules took about two hours and the
definition of the constraints about one hour. Additionally to the Rule constraints,
we created an AD constraint, which consists of the entity distribution of words
that occurred at least 1000 times in the latest Bibtex database dump of
Bibsonomy6. The set of constraints Cruta+bib and Cruta+5xbib combine both types of
constraints with different weighting.</p>
      <p>Table 3 contains the evaluation, which compares the predicted cde score
to the actual token-based F1 score for each document. We apply two different
5 The actual implementation of the constraints as UIMA Ruta rules is depicted in</p>
      <p>Figure 1 (lower left part).
6 http://www.kde.cs.uni-kassel.de/bibsonomy/dumps
ρ</p>
      <p>r</p>
      <p>Cruta+bib
ρ</p>
      <p>r</p>
      <p>Cruta+5xbib
ρ</p>
      <p>r
correlation coefficients for measuring the quality of the prediction: Spearman’s ρ
gives an indication about the ranking of the documents and Pearson’s r provides
a general measure of linear dependency.</p>
      <p>Although the expectations defined by the sets of constraints are limited and
quite minimalistic covering mostly only common expectations, the results
indicate that they can be useful in any scenario. The results for dataset Ddev are
only given for completeness since this dataset was applied to define the set of
constraints. The results for the dataset Dtest, however, reflect the prediction
on unseen documents of the same distribution. The ranking of the documents
was almost perfectly estimated with a Spearman’s ρ of 0.96157. The coefficients
for the other scenarios Dcrf and Dgen are considerably decreased, but the cde
scores are nevertheless very useful for an assessment of the extraction model’s
quality. The five worst documents in Dgen (including new style guides), for
example, have been reliably detected. The results show that the AD constraints
can improve the prediction, but do not exploit their full potential in the current
implementation. The impact measured for the dataset Dcrf is not as distinctive
since the CRF model already includes such features and thus is able to avoid
errors that are detected by these constraints. However, the prediction in the
dataset Dgen is considerably improved. The UIMA Ruta rules produce severe
errors in documents with new style guides, which are easily detected by the word
distribution.
5</p>
      <sec id="sec-11-1">
        <title>Conclusions</title>
        <p>This paper presented a tool for the UIMA community implemented in UIMA
Ruta, which enables to estimate the extraction quality of arbitrary models on
unseen documents. Its introspective report is able to improve the development
of information extraction models already with minimal efforts. This is achieved
by formalizing the background knowledge about the domain with different types
of constraints. We have shown the usability and advantages of the approach in
a case study about segmentation of references. Concerning future work, many
prospects for improvements remain, for example a logistic regression model for
7 The actual cde and F1 scores of Dtest are depicted in Figure 1 (right part)
approximating the scores of arbitrary evaluation measures, new types of
constraints, or approaches to automatically acquire the expectations on a domain.
Acknowledgments This work was supported by the Competence Network
Heart Failure, funded by the German Federal Ministry of Education and
Research (BMBF01 EO1004).</p>
      </sec>
      <sec id="sec-11-2">
        <title>References</title>
      </sec>
    </sec>
    <sec id="sec-12">
      <title>Di Bari, Alessandro Fang, Yan Faraotti, Alessandro Fette, Georg</title>
    </sec>
    <sec id="sec-13">
      <title>Maiberg, Avner</title>
      <p>Mass´o Sanabre, Guillem
McCormack, Collin
Noh, Tae-Gil
Nyberg, Eric</p>
    </sec>
    <sec id="sec-14">
      <title>Richardet, Renaud Rodr´ıguez-Penagos, Carlos</title>
    </sec>
    <sec id="sec-15">
      <title>Savova, Guergana</title>
      <p>Stadermann, Jan
Symons, Stephan</p>
    </sec>
    <sec id="sec-16">
      <title>Wittek, Andreas</title>
      <p>2
42
14
42
18
58</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Paul</given-names>
            <surname>Buitelaar</surname>
          </string-name>
          and
          <string-name>
            <given-names>Srikanth</given-names>
            <surname>Ramaka</surname>
          </string-name>
          .
          <article-title>Unsupervised ontology-based semantic tagging for knowledge markup</article-title>
          .
          <source>In Proceedings of the Workshop on Learning in Web Search at the International Conference on Machine Learning</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Hamish</given-names>
            <surname>Cunningham</surname>
          </string-name>
          . Gate, a
          <article-title>general architecture for text engineering</article-title>
          .
          <source>Computers and the Humanities</source>
          ,
          <volume>36</volume>
          (
          <issue>2</issue>
          ):
          <fpage>223</fpage>
          -
          <lpage>254</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>David</given-names>
            <surname>Ferrucci</surname>
          </string-name>
          and
          <string-name>
            <given-names>Adam</given-names>
            <surname>Lally</surname>
          </string-name>
          .
          <article-title>Uima: an architectural approach to unstructured information processing in the corporate research environment</article-title>
          .
          <source>Natural Language Engineering</source>
          ,
          <volume>10</volume>
          (
          <issue>3-4</issue>
          ):
          <fpage>327</fpage>
          -
          <lpage>348</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Kyungduk</given-names>
            <surname>Kim</surname>
          </string-name>
          et al.
          <article-title>A frame-based probabilistic framework for spoken dialog management using dialog examples</article-title>
          .
          <source>In Proceedings of the 9th SIGdial Workshop on Discourse and Dialogue</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. John Lafferty,
          <string-name>
            <surname>Andrew McCallum</surname>
          </string-name>
          , and
          <article-title>Fernando CN Pereira</article-title>
          .
          <article-title>Conditional random fields: probabilistic models for segmenting and labeling sequence data</article-title>
          .
          <source>In Proceedings of the 18th International Conference on Machine Learning</source>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Ron</given-names>
            <surname>Meir</surname>
          </string-name>
          and
          <article-title>Gunnar Ra¨tsch. An introduction to boosting and leveraging</article-title>
          .
          <source>In Advanced lectures on machine learning</source>
          , pages
          <fpage>118</fpage>
          -
          <lpage>183</lpage>
          . Springer,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>David</given-names>
            <surname>Pinto</surname>
          </string-name>
          ,
          <string-name>
            <surname>Andrew McCallum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Xing</given-names>
            <surname>Wei</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W Bruce</given-names>
            <surname>Croft</surname>
          </string-name>
          .
          <article-title>Table extraction using conditional random fields</article-title>
          .
          <source>In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval</source>
          , pages
          <fpage>235</fpage>
          -
          <lpage>242</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Pallavi</given-names>
            <surname>Pyreddy</surname>
          </string-name>
          and
          <string-name>
            <given-names>W Bruce</given-names>
            <surname>Croft</surname>
          </string-name>
          .
          <article-title>Tintin: A system for retrieval in text tables</article-title>
          .
          <source>In Proceedings of the second ACM international conference on Digital libraries</source>
          , pages
          <fpage>193</fpage>
          -
          <lpage>200</lpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Lev</given-names>
            <surname>Ratinov</surname>
          </string-name>
          and
          <string-name>
            <given-names>Dan</given-names>
            <surname>Roth</surname>
          </string-name>
          .
          <article-title>Design challenges and misconceptions in named entity recognition</article-title>
          .
          <source>In Proceedings of the thirteenth conference on Computational Natural Lanugage Learning</source>
          , pages
          <fpage>147</fpage>
          -
          <lpage>155</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>Stephen</given-names>
            <surname>Soderland</surname>
          </string-name>
          .
          <article-title>Learning information extraction rules for semi-structured and free text</article-title>
          .
          <source>Machine learning</source>
          ,
          <volume>34</volume>
          (
          <issue>1-3</issue>
          ):
          <fpage>233</fpage>
          -
          <lpage>272</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Mihai</surname>
            <given-names>Surdeanu</given-names>
          </string-name>
          , Ramesh Nallapati, and
          <string-name>
            <given-names>Christopher D.</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <article-title>Legal claim identification: Information extraction with hierarchically labeled data</article-title>
          .
          <source>In Proceedings of the LREC 2010 Workshop on the Semantic Processing of Legal Texts</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. Suzanne Liebowitz Taylor, Richard Fritzson, and
          <article-title>Jon A Pastor. Extraction of data from preprinted forms</article-title>
          .
          <source>Machine Vision and Applications</source>
          ,
          <volume>5</volume>
          (
          <issue>3</issue>
          ):
          <fpage>211</fpage>
          -
          <lpage>222</lpage>
          ,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>Paul</given-names>
            <surname>Viola</surname>
          </string-name>
          and
          <string-name>
            <given-names>Mukund</given-names>
            <surname>Narasimhan</surname>
          </string-name>
          .
          <article-title>Learning to extract information from semistructured text using a discriminative context free grammar</article-title>
          .
          <source>In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , pages
          <fpage>330</fpage>
          -
          <lpage>337</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bellare</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Druck</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McCallum</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Alternating Projections for Learning with Expectation Constraints</article-title>
          .
          <source>In: Proceedings of the Twenty-Fifth Conference on Uncertainty in AI</source>
          . pp.
          <fpage>43</fpage>
          -
          <lpage>50</lpage>
          . AUAI Press (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          2.
          <string-name>
            <surname>Cohn</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Atlas</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ladner</surname>
          </string-name>
          , R.:
          <article-title>Improving generalization with active learning</article-title>
          .
          <source>Machine Learning</source>
          <volume>15</volume>
          ,
          <fpage>201</fpage>
          -
          <lpage>221</lpage>
          (
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          3.
          <string-name>
            <surname>Councill</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Giles</surname>
            ,
            <given-names>C.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kan</surname>
          </string-name>
          , M.Y.:
          <article-title>ParsCit: an Open-source CRF Reference String Parsing Package</article-title>
          .
          <source>In: Proceedings of the Sixth International Language Resources and Evaluation (LREC'08)</source>
          . ELRA, Marrakech, Morocco (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          4.
          <string-name>
            <surname>Culotta</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McCallum</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Confidence Estimation for Information Extraction</article-title>
          .
          <source>In: Proceedings of HLT-NAACL 2004: Short Papers</source>
          . pp.
          <fpage>109</fpage>
          -
          <lpage>112</lpage>
          . HLT-NAACL-Short '
          <volume>04</volume>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computational Linguistics, Stroudsburg, PA, USA (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          5.
          <string-name>
            <surname>Ferrucci</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lally</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>UIMA: An Architectural Approach to Unstructured Information Processing in the Corporate Research Environment</article-title>
          .
          <source>Natural Language Engineering</source>
          <volume>10</volume>
          (
          <issue>3</issue>
          /4),
          <fpage>327</fpage>
          -
          <lpage>348</lpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          6.
          <string-name>
            <surname>Graca</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ganchev</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taskar</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Expectation Maximization and Posterior Constraints</article-title>
          . In: Platt,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Koller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Singer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Roweis</surname>
          </string-name>
          , S. (eds.) NIPS 20, pp.
          <fpage>569</fpage>
          -
          <lpage>576</lpage>
          . MIT Press, Cambridge, MA (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          7.
          <string-name>
            <surname>Kluegl</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Atzmueller</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Puppe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>TextMarker: A Tool for Rule-Based Information Extraction</article-title>
          . In: Chiarcos,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>de</surname>
          </string-name>
          <string-name>
            <surname>Castilho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.E.</given-names>
            ,
            <surname>Stede</surname>
          </string-name>
          , M. (eds.)
          <source>Proceedings of the 2nd UIMA@GSCL Workshop</source>
          . pp.
          <fpage>233</fpage>
          -
          <lpage>240</lpage>
          . Gunter Narr Verlag (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          8.
          <string-name>
            <surname>Kluegl</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hotho</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Puppe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Local Adaptive Extraction of References</article-title>
          .
          <source>In: 33rd Annual German Conference on Artificial Intelligence (KI</source>
          <year>2010</year>
          ). Springer (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          9.
          <string-name>
            <surname>Kluegl</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toepfer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lemmerich</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hotho</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Puppe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Collective Information Extraction with Context-Specific Consistencies</article-title>
          . In: Flach,
          <string-name>
            <given-names>P.A.</given-names>
            ,
            <surname>Bie</surname>
          </string-name>
          , T.D.,
          <string-name>
            <surname>Cristianini</surname>
          </string-name>
          , N. (eds.)
          <source>ECML/PKDD (1). Lecture Notes in Computer Science</source>
          , vol.
          <volume>7523</volume>
          , pp.
          <fpage>728</fpage>
          -
          <lpage>743</lpage>
          . Springer (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          10.
          <string-name>
            <surname>Lafferty</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McCallum</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pereira</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data</article-title>
          .
          <source>Proc. 18th International Conf. on Machine Learning</source>
          pp.
          <fpage>282</fpage>
          -
          <lpage>289</lpage>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          11.
          <string-name>
            <surname>Mann</surname>
            ,
            <given-names>G.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McCallum</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data</article-title>
          .
          <source>J. Mach. Learn. Res</source>
          .
          <volume>11</volume>
          ,
          <fpage>955</fpage>
          -
          <lpage>984</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          12.
          <string-name>
            <surname>McCallum</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nigam</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Employing</surname>
            <given-names>EM</given-names>
          </string-name>
          and
          <article-title>Pool-Based Active Learning for Text Classification</article-title>
          . In: Shavlik,
          <string-name>
            <surname>J.W</surname>
          </string-name>
          . (ed.) ICML. pp.
          <fpage>350</fpage>
          -
          <lpage>358</lpage>
          . Morgan Kaufmann (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          13.
          <string-name>
            <surname>Savova</surname>
            ,
            <given-names>G.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Masanz</surname>
            ,
            <given-names>J.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ogren</surname>
            ,
            <given-names>P.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zheng</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sohn</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kipper-Schuler</surname>
            ,
            <given-names>K.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chute</surname>
            ,
            <given-names>C.G.</given-names>
          </string-name>
          :
          <article-title>Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications</article-title>
          .
          <source>Journal of the American Medical Informatics Association : JAMIA</source>
          <volume>17</volume>
          (
          <issue>5</issue>
          ),
          <fpage>507</fpage>
          -
          <lpage>513</lpage>
          (
          <year>Sep 2010</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>