Unstructured Information Management Architecture (UIMA) 3rd UIMA@GSCL Workshop September 23, 2013 Copyright c 2013 for the individual papers by the papers’ authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors. Preface For many decades, NLP has suffered from low software engineering standards causing a limited degree of re-usability of code and interoperability of different modules within larger NLP systems. While this did not really hamper success in limited task areas (such as implementing a parser), it caused serious prob- lems for the emerging field of language technology where the focus is on building complex integrated software systems, e.g., for information extraction or machine translation. This lack of integration has led to duplicated software development, work-arounds for programs written in different (versions of) programming lan- guages, and ad-hoc tweaking of interfaces between modules developed at differ- ent sites. In recent years, the Unstructured Information Management Architecture (UIMA) framework has been proposed as a middleware platform which offers integration by design through common type systems and standardized commu- nication methods for components analysing streams of unstructured informa- tion, such as natural language. The UIMA framework offers a solid processing infrastructure that allows developers to concentrate on the implementation of the actual analytics components. An increasing number of members of the NLP community thus have adopted UIMA as a platform facilitating the creation of reusable NLP components that can be assembled to address different NLP tasks depending on their order, combination and configuration. This workshop aims at bringing together members of the NLP community – users, developers or providers of either UIMA components or UIMA-related tools in order to explore and discuss the opportunities and challenges in using UIMA as a platform for modern, well-engineered NLP. This volume now contains the proceedings of the 3rd UIMA workshop to be held under the auspices of the German Language Technology and Computa- tional Linguistics Society (Gesellschaft für Sprachverarbeitung und Computer- linguistik - GSCL) in Darmstadt, September 23, 2013. From 11 submissions, the programme committee selected 7 full papers and 2 short papers. The organizers of the workshop wish to thank all people involved in this meeting - submitters of papers, reviewers, GSCL staff and representatives - for their great support, rapid and reliable responses, and willingness to act on very sharp time lines. We appreciate their enthusiasm and cooperation. September 2013 Peter Kluegl, Richard Eckart de Castilho, Katrin Tomanek (Eds.) i Program Committee Sophia Ananiadou University of Manchester Steven Bethard KU Leuven Ekaterina Buyko Nuance Deutschland Philipp Cimiano University of Bielefeld Anni R. Coden IBM Thomas J. Watson Research Center Kevin Cohen University of Colorado Richard Eckart de Castilho Technische Universität Darmstadt Frank Enders Averbis Nicolai Erbs Technische Universität Darmstadt Stefan Geissler Temis Deutschland Thilo Götz IBM Deutschland Udo Hahn FSU Jena Nicolas Hernandez University of Nantes Michael Herweg IBM Deutschland Nancy Ide Vassar College Peter Kluegl University of Würzburg Eric Nyberg Carnegie Mellon University Kai Simon Averbis Michael Tanenblatt IBM Thomas J. Watson Research Center Martin Toepfer University of Würzburg Katrin Tomanek Averbis Karin Verspoor National ICT Australia Graham Wilcock University of Helsinki Torsten Zesch University of Duisburg-Essen Additional Reviewers Roman Klinger University of Bielefeld ii Table of Contents Keynote: Apache clinical Text Analysis and Knowledge Extraction System (cTAKES) . . . . 1 Pei Chen and Guergana Savova A Model-driven approach to NLP programming with UIMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Alessandro Di Bari, Alessandro Faraotti, Carmela Gambardella and Guido Vetere Storing UIMA CASes in a relational database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Georg Fette, Martin Toepfer and Frank Puppe CSE Framework: A UIMA-based Distributed System for Configuration Space Exploration 14 Elmer Garduno, Zi Yang, Avner Maiberg, Collin McCormack, Yan Fang and Eric Nyberg Aid to spatial navigation within a UIMA annotation index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Nicolas Hernandez Using UIMA to Structure An Open Platform for Textual Entailment . . . . . . . . . . . . . . . . . . . . . 26 Tae-Gil Noh and Sebastian Padó Bluima: a UIMA-based NLP Toolkit for Neuroscience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Renaud Richardet, Jean-Cedric Chappelier and Martin Telefont Sentiment Analysis and Visualization using UIMA and Solr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Carlos Rodrı́guez-Penagos, David Garcı́a Narbona, Guillem Massó Sanabre, Jens Grivolla and Joan Codina Extracting hierarchical data points and tables from scanned contracts . . . . . . . . . . . . . . . . . . . . 50 Jan Stadermann, Stephan Symons and Ingo Thon Constraint-driven Evaluation in UIMA Ruta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Andreas Wittek, Martin Toepfer, Georg Fette, Peter Kluegl and Frank Puppe iii Keynote: Apache clinical Text Analysis and Know- ledge Extraction System (cTAKES) Abstract The presentation will focus on methods and software development behind the cTAKES platform. An overview of the modules will set the stage, followed by more in-depth discussion of some of the methods and evaluations of select mod- ules. The second part of the presentation will shift to software development topics such as optimization and distributed computing including UIMA inte- gration, UIMA-AS, as well as our plans for UIMA-DUCC integration. A live demo of cTAKES will conclude the talk. About the speakers Pei Chen is a Vice President of Apache Software Foundation, leading the top-level cTAKES project1 . He is also a lead application development special- ist at the Informatics Program at Boston Children’s Hospital/Harvard Medical School. Mr. Chen’s interests lie in building practical applications using machine learning techniques. He has a passion for the end-user experience and has a background Computer Science/Economics. Mr. Chen is a firm believer in the open source community contributing to cTAKES as well as other Apache Soft- ware Foundation projects. Guergana Savova, Ph.D. is member of the faculty at Harvard Medical School and Childrens Hospital Boston. Her research interest is in natural language pro- cessing (NLP), especially as applied to the text generated by physicians (the clinical narrative) focusing on higher level semantic and discourse processing which includes topics such as named entity recognition, event recognition, rela- tion detection, and classification including co-reference and temporal relations. The methods are mostly machine learning spanning supervised, lightly super- vised, and completely unsupervised. Her interest is also in the application of the NLP methodologies to biomedical use cases. Dr. Savova has been leading the development and is the principal architect of cTAKES. She holds a Master’s of Science in Computer Science and a PhD in Linguistics with a minor in Cognitive Science from University of Minnesota. 1 http://ctakes.apache.org/ 1 A Model-driven approach to NLP programming with UIMA Alessandro Di Bari, Alessandro Faraotti, Carmela Gambardella, and Guido Vetere IBM Center for Advanced Studies of Trento Piazza Manci, 1 Povo di Trento Abstract. In Natural Language Processing, more complex business use cases and shorter delivery times drive a growing need of smoother, more flexible and faster implementations. This trend also requires integrating and orchestrating different functionalities delivered by services belong- ing to different technological platforms. All these needs imply raising the level of abstraction for NLP components development. In this paper we present a Model Driven Architecture approach suitable to develop an open and interoperable UIMA-based NLP stack. By decoupling UIMA NLP models from other solution specific platforms and services, we ob- tain major architectural improvements. 1 Introduction As Natural Language Processing (NLP) approaches complex tasks such as Ques- tion Answering or Dialog Management, the capability for NLP tools to seam- lessly interoperate with other software services, such as knowledge bases or rules engines, becomes crucial. Such level of integration may require linguistic models to be shared among a variety of different platforms, each of which comes with its own information representation language. Platforms like UIMA1 or GATE2 consist of middleware and tools for designing and pipelining NLP specific tasks, including support for modeling data structures for text annotation, such as lex- ical, morphological and syntactic features, which may be embedded in inter- process communication protocols. However, while perfectly suited for annota- tion purposes, NLP specific schema languages, such as the UIMA Type System, fall short on fulfilling solution-level modeling needs. Model-Driven software Ar- chitectures (MDA), on the other hand, are specifically aimed at tackling the complexity of modern software infrastructures, with emphasis on the integration and the orchestration of different technological platforms. The MDA approach is based on providing formal descriptions (models) of requirements, interactions, data structures, protocols, and many other aspects of the desired system, which are automatically turned into technical resources, such as schemes and software modules, by activating transformation rules. 1 http://uima.apache.org/ 2 http://gate.ac.uk/ 2 Based on this consideration, we adopted an MDA approach to develop a “Watson ready”3 , UIMA-based NLP stack for Italian, as part of the activity of the newborn IBM Language & Knowledge Center for Advanced Studies of Trento4 . We wanted our stack to be as open and interoperable as possible, to help users leveraging the availability of NLP resources and tools in the Open Source / Open Data space. In addition, our stack aims at being independent from language specific issues and domains, to facilitate its reuse across projects and within our (multinational) Company. The basic idea was to design a highly modularized general model including all the required structures, and to obtain technical platform-specific resources from a suitable set of model-to-model trans- formations. Also, we embraced the idea of abstracting semantic information away from the UIMA Type System, as in [5] and in [7], and evaluated the benefit of representing such kind of information by specific means. In sum, we looked at UIMA as a well-suited platform for linguistic analysis, which allows the integra- tion of analytic components into managed workflow pipelines, but regarded at the UIMA Type System as a schema specification for that platform, rather than as a general modeling language for any NLP-based solution. Here we present an overview of the basic ideas behind our approach, introduce our project, and discuss future directions. At the present stage of development, we can share our vision on MDA positioning and motivation with respect to NLP development (section 3), and we can report our first implementation experiences (section 4). Finally, we outline some related topic and introduce future works. 2 Motivating Scenario Natural Language based solutions may require the NLP stack to cooperate with other components in a complex system. Such cooperation typically involves data exchanges with reference to a shared information model. The picture 1 shows the integration of an NLP stack with a Knowledge Base (e.g. an Ontology-based Data Access System) and a Rule Engine. An UIMA-based NLP pipeline produces an annotated text (step 1 in the pic- ture 1) contained in an UIMA CAS (Common Annotation Structure). A wrapper of the UIMA Type System defines all the operations needed for a consumer (the Rule Engine in this case) in order to access the CAS and invoke the appropri- ate operations within the cooperating subsystem when needed (see 4.2). When developing and maintaining the solution, an Engineer builds a rule set (see step 3) in order to process linguistic structures and interact with a Knowledge Base (step 4), which, in turn, uses the annotated text to store assertions as the result of an Information Extraction process (step 5). In a separate flow, the Knowledge Base can be queried by a User through a Question Answering System based on a suitable query language (step 6). The integration of all components involved is guaranteed by a common abstract model (Platform Independent Model) that contains the overall conceptualization of the system. The transition from one 3 www.ibm.com/watson/ 4 www.ibm.com/ibm/cas/ 3 Fig. 1. Architectural sketch platform specific data structure to another is handled by a set of Model-to- Model transformations (steps 7 and 8 ). The figure also shows the link to legacy (possibly huge) conceptual models, such as the KB ontology (step 9). 3 Model Driven Architecture for NLP Model Driven Architecture (MDA) [6] is a development approach, strictly based on formal specifications of information structures and behaviors, and their se- mantics. MDA is managed by Object Management Group (OMG)5 based on several modeling standard such as: Unified Modeling Language (UML)6 , Meta- Object Facility (MOF), XML Metadata Interchange (XMI) and others. MDA supports Model Driven Development/Engineering (MDD, MDE). The key idea behind MDA is to provide a higher level of abstraction so that software can be fully designed independently from the underlying technological platform. More formally, MDA defines three macro “modeling” layers: – Computation Independent Model (CIM) – Platform Independent Model (PIM) – Platform Specific Model (PSM) The first one can be related to a Business Process Model and does not nec- essary imply the existence of a system that automates it. The PIM is a model that is independent from any technical platform; the third (PSM) layer is the actual implementation of the model with respect to a given technology and it is automatically derived from the PIM. Notice that the PIM allows a comprehen- sive representation of the structure and behavior of the system being developed. 5 http://omg.org/ 6 http://www.uml.org/ 4 The modeling language is typically UML or EMF7 , but it could actually be any other Domain Specic Language (DSL). Developing powerful NLP tasks, such as Question Answering systems, re- quires combining a great variety of analytic components, which is what UIMA has been designed for. We consider UIMA the standard solution for document workflow analysis. Within this framework, MDD tools can be effectively used to better manage the UIMA Type System. In particular, we decided to look at it as a PSM dedicated to text annotation. The motivation for leveraging MDD (in the NLP field) can be summarized as follows: – Formalization: MDA languages are well studied in logics and reasoning mechanisms can be developed upon. [1] – Expressiveness: MOF meta-modeling allow great and well-founded expres- siveness [4], including modeling behaviors. – Support: The availability of tools, including diagramming and code gener- ation, improves software life-cycle and team collaboration. In particular, with respect to our architecture, we modeled UIMA annota- tions by defining classes rather than just (data) types, so that a consumer is able to invoke operations designed for those objects. Access to UIMA annotation is then achieved by means of automatically generated wrappers. Another motiva- tion for a model driven approach was the need to represent complex linguistic data, and exploit existing tooling and resources for generating training data for a statistical parser. In sum, we tried to exploit the maturity and flexibility of MDD tools while keeping up the power of UIMA as a framework for component integration, pipeline execution, and workflow management in general. As the PIM language, we chose EMF because it is already integrated with UIMA and provides pow- erful and mature model driven features. Once also the code is generated (by UIMA JCASgen), the type system correspond to an implementation of a (busi- ness) domain model, limited to the structural aspects (as opposed to behavioral aspects). At PIM level, we also have to represent those properties that, once trans- formed against a target model, give specific characteristics on that model. For instance, in order to generate the UIMA Type System (PSM) starting from the PIM, we have to represent on the source model whether a class (that is a root in a hierarchy on the PIM model) will be generated as an UIMA annotation or not (UIMA TOP). Here we have taken two possible scenarios into account: – Having an UML PIM, this specification is easily accomplished by using an UML profile 8 . Profiles define stereotypes that can be further structured with custom properties. This way, we have a generic ”Unstructured Information” profile that at least, encompasses an Annotation stereotype; thus a class that is thought to become an annotation will be simply ”marked” with this stereotype. 7 http://www.eclipse.org/modeling/emf/ 8 http://www.omg.org/spec/#M&M 5 – Having an EMF PIM (such as our current implementation), we can represent the same thing as an EMF annotation. Therefore, (we apologize for the words conflict) we will have a class annotated as Annotation. In any case, a class stereotyped as Annotation on the PIM will take the role of a generic annotation for document analysis, independently from the underlying framework. The main benefit of our approach is the ability to represent NLP objects independently from any particular implementation: we are using different (gen- erated) PSMs (that are better explained in section 4) all deriving from the starting (PIM) model, as shown in the picture 2. Fig. 2. Mdd for NLP: different abstraction layers These benefits have certainly a price, that is essentially represented by the cost of developing the necessary transformations. However, following basic as- sumptions of the MDD approach, we estimate that those cost are well paying back, especially when heterogeneous components have to be integrated, devel- opment is managed iteratively, and models are subject to high volatility. 4 Model Driven Implementation Aspects In order to better clear up how we are leveraging the Model Driven approach, we list here the artifacts (PSMs and code) we are generating through appropriate transformations that we have developed. Starting from our “application” model: – UIMA type system (we modified the existing transformation from EMF in order to avoid any further modification on the UIMA type system) – EMF wrapper of UIMA type system – this wrapper also acts as the input for creating the model for the Rule engine as explained below Starting from (our) models of common standard data for parser training such as CONLL, PENN and others we generated all necessary (OpenNLP-specific) data for training the parser on: 6 – Tokenization – Named Entities – Part of Speech tagging – Chunking – Parsing To represent the model (PIM), we use the Eclipse Modeling Framework9 (EMF), which represents a de facto Java-based standard for meta-modeling. In- formally, we may say EMF represents a subset of UML (the structural part) with very precise semantics for code generation. In the future, we could move this representation to a profiled UML, as mentioned above (see section 3). Fur- thermore, EMF offers very powerful generation features. Summarizing, in the current implementation we use EMF in two ways: 1. A language to represent the model 2. A PIM model to generate different target PSM 4.1 NLP Parser The NLP Parser component is implemented using Apache OpenNLP10 and UIMA11 ; it is based on a UIMA Type System built from the Syntax and the Abstract models using the UIMA transformation utility. The training corpora for the parser has to be provided in a specific format required by OpenNLP. Since the data that we had available for training were in standard formats such as PENN12 , CONLL13 and others, some transformations were required. Ecore models have been created for the purpose of representing source formats. Fur- thermore, some simple JET14 transformations has been developed in order to generate our corpora (in specific OpenNLP formats) Compared to other solutions, this makes our infrastructure extremely flexible: should the parser be replaced or the data formats changed, the only operation we will have to make is to modify the JET template accordingly. 4.2 Type System EMF Wrapper As anticipated in 2, in the higher layers of our architecture, we have a Rule Engine that acts as a reasoner on annotation objects coming from the UIMA pipeline. We wanted this layer to be able to call operations implemented on those objects (as explained in section 3) and those objects always implementing the ex- act interfaces of the (Ecore) PIM model. Given these requirements, we developed a transformation that generates a wrapper of the UIMA type system and that 9 http://www.eclipse.org/modeling/emf/ 10 http://opennlp.apache.org/ 11 http://uima.apache.org/ 12 http://www.cis.upenn.edu/~treebank/ 13 http://ilk.uvt.nl/conll/#dataformat 14 http://www.eclipse.org/modeling/m2t/?project=jet 7 fully reflects the starting PIM model, including operations. Once implemented, the code will be kept up also against future re-generations, thanks to merging capabilities of this transformation. Thus, as shown in figure 1, the Rule Engine “consumes” instances of this wrapper, and still can access the underlying UIMA annotation. We considered the possibility of directly adding these operations on classes generated by UIMA (via JCAS generation utility) but this would not be consistent with our model-driven approach since those operations would not be part of a general, system-wide model. 4.3 Rule Engine As far as the Rule Engine is concerned, we chose IBM Operational Decision Manager (ODM)15 . ODM rules have to be written against a specific model, called Business Object Model (BOM), that allows a user-friendly business rule editing; ODM provides tools to set up a natural language vocabulary: users can use it to write business rules in a pseudo-natural language. Once defined, the rules are executed on a BOM-related Java implementation named Execution Object Model (XOM). We obtained the BOM by reverse engineering the XOM, and the XOM directly from Java classes (implementing the type system wrapper) generated from our PIM (EMF) model. Therefore, the BOM model can be seen as just another manifestation of our PIM model. 4.4 Knowledge Base Our architecture is backed by a Knowledge Base Management System which stores and reasons on information extracted from many sources. Leveraging on the Knowledge Model included in the PIM, we were able to integrate an external pre-existing system, named ONDA (Ontology Based Data Access) [3]. ONDA supports Ontology Based Data Access (OBDA) on OWL2-QL (16 ), by ensuring sound and complete conjunctive query answering with the same efficiency a scal- ability of a traditional database [2]. Because the ONDA underlying Knowledge Model was already designed with EMF, we simply adopted it in order to be in- cluded in the PIM. This way, reasoning and query answering services have been included in the PIM model as operations available to all other components (i.e. the Rule Engine). 5 Conclusion and future works We have outlined here an innovative approach to NLP development, based on the idea of setting UIMA as the target platform in a Model-Driven development process. A major benefit of this approach consists in giving NLP models a greater value, especially in terms of generality, usability, and interoperability. 15 http://www-03.ibm.com/software/products/us/en/odm/ 16 http://www.w3.org/TR/owl2-profiles/ 8 While developing this idea, we understood that a suitable Model-Driven ma- chinery for NLP should be supported by specific design patterns for concrete models. In particular, the model we have developed has been abstracted both from morphosyntactic specificity and from semantic aspects. The former (in- cluding part-of-speech classes, genders, numbers, verbal tenses, etc) may signif- icantly vary among different languages; the latter (including concepts like per- sons, events, places, etc) are related to specific application domains. By decou- pling these layers, we achieved a lightweight “generic” UIMA type system[7], we designed a powerful generic model for morphosyntactic features, and we managed ontological information with proper expressive means. Refining and extending this model is part of our future plans. We implemented a first prototype of a Knowledge Base query system based on the Eclipse Modeling Framework (EMF). For the future, we are considering the possibility of representing the model in UML, in order to have a greater representational power (such as modeling sequence diagrams). The work presented here is still at an early stage. More work is needed to complete the linguistic model, for instance in the area of argument structures, such as verbal frames. From an implementation standpoint, our priority is to consolidate, improve and extend the set of Model-to-Model transformations, and to further exploit MDD tools. References 1. A. Calı̀, D. Calvanese, G. De Giacomo, and M. Lenzerini. A formal framework for reasoning on uml class diagrams. In Proceedings of the 13th International Sym- posium on Foundations of Intelligent Systems, ISMIS ’02, pages 503–513, London, UK, UK, 2002. Springer-Verlag. 2. D. Calvanese, G. De Giacomo, D. Lembo, M. Lenzerini, and R. Rosati. Dl-lite: Tractable description logics for ontologies. In AAAI, volume 5, pages 602–607, 2005. 3. P. Cangialosi, C. Consoli, A. Faraotti, and G. Vetere. Accessing data through ontologies with onda. In Proceedings of the 2010 Conference of the Center for Advanced Studies on Collaborative Research, CASCON ’10, pages 13–26, Riverton, NJ, USA, 2010. IBM Corp. 4. Liliana Favre. A formal foundation for metamodeling. In F. Kordon and Y. Ker- marrec, editors, Reliable Software Technologies Ada-Europe 2009, volume 5570 of Lecture Notes in Computer Science, pages 177–191. Springer Berlin Heidelberg, 2009. 5. D. Ferrucci, J W. Murdock, and C. Welty. Overview of component services for knowledge integration in uima (aka suki). Technical report, IBM Research Report RC24074, 2006. 6. J. Miller and J. Mukerji. Mda guide version 1.0.1. Technical report, Object Man- agement Group (OMG), 2003. 7. K. Verspoor, W. Baumgartner Jr, C. Roeder, and L. Hunter. Abstracting the types away from a UIMA type system. From Form to Meaning: Processing Texts Auto- matically. Tübingen:Narr, pages 249–256, 2009. 9 Storing UIMA CASes in a relational database Georg Fette12 , Martin Toepfer1 , and Frank Puppe1 1 Department of Computer Science VI, University of Wuerzburg, Am Hubland, Wuerzburg, Germany 2 Comprehensive Heart Failure Center, University Hospital Wuerzburg, Straubmuehlweg 2a, Wuerzburg, Germany Abstract. In the UIMA text annotation framework the most common way to store annotated documents (CAS) is by serializing the document to XML and storing this XML in a file in the file system. We present a framework to store CASes as well as their type systems in a relational database. This does not only provide a way to improve document man- agement but also the possibility to access and manipulate selective parts of the annotated documents using the database’s index structures. The approach has been implemented for MSSQL and MySQL databases. Keywords: UIMA, data management, relational databases, SQL 1 Introduction UIMA [2] has become a well known and often used framework for processing text data. The main component of the UIMA infrastructure is the CAS (Common Analysis Structure), a data structure which combines the actual data (the text of a document), annotations on this data and the type system the annotations are based on. In many UIMA projects CASes are stored as serialized XML-files in file folders with the corresponding type system file in a separate location. In this storage mode the resource management to load which CAS with which type system lies in the responsibility of the programmer who wants to perform an operation on specific documents. However, manual management of files in folders on local machines or network folders can quickly become confusing and messy especially when projects get bigger. We present a framework to store CASes as well as their corresponding type systems in a relational database. This storage mode provides the possibility to access the data in a centralized, organized way. Furthermore the approach provides all benefits that come along with relational databases including search indices on the data, selective storage, retrieval and deletion as well as the possibility to perform complex queries on the stored data in the well known SQL language. The structure of the paper is as follows: Section 2 describes the related work, Section 3 describes the technical details of the database storage mechanism, Sec- tion 4 illustrates query possibilities using the database, Section 5 demonstrates performance experiences with the framework and Section 6 concludes with a summary of the presented work. 10 2 Related Work The only approach known to the best knowledge of the authors where CASes are stored in a database is the Julielab DB Mapper [4] which serialized CASes to a PostgreSQL database. However, the mechanism does not store the CASes’ type systems nor does it support features like referencing of annotations by features or derivation of annotation types. Other approaches use indices to improve query performance but do not allow to reconstruct the annotated documents from the index (Lucene based: LUCAS [4], Fangorn [3]; relational database based: XPath [1], ANNIS [7]; proprietary index based: TGrep/TGrep2 [6], SystemT [5]). The indices still need the documents to be stored in the file system. Furthermore some of the mentioned indices only allow specialized search capabilities (e.g. emphasis on parse trees) which are provided by the respective search index and cannot search directly on the UIMA data structures. In contrast to these approaches our system allows searches on arbitrary type systems by formulating queries closely related to the involved annotation and feature types. 3 Database storage The storage mechanism is based on a relational database for which the table model is illustrated in Figure 1. The schema can be subdivided in a document related part (left), an annotation instance part (middle) and a type system re- lated part (right). Documents are stored as belonging to a named collection and can be manipulated (retrieved, deleted, etc.) as a group, e.g. deleting all an- notations of a specific type. Annotated documents can be handled individually by loading/saving a single CAS or by processing a whole collection by creating a collection reader/writer. In either way any communication (loading/saving) can (but need not) be parametrized so that only desired annotation types are loaded/saved, thus speeding up processing time, reducing memory consumption and facilitating debugging processes. A type system, instead of being stored in an XML file and containing a fixed type system, can be retrieved from the database in different task specific ways. One way is by requesting the type system which is needed to load all the annotated documents belonging to a certain collection. Other possibilities are by providing a set of desired type names or by providing a regular expression determining all desired type names. The storage mechanism is able to store the inheritance structures of UIMA type systems as well as refer- encing of annotations by features of other annotations. For further information on the technical aspects we refer to the documentation of the framework3 . 4 Querying A benefit from storing data in an SQL database is the database index and the well established SQL query standard. The database can be queried for counts of occurrences of specific annotation types, counts of covered texts of annotations or even complex annotation structures in the documents. We want to exemplify this with a query on documents which have been annotated with a dependency parser using the type system shown in Figure 2. 3 http://code.google.com/p/uima-sql/ 11 Fig. 1. schema of the relational database storing CASes and type systems. SELECT govText.covered FROM Token annot_inst govToken, annot_inst_covered govText, annot_inst baseToken, annot_inst_covered baseText, uima.tcas.Annotation feat_inst, feat_type WHERE baseText.covered = ’walk’ AND baseToken.covered_ID = baseText.covered_ID AND baseToken.annot_inst_ID = feat_inst.annot_inst_ID AND Governor feat_inst.feat_type_ID = feat_type.feat_type_ID AND Token feat_type.name = ’Governor’ AND feat_inst.value = govToken.annot_inst_ID AND govText.covered_ID = govToken.covered_ID Fig. 2. type system for parses Fig. 3. SQL query for governor tokens To query for all words governing the word walk, we have to look for tokens with the desired covered text, find the tokens governing those tokens and return their covered text. The SQL command for this task is shown in Figure 3. An ab- straction layer to cover the complexity could be put on top (like a graph querying language), but even in the presented way with standard SQL the capabilities of the database engine can serve as a useful tool to improve corpus analysis. 5 Performance To run a performance test on the storage engine we created a corpus of 1000 documents, each consisting of 1000 words. The words were taken from a dictio- nary of 1000 randomly created words, each of 8 characters length. From each document we created a CAS and added annotations so that each word was cov- ered, with the annotations covering 1 to 5 successive words. Each annotation was given two features, one String feature with a value randomly taken from the word dictionary and a Long feature containing a random number. All documents were stored and then loaded again. This was done with the database engine as well as with a local file folder on the same hard drive the database files were located on. In a second experiment the same documents where loaded again and we added an annotation of another type with a Long feature containing a random number to each document. After adding the additional annotation the documents were stored again. In a third experiment we wanted to query for the frequencies of annotations covering each of the words from the word dictionary. 12 For file system storage this was done by accumulating the annotation counts during an iteration over all serialized CASes, for database storage this was done by performing a single SQL query for each of the words from the dictionary. In Table 1 we can observe that the time needed for database storage is quite long but reading is as fast as from the file system. Storing to the database during the second experiment was faster than in the first one, because this time only the additional annotations had to be incrementally stored. Storage to the file system again performed about five times faster than to the database but the benefit of being able to incrementally store only the additional annotations can be clearly observed. Physical storage space consumption is larger for database storage but that shouldn’t pose a major problem as hard disc space is not an overly expensive resource nowadays. Query performance in the database is about 20 times faster than using file system storage illustrating the benefit of the database approach. Table 1. Performance measures comparing database and file system storage exp1 exp2 exp3 saving (sec.) loading (sec.) saving (sec.) storage size (MB) query (sec.) DB 36.0 1.1 7.2 42.3 0.16 FileSystem 2.6 1.1 2.7 6.5 7.0 6 Conclusion We have presented a framework to store/retrieve CASes and perform analysis queries on them using a relational database. We examined the save, load and query speed compared to regular file based storage and presented examples how to use the database index structures to analyze annotations in the corpus. We hope to be able to improve the storage speed of the database engine so that the choice between file system storage and database storage will not be influenced by the still quite large difference in speed performance. This work was supported by grants from the Bundesministerium fuer Bildung und Forschung (BMBF01 EO1004). References 1. Bird, S., Lee, H.: Designing and evaluating an xpath dialect for linguistic queries. In: 22nd International Conference on Data Engineering (2006) 2. Ferrucci, D., Lally, A.D.A.M.: Uima: an architectural approach to unstructured information processing in the corporate research environment. Natural Language Engineering 10(3-4), 327–348 (2004) 3. Ghodke, S., Bird, S.: Fangorn: A system for querying very large treebanks. In: COLING (Demos). pp. 175–182 (2012) 4. Hahn, U., Buyko, E., Landefeld, R., Mühlhausen, M., Poprat, M., Tomanek, K., Wermter, J.: An overview of JCoRe, the JULIE lab UIMA component repository. In: LREC’08 Workshop ‘Towards Enhanced Interoperability for Large HLT Systems: UIMA for NLP‘ (2008) 5. Krishnamurthy, R., Li, Y., Raghavan, S., Reiss, F., Vaithyanathan, S., Zhu, H.: Systemt: a system for declarative information extraction. SIGMOD Rec. (2009) 6. Rohde, D.L.T.: Tgrep2 user manual (2001) 7. Zeldes, A., Lüdeling, A., Ritz, J., Chiarcos, C.: Annis: a search tool for multi-layer annotated corpora. In: Proceedings of Corpus Linguistics 2009 (2009) 13 CSE Framework: A UIMA-based Distributed System for Configuration Space Exploration Elmer Garduno1 , Zi Yang2 , Avner Maiberg2 , Collin McCormack3 , Yan Fang4 , and Eric Nyberg2 1 Sinnia, elmerg@sinnia.com 2 Carnegie Mellon University, {ziy, amaiberg, ehn}@cs.cmu.edu 3 Boeing Company, collin.w.mccormack@boeing.com 4 Oracle Corporation, yan.fang@oracle.com Abstract. To efficiently build data analysis and knowledge discovery pipelines, researchers and developers tend to leverage available services and existing components by plugging them into different phases of the pipelines, and then spend hours to days seeking the right compo- nents and configurations that optimize the system performance. In this paper, we introduce the CSE framework , a distributed system for a parallel experimentation test bed based on UIMA and uimaFIT, which is general and flexible to configure and powerful enough to sift through thousands option combinations to determine which represent the best system con- figuration. 1 Introduction To efficiently build data analysis and knowledge discovery “pipelines”, researchers and develop- ers tend to leverage available services and existing components by plugging them into different phases of the pipelines [1], and then spend hours, seeking for the components and configurations that optimize the system performance. The Unstructured Information Management Architecture (UIMA) [3] provides a general framework for defining common types in the information system (type system), designing pipeline phases (CPE descriptor), and further configuring the compo- nents (AE descriptor) without changing the component logic. However there is no easy way to configure and execute a large set of combinations without repeated executions, while evaluating the performance of each component and configuration. To fully leverage existing components, it must be possible to automatically explore the space of system configurations and determine the optimal combination of tools and parameter settings for a new task. We refer to this problem as configuration space exploration, which can be formally defined as a constraint optimization problem. A particular information processing task is defined by a configuration space, which consists of mt components that define each of the n phases with corresponding configurations. Given a limited total resource capacity C and input set S, configu- ration space exploration (CSE) aims to find the trace (a combination of configured components) within the space that achieves the highest expected performance without exceeding C total cost. Details on the mathematical definition and proposed greedy solutions can be found in [6]. In this paper, we introduce the CSE framework implementation, a distributed system for paral- lel experimentation test bed based on UIMA and uimaFIT [4]. In addition, we highlight the results from two case studies where we applied the CSE framework to the task of building biomedical question answering systems. 14           &'           $              ()*+*,*-.     !    /    ()*+*,*-.                   " #     "                  !         %           $       !                $        #             %        Fig. 1. Example YAML-based pipeline descriptors 2 Framework Architecture We highlight some features of the implementation in this section. Source code, examples, docu- mentation, and other resources are publicly available on GitHub5 . To benefit developers who are already familiar with UIMA framework, we have developed a CSE tutorial in alignment with the examples in the official UIMA tutorial. Declarative descriptors. To leverage the CSE framework, users need to specify how the components should be organized in the pipeline, which values need to be specified for each com- ponent configuration, which is the input set, and what measurement metrics should to be applied. Analogous to a typical UIMA CPE descriptor, components, configurations, and collection read- ers in the CSE framework are declared in extended configuration descriptors which are based on the YAML format. An example of the main pipeline descriptor and a component descriptor are shown in Figure 1. Architecture. Each pipeline can contain an arbitrary number of AnalysisEngines declared by using the class keyword or by inheriting configuration options from other components by name. Combinations of components are configured using an options block and parameter combinations within a component are configured on a cross-opts block. To take full advantage of the CSE framework capabilities, users inherit from a cse.phase, a CAS multiplier that provides, option multiplexing, intermediate resource persistence and resource management for long running com- ponents. The architecture also supports grouping options into sub-pipelines as a convenient way of reducing the configuration space for combinations whose performance is already known. Evaluation. Unlike a traditional scientific workflow management system, CSE emphasizes the evaluation of component performance, based on user-specified evaluation metrics and gold- standard outputs at each phase. In addition the framework keeps track of the performance of all the executed traces, this allows inter-component evaluation and automatic tracking of performance improvements through time. Automatic data persistence. To support further error analysis and reproduction of experi- mental results, intermediate data (CASes) and evaluation results are kept in a repository accessi- ble from any trace at any point during the experiment. To prevent duplicate execution of traces the system keeps track of all the execution traces an recovers those CASes whose predecessors have 5 http://oaqa.github.io/ 15 Table 1. Case study result DocMAP PsgMAP # Comp # Conf # Trace # Exec Max Median Min Max Median Min Participants ∼1,000 ∼1,000 92 ∼1,000 .5439 .3083 .0198 .1486 .0345 .0007 CSE 13 32 2700 190,680 .5648 .4770 .1087 .1773 .1603 .0311 already been executed. Also the overall results from experiments are kept in a historical database to allow researchers to keep track of the performance improvements along time. Configurable selection and pruning. If gold-standard data is provided for a certain phase, then components up to that phase can be evaluated. Given the measured cost of executing the components provided, components can be ranked, selected or pruned for evaluation and opti- mization of subsequent phases. The component ranking strategy can be configured by the user; several heuristic strategies are implemented in the open source software. Distributed architecture. We have extended the CSE framework implementation to execute the task set in parallel on a distributed system using JMS. The components and configurations are deployed into the cluster beforehand. The execution, fault tolerance and bookkeeping are managed by a master server. In addition we leverage the UIMA-AS capabilities to execute specific configurations in parallel as separate services directly from the pipeline. 3 Building biomedical QA Systems via CSE As a case study, we apply the CSE framework to the problem of building effective biomedical question answering (BioQA) on two different tasks. In one case, we employ the topic set and benchmarks, including gold-standard answers and evaluation metrics, from the question answering task of the TREC Genomics Track 2006, as well as commonly-used tools, resources, and algorithms cited by participants. The implemented components, benchmarks, task-specific evaluation methods are included in domain-specific layer named BioQA, which was plugged into the BaseQA framework. The configuration space was explored with the CSE framework, automatically yielding an op- timal configuration of the given components which outperformed published results for the same task. We compare the settings and results for the experiment with the official TREC 2006 Ge- nomics test results for the participating systems in Table 1. We can see that the best system de- rived automatically by the proposed CSE framework can outperform the best participating system in terms of both DocMAP and PsgMAP, with fewer, more basic components. This experiments ran on a 40 node cluster during 24 hours allowing the execution on 200K components over 2,700 execution traces. More detailed analysis can be found in [6]. We also used the CSE framework to automatically configure a different type of biomedical question answering system for the QA4MRE (Question Answering for Machine Reading Evalu- ation) task at CLEF. The CSE framework identified a better combination, which achieved 59.6% performance gain over the original pipeline. Details can be found in the working note paper [5]. 4 Related Work Previous work has been done on this area, in particular DKPro Lab [2] a flexible lightweight framework for parameter sweep experiments and the U-Compare [7] framework, an evaluation 16 platform for running tools on text targets and compare components, that generates statistics and instance-based visualizations of outputs. One of the main advantages of the CSE framework is that it allows the exploration of very large configuration spaces by distributing the experiments over a cluster of workers and collecting the statistics on a centralized way. Another advantage on the CSE framework is that configurations can have arbitrary nesting levels as long as they form a DAG by using sub-pipelines. Also results can be compared end-to-end at a global level to understand overall performance trends on time. One area where CSE could take advantage of the aforementioned frameworks is on having a graphical UI for pipeline configuration, better visualization tools for combinatorial and instance- based comparison and a more expressive language for workflow definition. 5 Conclusion & Future Work In this paper, we present a UIMA-based distributed system to solve a common problem in rapid domain adaptation, referred to as Configuration Space Exploration. It features declarative descrip- tors, evaluations, automatic data persistence, global resource caching, configurable configuration selection and pruning, and distributed architecture. As a case study, we applied the CSE frame- work to build a biomedical question answering system, which incorporated the benchmark from TREC Genomics QA task, and the results showed the effectiveness of the CSE framework system. We are planning to adapt the system to a wide variety of interesting information processing problems to facilitate rapid domain adaption and system building and evaluation of the commu- nity. For educational purpose, we are also interested in adopt the CSE framework as an experiment platform to teach students the principled ways to design, implement and evaluate an information system. Acknowledgement. We thanks Leonid Boystov, Di Wang, Jack Montgomery, Alkesh Patel, Rui Liu, Ana Cristina Mendes, Kartik Mandaville, Tom Vu, Naoki Orii, and Eric Riebling for the contribution to the design and development of the system and valued suggestions to the paper. References 1. D. Ferrucci et al. Towards the Open Advancement of Question Answering Systems. Technical report, IBM Research, Armonk, New York, 2009. 2. R. E. de Castilho and I. Gurevych. A lightweight framework for reproducible parameter sweeping in information retrieval. In Proceedings of the DESIRE’11 workshop, New York, NY, USA, Oct. 2011. 3. D. Ferrucci and A. Lally. UIMA: an architectural approach to unstructured information processing in the corporate research environment. Nat. Lang. Eng., 10(3-4), Sept. 2004. 4. P. Ogren and S. Bethard. Building test suites for UIMA components. In Proceedings of the (SETQA-NLP 2009) workshop, Boulder, Colorado, June 2009. Association for Computational Linguistics. 5. A. Patel, Z. Yang, E. Nyberg, and T. Mitamura. Building an optimal QA system automatically using configuration space exploration for QA4MRE’13 tasks. In Proceedings of CLEF 2013, 2013. 6. Z. Yang, E. Garduno, Y. Fang, A. Maiberg, C. McCormack, and E. Nyberg. Building optimal infor- mation systems automatically: Configuration space exploration for biomedical information systems. In Proceedings of the CIKM’13, 2013. 7. K. Yoshinobu, W. Baumgartner, L. McCrohon, S. Ananiadou, K. Cohen, L. Hunter, and J. Tsujii. U- Compare: share and compare text mining tools with UIMA. Bioinformatics, 25(15):1997–1998, 2009. in press. 17 Aid to spatial navigation within a UIMA annotation index Nicolas Hernandez Université de Nantes Abstract. In order to support the interoperability within UIMA work- flows, we address the problem of accessing one annotation from another when the type system does not specify an explicit link between the two kinds of objects but when a semantic relation between them can be inferred from a spatial relation which connects them. We discuss the limitations of the framework and briefly present the interface we have developed to support such navigation. Keywords: Apache UIMA, Type System interoperability, Annotation Index, Spatial navigation 1 Introduction One of the main ideas in using document analysis frameworks such as Apache Un- structured Information Management Architecture 1 (UIMA) [3] is to move away from handling directly the raw subject of analysis. The idea is to enrich the raw data with descriptions which can be used as basis for the processing of subse- quent components. In the UIMA framework, the descriptions are typed feature structures. The component developer defines a type system which informs about the features of a type (set of (attribute, typed value) pairs) as well as how the types are arranged together (through inheritance and aggregation relations). Annotations are feature structures attached to specific regions of documents. In this paper, we address the problem of accessing one annotation from an- other when the type system does not specify an explicit link between the two kinds of objects but when a semantic relation between them can be inferred from a spatial relation which connects them. The situation is a case of interoperabil- ity issue which can be encountered when developing a component (e.g. a term extractor) that uses analysis results produced by two components developed by different developers (e.g. part-of-speech and lemma information being both hold by distinct annotations at the same spans). In practice, most of the existing type systems define annotation types which inherit from the built-in uima.tcas.Annotation type [4, 5, 7]. This type contains begin and end features which are used to attach the description to a specific region of the text being analysed. Thanks to these features, it is possible to cross-reference the annotations which extend this type. 1 http://uima.apache.org 18 In this paper, we argue that the Apache UIMA Application Programming Interface (API) is not enough intuitive for a Natural Language Processing (NLP) developer. We argue that it has some restrictions which prevent from a complete and free navigation among the annotations. We also argue that the API is forcing the way of developing algorithms. In section 2, we define the kind of spatial navigation we would like to perform within an annotation index. In section 3, we describe the Apache UIMA solutions to index and explore the annotations within the indexes. In section 4, we discuss the API and show its limitation to access annotations by spatial relations. Finally, in section 5, we briefly present the structures and the interface we have developed to support a spatial navigation. 2 The spatial navigation problem By spatial relations we mean that we assume that the annotations in a text can be located in a two-dimensional space: One axis to represent the position in the text linearity and an orthogonal axis to represent the covering degrees between the annotations. Indeed annotations can cover, be covered by, precede, or follow (contiguously or not) other annotations. The spatiality may inform about the semantic relations. Two annotations at the same span may mean that they are different aspects of the same object. They can have complementary features or one of them can be the property of the other. One annotation covering some others may mean that the former is made of the others, and in the opposite, that the others are part of the former. The semantic interpretation of the spatial relations may depend on the considered linguistic paradigm. To give examples of situations we are dealing with, let’s assume the following type system: Document, Source (information about the document such as its URI), Sentence, Chunk, Word (having a feature whose value informs about the lemma), POS (having a feature whose value informs about the part-of-speech) and NamedEntity. Let’s also assume that all these types do not hold explicit ref- erences to each other and that there is no inheritance relation between them. In that context, examples of access we would like to carry out are to get :The words of a given sentence (can be interpreted as a made of relation); The sentence of a given word (is part of relation); The words which are a verb (is a property of relation); The named entities followed by a word which is a verb and has the lemma visit (followed by relation, . . . ). Indeed, we would like to be able to navigate within an annotation index from an annotation to its covering/covered annotations or to the spatially following/preceding annotation of a given type having such or such properties. We defined this problem as a navigation problem within an annotation index. 3 Accessing the annotations in the UIMA framework The problem of accessing the annotations depends on the means offered by the framework2 to build annotation indexes and to navigate within them. 2 See the Reference Guide http://uima.apache.org/d/uimaj-2.4.0/references. html and the Javadoc http://uima.apache.org/d/uimaj-2.4.0/apidocs. 19 3.1 Defining feature structures indexes Adding a feature structure to a subject of analysis corresponds to the act of indexing the feature structure. By default, an unnamed built-in bag index ex- ists which holds all feature structures which are indexed. The framework defines also a built-in annotation index, called AnnotationIndex, which automatically indexes all feature structures of type (and subtypes of) uima.tcas.Annotation. As reported in the documentation, ”the index sorts annotations in the order in which they appear in the document. Annotations are sorted first by increasing begin position. Ties are then broken by decreasing end position (so that longer annotations come first). Annotations that match in both their begin and end fea- tures are sorted using a type priority”. If no type priority is defined in the compo- nent descriptor3 , the order of the annotations sharing the same span in the text is undefined in the index. The UIMA API provides getAnnotationIndex methods to get all the annotations of that index (subtypes of uima.tcas.Annotation) or the annotations of a given subtype. The UIMA framework allows also to define indexes and possibly to sort the feature structures within them. 3.2 Parsing the annotation index The UIMA API offers several methods to parse the AnnotationIndex. Given an annotation index, the iterator method returns an object of the same name which allows to move to the first (respectively the last) annotation of the index, the next (respectively the previous) annotation (depending on its position in the index) or to a given annotation in the index. It is also possible to get an unambiguous iterator to navigate among contiguous annotations in the text. In practice, this iterator consists of getting successively the first annotation in the index whose begin value is higher than the end of the current one. We will call this mechanism the first-contiguous-in-the-index principle. The subiterator method returns an iterator whose annotations fall within the span of another annotation. It is possible to specify whether the returned annotations should be strictly covered (i.e. both begin and end offsets covered) or if it concerns only its begin offset. Subiterator can also be unambiguous. Annotations at the same span may be not returned depending on the order in the index as well as the type priority definition. The constrained iterator allows to iterate over feature structures which satisfy given constraints. The constraints are objects that can test the type of a feature structure, or the type and the value of its features. The tree method returns an AnnotationTree structure which contains nodes representing the results of doing recursively a strict, unambiguous subiterator over the span of a given annotation. The API offers methods to navigate within the tree from the root node. From any other nodes, it is possible to get the children nodes, the next or the previous sibling node, and the parent node. 3 In a UIMA workflow, a component is interfaced by a text descriptor that indicates how to use the component. 20 4 Limitations of the UIMA framework Table 1a shows the AnnotationIndex containing the analysis results of the data string ”Verne visited the seaport of Nantes.\n”. Annotations were initially added to the index in that order: First the Document, then the Source, the Sentence, the Words, the POS, the Chunks and the NamedEntities. Offsets Annotations Covered text LocatedAnnotations (0,37) Document Verne visited the seaport of Nantes.\n Document (0,36) Sentence1 Verne visited the seaport of Nantes. Sentence (0,5) Word1 Verne Word1 NamedEntity1 Chunk1 POS1 (0,5) NamedEntity1 Verne (0,5) POS1 Verne (0,5) Chunk1 Verne (0,0) Source Source (6,13) Word2 visited Word2 Chunk2 POS2 (6,13) POS2 visited (6,13) Chunk2 visited (14,35) Chunk3 the seaport of Nantes Chunk3 (14,25) Chunk4 the seaport Chunk4 (14,17) Word3 the Word3 POS3 (14,17) POS3 the (18,25) Word4 seaport Word4 POS4 (18,25) POS4 seaport (26,35) Chunk5 of Nantes Chunk5 (26,28) Word5 of Word5 POS5 (26,28) POS5 of (29,35) Word6 Nantes Word6 NamedEntity2 POS6 (29,35) NamedEntity2 Nantes (29,35) POS6 Nantes (35,36) Word7 . Word7 POS7 (35,36) POS7 . (a) (b) Table 1: An AnnotationIndex (a) and its corresponding LocatedAnnotationIndex (b). Both tables are aligned for comparison. Annotations and LocatedAnnotations are sorted in increasing order from the top of the tables. Annotations are identified by their type and an index number. 4.1 Index limitations The definition of an index is usually done in the component descriptor. The defined index can only contain one specific type (and subtypes) of feature struc- tures. So, to get an index made of two distinct types, the trick would be to declare them as subtypes of the same common type in the type system, and get the index of this super type. This can lead to make a less consistent type system from a linguistic point of view, but this is still coherent with the UIMA approach of doing whatever you need in your component. The framework allows also so to sort the feature structures of a defined index. There are some restrictions. The sorting key, which should be a feature of the indexed type, can only be a string or a numerical value. Only the natural way of sorting such elements is available. There is no way to declare its own comparator to set the order between two elements. To sort on a different kind of key, the developer has to come down to the available systems. In addition, the 21 type system may need to be modified to add a feature to play the role of the sorting key, which can also make the type system less consistent. 4.2 Navigation limitations within an annotation index Iterator With an ambiguous iterator, the result of a move to the previous/next annotation in the index may not correspond to the annotation which precedes/follows spatially in the text. It can also be a covering or a covered one. In Table 1a, the preceding of Word3 is the covering Chunk4. Unambiguous iterators force the methods to return only spatially contiguous annotations. In practice, the method does not always return the expected result. When called on the full annotation index, it starts from the first annotation in the index. In Table 1a, it only returns the Document annotation and no more next annotation. When calling a unam- biguous iterator on a typed annotation index, the effect of the first-contiguous- in-the-index principle will be remarkable if some annotations occur at the same span. In that situation, the developer has no access to all the annotations which effectively follow/precede spatially the current annotation. In Table 1a, an un- ambiguous iteration over the Chunk type returns Chunk1, Chunk2 and Chunk3. Chunk4 and Chunk5 are not reachable. To iterate unambiguously over annota- tions of distinct types (e.g. Named Entities and POS to get the Named Entities followed by a verb), the developer has to create a super-type over them and call the iterator method on this super-type. The super-type may not have linguistic consistency and the iterator will still suffer from the limitation we have previ- ously mentioned. Another drawback of the unambiguous iterator can be noticed when iterating an index in reverse order. If two overlapping annotations precede the current one, the one returned will be the one whose begin offset is the small- est and not the one with the highest end value, lower than the begin value of the current one. The iterator follows the first-contiguous-in-the-index principle in the normal order. Finally, the API does not allow to iterate over the index and in the text spatiality in the same time. It is not possible to switch from an ambiguous iterator to an unambiguous one (and vice-versa). Subiterator is the kind of method to get the covered annotations of another one, like the words of a given sentence. Its major drawback is that, without a type priority definition, there is no assurance that annotations occurring at the same text span will fit an expected conceptual order. In Table 1a, an ambigu- ous subiterator over each chunk annotation for getting the words returns the Source annotation for Chunk1, nothing for Chunk2, and the expected words (and more to filter) for the all remaining Chunks. Concerning the unambiguous subiterator, the first-contiguous-in-the-index principle causes to hide some an- notations. In Table 1a, when applying an unambiguous subiterator over each chunk, then Chunk1 and Chunk2 return the same bad result as previously. Chunk3 only returns Chunk4 and Chunk5 annotations while Chunk4 and Chunk5 return the right word annotations. To subiterate unambiguously over a set of specific types, a super-type, which encompasses both the covered and the covering types, has to be defined in the type system. But the problem of the unambiguous iter- ation remains. 22 Constraints objects aim at testing one given feature structure at a time. The framework does not allow to define dynamic constraints. This means that the values to test cannot be instantiated relatively to the feature structure in the index. A constraint cannot be set to select annotations whose begin feature value is higher than the end feature value of another one. Rather, we have to specify at the creation the exact value to be higher than. Constraints objects are complex to understand and to set. It requires, for example, seven lines of code for creating an iterator which will get the annotations with a lemma feature. Constraint iterators remain iterators with the same limitations. The Tree method returns an object close to the kind of structure we would like to manipulate to navigate within. Unfortunately, it can only give the children of a covering annotation. So to get the parent of an annotation, a trick could be to build the tree of the whole document by taking the most covering annotation as the root, then to browse the tree until finding the desired annotations for finally getting its parent. But in any case, there is no way to get directly a node and the structure will still suffer from the remarks we made about unambiguous subiterators (consequently some annotations may not be present in the tree). Missing Methods The existing methods partially answer the problem and some navigation methods are missing. There is no dedicated method: to super- iterate and to get the annotations covering a given annotation; to move to the first/next annotation of a given type (respectively the last/previous annotation of a given type); or to get partially-covering preceding or following annotations. All these remarks lead the developers to use preferentially ambiguous itera- tors and subiterators, even if, this causes to write more code to search the index backward/forward and tests to filter the desired annotations. 5 Supporting the spatial navigation To support a spatial navigation among the annotations we propose to index the annotations by their offsets in a structure called LocatedAnnotationIndex, and to merge the annotations occurring at the same spans in a structure called LocatedAnnotation. Table 1b illustrates the transformation of the AnnotationIndex depicted in Table 1a into a LocatedAnnotationIndex. Figure 1 shows the spatial links which interconnect the LocatedAnnotation. The LocatedAnnotationIndex is a sorted structure which follows the same sorting order than the AnnotationIndex: From a given LocatedAnnotation, covering and preceding LocatedAnnotations are located backward in the in- dex, and the covered and following LocatedAnnotations forward in the index. The structure allows to access directly to a LocatedAnnotation thanks to a pair of begin/end offsets. The first characteristic of a LocatedAnnotation is to list all the annotations occurring at the same offsets. This prevents from having to define a type priority for handling the limitation of the subiterator. The structure comes with several kinds of links to navigate both within the LocatedAnnotationIndex and spatially in the text. Indeed, the structure has links to visit its spatial vicinity (parent/children/following/preceding) LocatedAnnotation. The structure has also links to access the previous/next element in the index. The contiguous spa- 23 tial vicinity of each LocatedAnnotation is computed when the LocatedAnnotationIndex is built. The API also offers some methods to dynamically search LocatedAnnotation containing annotations of a given type among the ancestor/descendant or self. Similarly, it is also possible to search the first/last (respectively following/preceding) LocatedAnnotation containing annotations of a given type. In terms of memory consumption, the built LocatedAnnotationIndex takes approximatively as much memory as its AnnotationIndex; only the local vicin- ity of each LocatedAnnotation is kept in memory. The CPU time for building the index depends on the AnnotationIndex size. Some preliminary tests indi- cate that the time increases by a factor of three when doubling the size of the annotated text. It takes about 2 seconds for building the index of a 50-sentences text analysed with sentences, chunks and words. Fig. 1: Example of LocatedAnnotationIndex. The boxes represent the LocatedAnnotation. They are aligned on the text span they cover. The solid lines represent the spatial preceding/following relation while the dotted lines represent the parent/child relations. The parent is indicated by an arrow. 6 Related works With the prospect of developing a pattern matching engine over annotations, [2] have addressed some design considerations for navigating annotation lattices. They have so exposed a language for specifying spatial constraints among an- notations. An engine has been implemented within the UIMA framework. Due to this technical choice, the design of the language and its implementation may suffer from the drawbacks we have enumerated. Indeed there is no example of patterns which involve annotation types without inheritance relation. In addi- tion, as pointed out in the perspectives of the authors, it is not clear how the engine will behave when handling multiple annotations over the same spans without the guarantee of a consistent type priority. The LocatedAnnotation structure is a solution to the need of defining type priorities. More generally, the methods of our API can play the role of the navigation devices required to the development of a pattern matching engine. 24 uimaFIT4 is a well-known library which aims at simplifying the UIMA de- velopments. One appealing navigation option it offers is similar to our API. Some methods are designated to move from one annotation to the closest (cover- ing/covered/preceding/following) ones by specifying the type of the annotations to get. In practice, nevertheless, the implementation relies on the UIMA API and may have some of the restrictions. The selectFollowing method, for example, follows the first-contiguous-in-the-index mechanism. In Table 1a, it returns only the Chunk 1 to 3, and misses the 4th and 5th, when calling it successively to get the following chunk from the first chunk. 7 Conclusion and perspectives Solving the interoperability issues in the UIMA framework is a serious problem [1, 6]. Our opinion is to give the means to developers to do what they want. We show that the UIMA API presents some limitations regarding the spatial navi- gation within annotations in a text. We also show that by adapting his problem definition to the framework requirements the developer may succeed to accom- plish his task. But the adaptation has a cost in development time and requires skills in the framework. To overcome this problem, we have developed a library which transforms an AnnotationIndex into a navigable structure which can be used in a UIMA component. It is available in the uima-common project5 . Our perspectives are twofold: Reducing the processing time and adding a mechanism for updating the LocatedAnnotationIndex. References 1. Ananiadou, S., Thompson, P., Kano, Y., McNaught, J., Attwood, T.K., Day, P.J.R., Keane, J., Jackson, D., Pettifer, S.: Towards interoperability of european language resources. Ariadne 67 (2011) 2. Boguraev, B., Neff, M.S.: A framework for traversing dense annotation lattices. Language Resources and Evaluation 44(3), 183–203 (2010) 3. Ferrucci, D., Lally, A.: Uima: an architectural approach to unstructured information processing in the corporate research environment. Natural Language Engineering 10(3-4), 327–348 (2004) 4. Gurevych, I., Mühlhäuser, M., Müller, C., Steimle, J., Weimer, M., Zesch, T.: Darm- stadt knowledge processing repository based on uima. In: First Workshop on UIMA at GSCL. Tübingen, Germany (2007) 5. Hahn, U., Buyko, E., Tomanek, K., Piao, S., McNaught, J., Tsuruoka, Y., Anani- adou, S.: An annotation type system for a data-driven nlp pipeline. In: The LAW at ACL 2007. pp. 33–40 (2007) 6. Hernandez, N.: Tackling interoperability issues within uima workflows. In: LREC. pp. 3618–3625 (2012) 7. Kano, Y., McCrohon, L., Ananiadou, S., Tsujii, J.: Integrated NLP evaluation system for pluggable evaluation metrics with extensive interoperable toolkit. In: SETQA-NLP. pp. 22–30 (2009) 4 http://uimafit.googlecode.com 5 https://uima-common.google.com 25 Using UIMA to Structure an Open Platform for Textual Entailment Tae-Gil Noh and Sebastian Padó Department of Computational Linguistics Heidelberg University 69120 Heidelberg, Germany {noh, pado}@cl.uni-heidelberg.de Abstract. EXCITEMENT is a novel, open software platform for Tex- tual Entailment (TE) which uses the UIMA framework. This paper dis- cusses the design considerations regarding the roles of UIMA within EX- CITEMENT Open Platform (EOP). We focus on two points: a) how to best design the representation of entailment problems within UIMA CAS and its type system. b) the integration and usage of UIMA components among non-UIMA components. Keywords: Textual Entailment, UIMA type system, UIMA application 1 Introduction Textual Entailment (TE) captures a common sense notion of inference and ex- presses it as a relation between two natural language texts. It is defined as follows: A Text (T) entails a Hypothesis (H), if a typical human reading of T would infer that H is most likely true [4]. Consider the following example: T: That was the 1908 Tunguska event in Siberia, known as the Tunguska mete- orite fall. H1 : A shooting star fell in Russia in 1908. H2 : Tunguska fell to Siberia in 1908. The text (T) entails the first hypothesis (H1 ), since a typical human reader of T would (arguably) believe that H1 is true. In contrast, T does not entail H2 . Nor does H1 entail T, that is, entailment is a directed relation. The promise of TE lies in its potential to subsume the semantic process- ing needs of many NLP applications, offering a uniform, theory-independent semantic processing paradigm. Software for the Recognition of Textual Entail- ment (RTE) have been used to build proof-of-concept versions of various tasks, including Question Answering, Machine Translation Evaluation, Information Vi- sualization, etc. [1, 7]. As a consequence of the theory-independence of TE, there are many different strategies to build RTE systems [1]. This has led to a practical problem of fragmentation: Various systems exist, and some have been made available as 26 open-source systems, but there is little to no interoperability between them, since the systems are, as a rule, designed to implement one specific algorithm to solve RTE. The problems is complicated by the fact that RTE systems generally rely on tightly integrated components such as linguistic analysis tools and knowledge resources. Thus, when a researcher wants to develop a new RTE algorithm, they often need to invest major effort to build a novel system from scratch: Many of the components already exist – but just not in a usable form. EXCITEMENT open platform (EOP) has been developed to address those problems. It is a suite of textual inference components which can be combined into complete textual inference systems. The platform aims to become a common development platform for RTE researchers, and we hope that it can establish itself in the RTE community in a similar way to MOSES [6] in Machine Trans- lation. Compared to Machine Translation, however, a major challenge is that seman- tic processing typically depends on linguistic analysis as well as large knowledge sources, which is a direct source of the reusability problems mentioned above. In this paper, we focus on the architectural side of the platform which was designed with the explicit goal of improving component re-usability. We have adopted UIMA (Unstructured Information Management applications) and UIMA CAS (Common Analysis Structure) as the central building blocks for data represen- tation and preprocessing within EOP. One interesting aspect is that our adoption of UIMA has been partial and parallel. By partial, we mean that there are two groups of sharable components within EOP: the “core” components and the “LAP” components (see Section 2). We have adopted UIMA only for LAPs; however, we use UIMA CAS as one of the standard data containers, even in non-UIMA components. Parallel refers to the fact that we allow non-UIMA components to be integrated into our LAPs transparently. 2 EXCITEMENT: An Open Platform for Textual Entailment Systems RTE systems traditionally rely on self-defined input types, pre-processing (lin- guistic annotation) representations, and resources, tailored to a specific approach to RTE. EXCITEMENT open platform (EOP) tries to alleviate this situation by providing a generic platform for sharable RTE components. The platform has the following requirements. Reusing of existing software : The platform must permit easy integration and re-using of existing softwares, including language processing tools, RTE components, and knowledge resources. Multilinguality : The platform is not tied to a specific language. Adding suites for a new language in the future should not be restricted by the platform design. 27 EXCITEMENT Platform Linguistic Analysis Entailment Core(EC) Pipeline (LAP) Annotated Raw Linguis7c   entailment Entailment  Decision     entailment Decisions Analysis  Tools   problems Algorithm  (EDA)   problems     Dynamic  and  Sta7c   Components   (Algorithms  and  Knowledge)   UIMA Components Java Components Fig. 1: EXCITEMENT Architecture Overview Component Independence : Components of EOP should be independent and complete as they are. So they can be used by different RTE approaches. This is also true for linguistic annotation pipelines and their components: An annotation pipeline as a whole, or an individual component of the pipeline, can be replaced with equivalent components. Figure 1 visualizes the top level of the platform. At this level, the platform can be grouped into two boxes: one is the Linguistic Analysis pipeline (LAP), and the other is the Entailment Core (EC). Entailment problems are first analyzed in the LAP, since almost all RTE algorithms require some level of linguistic anno- tation (e.g., POS tagging, parsing, NER, or lemmatization). The annotated TE problems are then passed to the EC box. In this box, the problems are analyzed by Entailment Decision Algorithms (EDAs), which are the “core” algorithms that make the entailment call and may in turn call other core components to provide either algorithmic additions or knowledge. Finally, the EDA returns an entailment decision. It is relatively natural to think of the LAP in terms of UIMA, since the typical computational linguistic analysis workflow corresponds well to UIMA’s annotation pipeline concept. Each annotator in LAP adds some annotations, and downstream annotators can use existing annotations and add richer anno- tations. UIMA CAS and its type system are strong enough to represent any data. UIMA AEs (Analysis Engines) are a good solution for encapsulating and using annotator components. In Section 3, we describe the UIMA adoption in the LAP in more detail. For Entailment Core (EC) components, however, the situation is different. In contrast to LAP, the functionalities of EC components are often not naturally mapped as “annotation behavior”. To visualize this, let’s check the example in Figure 2. The figure shows a conceptual search process of a RTE system that is based on textual rewriting. In this example, the text is “Google bought Mo- torola.”, and the system tries to determine hypothesis “Motorola is acquired by 28 Derived parse A parse tree trees (derived) acquired bought acquired is Motorola by Google Motorola Google Motorola Google A parse tree (derived) Fig. 2: Entailment as a search on possible rewritings Google.” as an entailment. The example system gets a dependency parse tree of the text, and starts the rewriting process. On each iteration, it generates possible entailed sentences by querying knowledge bases. In the example, lexical knowl- edge is used on the first rewriting (buy entails acquire), and syntactic knowledge (change to passive voice) is used on the second derivation. The process will gen- erate many derived candidates per iteration. The algorithm must employ a good search strategy to find the best rewriting path from text (T) to hypothesis (H). On this example, there are three major component types. One is the knowl- edge component type that supports knowledge look-up, another is generation of derived parse trees, and finally the decision algorithm itself drives the search process and makes the entailment decision. Expressing behaviors of such com- ponents in terms of annotations on the artifact, might be possible, but is very hard and counter-intuitive. Following this line of reasoning, we decided that the EC components are better thought of as Java modules whose common behavior is defined by a set of Java interfaces, data types, and contracts, and have defined them accord- ingly in the EXCITEMENT open platform specification.1 More specifically, we have defined a typology of components. They include a type for the top-level EDA as well as (currently) five major component types: (1) a feature extractor (get a T-H pair CAS, return a set of features for the T-H pair); (2) a seman- tic distance calculator (get a T-H pair CAS, return semantic similarity); (3) a lexical resource type (lexical relation database); (4) a syntactic resource type (phrasal relation database); (5) an annotation component (dynamic enrichment of entailment problems). Although UIMA components are not suitable for conceptualizing inference components, we decided to keep CAS as the data container even in the EC components as far as possible to take advantage of the CAS objects created in the LAP. Thus, various components (including EDAs) gets CAS (as JCas) as an argument on their methods. Also note that LAP and EC boxes are independent: 1 Specification and architecture for EXCITEMENT open platform, http://excitement- project.eu/index.php/results 29 CAS Entailment Metadata language, channel, docId, collectionId, ... Entailment Pair pairId, goldAnswer, text, hypothesis Text View Hypothesis View Subject of Analysis Subject of Analysis That was ... A shooting star ... POS Pos. Pos. Pos. Pos. Pos. ... ... Annotations PR V ART ADJ NN Token Token Token Token Token Token ... ... Annotations Dependency Dep Gov Dep Dep Gov Annotations ... dep. ... AMOD dep.NSUBJ dep.DET Fig. 3: CAS representation of a Text-Hypothesis pair as long as the CAS holds correct data, the EC components does not care which pipeline has generated the data. 3 Details on the UIMA usage in EXCITEMENT 3.1 CAS for Entailment Problems The input to any RTE system is a set of entailment problems, typically Text- Hypothesis pairs, each of which is represented in one CAS. Figure 3 shows a pictorial example of the CAS data structure for the example pair (T, H1 ) from Section 1. It contains the two text fragments (in two views) and their annotations (here, POS tags and dependencies), as well as global data such as generic meta- data (e.g., language) and entailment-specific metadata (e.g., the gold-standard answer). On the level of the CAS representation, we had to address two points: one is the representation of entailment problems in terms of CASes, the other one is the type definitions. Regarding the first point, general practice in text analysis use cases is to have one UIMA CAS corresponding to one document. This suggests representing both text and hypothesis (including, if available, their document context) as 30 separate CASes. However, we decided to store complete entailment problems as individual CASes, where each CAS has two named views (one view for text, the other for hypothesis). This approach has two major advantages: first of all, this enables us to represent cross-annotations between texts and hypotheses, notably alignments, which can be added by annotators. Second, this enables us to define a straightforward extension from “simple” entailment problems (one text and one hypothesis) to “complex” entailment problems (one text and multiple hypotheses or vice versa, as in the “RTE search” task [2]). Regarding the second point, we adopted the DKPro type system [5], which was designed with language independence in mind. It provides types for mor- phological information, POS tags, dependency structure, named entities, and co-reference, etc. We extended the DKPro type system with the types necessary to define textual entailment-specific annotation. This involved types for mark- ing stretches of text as texts and hypotheses, respectively, as well as storing correspondence information between texts and hypotheses, pair IDs, gold labels, and some meta data. We also added types for linguistic annotation that are not exclusively entailment-specific, but were not covered yet by DKPro. This included annotation for polarity, reference of temporal expressions, word and phrase alignments, and semantic role labels. Details about the newly defined types can be found in the platform specifi- cation, and the type definition files are part of the platform code distribution. 3.2 Wrapping the Linguistic Annotation Pipeline One decision that may be surprising at the first glance is that we defined our own top-level Java interface for users of the LAP that hides UIMA’s own run- time access methods. This interface dictates the common capabilities that all pipelines of LAP should provide. The reason for this decision is twofold and pragmatic in nature, making transitioning to and using the EOP as easy as possible for developers. The first aspect is the learning curve. We would like to avoid the need for Entailment Core developers to deeply understand UIMA AEs and Aggregated Analysis Engines (AAEs). We feel that a deep understanding of these points requires substantial effort but is not really to the point, since many EC developers will only want to use pre-existing LAPs. By making the UIMA aspect of the LAP transparent to the Entailment Core, EC developers do not need to know how the LAP works internally beyond knowledge of the (fairly minimal) LAP interface. Of course, the EC developers still need to understand UIMA CAS very well. The second aspect is migration cost. If the LAP pipelines were nothing but UIMA AEs, all analysis pipelines of existing RTE systems would have to be deeply refactored, which comes at a considerable cost. Our approach allows such analysis pipelines to be kept largely intact and merely surrounded by a wrapper that provides the requires functionality and converts their output into valid UIMA CASes according to the EOP’s specification. Nevertheless, there are good reasons to encourage the use of AE-based LAPs: AE-based components are generally much more flexible, and they are very easy to 31 assemble into AAE pipelines. Therefore, we encourage AE-based LAP develop- ment by providing ready-to-use code that implements our LAP interface, taking a list of AEs as input. Thus, if the individual components are already present as AEs, the implementation effort to assemble them into a LAP is near zero. In this sense, we see our LAP interface as a thin wrapper above UIMA with the purpose of enabling peaceful co-existence between UIMA and non-UIMA pipelines. In the long run, we also hope to provide some new AEs back to the UIMA community. 4 Some Open Issues In this section, we discuss two open questions that we are facing in future work. CAS in non-UIMA environments. There is considerable number of best-practice strategies for handling CAS objects (reset the data structure instead of creating a new one; use a CAS pool instead of generating multiple CASes, etc). When a CAS is used in an UIMA context (i.e., in the LAP), it is not hard to guide the developers to follow these rules. However, with CAS being used as a general data container throughout the EOP, developers also often encounter CAS (JCas) objects outside specific UIMA contexts, and we have found it harder to guide the developers towards “proper usage”. For example, one part of the EXCITEMENT project is concerned with the construction of Entailment Graphs [3], structured knowledge repositories whose vertices are statements and whose edges indicate entailment relations. Since the standard data structure for annotations is JCas, the graph developers tend to add one JCas for each node. This is not problematic for small graphs, but once the graph gets bigger, this can be problematic; CAS is a very large data structure, and its creation and deletion take some time. We are still trying to establish best practices for using CASes in non-UIMA EOP environment. Annotation Styles: Hidden dependencies. One of the EOP design requirement was the clear separation of LAP and EC. This has been fairly well achieved, at least on a technical level. However, it is clear that there are still implicit dependencies between linguis- tic analysis tools and entailment core components. Consider the case of syntactic knowledge components such as DIRT-style paraphrase rules in the Entailment Core. Such components store entailment rules as pairs of partial dependency trees which have typically been extracted from large corpora. If the corpus used for rule induction was parsed with a different parser than the current entailment problem, then matching the sentence against the rule base will result in missing rules, due to differences in the analysis style. Note that this implicit dependency does not break the UIMA pipeline, since it does not involve the use of a novel type system, but rather differences in the interpretation of shared types. We are currently investigating what type of “style differences” can be observed from actual annotators. 32 5 Conclusion In this paper, we have provided an overview of the EXCITEMENT open plat- form architecture and its adoption of UIMA. We have adopted and adapted UIMA CAS and the DKPro type system as a flexible, language-independent data container for Textual Entailment problems. UIMA also provides the back- bone for platform’s LAP components. There are several open issues that is to be resolved in the future, but the EXCITEMENT project has already profited substantially from the use of the abstractions that UIMA offers as well as the integration of existing components from UIMA communities. The first version of EXCITEMENT open platform has been finished2 with three fully running RTE systems integrated with all core components and an- notation pipelines. The platform currently supports three languages (German, Italian and English), and is also shipped with various tools and resources for TE researchers. We believe that the platform will become a valuable tool for researchers and users of Textual Entailment. Acknowledgment. This work was supported by the EC-funded project EXCITE- MENT (FP7 ICT-287923). References 1. Androutsopoulos, I., Malakasiotis, P.: A Survey of Paraphrasing and Textual En- tailment Methods. Journal of Artificial Intelligence Research 38 (2010) 135–187 2. Bentivogli, L., Magnini, B., Dagan, I., Trang Dang, H., Giampiccolo, D.: The fifth PASCAL recognising textual entailment challenge. In: Proceedings of the TAC 2009 Workshop on Textual Entailment, Gaithersburg, MD (2009) 3. Berant, J., Dagan, I., Goldberger, J.: Learning entailment relations by global graph structure optimization. Computational Linguistics 38(1) (2012) 73–111 4. Dagan, I., Glickman, O., Magnini, B.: The PASCAL Recognising Textual Entail- ment Challenge. In: Proceedings of the First PASCAL Challenges Workshop on Recognising Textual Entailment, Southampton, UK (2005) 5. Gurevych, I., Mühlhäuser, M., Müller, C., Steimle, J., Weimer, M., Zesch, T.: Darmstadt knowledge processing repository based on UIMA. In: Proceedings of the First Workshop on Unstructured Information Management Architecture at the Conference of the Society for Computational Linguistics and Language Technology, Tübingen, Germany (2007) 6. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: Open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the Association for Computational Lin- guistics, Prague, Czech Republic (2007) 177–180 7. Sammons, M., Vydiswaran, V., Roth, D.: Recognizing textual entailment. In Bikel, D.M., Zitouni, I., eds.: Multilingual Natural Language Applications: From Theory to Practice. Prentice Hall (2012) 2 The platform has been released under an open source license, and all codes and resources can be freely accessed via the project repository. http://hltfbk.github.io/Excitement-Open-Platform/project-info.html 33 Bluima: a UIMA-based NLP Toolkit for Neuroscience Renaud Richardet, Jean-Cédric Chappelier, Martin Telefont Blue Brain Project, EPFL, 1015, Lausanne, Switzerland renaud.richardet@epfl.ch Abstract. This paper describes Bluima, a natural language process- ing (NLP) pipeline focusing on the extraction of neuroscientific content and based on the UIMA framework. Bluima builds upon models from biomedical NLP (BioNLP) like specialized tokenizers and lemmatizers. It adds further models and tools specific to neuroscience (e.g. named entity recognizer for neuron or brain region mentions) and provides col- lection readers for neuroscientific corpora. Two novel UIMA components are proposed: the first allows configuring and instantiating UIMA pipelines using a simple scripting language, en- abling non-UIMA experts to design and run UIMA pipelines. The second component is a common analysis structure (CAS) store based on Mon- goDB, to perform incremental annotation of large document corpora. Keywords: UIMA, natural language processing, NLP, neuroinformat- ics, NoSQL 1 Introduction Bluima started as an effort to develop a high performance natural language processing (NLP) toolkit for neuroscience. The goal was to extract structured knowledge from biomedical literature (PubMed1 ), in order to help neuroscientists gather data to specify parameters for their models. In particular, focus was set on extracting entities that are specific to neuroscience (like brain regions and neurons) and that are not yet covered by existing text processing systems. After careful evaluation of different NLP frameworks, the UIMA software system was selected for its open standards, its performance and stability, and its usage in several other biomedical NLP (bioNLP) projects; e.g. JulieLab [11], ClearTK [22], DKPRo [6], cTAKES [28], ccp-nlp, U-Compare [15], SciKnowMine [26], Argo [25]. Initial development went fast and several existing bioNLP models and UIMA components could rapidly be reused or integrated into UIMA without the need to modify its core system, as presented in Section 2.1. Once the initial components were in place, an experimentation phase started where different pipelines were created, each with different components and pa- rameters. Pipeline definition in verbose XML was greatly improved by the use 1 http://www.ncbi.nlm.nih.gov/pubmed 34 of UIMAFit [21] (to define pipelines in compact Java code) but ended up be- ing problematic, as it requires some Java knowledge and recompilation for each component or parameter change. To allow for a more agile prototyping, espe- cially by non-specialist end users, a pipeline scripting language was created. It is described in Section 2.2. Another concern was incremental annotation of large document corpus. For example, when running an initial pre-processing pipeline on several millions of documents, and then annotating them again at a later time. The initial strategy was to store the documents on disk, and overwrite them every time they would be incrementally annotated. Eventually, a CAS store module was developed to provide a stable and scalable strategy for incremental annotation, as described in Section 2.3. Finally, Section 3 presents two case studies illustrating the script- ing language and evaluating the performance of the CAS store against existing serialization formats. 2 Bluima Components Bluima contains several UIMA modules to read neuroscientific corpora, perform preprocessing, create simple configuration files to run pipelines, and persist doc- uments on the disk. 2.1 UIMA Modules Bluima’s typesystem builds upon the typesystem from JulieLab [10], which was chosen for its strong biomedical orientation and its clean architecture. Bluima’s typesystem adds neuroscientific annotations, like CellType, BrainRegion, etc. Bluima includes several collection readers for selected neuroscience cor- pora, like PubMed XML dumps, PubMed Central NXML files, the BioNLP 2011 GENIA Event Extraction corpus [24], the Biocreative2 annotated corpus [16], the GENIA annotated corpus [14], and the WhiteText brain regions corpus [8]. A PDF reader was developed to provide robust and precise text extrac- tion from scientific articles in PDF format. The PDF reader performs content correction and cleanup, like dehyphenation, removal of ligatures, glyph mapping correction, table detection, and removal of non-informative footers and headers. For pre-processing, the OpenNLP-wrappers developed by JulieLab for sen- tence segmentation, word tokenization and part-of-speech tagging [31] were used and updated to UIMAFit. Lemmatization is performed by the domain-specific tool BioLemmatizer [19].Abbreviation recognition (the task of identifying abbre- viations in text) is performed by BIOADI, a supervised machine learning model trained on the BIOADI corpus [17]. Bluima uses UIMA’s ConceptMapper [29] to build lexical-based NERs based on several neuroscientific lexica and ontologies (Table 1). These lexica and ontologies were either developed in-house or were imported from existing sources. Bluima wraps several machine learning-based NERs, like OSCAR4 [13] (chemicals, reactions), Linnaeus [9] (species), BANNER [18] (genes and pro- teins), and Gimli [5] (proteins). 35 Name Source Scope # forms Age BlueBrain age of organism, developmental stage 138 Sex BlueBrain sex (male, female) and variants 10 Method BlueBrain experimental methods in neuroscience 43 Organism BlueBrain organisms used in neuroscience 121 Cell BlueBrain cell, sub-cell and region 862 Ion channel Channelpedia [27] ion channels 868 Uniprot Uniprot [1] genes and proteins 143,757 Biolexicon Biolexicon [30] unified lexicon of biomedical terms 2.2 Mio Verbs Biolexicon verbs extracted from the Biolexicon 5,038 Cell ontology OBO [2] cell types (prokaryotic to mammalian) 3,564 Disease ont. OBO [23] human disease ontology 24,613 Protein ont. OBO [20] protein-related entities 29,198 Brain region Neuronames [3] hierarchy of brain regions 8,211 Wordnet Wordnet [7] general English 155,287 NIFSTD NIF [12,4] neuroscience ontology 16,896 Table 1. Lexica and ontologies used for lexical matching. 2.2 Pipeline Scripting Language Tool Advantages Disadvantages UIMA GUI GUI minimalistic UI, can not reuse pipelines XML descriptor typed (schema) very verbose raw UIMA java API typed verbose, requires writing and compiling Java UIMAFit compact, typed requires writing and compiling Java code Table 2. Different approaches to writing and running UIMA pipelines. There are several approaches2 to write and run UIMA pipelines (see Table 2). All Bluima components were initially written in Java with the UIMAFit library, that allows for compact code. To improve the design and experimentation with UIMA pipelines, and enable researchers without Java or UIMA knowledge to easily design and run such pipelines, a minimalistic scripting (domain-specific) language was developed, allowing UIMA pipelines to be configured with text files, in a human-readable format (Table 3). A pipeline script begins with the definition of a collection reader (starting with cr:), followed by several annotation engines (starting with ae:)3 . Parameter specification starts with a space, followed by the 2 Other interesting solutions exist (e.g. IBM LanguageWare, Argo), but are not open source. 3 If not package namespace is specified, Bluima loads Readers and Annotator classes from the default namespace. 36 parameter name, a column and its value. The scripting language also supports embedding of inline Python and Java code, reuse of a portion of a pipeline with include statements, and variable substitution similar to shell scripts. Extensive documentation (in particular snippets of scripts) is automatically generated for all components, using the JavaDoc and the UIMAFit annotations. 2.3 CAS Store A CAS store was developed to persist annotated documents, resume their pro- cessing and add new annotations to them. This CAS store was motivated by the common use case of repetitively and incrementally processing the same docu- ments with different UIMA pipelines, where some pipeline steps are duplicated among the runs. For example, when performing resource-intensive operations (like extracting the text from full-text PDF articles, or performing syntactic pars- ing), one might want to perform these preliminary operation once, store these results, and subsequently perform different experiments with different UIMA modules and parameters. The CAS store thus allows to perform the preprocess- ing only once, to then persist the annotated documents, and to perform the various experiments in parallel. MongoDB4 was selected as the datastore backend. MongoDB is a scalable, high-performance, open-source, schema-free (NoSQL), document-oriented data- base. No schema is required on the database side, since the UIMA typesystem acts as a schema, and data is validated on-the-fly by the module. Every CAS is stored as a MongoDB document, along with its annotations. UIMA annotations and their features are explicitly mapped to MongoDB fields, using a simple and declarative language. For example, a Protein annotation is mapped to a prot field in MongoDB. The mappings are used when persisting and loading from the database. As of this writing, annotations are declared in Java source files. In future versions, we plan to store mappings directly in MongoDB to improve flexibility. Persistence of complex typesystem has not been implemented yet, but could be easily added in the future. Currently, the following UIMA components are available for the CAS store: – MongoCollectionReader reads CAS from a MongoDB collection. Optionally, a (filter) query can be specified; – RegexMongoCollectionReader is similar to MongoCollectionReader but al- lows specifying a query with a regular expression on a specific field; – MongoWriter persists new UIMA CASes into MongoDB documents; – MongoUpdateWriter persists new annotations into an existing document; – MongoCollectionRemover removes selected annotations in a MongoDB col- lection. With the above components, it is possible within a single pipeline to read an existing collection of annotated documents, perform some further processing, add more annotations, and store theses annotations back into the same MongoDB documents. 4 http://www.mongodb.org/ 37 3 Case Studies and Evaluation A first experiment to illustrate the scripting language was conducted on a large dataset of full-text biomedical articles. A second simulated experiment evalu- ates the performance of the MongoDB CAS store against existing serialization formats. 3.1 Scripting and Scale-Out # collection reader configured with a list of files (provided as external params) cr: FromFilelistReader inputFile: $1 # processes the content of the PDFs ae: ch.epfl.bbp.uima.pdf.cr.PdfCollectionAnnotator # tokenization and lematization ae: SentenceAnnotator modelFile: $ROOT/modules/julielab_opennlp/models/sentence/PennBio.bin.gz ae: TokenAnnotator modelFile: $ROOT/modules/julielab_opennlp/models/token/Genia.bin.gz ae: BlueBioLemmatizer # lexical NERs, instantiated with some helper java code ae_java: ch.epfl.bbp.uima.LexicaHelper.getConceptMapper("/bbp_onto/brainregion") ae_java: ch.epfl.bbp.uima.LexicaHelper.getConceptMapper("/bams/bams") # removes duplicate annotations and extracts collocated brainregion annotations ae: DeduplicatorAnnotator annotationClass: ch.epfl.bbp.uima.types.BrainRegionDictTerm ae: ExtractBrainregionsCoocurrences outputDirectory: $2 Table 3. Pipeline script for the extraction of brain regions mention co-occurrences from PDF documents. Bluima was used to extract brain region mention co-occurrences from scien- tific articles in PDF. The pipeline script (Table 3) was created and tested on a development laptop. Scale-out was performed on a 12-node (144-core) clus- ter managed by SLURM (Simple Linux Utility for Resource Management). The 383,795 PDFs were partitioned in 767 jobs. Each job was instantiated with the same pipeline script, using different input and output parameters. The process- ing completed in 809 minutes (' 8 PDF/s). 3.2 MongoDB CAS Store The MongoDB CAS store (MCS) has been evaluated against 3 other available serialization formats (XCAS, XMI and ZIPXMI). For each, 3 settings were eval- uated: writes (CASes are persisted to disk), reads (CASes are loaded from their persisted states), and incremental (CASes are first read from their persisted 38 Write [s] Write Size [MB] XCAS 4014 41718 XMI 4479 32236 ZIPXMI 5033 4677 MongoDB 3281 16724 Read [s] Incremental [s] XCAS 3407 31.7 XMI 3090 42.2 ZIPXMI 2790 43.6 MongoDB 730 22.5 Fig. 1. Performance evaluation of MongoDB CAS Store against 3 other serialization formats. states, then further processed, and finally persisted again to disk). Writes and reads were performed on a random sample of 500,000 PubMed abstracts and an- notated with all available Bluima NERs. Incremental annotation was performed on a random sample of 5,000 PubMed abstracts and incrementally annotated with the Stopwords annotator. Processing time and disk space was measured on a commodity laptop (4 cores, 8GB RAM). In terms of speed, the MCS significantly outperforms the other formats, espe- cially for reads (Figure 1). The MCS disk size is significantly smaller than XCAS and XMI formats, but almost 4 times larger than the compressed ZIPXMI for- mat. The incremental annotation is significantly faster with MongoDB, and does not require duplicating or overwriting files, like with the other serialization for- mats. The MCS could be scaled up in a cluster setup, or using solid states drives (SSDs). Writes could probably be improved by turning MongoDB’s ”safe mode” option off. Furthermore, by adding indexes, the MCS can act as a searchable annotation database. 4 Conclusions and Future Work In the process of developing Bluima, a toolkit for neuroscientific NLP, we inte- grated and wrapped several specialized resources to process neuroscientific arti- cles. We also created two UIMA modules (scripting language and CAS store). These additions proved to be very effective in practice and allowed us to leverage UIMA, an enterprise-grade framework, while at the same time allowing an agile development and deployment of NLP pipelines. In the future, we will open-source Bluima and add more models for NER and relationship extraction. We also plan to ease the deployment of Bluima (and its scripting language) on a Hadoop cluster. 39 References 1. Bairoch, A., Apweiler, R., Wu, C.H., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M.: The universal protein resource (UniProt). Nucleic acids research 33(suppl 1), D154–D159 (2005) 2. Bard, J., Rhee, S.Y., Ashburner, M.: An ontology for cell types. Genome Biology 6(2) (2005) 3. Bowden, D., Dubach, M.: NeuroNames 2002. Neuroinformatics 1(1), 43–59 (2003) 4. Bug, W.J., Ascoli, G.A., Grethe, J.S., Gupta, A., Fennema-Notestine, C., Laird, A.R., Larson, S.D., Rubin, D., Shepherd, G.M., Turner, J.A.: The NIFSTD and BIRNLex vocabularies: building comprehensive ontologies for neuroscience. Neu- roinformatics 6(3), 175–194 (2008) 5. Campos, D., Matos, S., Oliveira, J.L.: Gimli: open source and high-performance biomedical name recognition. BMC Bioinformatics 14(1), 54 (Feb 2013) 6. De Castilho, R.E., Gurevych, I.: DKPro-UGD: a flexible data-cleansing approach to processing user-generated discourse. In: Onlineproceedings of the First French- speaking meeting around the framework Apache UIMA, LINA CNRS UMR (2009) 7. Fellbaum, C.: WordNet. Theory and Applications of Ontology: Computer Appli- cations p. 231–243 (2010) 8. French, L., Lane, S., Xu, L., Pavlidis, P.: Automated recognition of brain region mentions in neuroscience literature. Front Neuroinformatics 3 (Sep 2009) 9. Gerner, M., Nenadic, G., Bergman, C.: Linnaeus: A species name identification system for biomedical literature. BMC Bioinformatics 11(1), 85 (2010) 10. Hahn, U., Buyko, E., Tomanek, K., Piao, S., Mcnaught, J., Tsuruoka, Y., Anani- adou, S.: An Annotation Type System for a Data-Driven NLP Pipeline (2007) 11. Hahn, U., Buyko, E., Landefeld, R., Mühlhausen, M., Poprat, M., Tomanek, K., Wermter, J.: An overview of JCoRe, the JULIE lab UIMA component repository. In: Proceedings of the LREC. vol. 8, p. 1–7 (2008) 12. Imam, F.T., Larson, S.D., Grethe, J.S., Gupta, A., Bandrowski, A., Martone, M.E.: NIFSTD and NeuroLex: a comprehensive neuroscience ontology development based on multiple biomedical ontologies and community involvement (2011) 13. Jessop, D., et al.: OSCAR4: a flexible architecture for chemical text-mining. Jour- nal of Cheminformatics 3(1), 41 (Oct 2011) 14. Kim, J.D., Ohta, T., Tateisi, Y., Tsujii, J.: GENIA corpus–a semantically anno- tated corpus for bio-textmining. Bioinformatics 19, i180–i182 (Jul 2003) 15. Kontonatsios, G., Korkontzelos, I., Kolluru, B., Thompson, P., Ananiadou, S.: De- ploying and sharing u-compare workflows as web services. J. Biomedical Semantics 4, 7 (2013) 16. Krallinger, M., Morgan, A., Smith, L., Leitner, F., Tanabe, L., Wilbur, J., Hirschman, L., Valencia, A.: Evaluation of text-mining systems for biology: overview of the second BioCreative community challenge. Genome Biology 9(Suppl 2), S1 (2008) 17. Kuo, C.J., et al.: BioAdi: a machine learning approach to identifying abbreviations and definitions in biological literature. BMC Bioinformatics 10(Suppl 15), S7 (Dec 2009) 18. Leaman, R., Gonzalez, G., et al.: BANNER: an executable survey of advances in biomedical named entity recognition. In: Pacific Symposium on Biocomputing. vol. 13, p. 652–663 (2008) 19. Liu, H., et al.: BioLemmatizer: a lemmatization tool for morphological processing of biomedical text. Journal of Biomedical Semantics 3(1), 3 (Apr 2012) 40 20. Natale, D.A., Arighi, C.N., Barker, W.C., Blake, J.A., Bult, C.J., Caudy, M., Drabkin, H.J., D’Eustachio, P., Evsikov, A.V., Huang, H., Nchoutmboube, J., Roberts, N.V., Smith, B., Zhang, J., Wu, C.H.: The protein ontology: a structured representation of protein forms and complexes. Nucleic Acids Res. 39(Database issue), D539–545 (Jan 2011) 21. Ogren, P.V., Bethard, S.J.: Building test suites for UIMA components. NAACL HLT 2009 p. 1 (2009) 22. Ogren, P.V., Wetzler, P.G., Bethard, S.J.: ClearTK: a UIMA toolkit for statistical natural language processing. Towards Enhanced Interoperability for Large HLT Systems: UIMA for NLP p. 32 (2008) 23. Osborne, J., Flatow, J., Holko, M., Lin, S.M., Kibbe, W.A., Zhu, L.J., Danila, M.I., Feng, G., Chisholm, R.L.: Annotating the human genome with disease ontology. BMC Genomics 10(Suppl 1), S6 (Jul 2009) 24. Pyysalo, S., Ohta, T., Rak, R., Sullivan, D., Mao, C., Wang, C., Sobral, B., Tsujii, J., Ananiadou, S.: Overview of the ID, EPI and REL tasks of BioNLP shared task 2011. BMC Bioinformatics 13(Suppl 11), S2 (Jun 2012) 25. Rak, R., Rowley, A., Black, W., Ananiadou, S.: Argo: an integrative, interactive, text mining-based workbench supporting curation. Database: the journal of bio- logical databases and curation 2012 (2012) 26. Ramakrishnan, C., Baumgartner Jr, W.A., Blake, J.A., Burns, G.A., Cohen, K.B., Drabkin, H., Eppig, J., Hovy, E., Hsu, C.N., Hunter, L.E.: Building the scientific knowledge mine (SciKnowMine1): a community-driven framework for text mining tools in direct service to biocuration. malta. Language Resources and Evaluation (2010) 27. Ranjan, R., Khazen, G., Gambazzi, L., Ramaswamy, S., Hill, S.L., Schürmann, F., Markram, H.: Channelpedia: an integrative and interactive database for ion channels. Frontiers in neuroinformatics 5 (2011) 28. Savova, G.K., Masanz, J.J., Ogren, P.V., Zheng, J., Sohn, S., Kipper-Schuler, K.C., Chute, C.G.: Mayo clinical text analysis and knowledge extraction system (cTAKES): architecture, component evaluation and applications. Journal of the American Medical Informatics Association 17(5), 507–513 (2010) 29. Tanenblatt, M.A., Coden, A., Sominsky, I.L.: The ConceptMapper approach to named entity recognition. In: LREC (2010) 30. Thompson, P., et al.: The BioLexicon: a large-scale terminological resource for biomedical text mining. BMC Bioinformatics 12(1), 397 (2011) 31. Tomanek, K., Wermter, J., Hahn, U.: A reappraisal of sentence and token splitting for life sciences documents. Studies in health technology and informatics 129(Pt 1), 524–528 (2006) 41 Sentiment Analysis and Visualization using UIMA and Solr Carlos Rodrı́guez-Penagos, David Garcı́a Narbona, Guillem Massó Sanabre, Jens Grivolla, Joan Codina Filbá Barcelona Media Innovation Centre Abstract. In this paper we present an overview of a UIMA-based sys- tem for Sentiment Analysis in hotel customer reviews. It extracts object- opinion/attribute-polarity triples using a variety of UIMA modules, some of which are adapted from freely available open source components and others developed fully in-house. A Solr based graphical interface is used to explore and visualize the collection of reviews and the opinions ex- pressed in them. 1 Introduction With the continuing growth of Social Media such as Twitter, Facebook, and many others, both in terms of volume of content produced daily by users, and in terms of the impact it can have for reputation and decision making (buy- ing, travelling, ...) there is a strong commercial need (and social interest) to efficiently analyze those vast amounts of mostly unstructured information and extract summarized knowledge, while also being able to explore and navigate the content. We present here a prototype system for analyzing customer reviews of hotels, detecting what people talk about and what opinions they express. The litera- ture agrees on two main approaches for classifying opinion expressions: using supervised learning methods and applying dictionary/rule-based knowledge (see [3] for an overview). The choice of content to be processed also determines what kind of technique yields better results, since longer, more textured text accomo- dates deeper linguistic analyses including, for example, dependency parsing (see, for example, the use of Machine Learning informed with linguistic analyses in [5]) while shorter, noisy messages, such as those from Twitter microblogs can be tackled with more superficial processing that is strengthened by massive training data and extensive lexical resources (as shown in previous work from some of the authors: [2,4]). Each of them on its own has been used in workable systems (e.g. [6]) and a principled combination of both of them can yield good results on noisy data, since generally one (dictionaries/rules) offers good precision while the other (ML) is able to discover unseen examples and thus enhances recall. In the case at hand, the processing at the level of individual reviews is done using UIMA with a variety of analysis engines using both stochastic and symbolic ap- proaches; the summary of the results, visualization and exploration interface is based on Solr. 42 2 Extraction of Opinionated Units The prototype presented here focuses on the extraction of customer opinions from full-text unstructured reviews, provided by the users of a big customer review site. We identified as the object of interest for our analysis what we call ”opinionated units”. These OUs consist in: – the object of the opinion, i.e. the thing that is being commented on, which we call Target. – the opinion expression, i.e. the words or sentence fragments that represent what is being said about the target, which we call Cues. – the polarity of the opinion, as it relates to the target. Our system proceeds by first detecting possible opinion Targets and possible Cues in the review text. These Target and Cue candidates are then correlated to form Opinionated Units using relevant paths of the syntactic dependencies graph that link the two together. Finally, the polarity of the opinionated unit is established using an apriori polarity taken from the Cue (possibly dependent on the type of target) and taking into account quantifiers and negations that appear in the context of the opinionated unit. For the detection of the possible targets and cues that anchor the OUs, we relied on a JNET annotator that uses Conditional Random Fields over richly annotated vectors (POS, NER, polar words, NP chunks, etc.), and that was trained with a manually annotated corpus of similar customer reviews.1 We used this supervised approach since the hotel review domain is pretty regular inasmuch the kinds of things and features people comment on, but we wanted to leave open the possibility to discover items and concepts outside a closed list. 2.1 Recognizing Opinionated Units and their polarity In order to detect candidate opinion-bearing linguistic structures we parsed the sentences with the DeSR dependency parser[1]. We looked for possible paths in the graph linking Cues to Targets, as show in Figure 1, where the Target ”la habitación” (the room) is qualified by ”pequeña” (small) and the Target ”el desayuno” (breakfast) is being described as ”with room for improvement”. We identified both correct and wrong paths between Targets and Cues in annotated documents and analysed them. This allowed us to extract relevant patterns, and individualize opinionated units even if they were expressed in the same phrase. The most important patterns are structures of linking verbs and name-adjective relations, but there are prepositional phrases, adverbial phrases and subject-verb relations as well. All the relevant paths can be represented by 1 On evaluating this process, we allowed for partial overlap (e.g. ”The room” and ”room” counted as equally correct answers), and we obtained models that had F1 of 0.69 for Targets only, 0.54 for Cues and a combined Target/Cue identification model that provided an F1 of 0.62, with a top precision of 0.84 for the Cue only model and a top coverage (or recall) of 0.63 for Target-only models. 43 target cue target cue Fig. 1. Opinionated Units in the dependency graph a limited number or regular expressions, which are used to correlate Targets and Cues of the same OU. This approach maximized precision at the expense of recall, focusing further analyses only in semantically relevant fragments, as identified through Targets and Cues. We used different strategies and tools to detect the possible polarity of an Opinionated Unit, since each one has both advantages and weaknesses. A first strategy is to assign polarities to Cues, and then expand this polarity to the Opinionated Unit, using an aproach based on the words sequence (Conditional Random Fields). A second strategy uses Support Vector Machines on a ”bag of features” that includes words, polar words and their polarity, negations, and quantifiers to build a feature vector used for training and classification. After Fig. 2. Type representation of the Opinionated Unit visualized with the Annotation Viewer statistical models have been applied, we also used heuristics that combined those polarities with the polarities of key words (detected using dictionaries) in the context of the OUs, in order to assign a final polarity to the Opinionated Unit. Our UIMA type for Opinionated Units representing the Target-Polarity-Cue triplet has pointers to corresponding Targets and Cues from the relevant depen- dency graph, as well as the span and ultimate polarity of the complete object covered by it, as shown in Figure 2 for the text ”The location [Target] was very good [Cue]”. 44 Linguis'c  annota'on   •POS,  lemmas,  NER,  Chunks,   Target  &  Cue   Review  Database   dependencies,  etc.   detecBon   Opinionated  Unit   OU  polarity   detec'on   Data  visualizaBon   OU  indexing   Assignment   •T&C  CorrelaBon  via   dependencies   Fig. 3. System overview 3 Architecture and Implementation This section describes all UIMA modules used in the prototype, as implemented in Figure 3. Some of them are existing open source components, some are adap- tations, and some are our own custom developments. We have been publishing our work on Github and will continue doing so as far as possible.2 UIMA Collection Tools This prototype is designed to work on a static docu- ment collection, previously loaded into a MySQL database (including the review text as well as associated metadata). UIMA Collection Tools3 is an ecosystem of tools for allowing UIMA pipelines to store and retrieve data from database sys- tems, such as MySQL. Plain text documents can be retrieved from a database, XMI documents can be retrieved from and stored in a database either com- pressed or uncompressed, features can be extracted into a database table, and annotations within database-stored XMI blobs can be visualized the same way as the standard AnnotationViewer does for XMI files. – DBCollectionReader is a UIMA collection reader which retrieves plain text documents stored in a MySQL database. Database connection parameters as well as SQL query have to be specified in the component descriptor. It is derived from the FileSystemCollectionReader. – SolrCollectionReader is equivalent to DBCollectionReader, but using a Solr index as the document source. – DBXMICollectionReader is a UIMA collection reader that retrieves XMI documents stored in a MySQL database. DBXMICollectionReader is also prepared to read compressed XMI documents by means of ZLIB compression. This option can be set in the descriptor file. – DBAnnotationsCASConsumer is a CAS consumer which stores values of the features specified in the component descriptor file in a MySQL database ta- ble. Each table row corresponds to the annotation defined as the splitting annotation, e.g. if Sentence annotation has been defined as the splitting an- notation, each table row will correspond to a Sentence, and this row will 2 See https://github.com/BarcelonaMedia-ViL/ 3 The UIMA Collection tools have been developed at Barcelona Media, some of them based on the example Collection Readers and CAS Consumers provided with the UIMA distribution. They are published under the Apache License at https://github.com/BarcelonaMedia-ViL/uima-collection-tools. 45 contain features of the Sentence annotation and/or features of annotations covered by the Sentence annotation. – DBXMICASConsumer is a CAS consumer that persists XMI documents in a database. DBXMICASConsumer is also prepared to store compressed XMI documents by means of ZLIB compression. – DBAnnotationViewer is a modification of the Annotation Viewer, and al- lows reading XMI files directly from a MySQL database without needing to extract them first. OpenNLP We use OpenNLP4 with the standard UIMA wrappers for our base pipeline, including Sentence Detector, Tokenizer, and POS Tagger, using our own trained models for Spanish. Lemmatizer We apply Lemmatization using a large dictionary developed in- house. All candidate lemmas are first added to the CAS using ConceptMapper5 but a second custom component selects the right one using the POS tag. JNET For ML-based detection of Targets and Cues we use JNET6 (the Julielab Named Entity Tagger), which is based on Conditional Random Fields (CRF). It detects token sequences that belong to certain classes, taking into account a variety of features associated with each token (such as the surface form, lemma, POS tag, surface features such as capitalization, etc.) as well as its context of preceeding and successive tokens. While originally intended for Named Entity Recognition, we trained JNET with our own manually annotated corpus. Compared to the original JNET as released by JulieLab we introduced a series of changes, most importantly making it type system independent by taking all input and output types and features as parameters, and fixing some bugs that were triggered when using a larger amount of token features. We expect to release our changes soon, but are still looking into the question of licensing, to comply with JNET’s original license. DeSR We developed a UIMA wrapper for the DeSR dependency parser7 . The parser creates dependency annotations based on previously generated sentence, token and POStag annotations. It is available at https://github.com/BarcelonaMedia-ViL/desr-uima. The UIMA DeSR analysis engine is a UIMA C++ annotator, developed using the C++ SDK provided by UIMA. It translates between the format required by the DeSR parser shared library and the UIMA CAS format. The mapping between UIMA types and fea- tures and the features used internally by DeSR is configurable in the annotator descriptor. 4 http://opennlp.apache.org/ 5 http://uima.apache.org/sandbox.html#concept.mapper.annotator 6 http://www.julielab.de/Resources/Software/NLP Tools.html 7 https://sites.google.com/site/desrparser/ 46 DependencyTreeWalker This is a Pythonnator-based analysis engine for wrapping the DependencyGraph Python module (both developed in-house). This allows us to work easily with the dependency graph generated by DeSR in order to e.g. determine and validate the path between two given UIMA annotations. Weka Wrapper We used the Mayo Weka/UIMA Integration (MAWUI8 ), as a basis for the machine learning tools. The version we use is adapted to newer versions of UIMA and made much more configurable. MAWUI generates a single vector for each document, that is used to classify it as a whole. In our case, a document can contain several Opinionated Units that need to be classified. For this reason the Weka Wrapper was adapted to be able to deal with all the annotations of a given type inside a document (or collection when generating the training data). 4 Visualization Beyond being able to extract and classify the opinions, users need an interface that allows them to access and explore the data. They need to know which are the Targets or its features that are being addressed by the opinions and what is being said about them, and this has to be shown in an aggregated way, with drill-down capabilities, so that the end user has a clear view of the contents of hundreds or thousands of opinions. UIMA does not provide tools to deal with collections of documents, and we use Solr, a Lucene based indexing tool, to index the Opinionated Units. Through the use of Solr’s faceting and pivot utilities we are able to graphically summarize thousands of opinions. Special charts have been dconstructed in order to allow not only to represent the data but also to select subsets of opinions and summarize and compare them. For example, we can compare the global user’s opinions with the opinions about a single hotels or the hotels in a specific area. To index the data we needed the linguistic information, but also the metadata associated with the opinion, which is located in databases and is not processed with UIMA. For this reason we import the data to Solr in two steps. In a first step we generate from UIMA a table with the data that we then import to Solr together with the metadata. 4.1 Indexing Opinionated Units To index the Opinionated Units we use the DBAnnotationsCASConsumer com- ponent. We generate a register for each OU, containing: the Target, the Cue, the text span, the polar words, their polarity, the polarity of the cue, and the polarity of the Opinionated Unit. Cues and targets are grouped in single tokens by means of underscoring. 8 http://informatics.mayo.edu/text/index.php?page=weka 47 We use the the DataImportHandler from Solr in order to import the data from the database. To do it, a query combines the opinionated unit information with the one related with the hotel or the user who writes the opinion. Cues are indexed twice, once all merged and later in different fields depending on the opinion’s polarity, making it easy to retrieve just the positive or negative opinion markers. We selected this option because it is a bit faster, more flexible and reliable than the other ones: when indexing directly from UIMA we have problems in adding all the desired metadata, and if we call UIMA from Solr (or Lucene) then it is difficult to have a general framework that splits a single document into several Opinionated Units. AJAX-Solr9 is a JavaScript library for creating user interfaces to Apache Solr. This library works with facets. Faceting is a capability of Solr that allows to have a fast statistic of the most frequent terms in each field, after performing a query. Since version 4.0 Solr also has pivots that combine the facets from two or more different fields. We adapted AJAX-Solr to work with pivots and wrote a series of widgets to visualize them. Our own extensions to AJAX-Solr are also published on github10 . By means of clicking the different facets that appear on the widgets, the user can build a query that restricts the set of opinions to summarize. These opinions are then summarized by showing the most frequent terms they contain, or the most differentiating ones (i.e. those terms that are frequent in the current subset but that are less frequent in the general one). Figure 4 shows the pivot result in text and force diagram formats. It shows the relationship between Targets, and positive and negative Cues. In the textual representation, the relationships are not shown directly but scaled to magnify the most discriminative ones. Fig. 4. Visualization of Cue and Target correlations across the whole corpus 9 https://github.com/evolvingweb/ajax-solr 10 https://github.com/BarcelonaMedia-ViL/ajax-solr 48 5 Conclusion The combination of UIMA and Solr has allowed us to to develop a very flexible platform that makes it easy to integrate and combine processing modules from a variety of sources and in a variety of programming languages, as well as navigate and visualize the results easily and efficiently. In our evaluations with 700 OUs manually annotated by 3 independent re- viewers, there was an agreement on the correctness of the OU identified by the system of 88.5%, while the polarity assigned was found to be correct an average of 70%. We found many useful UIMA components to be available as open source, and encountered few compatibility issues (other than adapting some components to be type system independent). Solr provides us with a very flexible platform to access large document collections, and in combination with UIMA allows us to explore even complex hidden relationships within those collections. One of our main objectives was to make all modules configurable and reusable, inasmuch as Sentiment Analysis in general requires tweaking to adapt to domain and genre, but this generalization often requires considerable effort. We found the different open source communities to be very receptive, and we try to participate by publishing our own contributions under permissive licenses that make them easy for others to adopt and use. 6 Thanks This work has been partially funded by the Spanish Government project Holo- pedia, TIN2010-21128- C02-02, and the CENIT program project Social Media, CEN-20101037. References 1. G. Attardi, F. Dell’Orletta, M. Simi, A. Chanev, and M. Ciaramita. Multilingual dependency parsing and domain adaptation using desr. In Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL, page 1112–1118, 2007. 2. Jose M. Chenlo, Jordi Atserias, Carlos Rodriguez, and Roi Blanco. FBM-Yahoo! at RepLab 2012. In CLEF (Online Working Notes/Labs/Workshop), 2012. 3. Bing Liu. Sentiment analysis and opinion mining. Synthesis Lectures on Human Language Technologies, 5(1):1–167, 2012. 4. Carlos Rodrı́guez-Penagos, Jordi Atserias, Joan Codina-Filba, David Garcı́a- Narbona, Jens Grivolla, Patrik Lambert, and Roser Saurı́. FBM: combining lexicon- based ML and heuristics for social media polarities. In SemEval 2013, 2013. 5. Theresa Wilson, Janyce Wiebe, and Paul Hoffmann. Recognizing contextual polar- ity: An exploration of features for phrase-level sentiment analysis. Computational Linguistics, 35(3):399–433, December 2010. 6. L. Zhang, R. Ghosh, M. Dekhil, M. Hsu, and B. Liu. Combining lexicon-based and learning-based methods for twitter sentiment analysis. HP Technical Report HPL-2011-89, 2011. 49 Extracting hierarchical data points and tables from scanned contracts Jan Stadermann, Stephan Symons, and Ingo Thon Recommind Inc., 650 California Street, San Francisco, CA 94108, United States {jan.stadermann,stephan.symons,ingo.thon}@recommind.com http://www.recommind.com Abstract. We present a technique for developing systems to automat- ically extract information from scanned semi-structured contracts. Such contracts are based on a template, but have different layouts and client- specific changes. While the presented technique is applicable to all kinds of such contracts we specifically focus on so called ISDA credit support annexes. The data model for such documents consists of 150 individual entities some of which are tables that could span multiple pages. The information extraction is based on the Apache UIMA framework. It con- sists of a collection of small and simple Analysis Components that extract increasingly complex information based on earlier extractions. This tech- nique is applied to extract individual data points and tables. Experiments show an overall precision of 97% with a recall of 93% regarding individ- ual/simple data points and 89%/81% for table cells measured against manually entered ground truth. Due to its modular nature our system can be easily extended and adapted to other collections of contracts as long as some data model can be formulated. Keywords: OCR robust information extraction, hierarchical taggers, table extraction 1 Introduction Despite the existence of electronic document handling and content management systems there is still a large amount of paper based contracts. Even when scanned and OCRed the interesting data contained in the document is not machine- readable as there is no semantic attached to the text. Especially in the banking domain it is necessary to have the underlying information available, e.g., for risk assessment. Until now, the information has to be extracted by human reviewers. The goal of the system presented here is to automatically obtain the relevant information from OTC (over-the-counter) contracts which are based on a tem- plate provided by the ISDA1 . The data is given in the form of image-embedded pdf documents. Each contract contains around 150 data points organized in a complex hierarchical data model. A data point can be either a (possibly multi valued) simple field or a table. The main challenges of such a system are: 1 International Swaps and Derivatives Association, www.isda.org 50 (a) (b) Fig. 1. (a) Example simple valued fields (base currency and eligible currency). Note that for the eligible currency one or more currencies can be specified. (b) Example table (collateral eligibility). 1. The complex legal language used in the contracts. 2. Despite existing contract templates, the wording varies across customers. 3. The layout varies. Especially tables can be represented in various forms. 4. The scanning quality of the contracts is often poor, especially in old con- tracts or documents sent by fax. Still the remaining information needs to be extracted correctly. Figure 1 shows examples of two simple data points (a), and a table (b). In general, on the one hand, there are a lot of sophisticated entity extraction systems that try to find flat entities only (“Named entity extraction”) [9]. These systems sometimes use hierarchical information, like Tokens, Part-Of-Speech- Tags, Sentences, but only on a linguistic level without collecting and combining this information. These approaches work well on well-defined and general enti- ties such as persons or locations. However, they are difficult to adapt to a new domain since a new classifier needs to be created which requires huge amounts of labeled training data which is expensive to produce. On the other hand, there are systems that use a deep hierarchical structure, e.g. represented using Ontologies, but still do the classification in one single, flat step [1]. This approach is not as flexible and extensible compared to the presented one since in general it requires a re-training or re-building of the classifier if lay- ers within the hierarchy are changed. An early solution for dealing with scanned forms was presented by Taylor et al., who used a model-based approach for data extraction from tax forms [12]. Semi structured texts have been analyzed using 51 rule based approaches [10] or discriminative context free grammars [13]. Closest to our solution is a system described by Surdeanu et al. [11]. They employ two layers of extraction using Conditional Random Fields [5], and deal with OCR data. For table extraction, heuristic methods [8] have been proposed as well as Conditional Random Fields [7]. In contrast, our system uses a theoretically unlimited number of layers with separate classifiers for each piece of information, including tables, on each level. Instead of processing the whole text at once, our classifiers just collect the in- formation they require, and decide only on that data. Therefore, they allow for better performance and extensibility, as additional data does not affect the ex- isting classifiers. Our work follows strategies commonly used in spoken dialogue systems [4] and uses a set of small classifiers which is inspired by the boosting idea [6]. In addition, we use automatically extracted segmentation information and cross-checks between our classifiers to increase the precision of the extracted data. From a UI standpoint there is a similar application called GATE [2] which extracts entities based on given rule-sets. This application provides a hierarchi- cal organization of entities and the architecture seems to be very similar to the UIMA framework. However, GATE has no special provisions to deal with noise from due to the OCR step and it only allows to specify simple extraction rules. Furthermore there is no direct way that the entity extraction works hierarchically but only the result can be organized in a hierarchical way. 2 Information extraction An overview of or system’s architecture is shown in figure 2. Prior to information extraction, the OmniPage2 OCR engine is used to convert the image to readable text. However, many character level errors, and layout distortions remain which need to be dealt with in the following processing steps. The overall strategy is based on the idea that small pieces of relevant text can be extracted quite accurately even in the presence of OCR errors. On top of these pieces we build several layers of higher level extractors – here called ”experts” – that combine these small pieces to decide on a final data point. The extraction of tables works in a similar fashion by first trying to extract small pieces that form table cells. Then stretches of cells are collected, trying to deduce a layout from order and type of the pieces. Finally, an optimal result table is selected (see section 2.2). Our solution is based on the UIMA framework [3]. Each type of expert is implemented as a configurable annotation engine. The overall extraction system consists of a large hierarchy of analysis engines, encompassing several hundred elements. The type system, in contrast, only consists of three principal types, i.e. for simple fields, tables and table rows. Annotation types, extracted values, etc. are stored as features. Both final and intermediate annotations are represented by these types. 2 http://www.nuance.com/omnipage 52 Documentpimages OCR Recognizedptextp(XML) Informationpextraction RegExppExtractorp1 RegExppExtractorp2 Dictionary Normalization Normalization Resource Dictionaryp Expertp1 Extractor Normalization Normalization Expertp2 Normalization XMLpwithpmetapdata Documentpindex Fig. 2. Extraction architecture 2.1 Extraction of simple-valued fields We use the term “simple-valued fields” for data points, where one key has one or more values. They differ from named entities as they may include multi-valued data. Figure 1(a) shows an example of the key eligible currency with the (normalized) values “USD” and “Base currency”. Fields are extracted layer-wise. On the lowest layer, all instances of the identifying term “Eligible currency”, are captured, as well as the different currency expressions, including the special term “Base currency”, which refers to another simple field. On this level we typically use annotators based on dictionaries and regular expressions, where variations due to OCR errors are reflected in dictionary variants, respectively the regular expressions. All such annotators are implemented as analysis engines. On the next level, so-called “expert-extractors” combine the existing annotations to a new one. An expert is a rule, defined as a set of slots for annotations of specific types, and a definition of which slots form a new annotation if the rule is satisfied, i.e. if all slots are filled. To allow for fine tuning the experts, slots can be configured, e.g. by indicating certain slots as optional. Furthermore, it is possible to specify the order of annotations in slots appearing in the document. It is also possible to specify a maximum distance. If the distance between two found annotations exceeds the defined threshold for this expert, the expert assumes to be in the wrong area of the document and clears its internal state to start all over 53 (EligibleCurrency) (Currency, "Base currency") Currency, "USD" Expert 1 Currency Currency Currency Currencies, "Base currency, USD" Expert 2 EligibleCurrency Currencies Distance < 20 (eligible_currency, "Base currency, USD") Fig. 3. Extraction of a simple field. First level components have tagged the “Eligible Currency” phrase and the different variants of currencies. Expert 1 collects two or more currencies (the third slot is optional). The resulting annotation is used by Expert 2 to build the final annotation. All elements are represented in UIMA as simple fields types. again. Finally, slots can be write-protected, accepting only the first occurrence of the configured annotation. To extract eligible currency, two experts are employed (see figure 3). The first expert collects adjacent currency annotations. The second one combines the “Eligible Currency” term, and the collected currencies found by expert one, if both annotations are found within a short distance. The resulting annotation will span the relevant currency terms. This modular design allow us to reduce the number of extractors and re-use the already made annotations for completely different data points. In general, the information found in the examined contracts is not independent of each other. We use business rules and other constraints to validate and normalize the found results, e.g., the set of currencies is well- defined. If the validation fails or the normalization repairs some value due to business rules a corresponding message can be attached to the annotation to inform the reviewer. 2.2 Extraction of tables We define a table as multi-dimensional, structured data present in a document either in a classical tabular layout, or defined in a series of sentences or para- graphs in free text form (like in figure 4). We aim at extracting tables of both structure types and intermediate formats (e.g. as in figure 1(b)) only from the document’s OCR output at character level. In our application, table extraction extends the simple valued field extraction: The basic input for a table expert is a document annotated with simple value fields and intermediate annotations. The experts attempt to match sequences of simple annotations to a set of table models. A table model is user-defined and describes which columns the resulting extracted table should have. Each column can contain multiple types of simple 54 fields. Furthermore, columns can be configured to be optional and to accept only unique or non-overlapping annotations. This allows for both more general models with variable columns and fine-tuning the accepted annotations. The process of detecting tables by the table expert (see figure 4 for an ex- ample) begins with collecting all accepted annotations for a model, within a predefined range or until a table stop annotation is found, into a list sorted by order of appearance. For each such list, several filling strategies are employed. A filling strategy addresses the problem that multiple columns may accept the same types of annotations. If elements appear row-wise, or column-wise, the corresponding strategies will recover the correct table, also compensating for some errors from omitted table elements. In mixed cases, adding a new table cell to the shortest relevant column is used as a fall back strategy. Each strategy is evaluated, using the fraction of cells filled in the resulting table c and the filling strategy specific score s. The latter score measures how well the annota- tions match the expectations of the filling strategy. The table which maximizes sf = c · s is annotated as a candidate, if sf is above a predefined threshold. The table expert is implemented an analysis engine. Configuration encompasses the columns describing the table model, distance and scoring threshold, and the set of filling strategies to be evaluated. The output is a table type annotation, which in turn contains several table rows, each containing simple fields as cells. Multiple table experts may be used to generate candidate tables for a single target, and candidates may occur in several locations in a document. Usually, the correct location gives raise to tables with certain properties, e.g. short, dense tables. This is used by a feature-based selection of the optimal table candidate. We model this using both general purpose features (e.g. size, and number of empty cells) as well as domain specific features. The table with the highest weighted sum of score features is selected as the final output. The weights can either be user defined or fitted using a formal optimization model. 3 Experiments We composed a document set containing 449 documents3 to measure the ex- traction quality of our system. These documents are from various customers and represent as many variants of different wordings and layouts as possible. With our customers we agreed upon certain quality gates that the automatic extraction system has to meet. Due to the nature of the contracts it is much more important to achieve a high precision of the extracted data instead of recall. For simple fields the gate’s threshold is 95% precision and 80% recall. Table cells are more difficult to extract since the OCR component not only mis-recognizes individual characters but makes errors on the structure of a table. For table cells, our goal is to have a high recall since errors within a structured table are easier to detect and correct than simple field errors by a human reviewer. Table 1 shows our results against a manually created ground truth. The numbers represent the 3 see tinyurl.com/csa-example for a public sample document. 55 Insertions Deletions Substitutions Correct Precision Recall Simple fields 375 1267 330 20519 0.966 0.928 Table cells 1492 3563 906 18838 0.887 0.808 Table 1. Results for simple fields and table cells on our document corpus, shown as absolute numbers of (in)correct data points and values for precision and recall. Fig. 4. Textual source (OCR output rendered as rich text) and extraction result of an interest rate table. The tabular result is extracted from the paragraph with three similar sentences which contain currencies (solid frame) and interest rate names (dashed frame). Domain knowledge is used to fill the maturity and spread columns. total number of data points and errors respectively over all of our documents. In total, we meet our gate criterion for simple fields. Precision can be as low as 33% for rare fields, where fitting appropriate data experts is hard. In contrast, for frequent fields, precision may exceed 99%. In principle, the same is true for recall, with both maximum and minimum lower, due to our target criteria. For table cells, the precision needs improvement mainly due to the OCR’s structural errors like swapping rows within a table or switching between row-wise and column-wise recognition in one table. This is especially true for tables which are complex with respect to both lay-out and contents, like the collateral eligibility table in figure 1(b). Here, precision and recall are 84.4% and 80.2%, respectively. In contrast, structurally simple tables, like the interest rate table (see figure 4 for an example) can be extracted with much higher confidence (97.4% precision and 90.8% recall). 4 Conclusion and outlook This article presents a system to automatically extract simple data points and tables from OTC contract images. The system consists of an OCR component and a hierarchical set-up of small modular extractors either capturing (noisy) text or combining already annotated clues using a slot-filling strategy. Our ex- periments are conducted on a in-house contract collection resulting in a precision of 97% (recall 93%) on simple fields and a precision of 89% (recall 81%) on table cells. While the evaluation we conducted is limited, we expect overfitting to be 56 moderate. The legal nature of the contracts limits the layout and wording op- tions. Our next steps include the introduction of a confidence score on data-point level and the use of statistical classification methods for selecting the best-suited table model. Acknowledgement. We would like to thank our partner Rule Financial for pro- viding the data model and for their assistance in understanding the documents. References 1. Paul Buitelaar and Srikanth Ramaka. Unsupervised ontology-based semantic tag- ging for knowledge markup. In Proceedings of the Workshop on Learning in Web Search at the International Conference on Machine Learning, 2005. 2. Hamish Cunningham. Gate, a general architecture for text engineering. Computers and the Humanities, 36(2):223–254, 2002. 3. David Ferrucci and Adam Lally. Uima: an architectural approach to unstructured information processing in the corporate research environment. Natural Language Engineering, 10(3-4):327–348, 2004. 4. Kyungduk Kim et al. A frame-based probabilistic framework for spoken dialog management using dialog examples. In Proceedings of the 9th SIGdial Workshop on Discourse and Dialogue, 2008. 5. John Lafferty, Andrew McCallum, and Fernando CN Pereira. Conditional ran- dom fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning, 2001. 6. Ron Meir and Gunnar Rätsch. An introduction to boosting and leveraging. In Advanced lectures on machine learning, pages 118–183. Springer, 2003. 7. David Pinto, Andrew McCallum, Xing Wei, and W Bruce Croft. Table extrac- tion using conditional random fields. In Proceedings of the 26th annual interna- tional ACM SIGIR conference on Research and development in informaion re- trieval, pages 235–242, 2003. 8. Pallavi Pyreddy and W Bruce Croft. Tintin: A system for retrieval in text tables. In Proceedings of the second ACM international conference on Digital libraries, pages 193–200, 1997. 9. Lev Ratinov and Dan Roth. Design challenges and misconceptions in named entity recognition. In Proceedings of the thirteenth conference on Computational Natural Lanugage Learning, pages 147–155, 2009. 10. Stephen Soderland. Learning information extraction rules for semi-structured and free text. Machine learning, 34(1-3):233–272, 1999. 11. Mihai Surdeanu, Ramesh Nallapati, and Christopher D. Manning. Legal claim identification: Information extraction with hierarchically labeled data. In Proceed- ings of the LREC 2010 Workshop on the Semantic Processing of Legal Texts, 2010. 12. Suzanne Liebowitz Taylor, Richard Fritzson, and Jon A Pastor. Extraction of data from preprinted forms. Machine Vision and Applications, 5(3):211–222, 1992. 13. Paul Viola and Mukund Narasimhan. Learning to extract information from semi- structured text using a discriminative context free grammar. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 330–337, 2005. 57 Constraint-driven Evaluation in UIMA Ruta Andreas Wittek1 , Martin Toepfer1 , Georg Fette12 , Peter Kluegl12 , and Frank Puppe1 1 Department of Computer Science VI, University of Wuerzburg, Am Hubland, Wuerzburg, Germany 2 Comprehensive Heart Failure Center, University of Wuerzburg, Straubmuehlweg 2a, Wuerzburg, Germany {a.wittek,toepfer,fette,pkluegl,puppe}@informatik.uni-wuerzburg.de Abstract. This paper presents an extension of the UIMA Ruta Work- bench for estimating the quality of arbitrary information extraction mod- els on unseen documents. The user can specify expectations on the do- main in the form of constraints, which are applied in order to predict the F1 score or the ranking. The applicability of the tool is illustrated in a case study for the segmentation of references, which also examines the robustness for different models and documents. 1 Introduction Apache UIMA [5] and the surrounding ecosystem provide a powerful framework for engineering state-of-the-art Information Extraction (IE) systems, e.g., in the medical domain [13]. Two main approaches for building IE models can be dis- tinguished. One approach is based on manually defining a set of rules, e.g., with UIMA Ruta3 (Rule-based Text Annotation) [7]4 , that is able to identify the interesting information or annotations of specific types. A knowledge engineer writes, extends, refines and tests the rules on a set of representative documents. The other approach relies on machine learning algorithms, such as probabilis- tic graphical models like Conditional Random Fields (CRF) [10]. Here, a set of annotated gold documents is used as a training set in order to estimate the parameters of the model. The resulting IE system of both approaches, the sta- tistical model and the set of rules, is evaluated on an additional set of annotated documents in order to estimate its accuracy or F1 score, which is then assumed to hold for the application in general. However, while the system performed well in the evaluation setting, its accuracy decreases when applied on unseen docu- ments, maybe because the set of documents applied for developing the IE system was not large or not representative enough. In order to estimate the actual per- formance, either more data is labeled or the results are manually checked by a human, who is able to validate the correctness of the annotations. Annotated documents are essential for developing IE systems, but there is a natural lack of labeled data in most application domains and its creation is 3 http://uima.apache.org/ruta.html 4 previously published as TextMarker 58 error-prone, cumbersome and time-consuming as is the manual validation. An automatic estimation of the IE system’s quality on unseen documents would therefore provide many advantages. A human is able to validate the created annotations using background knowledge and expectations on the domain. This kind of knowledge is already used by current research in order to improve the IE models (c.f. [1, 6, 11]), but barely to estimate IE system’s quality. This paper introduces an extension of the UIMA Ruta Workbench for exactly this use case: Estimating the quality and performance of arbitrary IE models on unseen documents. The user can specify expectations on the domain in the form of constraints thus the name Constraint-driven Evaluation (CDE). The constraints rate specific aspects of the labeled documents and are aggregated to a single cde score, which provides a simple approximation of the evalua- tion measure, e.g., the token-based F1 score. The framework currently supports two different kinds of constraints: Simple UIMA Ruta rules, which express spe- cific expectations concerning the relationship of annotations, and annotation- distribution constraints, which rate the coverage of features. We distinguish two tasks: predicting the actual F1 score of a document and estimating the ranking of the documents specified by the actual F1 score. The former task can give answers on how good the model performs. The latter task points to documents where the IE model can be improved. We evaluate the proposed tool in a case study for the segmentation of scientific references, which tries to estimate the F1 score of a rule-based system. The expectations are additionally applied on documents of a different distribution and on documents labeled by a different IE model. The results emphasize the advantages and usability of the approach, which works already with minimal efforts due to a simple fact: It is much easier to estimate how good a document is annotated than to actually identify the positions of defective or missing annotations. The rest of the paper is structured as follows. In the upcoming section, we describe how our work relates to other fields of Information Extraction research. We explain the proposed CDE approach in Section 3. Section 4 covers the case study and the corresponding results. We conclude with pointers to future work in Section 5. 2 Related Work Besides standard classification methods, which fit all model parameters against the labeled data of the supervised setting, there have been several efforts to incorporate background knowledge from either user expectations or external data analysis. Bellare et al. [1], Graça et al. [6] and Mann and McCallum [11], for example, showed how moments of auxiliary expectation functions on unlabeled data can be used for such a purpose with special objective functions and an alternating optimization procedure. Our work on constraint-driven evaluation is partly inspired by this idea, however, we address a different problem. We suggest to use auxiliary expectations to estimate the quality of classifiers on unseen data. 59 A classifier’s confidence describes the degree to which it believes that its own decisions are correct. Several classifiers provide intrinsic measures of con- fidence, for example, naive Bayes classifiers. Culotta and McCallum [4], for in- stance, studied confidence estimation for information extraction. They focus on predictions about field and record correctness of single instances. Their main motivation is to filter high precision results for database population. Similar to CDE, they use background knowledge features like record length, single field la- bel assignments and field confidence values to estimate record confidence. CDE generalizes common confidence estimation because the goal of CDE is the esti- mation of the quality of arbitrary models. Active learning algorithms are able to choose the order in which training examples are presented in order to improve learning, typically by selective sam- pling [2]. While the general CDE setting does not necessarily contain aspects of selective sampling, consider for example the batch F1 score prediction task, the ranking task can be used as a selective sampling strategy in applications to find instances that support system refactoring. The focus of the F1 ranking task, however, still differs from active learning goals which is essential for the design of such systems. Both approaches are supposed to favor different tech- niques to fit their different objectives. Popular active learning approaches such as density-weighting (e.g., [12]) focus on dense regions of the input distribution. CDE, however, tries to estimate the quality of the model on the whole data set and hence demands for differently designed methods. Despite their differences, the combination of active learning and CDE would be an interesting subject for future work. CDE may be used to find weak learners of ensembles and informa- tive instances for these learners. 3 Constraint-driven Evaluation The Constraint-driven Evaluation (CDE) framework presented in this work al- lows the user to specify expectations about the domain in form of constraints. These constraints are applied on documents with annotations, which have been created by an information extraction model. The results of the constraints are aggregated to a single cde score, which reflects how well the annotations fulfill the user’s expectations and thus provide a predicted measurement of the model’s quality for these documents. The framework is implemented as an extension of the UIMA Ruta Workbench. Figure 1 provides a screenshot of the CDE per- spective, which includes different views to formalize the set of constraints and to present the predicted quality of the model for the specified documents. We define a constraint in this work as a function C : CAS → [0, 1], which returns a confidence value for an annotated document (CAS) where high values indicates that the expectations are fulfilled. Two different types of constraints are currently supported: Rule constraints are simple UIMA Ruta rules without actions and allow to specify sequential patterns or other relationships between annotations that need to be fulfilled. The result is basically the ratio of how often the rule has tried to match compared to how often the rule has actually 60 Fig. 1. CDE perspective in the UIMA Ruta Workbench. Bottom left: Expectations on the domain formalized as constraints. Top right: Set of documents and their cde scores. Bottom right: Results of the constraints for the selected document. matched. An example for such a constraint is Document{CONTAINS(Author)};, which specifies that each document must contain an annotation of the type Author. The second type of supported constraints are Annotation Distribution (AD) constraints (c.f. Generalized Expectations [11]). Here, the expected distri- bution of an annotation or word is given for the evaluated types. The result of the constraint is the cosine similarity of the expected and the observed presence of the annotation or word within annotations of the given types. A constraint like "Peter": Author 0.9, Title 0.1, for example, indicates that the word “Peter” should rather be covered by an Author annotation than by a Title an- notation. The set of constraints and their weights can be defined using the CDE Constraint view (c.f. Figure 1, bottom left). For a given set of constraints C = {C1 , C2 ...Cn } and corresponding weights w = {w1 , w2 , ..., wn }, the cde score for each document is defined by the weighted average: n 1X cde = wi · Ci (1) n i The cde scores for a set of documents may already be very useful as a report how well the annotations comply with the expectations on the domain. However, one can further distinguish two tasks for CDE: the prediction of the actual evaluation score of the model, e.g., the token-based F1 score, and the 61 prediction of the quality ranking of the documents. While the former task can give answers how good the model performs or whether the model is already good enough for the application, the latter task provides a useful tool for introspection: Which documents are poorly labeled by the model? Where should the model be improved? Are the expectations on the domain realistic? Due to the limited expressiveness of the aggregation function, we concentrate on the latter task. The cde scores for the annotated documents are depicted in the CDE Documents view (c.f. Figure 1, top right). The result of each constraint for the currently selected document is given in the CDE Results view (c.f. Figure 1, bottom right). The development of the constraints needs to be supported by tooling in order to achieve an improved prediction in the intended task. If the user extends or refines the expectations on the domain, then a feedback whether the prediction has improved or deteriorated is very valuable. For this purpose, the framework provides functionality to evaluate the prediction quality of the constraints itself. Given a set of documents with gold annotations, the cde score of each document can be compared to the actual F1 score. Four measures are applied to evaluate the prediction quality of the constraints: the mean squared error, the Spearman’s rank correlation coefficient, the Pearson correlation coefficient and the cosine similarity. For optimizing the constraints to approximate the actual F1 score, the Pearson’s r is maximized, and for improving the predicted ranking, the Spearman’s ρ is maximized. If documents with gold annotations are available, then the F1 scores and the values of the four evaluation measures are given in the CDE Documents view (c.f. Figure 1, top right). 4 Case Study The usability and advantages of the presented work are illustrated with a simple case study concerning the segmentation of scientific references, a popular domain for evaluating novel information extraction models. In this task, the information extraction model normally identifies about 12 different entities of the reference string, but in this case study we limited the relevant entities to Author, Title and Date, which are commonly applied in order to identify the cited publication. In the main scenario of the case study, we try to estimate the extraction quality of a set of UIMA Ruta rules that shall identify the Author, Title and Date of a reference string. For this purpose, we define constraints representing the background knowledge about the domain for this specific set of rules. Addi- tionally to this main setting of the case study, we also measure the prediction of the constraints in two different scenarios: In the first one, the documents have been labeled not by UIMA Ruta rules, but by a CRF model [10]. The CRF model was trained with a limited amount of iterations in a 5-fold manner. In a second scenario, we apply the UIMA Ruta rules on a set of documents of a different distribution including unknown style guides. Table 1 provides an overview of the applied datasets. We make use of the references dataset of [9]. This data set is homogeneously divided in three sub- datasets with respect to their style guides and amount of references, which are 62 Table 1. Overview of utilized datasets. Druta 219 references in 8 documents used to develop the set of UIMA Ruta rules. Ddev 192 references in 8 documents labeled by the UIMA Ruta rules and applied for developing the constraints. Dtest 155 references in 7 documents labeled by the UIMA Ruta rules and applied to evaluate the constraints. Dcrf Druta , Ddev and Dtest (566 references in 23 documents) labeled by a (5-fold) CRF model. Dgen 452 references in 28 documents from a different source with unknown style guides labeled by the UIMA Ruta rules. applied to develop the UIMA Ruta rules, define the set of constraints, and to eval- uate the prediction of the constraints compared to the actual F1 score. The CRF model is trained on the partitions given in [9]. The last dataset Dgen consists of a mixture of the datasets Cora, CiteSeerX and FLUX-CiM described in [3] generated by the rearrangement of [8]. Table 2. Overview of evaluated sets of constraints. Cruta 15 Rule constraints describing general expectations for the entities Author, Title and Date. The weight of each constraint is set to 1. Cruta+bib Cruta extended with one additional AD constraint covering the entity- distribution of words extracted from Bibsonomy. The weight of each constraint is set to 1. Cruta+5xbib Same set of constraints as in Cruta+bib , but the weight of the additional AD constraint is set to 5. Table 2 provides an overview of the different sets of constraints, whose pre- dictions are compared to the actual F1 score. First, we extended and refined a set of UIMA Ruta rules until they achieved an F1 score of 1.0 on the dataset Druta . Then, 15 Rule constraints Cruta 5 have been specified using the dataset Ddev . The definition of the UIMA Ruta rules took about two hours and the def- inition of the constraints about one hour. Additionally to the Rule constraints, we created an AD constraint, which consists of the entity distribution of words that occurred at least 1000 times in the latest Bibtex database dump of Bibson- omy6 . The set of constraints Cruta+bib and Cruta+5xbib combine both types of constraints with different weighting. Table 3 contains the evaluation, which compares the predicted cde score to the actual token-based F1 score for each document. We apply two different 5 The actual implementation of the constraints as UIMA Ruta rules is depicted in Figure 1 (lower left part). 6 http://www.kde.cs.uni-kassel.de/bibsonomy/dumps 63 Table 3. Spearman’s ρ and Pearson’s r given for the predicted cde score (for each document) compared to the actual F1 score. Cruta Cruta+bib Cruta+5xbib Dataset ρ r ρ r ρ r Ddev 0.8708 0.9306 0.9271 0.9405 0.8051 0.6646 Dtest 0.9615 0.9478 0.9266 0.8754 0.8154 0.6758 Dcrf 0.6793 0.7881 0.7429 0.8011 0.7117 0.7617 Dgen 0.7089 0.8002 0.7724 0.8811 0.8150 0.9504 correlation coefficients for measuring the quality of the prediction: Spearman’s ρ gives an indication about the ranking of the documents and Pearson’s r provides a general measure of linear dependency. Although the expectations defined by the sets of constraints are limited and quite minimalistic covering mostly only common expectations, the results indi- cate that they can be useful in any scenario. The results for dataset Ddev are only given for completeness since this dataset was applied to define the set of constraints. The results for the dataset Dtest , however, reflect the prediction on unseen documents of the same distribution. The ranking of the documents was almost perfectly estimated with a Spearman’s ρ of 0.96157 . The coefficients for the other scenarios Dcrf and Dgen are considerably decreased, but the cde scores are nevertheless very useful for an assessment of the extraction model’s quality. The five worst documents in Dgen (including new style guides), for ex- ample, have been reliably detected. The results show that the AD constraints can improve the prediction, but do not exploit their full potential in the current implementation. The impact measured for the dataset Dcrf is not as distinctive since the CRF model already includes such features and thus is able to avoid errors that are detected by these constraints. However, the prediction in the dataset Dgen is considerably improved. The UIMA Ruta rules produce severe errors in documents with new style guides, which are easily detected by the word distribution. 5 Conclusions This paper presented a tool for the UIMA community implemented in UIMA Ruta, which enables to estimate the extraction quality of arbitrary models on unseen documents. Its introspective report is able to improve the development of information extraction models already with minimal efforts. This is achieved by formalizing the background knowledge about the domain with different types of constraints. We have shown the usability and advantages of the approach in a case study about segmentation of references. Concerning future work, many prospects for improvements remain, for example a logistic regression model for 7 The actual cde and F1 scores of Dtest are depicted in Figure 1 (right part) 64 approximating the scores of arbitrary evaluation measures, new types of con- straints, or approaches to automatically acquire the expectations on a domain. Acknowledgments This work was supported by the Competence Network Heart Failure, funded by the German Federal Ministry of Education and Re- search (BMBF01 EO1004). References 1. Bellare, K., Druck, G., McCallum, A.: Alternating Projections for Learning with Expectation Constraints. In: Proceedings of the Twenty-Fifth Conference on Un- certainty in AI. pp. 43–50. AUAI Press (2009) 2. Cohn, D., Atlas, L., Ladner, R.: Improving generalization with active learning. Machine Learning 15, 201–221 (1994) 3. Councill, I., Giles, C.L., Kan, M.Y.: ParsCit: an Open-source CRF Reference String Parsing Package. In: Proceedings of the Sixth International Language Resources and Evaluation (LREC’08). ELRA, Marrakech, Morocco (2008) 4. Culotta, A., McCallum, A.: Confidence Estimation for Information Extraction. In: Proceedings of HLT-NAACL 2004: Short Papers. pp. 109–112. HLT-NAACL-Short ’04, Association for Computational Linguistics, Stroudsburg, PA, USA (2004) 5. Ferrucci, D., Lally, A.: UIMA: An Architectural Approach to Unstructured In- formation Processing in the Corporate Research Environment. Natural Language Engineering 10(3/4), 327–348 (2004) 6. Graca, J., Ganchev, K., Taskar, B.: Expectation Maximization and Posterior Con- straints. In: Platt, J., Koller, D., Singer, Y., Roweis, S. (eds.) NIPS 20, pp. 569–576. MIT Press, Cambridge, MA (2008) 7. Kluegl, P., Atzmueller, M., Puppe, F.: TextMarker: A Tool for Rule-Based Informa- tion Extraction. In: Chiarcos, C., de Castilho, R.E., Stede, M. (eds.) Proceedings of the 2nd UIMA@GSCL Workshop. pp. 233–240. Gunter Narr Verlag (2009) 8. Kluegl, P., Hotho, A., Puppe, F.: Local Adaptive Extraction of References. In: 33rd Annual German Conference on Artificial Intelligence (KI 2010). Springer (2010) 9. Kluegl, P., Toepfer, M., Lemmerich, F., Hotho, A., Puppe, F.: Collective Infor- mation Extraction with Context-Specific Consistencies. In: Flach, P.A., Bie, T.D., Cristianini, N. (eds.) ECML/PKDD (1). Lecture Notes in Computer Science, vol. 7523, pp. 728–743. Springer (2012) 10. Lafferty, J., McCallum, A., Pereira, F.: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. Proc. 18th International Conf. on Machine Learning pp. 282–289 (2001) 11. Mann, G.S., McCallum, A.: Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data. J. Mach. Learn. Res. 11, 955–984 (2010) 12. McCallum, A., Nigam, K.: Employing EM and Pool-Based Active Learning for Text Classification. In: Shavlik, J.W. (ed.) ICML. pp. 350–358. Morgan Kaufmann (1998) 13. Savova, G.K., Masanz, J.J., Ogren, P.V., Zheng, J., Sohn, S., Kipper-Schuler, K.C., Chute, C.G.: Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. Journal of the American Medical Informatics Association : JAMIA 17(5), 507–513 (Sep 2010) 65 Author Index Chappelier, Jean-Cedric 34 Chen, Pei 1 Codina, Joan 42 Di Bari, Alessandro 2 Fang, Yan 14 Faraotti, Alessandro 2 Fette, Georg 10, 58 Gambardella, Carmela 2 Garcı́a Narbona, David 42 Garduno, Elmer 14 Grivolla, Jens 42 Hernandez, Nicolas 18 Kluegl, Peter 58 Maiberg, Avner 14 Massó Sanabre, Guillem 42 McCormack, Collin 14 Noh, Tae-Gil 26 Nyberg, Eric 14 Padó, Sebastian 26 Puppe, Frank 10, 58 Richardet, Renaud 34 Rodrı́guez-Penagos, Carlos 42 Savova, Guergana 1 Stadermann, Jan 50 Symons, Stephan 50 Telefont, Martin 34 Thon, Ingo 50 Toepfer, Martin 10, 58 Vetere, Guido 2 Wittek, Andreas 58 Yang, Zi 14 66