Using UIMA to Structure an Open Platform for Textual Entailment Tae-Gil Noh and Sebastian Padó Department of Computational Linguistics Heidelberg University 69120 Heidelberg, Germany {noh, pado}@cl.uni-heidelberg.de Abstract. EXCITEMENT is a novel, open software platform for Tex- tual Entailment (TE) which uses the UIMA framework. This paper dis- cusses the design considerations regarding the roles of UIMA within EX- CITEMENT Open Platform (EOP). We focus on two points: a) how to best design the representation of entailment problems within UIMA CAS and its type system. b) the integration and usage of UIMA components among non-UIMA components. Keywords: Textual Entailment, UIMA type system, UIMA application 1 Introduction Textual Entailment (TE) captures a common sense notion of inference and ex- presses it as a relation between two natural language texts. It is defined as follows: A Text (T) entails a Hypothesis (H), if a typical human reading of T would infer that H is most likely true [4]. Consider the following example: T: That was the 1908 Tunguska event in Siberia, known as the Tunguska mete- orite fall. H1 : A shooting star fell in Russia in 1908. H2 : Tunguska fell to Siberia in 1908. The text (T) entails the first hypothesis (H1 ), since a typical human reader of T would (arguably) believe that H1 is true. In contrast, T does not entail H2 . Nor does H1 entail T, that is, entailment is a directed relation. The promise of TE lies in its potential to subsume the semantic process- ing needs of many NLP applications, offering a uniform, theory-independent semantic processing paradigm. Software for the Recognition of Textual Entail- ment (RTE) have been used to build proof-of-concept versions of various tasks, including Question Answering, Machine Translation Evaluation, Information Vi- sualization, etc. [1, 7]. As a consequence of the theory-independence of TE, there are many different strategies to build RTE systems [1]. This has led to a practical problem of fragmentation: Various systems exist, and some have been made available as open-source systems, but there is little to no interoperability between them, since the systems are, as a rule, designed to implement one specific algorithm to solve RTE. The problems is complicated by the fact that RTE systems generally rely on tightly integrated components such as linguistic analysis tools and knowledge resources. Thus, when a researcher wants to develop a new RTE algorithm, they often need to invest major effort to build a novel system from scratch: Many of the components already exist – but just not in a usable form. EXCITEMENT open platform (EOP) has been developed to address those problems. It is a suite of textual inference components which can be combined into complete textual inference systems. The platform aims to become a common development platform for RTE researchers, and we hope that it can establish itself in the RTE community in a similar way to MOSES [6] in Machine Trans- lation. Compared to Machine Translation, however, a major challenge is that seman- tic processing typically depends on linguistic analysis as well as large knowledge sources, which is a direct source of the reusability problems mentioned above. In this paper, we focus on the architectural side of the platform which was designed with the explicit goal of improving component re-usability. We have adopted UIMA (Unstructured Information Management applications) and UIMA CAS (Common Analysis Structure) as the central building blocks for data represen- tation and preprocessing within EOP. One interesting aspect is that our adoption of UIMA has been partial and parallel. By partial, we mean that there are two groups of sharable components within EOP: the “core” components and the “LAP” components (see Section 2). We have adopted UIMA only for LAPs; however, we use UIMA CAS as one of the standard data containers, even in non-UIMA components. Parallel refers to the fact that we allow non-UIMA components to be integrated into our LAPs transparently. 2 EXCITEMENT: An Open Platform for Textual Entailment Systems RTE systems traditionally rely on self-defined input types, pre-processing (lin- guistic annotation) representations, and resources, tailored to a specific approach to RTE. EXCITEMENT open platform (EOP) tries to alleviate this situation by providing a generic platform for sharable RTE components. The platform has the following requirements. Reusing of existing software : The platform must permit easy integration and re-using of existing softwares, including language processing tools, RTE components, and knowledge resources. Multilinguality : The platform is not tied to a specific language. Adding suites for a new language in the future should not be restricted by the platform design. EXCITEMENT Platform Linguistic Analysis Entailment Core(EC) Pipeline (LAP) Annotated Raw Linguis7c   entailment Entailment  Decision     entailment Decisions Analysis  Tools   problems Algorithm  (EDA)   problems     Dynamic  and  Sta7c   Components   (Algorithms  and  Knowledge)   UIMA Components Java Components Fig. 1: EXCITEMENT Architecture Overview Component Independence : Components of EOP should be independent and complete as they are. So they can be used by different RTE approaches. This is also true for linguistic annotation pipelines and their components: An annotation pipeline as a whole, or an individual component of the pipeline, can be replaced with equivalent components. Figure 1 visualizes the top level of the platform. At this level, the platform can be grouped into two boxes: one is the Linguistic Analysis pipeline (LAP), and the other is the Entailment Core (EC). Entailment problems are first analyzed in the LAP, since almost all RTE algorithms require some level of linguistic anno- tation (e.g., POS tagging, parsing, NER, or lemmatization). The annotated TE problems are then passed to the EC box. In this box, the problems are analyzed by Entailment Decision Algorithms (EDAs), which are the “core” algorithms that make the entailment call and may in turn call other core components to provide either algorithmic additions or knowledge. Finally, the EDA returns an entailment decision. It is relatively natural to think of the LAP in terms of UIMA, since the typical computational linguistic analysis workflow corresponds well to UIMA’s annotation pipeline concept. Each annotator in LAP adds some annotations, and downstream annotators can use existing annotations and add richer anno- tations. UIMA CAS and its type system are strong enough to represent any data. UIMA AEs (Analysis Engines) are a good solution for encapsulating and using annotator components. In Section 3, we describe the UIMA adoption in the LAP in more detail. For Entailment Core (EC) components, however, the situation is different. In contrast to LAP, the functionalities of EC components are often not naturally mapped as “annotation behavior”. To visualize this, let’s check the example in Figure 2. The figure shows a conceptual search process of a RTE system that is based on textual rewriting. In this example, the text is “Google bought Mo- torola.”, and the system tries to determine hypothesis “Motorola is acquired by Derived parse A parse tree trees (derived) acquired bought acquired is Motorola by Google Motorola Google Motorola Google A parse tree (derived) Fig. 2: Entailment as a search on possible rewritings Google.” as an entailment. The example system gets a dependency parse tree of the text, and starts the rewriting process. On each iteration, it generates possible entailed sentences by querying knowledge bases. In the example, lexical knowl- edge is used on the first rewriting (buy entails acquire), and syntactic knowledge (change to passive voice) is used on the second derivation. The process will gen- erate many derived candidates per iteration. The algorithm must employ a good search strategy to find the best rewriting path from text (T) to hypothesis (H). On this example, there are three major component types. One is the knowl- edge component type that supports knowledge look-up, another is generation of derived parse trees, and finally the decision algorithm itself drives the search process and makes the entailment decision. Expressing behaviors of such com- ponents in terms of annotations on the artifact, might be possible, but is very hard and counter-intuitive. Following this line of reasoning, we decided that the EC components are better thought of as Java modules whose common behavior is defined by a set of Java interfaces, data types, and contracts, and have defined them accord- ingly in the EXCITEMENT open platform specification.1 More specifically, we have defined a typology of components. They include a type for the top-level EDA as well as (currently) five major component types: (1) a feature extractor (get a T-H pair CAS, return a set of features for the T-H pair); (2) a seman- tic distance calculator (get a T-H pair CAS, return semantic similarity); (3) a lexical resource type (lexical relation database); (4) a syntactic resource type (phrasal relation database); (5) an annotation component (dynamic enrichment of entailment problems). Although UIMA components are not suitable for conceptualizing inference components, we decided to keep CAS as the data container even in the EC components as far as possible to take advantage of the CAS objects created in the LAP. Thus, various components (including EDAs) gets CAS (as JCas) as an argument on their methods. Also note that LAP and EC boxes are independent: 1 Specification and architecture for EXCITEMENT open platform, http://excitement- project.eu/index.php/results CAS Entailment Metadata language, channel, docId, collectionId, ... Entailment Pair pairId, goldAnswer, text, hypothesis Text View Hypothesis View Subject of Analysis Subject of Analysis That was ... A shooting star ... POS Pos. Pos. Pos. Pos. Pos. ... ... Annotations PR V ART ADJ NN Token Token Token Token Token Token ... ... Annotations Dependency Dep Gov Dep Dep Gov Annotations ... dep. ... AMOD dep.NSUBJ dep.DET Fig. 3: CAS representation of a Text-Hypothesis pair as long as the CAS holds correct data, the EC components does not care which pipeline has generated the data. 3 Details on the UIMA usage in EXCITEMENT 3.1 CAS for Entailment Problems The input to any RTE system is a set of entailment problems, typically Text- Hypothesis pairs, each of which is represented in one CAS. Figure 3 shows a pictorial example of the CAS data structure for the example pair (T, H1 ) from Section 1. It contains the two text fragments (in two views) and their annotations (here, POS tags and dependencies), as well as global data such as generic meta- data (e.g., language) and entailment-specific metadata (e.g., the gold-standard answer). On the level of the CAS representation, we had to address two points: one is the representation of entailment problems in terms of CASes, the other one is the type definitions. Regarding the first point, general practice in text analysis use cases is to have one UIMA CAS corresponding to one document. This suggests representing both text and hypothesis (including, if available, their document context) as separate CASes. However, we decided to store complete entailment problems as individual CASes, where each CAS has two named views (one view for text, the other for hypothesis). This approach has two major advantages: first of all, this enables us to represent cross-annotations between texts and hypotheses, notably alignments, which can be added by annotators. Second, this enables us to define a straightforward extension from “simple” entailment problems (one text and one hypothesis) to “complex” entailment problems (one text and multiple hypotheses or vice versa, as in the “RTE search” task [2]). Regarding the second point, we adopted the DKPro type system [5], which was designed with language independence in mind. It provides types for mor- phological information, POS tags, dependency structure, named entities, and co-reference, etc. We extended the DKPro type system with the types necessary to define textual entailment-specific annotation. This involved types for mark- ing stretches of text as texts and hypotheses, respectively, as well as storing correspondence information between texts and hypotheses, pair IDs, gold labels, and some meta data. We also added types for linguistic annotation that are not exclusively entailment-specific, but were not covered yet by DKPro. This included annotation for polarity, reference of temporal expressions, word and phrase alignments, and semantic role labels. Details about the newly defined types can be found in the platform specifi- cation, and the type definition files are part of the platform code distribution. 3.2 Wrapping the Linguistic Annotation Pipeline One decision that may be surprising at the first glance is that we defined our own top-level Java interface for users of the LAP that hides UIMA’s own run- time access methods. This interface dictates the common capabilities that all pipelines of LAP should provide. The reason for this decision is twofold and pragmatic in nature, making transitioning to and using the EOP as easy as possible for developers. The first aspect is the learning curve. We would like to avoid the need for Entailment Core developers to deeply understand UIMA AEs and Aggregated Analysis Engines (AAEs). We feel that a deep understanding of these points requires substantial effort but is not really to the point, since many EC developers will only want to use pre-existing LAPs. By making the UIMA aspect of the LAP transparent to the Entailment Core, EC developers do not need to know how the LAP works internally beyond knowledge of the (fairly minimal) LAP interface. Of course, the EC developers still need to understand UIMA CAS very well. The second aspect is migration cost. If the LAP pipelines were nothing but UIMA AEs, all analysis pipelines of existing RTE systems would have to be deeply refactored, which comes at a considerable cost. Our approach allows such analysis pipelines to be kept largely intact and merely surrounded by a wrapper that provides the requires functionality and converts their output into valid UIMA CASes according to the EOP’s specification. Nevertheless, there are good reasons to encourage the use of AE-based LAPs: AE-based components are generally much more flexible, and they are very easy to assemble into AAE pipelines. Therefore, we encourage AE-based LAP develop- ment by providing ready-to-use code that implements our LAP interface, taking a list of AEs as input. Thus, if the individual components are already present as AEs, the implementation effort to assemble them into a LAP is near zero. In this sense, we see our LAP interface as a thin wrapper above UIMA with the purpose of enabling peaceful co-existence between UIMA and non-UIMA pipelines. In the long run, we also hope to provide some new AEs back to the UIMA community. 4 Some Open Issues In this section, we discuss two open questions that we are facing in future work. CAS in non-UIMA environments. There is considerable number of best-practice strategies for handling CAS objects (reset the data structure instead of creating a new one; use a CAS pool instead of generating multiple CASes, etc). When a CAS is used in an UIMA context (i.e., in the LAP), it is not hard to guide the developers to follow these rules. However, with CAS being used as a general data container throughout the EOP, developers also often encounter CAS (JCas) objects outside specific UIMA contexts, and we have found it harder to guide the developers towards “proper usage”. For example, one part of the EXCITEMENT project is concerned with the construction of Entailment Graphs [3], structured knowledge repositories whose vertices are statements and whose edges indicate entailment relations. Since the standard data structure for annotations is JCas, the graph developers tend to add one JCas for each node. This is not problematic for small graphs, but once the graph gets bigger, this can be problematic; CAS is a very large data structure, and its creation and deletion take some time. We are still trying to establish best practices for using CASes in non-UIMA EOP environment. Annotation Styles: Hidden dependencies. One of the EOP design requirement was the clear separation of LAP and EC. This has been fairly well achieved, at least on a technical level. However, it is clear that there are still implicit dependencies between linguis- tic analysis tools and entailment core components. Consider the case of syntactic knowledge components such as DIRT-style paraphrase rules in the Entailment Core. Such components store entailment rules as pairs of partial dependency trees which have typically been extracted from large corpora. If the corpus used for rule induction was parsed with a different parser than the current entailment problem, then matching the sentence against the rule base will result in missing rules, due to differences in the analysis style. Note that this implicit dependency does not break the UIMA pipeline, since it does not involve the use of a novel type system, but rather differences in the interpretation of shared types. We are currently investigating what type of “style differences” can be observed from actual annotators. 5 Conclusion In this paper, we have provided an overview of the EXCITEMENT open plat- form architecture and its adoption of UIMA. We have adopted and adapted UIMA CAS and the DKPro type system as a flexible, language-independent data container for Textual Entailment problems. UIMA also provides the back- bone for platform’s LAP components. There are several open issues that is to be resolved in the future, but the EXCITEMENT project has already profited substantially from the use of the abstractions that UIMA offers as well as the integration of existing components from UIMA communities. The first version of EXCITEMENT open platform has been finished2 with three fully running RTE systems integrated with all core components and an- notation pipelines. The platform currently supports three languages (German, Italian and English), and is also shipped with various tools and resources for TE researchers. We believe that the platform will become a valuable tool for researchers and users of Textual Entailment. Acknowledgment. This work was supported by the EC-funded project EXCITE- MENT (FP7 ICT-287923). References 1. Androutsopoulos, I., Malakasiotis, P.: A Survey of Paraphrasing and Textual En- tailment Methods. Journal of Artificial Intelligence Research 38 (2010) 135–187 2. Bentivogli, L., Magnini, B., Dagan, I., Trang Dang, H., Giampiccolo, D.: The fifth PASCAL recognising textual entailment challenge. In: Proceedings of the TAC 2009 Workshop on Textual Entailment, Gaithersburg, MD (2009) 3. Berant, J., Dagan, I., Goldberger, J.: Learning entailment relations by global graph structure optimization. Computational Linguistics 38(1) (2012) 73–111 4. Dagan, I., Glickman, O., Magnini, B.: The PASCAL Recognising Textual Entail- ment Challenge. In: Proceedings of the First PASCAL Challenges Workshop on Recognising Textual Entailment, Southampton, UK (2005) 5. Gurevych, I., Mühlhäuser, M., Müller, C., Steimle, J., Weimer, M., Zesch, T.: Darmstadt knowledge processing repository based on UIMA. In: Proceedings of the First Workshop on Unstructured Information Management Architecture at the Conference of the Society for Computational Linguistics and Language Technology, Tübingen, Germany (2007) 6. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: Open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the Association for Computational Lin- guistics, Prague, Czech Republic (2007) 177–180 7. Sammons, M., Vydiswaran, V., Roth, D.: Recognizing textual entailment. In Bikel, D.M., Zitouni, I., eds.: Multilingual Natural Language Applications: From Theory to Practice. Prentice Hall (2012) 2 The platform has been released under an open source license, and all codes and resources can be freely accessed via the project repository. http://hltfbk.github.io/Excitement-Open-Platform/project-info.html