Aid to spatial navigation within a UIMA
                   annotation index

                                Nicolas Hernandez

                               Université de Nantes


       Abstract. In order to support the interoperability within UIMA work-
       flows, we address the problem of accessing one annotation from another
       when the type system does not specify an explicit link between the two
       kinds of objects but when a semantic relation between them can be
       inferred from a spatial relation which connects them. We discuss the
       limitations of the framework and briefly present the interface we have
       developed to support such navigation.

       Keywords: Apache UIMA, Type System interoperability, Annotation
       Index, Spatial navigation


1     Introduction
One of the main ideas in using document analysis frameworks such as Apache Un-
structured Information Management Architecture 1 (UIMA) [3] is to move away
from handling directly the raw subject of analysis. The idea is to enrich the raw
data with descriptions which can be used as basis for the processing of subse-
quent components. In the UIMA framework, the descriptions are typed feature
structures. The component developer defines a type system which informs about
the features of a type (set of (attribute, typed value) pairs) as well as how the
types are arranged together (through inheritance and aggregation relations).
Annotations are feature structures attached to specific regions of documents.
    In this paper, we address the problem of accessing one annotation from an-
other when the type system does not specify an explicit link between the two
kinds of objects but when a semantic relation between them can be inferred from
a spatial relation which connects them. The situation is a case of interoperabil-
ity issue which can be encountered when developing a component (e.g. a term
extractor) that uses analysis results produced by two components developed by
different developers (e.g. part-of-speech and lemma information being both hold
by distinct annotations at the same spans).
    In practice, most of the existing type systems define annotation types which
inherit from the built-in uima.tcas.Annotation type [4, 5, 7]. This type contains
begin and end features which are used to attach the description to a specific
region of the text being analysed. Thanks to these features, it is possible to
cross-reference the annotations which extend this type.
1
    http://uima.apache.org
    In this paper, we argue that the Apache UIMA Application Programming
Interface (API) is not enough intuitive for a Natural Language Processing (NLP)
developer. We argue that it has some restrictions which prevent from a complete
and free navigation among the annotations. We also argue that the API is forcing
the way of developing algorithms. In section 2, we define the kind of spatial
navigation we would like to perform within an annotation index. In section 3,
we describe the Apache UIMA solutions to index and explore the annotations
within the indexes. In section 4, we discuss the API and show its limitation to
access annotations by spatial relations. Finally, in section 5, we briefly present the
structures and the interface we have developed to support a spatial navigation.

2     The spatial navigation problem
By spatial relations we mean that we assume that the annotations in a text can
be located in a two-dimensional space: One axis to represent the position in the
text linearity and an orthogonal axis to represent the covering degrees between
the annotations. Indeed annotations can cover, be covered by, precede, or follow
(contiguously or not) other annotations. The spatiality may inform about the
semantic relations. Two annotations at the same span may mean that they are
different aspects of the same object. They can have complementary features or
one of them can be the property of the other. One annotation covering some
others may mean that the former is made of the others, and in the opposite,
that the others are part of the former. The semantic interpretation of the spatial
relations may depend on the considered linguistic paradigm.
    To give examples of situations we are dealing with, let’s assume the following
type system: Document, Source (information about the document such as its
URI), Sentence, Chunk, Word (having a feature whose value informs about the
lemma), POS (having a feature whose value informs about the part-of-speech)
and NamedEntity. Let’s also assume that all these types do not hold explicit ref-
erences to each other and that there is no inheritance relation between them. In
that context, examples of access we would like to carry out are to get :The words
of a given sentence (can be interpreted as a made of relation); The sentence of
a given word (is part of relation); The words which are a verb (is a property
of relation); The named entities followed by a word which is a verb and has
the lemma visit (followed by relation, . . . ). Indeed, we would like to be able to
navigate within an annotation index from an annotation to its covering/covered
annotations or to the spatially following/preceding annotation of a given type
having such or such properties. We defined this problem as a navigation problem
within an annotation index.

3     Accessing the annotations in the UIMA framework
The problem of accessing the annotations depends on the means offered by the
framework2 to build annotation indexes and to navigate within them.
2
    See the Reference Guide http://uima.apache.org/d/uimaj-2.4.0/references.
    html and the Javadoc http://uima.apache.org/d/uimaj-2.4.0/apidocs.
3.1    Defining feature structures indexes

Adding a feature structure to a subject of analysis corresponds to the act of
indexing the feature structure. By default, an unnamed built-in bag index ex-
ists which holds all feature structures which are indexed. The framework defines
also a built-in annotation index, called AnnotationIndex, which automatically
indexes all feature structures of type (and subtypes of) uima.tcas.Annotation.
As reported in the documentation, ”the index sorts annotations in the order in
which they appear in the document. Annotations are sorted first by increasing
begin position. Ties are then broken by decreasing end position (so that longer
annotations come first). Annotations that match in both their begin and end fea-
tures are sorted using a type priority”. If no type priority is defined in the compo-
nent descriptor3 , the order of the annotations sharing the same span in the text is
undefined in the index. The UIMA API provides getAnnotationIndex methods
to get all the annotations of that index (subtypes of uima.tcas.Annotation) or
the annotations of a given subtype. The UIMA framework allows also to define
indexes and possibly to sort the feature structures within them.


3.2    Parsing the annotation index

The UIMA API offers several methods to parse the AnnotationIndex. Given
an annotation index, the iterator method returns an object of the same name
which allows to move to the first (respectively the last) annotation of the index,
the next (respectively the previous) annotation (depending on its position in
the index) or to a given annotation in the index. It is also possible to get an
unambiguous iterator to navigate among contiguous annotations in the text. In
practice, this iterator consists of getting successively the first annotation in the
index whose begin value is higher than the end of the current one. We will call
this mechanism the first-contiguous-in-the-index principle.
    The subiterator method returns an iterator whose annotations fall within
the span of another annotation. It is possible to specify whether the returned
annotations should be strictly covered (i.e. both begin and end offsets covered)
or if it concerns only its begin offset. Subiterator can also be unambiguous.
Annotations at the same span may be not returned depending on the order in
the index as well as the type priority definition.
    The constrained iterator allows to iterate over feature structures which satisfy
given constraints. The constraints are objects that can test the type of a feature
structure, or the type and the value of its features.
    The tree method returns an AnnotationTree structure which contains nodes
representing the results of doing recursively a strict, unambiguous subiterator
over the span of a given annotation. The API offers methods to navigate within
the tree from the root node. From any other nodes, it is possible to get the
children nodes, the next or the previous sibling node, and the parent node.
3
    In a UIMA workflow, a component is interfaced by a text descriptor that indicates
    how to use the component.
4     Limitations of the UIMA framework

Table 1a shows the AnnotationIndex containing the analysis results of the data
string ”Verne visited the seaport of Nantes.\n”. Annotations were initially added
to the index in that order: First the Document, then the Source, the Sentence,
the Words, the POS, the Chunks and the NamedEntities.

Offsets Annotations              Covered text                 LocatedAnnotations
 (0,37) Document     Verne visited the seaport of Nantes.\n   Document
 (0,36) Sentence1    Verne visited the seaport of Nantes.     Sentence
  (0,5) Word1        Verne                                    Word1 NamedEntity1 Chunk1 POS1
  (0,5) NamedEntity1 Verne
  (0,5) POS1         Verne
  (0,5) Chunk1       Verne
  (0,0) Source                                                Source
 (6,13) Word2        visited                                  Word2 Chunk2 POS2
 (6,13) POS2         visited
 (6,13) Chunk2       visited
(14,35) Chunk3       the seaport of Nantes                    Chunk3
(14,25) Chunk4       the seaport                              Chunk4
(14,17) Word3        the                                      Word3 POS3
(14,17) POS3         the
(18,25) Word4        seaport                                  Word4 POS4
(18,25) POS4         seaport
(26,35) Chunk5       of Nantes                                Chunk5
(26,28) Word5        of                                       Word5 POS5
(26,28) POS5         of
(29,35) Word6        Nantes                                   Word6 NamedEntity2 POS6
(29,35) NamedEntity2 Nantes
(29,35) POS6         Nantes
(35,36) Word7        .                                        Word7 POS7
(35,36) POS7         .

                           (a)                                             (b)
Table     1:    An    AnnotationIndex         (a)    and     its    corresponding
LocatedAnnotationIndex (b). Both tables are aligned for comparison.
Annotations and LocatedAnnotations are sorted in increasing order from the
top of the tables. Annotations are identified by their type and an index number.


4.1   Index limitations

The definition of an index is usually done in the component descriptor. The
defined index can only contain one specific type (and subtypes) of feature struc-
tures. So, to get an index made of two distinct types, the trick would be to
declare them as subtypes of the same common type in the type system, and get
the index of this super type. This can lead to make a less consistent type system
from a linguistic point of view, but this is still coherent with the UIMA approach
of doing whatever you need in your component.
    The framework allows also so to sort the feature structures of a defined
index. There are some restrictions. The sorting key, which should be a feature
of the indexed type, can only be a string or a numerical value. Only the natural
way of sorting such elements is available. There is no way to declare its own
comparator to set the order between two elements. To sort on a different kind of
key, the developer has to come down to the available systems. In addition, the
type system may need to be modified to add a feature to play the role of the
sorting key, which can also make the type system less consistent.

4.2   Navigation limitations within an annotation index
Iterator With an ambiguous iterator, the result of a move to the previous/next
annotation in the index may not correspond to the annotation which precedes/follows
spatially in the text. It can also be a covering or a covered one. In Table 1a,
the preceding of Word3 is the covering Chunk4. Unambiguous iterators force the
methods to return only spatially contiguous annotations. In practice, the method
does not always return the expected result. When called on the full annotation
index, it starts from the first annotation in the index. In Table 1a, it only returns
the Document annotation and no more next annotation. When calling a unam-
biguous iterator on a typed annotation index, the effect of the first-contiguous-
in-the-index principle will be remarkable if some annotations occur at the same
span. In that situation, the developer has no access to all the annotations which
effectively follow/precede spatially the current annotation. In Table 1a, an un-
ambiguous iteration over the Chunk type returns Chunk1, Chunk2 and Chunk3.
Chunk4 and Chunk5 are not reachable. To iterate unambiguously over annota-
tions of distinct types (e.g. Named Entities and POS to get the Named Entities
followed by a verb), the developer has to create a super-type over them and call
the iterator method on this super-type. The super-type may not have linguistic
consistency and the iterator will still suffer from the limitation we have previ-
ously mentioned. Another drawback of the unambiguous iterator can be noticed
when iterating an index in reverse order. If two overlapping annotations precede
the current one, the one returned will be the one whose begin offset is the small-
est and not the one with the highest end value, lower than the begin value of
the current one. The iterator follows the first-contiguous-in-the-index principle
in the normal order. Finally, the API does not allow to iterate over the index
and in the text spatiality in the same time. It is not possible to switch from an
ambiguous iterator to an unambiguous one (and vice-versa).
    Subiterator is the kind of method to get the covered annotations of another
one, like the words of a given sentence. Its major drawback is that, without a
type priority definition, there is no assurance that annotations occurring at the
same text span will fit an expected conceptual order. In Table 1a, an ambigu-
ous subiterator over each chunk annotation for getting the words returns the
Source annotation for Chunk1, nothing for Chunk2, and the expected words
(and more to filter) for the all remaining Chunks. Concerning the unambiguous
subiterator, the first-contiguous-in-the-index principle causes to hide some an-
notations. In Table 1a, when applying an unambiguous subiterator over each
chunk, then Chunk1 and Chunk2 return the same bad result as previously. Chunk3
only returns Chunk4 and Chunk5 annotations while Chunk4 and Chunk5 return
the right word annotations. To subiterate unambiguously over a set of specific
types, a super-type, which encompasses both the covered and the covering types,
has to be defined in the type system. But the problem of the unambiguous iter-
ation remains.
    Constraints objects aim at testing one given feature structure at a time.
The framework does not allow to define dynamic constraints. This means that
the values to test cannot be instantiated relatively to the feature structure in
the index. A constraint cannot be set to select annotations whose begin feature
value is higher than the end feature value of another one. Rather, we have to
specify at the creation the exact value to be higher than. Constraints objects are
complex to understand and to set. It requires, for example, seven lines of code
for creating an iterator which will get the annotations with a lemma feature.
Constraint iterators remain iterators with the same limitations.
    The Tree method returns an object close to the kind of structure we would
like to manipulate to navigate within. Unfortunately, it can only give the children
of a covering annotation. So to get the parent of an annotation, a trick could be
to build the tree of the whole document by taking the most covering annotation
as the root, then to browse the tree until finding the desired annotations for
finally getting its parent. But in any case, there is no way to get directly a node
and the structure will still suffer from the remarks we made about unambiguous
subiterators (consequently some annotations may not be present in the tree).
    Missing Methods The existing methods partially answer the problem and
some navigation methods are missing. There is no dedicated method: to super-
iterate and to get the annotations covering a given annotation; to move to the
first/next annotation of a given type (respectively the last/previous annotation
of a given type); or to get partially-covering preceding or following annotations.
    All these remarks lead the developers to use preferentially ambiguous itera-
tors and subiterators, even if, this causes to write more code to search the index
backward/forward and tests to filter the desired annotations.

5   Supporting the spatial navigation
To support a spatial navigation among the annotations we propose to index
the annotations by their offsets in a structure called LocatedAnnotationIndex,
and to merge the annotations occurring at the same spans in a structure called
LocatedAnnotation. Table 1b illustrates the transformation of the AnnotationIndex
depicted in Table 1a into a LocatedAnnotationIndex. Figure 1 shows the spatial
links which interconnect the LocatedAnnotation.
    The LocatedAnnotationIndex is a sorted structure which follows the same
sorting order than the AnnotationIndex: From a given LocatedAnnotation,
covering and preceding LocatedAnnotations are located backward in the in-
dex, and the covered and following LocatedAnnotations forward in the index.
The structure allows to access directly to a LocatedAnnotation thanks to a pair
of begin/end offsets. The first characteristic of a LocatedAnnotation is to list all
the annotations occurring at the same offsets. This prevents from having to define
a type priority for handling the limitation of the subiterator. The structure comes
with several kinds of links to navigate both within the LocatedAnnotationIndex
and spatially in the text. Indeed, the structure has links to visit its spatial vicinity
(parent/children/following/preceding) LocatedAnnotation. The structure has
also links to access the previous/next element in the index. The contiguous spa-
tial vicinity of each LocatedAnnotation is computed when the LocatedAnnotationIndex
is built. The API also offers some methods to dynamically search LocatedAnnotation
containing annotations of a given type among the ancestor/descendant or self.
Similarly, it is also possible to search the first/last (respectively following/preceding)
LocatedAnnotation containing annotations of a given type.
    In terms of memory consumption, the built LocatedAnnotationIndex takes
approximatively as much memory as its AnnotationIndex; only the local vicin-
ity of each LocatedAnnotation is kept in memory. The CPU time for building
the index depends on the AnnotationIndex size. Some preliminary tests indi-
cate that the time increases by a factor of three when doubling the size of the
annotated text. It takes about 2 seconds for building the index of a 50-sentences
text analysed with sentences, chunks and words.


Fig. 1: Example of LocatedAnnotationIndex. The boxes represent the
LocatedAnnotation. They are aligned on the text span they cover. The solid
lines represent the spatial preceding/following relation while the dotted lines
represent the parent/child relations. The parent is indicated by an arrow.

6   Related works
With the prospect of developing a pattern matching engine over annotations,
[2] have addressed some design considerations for navigating annotation lattices.
They have so exposed a language for specifying spatial constraints among an-
notations. An engine has been implemented within the UIMA framework. Due
to this technical choice, the design of the language and its implementation may
suffer from the drawbacks we have enumerated. Indeed there is no example of
patterns which involve annotation types without inheritance relation. In addi-
tion, as pointed out in the perspectives of the authors, it is not clear how the
engine will behave when handling multiple annotations over the same spans
without the guarantee of a consistent type priority. The LocatedAnnotation
structure is a solution to the need of defining type priorities. More generally, the
methods of our API can play the role of the navigation devices required to the
development of a pattern matching engine.
    uimaFIT4 is a well-known library which aims at simplifying the UIMA de-
velopments. One appealing navigation option it offers is similar to our API.
Some methods are designated to move from one annotation to the closest (cover-
ing/covered/preceding/following) ones by specifying the type of the annotations
to get. In practice, nevertheless, the implementation relies on the UIMA API and
may have some of the restrictions. The selectFollowing method, for example,
follows the first-contiguous-in-the-index mechanism. In Table 1a, it returns only
the Chunk 1 to 3, and misses the 4th and 5th, when calling it successively to get
the following chunk from the first chunk.

7     Conclusion and perspectives
Solving the interoperability issues in the UIMA framework is a serious problem
[1, 6]. Our opinion is to give the means to developers to do what they want. We
show that the UIMA API presents some limitations regarding the spatial navi-
gation within annotations in a text. We also show that by adapting his problem
definition to the framework requirements the developer may succeed to accom-
plish his task. But the adaptation has a cost in development time and requires
skills in the framework. To overcome this problem, we have developed a library
which transforms an AnnotationIndex into a navigable structure which can be
used in a UIMA component. It is available in the uima-common project5 . Our
perspectives are twofold: Reducing the processing time and adding a mechanism
for updating the LocatedAnnotationIndex.

References
1. Ananiadou, S., Thompson, P., Kano, Y., McNaught, J., Attwood, T.K., Day, P.J.R.,
   Keane, J., Jackson, D., Pettifer, S.: Towards interoperability of european language
   resources. Ariadne 67 (2011)
2. Boguraev, B., Neff, M.S.: A framework for traversing dense annotation lattices.
   Language Resources and Evaluation 44(3), 183–203 (2010)
3. Ferrucci, D., Lally, A.: Uima: an architectural approach to unstructured information
   processing in the corporate research environment. Natural Language Engineering
   10(3-4), 327–348 (2004)
4. Gurevych, I., Mühlhäuser, M., Müller, C., Steimle, J., Weimer, M., Zesch, T.: Darm-
   stadt knowledge processing repository based on uima. In: First Workshop on UIMA
   at GSCL. Tübingen, Germany (2007)
5. Hahn, U., Buyko, E., Tomanek, K., Piao, S., McNaught, J., Tsuruoka, Y., Anani-
   adou, S.: An annotation type system for a data-driven nlp pipeline. In: The LAW
   at ACL 2007. pp. 33–40 (2007)
6. Hernandez, N.: Tackling interoperability issues within uima workflows. In: LREC.
   pp. 3618–3625 (2012)
7. Kano, Y., McCrohon, L., Ananiadou, S., Tsujii, J.: Integrated NLP evaluation
   system for pluggable evaluation metrics with extensive interoperable toolkit. In:
   SETQA-NLP. pp. 22–30 (2009)

4
    http://uimafit.googlecode.com
5
    https://uima-common.google.com