KAT: Enabling the Semantification of STEM Documents Felix Schmoll Tom Wiesing f.schmoll@jacobs-university.de t.wiesing@jacobs-university.de Jacobs University Bremen Abstract Considering a constantly growing body of mathematical knowledge it becomes more and more difficult for individuals to take full advantage of all information available. To semantify documents, we need to be able to create annotations efficiently and conveniently – marking definitions or declarations as well as usages of concepts – on a large corpus of documents. Eventually this can be achieved automatically, but as a first step a gold standard has to be created by humans. KAT – the KWARC Annotation Tool – has the goal of allowing users to create, view and update annotations on arbitrary (X)HTML documents. We have presented our approach before and in this paper we want to give an update on our progress. 1 Introduction STEM – Science, Technology, Engineering and Mathematics – documents often not only introduce the user to new areas but build heavily on previous knowledge. We want to make STEM documents, in particular mathematical documents, more accessible and lower the burden of gaining access to a new topic. Users should be able to intuitively navigate through existing knowledge as to make the process of understanding more efficient. In the context of mathematical documents this requires annotating documents and marking up definitions, declarations and other linguistic phenomena. Furthermore we want to mark up all usages of these concepts within the document to allow readers to, for example, click on a concept and navigate to its definition. In order to bring us closer to this goal it is necessary to do this on a large scale. As it is non-trivial to annotate a huge corpus of documents one wants eventually to do this automatically, with only little human interaction. During the development of tools to achieve this it is however common to annotate a small subset of documents manually, creating a “gold standard”, that can then be used as a basis for further the development – either using automated machine learning approaches or smart rule-based software. Annotation tools to create, edit and view annotations are here necessary and can further be helpful during later phases of development – in order to evaluate performance1 . Additionally they can they can be used by mathematical readers to interactively navigate through annotated content. STEM documents usually consist of a multitude of content, ranging from pure text over mathematical/chem- ical formulae to tables and diagrams. Most common annotation tools, such as brat [BR], Yawas [YW], and Annotatie [AN], only work properly with textual content – they store the position of annotations as offsets in a character string. While some of them can work with more advanced content, they do so by treating it as a blackbox and replacing it with a placeholder. Such a treatment however prevents for example the annotation of sub-formulae as any embedded formulae are considered as a single object. Copyright c by the paper’s authors. Copying permitted for private and academic purposes. 1 Indeed our tool has been used during the development of a declaration spotter in order to evaluate performance, we refer interested readers to [Sch16]. We have developed KAT – the KWARC Annotation Tool – to address and solve these problems. KAT is a browser-based tool that will allow users to annotate arbitrary (X)HTML documents. In this paper we want to present the newest iteration of it, building on the previous versions presented in [DGKMMW14] and [GLKW15]. We proceed in Section 2 by giving a short review of the existing KAT system architecture. In Section 3 we then proceed by giving a more detailed look into how annotations can be created in the browser and in particular focus on the progress we have achieved since the last iteration. We then conclude in Section 4 with an outlook on how we plan to develop the system further. 2 Recap of KAT system architecture KAT is implemented as a JavaScript library work- ing with (X)HTML5 documents and can be inte- interact KAT grated into arbitrary websites with little effort. Annotator read The basic system architecture can be seen in Fig- CorTeX KAnnSpec ure 1. This architecture consists of four main import load components: i) the KAT Annotator, the KAT export tool running in the browser; ii) the KAnnSpec, ref which serves as a description of iii) the Annota- Anno- tion ontology and iv) a document management Document Annotation ref ref tation system (here called CorTeX). Store Store Onto- Instead of describing this architecture in detail, (X)HTML5 RDF logy we refer interested readers to [DGKMMW14] and [GLKW15]. KAT is capable of working as a stan- dalone tool, however it is best used in the context Figure 1: The KAT System Architecture of a corpus management system such as CorTeX system [CT]. This scenario is intended for creating and improving a “gold standard”. Users receive a document from the document store in CorTex and create annotations for it using KAT. These annotations are then sent back to the system in RDF form. Alternatively the users receive an already annotated document and check existing annotations. These two scenarios drive the KAT development. KAT annotations are represented as RDF subject / predicate / object triples. The subjects are the text fragments that we are annotating whereas the objects are either concepts from the annotation ontology (in the case of classificational annotations) or other text fragments (in the case of relational annotations). KAT is not tied to a particular annotation ontology. On the contrary, at startup time it loads a set of annotation descriptions referred to as KAT Annotation Specification, or KAnnSpec for short. These are a set of custom XML documents that describe the annotation ontology, its concepts and additional constraints for the KAT user interface. As KAT is XHTML based we reference to text fragments using URIs. For this we make use of the XPointer framework [GMMW03] and developed a custom XPointer scheme. Each text fragment is a contiguous range of elements in the DOM tree and can be identified by giving the first and last elements contained in it. This obviously imposes the limitation that text fragments must consist of entire elements, making it very difficult to annotate linguistically meanigful concepts. In previous iterations of the system we worked around this by TEI- tokenizing document, that is adding elements around each word in the document, however we are in progress of implementing an updated XPointer scheme that allows us to reference text ranges within elements and thereby eliminating the limitation entirely. 3 Editing Annotations In The Browser KAT offers three different operating modes that can be navigated via a sidebar. Each of them facilitates a different kind of working with annotations: 1. Annotation Mode. This mode provides a user interaction to create and edit annotations on documents. This is semantically seperate from the viewing and evaluating of annotations by a different mode, allowing specialized operations e.g. on right-clicking. 2. Reading Mode. Once annotations have been created they can be used to obtain a better understanding of the structure of a mathematical document as a whole. This mode attempts to provide as much information as possible about all given annotations in an intuitive way. 3. Review Mode. When automating the creation of annotations, one wants to go over the created annotations and evaluate them. This mode allows an evaluation of annotations by providing an interface for judging them as good (thumbs up) or bad (thumbs down). 3.1 Creating Annotations As KAT is a browser-based tool, the process of creating annotations is a heavily form-oriented process. A range of text is selected and then an annotation category is chosen from a right-click context menu. Subsequently a form appears in the sidebar where more specific information can be entered. An example of this can be seen in Figure 2. Figure 2: Creating annotations in KAT. It is possible to make restrictions on input in the KAnnSpec, such as the format of a name by providing a Regular Expression to be fulfilled. To ensure that entered content is valid it is only possible to save an annotation once all conditions are fulfilled. To visually indicate this the color of a text field changes from red to green, and only once the validation constraints are fullfilled the save button becomes active. Additionally the question mark right next to the input-field can be used to obtain more information about said constraints (in this particular case the Regular Expression restricting the input format would be shown). 3.2 Visualizing Annotations And References In Reading mode all annotations are displayed to the user. The respective information are conveyed via multiple means: • Highlighting. Each concept from the KAnnSpec is assigned its own color and each annotation is highlighted in the appropriate color. • Tooltip. When hovering an annotation, information about its fields is displayed. The tooltip is generated using the