An Active Annotation Support System for Regulatory Documents Andreas Korger1 , Joachim Baumeister1 1 University of Würzburg, Am Hubland, D-97074 Würzburg Abstract Manual document annotation is a resource intense task. The costs of annotation can be lowered by supporting the manual annotation with pre-processing of the available corpus and active in-process support of annotating users. To integrate different components into a coherent active annotation support system the XML Metadata Interchange standard can be used to exchange objects on the base of a meta- meta data model. Further, to integrate an existing knowledge graph into an annotation support system the RDF query language SPARQL can be used as an interface to analyze existent documents and declare new knowledge. In this manner the presented efforts contribute to structure and standardize the process of manual knowledge acquisition from regulatory documents. Keywords Knowledge Management, Document Annotation, Meta-Meta Data Models, SPARQL, Ontology Population, Natural Language Processing 1. Introduction A core task in the field of knowledge management is to provide insight to documents related to the current problem situation. For example, for servicing industrial machines, the service technician will need quick access to the appropriate documentation. For internal knowledge management, large companies often provide access to regulatory compliance documents for use by their employees. This access is depicted conceptually on the left side of Figure 1. In these documents the textual parts are usually annotated by (semantic) metadata, in order to implement a quick and problem-oriented access to information. This semantic metadata in turn is then used by semantic search engines and navigation interfaces to provide a quick and context-oriented access [1]. Whereas for some new documents the authoring process can be extended by the augmentation of metadata, the metadata needs to be attached to legacy documents in any case. Albeit the progression in natural language processing, information extraction and ontology population, the attachment of metadata to document passages is often done manually in order to achieve high quality for the annotation. However, the manual annotation of documents is a cumbersome and costly task. Available frameworks for general textual annotation lack active annotation support. In this work, we propose a semantic approach to actively annotate documents by integrating a domain specific ontology within the annotation task. The domain specific ontology represents prior LWDA’22: Lernen Wissen Daten Analysen, 2022, Hildesheim $ a.korger@informatik.uni-wuerzburg.de (A. Korger); joba@uni-wuerzburg.de (J. Baumeister) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) Active Annotation Support System Documents Annotated Document Corpus Semantic Access Domain Knowledge Users Annotations Annotation Engine Figure 1: A conceptual view of annotating documents in knowledge management. domain knowledge and it is used to integrate new knowledge collected in the active annotation process. The right side of Figure 1 shows a conceptual view of this approach. which reduces the efforts and improves quality of annotation compared to fully manual approaches. The rest of the paper is organized as follows: In Section 2 we describe the components of the proposed framework as well as their interaction. In Section 3 we present a case study using the system for the annotation of regulatory documents for nuclear safety. We finish the paper with related work, future work, and a concluding statement. 2. Components of an Active Annotation Support System In the following, we use an example with explanatory characteristics out of the domain of fire safety. A fire is an incident for which appropriate measures need to be taken. A fire extinguisher is such a measure but also a blanket. Such incidents and measures are in the scope of discovery for semantic annotation in unknown documents. An active annotation support system (AASS) suggests such annotations to the user with the recommendation of likely annotation features. It proposes automatically discovered annotations on the base of machine learning and natural language processing techniques, so-called pre-annotations. Such a system is called active, if the choice made by the user for the current annotation is incorporated into the next annotation recommendation. In this context, the system can also decide which textual passages are presented to the user for annotation to optimize the overall performance. The annotations are done on the basis of a semantic model for regulatory knowledge represented in an ontology. The semantic model is populated with instances created during the annotation step which makes up a knowledge graph of applied regulatory knowledge. In this paper, an AASS for the knowledge management of regulatory documents is introduced, major components necessary for the implementation are pointed out and explained in detail. The workflow of bringing different components of the architecture together in a performant and consistent manner is explained. Basically there are two processing environments in use. First, the annotation frontend in which the user works and does his manual annotations. Second, the backend which does background work like file handling, provides NLP functionality, and manages consistency. A graphical view of the workflow and incorporated components is presented in Figure 2, components shown in the figure will be explained with more detail in the following sections. Active Annotation Support System Semantic Access Annotated Document Regulatory Domain Knowledge Corpus 5 Ontology Pre- 1 Annotations Users Regulatory Knowledge Graphs Document 3 2 Container Provenance Document Document Semantic 6 Information Structure Annotations Type System NLP Container 4 Annotation Frontend Backend Engine Figure 2: A view of the main components and their workflow for the active annotation of regulatory documents accessed by users over the frontend and processed in the backend. 2.1. Natural Language Processing Components The handling of natural language text contained in regulatory documents requires appropriate techniques, namely to extract relevant entities and their relations [2]. For the processing of natural language text a supporting NLP container is necessary that provides functionality like text offset handling, tokenization, string matching, and rule matching. Each document is encapsulated in such a container on the backend side as Figure 2 shows in the component number (4). The container provides an interface to semantic knowledge contained in the regulatory domain knowledge ontology (1). It applies natural language processing steps to grant access to the document on a level of tokens and entities. Rule matching steps are applied to identify relevant known and unknown entities together with their relations. The document container is created with the input of the document content (3). From the existent ontology structure (1) together with the already available annotations saved in the corresponding knowledge graph (2) a hierarchical type system is derived that is implemented into the annotation environment. Therefore, the information has to be transferred into a format that can be consumed and displayed by the frontend tools. This type system holds the supportive information for the user. In this way the options are represented that a user has for annotating entities and their relations. The functionality of the annotation environment is used, in order to retrieve feedback within the active learning step. As the presented architecture picks the best of diverse data models it is important to guarantee consistency of data. For the practical application the documents need to be uniquely addressable as well as different annotated versions of the same document need to be identifiable. Furthermore, the annotated metadata needs to be manageable. For instance, it is important to know which annotation was done by which annotating component. Therefore, a systematic model of unique identifiers is maintained. Additionally, meta information about the provenance of data is stored in the knowledge graph. The document container itself provides measures to maintain its own identifiers and assures at least the inherent (syntactical) correctness when transferring identifiers to the frontend. Therefore, this container needs to communicate with the annotation tool for which a standard is necessary for the interchange of metadata. The pre-annotations (5) created by the NLP container are transferred to the document container which forwards them to the frontend. They are displayed for instance in the same manner as manual annotations are and can be accepted or adjusted by the user. The user might want to change the beginning and ending of the automatic annotation, the annotated features or delete the whole annotation. The annotations of the user are communicated to the backend side in the same manner as they are passed to the frontend. 2.2. Semantic Technology Components Semantic components are modularized and stacked with the intention to re-use and adapt them to new regulatory domains. The fundament is an ontology (1) that represents the basic regulatory knowledge model. This model is instantiated and instances are aggregated in different knowledge graphs (2). Semantic technologies are used for three different purposes: First, prior and learned domain knowledge is organized in the knowledge graph on the base of the knowledge organization model. For this task we use the SKOS (Simple Knowledge Organization System) [3] scheme and build a domain specific semantic model on top. The domain knowledge consists of entities that are described with labels and descriptions as well as proprietary relations between them. These are the entities and relations the user wants to discover and annotate in the available document corpus. Second, a set of extraction patterns is matched to the corpus to identify unknown entities and relations. All these discovered information units are saved in the knowledge graph together with their annotations. An annotation made in a documents on the base of the domain knowledge signifies for instance, that a certain entity occurs at this position in the document. These automatic annotations are complemented by the manual annotations of the user. A big corpus quickly entails a large amount of annotations. This data has to be queryable to give insight with human perception capability for assessment of the quality of annotations and usage of the discovered knowledge. Additionally, a structured data storage in this manner allows for semantic search in the annotated data [2]. Third, provenance information is stored in the knowledge graph to organize for instance different authors, different documents, and experiments. We chose the Provenance Ontology (PROV-O) as a base to build a domain specific system above [4]. 2.3. Textual Similarity Assessment What distinguishes the annotation of regulatory documents from the annotation of general documents? We see the difference in the availability of a semantic model coding prior knowl- edge founding on several previous case studies. Further, this provides the base to construct a domain independent NLP engine exploiting textual phenomena that occur especially in texts of regulation. The combination of both facilitates the annotation of unknown regulatory texts. For instance, having taxonomies available of regulatory domain specific entities allows for rec- ommending annotation features based on generalization and specification in the neighborhood defined by the topology of the taxonomy [5]. The domain specific relations and which textual indicators in regulatory speech point to them, simplifies the discovery of specific entities. Being supported by regulatory semantic prior knowledge the “active” component of the annotation system can also exploit the users annotations efficiently. For instance, if the user has annotated a relation but not the entities the recommendation of fitting entities becomes possible due to domain knowledge. For instance, taken the example of fire safety “... detecting and extinguishing quickly those fires which do start ...”. We already know, that a fire extinguisher is a measure against fires, this can be exploited by textual similarity assessment to identify that the verb extinguishing itself also is a measure. Additionally, extinguishing is connected with detecting via the conjunction and which allows the inference that detecting also is a measure and should be annotated as one. The adverb quickly can be added to create the more specific entities detecting quickly and extinguishing quickly. This would be stored in the knowledge graph as the measures quick detection and quick extinguishment in nominalized speech as well as the more general entities detection and extinguishment. One additional aspect of having domain knowledge available is that specific similarity mea- sures can be created. This is done either on the base of a language model like BERT or with the usage of a feature-based similarity measure [6]. In the case of the language model approach a pre-trained model is used. A model can also be trained on the specific textual data if sufficiently available. These similarity measures are then used to identify unknown textual passages that are similar to known textual passages that have already been classified. In the following section we describe an implementation of all components and report a case study in the domain of nuclear safety. 3. Case Study - Active Learning in the Domain of Nuclear Safety Regulations Worldwide, several governmental and non governmental authorities provide regulatory com- pliance documents to assure nuclear safety. The scope reaches from recommendations for the safe operation of nuclear power plants to the safe execution of x-ray radiography. These documents are a valuable source of knowledge. To access this knowledge in a systematic manner, the availability of metadata within the documents is needed. This metadata has to be created by manual or automatic textual annotation. The information of how the regulatory metadata is structured is so-called meta-metadata. Synchronous with the annotation process, this knowledge about the structure of the metadata is improved. To evaluate and improve the before explained approach of active annotation support a case study in the domain of nuclear safety was put into practice. The framework was used to annotate selected documents of a corpus of 143 regulatory documents provided by the International Atomic Energy Association (IAEA) [7]. A task in this context is the annotation of potential incidents with the according safety measures as recommended in a document. For instance the phrase “manual fire fighting” is annotated as a measure to react to a discovered “fire”. Subsequently, it is wishful to provide this information in the next annotation step. This can be done by either annotating all occurrences of the phrase “manual fire fighting” or “on the fly” by providing the available information to the domain expert as a likely annotation to choose from. In this manner the knowledge is continuously improved with every working step of manual annotation. The overall view of tools and techniques used to implement the whole system as depicted in Figure 3. Active Annotation Support System Semantic Access Annotated XMI Document Regulatory Domain Knowledge Corpus 5 Pre- Ontology 1 Annotations KnowWE Users SKOS XMI UIMA Regulatory Knowledge Graphs webATHEN SPARQL Document 3 2 Container Provenance Document Document Semantic 6 Information Structure Annotations Type System NLP Container 4 PROV spaCy Annotation Frontend Backend Engine Figure 3: A view of used tools and techniques to implement an active annotation suport system for regulatory documents in the domain of nuclear safety. 3.1. The UIMA CAS Object and its Serialized Representation Apache UIMA (Unstructured Information Management Architecture) is a framework for the management of unstructured information with the goal of structuring it into a processing pipeline of annotation steps [8]. In this setup UIMA is used as a document container in the backend depicted by component number three. UIMA provides functionality to hold the textual content of the document, the information about entities and their relations, and is capable of transferring this information into a serialized data format. For each document a so-called CAS object (Common Analysis Structure) is created with access to the type system schema necessary for the serialization and the communication to the frontend (5). The CAS object maintains the correctness of indices and the according correctness of the serialized format. The XMI standard can be used for the serialized exchange of data objects on the base of meta-metadata models [9]. The content of an exemplary XMI file can be seen in the following Listing 1. The file holds the document content with its MIME type in the SOFA string (Subject Of Analysis) as well as two entities with the attributes measureRoot and incidentRoot and the relation between them with the label isReactiveMeasureTo.