=Paper= {{Paper |id=Vol-3341/wm1534 |storemode=property |title=An Active Annotation Support System for Regulatory Documents |pdfUrl=https://ceur-ws.org/Vol-3341/WM-LWDA_2022_CRC_1534.pdf |volume=Vol-3341 |authors=Andreas Korger,Joachim Baumeister |dblpUrl=https://dblp.org/rec/conf/lwa/KorgerB22 }} ==An Active Annotation Support System for Regulatory Documents== https://ceur-ws.org/Vol-3341/WM-LWDA_2022_CRC_1534.pdf
An Active Annotation Support System for Regulatory
Documents
Andreas Korger1 , Joachim Baumeister1
1
    University of Würzburg, Am Hubland, D-97074 Würzburg


                  Abstract
                  Manual document annotation is a resource intense task. The costs of annotation can be lowered by
                  supporting the manual annotation with pre-processing of the available corpus and active in-process
                  support of annotating users. To integrate different components into a coherent active annotation support
                  system the XML Metadata Interchange standard can be used to exchange objects on the base of a meta-
                  meta data model. Further, to integrate an existing knowledge graph into an annotation support system
                  the RDF query language SPARQL can be used as an interface to analyze existent documents and declare
                  new knowledge. In this manner the presented efforts contribute to structure and standardize the process
                  of manual knowledge acquisition from regulatory documents.

                  Keywords
                  Knowledge Management, Document Annotation, Meta-Meta Data Models, SPARQL, Ontology Population,
                  Natural Language Processing




1. Introduction
A core task in the field of knowledge management is to provide insight to documents related
to the current problem situation. For example, for servicing industrial machines, the service
technician will need quick access to the appropriate documentation. For internal knowledge
management, large companies often provide access to regulatory compliance documents for
use by their employees. This access is depicted conceptually on the left side of Figure 1. In these
documents the textual parts are usually annotated by (semantic) metadata, in order to implement
a quick and problem-oriented access to information. This semantic metadata in turn is then used
by semantic search engines and navigation interfaces to provide a quick and context-oriented
access [1]. Whereas for some new documents the authoring process can be extended by the
augmentation of metadata, the metadata needs to be attached to legacy documents in any case.
Albeit the progression in natural language processing, information extraction and ontology
population, the attachment of metadata to document passages is often done manually in order
to achieve high quality for the annotation. However, the manual annotation of documents is a
cumbersome and costly task.
   Available frameworks for general textual annotation lack active annotation support. In this
work, we propose a semantic approach to actively annotate documents by integrating a domain
specific ontology within the annotation task. The domain specific ontology represents prior

LWDA’22: Lernen Wissen Daten Analysen, 2022, Hildesheim
$ a.korger@informatik.uni-wuerzburg.de (A. Korger); joba@uni-wuerzburg.de (J. Baumeister)
    © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR Workshop Proceedings (CEUR-WS.org)
                                 Active Annotation Support System
                  Documents
                                                                                 Annotated
                                                                                 Document
                                                                                  Corpus
             Semantic Access            Domain Knowledge

Users



           Annotations                                              Annotation Engine


        Figure 1: A conceptual view of annotating documents in knowledge management.


domain knowledge and it is used to integrate new knowledge collected in the active annotation
process. The right side of Figure 1 shows a conceptual view of this approach. which reduces
the efforts and improves quality of annotation compared to fully manual approaches.
   The rest of the paper is organized as follows: In Section 2 we describe the components of the
proposed framework as well as their interaction. In Section 3 we present a case study using the
system for the annotation of regulatory documents for nuclear safety. We finish the paper with
related work, future work, and a concluding statement.


2. Components of an Active Annotation Support System
In the following, we use an example with explanatory characteristics out of the domain of
fire safety. A fire is an incident for which appropriate measures need to be taken. A fire
extinguisher is such a measure but also a blanket. Such incidents and measures are in the
scope of discovery for semantic annotation in unknown documents. An active annotation
support system (AASS) suggests such annotations to the user with the recommendation of likely
annotation features. It proposes automatically discovered annotations on the base of machine
learning and natural language processing techniques, so-called pre-annotations. Such a system
is called active, if the choice made by the user for the current annotation is incorporated into
the next annotation recommendation. In this context, the system can also decide which textual
passages are presented to the user for annotation to optimize the overall performance. The
annotations are done on the basis of a semantic model for regulatory knowledge represented in
an ontology. The semantic model is populated with instances created during the annotation
step which makes up a knowledge graph of applied regulatory knowledge.
   In this paper, an AASS for the knowledge management of regulatory documents is introduced,
major components necessary for the implementation are pointed out and explained in detail.
The workflow of bringing different components of the architecture together in a performant
and consistent manner is explained. Basically there are two processing environments in use.
First, the annotation frontend in which the user works and does his manual annotations. Second,
the backend which does background work like file handling, provides NLP functionality, and
manages consistency. A graphical view of the workflow and incorporated components is
presented in Figure 2, components shown in the figure will be explained with more detail in the
following sections.

                                      Active Annotation Support System
        Semantic
         Access                                                                  Annotated
                                                                                 Document
                                      Regulatory Domain Knowledge                 Corpus
                5                               Ontology
                       Pre-                                1
                    Annotations
Users



                                        Regulatory Knowledge Graphs
                                                                                Document 3
                                               2                                Container
                                  Provenance      Document      Document
     Semantic 6                   Information      Structure    Annotations
       Type
      System                                                                  NLP Container
                                                                                              4
             Annotation Frontend                          Backend Engine


Figure 2: A view of the main components and their workflow for the active annotation of
regulatory documents accessed by users over the frontend and processed in the backend.


2.1. Natural Language Processing Components
The handling of natural language text contained in regulatory documents requires appropriate
techniques, namely to extract relevant entities and their relations [2]. For the processing
of natural language text a supporting NLP container is necessary that provides functionality
like text offset handling, tokenization, string matching, and rule matching. Each document
is encapsulated in such a container on the backend side as Figure 2 shows in the component
number (4). The container provides an interface to semantic knowledge contained in the
regulatory domain knowledge ontology (1). It applies natural language processing steps to
grant access to the document on a level of tokens and entities. Rule matching steps are applied
to identify relevant known and unknown entities together with their relations.
   The document container is created with the input of the document content (3). From the
existent ontology structure (1) together with the already available annotations saved in the
corresponding knowledge graph (2) a hierarchical type system is derived that is implemented
into the annotation environment. Therefore, the information has to be transferred into a format
that can be consumed and displayed by the frontend tools. This type system holds the supportive
information for the user. In this way the options are represented that a user has for annotating
entities and their relations. The functionality of the annotation environment is used, in order
to retrieve feedback within the active learning step. As the presented architecture picks the
best of diverse data models it is important to guarantee consistency of data. For the practical
application the documents need to be uniquely addressable as well as different annotated
versions of the same document need to be identifiable. Furthermore, the annotated metadata
needs to be manageable. For instance, it is important to know which annotation was done by
which annotating component. Therefore, a systematic model of unique identifiers is maintained.
Additionally, meta information about the provenance of data is stored in the knowledge graph.
The document container itself provides measures to maintain its own identifiers and assures at
least the inherent (syntactical) correctness when transferring identifiers to the frontend.
   Therefore, this container needs to communicate with the annotation tool for which a standard
is necessary for the interchange of metadata. The pre-annotations (5) created by the NLP
container are transferred to the document container which forwards them to the frontend. They
are displayed for instance in the same manner as manual annotations are and can be accepted or
adjusted by the user. The user might want to change the beginning and ending of the automatic
annotation, the annotated features or delete the whole annotation. The annotations of the user
are communicated to the backend side in the same manner as they are passed to the frontend.

2.2. Semantic Technology Components
Semantic components are modularized and stacked with the intention to re-use and adapt
them to new regulatory domains. The fundament is an ontology (1) that represents the basic
regulatory knowledge model. This model is instantiated and instances are aggregated in different
knowledge graphs (2). Semantic technologies are used for three different purposes:
   First, prior and learned domain knowledge is organized in the knowledge graph on the
base of the knowledge organization model. For this task we use the SKOS (Simple Knowledge
Organization System) [3] scheme and build a domain specific semantic model on top. The
domain knowledge consists of entities that are described with labels and descriptions as well
as proprietary relations between them. These are the entities and relations the user wants to
discover and annotate in the available document corpus.
   Second, a set of extraction patterns is matched to the corpus to identify unknown entities and
relations. All these discovered information units are saved in the knowledge graph together with
their annotations. An annotation made in a documents on the base of the domain knowledge
signifies for instance, that a certain entity occurs at this position in the document. These
automatic annotations are complemented by the manual annotations of the user. A big corpus
quickly entails a large amount of annotations. This data has to be queryable to give insight
with human perception capability for assessment of the quality of annotations and usage of
the discovered knowledge. Additionally, a structured data storage in this manner allows for
semantic search in the annotated data [2].
   Third, provenance information is stored in the knowledge graph to organize for instance
different authors, different documents, and experiments. We chose the Provenance Ontology
(PROV-O) as a base to build a domain specific system above [4].

2.3. Textual Similarity Assessment
What distinguishes the annotation of regulatory documents from the annotation of general
documents? We see the difference in the availability of a semantic model coding prior knowl-
edge founding on several previous case studies. Further, this provides the base to construct a
domain independent NLP engine exploiting textual phenomena that occur especially in texts of
regulation. The combination of both facilitates the annotation of unknown regulatory texts.
For instance, having taxonomies available of regulatory domain specific entities allows for rec-
ommending annotation features based on generalization and specification in the neighborhood
defined by the topology of the taxonomy [5]. The domain specific relations and which textual
indicators in regulatory speech point to them, simplifies the discovery of specific entities.
   Being supported by regulatory semantic prior knowledge the “active” component of the
annotation system can also exploit the users annotations efficiently. For instance, if the user
has annotated a relation but not the entities the recommendation of fitting entities becomes
possible due to domain knowledge. For instance, taken the example of fire safety “... detecting
and extinguishing quickly those fires which do start ...”. We already know, that a fire extinguisher
is a measure against fires, this can be exploited by textual similarity assessment to identify that
the verb extinguishing itself also is a measure. Additionally, extinguishing is connected with
detecting via the conjunction and which allows the inference that detecting also is a measure
and should be annotated as one. The adverb quickly can be added to create the more specific
entities detecting quickly and extinguishing quickly. This would be stored in the knowledge
graph as the measures quick detection and quick extinguishment in nominalized speech as well
as the more general entities detection and extinguishment.
   One additional aspect of having domain knowledge available is that specific similarity mea-
sures can be created. This is done either on the base of a language model like BERT or with the
usage of a feature-based similarity measure [6]. In the case of the language model approach a
pre-trained model is used. A model can also be trained on the specific textual data if sufficiently
available. These similarity measures are then used to identify unknown textual passages that
are similar to known textual passages that have already been classified. In the following section
we describe an implementation of all components and report a case study in the domain of
nuclear safety.


3. Case Study - Active Learning in the Domain of Nuclear Safety
   Regulations
Worldwide, several governmental and non governmental authorities provide regulatory com-
pliance documents to assure nuclear safety. The scope reaches from recommendations for
the safe operation of nuclear power plants to the safe execution of x-ray radiography. These
documents are a valuable source of knowledge. To access this knowledge in a systematic
manner, the availability of metadata within the documents is needed. This metadata has to be
created by manual or automatic textual annotation. The information of how the regulatory
metadata is structured is so-called meta-metadata. Synchronous with the annotation process,
this knowledge about the structure of the metadata is improved. To evaluate and improve the
before explained approach of active annotation support a case study in the domain of nuclear
safety was put into practice. The framework was used to annotate selected documents of a
corpus of 143 regulatory documents provided by the International Atomic Energy Association
(IAEA) [7].
   A task in this context is the annotation of potential incidents with the according safety
measures as recommended in a document. For instance the phrase “manual fire fighting” is
annotated as a measure to react to a discovered “fire”. Subsequently, it is wishful to provide this
information in the next annotation step. This can be done by either annotating all occurrences
of the phrase “manual fire fighting” or “on the fly” by providing the available information to
the domain expert as a likely annotation to choose from. In this manner the knowledge is
continuously improved with every working step of manual annotation. The overall view of
tools and techniques used to implement the whole system as depicted in Figure 3.

                                      Active Annotation Support System
        Semantic
         Access                                                                       Annotated
                             XMI
                                                                                      Document
                                      Regulatory Domain Knowledge                      Corpus
                5      Pre-                     Ontology   1
                    Annotations                                   KnowWE
Users
                                                                     SKOS


                              XMI                                                          UIMA

                                        Regulatory Knowledge Graphs
        webATHEN                                                          SPARQL    Document 3
                                               2                                    Container
                                  Provenance      Document      Document
     Semantic 6                   Information      Structure    Annotations
       Type
      System                                                                       NLP Container
                                                                                                   4
                                         PROV                                           spaCy
             Annotation Frontend                          Backend Engine


Figure 3: A view of used tools and techniques to implement an active annotation suport system
for regulatory documents in the domain of nuclear safety.


3.1. The UIMA CAS Object and its Serialized Representation
Apache UIMA (Unstructured Information Management Architecture) is a framework for the
management of unstructured information with the goal of structuring it into a processing
pipeline of annotation steps [8]. In this setup UIMA is used as a document container in the
backend depicted by component number three. UIMA provides functionality to hold the textual
content of the document, the information about entities and their relations, and is capable of
transferring this information into a serialized data format. For each document a so-called CAS
object (Common Analysis Structure) is created with access to the type system schema necessary
for the serialization and the communication to the frontend (5).
   The CAS object maintains the correctness of indices and the according correctness of the
serialized format. The XMI standard can be used for the serialized exchange of data objects on
the base of meta-metadata models [9]. The content of an exemplary XMI file can be seen in the
following Listing 1. The file holds the document content with its MIME type in the SOFA string
(Subject Of Analysis) as well as two entities with the attributes measureRoot and incidentRoot
and the relation between them with the label isReactiveMeasureTo.