1. Introduction

Semantic Extraction of Key Figures and Their Properties From Tax Legal Texts Using Neural Models

Daniel Steinigen

Marcin Namysl

Markus Hepperle

Jan Krekeler

Susanne Landgraf

1 0 Bucerius Law School , Jungiusstraße 6, Hamburg, 20355 , Germany 1 Federal Ministry of Finance , Wilhelmstraße 97, Berlin, 10117 , Germany 2 Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS , Schloss Birlinghoven 1, Sankt Augustin, 53757 , Germany

Applying information extraction to legislative texts is a challenging task that requires a specification to distinguish the relevant parts from the less relevant parts of the text. Moreover, there is still a lack of appropriate language- and domain-specific data in the field of information extraction. This work investigates the extraction and modeling of key figures from legal texts. We introduce a universally applicable annotation scheme together with a semantic model for key figures and their logically connected properties in legal texts. Moreover, we release KeyFiTax, a dataset with key figures based on paragraphs of German tax acts manually annotated by tax experts together with a knowledge graph populated from these paragraphs based on our semantic model. Using our dataset, we also evaluate and compare state-of-the-art entity extraction models in terms of long entity spans and low-resource data. Furthermore, we present a transformer-based approach for relation extraction using entity markers to obtain a logical formulation of the key figures. Finally, we introduce task triggers for training a combined resource-eficient entity and relation extraction model. We make our dataset together with the semantic model and the knowledge graph, as well as the implementation of the entity and relation extraction approaches investigated in this work public.

eol>information extraction entity extraction relation extraction ontologies knowledge graphs transformers language models German datasets legal texts tax key figures

1. Introduction

NegativeCondition Key figures represent a central component in legal texts Paragraph DeclarativeKeyFigure of tax laws. They are crucial for applying laws and are CumulativeCondsiutiboCnlassOsfubClassOf refersTo hasKeyFiguresubClassOf subClassOf an important criterion in the amendment of laws. Such Condition hasCondition KeyFigure lkoewyafignucree’,sKairned,eer.gfr.,eibEetnrtafger‘ncuhnilgdsptaaxu-sfcrheaeleal‘ldoiwstaannccee’ aolr- AlternativeCondsuitbioCnlassOf hasExpression hasCondition hasExpression hxassdV:daeluceimal pWeenrsbeusnaglslokwosatenncpe’a.uschale ‘flat-rate income-related ex- Unit hasUnit Expression subClassOf StatedExpression

Changing key figures in the tax laws directly afects subClassOf hasRange hasFactor subClassOf DeclarativeExpression itonhfgeestrhteisemuclaottiminngmgtutahtxeerrieamvlleponawucaten.ocAfeantcoehx5aa0nmcgepenleitnswpthoeuerlkldamwb.e,Iainnmtceroredmaessl- CUuprrpeenrcLyimsiutbClassOf subClassOfRange subLCimlaistsROafnge FasucbtoCrlassOf subClasTseOmfhpradosfrDsa:eLlFciltaaercraattoliorn can be used to simulate what efect an adjustment of the LowerLimit QuantitativeFactor key figures will have on the specific tax forecast. To facilitate this, in this paper we propose an approach based Figure 1: Ontology for semantic modeling of key figures and on information extraction and semantic technologies. their logically connected properties in legal texts For this it is first necessary to recognize and extract the key figures with their logically connected properties and rules from legal texts. This task requires an automatic understanding of the legal texts and recognizing the relevant information within the text. Then it is necessary to semantically model the extracted information using a specific ontology and populate a Knowledge Graph (KG) out of this information. This then allows to compare the KG’s of existing and new law texts to identify legislative changes. In this paper we focus on the information extraction part and the semantic modeling part. We leave the diferential analysis and the prediction of the impact chine learning (ML) approaches to extract key figures mance of natural language understanding approaches from legal texts. Specifically, we consider this problem a on statutory reasoning by introducing the SARA dataset, token-level classification task, known as sequence label- which consists among other of extracted arguments and ing. With this approach, each token of a text is classified a graph-based representation of those arguments. Nevaccording to the predefined categories, whereby tokens ertheless, these approaches are either too general and not assigned to any class are labeled with zeros [1]. More generic or too specifically modeled for a particular probprecisely, this can also be interpreted as an entity extrac- lem to fit our use case of modeling key figures. Therefore, tion task in which individual entities can span over many we propose a new semantic model tailored to our use words or tokens. Entity extraction is widely used in the case that models the key figures with their properties in research area of information extraction (IE) and has also detail. The authors of this paper are a diverse team of been applied in the legal domain [2]. NLP and ML experts and tax experts. In interdisciplinary

We face several challenges in applying standard en- cooperation, we have developed an annotation scheme tity extraction approaches in our work. Since we focus and a semantic model in an iterative process, which conon German tax legal texts, we have both language- and tains the classes and properties required for the complete domain-specific data. It means we are in a low-resource specification of the key figures. domain and have to deal with limited training data. More- There are various challenges when annotating the key over, the entities can span over many tokens, making it ifgures. Since legal texts can be structured in a complex harder for the models to recognize the complete entities. way, the goal is to find a universally applicable annotation Furthermore, not all numeric currency values are directly schema. Furthermore, most key figures contain not just relevant to key figures. Therefore, the model must learn a single value but diferent values that apply under diferthe text semantics and what specific tokens refer to in ent conditions. Using the created annotation scheme, we order to extract the relevant values. generated a manually annotated gold standard dataset

For obtaining a logical formulation of the key figures, based on paragraphs of German tax laws. This dataset it is necessary to extract the key figures represented by is the basis for training and evaluating diferent state-oftheir entities and the relations between them. To address the-art information extraction models. Figure 2 shows this, we also consider relation extraction approaches in two examples of annotated paragraphs with distinct catour work. To facilitate resource-eficient training and to egories or entity types. get more benefit from the limited amount of available By applying our information extraction models and training data, training a combined model for both entity our semantic model, the adjusted key figures will be and relation extraction is reasonable. extracted and semantically modeled when the legal texts

As a prerequisite to training models for the automatic have changed so that they can be taken into account in extraction of key figures, we also introduce an annotation the tax forecast. In summary, the contributions of this scheme together with a semantic model for key figures paper are as follows: in legal texts. A variety of approaches, ontologies, and knowledge graphs already exist for semantic modeling of legal texts. LegalRuleML by Palmirani et al. [3] is intended to model legal rules and to connect between legal sources and metadata of the rules. They also introduce a Metamodel with defined nodes (classes) and edges (properties) to expose the LegalRuleML Metadata as linked data. Moreno Schneider et al. [4] propose a Legal Knowledge Graph that integrates and links heterogeneous compliance data sources including legislation, case law, regulations, standards, and private contracts.

Holzenberger and Van Durm [5] investigated the perfor• An annotation scheme together with a semantic

model for key figures in legal texts • A dataset consisting of paragraphs of German tax laws with annotated key figures and a knowledge graph populated with these key figures • Evaluation and comparison of state-of-the-art entity extraction models in terms of long entity spans and low-resource data utilizing the proposed dataset • A transformer-based approach for a combined resource-eficient extraction of entities and relations from legal data.

2. Semantic Model and Dataset 2.1. Data Sources and Data Selection The initial data basis for generating the annotated dataset

is legal texts in the German language. For this purpose, we took advantage of the publicly accessible website of the German Federal Ministry of Justice and the Federal Ofice of Justice 1, which contains the current German laws and legal regulations. These legal texts are available in various data formats, such as XML, PDF, or HTML. For our purpose, we use the XML files and automatically extract the contained legal paragraphs.

In accordance with the overall aim of providing a model for determining the impact of legislative change on tax revenues, we select on a primary step the relevant German tax laws, notably the Fiscal Code (Abgabenordnung), the Income Tax Act (Einkommensteuergesetz), Corporate Tax Act (Körperschaftsteuergesetz), Inheritance Tax Act (Erbschaft- und Schenkungsteuergesetz) and further tax acts regulating German direct and indirect taxes. To generate a larger dataset, we also considered further tax acts from other jurisdictions in the German language, such as the Austrian or the Swiss, but gave up on this due to the inconsistent and, therefore, harmful use of the same key figures in a difering meaning or different key figures in the same meaning as the key figure from the German jurisdiction.

In the second step, we determine the relevant sections and paragraphs of the selected acts. To this end, we ask which rules directly impact the tax revenues and have not only a serving or systematizing function. Thereto we select these sections and paragraphs, which contain a key figure and a corresponding value and unit, which are the essential and mandatory components of the relevant key figures, whereas the other categories are optional. The categories are described in detail in the next section.

2.2. Semantic Model

We introduce our annotation scheme and our semantic model for creating the dataset with diferent semantic categories for the key figures. The goal is to provide a comprehensive specification of the key figures so that they can be used independently of the legal text for downstream applications, such as tax forecasts. The annotation scheme and the semantic model should be universally applicable to legal texts, which can be structured in various complex ways. We identified the semantic categories in an iterative process by analyzing diferent paragraphs of tax acts and revising our annotation scheme continuously.

First, we introduce the category for the key figure itself as a central category, which is specified by containing one 1https://www.gesetze-im-internet.de or more values that have an impact on tax revenue. The annotation can be considered as the name or label for the key figure. It corresponds to a text phrase or word that describes this key figure. Figure 2 shows, for example, the annotation of the key figures distance allowance and child allowance. Then, since every key gfiure we consider here should have at least one or more values, there is the category for these values that we call expression of the key figure. These are numerical values or terms to which the key figures refer, such as the values 0.30 or 4 500 in Figure 2. The expressions can be specified in certain units, so there is a category for units. In the case of monetary amounts, which often appear in tax acts, the unit is in most cases a currency, such as Euro.

With these three categories, simple key figures can already be specified. However, while analyzing the legal texts, we found that the key figures can also be structured much more complexly. Thus, most key figures contain not only a single expression but diferent expressions that apply under diferent conditions, and there are also preconditions for specific key figures. For this purpose, we introduce the category condition. It includes spans of text over several words with conditions that apply to a key figure or for which a key figure has specific expressions. An example is the commuter allowance, which amounts to 0.30 euros up to 20 kilometers driven and increases to 0.35 euros from kilometer 21. Another example is given in Figure 2, where it is shown that the child allowance can have diferent expressions resp. values depending on the number of children, which is the condition there.

We also found that there can be diferent types of conditions, namely negative, alternative, or cumulative conditions. An example of a negative condition can be found in section 24 sentence 2 of the Corporate Tax Act. The provision stipulates that the allowance for corporate tax subjects, as regulated in sentence 1, does not apply to the type of subjects specified in number 1 to 3 of the provision. Alternative conditions are, for instance, used in section 10b para. 1 Sentence 8 of the Income Tax Act.

The sentence regulates that certain membership fees cannot be deducted in case they are paid to corporations serving certain in number 1 to 5 specified purposes. The deduction prohibition already applies, if only one of these numbers is fulfilled, as indicated by the word or between the ultimate and the penultimate number. Section 10b para. 1a sentence 1 contains one of many examples of cumulative conditions, where donations into the assets of a foundation are only declared deductible, if they meet the requirements of a donation into the assets of a foundation, the provisions of para. 1 sentence 2 to 6 are fulfilled, and an application has been filed.

Another point to consider when describing key figures is that the expressions are not always just fixed values but can also define a range in which a key figure applies. This is covered by the range category. The range is an indicator shown in Figure 1 using the RDF Schema2 vocabulary. for the area in which an expression is valid. This area can The semantic categories become the classes and the relabe defined by either an upper limit, a lower limit or some tions become the properties of this ontology, which also limit range. Figure 2 shows an example of an upper limit define the permissible properties between these classes. "at most" and a lower limit "an amount greater than". In For the class expression, we have also defined data propaddition, there is also weighting of the expressions, which erties for storing the numeric values if they are explicitly we call factors. This category characterizes the factor that specified or the phrases for the declarative expressions. must be considered for a expression and indicates what This model allows the assignment of the key figures to the expression refers to. These factors can be further the associated conditions and expressions during annodivided into temporal factors, which refer to periods of tation. Noteworthy are the properties hasCondition and time, such as months or years, and quantitative factors, hasExpression since they can be applied to two diferent which refer to some absolute amount. For example, the classes as a head. When considering conditions, these can paragraph in Figure 2 includes a temporal factor "per apply directly to certain key figures or define the validity calendar year" and a quantitative factor "for each full of diferent expressions. On the other hand, expressions kilometer". can be derived directly from a key figure or can also be

Furthermore, we found that not all key figures have part of a condition. their expressions explicitly mentioned as such in the legal Furthermore, we introduce the relation join to link texts. It means that the key figures sometimes cannot related annotations from the same semantic category be recognized as distinct mentions of a short sequence since there are cases where a single entity is spread across of words, and expressions do not always occur as easily multiple annotations. Beyond the key figures, we also recognizable numerical values. Instead, the key figures model the pargraphs that contain the key figures and and expressions can also be implicitly described in the the legal sources, in our case the tax acts that consist legal texts using long phrases in a declarative manner. of the parapgraphs. In addition, since conditions can be To tackle this, we have two additional categories for the expressed not only by natural text, but also depend other declarative phrases of the key figures and expressions, paragraphs, we also introduce a property referTo between called declarative key figures and declarative expressions. condition and paragraph.

For the cases where the key figures and expressions are explicitly mentioned, we use the categories stated key 2.3. Annotation Rules and Dataset ifgure and stated expression. Table 1 shows all introduced Acquisition semantic categories with some sample formulations and their English translations. Given the developed annotation schema and the collected

To assign the annotations created according to the se- data sources, the next step is to annotate the legal texts mantic categories to each other in order to obtain a logical and build up the dataset. For the further procedure of formulation of the key figures, we also introduce relation annotating the dataset and applying the information extypes between the categories. This is also particularly traction models, we refer to the semantic catgories or important, since a single paragraph may contain multiple classes as entities and the properties as relations. We first key figures with the associated other categories. Based used the selected paragraphs from Section 2.1 and peron the defined semantic categories and relation types, formed a simple pre-annotation task. Using rule-based we build a semantic model in the form of an ontology as approaches and pattern matching, we automatically en- cations already mentioned, there were other aspects and riched the paragraphs with annotations for the expression challenges to be considered during the annotation. The and unit categories. The annotators reviewed these pre- general challenge is the complexity of the German tax annotations and corrected, removed, or complemented regulations, which are often long, convoluted, and conthem as necessary. For storing the pre-annotated data, tain references to other provisions. Hence, compromises we have chosen the CAS format serialized as an XMI file. were often necessary between annotation as accurately It allows us to import the data directly into the anno- as possible and managing the complexity of annotations tation tool. For manual annotation of the texts, we use that would otherwise result in specifying rules that afect the INCEpTION tool3 [6] as it has an intuitive graphical only a small number of tokens. Because the dataset is of user interface and can be configured well for specific a manageable size, the annotation agreement was that annotation tasks. the annotation is done piecewise by both commenters

Furthermore, we defined a set of annotation rules. We simultaneously. Anomalies and deviations were then only allow complete words to be annotated and not parts discussed together with the NLP engineers and the anof words. We do not allow multi-label annotation except notation scheme was readjusted if necessary. for the conditions category, which means that each token can only be labeled with one of the defined semantic 2.4. Dataset Statistics categories. Conditions are an exception to this rule. Each token already labeled as a condition can also have a label The generated dataset includes 106 annotated paragraphs of another category because conditions can also represent from 14 diferent German tax acts. Table 2 show the a key figure concurrently, and conditions themselves can statistics of the generated dataset with the number of contain expressions. For example, section 10 para. 1a annotated instances and the token sequence length for sentence 1 number 1 of the Income tax act contains the each category. It shows that the dataset contains 157 key figure of maintenance payments to the divorced or annotations of key figures , with the corresponding addipermanently separated spouse who is subject to unlim- tional categories. The statistics also illustrate that the ited income tax liability, which is a condition of this key annotations for categories condition, declarative key figifgure. This is because the key figure and its expression ure, and declarative expression contain very long token only apply if the maintenance payment, as defined else- sequences. We further populated a KG out of this anwhere (in the German Civil Code) but referenced here, is notated dataset using the defined semantic model from paid. Section Section 2.2. The annotated dataset, as well as the

We also found that besides diferent types, the condi- KG and the list of tax acts of which paragraphs are intions can also have diferent formats. Considering the cluded in the dataset, have been made publicly available length, some conditions that span only a few words, and and can be found in the project repository. others might span entire sentences. Here we do not limit the length of the conditions and allow arbitrary long Table 2 phrases. The same applies to the categories declarative Statistics of the entities and relations in our dataset. No. is the key figure and declarative expression. number of annotated instances and Tok. the mean number of

For our annotation task we simplify for now the issue tokens for each category. that there are diferent condition types, and do not distinguish these types during annotation. We define that Entity Type No. Tok. Relation Type No. cumulative conditions are labeled contiguously and that Key figure 129 4 hasKeyFigure 157 alternative conditions are labeled separately as long as (stated) they do not have a common beginning or end of sentence. Expression 295 2 hasExpression 319 In addition, the relations between the entities are also (stated) annotated. However, the relations are only allowed be- CUonnitdition 429814 114 hhaassCUonnitdition 237999 tween certain entity types, in a defined direction. This Range 75 2 hasRange 75 annotation was done in accordance with the classes and Factor 97 11 hasFactor 137 properties defined in the ontology in Figure 1. Key figure 28 14 hasParagraph 106

The data was annotated by tax experts who coauthored (declarative) this paper in an iterative process. In this process, we also Expression 32 6 join 139 continuously developed the annotation scheme together (declarative) with the semantic model. The first semantic model was more restrictive and as it progressed we allowed more relations when it was necessary. In addition to the specifi

3. Approaches for Key Figure Extraction

Given the dataset described in Section 2, the goal is to automatically extract the key figures specified by their semantic types from the legal texts. We address this problem by employing entity extraction approaches. In the entity extraction task, each token of a text is assigned a label according to some predefined categories, whereby tokens not assigned to any category are labeled with zeros. The individual entities can then span over a large number of tokens. Based on this, ML-based classification models can be trained to classify each token. Ideally, the model memorizes the examples seen during training and tries to generalize to unseen examples.

3.1. Approaches from NLP libraries

model and the GBERT and GElectra models by Chan et al. [8], which, in addition to Wikipedia- and news articles, is also pre-trained on 2.4GB of German legal texts from Open Legal Data7 [9]. We also consider a multilingual language model XLM-RoBERTa [10], which is pre-trained on 2.5 TB of data from 100 diferent languages, including about 100 GB of German texts.

In order to face the challenge of long input sequences due to the long paragraphs legal texts can have, we also consider the Longformer model by Beltagy et al. [11]. In contrast to the other models, which only allow a maximum length of 512 tokens as input, this model allows up to 4096 tokens. Specifically, we use the XLM-R Longformer model by Sagen [12]8. This is an XLM-RoBERTa model that has been extended to allow sequence lengths up to 4096 tokens using the Longformer pre-training scheme.

We also consider transformer-based approaches as we investigate the low-resource scenario and have to cope with long entity spans. Transformer architecture aims to solve sequence-to-sequence tasks while being able to consider long-distance dependencies across several words in a sentence by employing the attention mechanism [7]. Transformer-based language models can be pre-trained on large text corpora, allowing them to understand the contextual relationships between individual words and sentences. Considering the entity extraction task, we choose models that utilize the encoder part of the transformer architecture. These models provide an encoded representation of the input sentences. We use a ifnal classification layer to classify the sentence tokens according to our annotation scheme.

For our work, we select relevant models pre-trained on German text data. First, we consider the German BERT In our work, we consider and compare diferent ap- 3.3. Relation Extraction proaches for entity extraction. First, we investigate the approaches of two well-known NLP libraries spaCy and As described in Section 2, our goal is to automatically RASA. For spaCy, we take advantage of the provided pre- extract key figures represented by their entities and the defined pipelines for training named entity recognition relations between them to obtain the logical formulation (NER) models4. We used the recommended settings and of key figures. We employ a relation extraction approach adjusted the hyperparameters for our use case, as shown to classify the relationship between the entities. Table 2 in Table 7. From RASA, we use an entity extraction ap- lists the relations in our dataset. Note that a simple ruleproach based on a conditional random field (CRF) model 5. based assignment of the relation type based on the enThis model utilizes the sklearn-crfsuite6 and uses features tity types according to the ontology in Figure 1 is not of the words (e.g., capitalization, part-of-speech tagging) straightforward as the relationship may or may not exist and their context to assign probabilities to certain entity depending on many other factors. Therefore, we apply classes. ML-based approaches to this task. We adopt a transformer-based approach inspired by 3.2. Transformer Models for Entity Zhou and Chen [13] and introduce typed entity markers to the input text before feeding it into the model.

Extraction First, we add special tokens into the vocabulary of the model and use them to enclose subject and object entities within the input paragraph: [SUB], [/SUB], [OBJ], [/OBJ]. In addition to the subject and object, we also mark the type of entities in the input text by using additional special tokens for each entity type, which provides the neural network with prior knowledge that facilitates the learning process.

Multiple training samples are generated for each input paragraph depending on the number of entities contained in that paragraph. For each sample, we mark one entity as a subject and all other entities as objects. Similar to the sequence labeling approach (Section 3.2), we feed the text with marked entities to the encoder to obtain a token-level representation of the input. Then, we apply a classification layer to classify the relations between the subject and objects. We label each [OBJ] token with the

4https://spacy.io/usage/training/

5https://rasa.com/docs/rasa/components/#crfentityextractor 6https://sklearn-crfsuite.readthedocs.io/en/latest/ [CLS] [RE] Das [OBJ] Kindergeld [/OBJ] beträgt [OBJ] monatlich [/OBJ] [OBJ] für das erste und zweite Kind [/OBJ] [SUB] 219 [/SUB] [OBJ] Euro [/OBJ] Key Figure

4. Experimental Evaluation 4.1. Comparison of Approaches for Entity Extraction From Legal Data 3.4. Joint Entity and Relation Extraction In this experiment, we evaluate the entity extraction

approaches described in Section 3 on the dataset introduced in Section 2. For this purpose, we only use the superclasses condition, unit, range and factor of our semantic model and do not distinguish into the subclasses.

However, for the key figure and expressions classes we retain the distinction between the stated and declarative subclasses.

Extending the approach from Section 3.3 further, it is possible to use the same network architecture to train a combined entity and relation extraction model. To this end, we introduce new special tokens called task triggers to distinguish the entity and relation extraction task: [EE] and [RE], respectively. We insert these tokens at the beginning of each paragraph right after the [CLS] token.

Moreover, since the condition class may overlap with 4.1.1. Experimental Setup other classes in our dataset, we employ task triggers to distinguish between groups of entities by defining addi- Data Split and Data Partition We use diferent stratetional triggers for each group. It allows us to separate gies for splitting the data. For evaluating the diferent entities into groups of types with non-overlapping an- types of transformer models described in Section 3.2 and notations. Specifically, we have one entity group for ifnding the best-performing model on our dataset, we conditions, marked with [GRP-1], and one group for randomly split the data into fixed training (80%) and evalthe remaining entity types, marked with [GRP-2]. Con- uation (20%) subsets. This results in 85 paragraphs for sidering that we have two entity groups, this gives us training and 21 for evaluation. For the condition class, two training samples for entity extraction and multiple which is trained separately, there are 73 paragraphs for samples (depending on the number of entities) for rela- training and 18 for evaluation. This allows us to identify tion extraction for each paragraph. By executing multiple the most suitable models in less time and with less compuforward passes on a single token classification model, we tational efort compared to the more complex evaluation can recognize entities with overlapping annotations as approach we used afterward. well as the relations between these entities. In the next step, we select the best-performing trans

One advantage of this approach is that we do not need former model and compare it with the other approaches to train separate models for the diferent entity groups described in Section 3.1 using k-fold cross-validation. and for relation extraction, which saves computational This validation technique is particularly suitable for the resources for training and memory resources for infer- low-resource scenario considered here, as it reduces the ence. Another advantage is that we get a larger number influence of the distribution of data across the training and variety of samples for training the model and thus and test splits on the evaluation results of the models. more benefit from the limited training data available. Fig- We choose k= 5 and randomly divide the dataset into ure 3 shows an example excerpt of a paragraph with ifve equal-sized subsets. In each iteration, one subset is the marked entities and the labeled relations. A detailed retained as the data used for testing the model, and the overview of all generated training samples for this ex- remaining four subsets are used as training data. Thus, cerpt can be found in the project repository. each subset is used once for evaluation and four times for training the model. The results are then averaged to produce the final scores.

Training Setup As the annotations for the condition class may overlap with other annotations, we train two separate models — one for the recognition of the condition type and the other for the recognition of the remaining entity types. We train the transformer model over 200 epochs with a batch size of 8 and a learning rate of 1 × 10− 5. All other relevant hyperparameters and the configuration files used for the other approaches are documented in the project repository.

Evaluation Metric For each entity type individually, we report the token-level micro-averaged F1 score on the test set as the evaluation metric in the charts. We also provide the macro-averaged F1 score over all classes as a tabular overview. For k-fold cross-validation, we report the average F1 score achieved over all five training runs. 4.1.2. Results and Discussion

We believe that it is due to the complexity of this class

and the low number of instances in the data (see num. samples and max. length plots in Figure 4, respectively).

Despite a large number of available samples, the score on the condition class is also low for spaCy-NER and RASA-CRF, but acceptable for XLM-RoBERTaLARGE. We believe that the length and the complexity of this class could cause this. Note that the longest instances of this class have over 100 tokens. Moreover, the concept of a condition is not so strictly defined, as, e.g., expression, unit, or factor.

Looking at the overall performance across all classes, XLM-RoBERTaLARGE clearly scores the best with a macroaveraged F1 score of 60.9 %. SpaCy-NER and RASA-CRF perform comparably in terms of overall performance but are still about 15 % behind XLM-RoBERTaLARGE.

Transformer Models The evaluation results for com

paring diferent pre-trained transformer models are pre- GBERTBASE sented in Table 3 as a summary overview. The detailed GBERTLARGE performance of the evaluated models per class is visual- GGEElleeccttrraaLBAARSGEE ized in the project repository. The results show that the Longformer GBERT and XLM-RoBERTa models outperform other XLM-RoBERTaBASE models for the declarative expression class. The best- XLM-RoBERTaLARGE performing Transformer model is XLM-RoBERTaLARGE with a F1 score of 56.8 %. spaCy-NER (cross-validated) 45.78 RASA-CRF (cross-validated) 44.10 XLM-RoBERTaLARGE (cross-validated) 60.91 Model comparison By choosing XLM-RoBERTaLARGE, XLM-RoBERTaLARGE-Triggers (cross-validated) 58.78 we perform a cross-validation of this model and the spaCy-NER and RASA-CRF approaches. Figure 4 present the results of this experiment.

In the case of the unit class, all models achieved high 4.2. Combined Extraction of Entities and F1 scores. Unsurprisingly, the instances of this class are Relations From Legal Data single-token entities (e.g., Euro, EUR) that only pose a few challenges to the examined models. Similarly, the In this experiment, we evaluate the approach described scores for the stated expression class were also high. in Section 3.4 for combined entity and relation extraction

The Range and Factor classes were recognized rela- on the dataset introduced in Section 2. We use the same tively well, especially by XLM-RoBERTaLARGE and in the classes as in Section 4.1. case of the Factor class also by spaCy-NER. Note that these two classes have three times fewer samples than in 4.2.1. Experimental Setup the case of the expression and unit types. Despite a lower Training Setup We select the XLM-RoBERTaLARGE number of examples, similar scores are achieved on the model for this experiment as its results in Section 4.1 declarative expression class by XLM-RoBERTaLARGE. were the most consistent among the examined models.

All models, except XLM-RoBERTaLARGE, perform rel- Using the approach described in Section 3.3, we train one atively poorly on the key figure class. Interestingly, the model for extracting the two groups of entities and the variance of the results for this class is relatively large: relations.

RASA-CRF achieves only 0.16 F1 score and, in contrast, XLM-RoBERTaLARGE exhibits three times better score. Dataset We expand our training data according to Sec

For the declarative key figure class, the performance tion 3.3. For each record, we create one training sample of every model examined in our experiment is the worst. 1.0 0.8 for each entity group and one training sample for each possible subject entity containing the entity markers for relation extraction. Then, analogous to Section 4.1, we apply cross-validation to evaluate the model’s performance.

RoBERTaLARGE-Triggers achieves better performance on the most relevant class key figure . Moreover, it performs better on the most complex class condition. Relation Extraction The performance of this model

in the relation extraction task is presented in Table 4. The result shows that the F1 scores for all relation types are above 0.6. Especially for relations hasUnit, hasRange, hasFactor and hasExpression the F1 scores are high. The model recognized the relationship between expressions and units almost perfectly.

5. Related Work 5.1. NLP datasets in Legal Domain Chalkidis et al. [14] provide a dataset for entity recog

Macro-averaged 77.34 nition consisting of 3,500 English contracts manually annotated with 11 entity types (party name, termination date, jurisdiction, etc.). Chalkidis et al. [15] release a multi-label text classification dataset based on EUR-LEX 4.2.2. Results and Discussion portal9. Leitner et al. [16] develop a dataset consisting of Entity Extraction The results of this experiment are German court decisions annotated with 19 entity types presented in Figure 4 and Table 3, named as XLM- (person, judge, lawyer, ordinance, court decision, etc.) RoBERTaLARGE-Triggers, for comparison with the other and they examine, among others, CRF’s for entity exmodels. The evaluation result shows that the jointly traction. Glaser et al. [17] introduce a dataset of 100k trained model can achieve comparable performance German court rulings with short summaries to study the for entity extraction as the XLM-RoBERTaLARGE mod- performance of text summarization systems. Wrzalik els trained separately for conditions and other entities. and Krechel [18] release a dataset for legal information Even though the jointly trained model slightly underper- retrieval (IR), which is based on case documents from forms on the classes factor, range and declarative expres- the Open Legal Data platform [9]. Chalkidis et al. [19] sion compared to the separately trained models, XLM- 9https://eur-lex.europa.eu/ present FairLex, a multilingual fairness benchmark of low-resource scenario and long entity spans. The refour legal datasets that covers five languages and five sults showed that all models perform well for classes sensitive attributes. They employ FairLex to evaluate with low complexity and suficient training data available. the fairness of pre-trained language models (PLMs) and Nonetheless, for more complex entities the transformerthe techniques used to fine-tune them. Holzenberger based language models significantly outperform the other and Durme [5] introduce the SARA dataset to investigate models. However, as a limitation, such models also rethe performance of natural language understanding ap- quire a certain amount of training data to achieve acceptproaches on statutory reasoning Waltl et al. [20] present able performance. We further provided a transformera automated classification of legal norms with regard to based relation extraction approach using typed entity their semantic type and propose a semantic type taxon- markers, which has performed very well in our experiomy for norms in the German civil law domain. ments. Moreover, we introduced task triggers for training a combined model for entity and relation extraction and 5.2. NLP Approaches in Legal Domain for diferent groups of entities with overlapping annotations. We have shown that comparable performance can Dozier et al. [21] discusses NER and named entity disam- be achieved with this combined model as with separately biguation (NED) in legal documents such as US case law, trained models. Using a combined model saves computadepositions, pleadings, etc. Glaser et al. [22] evaluate tional resources for training and memory resources for NER and NED approaches on a manually annotated Ger- inference. man court decisions dataset. Chalkidis et al. [23] apply We make our dataset together with the semantic model sequence labeling techniques to extracting core informa- and the KG, as well as the implementation of the entity tion from contracts. Large PLMs are usually trained using and relation extraction approaches investigated in this generic corpora and tend to underperform in specialized work publicly available11. To showcase our work, we also domains [24, 25]. Chalkidis et al. [2] apply BERT models provide a simple demonstrator application12. [26] to English downstream legal tasks: text classification and sequence labeling, by exploring diferent pretraining and fine-tuning strategies.

Andrew [27] uses statistical and rule-based techniques to extract entities such as names, organizations and roles and their relations in legal documents. Chen et al. [28] propose a legal triplet extraction system for drug-related criminal judgment documents. Hong et al. [29] perform IE of case factors from a dataset of parole hearings.

Cardellino et al. [30] employ IE in legal texts to recognize mentions of entities and links them to a structured knowledge representation10. Lüdemann et al. [31] use KG’s to model business entities of multinational companies and employ it for tax planning strategies.

Future Work In the future, we also plan to consider alternative modeling approaches of the entity and relation extraction task, e.g., as a span-based classification, using machine reading comprehension or unsupervised approaches utilizing large PLMs. Even with the relation extraction approach used in this work, a more comprehensive evaluation can be performed by considering diferent entity markers and providing more or less information about the entities, such as the entity types.

The KG’s populated from the extracted key figures allows as next step to compare the KG’s of existing and new law texts in terms of their key figures. In this future work, we also plan to evaluate other approaches for differential analysis and then compare them to the semantic approach described in this work. These detected changes then provide the input for an application to predict the impact of the law change on the expected tax revenue.

The ontology developed in this work on the basis of German tax acts can thereby also be applied universally to other legal fields and languages.

6. Conclusion and Future Work

11https://github.com/danielsteinigen/nlp-legal-texts 12https://huggingface.co/spaces/danielsteinigen/NLP-Legal-Texts In this work, we investigated extracting relevant key ifgures from legislative texts. To this end, we provided a universally applicable annotation schema together with a semantic model for key figures and their properties in legal texts. We successfully applied the schema and the model to legal texts. Moreover, we presented a dataset Acknowledgments manually annotated by tax experts, which includes 85 annotated paragraphs from 14 diferent German tax acts The authors acknowledge the financial support by the with 157 annotated tax key figures as well as a knowledge German Federal Ministry of Finance in the project "KISS graph populated from these annotated paragraphs based - KI-gestütztes System zur Steueranalyse". on our semantic model.

We evaluated state-of-the-art entity extraction models on the proposed dataset, facing the challenges of the 10LKIF ontology: http://www.estrellaproject.org/lkif-core/ sources and Evaluation Conference, European Lan- ings of the 2019 Conference on Empirical Methguage Resources Association, Marseille, France, ods in Natural Language Processing and the 9t h 2020 , pp. 4478–4485. URL: https://aclanthology.org/ International Joint Conference on Natural Lan2020.lrec-1.551. guage Processing (EMNLP-IJCNLP), Association [17] I. Glaser, S. Moser, F. Matthes, Summarization for Computational Linguistics, Hong Kong, China, of German court rulings, in: Proceedings of 2019, pp. 3615–3620. URL: https://aclanthology.org/ the Natural Legal Language Processing Workshop D19-1371. doi:10.18653/v1/D19-1371. 2021, Association for Computational Linguistics, [26] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Punta Cana, Dominican Republic, 2021, pp. 180– Pre-training of deep bidirectional transformers for 189. URL: https://aclanthology.org/2021 .nllp-1.19. language understanding, in: Proceedings of the doi:10.18653/v1/2021.nllp-1.19. 2019 Conference of the North American Chap[18] M. Wrzalik, D. Krechel, GerDaLIR: A German ter of the Association for Computational Linguisdataset for legal information retrieval, in: Pro- tics: Human Language Technologies, Volume 1 ceedings of the Natural Legal Language Processing (Long and Short Papers), Association for ComWorkshop 2021 , Association for Computational Lin- putational Linguistics, Minneapolis, Minnesota, guistics, Punta Cana, Dominican Republic, 2021, pp. 2019, pp. 4171–4186. URL: https://aclanthology.org/ 123–128. URL: https://aclanthology.org/2021 .nllp-1. N19-1423. doi:10.18653/v1/N19-1423. 13. doi:10.18653/v1/2021.nllp-1.13. [27] J. J. Andrew, Automatic extraction of entities and [19] I. Chalkidis, T. Pasini, S. Zhang, L. Tomada, relation from legal documents, in: Proceedings S. Schwemer, A. Søgaard, FairLex: A multilingual of the Seventh Named Entities Workshop, Assobenchmark for evaluating fairness in legal text pro- ciation for Computational Linguistics, Melbourne, cessing, in: Proceedings of the 60th Annual Meeting Australia, 2018, pp. 1–8. URL: https://aclanthology. of the Association for Computational Linguistics org/W18-2401. doi:10.18653/v1/W18-2401. (Volume 1: Long Papers), Association for Computa- [28] Y. Chen, Y. Sun, Z. Yang, H. Lin, Joint entional Linguistics, Dublin, Ireland, 2022, pp. 4389– tity and relation extraction for legal documents 4406. URL: https://aclanthology.org/2022.acl-long. with legal feature enhancement, in: Proceed301. doi:10.18653/v1/2022.acl-long.301. ings of the 28th International Conference on [20] B. Waltl, G. Bonczek, E. Scepankova, F. Matthes, Computational Linguistics, International ComSemantic types of legal norms in german laws: clas- mittee on Computational Linguistics, Barcelona, sification and analysis using local linear explana- Spain (Online), 2020 , pp. 1561–1571. URL: https: tions, Artificial Intelligence and Law 27 (2019) 43– // aclanthology.org/2020 .coling-main.137. doi:10. 71. doi:10.1007/s10506-018-9228-y. 18653/v1/2020.coling-main.137. [21] C. Dozier, R. Kondadadi, M. Light, A. Vachher, [29] J. Hong, D. Chong, C. Manning, Learning from limS. Veeramachaneni, R. Wudali, Named Entity Recog- ited labels for long legal dialogue, in: Proceedings of nition and Resolution in Legal Text, Springer Berlin the Natural Legal Language Processing Workshop Heidelberg, Berlin, Heidelberg, 2010, pp. 27–43. 2021, Association for Computational Linguistics, URL: https://doi.org/10.1007/978-3-642-12837-0_2. Punta Cana, Dominican Republic, 2021, pp. 190– doi:10.1007/978-3-642-12837-0_2. 204. URL: https://aclanthology.org/2021 .nllp-1.20. [22] I. Glaser, B. Waltl, F. Matthes, Named entity recog- doi:10.18653/v1/2021.nllp-1.20. nition, extraction, and linking in german legal con- [30] C. Cardellino, M. Teruel, L. A. Alemany, S. Villata, tracts, in: IRIS: Internationales Rechtsinformatik A low-cost, high-coverage legal named entity recSymposium, 2018, p. 325–334. ognizer, classifier and linker, in: Proceedings of [23] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, I. An- the 16th Edition of the International Conference on droutsopoulos, Neural contract element extraction Articial Intelligence and Law, ICAIL ’17, Associarevisited, in: Workshop on Document Intelligence tion for Computing Machinery, New York, NY, USA, at NeurIPS 2019, 2019. URL: https://openrev iew.net/ 2017 , p. 9–18. URL: https://doi.org/10.1145/3086512. forum?id=B1x6fa95UH. 3086514. doi:10.1145/3086512.3086514. [24] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. [31] N. Lüdemann, A. Shiba, N. Thymianis, N. Heist, So, J. Kang, BioBERT: a pre-trained biomedi- C. Ludwig, H. Paulheim, A knowledge graph for cal language representation model for biomedical assessing agressive tax planning strategies, in: J. Z. text mining, Bioinformatics 36 (2019) 1234–1240. Pan, V. Tamma, C. d’Amato, K. Janowicz, B. Fu, URL: https://doi.org/10.1093/bioinformatics/btz682. A. Polleres, O. Seneviratne, L. Kagal (Eds.), The Sedoi:10.1093/bioinformatics/btz682. mantic Web – ISWC 2020, Springer International [25] I. Beltagy, K. Lo, A. Cohan, SciBERT: A pretrained Publis hing, Cham, 2020 , pp. 395–410. language model for scientific text, in: Proceed

Spain (Online), 2020 , pp. 6788 - 6796 . URL: https:

//aclanthology.org/ 2020 .coling-main. 598 . doi:10. [1] Namysł , Marcin, Robust Information Extrac- 18653 /v1/ 2020 .coling-main. 598 .

tion From Unstructured Documents , Ph.D. the- [9] M.

Ostendorf , T.

Blume , S.

Ostendorf , Towards

Bonn , 2023 . URL: https://hdl.handle. net/20 .500. ceedings of the ACM/IEEE Joint Conference on

11811 /10560. Digital Libraries in 2020 , JCDL '20, Association

for

[2]

Chalkidis ,

Fergadiotis ,

Malakasiotis , N. Ale- Computing Machinery , New York, NY, USA, 2020 ,

tras , I. Androutsopoulos, LEGAL-BERT: The mup - p. 385 - 388 . URL: https://doi.org/10.1145/3383583.

pets straight out of law school , in: Findings 3398616. doi: 10 .1145/3383583.3398616.

of the Association for Computational Linguistics : [10]

Conneau ,

Khandelwal ,

Goyal , V. Chaud-

EMNLP 2020 , Association for Computational Lin- hary , G. Wenzek,

Guzmán , E. Grave, M. Ott,

guistics , Online, 2020 , pp. 2898 - 2904 . URL: https:// L. Zettlemoyer,

Stoyanov , Unsupervised cross-

aclanthology.org/ 2020 .findings-emnlp. 261 . doi: 10. lingual representation learning at scale , in: Pro-

18653 /v1/ 2020 .findings-emnlp. 261. ceedings of the 58th Annual Meeting of the Associa [3]

Palmirani ,

Governatori ,

Rotolo , S. Tabet, tion for Computational Linguistics, Association for

Boley ,

Paschke , Legalruleml: Xml-based rules Computational Linguistics , Online, 2020 , pp. 8440 -

and norms ., RuleML America 7018 ( 2011 ) 298 - 312 . 8451. URL: https://aclanthology.org/ 2020 .acl-main.

doi:10 .1007/978-3- 642 -24908-2_ 30 . 747. doi: 10 .18653/v1/ 2020 .acl-main. 747 . [4]

J. Moreno

Schneider ,

Rehm , E. Montiel-Ponsoda, [11]

Beltagy ,

M. E.

Peters ,

Cohan , Longformer:

Rodríguez-Doncel ,

Martín-Chozas , M. Navas- The long-document transformer , 2020 . URL: https:

Loro , M.

Kaltenböck , A.

Revenko , S. Karampatakis, //arxiv.org/abs/ 2004 .05150. doi: 10 .48550/ARXIV.

Sageder ,

Gracia ,

Maganza , I. Kernerman , 2004 . 05150 .

Lonke , Lynx: A knowledge-based ai service plat- [12]

Sagen , Large-Context Question Answering with

ysis for the legal domain , Information Systems 106 University, Department of Information Technology,

( 2022 ) 101966 . URL: https://www.sciencedirect.com/ 2021 .

science/article/pii/S0306437921001563. doi:https: [13]

Zhou ,

Chen , An improved baseline for

//doi.org/10.1016/j.is. 2021 . 101966 . sentence-level relation extraction , in: Proceedings [5]

Holzenberger ,

B. V.

Durme , Factoring statutory of the 2nd Conference of the Asia-Pacific Chap-

in: C. Zong , F.

Xia , W.

Li , R.

Navigli (Eds.), Proceed- tics and the 12th International Joint Conference

ings of the 59th Annual Meeting of the Associa- on Natural Language Processing (Volume 2 : Short

tion for Computational Linguistics and the 11th

In- Papers) , Association for Computational Linguis-

ternational Joint Conference on Natural Language tics , Online only, 2022 , pp. 161 - 168 . URL: https:

Processing , ACL/IJCNLP 2021, (Volume 1 : Long Pa- //aclanthology.org/ 2022 .aacl-short. 21 .

pers) , Virtual Event, August 1-6 , 2021 , Association [14]

Chalkidis , I. Androutsopoulos ,

Michos , Ex-

for Computational

Linguistics

, 2021 , pp. 2742 - 2758 . tracting contract elements , in: Proceedings of

URL: https://doi.org/10.18653/v1/ 2021 . acl-long.213. the 16th Edition of the International Conference

doi:10.18653/v1/2021.acl-long.213. on Articial Intelligence and Law , ICAIL '17, As [6] J.-C.

Klie , M.

Bugert , B.

Boullosa , R. E. de Castilho, sociation for Computing Machinery, New York,

I. Gurevych , The inception platform: Machine- NY, USA, 2017 , p. 19 - 28 . URL: https://doi.org/

assisted and knowledge-oriented interactive anno - 10 .1145/3086512.3086515. doi: 10 .1145/3086512.

tation, in: Proceedings of the 27th International 3086515.

Conference on Computational Linguistics: System [15]

Chalkidis , E. Fergadiotis,

Malakasiotis , I.

An-

Demonstrations , 2018 , pp. 5 - 9 . droutsopoulos, Large-scale multi-label text clas[7]

Vaswani ,

Shazeer ,

Parmar , J. Uszkoreit, sification on EU legislation , in: Proceedings of

Jones ,

A. N.

Gomez , Ł. Kaiser, I. Polosukhin , At- the 57th Annual Meeting of the Association for

mation processing systems 30 ( 2017 ). putational Linguistics, Florence, Italy, 2019 , pp. [8]

Chan ,

Schweter , T. Möller, German's 6314-6322 . URL: https://aclanthology.org/P19-1636.

next language model , in: Proceedings of doi:10 .18653/v1/ P19 -1636.

the 28th International Conference on Com- [16]

Leitner , G. Rehm,

Moreno-Schneider , A dataset

tee on Computational Linguistics , Barcelona, nition, in: Proceedings of the 12th Language Re-