-

Concept Hierarchy Extraction from Legal Literature

Stefan Langer Legal Horizon AG Magdeburg

0 1 2

Germany stefan.langer@legalhorizon.ag

0 1 2 0 David Broneske Otto von Guericke University Magdeburg , Germany 1 Gunter Saake Otto von Guericke University Magdeburg , Germany 2 Sabine Wehnert Otto von Guericke University Magdeburg , Germany

2018

Due to the ever-increasing amount of legal regulations, it became an interest of scholars to nd ways of capturing domain-relevant knowledge and facilitate the navigation in legal text corpora. Furthermore, the contextual nature of legislation requires enhanced semantic capabilities to identify relevant regulations for speci c user needs. This work aims for collecting concept hierarchies from German literature in the legal domain which are then integrated into a knowledge base with multiple clusters, allowing for di erent perspectives and e cient lookups. Having references to regulations in the leaves of the concept tree and higher levels with an increasingly abstract context, the resulting hierarchies provide the basis for creating legal domain knowledge in German law. Starting with rule-based annotation, we cluster extracted references, given their context features derived from tables of contents and reasons for citing from various textbook formats. We study the expressiveness of the obtained reference context features. Since di erent authors have their own notion of hierarchy given by the table of con-

BY 4.0). tents, we propose a heterogeneous lightweight ontology allowing for the coexistence of similar, yet diverse concept hierarchies to dynamically determine the best t for a user in a semisupervised setting. This approach is novel, since state-of-the-art ontologies are conventionally modeled under full integration and in a top-down manner, often not accounting for perspectives in knowledge representation. 1

Introduction

Nowadays, enterprises as well as lawyers are facing the challenge of keeping track of an overwhelming number of legal texts from di erent jurisdictions. Yet, it is their obligation to ensure compliance, so that often manual e orts are made to monitor changes in law. On the other hand, this means that new developments need to be integrated into already existing knowledge, e.g., if a law is amended and impacts other regulations which are used in a speci c scenario, the knowledge needs to be adapted accordingly. There is a need for context-sensitive search and a grouping method which ensures that all relevant documents are retrieved for a speci c situation. The natural language processing (NLP) community has made many advances, such as building citation networks [ZK07, WLM16]. Surprisingly, there are few works addressing the extraction of legal concept hierarchies based on implicit semantic relations between legal texts. We de ne implicit semantic relations as relationships among legal texts which only apply in speci c contexts, so that they are not coded as explicit citations within generally applicable regulations. For example, depending on the expertise of a lawyer (i.e., knowledge about implicit semantic relations), he can use his background to identify connected laws which are important for a speci c case.

In this paper, we propose a method to extract information from a large number of textbooks. It can be used to identify contextually relevant texts based on their mentions within literature, providing evidence of a semantic relationship between legal texts depending on their closeness within the resulting concept hierarchy. This form of domain knowledge is modeled in a bottom-up manner, using the references to legal texts in the literature as instances in the bottom levels of the concept hierarchy. Above, descriptive context representations are desired, which we refer to as reasons for citing, for each respective regulation. These representations and relationships can be modeled according to the desired expressiveness of the resulting ontology. Winkels et al. show that reasons for citing can be extracted from the sentence referring to the respective regulation, and narrow them down to four relationship categories: selection, application, concluding (denying) and a category for in relation to [WBVvS14]. Zhang and Koppaka link relevant legal texts based on reasons for citing and let experts assess their contextual quality [ZK07]. There are works addressing legal text linking based on the information given therein [FMPT10, BDCG+15]. These approaches use explicit citations from within the document itself or its metadata. We choose to use external knowledge from literature to nd relationships which cannot be directly detected within these documents. For this, we model relationships among legal texts in a concept hierarchy, founded upon the spatial co-occurrence of their mentions in legal literature.

Our approach is therefore a step in a new direction of legal informatics, because we consider legal literature as a source of concept hierarchies to build domain knowledge. We base our method on the assumption that a (sub-) chapter headline corresponds approximately to the concept described in the section. Furthermore, the cited legal texts in each passage are seen as semantically related to the discussed concept of the respective section. While this assumption does not always hold - especially in cases where authors use creative titles - our studied literature contains descriptive concepts in most headings of sections.

For the scope of this paper, we establish a connection between legal documents which co-occur in the same chapter, part, section or lower level subsections. By means of a concept hierarchy, we are able to identify closely related legal texts in the lower parts, as well as those which have a higher distance given only one common concept on a high abstraction level. A limitation of this approach is that we extract and maintain explicit keywords forming a concept. Hence, we do not integrate it into a common understanding of standardized concepts, as it can be encountered in standard ontologies. Having legal textbooks of many di erent formats and authors as data sources, we expect many contradictions to occur during an attempt to establish mapping rules for a standard axiomatic ontology. Therefore, we follow a di erent notion of knowledge representation.

Similar to the process of studying law, we aim for a diversity of perspectives within our system, which are chosen depending on the context. Speci cally, we are interested in the e ects of letting a concept hierarchy remain in its original structure, derived from the table of contents (TOC), and coexist among other similar concept hierarchies belonging to the same cluster. In this work, we show how such an approach can model the contextual application of regulations and how it is able to adapt to user-given feedback. Thus, the contribution of this work is a combination of the following techniques:

We apply rules to annotate elements in a textbook.

We access DBpedia knowledge for named entity resolution.

We form concept hierarchies and evaluate their components.

We group concept hierarchies with nominal clustering.

We discuss the use of heterogeneous lightweight ontology clusters for legal texts.

The remainder is structured as follows: Section 2 contains related work regarding concept hierarchy extraction, lightweight ontologies and the formation of clusters. Since our approach is derived from observations of research gaps for our speci c use case, we provide a justi cation of our methods alongside. In Section 3, we describe our method of extracting concept hierarchies from legal literature and the subsequent steps of constructing the domain knowledge. We discuss experimental results in Section 4. Finally, we conclude our ndings and unveil future research potential. 2

Related Work

We introduce three main aspects regarding our aim of capturing and applying knowledge from textbooks. The concept hierarchy is derived from the inherent structure of a piece of literature. In this section, we rst name some alternative approaches to extract concept hierarchies. Second, we provide the background for the formation of our knowledge base, being derived from a heterogeneous ontology. Third, we brie y outline a clustering method because it provides some further optimization options to control the cluster formation of a heterogeneous ontology. 2.1

Concept Hierarchy Extraction

Concept hierarchies are a means for representing knowledge in a hierarchical manner, having nodes of increasing abstraction per level and things as instances in the leaves of the tree. We intend to represent links between legal texts by shared concepts: The higher a linking node between two instances is located in the concept hierarchy, the more distant are two documents. There are several approaches for extracting concept hierarchies from unstructured text. Among them, we nd rules to detect hyponomy relations based on Hearst Patterns [Hea92], for example to represent legal vocabularies. Also eigenvector decomposition is a method for identifying term taxonomies [BDMP06]. Those patterns, however, are not applicable for the use case of linking legal texts. Lexical hyponomies are not suitable for references modeled as instances of the concept hierarchy tree, since the subsumption relation is not based on the vocabulary, but semantic relatedness gained from textbooks. Kuo et al. [KTH06] propose hierarchical clustering to build concept hierarchies, while also the extraction of noun groups is a valid approach [ROB17].

We examine methods of noun group extraction combined with hierarchical clustering further, and propose a combination of them for concept hierarchy extraction from literature. This approach is based on the assumption that an author captures the topic of a section within its title. In the highest levels of abstraction within our concept hierarchy, we gather elements from the Tables of Contents (TOC) within literature. Finally, we obtain a coarse- to ne-grained clustering of regulations based on the understanding of the corresponding author, while we assume that the reasons for citing in particular are relevant features justifying the cluster membership of a regulation.

Similar to this work, Gunel and Asl yan [GA10] describe how to extract concepts from tutoring material in TEX format using domain relevance, entropy and lexical cohesion as inclusion criteria. Wang et al. extract concept hierarchies from textbooks by the TOC and Wikipedia [WLW+15]. We also use the TOC to nd local relatedness of regulations given the section title and Wikipedia for Named Entity Resolution. Robin et al. compare two approaches for legal concept hierarchy extraction: hierarchical clustering and the extraction of topical expressions composed of noun groups [ROB17]. Bruckschen et al. populate a legal ontology based on Named Entity Recognition [BNS+10]. In a related eld, an approach using syntactic positions, called Formal Concept Analysis, is suggested by Cimiano et al. to extract concept hierarchies [CHS04]. Based on topic modeling, part-ofspeech tags and tf-idf weighting, Anoop et al. [AAD16] suggest an unsupervised method for concept hierarchy extraction. A possible drawback of statistical topic modeling methods is the instability of retrieved topics and their keywords if the process is repeated on the same data. Belford et al. propose a method relying on matrix factorization to increase the stability and accuracy of topic models [BMNG18].

In contrast to these implementations, we use a rulebased approach to extract information. Legal applications can bene t from the control over data quality that a system designer has while using rule-based approaches, without compromising on the amount of data. Despite some deviations from the pattern where authors incorporate creative headings for didactic purposes - we nd very few of these cases in our collection of legal literature. We show the results of our approach in Section 4. 2.2

Heterogeneous Legal Ontology

Despite some variation in the style format among the pieces of literature, another major challenge arises from the obtained concept hierarchies themselves: Initially, we obtain standalone hierarchies from each book, and the di erence among them is unknown. However, topical overlaps are possible for diversi ed literature, thus posing a challenge in integrating all concept hierarchies in a non-contradicting manner.

Instead, we capture the contextual character of legal texts. Following the notion of hierarchical ontology clusters proposed in [VC98], we develop the idea of allowing multiple concept hierarchies to coexist without integrating them. Conventionally, one common language and understanding is desired for system architectures whose components access the same domain knowledge. Despite these advantages, for our application such an ontology requires high maintenance e orts resulting from frequent insertions of further knowledge, either by automatically determining valid mappings or checking for logically matching candidates.

In the legal domain, a common requirement is to ensure that all relevant documents are retrieved, thus we optimize for a high recall. This is however challenging when working with natural language, for example when encountering its cases of ambiguity, nearsynonyms and polysemy. We therefore argue that concepts in legal literature may di er even for equal topics, which is due to di erent perspectives of the authors and their own interpretation. However, any human regularly overcomes these inconsistencies and ambiguity by either choosing one concept for a narrow but consistent understanding, or by broadening the scope and encompassing multiple sources to avoid omissions of important items, while accessing the most appropriate t based on a contextual decision criterion. This criterion can be derived from user-provided feedback, for example by marking a document as irrelevant. Then, the concept hierarchy will be selected which most likely captures the user need based on the recomputation of relevance.

Since our intended knowledge base is built in a bottom-up manner, this work is di erent from axiomatic ontologies. There are legal ontologies available such as ALLOT [BDIPV13] or LKIF [HBDB+07], which are able to encompass multiple legal data sources, however also requiring alignment of the respective classes. These ontologies are built upon a document standard called Akoma Ntoso [VZ07] and o er many ways of standardized information modeling on the document level and beyond. For our speci c use case, we identify two possibilities to achieve our goal: Either an expert maintains contextual information regarding speci c applications of laws together in such a standardized ontology - for instance, by using the contextual ontology language C-OWL [BGvH+03] - or there is a system for legal literature covering di erent scenarios, user categories and jurisdictions, ideally resulting in a complete collection of all regulations needed for a case. Several bottom-up lightweight ontologies for legislative terms and entities exist [BGBI16, ABC+16]. Our knowledge representation di ers from these works substantially in terms of the application scenario and extraction method. To the best of our knowledge, there is no approach for the same use case within the legal domain allowing for a fair comparison with our work. 2.3

Concept Hierarchy Clusters

Given a large collection of textbooks, we apply clustering to increase contextuality and to reduce the search space for nding the the most applicable concept hierarchy for a context. As a result, many references from di erent concept hierarchies are merged together. In order to stucture the cluster, the distance information given by a hierarchical clustering algorithm can be exploited. For user-centered applications, a semisupervised clustering method has been proposed by Bade and Nurnberger [BN14]. They introduce mustlink-before constraints for clustering algorithms which can be applied to hierarchical agglomerative clustering. Those constraints identify instances to be linked and those which shall remain separate. Di erent from other works, this method also implies the means to model the hierarchical order of instances without requiring to de ne the exact level di erence. As a use case for an enforced hierarchy, consider a scenario where a distance between European and national law is desired. After including must-link-before constraints, instances from the speci ed category are located closer to the reference instance than those which are forced to link on a higher node of the concept tree. The algorithm we use in the scope of this work allows for mustlink and cannot-link constraints by de ning a relationship between two features [MHAK16]. Due to space limitations, we leave the examination of constraint effects for future work and implement the clustering algorithm without constraints. 3

Concept Extraction neous Ontologies for Heteroge

Following relevant literature and the justi cation of our method, we outline our approach for building a heterogeneous ontology. In particular, we describe the process of annotating features in textbooks to obtain a contextual representation of the reference by means of concept hierarchy clusters. Figure 1 depicts the workow. 1. An electronic literature resource is converted into a txt le. 2. The text is preprocessed by performing tokenization, sentence chunking, orthographic coreference resolution, parts-of-speech tagging, roman literal identi cation and named entity resolution using web knowledge from DBpedia. 3. Rule-based annotation is applied to match TOC components (Chapter, Part, Subchapter, Subsubchapter ), CS components (regulation name REG, DBpedia concept DBp, relationship REL and references REF. 4. All annotations are extracted into a csv le, resulting in a table of tokens T with their respective annotation features. 5. The le is treated as a lookup table and for each

TOC component, boundaries are determined. 6. All references are matched in document order to each TOC component with respect to the di erent section boundaries. Also, the CS information is retrieved from an extracted annotation le and assigned to the REF. 7. After the feature information has been detected, a at representation of the concept hierarchy is stored, with one REF instance per line and its TOC and CS feature information.

Cluster Concept

(8)

Hierarchy Instances

C1 REF1 X

REF20 x Feedback REF55 x ”REF” found in:

REF1 C1 REF5 C5 Context Descriptor Label of Cluster

Query Knowledge Base

REF1, <CS>, <TOC> REF2, <CS>, <TOC> Instances REF3, <CS>, <TOC>

Compose Flat Concept Hierarchy

REF1 in 1 REF2 in 1, 1.1 REF3 in 1, 1.2

Lookup and Matching

(9) (7) (6)

Book Preprocessing Concept Annotation Annotation

Extraction (5) Grouping REF by <TOC> Component 1 1.1 1.2 <TOC> <CS>

REF T T T T <TOC> <CS> REF

X x x x X x x x X x x x

Selected process steps to obtain the knowledge base are described in more detail in the following. We share more implementation details and program code on GitHub.1 10. A feedback mechanism can be implemented to narrow down relevant references. Di erent from our idea, Boonchom and Soonthornphisaj use term frequency-based ontology seeds for a legal ontology search task [sBS12]. A similar approach for query expansion using a hierarchical legal knowledge base is by Schweighofer et al. [SG+07]. Yet, their relevance feedback is based on the preferences of other users, unlike our approach focusing only on content. 2https://gate.ac.uk/sale/tao/splitch8.html 3We use the German german-hgc.tagger from the Stanford parser https://nlp.stanford.edu/software/tagger.shtml Depending on the publisher, a table of contents manifests itself in various styles. From numeral-only versions to mixed alphabet, roman literal and numeric variations, we de ne separate rules to capture each distinct heading element including its level in the context of the table of contents. Despite the e orts in rule de nition, there are not many substantial variations within each publishing style, so that minor inconsistencies may be captured by generalization from seen examples. Waltl et al. combine the advantages of rule-based approaches with those of machine learning techniques because domain knowledge can be directly incorporated into the training phase to obtain more control over results [WBM18]. However, it is out of scope of this work to train an annotation classi er and a potential future optimization task. After annotation, we export the TOC features. Based on the detected elements, we determine the boundaries for each level of the TOC hierarchy to store the respective references contained per part, subchapter and subsubchapter. 3.1.2

Reasons for Citing (RFC) and Relationships (REL)

Each sentence with a reference to a legal text potentially contains information about the rationale of this citation, which serves as a contextual summary. We divide the citation summary CS into the regulation name REG, the reason for citing RFC - following the notion of an entity - and its relationship REL with the regulation, captured by verb forms. Extracting the CS serves as feature information for a clustering algorithm. Another application is in connection with a reasoner based on the abstract relationships. Similar to the approach of Winkels et al. [WBVvS14], a model of relationships among legal texts can be derived from textbooks and then be incorporated into the concept hierarchy. In addition, reasons for citing RFC can be considered for the user of a (content-based) legal recommender system as an explanatory component, to be displayed alongside the reference as a context descriptor. We nd several pattern varieties proposed for keyphrase extraction and consider them for the RFC [WZH16, Hul03]. While the respective authors analyze English language and capture adjective groups in addition to noun groups as well, there are more distinctions available for part-of-speech-tags in German language. Since including all adjective groups results in a larger number of distinct nominal features, we limit the pattern to minor sequence variations allowing for attributive adjectives. In our use case, we de ne the following expression to capture the RFC : RFC = (NN j NNS j NNP j NNPS j NE j (NN (ADJA j NN) NN))+ (1) Due to space limitations, this pattern is a simplied version of the actual one, here only listing candidate part-of-speech tags (POS) using the SSTS tagset [STT95]. Our rules account for a variety of possible sentence structures in German natural language. Those patterns which are formulated by using the more expressive JAPE rule syntax are de ned with priorities, so that the most restrictive rule is applied rst. Likewise, there are patterns for relationship extraction examined by multiple authors, as well [FSE11]. We adapted them to German language and added negation tags with

REL = (PTKNEG j V-INF j V-PP j V-FIN)+ (2) as the simpli ed relationship pattern REL. In the verb categories we subsume the tags using a hyphen, for example V-INF is a placeholder for VAINF, VVINF and VMINF, which are originally output by the Stanford parser. The relationship feature of the annotation in this case is formed as a concatenation of REL matches within a sentence containing RFC. We adjust the matching rule regarding speci c word patterns for important indicators - strings indicating contradictions (e.g., in German \Widerspruch") or selections (e.g., in German \Beispiel") - which cannot be generalized with parts-of-speech information. Also, if there is a syntactic indication of a legal term de nition (e.g., in German \nach" or \gema ") within a law, we ll undetected REL elds with an is-relationship (in German: \ist"). Furthermore, we clean the matches by parsing out non-descriptive strings for a relationship between a reference and its reason for citing (e.g., in German \denke"). This consequently results in sparse relationship features, since the above rules are both speci ed within sentence boundaries. While our assumption that a sentence citing a regulation contains RFC and REL patterns, this is not always the case. For the subsequent steps, we only consider those regulations containing RFC, and optionally REL. Any annotated regulation contained in the document where RFC is missing may not hold enough context information to determine its applicability for the context. Despite this limitation, it shall not have severe consequences in case of a su ciently large heterogeneous ontology, since other extracted concept hierarchies for the same context shall cover possible gaps due to the highly regularized nature of legislation. 3.1.3

Regulations (REG, REF)

Many scholars have examined methods to extract regulations from unstructured text [WLM16], often to create a citation network based on the references within the original regulation text [WBVvS14]. While currently machine learning approaches remain popular, rule-based methods achieve high precision and recall, as well, which is due to the highly regularized pattern of regulation citation. In German law, there are xed citation guidelines. Therefore, a su ciently high proportion of citations can be detected with rules, with precision and recall in the range from 80% to 90% [WLM16]. In addition, legal language contains term de nitions, which are implicitly referenced by other laws [WLM16]. Those term de nitions can be extracted with rules and stored in a Lookup dictionary. Although it is out of scope of this work, we plan to analyze and enrich regulations with legal term de nitions - to be found in other regulations - to gain more context information from the knowledge provided in the data source itself. We considered corner cases in reference citations, thus aiming for an improvement of the already high regulation coverage. These corner cases include references containing more than two regulations from di erent sources, and occurrences of connection indicators, in German abbreviated as \i. V. m.". These annotations shall contribute to a rich knowledge base. 3.1.4

Access Web Knowledge (DBp)

Wang et al. suggest in their approach to apply web knowledge for identifying concept candidates [WLW+15]. We access Wikipedia-based linked open data through the DBpedia Spotlight 4 plugin for GATE5. Unlike their method, we intend the knowledge base to perform named entity resolution directly on the citation summary. If a DBpedia entry exists in the sentence containing a reference, we split the URI to obtain the concept name as a nominal feature. We observe that most matches occur for the regulation or the RFC tokens. There is one frequent misclassi cation regarding the German Civil Code (BGB), where the DBpedia lookup yields a swiss political party instead of the civil code, which we manually corrected before composing the concept hierarchy. After having annotated the nine feature types (Chapter, Part, Subchapter, Subsubchapter, REG, DBp, RFC, REL, REF ), we export them from GATE and build the concept hierarchy. 3.2

Compose Concept Hierarchies

Figure 2 shows how we compose and evaluate the concept hierarchy. In this example, there are two simpli ed concept hierarchies, which are obtained from the JAPE rule-based annotations. In the ctive CS 4https://www.dbpedia-spotlight.org/ 5http://www.semanticsoftware.info/lodtagger node, we summarize the features REG, DBp, RFC, REL for space reasons, however, they are all standalone features. Each element has mandatory values for the Chapter, RFC and Reference. The other elds are optional because we do not assert that the rules return values for each feature.

Given the illustrated concept hierarchy in Figure 2, we evaluate the results by setting the Chapter as a class label - thus expecting a reproduction of the structure of a chapter - and by not including it in the features to be processed. As indicated by the arrows, the test data can match the learned examples by comparison of the subfeatures and early merges are an indicator for higher similarity between two instances. A possible limitation of this approach comes from the reliance on explicitly stated information. For instance, if the RFC are not indicated within the reference sentence or if they are faulty extracted, this can decrease the expressiveness of the features for the desired structure. Since the resulting concept hierarchy depends on the author of the book, his perspective may not be suitable for any user. Therefore, we see a possible remedy in the notion of concept hierarchy clusters, forming a heterogeneous lightweight ontology. 3.2.1

Concept Hierarchy Clusters.

Extracting a narrow concept hierarchy with only nominal features leads to a lower probability of getting all relevant references for a speci c information need. Consider the following example: While one book may focus on the aspects of national law, another depicts European legislation. In reality, this information needs to be considered as a whole, since European legislation supersedes national law.

Recalling the discussion from Section 2.2, we show how exactly a heterogeneous ontology can serve a user who is interested in complete, reliable and founded information. Aside from our experiment of matching extracted instances with Chapter labels, an actual application of this method is to classify for Relevance instead. Figure 3 illustrates how a heterogeneous ontology in legal contexts may emerge. In the setting of a recommender system, suppose there is a cluster containing two concept hierarchies with sets of instances (1, 5, 8) and (1, 2, 4, 8) respectively. In the rst scenario depicted on the left hand side, the recommender system receives positive user feedback regarding instance 1. Since this instance is present in the current context which is more narrow than other concept hierarchy, the context is not altered. In contrast, a similarity function ( A) receives negative feedback for instance 5 in the second scenario, thus resulting in a context switch to the other concept hierarchy without instance 5. There are several approaches for similarity P

C S SS CS § unknown connenction inferred membership from feature training data test data Book Chapter Part Subchapter Subsubchapter Citation Summary (REG, DBp, RFC, REL)

We conducted some experiments with subsets from the 78 documents (subchapters from three xed chapters), the results are shown in the next Section 4. 4

Results

To show the e ect of adding knowledge to the heterogeneous lightweight ontology, we evaluate the annotation and perform two experiments. The rst experiment applies COBWEB clustering on the features, without knowing the Chapter class label. The second approach is a classi er for the same features, this time we use the COBWEB tree. Before we present their results, we describe the experiment setting and evaluation measures. 4.1

Evaluation Setup

The aim of this evaluation is to determine the expressiveness of our selected features to distinguish between abstract concepts. In this work, we intend to show the feasibility of our proposed knowledge extraction and representation method. Therefore, we create clusters of semantically similar concept hierarchies by using the COBWEB algorithm [Fis87]. It is a recursive 4.2

Evaluation Measures

Regarding the annotation success, we determine the effectiveness of context feature extraction by computing the average coverage of references REF by RFC annotations. Basically, if a sentence contains a pattern which can be detected by our JAPE rules, there will be an RFC annotation. Since we only considered those regulations whose context features (especially RFC ) could be retrieved, this evaluation is important to understand how many data points were the basis for the subsequent steps of clustering and classi cation.

Our evaluation measure for the supervised clustering experiment is the Adjusted Rand Index (ARI), originally proposed by Hubert and Arabie [HA85]. It quanti es the overlap between two partitioning approaches, in our case, we compare the COBWEB clustering and the class labels (i.e., textbook chapters). Its expected value 0 indicates a random clustering, while a value close to 1 corresponds to a high agreement 6https://github.com/cmaclell/concept formation 1 = relevant current context 1 5 8 5 = irrelevant current context 1 2 4 8

In Table 1, we list the number of reference annotations corresponding to the book chapters: (1) Bankvertragliche Grundlagen (English: Foundations of Banking Contracts), (4) Kapitalmarkt- und Auslandsgeschafte (English: Capital Market and Foreign Transactions), (8) Europaisches Bankenrecht mit Landerabschnitten (English: European Banking Law by Country). Additionally, we indicate the number of RFC and the average percentage of detected RFC from all REF annotations per chapter. The numbers in the column header depict the document number, corresponding to the subchapters of the textbook. We nd that almost 75% of the references have an annotation value for RFC. The restrictions we included in our pattern prevent us from extracting the chapter name as a REF, and despite some missing references and RFC due to long-range dependencies within the sentence or unwanted headline text insertions at page breaks, the noise in the text data (e.g., citations of other books in a reference-like format) did not a ect the extraction substantially. Nevertheless, all subsequent steps depend on the annotation, so that a loss in this step propagates forward to the clustering and classi cation task. 4.3

Evaluation of Annotation

We evaluate our annotation results regarding the number of detected references REF compared to the number of extracted RFC in the chapter, since we require the latter for concept formation. Spiegel-Rosing found for scienti c texts descriptive RFC context in 80% of the sentences. We assume that in a German legal textbook, slightly less RFC will be detected, due to a different writing style (e.g., more complex syntax and longer sentences). Consequently, our aim for RFC annotation is set to 70% of REF occurrences. Therefore, 4.4

Evaluation of Heterogeneous Legal Ontology

We evaluate our results for the COBWEB clustering algorithm using the extracted Chapter feature as the ground truth class. With the remaining context information starting with the Part feature until the REF feature, the instances are supposed to be grouped by the COBWEB clustering algorithm. In order to show the e ect of a successful extraction method, we restricted the instances only to those cases where a value could be retrieved for the Part feature, since this is the most abstract class. To have an equal class distribution, we downsampled the instances of other chapters to match the class with the fewest instances left. This has not been achieved with a random selection, but instead we selected a group of instances which were previously spatially close in the textbook. This has the advantage of not missing important context, as well as limiting the variance in nominal features. For a fair comparison, running the evaluation with di erent instance groups yielded mostly similar results, however we observe that more variability leads to less similar examples and thus a lower ARI score.

For the rst evaluation shown in Figure 4 with 2 principal components p, 3 Chapters and 1020 instances i of balanced classes, we obtain an adjusted rand index ( ARI) of 0.28. Each axis holds one principal component analysis (PCA) dimension to visualize a projection of the cluster shape. According with our expectation, there are three clusters, while each cluster consists of two to three ellipsis shapes. The chapter labels in Figure 4 indicate that the algorithm does not have enough information to distinguish between chapter (1) (labeled as B) and chapter (4) (labeled as K) and chapter (8) (labeled as E). Many instances of particularly chapters (4) and (8) are placed in the wrong cluster. From this, we conclude that despite having balanced classes, there may be topical overlaps among the concept hierarchies which shall either result in a merge or are lacking evidence for separate groups. If we allow for a slight class imbalance of the instances by increasing the number of chapter (1) and (4) instances in a comparable amount to 1149, the ARI increases to 0.64, as shown in Figure 5. This also led to a di erent cluster shape and a better discrimination between the three chapter classes. The improvement can be seen in the classes, where more labels correspond to the cluster membership. It indicates that the clustering approach found more agreement between clusters and the ground truth classes. That observation lets us conclude that additional examples can lead to a higher ARI if they only broaden the feature value space moderately. In previous experiments, we applied the algorithm to all extracted instances, leading to an ARI of 0.05, presumably because of the high variance of instances within a chapter and di erent chapter length.

COBWEB tree with r=10, num=100,

COBWEB tree with r=10, num=100, Since this class imbalance will naturally occur in a heterogeneous ontology, we need to investigate futher how the approach scales and what the limitations are regarding the feature diversity.

We perform a second experiment on the same data, but in the classi cation setting with a COBWEB tree with 10 runs r and 300 training instances num. The result of the classi cation algorithm is shown in Figures 6 and 7, including 95% con dence intervals for the average precision and recall values. In Figure 6, the con dence intervals obtain a range of 40 percentage points (pp), witnessing of an unstable classi cation result of 80% precision and 87% recall on average after 200 training examples. The e ect of adding further examples is illustrated in Figure 7 and similar to the previous experiment, which manifests in a gain in precision of about 10pp and a slight increase of 5pp in the average recall score. Please note that the range of the con dence interval is reduced to 20pp for recall and to 10pp for precision, which is a signi cant improvement of the classi er performance. In summary, the results for the COBWEB algorithm vary depending on the number of examples for each concept hierarchy. A recall of more than 90% is desirable, so that the results from the second setup of each experiment are regarded as su cient evidence for descriptive features to distinguish between di erent contexts. We discuss the general applicability of the results. There is more research potential in the question whether this approach also works for other domain literature, or what happens if other clustering algorithms with advanced capabilities of constraint formulation are chosen. Considering that we used concept hierarchies mostly about general banking law, nancial markets and european banking law, the overlap of REG and RFC is considerable. After other books about di erent subjects are added, those three concept hierarchies may form a cluster. During the concept hierarchy extraction, we found that there are four major limitations of our approach: First, literature resources are needed which cover the information need. Otherwise, a user may not nd his case represented. Second, for each textbook, there can be a di erent format of citations or the TOC components. This results in a higher manual e ort for rule formulation. Third, since we only had the PDF les of literature available, there were challenges in segmenting the le and assigning references to each section, leading to missing feature values. Fourth, despite having gained much domain information from the textbook, we need to investigate more methods of leveraging those. Since we plan to implement a lightweight heterogeneous ontology, we uncover future research elds in Section 5. 5

Conclusion and Future Work

To conclude, our lightweight heterogeneous ontology is composed of concept hierarchies which are derived from literature. It is a promising area for further work. We pointed out the reasons for accepting coexisting perspectives in the legal domain and gave indications of how to take advantage of many sources, while still controlling the results with constraints and user feedback. The rule-based annotation method provided features for context-aware classi cation and clustering of the concept hierarchies. Overall, the results indicate that the chosen features, the extraction method and the concept formation library are suitable for detecting semantic similarity in the book we selected. Regarding future work, we are curious about how this method performs, if additional features of the content of referenced regulations and term de nitions are taken into account. Another eld to study is the impact of abstract relationship categories on clustering. We see possible applications of the learned ontology in the eld of law clustering, legal context search, topic detection and legal recommender systems and intend to explore more about these use cases. 6

Acknowledgements

The authors would like to thank Andreas Nurnberger and the anonymous referees for their valuable comments. The work is supported by Legal Horizon AG, Grant No.:1704/00082

[AAD16]

Anoop ,

Asharaf , and

Deepak . Unsupervised concept hierarchy learning: a topic modeling guided approach . Procedia Computer Science , 89 : 386 { 394 , 2016 .

[ABC+16] Gianmaria

Ajani

, Guido Boella, Luigi Di Caro, Livio Robaldo, Llio Humphreys, Sabrina Praduroux, Piercarlo Rossi, and

Andrea

Violato . The european taxonomy syllabus: A multi-lingual, multi-level ontology framework to untangle the web of european legal terminology . Applied Ontology , 11 ( 4 ): 325 { 375 , 2016 .

[BDCG+15] Guido

Boella

, Luigi Di Caro, Michele Graziadei, Loredana Cupi, Carlo Emilio Salaroglio, Llio Humphreys, Hristo Konstantinov, Kornel Marko, Livio Robaldo, Claudio Ru ni, Kiril Simov, Andrea Violato, and

Veli

Stroetmann . Linking legal open data: Breaking the accessibility and language barrier in european legislation and case law . In Proceedings of the 15th International Conference on Articial Intelligence and Law , ICAIL '15 , pages 171 { 175 , New York, NY, USA, 2015 . ACM.

[BDIPV13]

Gioele

Barabucci , Angelo Di Iorio, Francesco Poggi, and

Fabio

Vitali . Integration of legal datasets: From meta-model to implementation . In Proceedings of International Conference on Information Integration and Web-based Applications & Services , IIWAS ' 13 , pages 585 : 585 { 585 : 594 , New York, NY, USA, 2013 . ACM.

[BDMP06]

Holger

Bast , Georges Dupret, Debapriyo Majumdar, and

Benjamin

Piwowarski . Discovering a term taxonomy from term similarities using principal component analysis . In Markus Ackermann , Bettina Berendt, Marko Grobelnik, Andreas Hotho, Dunja Mladenic, Giovanni Semeraro, Myra Spiliopoulou, Gerd Stumme, Vojtech Svatek, and Maarten van Someren, editors, Semantics, Web and Mining , pages 103 { 120 . Springer Berlin Heidelberg, Berlin, Heidelberg, 2006 .

[BGBI16] MarA~ a G. Buey, Angel Luis Garrido, Carlos Bobed, and Sergio Ilarri. The ais project: Boosting information extraction from legal documents by using ontologies . In Proceedings of the 8th International Conference on Agents and Arti cial Intelligence , pages 438 { 445 , 2016 . Exported from https://app.dimensions. ai on 2018 /08/19.

[BGvH+03] Paolo

Bouquet

, Fausto Giunchiglia, Frank van Harmelen, Luciano Sera ni, and Heiner Stuckenschmidt. C-owl: Contextualizing ontologies . In Dieter Fensel, Katia Sycara, and John Mylopoulos, editors, The Semantic Web - ISWC 2003 , pages 164 { 179 , Berlin, Heidelberg, 2003 . Springer Berlin Heidelberg.

[BMNG18]

Mark

Belford , Brian Mac Namee, and

Derek

Greene . Stability of topic modeling via matrix factorization . Expert Systems with Applications , 91 : 159 { 169 , 2018 .

[BN14] Korinna Bade and Andreas Nurnberger. Hierarchical constraints - providing structural bias for hierarchical clustering . Machine Learning , 94 ( 3 ): 371 { 399 , 2014 .

[BNS+10] M rian Bruckschen, Caio North eet , DM Silva, Paulo Bridi, Roger Granada, Renata Vieira,

Prasad

Rao , and

Tomas

Sander . Named entity recognition in the legal domain for ontology population . In In: 3rd Workshop on Semantic Processing of Legal Texts (SPLeT 2010 ), page 16, 2010 .

[CHS04]

Philipp

Cimiano , Andreas Hotho, and

Steffen

Staab . Clustering concept hierarchies from text . In Proceedings of the Conference on Lexical Resources and Evaluation (LREC) , pages 1721 { 1724 , 2004 .

[DKB08]

Peter

Derleder , Kai-Oliver Knops , and Heinz Georg Bamberger. Handbuch zum deutschen und europaischen Bankrecht . Springer Science & Business Media , 2008 .

[Fis87] Douglas H Fisher. Knowledge acquisition via incremental conceptual clustering . Machine learning , 2 ( 2 ): 139 { 172 , 1987 .

[FMPT10]

Enrico

Francesconi , Simonetta Montemagni, Wim Peters, and

Daniela

Tiscornia . Integrating a bottom{up and top{down methodology for building semantic resources for the multilingual legal domain . In Semantic Processing of Legal Texts , pages 95 { 121 . Springer, 2010 .

[PARR11]

Karteeka

Pavan , Allam Appa Rao, and

A V

Rao . An automatic clustering technique for optimal clusters . abs/1109 .1068: 133 { 144 , 09 2011 .

[FSE11]

Anthony

Fader , Stephen Soderland, and

Oren

Etzioni . Identifying relations for open information extraction . In Proceedings of the conference on empirical methods in natural language processing , pages 1535 { 1545 . Association for Computational Linguistics, 2011 .

[GA10] Korhan Gunel and R fat Asl yan. Extracting learning concepts from educational texts in intelligent tutoring systems automatically . Expert Systems with Applications: An International Journal , 37 ( 7 ): 5017 { 5022 , 2010 .

[GF14]

Marian

George and

Christian

Floerkemeier . Recognizing products: A per-exemplar multilabel image classi cation approach . In European Conference on Computer Vision , pages 440 { 455 . Springer, 2014 .

[HA85]

Lawrence

Hubert and

Phipps

Arabie . Comparing partitions . Journal of classi cation , 2 ( 1 ): 193 { 218 , 1985 .

[HBDB+07] Rinke

Hoekstra

, Joost Breuker, Marcello Di Bello,

Alexander

Boer , et al. The lkif core ontology of basic legal concepts . LOAIT , 321 : 43 { 63 , 2007 .

[Hea92] Marti A Hearst. Automatic acquisition of hyponyms from large text corpora . In Proceedings of the 14th conference on Computational linguistics-Volume 2 , pages 539 { 545 . Association for Computational Linguistics, 1992 .

[Hul03]

Anette

Hulth . Improved automatic keyword extraction given more linguistic knowledge . In Proceedings of the 2003 conference on Empirical methods in natural language processing , pages 216 { 223 . Association for Computational Linguistics, 2003 .

[KTH06] Huang-Cheng

Kuo

, Tsung-Han Tsai , and Jen-Peng Huang . Building a concept hierarchy by hierarchical clustering with join/merge decision . In Proceedings of the 9th Joint Conference on Information Sciences, JCIS 2006 , volume 2006 , 01 2006.

[ROB17]

Cecile

Robin , James O'Neill , and Paul Buitelaar . Automatic taxonomy generation - A use-case in the legal domain . CoRR, abs/1710 . 01823 , 2017 .

[sBS12] Vi sit Boonchom and Nuanwan Soonthornphisaj . Atob algorithm: an automatic ontology construction for thai legal sentences retrieval . Journal of Information Science , 38 ( 1 ): 37 { 51 , 2012 .

[SE09] Jorge

M Santos

and

Mark

Embrechts . On the use of the adjusted rand index as a metric for evaluating supervised classi cation . In International Conference on Arti cial Neural Networks , pages 175 { 184 . Springer, 2009 .

[SG+07] Erich

Schweighofer

Anton

Geist , et al. Legal query expansion using ontologies and relevance feedback . In LOAIT , pages 149 { 160 , 2007 .

[SN11] Sebastian Stober and Andreas Nurnberger. An experimental comparison of similarity adaptation approaches . In International Workshop on Adaptive Multimedia Retrieval , pages 96 { 113 . Springer, 2011 .

[STT95] Anne Schiller, Simone Teufel, and Christine Thielen. Guidelines fur das tagging deutscher textcorpora mit stts . Technical report, Universitaten Stuttgart und Tubingen , 1995 .

[VC98] Pepijn

R.S.

Visser and Zhan

Cui . Heterogeneous ontology structures for distributed architectures , 1998 .

[VZ07]

Fabio

Vitali and

Flavio

Zeni . Towards a country-independent data format: the akoma ntoso experience . In Proceedings of the V legislative XML workshop , pages 67 { 86 . Florence , Italy: European Press Academic Publishing, 2007 .

[WBM18]

Bernhard

Waltl , Georg Bonczek, and

Florian

Matthes . Rule-based information extraction - advantages, limitations, and perspectives . Jusletter IT , 02 2018 .

[MHAK16] C.J. MacLellan , E. Harpstead, V.

Aleven , and K.R.

Koedinger . Trestle: A model of concept formation in structured domains . Advances in Cognitive Systems , 4 : 131 { 150 , 2016 .

[WBVvS14]

Radboud

Winkels , Alexander Boer, Bart Vredebregt, and Alexander van Someren. Towards a legal recommender system . In JURIX , volume 271 , pages 169 { 178 , 2014 .

[WLM16]

Bernhard

Waltl , Jorg Landthaler, and

Florian

Matthes . Di erentiation and empirical analysis of reference types in legal documents . In JURIX , pages 211 { 214 , 2016 .

[WLW+15] Shuting

Wang

, Chen Liang, Zhaohui Wu,

Kyle

Williams , Bart Pursel, Benjamin Brautigam, Sherwyn Saul,

Hannah

Williams ,

Kyle

Bowen , and

Lee Giles . Concept hierarchy extraction from textbooks . In Proceedings of the 2015 ACM Symposium on Document Engineering , pages 147 { 156 . ACM, 2015 .

[WZH16] Minmei

Wang

, Bo Zhao , and Yihua Huang . Ptr: Phrase-based topical ranking for automatic keyphrase extraction in scienti c publications . In International Conference on Neural Information Processing , pages 120 { 128 . Springer, 2016 .

[ZK07]

Paul

Zhang and Lavanya Koppaka. Semanticsbased legal citation network . In Proceedings of the 11th international conference on Arti - cial intelligence and law , pages 123 { 130 . ACM, 2007 .