Concept Hierarchy Extraction from Legal Literature Sabine Wehnert David Broneske Otto von Guericke University Otto von Guericke University Magdeburg, Germany Magdeburg, Germany sabine.wehnert@ovgu.de david.broneske@ovgu.de Stefan Langer Gunter Saake Legal Horizon AG Otto von Guericke University Magdeburg, Germany Magdeburg, Germany stefan.langer@legalhorizon.ag gunter.saake@ovgu.de tents, we propose a heterogeneous lightweight ontology allowing for the coexistence of simi- Abstract lar, yet diverse concept hierarchies to dynami- cally determine the best fit for a user in a semi- Due to the ever-increasing amount of legal reg- supervised setting. This approach is novel, ulations, it became an interest of scholars to since state-of-the-art ontologies are conven- find ways of capturing domain-relevant knowl- tionally modeled under full integration and in edge and facilitate the navigation in legal text a top-down manner, often not accounting for corpora. Furthermore, the contextual nature perspectives in knowledge representation. of legislation requires enhanced semantic ca- pabilities to identify relevant regulations for specific user needs. This work aims for col- 1 Introduction lecting concept hierarchies from German lit- erature in the legal domain which are then Nowadays, enterprises as well as lawyers are facing the integrated into a knowledge base with mul- challenge of keeping track of an overwhelming number tiple clusters, allowing for different perspec- of legal texts from different jurisdictions. Yet, it is tives and efficient lookups. Having references their obligation to ensure compliance, so that often to regulations in the leaves of the concept tree manual efforts are made to monitor changes in law. and higher levels with an increasingly abstract On the other hand, this means that new developments context, the resulting hierarchies provide the need to be integrated into already existing knowledge, basis for creating legal domain knowledge in e.g., if a law is amended and impacts other regulations German law. Starting with rule-based anno- which are used in a specific scenario, the knowledge tation, we cluster extracted references, given needs to be adapted accordingly. There is a need for their context features derived from tables of context-sensitive search and a grouping method which contents and reasons for citing from various ensures that all relevant documents are retrieved for textbook formats. We study the expressive- a specific situation. The natural language processing ness of the obtained reference context fea- (NLP) community has made many advances, such as tures. Since different authors have their own building citation networks [ZK07, WLM16]. Surpris- notion of hierarchy given by the table of con- ingly, there are few works addressing the extraction of legal concept hierarchies based on implicit semantic re- Copyright © CIKM 2018 for the individual papers by the papers' lations between legal texts. We define implicit seman- authors. Copyright © CIKM 2018 for the volume as a collection tic relations as relationships among legal texts which by its editors. This volume and its papers are published under only apply in specific contexts, so that they are not the Creative Commons License Attribution 4.0 International (CC coded as explicit citations within generally applicable BY 4.0). regulations. For example, depending on the expertise of a lawyer (i.e., knowledge about implicit semantic relations), he can use his background to identify con- ized concepts, as it can be encountered in standard nected laws which are important for a specific case. ontologies. Having legal textbooks of many different In this paper, we propose a method to extract in- formats and authors as data sources, we expect many formation from a large number of textbooks. It can be contradictions to occur during an attempt to estab- used to identify contextually relevant texts based on lish mapping rules for a standard axiomatic ontology. their mentions within literature, providing evidence of Therefore, we follow a different notion of knowledge a semantic relationship between legal texts depending representation. on their closeness within the resulting concept hierar- Similar to the process of studying law, we aim for a chy. This form of domain knowledge is modeled in a diversity of perspectives within our system, which are bottom-up manner, using the references to legal texts chosen depending on the context. Specifically, we are in the literature as instances in the bottom levels of interested in the effects of letting a concept hierarchy the concept hierarchy. Above, descriptive context rep- remain in its original structure, derived from the table resentations are desired, which we refer to as reasons of contents (TOC), and coexist among other similar for citing, for each respective regulation. These repre- concept hierarchies belonging to the same cluster. In sentations and relationships can be modeled according this work, we show how such an approach can model to the desired expressiveness of the resulting ontology. the contextual application of regulations and how it is Winkels et al. show that reasons for citing can be able to adapt to user-given feedback. Thus, the con- extracted from the sentence referring to the respective tribution of this work is a combination of the following regulation, and narrow them down to four relationship techniques: categories: selection, application, concluding (denying) and a category for in relation to [WBVvS14]. Zhang • We apply rules to annotate elements in a text- and Koppaka link relevant legal texts based on rea- book. sons for citing and let experts assess their contex- • We access DBpedia knowledge for named entity tual quality [ZK07]. There are works addressing legal resolution. text linking based on the information given therein [FMPT10, BDCG+ 15]. These approaches use explicit • We form concept hierarchies and evaluate their citations from within the document itself or its meta- components. data. We choose to use external knowledge from lit- erature to find relationships which cannot be directly • We group concept hierarchies with nominal clus- detected within these documents. For this, we model tering. relationships among legal texts in a concept hierarchy, founded upon the spatial co-occurrence of their men- • We discuss the use of heterogeneous lightweight tions in legal literature. ontology clusters for legal texts. Our approach is therefore a step in a new direc- The remainder is structured as follows: Section 2 con- tion of legal informatics, because we consider legal tains related work regarding concept hierarchy extrac- literature as a source of concept hierarchies to build tion, lightweight ontologies and the formation of clus- domain knowledge. We base our method on the as- ters. Since our approach is derived from observations sumption that a (sub-) chapter headline corresponds of research gaps for our specific use case, we provide approximately to the concept described in the section. a justification of our methods alongside. In Section 3, Furthermore, the cited legal texts in each passage are we describe our method of extracting concept hierar- seen as semantically related to the discussed concept chies from legal literature and the subsequent steps of of the respective section. While this assumption does constructing the domain knowledge. We discuss ex- not always hold - especially in cases where authors use perimental results in Section 4. Finally, we conclude creative titles - our studied literature contains descrip- our findings and unveil future research potential. tive concepts in most headings of sections. For the scope of this paper, we establish a connec- tion between legal documents which co-occur in the 2 Related Work same chapter, part, section or lower level subsections. We introduce three main aspects regarding our aim By means of a concept hierarchy, we are able to iden- of capturing and applying knowledge from textbooks. tify closely related legal texts in the lower parts, as well The concept hierarchy is derived from the inherent as those which have a higher distance given only one structure of a piece of literature. In this section, we common concept on a high abstraction level. A limita- first name some alternative approaches to extract con- tion of this approach is that we extract and maintain cept hierarchies. Second, we provide the background explicit keywords forming a concept. Hence, we do not for the formation of our knowledge base, being de- integrate it into a common understanding of standard- rived from a heterogeneous ontology. Third, we briefly outline a clustering method because it provides some tactic positions, called Formal Concept Analysis, is further optimization options to control the cluster for- suggested by Cimiano et al. to extract concept hi- mation of a heterogeneous ontology. erarchies [CHS04]. Based on topic modeling, part-of- speech tags and tf-idf weighting, Anoop et al. [AAD16] 2.1 Concept Hierarchy Extraction suggest an unsupervised method for concept hierarchy extraction. A possible drawback of statistical topic Concept hierarchies are a means for representing modeling methods is the instability of retrieved topics knowledge in a hierarchical manner, having nodes of and their keywords if the process is repeated on the increasing abstraction per level and things as instances same data. Belford et al. propose a method relying in the leaves of the tree. We intend to represent links on matrix factorization to increase the stability and between legal texts by shared concepts: The higher accuracy of topic models [BMNG18]. a linking node between two instances is located in In contrast to these implementations, we use a rule- the concept hierarchy, the more distant are two doc- based approach to extract information. Legal appli- uments. There are several approaches for extracting cations can benefit from the control over data qual- concept hierarchies from unstructured text. Among ity that a system designer has while using rule-based them, we find rules to detect hyponomy relations based approaches, without compromising on the amount of on Hearst Patterns [Hea92], for example to represent data. Despite some deviations from the pattern - legal vocabularies. Also eigenvector decomposition is where authors incorporate creative headings for didac- a method for identifying term taxonomies [BDMP06]. tic purposes - we find very few of these cases in our Those patterns, however, are not applicable for the collection of legal literature. We show the results of use case of linking legal texts. Lexical hyponomies are our approach in Section 4. not suitable for references modeled as instances of the concept hierarchy tree, since the subsumption relation 2.2 Heterogeneous Legal Ontology is not based on the vocabulary, but semantic relat- edness gained from textbooks. Kuo et al. [KTH06] Despite some variation in the style format among the propose hierarchical clustering to build concept hier- pieces of literature, another major challenge arises archies, while also the extraction of noun groups is a from the obtained concept hierarchies themselves: Ini- valid approach [ROB17]. tially, we obtain standalone hierarchies from each We examine methods of noun group extraction com- book, and the difference among them is unknown. bined with hierarchical clustering further, and propose However, topical overlaps are possible for diversified a combination of them for concept hierarchy extrac- literature, thus posing a challenge in integrating all tion from literature. This approach is based on the concept hierarchies in a non-contradicting manner. assumption that an author captures the topic of a sec- Instead, we capture the contextual character of le- tion within its title. In the highest levels of abstraction gal texts. Following the notion of hierarchical ontology within our concept hierarchy, we gather elements from clusters proposed in [VC98], we develop the idea of al- the Tables of Contents (TOC) within literature. Fi- lowing multiple concept hierarchies to coexist without nally, we obtain a coarse- to fine-grained clustering of integrating them. Conventionally, one common lan- regulations based on the understanding of the corre- guage and understanding is desired for system archi- sponding author, while we assume that the reasons for tectures whose components access the same domain citing in particular are relevant features justifying the knowledge. Despite these advantages, for our applica- cluster membership of a regulation. tion such an ontology requires high maintenance efforts Similar to this work, Günel and Aşlıyan [GA10] de- resulting from frequent insertions of further knowl- scribe how to extract concepts from tutoring mate- edge, either by automatically determining valid map- rial in TEX format using domain relevance, entropy pings or checking for logically matching candidates. and lexical cohesion as inclusion criteria. Wang et In the legal domain, a common requirement is to al. extract concept hierarchies from textbooks by the ensure that all relevant documents are retrieved, thus TOC and Wikipedia [WLW+ 15]. We also use the we optimize for a high recall. This is however chal- TOC to find local relatedness of regulations given the lenging when working with natural language, for ex- section title and Wikipedia for Named Entity Resolu- ample when encountering its cases of ambiguity, near- tion. Robin et al. compare two approaches for legal synonyms and polysemy. We therefore argue that concept hierarchy extraction: hierarchical clustering concepts in legal literature may differ even for equal and the extraction of topical expressions composed of topics, which is due to different perspectives of the noun groups [ROB17]. Bruckschen et al. populate authors and their own interpretation. However, any a legal ontology based on Named Entity Recognition human regularly overcomes these inconsistencies and [BNS+ 10]. In a related field, an approach using syn- ambiguity by either choosing one concept for a nar- row but consistent understanding, or by broadening case for an enforced hierarchy, consider a scenario the scope and encompassing multiple sources to avoid where a distance between European and national law is omissions of important items, while accessing the most desired. After including must-link-before constraints, appropriate fit based on a contextual decision crite- instances from the specified category are located closer rion. This criterion can be derived from user-provided to the reference instance than those which are forced feedback, for example by marking a document as ir- to link on a higher node of the concept tree. The algo- relevant. Then, the concept hierarchy will be selected rithm we use in the scope of this work allows for must- which most likely captures the user need based on the link and cannot-link constraints by defining a relation- recomputation of relevance. ship between two features [MHAK16]. Due to space Since our intended knowledge base is built in a limitations, we leave the examination of constraint ef- bottom-up manner, this work is different from ax- fects for future work and implement the clustering al- iomatic ontologies. There are legal ontologies avail- gorithm without constraints. able such as ALLOT [BDIPV13] or LKIF [HBDB+ 07], which are able to encompass multiple legal data 3 Concept Extraction for Heteroge- sources, however also requiring alignment of the re- neous Ontologies spective classes. These ontologies are built upon a document standard called Akoma Ntoso [VZ07] and Following relevant literature and the justification of offer many ways of standardized information model- our method, we outline our approach for building a ing on the document level and beyond. For our spe- heterogeneous ontology. In particular, we describe the cific use case, we identify two possibilities to achieve process of annotating features in textbooks to obtain a our goal: Either an expert maintains contextual in- contextual representation of the reference by means of formation regarding specific applications of laws to- concept hierarchy clusters. Figure 1 depicts the work- gether in such a standardized ontology - for instance, flow. by using the contextual ontology language C-OWL [BGvH+ 03] - or there is a system for legal literature 1. An electronic literature resource is converted into covering different scenarios, user categories and juris- a txt file. dictions, ideally resulting in a complete collection of all regulations needed for a case. Several bottom-up 2. The text is preprocessed by performing tokeniza- lightweight ontologies for legislative terms and entities tion, sentence chunking, orthographic coreference exist [BGBI16, ABC+ 16]. Our knowledge representa- resolution, parts-of-speech tagging, roman literal tion differs from these works substantially in terms of identification and named entity resolution using the application scenario and extraction method. To web knowledge from DBpedia. the best of our knowledge, there is no approach for the same use case within the legal domain allowing for 3. Rule-based annotation is applied to match TOC a fair comparison with our work. components (Chapter, Part, Subchapter, Subsub- chapter ), CS components (regulation name REG, DBpedia concept DBp, relationship REL and ref- 2.3 Concept Hierarchy Clusters erences REF. Given a large collection of textbooks, we apply cluster- 4. All annotations are extracted into a csv file, re- ing to increase contextuality and to reduce the search sulting in a table of tokens T with their respective space for finding the the most applicable concept hier- annotation features. archy for a context. As a result, many references from different concept hierarchies are merged together. In 5. The file is treated as a lookup table and for each order to stucture the cluster, the distance informa- TOC component, boundaries are determined. tion given by a hierarchical clustering algorithm can be exploited. For user-centered applications, a semi- 6. All references are matched in document order to supervised clustering method has been proposed by each TOC component with respect to the differ- Bade and Nürnberger [BN14]. They introduce must- ent section boundaries. Also, the CS information link-before constraints for clustering algorithms which is retrieved from an extracted annotation file and can be applied to hierarchical agglomerative cluster- assigned to the REF. ing. Those constraints identify instances to be linked and those which shall remain separate. Different from 7. After the feature information has been detected, other works, this method also implies the means to a flat representation of the concept hierarchy is model the hierarchical order of instances without re- stored, with one REF instance per line and its quiring to define the exact level difference. As a use TOC and CS feature information. (1) Book C1 REF1 X Show Semantically REF20 x Feedback (10) REF55 x Related References ”REF” found in: (2) Preprocessing REF1 C1 Context Descriptor Query REF5 C5 (9) Label of Cluster Knowledge Base Concept Cluster Concept (3) (8) Annotation REF Hierarchy Instances Annotation T X x REF x REF1 , , Compose Flat (4) T x X x REF2 , , REF3 , , Instances Concept Hierarchy (7) Extraction T x x X T x x x (5) Grouping REF by 1 REF1 in 1 Lookup and Component 1.1 REF2 in 1, 1.1 (6) 1.2 REF3 in 1, 1.2 Matching Figure 1: Workflow towards a lightweight heterogeneous ontology, used in a query expansion setting. 8. The instances are clustered, using their nominal similar to the approach of Günel and Aşlıyan on cor- features. responding TEX-files can be cumbersome [GA10]. Al- ternatively, we convert the PDFs into txt files to speed 9. We included a possible use case, where a user up subsequent preprocessing steps. We use GATE - a searches for context information of a regulation widely adopted framework for text processing to pre- REF 1 . Here, for REF 1 , cluster context descrip- process the text - and JAPE Grammar rules 2 to an- tors C1 and C5 are retrieved. The user decides notate the concept hierarchy elements. For example, for C1 and receives references linked to REF 1 de- based on the pattern of a book publisher for a TOC, pending on the data contained in the respective we specify matching criteria including orthographic in- concept hierarchy cluster. formation, roman numerals and part-of-speech tags 3 . 10. A feedback mechanism can be implemented to The patterns for reasons for citing are described in narrow down relevant references. Different from Equation (1) and for the respective relationship in our idea, Boonchom and Soonthornphisaj use Equation (2). There is a trade-off between statisti- term frequency-based ontology seeds for a legal cal and rule-based approaches: the former is faster to ontology search task [sBS12]. A similar approach implement but less accurate, the latter is slow to im- for query expansion using a hierarchical legal plement but more accurate. Waltl et al. emphasize the + knowledge base is by Schweighofer et al. [SG 07]. effectiveness of rule-based information extraction due Yet, their relevance feedback is based on the pref- to explicitly applied domain knowledge and suggest erences of other users, unlike our approach focus- this approach as an alternative to machine learning ing only on content. algorithms, since the latter often require a sufficient quality of training data [WBM18]. Regarding the an- Selected process steps to obtain the knowledge base notation of several elements within a textbook, we de- are described in more detail in the following. We fine rules suited for the respective elements which we share more implementation details and program code consider as expressive features. We proceed with a de- on GitHub.1 scription of these rules for TOCs, reasons for citing and regulations. 3.1 Annotation Since digital literature is conventionally available in 2 https://gate.ac.uk/sale/tao/splitch8.html PDF format, making use of formatting information 3 We use the German german-hgc.tagger from the Stanford 1 https://github.com/anybass/HONto parser https://nlp.stanford.edu/software/tagger.shtml 3.1.1 Table of Contents (TOC) RFC = (NN | NNS | NNP | NNPS | NE | Depending on the publisher, a table of contents man- (1) (NN (ADJA | NN) ∗ NN))+ ifests itself in various styles. From numeral-only ver- sions to mixed alphabet, roman literal and numeric Due to space limitations, this pattern is a simpli- variations, we define separate rules to capture each fied version of the actual one, here only listing candi- distinct heading element including its level in the con- date part-of-speech tags (POS) using the SSTS tagset text of the table of contents. Despite the efforts in [STT95]. Our rules account for a variety of possi- rule definition, there are not many substantial varia- ble sentence structures in German natural language. tions within each publishing style, so that minor in- Those patterns which are formulated by using the consistencies may be captured by generalization from more expressive JAPE rule syntax are defined with pri- seen examples. Waltl et al. combine the advantages of orities, so that the most restrictive rule is applied first. rule-based approaches with those of machine learning Likewise, there are patterns for relationship extraction techniques because domain knowledge can be directly examined by multiple authors, as well [FSE11]. We incorporated into the training phase to obtain more adapted them to German language and added nega- control over results [WBM18]. However, it is out of tion tags with scope of this work to train an annotation classifier and REL = (PTKNEG | V-INF | V-PP | V-FIN)+ (2) a potential future optimization task. After annotation, we export the TOC features. Based on the detected as the simplified relationship pattern REL. In the verb elements, we determine the boundaries for each level of categories we subsume the tags using a hyphen, for the TOC hierarchy to store the respective references example V-INF is a placeholder for VAINF, VVINF contained per part, subchapter and subsubchapter. and VMINF, which are originally output by the Stan- ford parser. The relationship feature of the annota- tion in this case is formed as a concatenation of REL 3.1.2 Reasons for Citing (RFC) and Relation- matches within a sentence containing RFC. We ad- ships (REL) just the matching rule regarding specific word patterns for important indicators - strings indicating contradic- Each sentence with a reference to a legal text poten- tions (e.g., in German “Widerspruch”) or selections tially contains information about the rationale of this (e.g., in German “Beispiel”) - which cannot be gener- citation, which serves as a contextual summary. We alized with parts-of-speech information. Also, if there divide the citation summary CS into the regulation is a syntactic indication of a legal term definition (e.g., name REG, the reason for citing RFC - following the in German “nach” or “gemäß”) within a law, we fill notion of an entity - and its relationship REL with the undetected REL fields with an is-relationship (in Ger- regulation, captured by verb forms. Extracting the man: “ist”). Furthermore, we clean the matches by CS serves as feature information for a clustering al- parsing out non-descriptive strings for a relationship gorithm. Another application is in connection with a between a reference and its reason for citing (e.g., in reasoner based on the abstract relationships. Similar German “denke”). This consequently results in sparse to the approach of Winkels et al. [WBVvS14], a model relationship features, since the above rules are both of relationships among legal texts can be derived from specified within sentence boundaries. While our as- textbooks and then be incorporated into the concept sumption that a sentence citing a regulation contains hierarchy. In addition, reasons for citing RFC can be RFC and REL patterns, this is not always the case. considered for the user of a (content-based) legal rec- For the subsequent steps, we only consider those regu- ommender system as an explanatory component, to lations containing RFC, and optionally REL. Any an- be displayed alongside the reference as a context de- notated regulation contained in the document where scriptor. We find several pattern varieties proposed for RFC is missing may not hold enough context infor- keyphrase extraction and consider them for the RFC mation to determine its applicability for the context. [WZH16, Hul03]. While the respective authors ana- Despite this limitation, it shall not have severe con- lyze English language and capture adjective groups in sequences in case of a sufficiently large heterogeneous addition to noun groups as well, there are more dis- ontology, since other extracted concept hierarchies for tinctions available for part-of-speech-tags in German the same context shall cover possible gaps due to the language. Since including all adjective groups results highly regularized nature of legislation. in a larger number of distinct nominal features, we 3.1.3 Regulations (REG, REF) limit the pattern to minor sequence variations allow- ing for attributive adjectives. In our use case, we define Many scholars have examined methods to extract regu- the following expression to capture the RFC : lations from unstructured text [WLM16], often to cre- ate a citation network based on the references within node, we summarize the features REG, DBp, RFC, the original regulation text [WBVvS14]. While cur- REL for space reasons, however, they are all stand- rently machine learning approaches remain popular, alone features. Each element has mandatory values rule-based methods achieve high precision and recall, for the Chapter, RFC and Reference. The other fields as well, which is due to the highly regularized pattern are optional because we do not assert that the rules of regulation citation. In German law, there are fixed return values for each feature. citation guidelines. Therefore, a sufficiently high pro- Given the illustrated concept hierarchy in Figure portion of citations can be detected with rules, with 2, we evaluate the results by setting the Chapter as precision and recall in the range from 80% to 90% a class label - thus expecting a reproduction of the [WLM16]. In addition, legal language contains term structure of a chapter - and by not including it in the definitions, which are implicitly referenced by other features to be processed. As indicated by the arrows, laws [WLM16]. Those term definitions can be ex- the test data can match the learned examples by com- tracted with rules and stored in a Lookup dictionary. parison of the subfeatures and early merges are an in- Although it is out of scope of this work, we plan to dicator for higher similarity between two instances. A analyze and enrich regulations with legal term defini- possible limitation of this approach comes from the re- tions - to be found in other regulations - to gain more liance on explicitly stated information. For instance, context information from the knowledge provided in if the RFC are not indicated within the reference sen- the data source itself. We considered corner cases in tence or if they are faulty extracted, this can decrease reference citations, thus aiming for an improvement the expressiveness of the features for the desired struc- of the already high regulation coverage. These cor- ture. Since the resulting concept hierarchy depends ner cases include references containing more than two on the author of the book, his perspective may not regulations from different sources, and occurrences of be suitable for any user. Therefore, we see a possi- connection indicators, in German abbreviated as “i. ble remedy in the notion of concept hierarchy clusters, V. m.”. These annotations shall contribute to a rich forming a heterogeneous lightweight ontology. knowledge base. 3.2.1 Concept Hierarchy Clusters. 3.1.4 Access Web Knowledge (DBp) Extracting a narrow concept hierarchy with only nom- Wang et al. suggest in their approach to apply inal features leads to a lower probability of getting web knowledge for identifying concept candidates all relevant references for a specific information need. [WLW+ 15]. We access Wikipedia-based linked open Consider the following example: While one book may data through the DBpedia Spotlight 4 plugin for focus on the aspects of national law, another depicts GATE5 . Unlike their method, we intend the knowl- European legislation. In reality, this information needs edge base to perform named entity resolution directly to be considered as a whole, since European legislation on the citation summary. If a DBpedia entry exists in supersedes national law. the sentence containing a reference, we split the URI to obtain the concept name as a nominal feature. We ob- Recalling the discussion from Section 2.2, we show serve that most matches occur for the regulation or the how exactly a heterogeneous ontology can serve a user RFC tokens. There is one frequent misclassification who is interested in complete, reliable and founded in- regarding the German Civil Code (BGB), where the formation. Aside from our experiment of matching DBpedia lookup yields a swiss political party instead extracted instances with Chapter labels, an actual ap- of the civil code, which we manually corrected before plication of this method is to classify for Relevance composing the concept hierarchy. After having anno- instead. Figure 3 illustrates how a heterogeneous on- tated the nine feature types (Chapter, Part, Subchap- tology in legal contexts may emerge. In the setting of ter, Subsubchapter, REG, DBp, RFC, REL, REF ), we a recommender system, suppose there is a cluster con- export them from GATE and build the concept hier- taining two concept hierarchies with sets of instances archy. (1, 5, 8) and (1, 2, 4, 8) respectively. In the first sce- nario depicted on the left hand side, the recommender 3.2 Compose Concept Hierarchies system receives positive user feedback regarding in- stance 1. Since this instance is present in the cur- Figure 2 shows how we compose and evaluate the con- rent context which is more narrow than other concept cept hierarchy. In this example, there are two sim- hierarchy, the context is not altered. In contrast, a plified concept hierarchies, which are obtained from similarity function ( A) receives negative feedback for the JAPE rule-based annotations. In the fictive CS instance 5 in the second scenario, thus resulting in a 4 https://www.dbpedia-spotlight.org/ context switch to the other concept hierarchy without 5 http://www.semanticsoftware.info/lodtagger instance 5. There are several approaches for similarity unknown connenction B inferred membership from feature C C § training data P P § test data S S S B: Book SS SS SS SS C: Chapter P: Part CS CS CS CS CS S: Subchapter SS: Subsubchapter § § § § § CS: Citation Summary (REG, DBp, RFC, REL) Figure 2: Structure and Evaluation of a Concept Hierarchy adaptation, as investigated by Stober and Nürnberger hierarchical tree algorithm, which learns incrementally in [SN11]. In addition, the heterogeneous ontology can from new instances, given four options of incorporating also be used for query expansion, as previously pointed them (creating a new child node, adding to an exist- out regarding Figure 1. ing child node, merging two similar child nodes and We find that for a legal recommender system, het- incorporating the newest instance therein, and split- erogeneous ontologies - as defined in this work as clus- ting a node, so that it becomes a child of the current ters of concept hierarchies acquired from suitable lit- node) [MHAK16]. We visualize our results by using erature - can indeed fulfill the following desirable func- the python library concept formation 6 by MacLel- tions: lan et al. [MHAK16]. Instead of incorporating several books, we evaluate this method with respect to the 1. They group semantically related concept hierar- most high-level concepts (i.e., chapter titles) of one chies. comprehensive book. In particular, we used chapters 2. Their clusters allow for efficient lookups, instead (1), (4) and (8) from Derleder et al. because they were of querying the whole ontology. perceived as topically related, while still treating dif- ferent concepts [DKB08]. For a rich heterogeneous on- 3. They are sensitive towards user feedback. tology, multiple books need to be taken into account, 4. They are as relevant as possible by applying the among which several topical overlaps shall occur to narrowest context given user feedback constraints. compensate for losses from the extraction process or a different focus of an author. In case of significant We conducted some experiments with subsets from overlaps, two concept hierarchies shall be merged. the 78 documents (subchapters from three fixed chap- ters), the results are shown in the next Section 4. 4.2 Evaluation Measures 4 Results Regarding the annotation success, we determine the ef- To show the effect of adding knowledge to the het- fectiveness of context feature extraction by computing erogeneous lightweight ontology, we evaluate the an- the average coverage of references REF by RFC an- notation and perform two experiments. The first ex- notations. Basically, if a sentence contains a pattern periment applies COBWEB clustering on the features, which can be detected by our JAPE rules, there will be without knowing the Chapter class label. The second an RFC annotation. Since we only considered those approach is a classifier for the same features, this time regulations whose context features (especially RFC ) we use the COBWEB tree. Before we present their could be retrieved, this evaluation is important to un- results, we describe the experiment setting and evalu- derstand how many data points were the basis for the ation measures. subsequent steps of clustering and classification. Our evaluation measure for the supervised cluster- 4.1 Evaluation Setup ing experiment is the Adjusted Rand Index (ARI), originally proposed by Hubert and Arabie [HA85]. It The aim of this evaluation is to determine the expres- quantifies the overlap between two partitioning ap- siveness of our selected features to distinguish between proaches, in our case, we compare the COBWEB clus- abstract concepts. In this work, we intend to show tering and the class labels (i.e., textbook chapters). Its the feasibility of our proposed knowledge extraction expected value 0 indicates a random clustering, while and representation method. Therefore, we create clus- a value close to 1 corresponds to a high agreement ters of semantically similar concept hierarchies by us- ing the COBWEB algorithm [Fis87]. It is a recursive 6 https://github.com/cmaclell/concept formation 1 = relevant 5 = irrelevant A A current context current context 1 5 8 1 2 4 8 1 5 8 1 2 4 8 Figure 3: Incorporating user feedback in a cluster of concept hierarchies with an adaptation function (A) between the resulting clustering and class label parti- we define a JAPE rule and annotate the text based tions. Santos and Embrechts suggest using the ARI on a pattern that is able to detect several citation for- for supervised multilabel classification evaluation due mats: to its ability to measure the relationship of two el- German law: § 676 a Abs. 1 Satz 1 BGB ements instead of the correct class label assignment [SE09]. While we only use one book, we expect an German law: Art. 1 und 2 Abs. 1 GG ARI above 0.5 because each chapter contains unique European law: 2000 / 46 / EG themes and possible overlaps in cited regulations REG. In Table 1, we list the number of reference Having heterogeneous ontology clusters, an automatic annotations corresponding to the book chapters: merging criterion can be applied to achieve clusters (1) Bankvertragliche Grundlagen (English: Founda- of topically related concept hierarchies. Based on the tions of Banking Contracts), (4) Kapitalmarkt- und ARI, this merging criterion has been implemented by Auslandsgeschäfte (English: Capital Market and For- Pavan et al. to extend k-means clustering [PARR11]. eign Transactions), (8) Europäisches Bankenrecht mit For the classification task, we use average values Länderabschnitten (English: European Banking Law of precision and recall. Calculating average recall is by Country). Additionally, we indicate the number rather unconventional [GF14], however, optimizing for of RFC and the average percentage of detected RFC a high recall is crucial in the legal domain. Those two from all REF annotations per chapter. The numbers measures quantify how well the COBWEB tree is be in the column header depict the document number, able to infer the correct class membership given the corresponding to the subchapters of the textbook. We instance features, as shown in Figure 2. In partic- find that almost 75% of the references have an an- ular, our average precision measures the percentage notation value for RFC. The restrictions we included of correctly identified class members compared to all in our pattern prevent us from extracting the chapter instances labeled as class members by the algorithm, name as a REF, and despite some missing references averaged over the number of runs and all classes. The and RFC due to long-range dependencies within the average recall in our case is defined as the fraction of sentence or unwanted headline text insertions at page correctly identified instances of a class compared to all breaks, the noise in the text data (e.g., citations of that belong to the respective class, averaged over all other books in a reference-like format) did not affect runs and classes. Intuitively, a false positive recom- the extraction substantially. Nevertheless, all subse- mendation of a regulation is not as severe as a false quent steps depend on the annotation, so that a loss negative for the legal domain. in this step propagates forward to the clustering and classification task. 4.3 Evaluation of Annotation 4.4 Evaluation of Heterogeneous Legal Ontol- We evaluate our annotation results regarding the num- ogy ber of detected references REF compared to the num- ber of extracted RFC in the chapter, since we require We evaluate our results for the COBWEB clustering the latter for concept formation. Spiegel-Rosing found algorithm using the extracted Chapter feature as the for scientific texts descriptive RFC context in 80% of ground truth class. With the remaining context infor- the sentences. We assume that in a German legal text- mation starting with the Part feature until the REF book, slightly less RFC will be detected, due to a dif- feature, the instances are supposed to be grouped by ferent writing style (e.g., more complex syntax and the COBWEB clustering algorithm. In order to show longer sentences). Consequently, our aim for RFC an- the effect of a successful extraction method, we re- notation is set to 70% of REF occurrences. Therefore, stricted the instances only to those cases where a value Table 1: Evaluation of REF and RFC detection. From three chapters, we analyzed all subchapters. (1) 1 2 3 4 5 6 7 8 9 Avg. % REF 197 40 196 47 41 107 568 131 250 RFC 170 30 168 37 31 83 385 74 160 72 (4) 50 51 52 53 54 55 56 57 58 59 60 61 62 63 Avg. % REF 211 82 1091 283 119 41 82 283 270 483 112 115 164 237 RFC 158 60 643 232 85 33 70 215 227 400 85 93 111 221 74 (8) 72 73 74 75 76 77 78 Avg. % REF 47 90 188 40 67 28 370 RFC 36 61 147 30 43 16 275 73 Figure 4: COBWEB clustering with p=2, i=1020 and Figure 5: COBWEB clustering with p=2, i=1149 and the ARI evaluation [HA85] the ARI evaluation [HA85] could be retrieved for the Part feature, since this is the ter (1) (labeled as B) and chapter (4) (labeled as K) most abstract class. To have an equal class distribu- and chapter (8) (labeled as E). Many instances of par- tion, we downsampled the instances of other chapters ticularly chapters (4) and (8) are placed in the wrong to match the class with the fewest instances left. This cluster. From this, we conclude that despite having has not been achieved with a random selection, but in- balanced classes, there may be topical overlaps among stead we selected a group of instances which were pre- the concept hierarchies which shall either result in a viously spatially close in the textbook. This has the merge or are lacking evidence for separate groups. If advantage of not missing important context, as well as we allow for a slight class imbalance of the instances by limiting the variance in nominal features. For a fair increasing the number of chapter (1) and (4) instances comparison, running the evaluation with different in- in a comparable amount to 1149, the ARI increases to stance groups yielded mostly similar results, however 0.64, as shown in Figure 5. This also led to a differ- we observe that more variability leads to less similar ent cluster shape and a better discrimination between examples and thus a lower ARI score. the three chapter classes. The improvement can be For the first evaluation shown in Figure 4 with 2 seen in the classes, where more labels correspond to principal components p, 3 Chapters and 1020 instances the cluster membership. It indicates that the cluster- i of balanced classes, we obtain an adjusted rand in- ing approach found more agreement between clusters dex ( ARI) of 0.28. Each axis holds one principal com- and the ground truth classes. That observation lets us ponent analysis (PCA) dimension to visualize a pro- conclude that additional examples can lead to a higher jection of the cluster shape. According with our ex- ARI if they only broaden the feature value space mod- pectation, there are three clusters, while each cluster erately. In previous experiments, we applied the algo- consists of two to three ellipsis shapes. The chapter rithm to all extracted instances, leading to an ARI of labels in Figure 4 indicate that the algorithm does not 0.05, presumably because of the high variance of in- have enough information to distinguish between chap- stances within a chapter and different chapter length. to 10pp for precision, which is a significant improve- ment of the classifier performance. In summary, the results for the COBWEB algorithm vary depending on the number of examples for each concept hierarchy. A recall of more than 90% is desirable, so that the results from the second setup of each experiment are regarded as sufficient evidence for descriptive features to distinguish between different contexts. We discuss the general applicability of the results. 4.5 Discussion There is more research potential in the question whether this approach also works for other domain literature, or what happens if other clustering algo- rithms with advanced capabilities of constraint formu- Figure 6: COBWEB tree with r=10, num=100, lation are chosen. Considering that we used concept i=1020 hierarchies mostly about general banking law, finan- cial markets and european banking law, the overlap of REG and RFC is considerable. After other books about different subjects are added, those three concept hierarchies may form a cluster. During the concept hi- erarchy extraction, we found that there are four major limitations of our approach: First, literature resources are needed which cover the information need. Other- wise, a user may not find his case represented. Second, for each textbook, there can be a different format of citations or the TOC components. This results in a higher manual effort for rule formulation. Third, since we only had the PDF files of literature available, there were challenges in segmenting the file and assigning references to each section, leading to missing feature values. Fourth, despite having gained much domain information from the textbook, we need to investigate Figure 7: COBWEB tree with r=10, num=100, more methods of leveraging those. Since we plan to i=1149 implement a lightweight heterogeneous ontology, we uncover future research fields in Section 5. Since this class imbalance will naturally occur in a heterogeneous ontology, we need to investigate futher how the approach scales and what the limitations are 5 Conclusion and Future Work regarding the feature diversity. To conclude, our lightweight heterogeneous ontology We perform a second experiment on the same data, is composed of concept hierarchies which are derived but in the classification setting with a COBWEB tree from literature. It is a promising area for further work. with 10 runs r and 300 training instances num. The We pointed out the reasons for accepting coexisting result of the classification algorithm is shown in Fig- perspectives in the legal domain and gave indications ures 6 and 7, including 95% confidence intervals for of how to take advantage of many sources, while still the average precision and recall values. In Figure 6, controlling the results with constraints and user feed- the confidence intervals obtain a range of 40 percent- back. The rule-based annotation method provided fea- age points (pp), witnessing of an unstable classification tures for context-aware classification and clustering of result of 80% precision and 87% recall on average af- the concept hierarchies. Overall, the results indicate ter 200 training examples. The effect of adding further that the chosen features, the extraction method and examples is illustrated in Figure 7 and similar to the the concept formation library are suitable for detect- previous experiment, which manifests in a gain in pre- ing semantic similarity in the book we selected. Re- cision of about 10pp and a slight increase of 5pp in the garding future work, we are curious about how this average recall score. Please note that the range of the method performs, if additional features of the content confidence interval is reduced to 20pp for recall and of referenced regulations and term definitions are taken into account. Another field to study is the impact Vojtěch Svátek, and Maarten van Someren, ed- of abstract relationship categories on clustering. We itors, Semantics, Web and Mining, pages 103– see possible applications of the learned ontology in the 120. Springer Berlin Heidelberg, Berlin, Heidel- field of law clustering, legal context search, topic de- berg, 2006. tection and legal recommender systems and intend to [BGBI16] MarÃa G. Buey, Angel Luis Garrido, Car- explore more about these use cases. los Bobed, and Sergio Ilarri. The ais project: Boosting information extraction from legal doc- 6 Acknowledgements uments by using ontologies. In Proceedings The authors would like to thank Andreas Nürnberger of the 8th International Conference on Agents and the anonymous referees for their valuable com- and Artificial Intelligence, pages 438–445, 2016. ments. The work is supported by Legal Horizon AG, Exported from https://app.dimensions.ai on Grant No.:1704/00082 2018/08/19. [BGvH+ 03] Paolo Bouquet, Fausto Giunchiglia, References Frank van Harmelen, Luciano Serafini, and [AAD16] VS Anoop, S Asharaf, and P Deepak. Un- Heiner Stuckenschmidt. C-owl: Contextualiz- supervised concept hierarchy learning: a topic ing ontologies. In Dieter Fensel, Katia Sycara, modeling guided approach. Procedia Computer and John Mylopoulos, editors, The Semantic Science, 89:386–394, 2016. Web - ISWC 2003, pages 164–179, Berlin, Heidelberg, 2003. Springer Berlin Heidelberg. [ABC+ 16] Gianmaria Ajani, Guido Boella, Luigi Di Caro, Livio Robaldo, Llio Humphreys, Sabrina [BMNG18] Mark Belford, Brian Mac Namee, and Praduroux, Piercarlo Rossi, and Andrea Vi- Derek Greene. Stability of topic modeling via olato. The european taxonomy syllabus: A matrix factorization. Expert Systems with Ap- multi-lingual, multi-level ontology framework plications, 91:159–169, 2018. to untangle the web of european legal termi- [BN14] Korinna Bade and Andreas Nürnberger. Hier- nology. Applied Ontology, 11(4):325–375, 2016. archical constraints - providing structural bias [BDCG+ 15] Guido Boella, Luigi Di Caro, Michele for hierarchical clustering. Machine Learning, Graziadei, Loredana Cupi, Carlo Emilio 94(3):371–399, 2014. Salaroglio, Llio Humphreys, Hristo Konstanti- [BNS+ 10] Mı́rian Bruckschen, Caio Northfleet, nov, Kornel Marko, Livio Robaldo, Claudio DM Silva, Paulo Bridi, Roger Granada, Re- Ruffini, Kiril Simov, Andrea Violato, and Veli nata Vieira, Prasad Rao, and Tomas Sander. Stroetmann. Linking legal open data: Breaking Named entity recognition in the legal domain the accessibility and language barrier in euro- for ontology population. In In: 3rd Workshop pean legislation and case law. In Proceedings on Semantic Processing of Legal Texts (SPLeT of the 15th International Conference on Arti- 2010), page 16, 2010. ficial Intelligence and Law, ICAIL ’15, pages 171–175, New York, NY, USA, 2015. ACM. [CHS04] Philipp Cimiano, Andreas Hotho, and Stef- fen Staab. Clustering concept hierarchies from [BDIPV13] Gioele Barabucci, Angelo Di Iorio, text. In Proceedings of the Conference on Lex- Francesco Poggi, and Fabio Vitali. Integra- ical Resources and Evaluation (LREC), pages tion of legal datasets: From meta-model to 1721–1724, 2004. implementation. In Proceedings of Interna- [DKB08] Peter Derleder, Kai-Oliver Knops, and tional Conference on Information Integration Heinz Georg Bamberger. Handbuch zum and Web-based Applications & Services, deutschen und europäischen Bankrecht. IIWAS ’13, pages 585:585–585:594, New York, Springer Science & Business Media, 2008. NY, USA, 2013. ACM. [Fis87] Douglas H Fisher. Knowledge acquisition via [BDMP06] Holger Bast, Georges Dupret, Debapriyo incremental conceptual clustering. Machine Majumdar, and Benjamin Piwowarski. Discov- learning, 2(2):139–172, 1987. ering a term taxonomy from term similarities using principal component analysis. In Markus [FMPT10] Enrico Francesconi, Simonetta Monte- Ackermann, Bettina Berendt, Marko Grobel- magni, Wim Peters, and Daniela Tiscornia. In- nik, Andreas Hotho, Dunja Mladenič, Giovanni tegrating a bottom–up and top–down method- Semeraro, Myra Spiliopoulou, Gerd Stumme, ology for building semantic resources for the multilingual legal domain. In Semantic Pro- [PARR11] Karteeka Pavan, Allam Appa Rao, and A V cessing of Legal Texts, pages 95–121. Springer, Rao. An automatic clustering technique for op- 2010. timal clusters. abs/1109.1068:133–144, 09 2011. [FSE11] Anthony Fader, Stephen Soderland, and [ROB17] Cécile Robin, James O’Neill, and Paul Oren Etzioni. Identifying relations for open Buitelaar. Automatic taxonomy generation information extraction. In Proceedings of the - A use-case in the legal domain. CoRR, conference on empirical methods in natural lan- abs/1710.01823, 2017. guage processing, pages 1535–1545. Association for Computational Linguistics, 2011. [sBS12] Vi sit Boonchom and Nuanwan Soonthorn- phisaj. Atob algorithm: an automatic ontology [GA10] Korhan Günel and Rıfat Aşlıyan. Extracting construction for thai legal sentences retrieval. learning concepts from educational texts in in- Journal of Information Science, 38(1):37–51, telligent tutoring systems automatically. Expert 2012. Systems with Applications: An International Journal, 37(7):5017–5022, 2010. [SE09] Jorge M Santos and Mark Embrechts. On the use of the adjusted rand index as a metric for [GF14] Marian George and Christian Floerkemeier. evaluating supervised classification. In Inter- Recognizing products: A per-exemplar multi- national Conference on Artificial Neural Net- label image classification approach. In Euro- works, pages 175–184. Springer, 2009. pean Conference on Computer Vision, pages 440–455. Springer, 2014. [SG+ 07] Erich Schweighofer, Anton Geist, et al. Legal query expansion using ontologies and relevance [HA85] Lawrence Hubert and Phipps Arabie. Com- feedback. In LOAIT, pages 149–160, 2007. paring partitions. Journal of classification, 2(1):193–218, 1985. [SN11] Sebastian Stober and Andreas Nürnberger. An experimental comparison of similarity adapta- [HBDB+ 07] Rinke Hoekstra, Joost Breuker, Marcello tion approaches. In International Workshop on Di Bello, Alexander Boer, et al. The lkif Adaptive Multimedia Retrieval, pages 96–113. core ontology of basic legal concepts. LOAIT, Springer, 2011. 321:43–63, 2007. [STT95] Anne Schiller, Simone Teufel, and Christine [Hea92] Marti A Hearst. Automatic acquisition of hy- Thielen. Guidelines für das tagging deutscher ponyms from large text corpora. In Proceed- textcorpora mit stts. Technical report, Univer- ings of the 14th conference on Computational sitäten Stuttgart und Tübingen, 1995. linguistics-Volume 2, pages 539–545. Associa- tion for Computational Linguistics, 1992. [VC98] Pepijn R.S. Visser and Zhan Cui. Heteroge- neous ontology structures for distributed archi- [Hul03] Anette Hulth. Improved automatic keyword tectures, 1998. extraction given more linguistic knowledge. In Proceedings of the 2003 conference on Empirical [VZ07] Fabio Vitali and Flavio Zeni. Towards a methods in natural language processing, pages country-independent data format: the akoma 216–223. Association for Computational Lin- ntoso experience. In Proceedings of the V leg- guistics, 2003. islative XML workshop, pages 67–86. Florence, Italy: European Press Academic Publishing, [KTH06] Huang-Cheng Kuo, Tsung-Han Tsai, and 2007. Jen-Peng Huang. Building a concept hierar- chy by hierarchical clustering with join/merge [WBM18] Bernhard Waltl, Georg Bonczek, and Flo- decision. In Proceedings of the 9th Joint Confer- rian Matthes. Rule-based information extrac- ence on Information Sciences, JCIS 2006, vol- tion - advantages, limitations, and perspectives. ume 2006, 01 2006. Jusletter IT, 02 2018. [MHAK16] C.J. MacLellan, E. Harpstead, V. Aleven, [WBVvS14] Radboud Winkels, Alexander Boer, Bart and K.R. Koedinger. Trestle: A model of con- Vredebregt, and Alexander van Someren. To- cept formation in structured domains. Advances wards a legal recommender system. In JURIX, in Cognitive Systems, 4:131–150, 2016. volume 271, pages 169–178, 2014. [WLM16] Bernhard Waltl, Jörg Landthaler, and Flo- rian Matthes. Differentiation and empirical analysis of reference types in legal documents. In JURIX, pages 211–214, 2016. [WLW+ 15] Shuting Wang, Chen Liang, Zhaohui Wu, Kyle Williams, Bart Pursel, Benjamin Brautigam, Sherwyn Saul, Hannah Williams, Kyle Bowen, and C Lee Giles. Concept hierar- chy extraction from textbooks. In Proceedings of the 2015 ACM Symposium on Document En- gineering, pages 147–156. ACM, 2015. [WZH16] Minmei Wang, Bo Zhao, and Yihua Huang. Ptr: Phrase-based topical ranking for auto- matic keyphrase extraction in scientific publi- cations. In International Conference on Neu- ral Information Processing, pages 120–128. Springer, 2016. [ZK07] Paul Zhang and Lavanya Koppaka. Semantics- based legal citation network. In Proceedings of the 11th international conference on Artifi- cial intelligence and law, pages 123–130. ACM, 2007.