An Analysis of Topic Modelling for Legislative Texts James O’ Neill Cécile Robin Insight Centre for Data Analytics Insight Centre for Data Analytics IDA Business Park IDA Business Park Galway, Ireland Galway, Ireland james.oneill@insight-centre.org cecile.robin@insight-centre.org Leona O’ Brien Paul Buitelaar Governance, Risk and Compliance Technology Centre Insight Centre for Data Analytics University College Cork IDA Business Park Cork, Ireland Galway, Ireland leona.obrien@ucc.ie paul.buitelaar@insight-centre.org ABSTRACT the number of documents term t appears in. From a term-document The uprise of legislative documents within the past decade has risen matrix M, dimensionality reduction techniques are often used to dramatically, making it difficult for law practitioners to attend to reduce all terms to a set of concepts, which can be interpreted as ap- legislation such as Statutory Instrument orders and Acts. This work proximations of “topics” in a given corpus. The matrix factorization focuses on the use of topic models for summarizing and visualizing techniques we discuss include Singular Value Decomposition (SVD) British legislation, with a view toward easier browsing and identifi- and Non-Negative Matrix Factorization (NMF). cation of salient legal topics and their respective set of topic specific 2.1.1 Non-Negative Matrix Factorization. NMF is specifi- terms. We provide an initial qualitative evaluation from a legal ex- cally for factorizing matrices with non-negative values, hence why it pert on how the models have performed, by ranking them for each is particularly suitable for term-document matrices. Since M is rep- jurisdiction according to topic coherency and relevance. resented as non-negative values, features are composed of additive computations resulting in a part based representation (as opposed to 1 INTRODUCTION subtracting values which would not lead to parts-based factored rep- The legal domain is experiencing a major shift towards automated resentation) [6]. The objective of NMF is to find an approximation tools that can perform tasks that are becoming increasingly diffi- of matrix M by factorizing it into W (r x k) and H (k x c) such that cult for legal practitioners to carry out, due to the rate of change M ≈ W H and k have lower rankthan M. The reconstruction error is in the legal domain. Regulatory change (RC) is a notable area that minimized according to that shown in Equation 1 [11, 12]. has gained more attention in recent years due to the difficulties in n m 1 1 XX compliance. In order to build automated solutions for compliance |M − W H | 2F = (Mi j − W Hi j )2 (1) and verification, automated knowledge acquisition is an imperative 2 2 i=1 j=1 for related tasks. An initial step towards such a system requires an overview/summarization of the core topics within the domain, in Also described by Lee and Seung [11], the multiplicative update order to identify salient terms within the topics that are potentially algorithm is used for updating both W and H . Both update rules associated with compliance across various documents. Many ap- are outlined in Equation 2. The objective ensures the minimization proaches in legal systems require metadata from an XML schema to is constrained to W and H being positive and that the distance D carry out analysis such as topic modelling. This paper analyzes the between both is positive. use of topic models to do this automatically from raw text. We start (W T M)α, µ with a background to the models used for testing. H α, µ := H α, µ , (W T W H )α, µ (2) 2 TOPIC MODELLING (MH T )α,i Wi,α := Wi,α 2.1 Dimensionality Reduction Approaches (W HH T )α,i A basic approach to modelling topics is to view a corpus as a set In this work, instead of using gradient descent to minimize the of term frequencies (tf) where the weight for each term is also de- sum of squared (euclidean) distance (SSD) between M and W H , we pendent on the inverse document frequency (idf) ( e.g “and” occurs use the Coordinate Descent solver. Lin et al. [13] describe the pro- many times in a document, therefore its weight is low). Formally, cess that builds upon the multiplicative update algorithm by applying N ft,d ∗ log where N represents the number of documents and nt is Alternating Non-negative Least Squares (ANLS) using projected nt gradient descent which is a parameter estimator with lower-bounded In: Proceedings of the Second Workshop on Automated Semantic Analysis of Informa- constraints. Although, NMF is widely used for topic modelFling [21], tion in Legal Text (ASAIL 2017), June 16, 2017, London, UK. it is sometimes known to produce non-meaningful topics, particu- Copyright © 2017 held by the authors. Copying permitted for private and academic purposes. larly if a term-document matrix is relatively sparse. Therefore, the Published at http://ceur-ws.org identification of both rare and non-distinct terms is an important step ASAIL 2017, June 16, 2017, London, UK J. O’ Neill et al. to consider for removal before factorization. Furthermore, NMF can and less general terms are desired by the legal practitioners, hence be prone to local minima. we use this saliency measure in our analysis. 2.1.2 Singular Value Decomposition. SVD decomposes a X P(K |w) matrix into three parts as shown in Equation (3) in order to find S(w) = P(w) P(K |w) log (4) P(K) a lower rank1 approximation of the term-document matrix. Con- T sider M to be a tf-idf matrix representation of the corpus, where Sievert and Shirley [19] describe the relevance measure, also U diagonalizes MM T and ui represents the corresponding eigen- shown in 5, where ϕ k (w ) is the probability of w for topic k and p(w) vector. Similarly V ∗ diagonalizes M T M and vi represents M T M is the probability of observing w in corpus D. In this work, λ is eigenvectors. The diagonal values of Σ are ordered singular values2 . can be chosen between 0-1. We set λ according to term relevance judgments made by a legal practitioner, prior to the final analysis of M = U ΣV ∗ (3) each topic model. SVD on a term-document matrix is also referred to as Latent ϕ k(w ) Semantic Analysis (LSA), as the lower ranked matrix M is said r (w, k |λ) = λ log(ϕ k(w ) ) + (1 − λ) log( ) (5) p(w) to represent a latent semantic space. In information retrieval, it is referred to as Latent Semantic Indexing (LSI), where SVD is used to 2.3 Saffron index documents by representing documents (document-document) Saffron is a software tool3 that can construct a model-free topic hier- and terms (document-term where terms are query terms) in vector archy. It extracts terms related to the domain of expertise, establish space where the elements in the vector correspond to the degree that semantic relations between them, and constructs a taxonomy out of a term or document has to a given topic. The similarity between a it. Saffron also deals with multiword expressions, which can improve query and a given set of documents can then be determined using a topic coherency as phrases are often necessary for better readability term-topic-matrix [18]. This is particularly helpful for distinguishing and understanding. polysemous and synonymous terms. Saffron builds the topic hierarchy of a corpus by first capturing the expertise domain through a model represented as single-word 2.2 Latent Dirichlet Allocation list. The latter is extracted using feature selection during a term Latent Dirichlet Allocation was first introduced by Blei et al. [2] and and linguistic pattern extraction phase. It uses constraints such as has since been a state of the art (SoTA) topic model, showing to have limiting to contentful parts-of-speech, to single words (in order to more expressiveness over probabilistic LSA (pLSA) [3]. LDA builds target a more generic level) and to terms distributed across at least a a Bayesian generative model using Dirichlet priors for topic mixtures 1/4 of the corpus (for the specificity to the area of expertise). Topic (an assumed prior probability for each topic distribution, Dirichlet is coherency, which is a main issue for statistically driven models in a set of categorical distributions in this sense), in contrast to pLSA order for Subject Matter Experts (SMEs) to reply upon them, is that can be considered to use uniform prior distribution for the topic tackled here by using semantic relatedness to filter the candidate mixtures. Further extensions since then have been made to improve words. It is interpreted here as a domain coherency measure using and adapt this model in a continuous space setting. In this sense, Pointwise Mutual Information (PMI) (see [4] for more details). The continuous word embeddings are used. Categorical distributions domain model is then used as a base to measure the coherence of the are replaced with multivariate Gaussian distributions, meaning that topics within the domain in the next phase. Gaussian LDA has the capability of handling out of vocabulary After extracting candidate terms following a standard multi-word words on unseen text [8]. The probability of word w is dependent term extraction technique (see [4]), the first step involves searching on a topic k in z which is dependent on probability of a document for words from the domain model in the immediate context of those θd that is drawn from a Dirichlet prior α. Likewise, a word w is also candidates. This allows to determine a term’s coherence within the dependent on the probability ϕ that a word w is in topic k. domain. This is achieved again through PMI calculation, by using The LDA generative process is described by Blei [3]. For each top level terms to extract intermediate level terms. document, a parameter θd is chosen from a Dirichlet prior distribu- To create the pruned graph which represents the taxonomy, the tion, then for each word in d a topic category is chosen according to strength of relationship between two research terms is measured, the Dirichlet. A word w is generated afterwards, given the topic zw defined as Ii j = D i j /(D i × D j ) where D i is the number of articles and β. that mention the term Ti in our corpus, D j is number of articles that The aforementioned Gaussian LDA represents these words as con- mention the term T j , and D i j is the number of documents where tinuous embedded vectors instead of discrete co-occurrence counts, both terms appear. Edges are added in the graph for all the pairs replacing the categorical distributions for zn and w n with Gaussians. that appear together in at least three documents, threshold fixed The saliency of terms within a topic is considered by [7] and based on the results of previous studies and tests (see [4] for more formulated in Equation 4. A distinctive word w is a word that has a details). Saffron also uses a generality measure to direct edges from higher log-likelihood of being in a topic K compared to a random generic concepts to more specific ones. This results in a dense, noisy word. Hence, if a word w occurs in many topics it is non-informative, directed graph that is further trimmed using a specific branching resulting in lower saliency. More informative topic-specific terms algorithm which was successfully applied for the construction of 1 The rank of a matrix is the number of linearly independent column vectors in a matrix domain taxonomies in [14]. This yields a tree structure where the (e.g document-term matrix), which can be used to reconstruct all column vectors. 2 singular values are the square root of the eigenvalue 3 see here - http://saffron.insight-centre.org/ An Analysis of Topic Modelling for Legislative Texts ASAIL 2017, June 16, 2017, London, UK root is the most generic term and the leaves are the most specific to all topic models. United Kingdom legislative texts were used for terms. topic modelling5 . The corpus contains 41,518 documents between 2000 - 2016. However, for practical purposes the analysis is carried 3 RELATED WORK out on the year 2016, only to lessen the reading burden on the Wiltshire et al. [20] introduced a large scale machine learning sys- legal practitioner. The legislative types consist of the following: 304 tems that incorporates the use of hierarchical topic construction after Northern Ireland Statutory Rules, 838 UK Statutory Instruments, 132 the extraction of terms, legal phrases and case cites. Their system Welsh Statutory Instruments and 317 Scottish Statutory Instruments. allows for a ranking and classification of topics given a legal concept as input according to a scoring criterion. George et al. [10] provide 4.1 Text Preprocessing a legal system for ranking documents according to their similarity Corpus specific regular expressions (RE) are used to clean legal do- to legal cases by finding similarity between documents in the la- main syntax (e.g bracketed alphanumerics), followed by tokenization tent topic space and query terms. They then use human assistance and lemmatization using the WordNet lemmatizer [9]. The struc- to provide annotate documents that are relevant to the query in a ture usually contains nested expressions e.g (ii) followed by (a) and semi-supervised fashion. In contrast, our work is fully unsupervised (b) subsections. This syntax is removed using the regular expres- with no human assistance during the topic modelling process. LDA sions along with other standard RE for identifying references and has been used extensively on natural language texts such as social alphanumeric expressions e.g “Regulation EC No. 1370/2007 means media texts [16], publication texts, newspapers etc. and typically not Regulation 1370/2007 ...”. Redundant stopwords are removed from in formal settings such as their use on legal texts. the corpora for word frequency f < 2. This is carried out under the Raghuveer and Kumar [17] use LDA to cluster Indian legal judge- supervision of a subject expert by analysing a subsample of terms ments and use cosine similarity as the distance measure between which are considered for removal. We assume that terms with high documents for clustering. However, their evaluation does not present frequency are not specific to a particular topic e.g ’the’,’of’ etc. Also, the prior knowledge of a legal expert to determine if the clusters rare terms that occur infrequently are not representative of a single coincide with legal knowledge within the domain. topic since they do not appear enough to infer that it is salient for a O’ Neill et al. [15] have identified salient legal statements (in topic. Each corpus (corpus per jurisdiction) is then converted to a contrast to salient topics) by extracting deontic modalities from term-document matrix where weights are placed on each word using using a small number of labeled samples to train a recurrent neural the aforementioned tf-idf weighting scheme. Furthermore, 30 terms network. for all models except Saffron are listed for SME for ranking. For Ahmed and Xing [1] use dynamic HDP to track topic over time, Saffron we rely on a visualization of the term hierarchy for a domain documents can be exchanged however the ordering is intact. They expert to judge. also use longitudinal NIPS papers to track emerging topics and de- caying topics (this is worth noting, particularly for tracking changing 4.2 Ranking Criterion and Model Configurations topics around compliance issues). In order for a legal practitioner to assess the models in a fair manner, The use of the aforementioned Saffron has been previously demon- a set of guidelines are presented for the ranking of the models. An strated through a wide range of projects from several domains and important aspect to ranking is the pretuning of the term relevance for different tasks. In [5], Bordea used Saffron’s topic extractor to parameter λ, which chooses the top 30 terms that are presented analyze legal documents arising around the financial crisis in 2008. for each topic within the jurisdiction accordingly. We also assess a She mapped the problem as an expert finding task, which aims at number of parameter setting for NMF, LSA, LDA and HDP before ranking people that have knowledge about a given topic. In that finally choosing the final 10 set of topics which the legal expert particular context, the task allowed the identification of individuals makes their final judgment. Since the term-document matrix is quite involved in defining the response of the U.S. government to the fi- sparse (evident from 1), NMF is initialized using Non-Negative nancial crisis by searching for a topic of interest. In [4], Saffron was Singular Value Decomposition (NNSVD). The Coordinate Descent used as a tool to detect the presence of different disciplines within solver is used for minimizing the reconstruction error as mentioned the field of Web Science. By running it on over 10 years of Web in section 2.1.1. The number of components is set to nk = 10. LSI Science conference series documents, it resulted on a discovery of 4 uses standard SVD which does not require much tuning only to communities (Communication, Computer Science, Psychology, and choose the number of singular values, also set nk = 10. For LDA we Sociology), and trends over time and types of paper. Saffron was choose low relevance λ = 0.25 to highlight topic specific terms. also used in a demo for an Irish bookshop website4 to extract topics from book descriptions/reviews and then classify them accordingly. 5 RESULTS It was also used to link the books for the creation of a multi-level In this section we analyse the topics retrieved for each approach, and browsing application for book navigation. an SME evaluated the topics for the regulations. Figure 1 simply compares the effects of dictionary size once infrequent terms are 4 METHODOLOGY increasingly removed. It is evident that after removing terms that This section outlines the steps towards creating each topic model occur less than twice, the corpus’ size dramatically decreases, mean- and their configurations used for analysis. We start with a brief ing that a significant number of terms are too specific to a particular introduction to the corpora used and preprocessing steps common document. We remove these terms for subsequent analysis. 4 see http://kennys.insight-centre.org/ 5 Retrieved from: http://www.legislation.gov.uk/ ASAIL 2017, June 16, 2017, London, UK J. O’ Neill et al. Figure 1: Rare-word Removal For Each Corpus Figure 3: Latent Dirichlet Allocation terms for topic 10 of Northern Ireland Statutory Rules Figure 2: LDA topics for Northern Ireland Statutory Rules projected to 2 principal components using multi-dimensional scaling (MDS) Figure 4: Support Allowance topic within Northern Ireland Statutory Latent Dirichlet Allocation Visualization. For the visualization Rules of LDA topics, we use the pyLDAvis [19] visualization tool. A mul- tidimensional scaling projects the t dimensional space to a 2 dimen- sions as shown in figure 2. Ten topics for Northern Ireland Statutory different aspects of it. We can see the advantage of the hierar- Rules (NISR) are presented with the relevance metric set λ = 0.25 chical structure of the graph, with semantically related topics go- (which decides the term-topic specificity). This is done under the su- ing from the more generic to the more specialized ones. We can pervision of a legal practitioner to ensure that λ is tuned to a correct this way identify a waterfall structure from the housing benefit specificity and that topics are also coherent, before a final evaluation. branch, logically followed by the more specific local housing al- Some terms such as biomass, biomaterial, bioliquid, fossil and lowance, and then local housing allowance determination. An- fuel show a clear and distinct topic and are quite topic specific given other quite clear example can be observed from the child sup- λ = 0.25, shown by red bars which indicate the term frequency with port branch, related to the personal independence_payment node. the given topic as opposed to the blue bar that indicate the term From child support, the directed edge links to child support main- frequency among the whole corpus. tenance, then maintenance calculation, and finally the three topics Saffron. In Saffron’s results, a cluster is located around the ex- child_support_maintenance_calculation_regulation, welfare service tracted topic of department of justice, and support allowance which and maintenance assessment. The police service node is at the root of derives the whole taxonomy for the Northern Ireland Statutory a taxonomy that includes children nodes northern_ireland_reserve Rules. This topic is thus the primary node of the 2016 corpus. ⇒ notice_of_appeal ⇒ written_representation,avoiding service ⇒ In Figure 4, we zoom in a subset of this graph (and thus sub- reasonable_amount_of_duty_time. This example summary allows a domains) which includes housing benefit, income support, social legal practitioner to identify topics surrounding certain legal issues, security, personal_independence payment. They all are semanti- or for simply summarizing a complete jurisdiction. Zooming in on cally related to the mother node support allowance, but tackling a subset of the hierarchical tree, we highlight a topic with coherent An Analysis of Topic Modelling for Legislative Texts ASAIL 2017, June 16, 2017, London, UK terms that correspond to compliance related issues. After evalua- tion Saffron has been consistently ranked as the most favourable of all models, as the aforementioned vocabulary pruning and usage of multi-word expressions has played a fundamental role in topic coherency. Standard LDA has performed the best of all single term models, particularly when top terms are chosen according to their topic specificity. HDP has inferred a similar number of topics as that of LDA according to an analysis of the log-likelihood curve and the legal practitioners judgment. This work is an early indication as to how legal practitioners can identify salient and coherent topics using automatic topic modelling tools. REFERENCES [1] Amr Ahmed and Eric P. Xing. 2012. Timeline: A Dynamic Hierarchical Dirichlet Process Model for Recovering Birth/Death and Evolution of Topics in Text Stream. CoRR abs/1203.3463 (2012). http://arxiv.org/abs/1203.3463 [2] David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3, Jan (2003), 993–1022. [3] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3 (March 2003), 993–1022. http://dl.acm.org/ citation.cfm?id=944919.944937 [4] Georgeta Bordea. 2013. Domain adaptive extraction of topical hierarchies for Expertise Mining. Ph.D. Dissertation. [5] Georgeta Bordea, Kartik Asooja, Paul Buitelaar, and Leona OâĂŹBrien. 2014. Gaining insights into the Global Financial Crisis using Saffron. NLP Unshared Task in PoliInformatics (2014). [6] Deng Cai, Xiaofei He, Xiaoyun Wu, and Jiawei Han. 2008. Non-negative ma- trix factorization on manifold. In Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on. IEEE, 63–72. [7] Jason Chuang, Christopher D Manning, and Jeffrey Heer. 2012. Termite: Visu- alization techniques for assessing textual topic models. In Proceedings of the International Working Conference on Advanced Visual Interfaces. ACM, 74–77. Figure 5: Police Service topic within Northern Ireland Statutory Rules [8] Rajarshi Das, Manzil Zaheer, and Chris Dyer. 2015. Gaussian LDA for Topic Models with Word Embeddings.. In ACL (1). 795–804. [9] Christiane Fellbaum. 1998. WordNet. Wiley Online Library. Rank NISR SSI UKSI WSI [10] Clint Pazhayidam George, Sahil Puri, Daisy Zhe Wang, Joseph N Wilson, and William F Hamilton. 2014. SMART Electronic Legal Discovery Via Topic 1 Saffron Saffron Saffron Saffron Modeling.. In FLAIRS Conference. 2 LDA LDA LDA LDA [11] Daniel D Lee and H Sebastian Seung. 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401, 6755 (1999), 788–791. 3 HDP NMF HLDP/LSI HLDP/LSI [12] Daniel D Lee and H Sebastian Seung. 2001. Algorithms for non-negative matrix 4 LSA LSI HLDP/LSI HLDP/LSI factorization. In Advances in neural information processing systems. 556–562. 5 NMF HLDP NMF NMF [13] Chih-Jen Lin. 2007. Projected gradient methods for nonnegative matrix factoriza- tion. Neural computation 19, 10 (2007), 2756–2779. Table 1: Subject Matter Expert Ranking of Topic Models [14] Roberto Navigli, Paola Velardi, and Stefano Faralli. 2011. A Graph-based Al- gorithm for Inducing Lexical Taxonomies from Scratch. In Proceedings of the multi-word expressions summarizing an area within the Northern Twenty-Second International Joint Conference on Artificial Intelligence - Volume Volume Three (IJCAI’11). AAAI Press, 1872–1877. DOI:http://dx.doi.org/10. Ireland Statutory Rules in Figure 5. 5591/978-1-57735-516-8/IJCAI11-313 [15] James O’ Neill, Paul Buitelaar, Cecile Robin, and Leona O’ Brien. 2017. Classify- Ranking. Table 1 shows the results of SME ranking after assess- ing sentential modality in legal language: a use case in financial regulations, acts ing each topic model for each jurisdiction. Saffron overall is favored and directives. In Proceedings of the 16th edition of the International Conference for all jurisdictions, considering it is the only model that performs on Articial Intelligence and Law. ACM, 159–168. [16] Marco Pennacchiotti and Siva Gurumurthy. 2011. Investigating topic models multi-word expression topic extraction and weighting of descriptive for social media user recommendation. In Proceedings of the 20th international noun terms/phrases. We conjecture that the appeal of a hierarchi- conference companion on World wide web. ACM, 101–102. [17] K Raghuveer. 2012. Legal documents clustering using latent dirichlet allocation. cal structure and multi-word noun expressions has influenced the IAES Int. J. Artif. Intell. 2, 1 (2012), 34–37. interpretation of the salient terms in the domain, making it easier for [18] Barbara Rosario. 2000. Latent semantic indexing: An overview. Techn. rep. legal practitioners to identify important and coherent legal topics. INFOSYS 240 (2000). [19] Carson Sievert and Kenneth E Shirley. 2014. LDAvis: A method for visualizing We emphasize at this point that single word topic models and and interpreting topics. In Proceedings of the workshop on interactive language multi-word hierarchical models are not directly comparable for this learning, visualization, and interfaces. 63–70. reasons outlined however, they are included in table 1 to highlight [20] James S Wiltshire Jr, John T Morelock, Timothy L Humphrey, X Allan Lu, James M Peck, and Salahuddin Ahmed. 2002. System and method for classifying the importance of longer expressions that are linked in a taxonomy, legal concepts using legal topic scheme. (Dec. 31 2002). US Patent 6,502,081. providing more clarity on what the emerging topics are. [21] Xiaohui Yan, Jiafeng Guo, Shenghua Liu, Xueqi Cheng, and Yanfeng Wang. 2013. Learning topics in short texts by non-negative matrix factorization on term correlation matrix. In Proceedings of the 2013 SIAM International Conference on 6 CONCLUSION Data Mining. SIAM, 749–757. This work has presented a fully automated approach for identifying topics in regulations that assist in easier tracking of important domain