-

Exploring the Use of Topic Analysis in Latvian Legal Documents

Rinalds Vīksna

rinaldsviksna@gmail.com 0

Marite Kirikova

Marite.Kirikova@rtu.lv 0

Daiga Kiopa

daiga@lursoft.lv 1 0 Department of Artificial Intelligence and Systems Engineering, Riga Technical University , Latvia 1 Lursoft , Latvia

The large number of legislative documents produced every day makes it difficult to follow each and every document. However, it is important for enterprises to comply with all current legislative acts. In this paper we demonstrate the application of different topic analysis algorithms and stop word filtering approaches to the corpus of legal texts of the Republic of Latvia. This is done for the purpose of supporting the discovery of expressive and meaningful legal topics and marking respective documents according to those topics. Topic models produced in this work are intended to be used as an aid for experts, enabling faster document browsing possibilities.

Topic Analysis - Legal Analysis - Information Retrieval1

Every enterprise must conform to and comply with current regulatory acts. Moreover, some types of regulations may be used as blueprints for business process models [1]. Legislative documents may be laws issued by parliament, regulations issued by the Cabinet of Ministers, Municipalities or other institutions, as well as industry standards, various contracts and other documents [2]. Many regulations are related to others, being either an update of an earlier regulation or depending on or being implemented by other regulations. Keeping track of the changing regulatory environment requires significant time and effort.

In this paper we envision a solution that may help to save effort by the overview and summarization of various topics within the Latvian law domain. The goal of this paper is to explore the application of different topic analysis algorithms and stop word filtering approaches on the corpus of legal texts of the Republic of Latvia. For the demonstration we use three common topic analysis algorithms, briefly introduced in 1 Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). This volume is published and copyrighted by its editors. COUrT - CAiSE for Legal Documents, June 9, 2020, Virtual Workshop.

Section 2. The paper presents the research in progress that is a part of more extensive research activity, the aim of which is to find core topics in Latvian legislation, as well as to identify, for further exploration, a method for automated document tagging with salient topics.

The paper is organized as follows. Section 2 discusses the problem domain and available topic analysis algorithms. Section 3 shows data preparation steps and the results of stop word removal. Section 4 discusses the results obtained and Section 5 provides brief conclusions. 2

Related Work and Background

Topic analysis (often called “topic modeling” or “topic detection”) is a text-mining [3] technique for soft clustering (where each document has a probability distribution over all the clusters) of documents according to distribution of terms that occur in the text body. O’Neill et al. [4] used topic analysis to summarize and visualize British legislation to find useful topics and terms. Wyner et al. applied topic analysis to profile and extract arguments from legal cases [5]. Soria et al. applied topic analysis to annotate each paragraph in Italian law texts with semantic information [6]. Sulea et al. explore the use of text classification methods in [7], however, here, the use of text classification methods requires that documents have known labels. The results may differ depending on the language used. In this work we address regulatory (legal) documents in Latvian with the purpose of supporting legal document handling activities by experts.

Topic analysis is an unsupervised learning method, which produces a number of topics, which each consist of related terms and their respective weights. Topic analysis is most often done using either Latent Semantic Analysis (LSI), Latent Dirichlet allocation (LDA) [8] or its variant - Hierarchical Dirichlet Process (HDP) algorithms [9], [3]. These algorithms are briefly described below and, being the most popular ones, were used in the experiments reported in Section 4. 2.1

Latent Semantic Indexing

Latent Semantic Analysis, also called LSI, is a method for extracting and representing the contextual usage meaning of words by statistical computations applied to a large corpus of text. The text corpus is viewed as a set of term tf-idf weights, where tf is a term frequency in the given text, and idf is an inverse document frequency. To this term-document matrix singular value decomposition (SVD) is applied. In SVD a rectangular matrix is decomposed into the product of three other matrices in order to find a lower rank approximation of the term-document matrix [10]. LSI is implemented using gensim12python library.

1 https://radimrehurek.com/gensim/

2.2

Latent Dirichlet Allocation

Latent Dirichlet allocation (LDA) is a generative probabilistic model of a corpus. The basic idea is that documents are represented as random mixtures over latent topics where each topic is characterized by a distribution over words. The LDA algorithm is described in [11] and it is implemented in the sklearn python library23and in gensim. Term distinctiveness and saliency are used to evaluate generated topics. For a given word w, unconditional probability P(w) and probability P(T|w) that that given word was generated by latent topic T is computed. Probability P(T|w') that random word w' was generated by topic T is also computed. Distinctiveness of word w is then calculated as follows [12]: distinctivness w =

! ! ! ! ! ! The above-presented equation describes how informative word w is for determining topic T. If a word occurs in all topics, observing the word tells little about the document’s topic, and the word has little distinctiveness. The saliency of a word is defined as [12]:

= ∗ () 2.3

Hierarchical Dirichlet Process

Hierarchical Dirichlet Process (HDP) is a Bayesian nonparametric model for unsupervised analysis of grouped data. Documents are viewed as bags of words, which are drawn from a number of latent clusters or "topics", where "a topic" is modeled as a multinomial probability distribution on words from some basic vocabulary. Given a collection of documents, HDP finds latent clusters, without the need to specify the number of topics as a parameter [9]. HDP analysis requires multiple passes through all the data and therefore is poorly suited for massive and streaming data. Wang C. proposed Online Variational Inference for the HDP algorithm, which requires only one pass through data and is significantly faster [9] and is implemented in gensim library. (1) (2) 3

Corpus and Data Preparation and Analysis

In this paper, we use the corpus of legal acts from http://likumi.lv/ - a website of legal acts that ensures free access to systematized (consolidated) legal acts of the Republic of Latvia. Documents were downloaded as HTML documents which were kept for later metadata extraction (document information – issuer, status, adoption, end-ofvalidity date, related documents, etc.). Text contents of downloaded HTML documents were extracted and saved as plain text in UTF-8 format. Some of the documents contained mostly Russian or English text and were dropped from the corpus. In

2 https://scikit-learn.org/stable/

total, over 50000 documents in the Latvian language were collected. With these documents, the experiments regarding stop word removal and different topic models were made.

In the remainder of this section, first we do exploratory data analysis, and assess the impact of stop word removal on the performance of clustering algorithms with LDA as an example (sub-Section 3.1), and in the second part (sub-Section 3.2) explore alternative clustering algorithms. 3.1

Experiments with Stop Word Removal Approaches

During text preprocessing boilerplate content (irrelevant text, ads) and generic Latvian stop words identified by Garkaje [13] were removed. After this step, the 10 most common words in the corpus were identified (Fig.1).

As we see in Fig. 1, most common words (“state”, “latvian”, “in force”, etc.) occur multiple times in most documents in the corpus and are not very informative, so in this context those words are stop words and were removed. To find stop words specific for this domain, a normalized tf-idf metric was used [14]. A tf-idf metric was calculated for each word as given in [14]: - = tf !"#$ ∗ () !"#$ = − log(!!") = log (!!!(!")# ) (3) where TF – term frequency, is the number of times a certain word appears in this corpus; N(doc) – number of documents in the corpus; N(k) – number of documents containing term k, and U – total number of words in the corpus. A tf-idf was calculated for each word in the corpus, and 140 words with a tf-idf score of less than 9 were selected as stop words. This custom stop word list was combined with general Latvian stop words from Garkaje [13]: in total 450 stop words. After additional stop word filtering, documents contained relatively more informative words (see Fig. 2).

To evaluate topics created using different stop word selection, we used preassigned theme labels given in http://likumi.lv/4.3 Documents belonging to the same theme should have similar content or describe similar topics and questions. It should be noted that each document in likumi.lv may belong to more than one theme. We used documents from 3 themes: “human rights”, “banks, finance, budget” and “taxes and fees”. Topic models were created (one – using generic stop word set, and another – using adapted stop word set) by which selected documents were then classified as belonging to particular topics. Document distribution by topics is shown in Fig. 3 and Fig. 4.

To assess the impact of domain-specific stop words removal, we implemented the LDA model using a general list of stop words (Fig. 3) and then compared it with the model generated using a domain-specific list of stop words (Fig. 4). One topic is found by both models: Topic 9 in Fig. 3 and topic 15 in Fig. 4 represent a document with significant English content – it is indicated by keywords in the English language. Other topics, although different, display some similarities (Table 1 and Table 2). bankas-finanses-budzets cilvektiesibas nodokli-un-nodevas 50 ic40 p o t30 n i t%20 n e 10 m u c o0 D 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Topic number Fig. 3. Document distribution by topic using generic stop word list.

Keywords translated in English Latvian, republic, councils, european

3 https://likumi.lv/ta/tema

# 5 As we see in Table 1, many keywords are present in multiple topics and are not representative (“year”, “in law”). Topics generated using the second model (Table 2) contain representative words (i.e. “Cadaster”, “convention”), which indicate that this topic talks about real estate (“Cadaster”), or international treaties (“convention”). bankas-finanses-budzets cilvektiesibas nodokli-un-nodevas 40 c 3ip0 o t 2i0 n % t 1n0 e m c0 u o D Topic # 3 # 5 # 9 # 10 # 12 # 14 # 15 # 16

Keywords in Latvian valstī, līgumslēdzējas, nodokļiem, state, līgumslēdzējā likumu, persona, daļā, tiesas, redakcijā kadastra, nekustamā, eiro, nodokļa, īpašumu atbalsta, izmaksas, programmas, ietvaros, sadarbības kapitāla, ieguldījumu, tirgus, pārskata, apdrošināšanas vēlēšanu, domes, pilsētas, komisija, pārvaldes dienesta or, shall, be, for, by, article puses, līgumslēdzējas, puse, konvencijas, teritorijā

Keywords in English in the country, contracting, taxes, state, contracting law, a person, part, courts, version cadaster, real, euro, tax, property supports, costs, programs, within, cooperation capital, investment, market, review, insurance election, city council, cities, commission, administration, service or, shall, be, for, by, article sides, contracting, sides, convention, territory The LDA model using adapted stop word list created more meaningful topics, as it contained more meaningful words, which tell us more about its content. Furthermore, as most of the more popular words were labeled as stop words, the model was able to classify more tax related documents into very expressive topics 3, 5, 9, 12 in Fig. 4, while the model with just generic stop word filtering applied, created more broad topics as # 4, 6 and 12 in Fig. 3. Both models classified documents into multiple topics, some of which corresponded to topics assigned to those documents in http://likumi.lv/. However, most topics were different; for instance, topics containing English words, or terms related to international treaties. This shows that topics that are found using topic analysis give insights into data and offer different classification schemes. 3.2

Topic Models and Their Evaluation

To evaluate the performance of different topic analysis algorithms, we built topic models using LDA, HDP and LSI algorithms, and visualized topics found using gensim visualization tool. HDP algorithm does not need to know the number of topics as it is able to determine the number of topics automatically. In this case it has found 150 topics, which is the maximum allowed by gensim implementation. The first topic (see Fig. 5 left) contains 80% of tokens (words), the second topic – 16% of tokens, the third – 2.5% of tokens, and the rest of the topics contain less than 1.5% of tokens.

The LDA model, in comparison with the HDP model, is more balanced – the largest topic contains 11.7% of tokens and the smallest one contains 1.7% of tokens. It was not possible to visualize the LSI model, as it contains negative weights for terms, which are not supported by gensim. Therefore, models were evaluated using coherence metrics proposed by Röder et al. [15]. The results are shown in Table 3.

Topic model Coherence measure (u_mass) HDP -7.906688044302112 LDA (20 topics) -7.7222343265180715 LSI -9.482837750182188 In Table 3, the LDA 20 topic model has the highest coherence measure among the three models. 4

Conclusions

We explored the application of different topic analysis algorithms and stop word filtering approaches to the corpus of legal texts in the Latvian language. Domainadapted stop word filtering improves topic models that are produced using the LDA model, yielding more expressive topics, which allow separating of topics into more distinctive groups. When the corpus contains multiple documents, stop word filtering methods, explored in this work, are applicable to other corpora and languages. Compared to the LSI and the HDP, the LDA algorithm produced a topic model which was shown to perform better than the alternatives. However, for topics generated by the LDA to be of practical value, some further finetuning needs to be done, as, currently, topics from different dimensions are mixed – for instance, there was a topic for documents in English, and a topic for documents which are international treaties, and both topics encompass documents which talk about different themes such as civil rights and taxes. 13. 14. 15.

Lect. Notes Bus. Inf. Process. 113 LNBIP , 241 - 254 ( 2012 ).

T.L.L .S. of Washington, Types of Legislative Documents. ( 2018 ).

Allahyari , M. , Pouriyeh , S. , Assefi , M. , Safaei , S. , Trippe , E.D. , Gutierrez , J.B. , Kochut , K. : A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques. ( 2017 ).

O'Neill , J. , Robin , C. , O'Brien , L. , Buitelaar , P. : An analysis of topic modelling for legislative texts . CEUR Workshop Proc. 2143 , ( 2017 ).

Wyner , A. , Mochales-Palau , R. , Moens , M.F. , Milward , D. : Approaches to text mining arguments from legal cases . Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics) . 6036 LNAI , 60 - 79 ( 2010 ).

Soria , C. , Bartolini , R. , Lenci , A. , Montemagni , S. , Pirrelli , V. : Automatic extraction of semantics in law documents . Proc. V Legis. XML Work. (February 2007 ). 253 - 266 ( 2007 ).

Mehdiyev , R. , Nava , J. , Sodhi , K. , Acharya , S. , Rana , A.I. : Topic subject creation using unsupervised learning for topic modeling . ( 2019 ).

Wang , C. , Paisley , J. , Blei , D.M.: Online variational inference for the hierarchical Dirichlet process . J. Mach. Learn. Res . 15 , 752 - 760 ( 2011 ).

Blei , D.M. , Ng , A.Y. , Jordan , M.I. : Latent Dirichlet allocation . J. Mach. Learn. Res . 3 , 993 - 1022 ( 2003 ).

Chuang , J. , Manning , C.D. , Heer , J.: Termite: Visualization techniques for assessing textual topic models . Proc. Work. Adv. Vis. Interfaces AVI . 74 - 77 ( 2012 ).

Garkaje , G. , Zilgalve , E. , Dargis , R.: Normalization and Automatized Sentiment Analysis of Contemporary Online Latvian Language. Front. Artif. Intell. Appl . 268 , 83 - 86 ( 2014 ).

WSDM 2015 - Proc. 8th ACM Int. Conf. Web Search Data Min . 399 - 408 ( 2015 ).