<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Exploring the Use of Topic Analysis in Latvian Legal Documents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rinalds Vīksna</string-name>
          <email>rinaldsviksna@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marite Kirikova</string-name>
          <email>Marite.Kirikova@rtu.lv</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daiga Kiopa</string-name>
          <email>daiga@lursoft.lv</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Artificial Intelligence and Systems Engineering, Riga Technical University</institution>
          ,
          <country country="LV">Latvia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Lursoft</institution>
          ,
          <country country="LV">Latvia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The large number of legislative documents produced every day makes it difficult to follow each and every document. However, it is important for enterprises to comply with all current legislative acts. In this paper we demonstrate the application of different topic analysis algorithms and stop word filtering approaches to the corpus of legal texts of the Republic of Latvia. This is done for the purpose of supporting the discovery of expressive and meaningful legal topics and marking respective documents according to those topics. Topic models produced in this work are intended to be used as an aid for experts, enabling faster document browsing possibilities.</p>
      </abstract>
      <kwd-group>
        <kwd>Topic Analysis - Legal Analysis - Information Retrieval1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Every enterprise must conform to and comply with current regulatory acts. Moreover,
some types of regulations may be used as blueprints for business process models [1].
Legislative documents may be laws issued by parliament, regulations issued by the
Cabinet of Ministers, Municipalities or other institutions, as well as industry
standards, various contracts and other documents [2]. Many regulations are related to
others, being either an update of an earlier regulation or depending on or being
implemented by other regulations. Keeping track of the changing regulatory environment
requires significant time and effort.</p>
      <p>In this paper we envision a solution that may help to save effort by the overview
and summarization of various topics within the Latvian law domain. The goal of this
paper is to explore the application of different topic analysis algorithms and stop word
filtering approaches on the corpus of legal texts of the Republic of Latvia. For the
demonstration we use three common topic analysis algorithms, briefly introduced in
1 Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0). This volume is published and copyrighted
by its editors. COUrT - CAiSE for Legal Documents, June 9, 2020, Virtual Workshop.</p>
      <p>Section 2. The paper presents the research in progress that is a part of more extensive
research activity, the aim of which is to find core topics in Latvian legislation, as well
as to identify, for further exploration, a method for automated document tagging with
salient topics.</p>
      <p>The paper is organized as follows. Section 2 discusses the problem domain and
available topic analysis algorithms. Section 3 shows data preparation steps and the
results of stop word removal. Section 4 discusses the results obtained and Section 5
provides brief conclusions.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work and Background</title>
      <p>Topic analysis (often called “topic modeling” or “topic detection”) is a text-mining
[3] technique for soft clustering (where each document has a probability distribution
over all the clusters) of documents according to distribution of terms that occur in the
text body. O’Neill et al. [4] used topic analysis to summarize and visualize British
legislation to find useful topics and terms. Wyner et al. applied topic analysis to
profile and extract arguments from legal cases [5]. Soria et al. applied topic analysis to
annotate each paragraph in Italian law texts with semantic information [6]. Sulea et al.
explore the use of text classification methods in [7], however, here, the use of text
classification methods requires that documents have known labels. The results may
differ depending on the language used. In this work we address regulatory (legal)
documents in Latvian with the purpose of supporting legal document handling
activities by experts.</p>
      <p>Topic analysis is an unsupervised learning method, which produces a number of
topics, which each consist of related terms and their respective weights. Topic
analysis is most often done using either Latent Semantic Analysis (LSI), Latent Dirichlet
allocation (LDA) [8] or its variant - Hierarchical Dirichlet Process (HDP) algorithms
[9], [3]. These algorithms are briefly described below and, being the most popular
ones, were used in the experiments reported in Section 4.
2.1</p>
      <sec id="sec-2-1">
        <title>Latent Semantic Indexing</title>
        <p>Latent Semantic Analysis, also called LSI, is a method for extracting and representing
the contextual usage meaning of words by statistical computations applied to a large
corpus of text. The text corpus is viewed as a set of term tf-idf weights, where tf is a
term frequency in the given text, and idf is an inverse document frequency. To this
term-document matrix singular value decomposition (SVD) is applied. In SVD a
rectangular matrix is decomposed into the product of three other matrices in order to find
a lower rank approximation of the term-document matrix [10]. LSI is implemented
using gensim12python library.</p>
        <sec id="sec-2-1-1">
          <title>1 https://radimrehurek.com/gensim/</title>
          <p>2.2</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Latent Dirichlet Allocation</title>
        <p>Latent Dirichlet allocation (LDA) is a generative probabilistic model of a corpus. The
basic idea is that documents are represented as random mixtures over latent topics
where each topic is characterized by a distribution over words. The LDA algorithm is
described in [11] and it is implemented in the sklearn python library23and in gensim.
Term distinctiveness and saliency are used to evaluate generated topics. For a given
word w, unconditional probability P(w) and probability P(T|w) that that given word
was generated by latent topic T is computed. Probability P(T|w') that random word w'
was generated by topic T is also computed. Distinctiveness of word w is then
calculated as follows [12]:
distinctivness w
=</p>
        <p>! ! !
!     ! !
The above-presented equation describes how informative word w is for determining
topic T. If a word occurs in all topics, observing the word tells little about the
document’s topic, and the word has little distinctiveness. The saliency of a word is defined
as [12]:</p>
        <p>=   ∗ ()
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Hierarchical Dirichlet Process</title>
        <p>Hierarchical Dirichlet Process (HDP) is a Bayesian nonparametric model for
unsupervised analysis of grouped data. Documents are viewed as bags of words, which are
drawn from a number of latent clusters or "topics", where "a topic" is modeled as a
multinomial probability distribution on words from some basic vocabulary. Given a
collection of documents, HDP finds latent clusters, without the need to specify the
number of topics as a parameter [9]. HDP analysis requires multiple passes through
all the data and therefore is poorly suited for massive and streaming data. Wang C.
proposed Online Variational Inference for the HDP algorithm, which requires only
one pass through data and is significantly faster [9] and is implemented in gensim
library.
(1)
(2)
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Corpus and Data Preparation and Analysis</title>
      <p>In this paper, we use the corpus of legal acts from http://likumi.lv/ - a website of legal
acts that ensures free access to systematized (consolidated) legal acts of the Republic
of Latvia. Documents were downloaded as HTML documents which were kept for
later metadata extraction (document information – issuer, status, adoption,
end-ofvalidity date, related documents, etc.). Text contents of downloaded HTML
documents were extracted and saved as plain text in UTF-8 format. Some of the
documents contained mostly Russian or English text and were dropped from the corpus. In</p>
      <sec id="sec-3-1">
        <title>2 https://scikit-learn.org/stable/</title>
        <p>total, over 50000 documents in the Latvian language were collected. With these
documents, the experiments regarding stop word removal and different topic models were
made.</p>
        <p>In the remainder of this section, first we do exploratory data analysis, and assess
the impact of stop word removal on the performance of clustering algorithms with
LDA as an example (sub-Section 3.1), and in the second part (sub-Section 3.2)
explore alternative clustering algorithms.
3.1</p>
        <sec id="sec-3-1-1">
          <title>Experiments with Stop Word Removal Approaches</title>
          <p>During text preprocessing boilerplate content (irrelevant text, ads) and generic
Latvian stop words identified by Garkaje [13] were removed. After this step, the 10 most
common words in the corpus were identified (Fig.1).</p>
          <p>As we see in Fig. 1, most common words (“state”, “latvian”, “in force”, etc.) occur
multiple times in most documents in the corpus and are not very informative, so in
this context those words are stop words and were removed. To find stop words
specific for this domain, a normalized tf-idf metric was used [14]. A tf-idf metric was
calculated for each word as given in [14]:
-  = tf !"#$ ∗ ()
 !"#$ = − log(!!")
  = log (!!!(!")# )
(3)
where TF – term frequency, is the number of times a certain word appears in this
corpus; N(doc) – number of documents in the corpus; N(k) – number of documents
containing term k, and U – total number of words in the corpus. A tf-idf was
calculated for each word in the corpus, and 140 words with a tf-idf score of less than 9 were
selected as stop words. This custom stop word list was combined with general Latvian
stop words from Garkaje [13]: in total 450 stop words. After additional stop word
filtering, documents contained relatively more informative words (see Fig. 2).</p>
          <p>To evaluate topics created using different stop word selection, we used preassigned
theme labels given in http://likumi.lv/4.3 Documents belonging to the same theme
should have similar content or describe similar topics and questions. It should be
noted that each document in likumi.lv may belong to more than one theme. We used
documents from 3 themes: “human rights”, “banks, finance, budget” and “taxes and
fees”. Topic models were created (one – using generic stop word set, and another –
using adapted stop word set) by which selected documents were then classified as
belonging to particular topics. Document distribution by topics is shown in Fig. 3 and
Fig. 4.</p>
          <p>To assess the impact of domain-specific stop words removal, we implemented the
LDA model using a general list of stop words (Fig. 3) and then compared it with the
model generated using a domain-specific list of stop words (Fig. 4). One topic is
found by both models: Topic 9 in Fig. 3 and topic 15 in Fig. 4 represent a document
with significant English content – it is indicated by keywords in the English language.
Other topics, although different, display some similarities (Table 1 and Table 2).
bankas-finanses-budzets
cilvektiesibas
nodokli-un-nodevas
50
ic40
p
o
t30
n
i
t%20
n
e
10
m
u
c
o0
D
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20</p>
          <p>Topic number
Fig. 3. Document distribution by topic using generic stop word list.</p>
          <p>Keywords translated in English
Latvian, republic, councils, european</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3 https://likumi.lv/ta/tema</title>
        <p># 5
As we see in Table 1, many keywords are present in multiple topics and are not
representative (“year”, “in law”). Topics generated using the second model (Table 2)
contain representative words (i.e. “Cadaster”, “convention”), which indicate that this
topic talks about real estate (“Cadaster”), or international treaties (“convention”).
bankas-finanses-budzets
cilvektiesibas
nodokli-un-nodevas
40
c
3ip0
o
t
2i0
n
%
t
1n0
e
m
c0
u
o
D
Topic
# 3
# 5
# 9
# 10
# 12
# 14
# 15
# 16</p>
        <p>Keywords in Latvian
valstī, līgumslēdzējas, nodokļiem,
state, līgumslēdzējā
likumu, persona, daļā, tiesas, redakcijā
kadastra, nekustamā, eiro, nodokļa,
īpašumu
atbalsta, izmaksas, programmas,
ietvaros, sadarbības
kapitāla, ieguldījumu, tirgus, pārskata,
apdrošināšanas
vēlēšanu, domes, pilsētas, komisija,
pārvaldes dienesta
or, shall, be, for, by, article
puses, līgumslēdzējas, puse,
konvencijas, teritorijā</p>
        <p>Keywords in English
in the country, contracting, taxes, state,
contracting
law, a person, part, courts, version
cadaster, real, euro, tax, property
supports, costs, programs, within,
cooperation
capital, investment, market, review,
insurance
election, city council, cities,
commission, administration, service
or, shall, be, for, by, article
sides, contracting, sides, convention,
territory
The LDA model using adapted stop word list created more meaningful topics, as it
contained more meaningful words, which tell us more about its content. Furthermore,
as most of the more popular words were labeled as stop words, the model was able to
classify more tax related documents into very expressive topics 3, 5, 9, 12 in Fig. 4,
while the model with just generic stop word filtering applied, created more broad
topics as # 4, 6 and 12 in Fig. 3. Both models classified documents into multiple
topics, some of which corresponded to topics assigned to those documents in
http://likumi.lv/. However, most topics were different; for instance, topics containing
English words, or terms related to international treaties. This shows that topics that
are found using topic analysis give insights into data and offer different classification
schemes.
3.2</p>
        <sec id="sec-3-2-1">
          <title>Topic Models and Their Evaluation</title>
          <p>To evaluate the performance of different topic analysis algorithms, we built topic
models using LDA, HDP and LSI algorithms, and visualized topics found using
gensim visualization tool. HDP algorithm does not need to know the number of topics as
it is able to determine the number of topics automatically. In this case it has found 150
topics, which is the maximum allowed by gensim implementation. The first topic (see
Fig. 5 left) contains 80% of tokens (words), the second topic – 16% of tokens, the
third – 2.5% of tokens, and the rest of the topics contain less than 1.5% of tokens.</p>
          <p>The LDA model, in comparison with the HDP model, is more balanced – the largest
topic contains 11.7% of tokens and the smallest one contains 1.7% of tokens. It was
not possible to visualize the LSI model, as it contains negative weights for terms,
which are not supported by gensim. Therefore, models were evaluated using
coherence metrics proposed by Röder et al. [15]. The results are shown in Table 3.</p>
          <p>Topic model Coherence measure (u_mass)
HDP -7.906688044302112
LDA (20 topics) -7.7222343265180715
LSI -9.482837750182188
In Table 3, the LDA 20 topic model has the highest coherence measure among the
three models.
4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>We explored the application of different topic analysis algorithms and stop word
filtering approaches to the corpus of legal texts in the Latvian language.
Domainadapted stop word filtering improves topic models that are produced using the LDA
model, yielding more expressive topics, which allow separating of topics into more
distinctive groups. When the corpus contains multiple documents, stop word filtering
methods, explored in this work, are applicable to other corpora and languages.
Compared to the LSI and the HDP, the LDA algorithm produced a topic model which was
shown to perform better than the alternatives. However, for topics generated by the
LDA to be of practical value, some further finetuning needs to be done, as, currently,
topics from different dimensions are mixed – for instance, there was a topic for
documents in English, and a topic for documents which are international treaties, and
both topics encompass documents which talk about different themes such as civil
rights and taxes.
13.
14.
15.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          Lect. Notes Bus.
          <source>Inf. Process. 113 LNBIP</source>
          ,
          <fpage>241</fpage>
          -
          <lpage>254</lpage>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>T.L.L</surname>
          </string-name>
          .S. of Washington, Types of Legislative Documents. (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Allahyari</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pouriyeh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Assefi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Safaei</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Trippe</surname>
            ,
            <given-names>E.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gutierrez</surname>
            ,
            <given-names>J.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kochut</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>A Brief Survey of Text Mining: Classification, Clustering</article-title>
          and
          <string-name>
            <surname>Extraction Techniques.</surname>
          </string-name>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>O'Neill</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Robin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>O'Brien</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Buitelaar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>An analysis of topic modelling for legislative texts</article-title>
          .
          <source>CEUR Workshop Proc. 2143</source>
          , (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Wyner</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mochales-Palau</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moens</surname>
            ,
            <given-names>M.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Milward</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Approaches to text mining arguments from legal cases</article-title>
          .
          <source>Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics)</source>
          .
          <source>6036 LNAI</source>
          ,
          <fpage>60</fpage>
          -
          <lpage>79</lpage>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Soria</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bartolini</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lenci</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montemagni</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pirrelli</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Automatic extraction of semantics in law documents</article-title>
          .
          <source>Proc. V Legis. XML Work. (February</source>
          <year>2007</year>
          ).
          <fpage>253</fpage>
          -
          <lpage>266</lpage>
          (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Mehdiyev</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nava</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sodhi</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Acharya</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rana</surname>
            ,
            <given-names>A.I.</given-names>
          </string-name>
          :
          <article-title>Topic subject creation using unsupervised learning for topic modeling</article-title>
          . (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paisley</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>D.M.:</given-names>
          </string-name>
          <article-title>Online variational inference for the hierarchical Dirichlet process</article-title>
          .
          <source>J. Mach. Learn. Res</source>
          .
          <volume>15</volume>
          ,
          <fpage>752</fpage>
          -
          <lpage>760</lpage>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>D.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>A.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jordan</surname>
            ,
            <given-names>M.I.</given-names>
          </string-name>
          :
          <article-title>Latent Dirichlet allocation</article-title>
          .
          <source>J. Mach. Learn. Res</source>
          .
          <volume>3</volume>
          ,
          <fpage>993</fpage>
          -
          <lpage>1022</lpage>
          (
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Chuang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heer</surname>
          </string-name>
          , J.: Termite:
          <article-title>Visualization techniques for assessing textual topic models</article-title>
          .
          <source>Proc. Work. Adv. Vis. Interfaces AVI</source>
          .
          <fpage>74</fpage>
          -
          <lpage>77</lpage>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Garkaje</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zilgalve</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dargis</surname>
          </string-name>
          , R.:
          <source>Normalization and Automatized Sentiment Analysis of Contemporary Online Latvian Language. Front. Artif. Intell. Appl</source>
          .
          <volume>268</volume>
          ,
          <fpage>83</fpage>
          -
          <lpage>86</lpage>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>WSDM 2015 - Proc. 8th ACM Int. Conf. Web Search Data Min</source>
          .
          <fpage>399</fpage>
          -
          <lpage>408</lpage>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>