=Paper= {{Paper |id=Vol-3226/paper16 |storemode=property |title=Keyphrase extraction from Slovak court decisions |pdfUrl=https://ceur-ws.org/Vol-3226/paper16.pdf |volume=Vol-3226 |authors=Dávid Varga,Šimon Horvát,Zoltán Szoplák,Ľubomír Antoni,Stanislav Krajči,Peter Gurská,Laura Bachňáková Rózenfeldová |dblpUrl=https://dblp.org/rec/conf/itat/VargaHSAKGR22 }} ==Keyphrase extraction from Slovak court decisions== https://ceur-ws.org/Vol-3226/paper16.pdf
Keyphrase extraction from Slovak court decisions
Dávid Varga, Šimon Horvát, Zoltán Szoplák, Ľubomír Antoni, Stanislav Krajči, Peter Gurský
and Laura Bachňáková Rózenfeldová
Pavol Jozef Šafárik University in Košice, Faculty of Science, Institute of Computer Science, Jesenná 5, 040 01 Košice, Slovakia


                                          Abstract
                                          Keyphrase extraction is a vital subtask of text summarization and comparison, through which we can obtain the most
                                          relevant set of words and phrases that describe the content of a given document. In this paper we test multiple approaches
                                          of unsupervised keyword extraction on a set of court decisions. These approaches are TF-IDF, YAKE! and a graph-based
                                          weighted PageRank algorithm. We combine these algorithms with a dictionary-based word embedding method in order to
                                          capture the semantic relationships between the potential keyphrases. Extracted keyphrases can be used for semantic indexing
                                          of court decisions, which can help with finding decisions with similar content.

                                          Keywords
                                          keyphrase, keyword, extraction, legal text, word network, embedding, court decision



1. Introduction                                                                                                                    and selecting the phrases used, or generating phrases
                                                                                                                                   that aptly describe the document. Manual extraction
In their decision-making, judges need to ensure the con-                                                                           of keyphrases from long texts or from a large number
sistency of decisions with the standard practice of courts.                                                                        of texts is time-consuming and demanding on human
Getting an overview of similar relevant court decisions                                                                            resources. These are the reasons why it is appropriate
is a time-consuming process. Currently, available tools                                                                            to automate this process. The process of automated ex-
have limited options for filtering a set of all decisions,                                                                         traction of keyphrases consists of selecting candidate
often resulting in an extensive collection of documents.                                                                           phrases from a document or external source, which are
In the Slovak court system, only the Supreme Court has                                                                             evaluated according to how well they describe the doc-
an analytical department that has human resources to                                                                               ument. An evaluation algorithm is used to evaluate the
create overviews of relevant court decisions for judges.                                                                           candidate phrases, which calculates the score according
With a vast number of court cases, common judges of-                                                                               to statistics, semantics, or both at the same time. The can-
ten do not have time and resources to get to all relevant                                                                          didate phrases with the highest score are then selected
documents, which can cause essential decisions to be                                                                               as keyphrases.
overlooked by judges. The analytical department of the                                                                                Keyphrase extraction algorithms are divided into two
Supreme Court manually creates metadata to all Supreme                                                                             main groups, supervised and unsupervised algorithms.
Court decisions, including keyphrases, to speed up the                                                                             We can train supervised algorithms on a labeled dataset,
overview-making process, especially by narrowing the                                                                               while the resulting models often achieve high accuracy
search results down to a reasonable size. Automatic                                                                                [1]. If a dataset that is labeled with keyphrases is not
keyphrase extraction can help with manual annotation                                                                               available, it is advisable to use unsupervised algorithms.
by providing hints, thus making the annotation process                                                                             These types of algorithms usually uses statistical met-
semi-automatic and faster. This increases the number of                                                                            rics that take into account the number of occurrences
court decision annotations that can be used for searching                                                                          of phrases, the co-occurrence of phrases, the position
and filtering.                                                                                                                     of phrases within the document and others. These algo-
   In the field of natural language processing, automatic                                                                          rithms are often combined with graph algorithms, word
keyphrase extraction can be used as a form of text sum-                                                                            embeddings, or other language models.
marization. Manually extracting keyphrases consists of                                                                                In this article, we will focus on the extraction of
reading the whole document, understanding its content                                                                              keyphrases from Slovak court decisions. This dataset
                                                                                                                                   does not contain manually extracted keyphrases, there-
ITAT’22: Information technologies – Applications and Theory, Septem-                                                               fore we decided to use a combination of unsupervised
ber 23–27, 2022, Zuberec, Slovakia
                                                                                                                                   statistical and semantic approaches.
Envelope-Open david.varga@student.upjs.sk (D. Varga);
simon.horvat@student.upjs.sk (Š. Horvát);                                                                                             The objectives of this article are:
zoltan.szoplak@student.upjs.sk (Z. Szoplák);
lubomir.antoni@upjs.sk (Ľ. Antoni); stanislav.krajci@upjs.sk                                                                            • design and implementation of an algorithm for ex-
(S. Krajči); peter.gursky@upjs.sk (P. Gurský);                                                                                            tracting keyphrases from Slovak court decisions;
laura.rozenfeldova@upjs.sk (L. B. Rózenfeldová)                                                                                         • evaluation of the results of extracted keyphrases
                                    © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                    Attribution 4.0 International (CC BY 4.0).                                                            on a set of court decisions.
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)
   This article is organized into four sections. In Sec-           • Understandable. Keyphrases should be easy to
tion 2, we describe the related approaches to automated              understand.
keyphrase extraction and other works related to legal              • Relevant. Keyphrases should relate to the main
document processing. In Section 3, we propose the mul-               topic of the document.
tiple algorithms to extract the keyphrases from Slovak             • Good coverage. Keyphrases should cover all
court decisions. Finally, we analyze the results of the              parts of the document appropriately.
algorithms in Section 4.
                                                              According to these properties, the Liu’s clustering algo-
                                                              rithm [14] was created, which used statistical, semantic
2. Related works                                              and clustering methods simultaneously. The first step of
                                                              the algorithm was to search for candidate words. From
A lot of research has been done on applying NLP tech-         these, keyphrases of several words will be composed in
niques to law texts and court decisions. NLP techniques       the next steps of the algorithm. Subsequently, the candi-
are used in different tasks, for example: predicting the      date phrases were calculated semantic closeness scores,
outcomes of court decisions [2, 3, 4], searching for insuf-   according to their common occurrences within a fixed-
ficiently reasoned court decisions [5], creating electronic   length window and also according to an external source
versions of court decisions [6] or creating a collection      - Wikipedia. For each word, they created an embedding,
of datasets for evaluating performance across different       where on each index of the vector, a value representing
legal text understanding tasks [7].                           the relationship between the word and a specific article
   An international voluntary association called the Free     from Wikipedia was calculated using TF-IDF. Candidate
Access to Law Movement (FALM) [8] was founded in              words were clustered according to semantic closeness,
1992 and has more than 60 member organisations from           which grouped semantically similar words into individual
around the globe. FALM members provide free access to         clusters. Subsequently, exemplary words representing
legal information, group legal documents into one place       the entire cluster were selected from individual clusters,
and analyse law texts. FALM member CanLII [9] uses            which had to be extended to phrases composed of several
software to process canadian court decisions. CanLII cre-     words. The keyphrases for the document were selected
ates links to articles that are used in court decisions and   so that the algorithm processed all the words of the doc-
to other court decisions used as citations. This software     ument, and if the word type was a noun that was also an
also creates a short description of the court decision and    exemplary word, then the word was selected in the list of
selects keyphrases. These can be used to save time and        keyphrases along with adjectives in its neighbourhood
effort for legal experts such as judges and lawyers.          in the original text.
   Algorithms used for legal text summarization are              One of the latest language-independent unsupervised
summed up in a survey paper [10]. However, in this            keyphrase extraction algorithms is YAKE! [15]. It uses
section, we focus specifically on keyphrase extraction        statistical information, such as word counts and word
by use of statistical approaches and unsupervised algo-       occurrences, to identify keyphrases in unstructured texts.
rithms. We also summarize the principles of selecting         Its great advantage is that it only works with the cur-
appropriate keyphrases based on observations.                 rent document during extraction, so it is not necessary
   The simplest approach to select keyphrases is to count     to have the whole corpus of similar texts or other text
the n-grams in the text and select the most common n-         sources available. The algorithm consists of five steps:
grams [11]. This approach is also called Bag of Words or      (1) preprocessing the document into a machine-readable
BoW and does not take into account synonyms, grammar          format, which results in tagged individual words; (2) for
or the meaning of individual n-grams. The downside of         each word, a representation is created consisting of a set
using a BoW approach is that it does not select those         of properties evaluated by statistical measurements; (3)
keyphrases that are concise to the text and at the same       the individual properties of the words are heuristically
time occur rarely in the text.                                combined into one score, which represents the impor-
   A significant improvement over the BoW method is           tance of the word; (4) generating n-grams from candidate
TF-IDF [12]. TF-IDF takes into account the whole corpus       words and assigning a degree of relevance; (5) dedupli-
and penalizes phrases that occur in many documents.           cation of keyphrases that are too similar and ranking by
It is often used as a baseline method or in one of the        relevance.
steps of an algorithm, for example KP-Miner [13] or Liu’s        Another approach to extracting keyphrases is to use
clustering algorithm [14]. We will describe TF-IDF in         graphs and graph algorithms. The text may be repre-
more detail in the next chapter.                              sented by a graph such that the vertices of the graph are
   Three desirable properties of keyphrases are described     candidate phrases and the edges represent the relation-
in [14]:                                                      ship between these phrases. Subsequently, a value for
                                                              each vertex of the graph is assigned using the selected
evaluation function, and the edges and their weights are      3.1.2. Phrase network.
used to calculate this value. Thus, the individual methods
                                                              To use keyphrase extraction methods it is in our best
differ in the use of different types of graphs and evalu-
                                                              interest to develop a vocabulary of potential keyphrases.
ation functions. One of the first algorithms to extract
                                                              Keyphrase extraction methods such as TF-IDF prioritize
keywords from a text that uses a graph is TextRank [16],
                                                              keyphrases that are unique to a specific document that
which has inspired a number of other graph-based al-
                                                              might not be suitable for the purposes of topic clus-
gorithms. Its evaluation function calculates the values
                                                              tering. Using this vocabulary as a basis for creating
for the vertices of the oriented graph recursively and the
                                                              phrase embeddings can help us semantically compare
information at the input of this function is global, ie in
                                                              the keyphrases in order to facilitate better keyphrase
each step, it comes from the whole graph. The evaluation
                                                              selection.
function used is the PageRank [17] algorithm, which is
                                                                 Let 𝑉 be the set of all unigrams, bigrams and trigrams
iterative and its input is the oriented graph. PageRank
                                                              used in court decisions documents1 . Relations among
was originally designed for scoring web pages by impor-
                                                              phrases in 𝑉 are denoted as set of 𝐸. We mine these
tance on the web, but in TextRank it is used to give score
                                                              relations mainly from Slovak Law Thesaurus (SLT) [19]
to candidate keyphrases.
                                                              as follows:
   RAKE [18] is another graph-based unsupervised algo-
rithm and it uses word frequency and word co-occurence                1. Let phrase𝑖 and phrase𝑗 be words or phrases de-
to create a graph and assign scores to phrases. It needs                 fined in SLT. In case that phrase𝑗 occurs in defini-
a list of stop-words and delimiters at the input, but it is              tion of phrase𝑖 , we expand our set 𝐸 by the pair
able to identify interior stop-words in phrases.                         (phrase𝑖 , phrase𝑗 ).
                                                                      2. Let phrase𝑖 be word or phrase defined in SLT. Let
                                                                         {phrase1 , phrase2 , … , phrase𝑗 } ⊆ 𝑉 be the words
3. Methods                                                               used in definition of phrase𝑖 , but without def-
                                                                         inition in SLT. We add pairs (phrase𝑖 ,phrase1 ),
3.1. Background knowledge                                                (phrase𝑖 ,phrase2 ), … , (phrase𝑖 ,phrase𝑗 ) to the set
In this section we describe how we mine knowledge from                   𝐸. Not all words appearing in the definition are
sources other than the document from which we want                       related to a defined phrase, therefore we weigh
to extract key phrases. This background knowledge is                     these relations with the TF-IDF used in our global
used as weighting mechanism in methods described in                      weight function. The set of documents used for
the next section.                                                        IDF calculation 𝐷 is the set of all definitions from
                                                                         SLT.
                                                                      3. Let phrase𝑖 ∈ 𝑉 be a phrase that is not found in
3.1.1. Term frequency – inverse document
                                                                         SLT. We find the definitions of individual words
       frequency.
                                                                         that make up the phrase in the Dictionary of
TF-IDF is a statistical measure that evaluates how rele-                 Slovak language and continue as in the previ-
vant a word is to a document in a collection of documents.               ous step. Let {phrase1 , phrase2 , … , phrase𝑗 } ⊆ 𝑉
This measure is multiplication of two metrics:                           be the words used in definition of phrase𝑖 . We
                                                                         add pairs (phrase𝑖 ,phrase1 ), (phrase𝑖 ,phrase2 ), … ,
    1. term frequency expresses how many times a word
                                                                         (phrase𝑖 ,phrase𝑗 ) to the set 𝐸. The set of docu-
       appears in a document,
                                                                         ments used for IDF calculation 𝐷 is the set of all
    2. the inverse document frequency expresses how
                                                                         definitions from Dictionary of Slovak language.
       unique a given word is to a document. It is the
       frequency of the word across a set of all docu-    Using this set of definitions, we model a network of
       ments:                                          legal phrases defined as follows:
                                                          Let 𝐺 = (𝑉 , 𝐸, 𝜙) be a directed evaluated graph, where
                                       |𝐷|             𝜙 ∶ 𝐸 → 𝑅 is a function:
                idf(𝑡, 𝐷) = log
                                |{𝑑 ∈ 𝐷 ∶ 𝑡 ∈ 𝑑}|

       where |{𝑑 ∈ 𝐷 ∶ 𝑡 ∈ 𝑑}| is the number of docu-                         1, if 𝑒 gained in 1
       ments where the term 𝑡 appears.                             𝜙(𝑒) = {
                                                                              tf-idf(phrase𝑖 , phrase𝑗 ), if 𝑒 gained in 2 ∨ 3
So idf examines the frequency values in all documents
                                                              such that 𝑒 = (phrase𝑖 , phrase𝑗 ), phrase𝑖 ∈ 𝑉 is defined
to reduce the impact of frequent words.
                                                              phrase, phrase𝑗 ∈ 𝑉 is phrase occurring in definition of
                                                              phrase𝑖 and
                                                              𝑒 ∈ 𝐸.
                                                              1
                                                                  Vocabulary 𝑉 does not contain stop words.
   In the next step, we use the graph embedding tech-
niques described in [20] which produce a semantic rep-                          𝑤𝑖𝑗 = attr(𝑣𝑖 , 𝑣𝑗 ) × dice(𝑣𝑖 , 𝑣𝑗 )             (3)
resentation for each phrase from 𝑉. In our approach, we
                                                                   To extract keywords from the keywords of a graph, we
use the Node2Vec algorithm, described in [21] which is
                                                                will make use of the weighted PageRank algorithm. The
one of the graph embedding techniques based on a ran-
                                                                PageRank algorithm is an iterative algorithm that calcu-
dom walk. These vectors with semantic interpretation
                                                                lates a score for each node of the graph, with a higher
are used as background knowledge for the algorithms
                                                                score indicating higher suitability as a keyphrase. The
described below. A detailed description of the method
                                                                weighted PageRank algorithm ranks a node according to
for obtaining embeddings is described in [22].
                                                                the rank of the sum of all its adjacent nodes, as well as
   Suppose we need embedding for a phrase2 consisting
                                                                the weights that connect them.
of more than one word. We compute it as an element-
                                                                   Then, the PageRank score is calculated, for each node
wise average of all word embedding occurring in the
                                                                of the graph recursively. The score at a given time step
phrases.
                                                                is calculated as:

3.2. Weighted PageRank                                                                                         𝑤𝑖𝑗
                                                                   𝑃𝑡 (𝑣𝑖 ) = (1 − 𝑑) + 𝑑 × ∑                            𝑃 (𝑣 )   (4)
In order to incorporate our vocabulary and embeddings,
                                                                                             𝑣𝑗 ∈𝐶(𝑣𝑗 )
                                                                                                          ∑𝑣𝑘 ∈𝐶(𝑣𝑗 ) 𝑤𝑗𝑘 𝑡−1 𝑗
we can use a keyphrase selection method described in
[23] in conjunction with our phrase embeddings.                      where 𝑃𝑡 (𝑣𝑖 ) is the PageRank score for the node 𝑣𝑖 at
   First, we create an undirected weighted graph rep-             time 𝑡, 𝐶(𝑣𝑖 ) is the set of edges adjacent to node 𝑣𝑖 , 𝑑 is
resenting a given court decision, with each node cor-             the dumping factor.
responding to a phrase of the decision present in our                The results obtained from the PageRank algorithm can
vocabulary 𝑉. A pair of nodes 𝑣1 and 𝑣2 , each represent-         then be used to determine the most likely keyphrase can-
ing a potential keyphrase, will be connected by an edge if        didates, with a higher score representing a more suitable
they are located within a fixed-size sliding window. The          keyphrase.
weight of these edges represents the similarity between              The issue with using the weighted PageRank algo-
the potential keyphrases that make up its nodes. This             rithm on its own is that it works only with a given docu-
similarity is defined by two metrics. One of them is the          ment, which makes it useful in extracting keyphrases that
dice coefficient which measures the interlinkedness of            describe the text itself, but not what differentiates it from
the two phrases. It is calculated as the number of times          other texts. Since the texts are judicial decisions, many
the phrases appear in the decision as a tuple, divided by         court-centric phrases would hinder our ability to differ-
the sum of frequencies of phrases individually:                   entiate court decisions by topic. Therefore the score for
                                    2 × freq(𝑣𝑖 , 𝑣𝑗 )            each phrase we obtained from the weighted PageRank
                 dice(𝑣𝑖 , 𝑣𝑗 ) =                             (1) was multiplied by its IDF score, calculated from all avail-
                                  freq(𝑣𝑖 ) + freq(𝑣𝑗 )           able court decisions as described by the TF-IDF metric.
    where 𝑣𝑖 and 𝑣𝑗 are vertices connected by an edge, Multiplying the PageRank score by the IDF should favor
𝑓 𝑟𝑒𝑞(𝑣𝑖 ) is the number of times the vertex 𝑣𝑖 appears in keywords that are not as frequent and would therefore
the document, and freq(𝑣𝑖 , 𝑣𝑗 ) is the frequency where the probably not be court-centric and thus more relevant to
vertices 𝑣𝑖 and 𝑣𝑗 form a tuple, in whichever order.              the specific topic of that decision.
    The second metric is inspired by Newton’s law of uni-
versal gravitation. The frequencies of the phrases are 3.3. Autoencoders
used as the mass of the objects, and the distance is calcu-
                                                                  Keyword extraction methods like TF-IDF penalize
lated as the cosine distance between the embeddings of
                                                                  phrases that are frequent in many documents, but infre-
the two phrases.
                                                                  quent phrases are not necessarily semantically informa-
                                  freq(𝑣𝑖 ) × freq(𝑣𝑗 )           tive. The task of removing court-centric phrases would
                 attr(𝑣𝑖 , 𝑣𝑗 ) =                 2
                                                              (2) be better achieved by using some form of semantic com-
                                       𝑑(𝑣𝑖 , 𝑣𝑗 )
                                                                  parison. Phrases that are semantically dissimilar to the
    where 𝑑(𝑣𝑖 , 𝑣𝑗 ) is the cosine between the embeddings meaning of the majority of phrases are more likely to
of phrases 𝑣𝑖 and 𝑣𝑗 .                                            be keyphrases that can be used to meaningfully cluster
    The weight of an edge is then calculated combining documents. To perform semantic comparisons, we can
the attraction force and the dice coefficient:                    combine our phrase embeddings with the autoencoder
2
 We already have embeddings for phrases defined in SLT. Here we method.
  talk about phrases from 𝑉 (or unseen) that do not occur in any     Autoencoders, described in detail in [24] are unsuper-
  relation to 𝐸.                                                  vised neural networks that aim to create a representation
                                                                       4. Evaluation
                                                                       We have implemented two algorithms to serve as our
                                                                       baseline. The first is the regular TF-IDF metric used
                                                                       for keyword extraction, using all available court deci-
                                                                       sions to calculate the IDF value. This method is corpus-
                                                                       dependent, so other documents are taken into account.
                                                                       The second is the YAKE! algorithm [15], which takes
                                                                       into account only the current document. The algorithm
                                                                       described in 3.2 combines weighted PageRank with our
                                                                       phrase embeddings and multiplies the result by the IDF
                                                                       score of the TF-IDF metric. We will refer to this algorithm
                                                                       as WPR. The algorithm that multiplies regular TF-IDF
                                                                       score with cosine distance between and the algorithm
                                                                       described in 3.3 we labelled as AE.
                                                                          Since we did not have access to extracted keyphrases of
Figure 1: The scheme of autoencoder                                    any court decisions, we have chosen five random court
                                                                       decisions for manual and expert evaluation. We have
                                                                       asked a legal expert to evaluate results in three ways:
of data that selects only the most relevant parameters,
                                                                            • creation of abstracts that offer a brief summary
which can be used to reconstruct the original data. Au-
                                                                              of the content of the decisions (see figures 1 and
toencoders consist of two main parts: the encoder, which
                                                                              3),
converts the input into an encoding (usually of lesser
                                                                            • manual extraction of keyphrases from the deci-
dimensionality than the input), and a decoder that tries
                                                                              sions using dictionary of keyphrases used by the
to reconstruct the input from the encoding (Fig. 1). Using
                                                                              analytical department of the Supreme Court (see
simple feedforward neural networks, the encoding ℎ be
                                                                              figures 1 and 3),
calculated as:
                                                                            • the expert’s opinion on the potential of the com-
                           ℎ = 𝜔(𝑊 𝑥 + 𝑏)                        (5)          puted keyphrases to be included in the dictionary
                                                                              or to be used in any other way (see section 4.1).
   where 𝑥 is the input, 𝜔 is the element-wise activation
function, 𝑊 is a weight matrix and 𝑏 is the bias. This en-       We summarized the outputs of the algorithms into
                                         ′
coding can then be used to obtain 𝑥 , the reconstruction tables 2 and 4, where the rows are documents and the
of the input. The reconstruction is calculated as:             columns are algorithms. Each table cell consists of the
                                                               top five keyphrases found by the given algorithm for the
                       ′       ′   ′    ′
                      𝑥 = 𝜔 (𝑊 ℎ + 𝑏 )                     (6) given document.
             ′   ′        ′                                      We have compared the computed key phrases with ab-
   where 𝜔 , 𝑊 and 𝑏 might be different from 𝜔, 𝑊 and stracts and manually extracted keyphrases. The phrases
𝑏.                                                             that are present in the abstract are highlighted in yel-
   We have trained our autoencoder to reconstruct the low. If the keyphrase matches the manually extracted
embeddings of phrases of the vocabulary 𝑉, described keyphrase, it is highlighted by a black frame.
in 3.1.2.3 Due to the vocabulary being made up primar-           As we can see, the YAKE! algorithm provides many
ily of phrases relevant to court decisions, we can infer keyphrases that cannot be found in abstracts or man-
that the reconstruction performance will be better with ual keyphrases. This is due to the chosen keyphrases
phrases explicitly related to court decisions. However, being too long and heavily related to the topic of judi-
these phrases are detrimental to topic-based differenti- cial decisions that offer little in phrases of differentiating
ation. Therefore by penalizing a high reconstruction decisions from one another since the method is corpus-
success of a keyphrase, we can filter out those that are independent.
not relevant to the topic of that court decision. In our         The weighted WPR algorithm multiplied by the IDF
case, we multiplied the TF-IDF score of keyphrases with score performs quite a bit better, achieving good per-
the cosine distance between the input embedding and formance on documents 3 and 5, but is outclassed by
the reconstructed embedding from the autoencoder:              the algorithms using TF-IDF as the basis of selection.
                                                               This is likely because the WPR algorithm prefers phrases
                                                               that are frequent and that are semantically similar to the
   score(𝑣𝑖 ) = tf-idf(𝑣𝑖 ) ∗ cos(emb(𝑣𝑖 ), rec(emb(𝑣𝑖 ))) (7)
                                                               other keyphrases, which is a good approach for general
3
    Link to lemmatized court decisions. https://bit.ly/3zUwbYA
keyphrase extraction; however those might not be well             4.2. Summary and future work
suited to clustering within a corpus.
                                                                  This paper proposes and evaluates unsupervised keyword
   TF-IDF on its own achieves good performance, as the
                                                                  extraction methods because we lack labeled data as a
metric is built for extracting phrases that are good unique
                                                                  proof of concept. We can conclude from the statement
descriptors of documents. It brings many matches on
                                                                  of a legal expert that the most relevant keyphrases are
all of the documents, with the top five keyphrases being
                                                                  legal institutes and legal processes.
good topic descriptors for all documents.
                                                                     In our new project, we plan to cooperate with the
   The most abstract and manual keyphrase matches were
                                                                  Supreme Court of the Slovak Republic, in which we
achieved by the AE algorithm, combining TF-IDF with
                                                                  should be able to work with manually extracted phrases
the reconstruction error of the autoencoder.
                                                                  from their court decisions. This cooperation will allow us
   An interesting finding of all evaluated methods is that
                                                                  to design and test supervised keyword extraction meth-
the resulting phrases are found mainly in abstracts and
                                                                  ods and compare them with the methods presented in
less among manually obtained phrases. We would also
                                                                  this paper. In our future work, we want to include laws
like to point out that several manually extracted phrases
                                                                  and regulations cited by court decisions as a source of
are not even in the abstracts themselves.
                                                                  names of legal institutes.
   We asked a legal expert to weigh in on the results from
her perspective. We present her statement in full in the
next section.                                                     5. Conclusion
4.1. Legal expert statement                                       In the article, we studied the problem of revealing
                                                                  keyphrases in the court decisions of the Slovak Republic.
The keyphrases selected by the analysis define the nature         We proposed two unsupervised algorithms and evalu-
of the respective judicial decisions to varying degrees. In       ated them on five arbitrary court decisions. We have
some cases, the selected keyphrases sufficiently charac-          compared computed keyphrases with expert-written ab-
terize the decisions, e. g. as regards the second decision        stracts and manually extracted keyphrases. The results
where it is clear that the decision regards the cancella-         show that the methods extract keyphrases that are mainly
tion of the child support obligation. In other cases, the         included in abstracts rather than manually extracted
keyphrases extracted from the decisions’ text describe            keyphrases. The best results proposed the AE algorithm,
the factual circumstances of the case rather than the             combining TF-IDF with the reconstruction error of the
relevant legal institutes applied in them or the legal pro-       autoencoder.
cess as such. To illustrate, the keyphrases describing                We believe that the results of the algorithms can be
the first decision focus on the factual background of the         used as recommendations for manual annotation of court
case, namely the asserting of warranty (”refund”) for the         decisions with keyphrases if the intersection of found
services provided (”to train”), but do not specifically de-       keyphrases with a dictionary of legal phrases is applied.
fine the applicable legal institute (liability for defects), or   It can also be used to enrich search results and expand
the type of contract concluded between the parties to a           filtering options.
dispute (framework agreement on cooperation), which
would be most likely the keyphrases used by the legal ex-
pert to search for decisions in analogous cases. Similarly,       Acknowledgement
it is unclear from the keyphrases characterizing other
decisions examined what type of a decision is adopted             This work was supported by the Slovak Research and
(decision on the merits of the case or a procedural deci-         Development Agency under contract No. APVV-21-0336
sion). To demonstrate, it is not apparent that the third          Analysis of court decisions by methods of artificial intel-
decision regards the appellant’s court reversal and refer-        ligence. This work was supported by the Scientific Grant
ral of the decision of the court of the first instance, that      Agency of the Ministry of Education, Science, Research
in the fourth decision, the court discontinued the execu-         and Sport of the Slovak Republic under contract VEGA
tion of a judgment or that the fifth decision approves the        1/0177/21 Descriptive and computational complexity of
agreement on guilt and punishment (although in this case          automata and algorithms. This work was supported by
the phrase ”approve the agreement” has been selected).            the internal project at the Faculty of Science at Pavol Jozef
This is, however, understandable, as these are all legal          Šafárik University in Košice vvgs-pf-2021-1789 Legal text
categories that may not be immediately identifiable from          analysis using computer linguistics.
the decisions’ text alone without previous legal input.
References                                                           C. Nunes, A. Jatowt, Yake! keyword extraction
                                                                     from single documents using multiple local features,
 [1] E. Papagiannopoulou, G. Tsoumakas, A review                     Information Sciences 509 (2020) 257–289.
     of keyphrase extraction, Wiley Interdisciplinary           [16] R. Mihalcea, P. Tarau, Textrank: Bringing order
     Reviews: Data Mining and Knowledge Discovery                    into text, in: Proceedings of the 2004 conference on
     10 (2020) e1339.                                                empirical methods in natural language processing,
 [2] M. Medvedeva, M. Vols, M. Wieling, Using machine                2004, pp. 404–411.
     learning to predict decisions of the european court        [17] S. Brin, L. Page, The anatomy of a large-scale hy-
     of human rights, Artificial Intelligence and Law 28             pertextual web search engine, Computer networks
     (2020) 237–266.                                                 and ISDN systems 30 (1998) 107–117.
 [3] N. Aletras, D. Tsarapatsanis, D. Preoţiuc-Pietro,          [18] S. Rose, D. Engel, N. Cramer, W. Cowley, Auto-
     V. Lampos, Predicting judicial decisions of the eu-             matic keyword extraction from individual docu-
     ropean court of human rights: A natural language                ments, Text mining: applications and theory 1
     processing perspective, PeerJ Computer Science 2                (2010) 10–1002.
     (2016) e93.                                                [19] Slovak law thesaurus,        Legislative and infor-
 [4] D. Alghazzawi, O. Bamasag, A. Albeshri, I. Sana,                mation portal, Ministry of Justice of the Slo-
     H. Ullah, M. Z. Asghar, Efficient prediction of                 vak Republic (2022). URL: https://www.slov-lex.sk/
     court judgments using an lstm+ cnn neural network               zoznam-tezaurov.
     model with an optimal feature set, Mathematics 10          [20] P. Goyal, E. Ferrara, Graph embedding techniques,
     (2022) 683.                                                     applications, and performance: A survey, Knowl.
 [5] D. Varga, Z. Szoplák, S. Krajci, P. Sokol, P. Gurskỳ,           Based Syst. 151 (2018) 78–94.
     Analysis and prediction of legal judgements in the         [21] A. Grover, J. Leskovec, node2vec: Scalable feature
     slovak criminal (2021).                                         learning for networks, Proceedings of the 22nd
 [6] P. H. Luz de Araujo, T. E. de Campos, F. Ataides Braz,          ACM SIGKDD International Conference on Knowl-
     N. Correia da Silva, VICTOR: a dataset for Brazilian            edge Discovery and Data Mining (2016).
     legal documents classification, in: Proceedings of         [22] S. Horvát, S. Krajči, L. Antoni, Semantic representa-
     the 12th Language Resources and Evaluation Con-                 tion of slovak words, CEUR Workshop Proceedings
     ference, European Language Resources Association,               Vol-2718 (2020).
     Marseille, France, 2020, pp. 1449–1458. URL: https:        [23] R. Wang, Corpus-independent generic keyphrase
     //www.aclweb.org/anthology/2020.lrec-1.181.                     extraction using word embedding vectors, Software
 [7] I. Chalkidis, A. Jana, D. Hartung, M. Bommar-                   engineering research conference Vol. 39 (2014).
     ito, I. Androutsopoulos, D. M. Katz, N. Aletras,           [24] D. Bank, N. Koenigstein, R. Giryes, Autoencoders,
     LexGLUE: A benchmark dataset for legal lan-                     CoRR abs/2003.05991 (2020). URL: https://arxiv.org/
     guage understanding in english, arXiv preprint                  abs/2003.05991. arXiv:2003.05991 .
     arXiv:2110.00976 (2021).
 [8] The Free Access to Law Movement (FALM), 2022.
     URL: http://falm.info/.
 [9] The Canadian Legal Information Institute (CanLII),
     2022. URL: https://www.canlii.org/.
[10] A. Kanapala, S. Pal, R. Pamula, Text summarization
     from legal documents: a survey, Artificial Intelli-
     gence Review 51 (2019) 371–402.
[11] Z. S. Harris, Distributional structure, Word 10
     (1954) 146–162.
[12] K. S. Jones, A statistical interpretation of term speci-
     ficity and its application in retrieval, Journal of
     documentation (1972).
[13] S. R. El-Beltagy, A. Rafea, Kp-miner: A keyphrase
     extraction system for english and arabic documents,
     Information systems 34 (2009) 132–144.
[14] Z. Liu, P. Li, Y. Zheng, M. Sun, Clustering to find ex-
     emplar terms for keyphrase extraction, in: Proceed-
     ings of the 2009 conference on empirical methods
     in natural language processing, 2009, pp. 257–266.
[15] R. Campos, V. Mangaravite, A. Pasquali, A. Jorge,
Table 1
Abstracts from court decisions and manually extracted keyphrases by legal expert translated to English.
  No.                                                 Abstract                                                Manually extracted keyphrases
  1      The complainant (lector) demanded via judicial proceedings that the defendant pays the              contract, liability for defects, liability,
         full price of the in-voice for the services provided (realization of professional training). The    default, client, innominate contract,
         defendant, who was the complainant’s customer, paid the invoice only in part (liability             warranty, service, action
         for delay) due to considering the services provided by the complainant to be of poor
         quality (liability for defects). The defendant has also demanded a refund.
  2      The complainant demanded the court to cancel the duty to support and maintain against               alimony, duty to support and maintain
         the two defendants, who graduated from high school, are legal adults who are able to
         earn a living wage. The defendants agreed with the cancellation of the duty to support
         and maintain.
  3      The complainant applied a bill of exchange against the defendant, which was rejected by             bill of exchange, claim, commercial
         the district court.The reasoning of rejection was the fact that the district court called for       paper, appeal, referral, reversing de-
         the complainant to fill in additional data in to the proposal form , which the complainant          cision
         did not do. The court of appeals ruled in favour of the complainant,affirming that he did
         not need to fill in his proposal with additional data. The first instance court arrived at the
         decision by applying incorrect legislation and incorrect interpretation of the legislation
         and EU rights.
  4      The court rejected the proposal of granting authorization to a court distrainor and                 discontinue distraint, distraint pro-
         stopped all distraint proceedings. The court didn’t assign the distraint expenses to                ceedings, distraint, court distrainor
         the court distrainor.
  5      The accused was neglmigently driving a motor vehicle, not paying attention to the traffic           bodily harm, agreement on guilt and
         situation on the road and did not give way to a crossing pedestrian. A collision occured,           punishment, negligence, punishment,
         where the pedestrian suffered injuries consisting of multiple bone fractures and internal           criminal offence, punishment by dis-
         bleeding. The accused inflicted grievous bodily harm to the pedestrian due to negligence,           qualification
         due to which the accused was charged with inflicting injury. The accused was received a
         fine had their driving license revoked from all types of motor vehicles and she entered a
         plea agreement.




Table 2
Top 5 keyphrases translated to English language.
  No.   TF-IDF                          YAKE!                                         WPR                    AE
        to train                        according to the PRINCE methodology           to train               to train
        customer                        between the participants of the proceedings   lector                 customer
  1     lector                          PRINCE methodology training                   trainer                lector
        project                         according to the commercial law section       accreditation          email
        studies                         participants of the proceedings was           studies                refund
        studies                         district court of Námestovo                   loader                 duty to support and maintain
        duty to support and maintain    by the judgment of the district court         high school            court of Námestovo
  2     support and maintain            on the basis of an employment contract        worker                 cancel the duty to support and maintain
        to work                         to support according to the paragraph         to take care of
                                        he finished high school studies               part-time job          contract of employment
        court of Námestovo                                                                                   obligation towards
                                        low value of the dispute                      assumption             bill of exchange
        bill of exchange
                                        to apply the claim of the court               receiving              the first instance court
        form
  3     first instance court            to apply the claim                            bill of exchange       to apply the claim
        first instance                  the first instance court                      form                   form of application
        fill out                        in connection to the court of appeals         stage                  owner of the bill of exchange

        court distrainor                court of Dolný Kubín                          Dolný Kubín            court distrainor
        Dolný Kubín                     first instance court                          court distrainor       Dolný Kubín
  4     Dolný                           district court of Dolný Kubín                 Dolný                  to grant a warrant
        to grant authorization          apartment Dolný Kubín                         to apply to instruct   court court
        to grant                        Dolný Kubín case reference                    case reference         expenses of distraint

        penalty                         by paragraph paragraph                        pedestrian             road traffic
                                        paragraph paragraph letter                    pedestrian crossing    fracture
        guilt                                                                                                bone
  5     bone                            health by paragraph                           shovel
        to charge                       months by paragraphs                          bone                   penalty
        fracture                        Euro by paragraph                             lane                   approve the agreement
Table 3
Abstracts from court decisions and manually extracted keyphrases by legal expert in Slovak.
  No.                                            Abstract                                                    Manually extracted keyphrases
  1     Navrhovateľ (lektor) sa súdnym konaním domáhal, aby odporca uhradil faktúru za                     zmluva, zodpovednosť za vady, zod-
        poskytnuté služby (realizácia odborných školení) v plnej výške. Odporca, ktorý bol                 povednosť, omeškanie, objednávateľ,
        zákazníkom navrhovateľa, uhradil faktúru iba čiastočne (zodpovednosť za omeškanie)                 nepomenovaná zmluva, reklamácia,
        kvôli tomu, že navrhovateľ podľa neho poskytol vadné služby (zodpovednosť za vady).                služba, žaloba
        Navrhovateľ taktiež podal reklamáciu.
  2     Navrhovateľka žiadala, aby súd zrušil jej vyživovaciu povinnosť voči dvom odporcom, ktorí          výživné, vyživovacia povinnosť
        ukončili stredoškolské štúdium, sú plnoletí a zarábajú si sami na živobytie. Odporcovia
        súhlasili so zrušením vyživovacej povinnosti.
  3     Navrhovateľ si v návrhu uplatnil voči odporcovi pohľadávku, ktorú mu okresný súd                   zmenka, pohľadávka, cenné papiere,
        zamietol. Dôvodom zamietnutia bol ten, že okresný súd vyzval navrhovateľa o doplne-                odvolanie, vrátenie veci, zrušujúce
        nie údajov prostredníctvom tlačiva na doplnenie návrhu, ktoré navrhovateľ nedoplnil.               rozhodnutie
        Odvolací súd dal navrhovateľovi za pravdu, teda že navrhovateľ nemusel dopĺňať svoj
        návrh o ďalšie údaje. Prvostupňový súd dospel k rozhodnutiu na základe aplikácie ne-
        správnych právnych predpisov a nesprávnej interpretácie príslušných právnych predpisov
        a práva EÚ.

  4     Súd zamietol žiadosť o udelenie poverenia pre súdnu exekútorku a zastavil exekučné                 zastavenie exekúcie, exekučné ko-
        konanie. Súd exekútorke trovy exekúcie neprisúdil.                                                 nanie, exekúcia, exekútor
  5     Obvinená viedla motorové vozidlo a nevenovala plnú pozornosť vedeniu vozidla. Nesle-               ujma na zdraví, dohoda o vine a treste,
        dovala situáciu v cestnej premávke a nedala prednosť chodcovi prechádzajúceho cez                  nedbanlivosť, trest, trestný čin, trest
        priechod pre chodcov. Došlo k zrážke, pričom chodec utrpel poranenia pozostávajúce zo              zákazu činnosti
        zlomením viacerých kostí a vnútorných krvácaní. Z nedbanlivosti spôsobila ťažkú ujmu
        na zdraví chodcovi, čím spáchala prečin ublíženia na zdraví. Obvinená dostala peňažný
        trest a trest zákazu činnosti viesť všetky druhy motorových vozidiel, pričom uzavrela
        dohodu o vine a treste.




Table 4
Top 5 keyphrases in Slovak language.
                 No.    TF-IDF                  YAKE!                           WPR                AE
                        školiť                  podľa metodiky PRINCE           školiť             školiť
                        zákazník                medzi účastníkmi konania        lektor             zákazník
                 1      lektor                  školenia metodiky PRINCE        školiteľ           lektor
                        projekt                 podľa ods obchodného            akreditácia        email
                        štúdium                 účastníkmi konania bola         štúdia             reklamácia
                        štúdium                 okresného súdu námestovo        nakladač           vyživovacia povinnosť
                        vyživovacia povinnosť   rozsudkom okresného súdu        stredoškolský      súd námestovo
                 2      vyživovací              základe pracovnej zmluvy        robotník           zrušiť vyživovaciu povinnosť
                        pracovať                živiť podľa ods                 opatrovať
                                                ukončil stredoškolské štúdium   brigáda            pracovná zmluva
                        súd Námestovo                                                              povinnosť voči
                                                nízkou hodnotou sporu           dohad              zmenka
                        zmenka
                                                uplatnenie pohľadávky súdu      prijímací          prvostupňový súd
                        tlačivo
                 3      prvostupňový súd        uplatnenie pohľadávky           zmenka             uplatniť pohľadávku
                        prvostupňový            prvostupňový súd                tlačivo            tlačivo návrh
                        vyplniť                 súvislosti odvolací súd         etapa              majiteľ zmenky

                        súdna exekútorka        súd Dolný Kubín                 Dolný Kubín        súdna exekútorka
                        Dolný Kubín             súd prvého stupňa               súdna exekútorka   dolný kubín
                 4      dolný                   okresný súd dolný               Dolný              udelenie poverenia
                        udelenie poverenia      bytom Dolný Kubín               uplatniť poučiť    súd súdny
                        udelenie                dolný kubín spisová             spisová značka     trovy exekúcie

                        trest                   podľa ods ods                   chodec             cestná premávka
                                                ods ods písm                    priechod           zlomenina
                        vina                                                                       kosť
                 5      kosť                    zdraví podľa ods                lopata
                        obviniť                 mesiacov podľa ods              kosť               trest
                        zlomenina               eur podľa ods                   pruh               schváliť dohodu