=Paper=
{{Paper
|id=Vol-3226/paper16
|storemode=property
|title=Keyphrase extraction from Slovak court decisions
|pdfUrl=https://ceur-ws.org/Vol-3226/paper16.pdf
|volume=Vol-3226
|authors=Dávid Varga,Šimon Horvát,Zoltán Szoplák,Ľubomír Antoni,Stanislav Krajči,Peter Gurská,Laura Bachňáková Rózenfeldová
|dblpUrl=https://dblp.org/rec/conf/itat/VargaHSAKGR22
}}
==Keyphrase extraction from Slovak court decisions==
Keyphrase extraction from Slovak court decisions
Dávid Varga, Šimon Horvát, Zoltán Szoplák, Ľubomír Antoni, Stanislav Krajči, Peter Gurský
and Laura Bachňáková Rózenfeldová
Pavol Jozef Šafárik University in Košice, Faculty of Science, Institute of Computer Science, Jesenná 5, 040 01 Košice, Slovakia
Abstract
Keyphrase extraction is a vital subtask of text summarization and comparison, through which we can obtain the most
relevant set of words and phrases that describe the content of a given document. In this paper we test multiple approaches
of unsupervised keyword extraction on a set of court decisions. These approaches are TF-IDF, YAKE! and a graph-based
weighted PageRank algorithm. We combine these algorithms with a dictionary-based word embedding method in order to
capture the semantic relationships between the potential keyphrases. Extracted keyphrases can be used for semantic indexing
of court decisions, which can help with finding decisions with similar content.
Keywords
keyphrase, keyword, extraction, legal text, word network, embedding, court decision
1. Introduction and selecting the phrases used, or generating phrases
that aptly describe the document. Manual extraction
In their decision-making, judges need to ensure the con- of keyphrases from long texts or from a large number
sistency of decisions with the standard practice of courts. of texts is time-consuming and demanding on human
Getting an overview of similar relevant court decisions resources. These are the reasons why it is appropriate
is a time-consuming process. Currently, available tools to automate this process. The process of automated ex-
have limited options for filtering a set of all decisions, traction of keyphrases consists of selecting candidate
often resulting in an extensive collection of documents. phrases from a document or external source, which are
In the Slovak court system, only the Supreme Court has evaluated according to how well they describe the doc-
an analytical department that has human resources to ument. An evaluation algorithm is used to evaluate the
create overviews of relevant court decisions for judges. candidate phrases, which calculates the score according
With a vast number of court cases, common judges of- to statistics, semantics, or both at the same time. The can-
ten do not have time and resources to get to all relevant didate phrases with the highest score are then selected
documents, which can cause essential decisions to be as keyphrases.
overlooked by judges. The analytical department of the Keyphrase extraction algorithms are divided into two
Supreme Court manually creates metadata to all Supreme main groups, supervised and unsupervised algorithms.
Court decisions, including keyphrases, to speed up the We can train supervised algorithms on a labeled dataset,
overview-making process, especially by narrowing the while the resulting models often achieve high accuracy
search results down to a reasonable size. Automatic [1]. If a dataset that is labeled with keyphrases is not
keyphrase extraction can help with manual annotation available, it is advisable to use unsupervised algorithms.
by providing hints, thus making the annotation process These types of algorithms usually uses statistical met-
semi-automatic and faster. This increases the number of rics that take into account the number of occurrences
court decision annotations that can be used for searching of phrases, the co-occurrence of phrases, the position
and filtering. of phrases within the document and others. These algo-
In the field of natural language processing, automatic rithms are often combined with graph algorithms, word
keyphrase extraction can be used as a form of text sum- embeddings, or other language models.
marization. Manually extracting keyphrases consists of In this article, we will focus on the extraction of
reading the whole document, understanding its content keyphrases from Slovak court decisions. This dataset
does not contain manually extracted keyphrases, there-
ITAT’22: Information technologies – Applications and Theory, Septem- fore we decided to use a combination of unsupervised
ber 23–27, 2022, Zuberec, Slovakia
statistical and semantic approaches.
Envelope-Open david.varga@student.upjs.sk (D. Varga);
simon.horvat@student.upjs.sk (Š. Horvát); The objectives of this article are:
zoltan.szoplak@student.upjs.sk (Z. Szoplák);
lubomir.antoni@upjs.sk (Ľ. Antoni); stanislav.krajci@upjs.sk • design and implementation of an algorithm for ex-
(S. Krajči); peter.gursky@upjs.sk (P. Gurský); tracting keyphrases from Slovak court decisions;
laura.rozenfeldova@upjs.sk (L. B. Rózenfeldová) • evaluation of the results of extracted keyphrases
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0). on a set of court decisions.
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
This article is organized into four sections. In Sec- • Understandable. Keyphrases should be easy to
tion 2, we describe the related approaches to automated understand.
keyphrase extraction and other works related to legal • Relevant. Keyphrases should relate to the main
document processing. In Section 3, we propose the mul- topic of the document.
tiple algorithms to extract the keyphrases from Slovak • Good coverage. Keyphrases should cover all
court decisions. Finally, we analyze the results of the parts of the document appropriately.
algorithms in Section 4.
According to these properties, the Liu’s clustering algo-
rithm [14] was created, which used statistical, semantic
2. Related works and clustering methods simultaneously. The first step of
the algorithm was to search for candidate words. From
A lot of research has been done on applying NLP tech- these, keyphrases of several words will be composed in
niques to law texts and court decisions. NLP techniques the next steps of the algorithm. Subsequently, the candi-
are used in different tasks, for example: predicting the date phrases were calculated semantic closeness scores,
outcomes of court decisions [2, 3, 4], searching for insuf- according to their common occurrences within a fixed-
ficiently reasoned court decisions [5], creating electronic length window and also according to an external source
versions of court decisions [6] or creating a collection - Wikipedia. For each word, they created an embedding,
of datasets for evaluating performance across different where on each index of the vector, a value representing
legal text understanding tasks [7]. the relationship between the word and a specific article
An international voluntary association called the Free from Wikipedia was calculated using TF-IDF. Candidate
Access to Law Movement (FALM) [8] was founded in words were clustered according to semantic closeness,
1992 and has more than 60 member organisations from which grouped semantically similar words into individual
around the globe. FALM members provide free access to clusters. Subsequently, exemplary words representing
legal information, group legal documents into one place the entire cluster were selected from individual clusters,
and analyse law texts. FALM member CanLII [9] uses which had to be extended to phrases composed of several
software to process canadian court decisions. CanLII cre- words. The keyphrases for the document were selected
ates links to articles that are used in court decisions and so that the algorithm processed all the words of the doc-
to other court decisions used as citations. This software ument, and if the word type was a noun that was also an
also creates a short description of the court decision and exemplary word, then the word was selected in the list of
selects keyphrases. These can be used to save time and keyphrases along with adjectives in its neighbourhood
effort for legal experts such as judges and lawyers. in the original text.
Algorithms used for legal text summarization are One of the latest language-independent unsupervised
summed up in a survey paper [10]. However, in this keyphrase extraction algorithms is YAKE! [15]. It uses
section, we focus specifically on keyphrase extraction statistical information, such as word counts and word
by use of statistical approaches and unsupervised algo- occurrences, to identify keyphrases in unstructured texts.
rithms. We also summarize the principles of selecting Its great advantage is that it only works with the cur-
appropriate keyphrases based on observations. rent document during extraction, so it is not necessary
The simplest approach to select keyphrases is to count to have the whole corpus of similar texts or other text
the n-grams in the text and select the most common n- sources available. The algorithm consists of five steps:
grams [11]. This approach is also called Bag of Words or (1) preprocessing the document into a machine-readable
BoW and does not take into account synonyms, grammar format, which results in tagged individual words; (2) for
or the meaning of individual n-grams. The downside of each word, a representation is created consisting of a set
using a BoW approach is that it does not select those of properties evaluated by statistical measurements; (3)
keyphrases that are concise to the text and at the same the individual properties of the words are heuristically
time occur rarely in the text. combined into one score, which represents the impor-
A significant improvement over the BoW method is tance of the word; (4) generating n-grams from candidate
TF-IDF [12]. TF-IDF takes into account the whole corpus words and assigning a degree of relevance; (5) dedupli-
and penalizes phrases that occur in many documents. cation of keyphrases that are too similar and ranking by
It is often used as a baseline method or in one of the relevance.
steps of an algorithm, for example KP-Miner [13] or Liu’s Another approach to extracting keyphrases is to use
clustering algorithm [14]. We will describe TF-IDF in graphs and graph algorithms. The text may be repre-
more detail in the next chapter. sented by a graph such that the vertices of the graph are
Three desirable properties of keyphrases are described candidate phrases and the edges represent the relation-
in [14]: ship between these phrases. Subsequently, a value for
each vertex of the graph is assigned using the selected
evaluation function, and the edges and their weights are 3.1.2. Phrase network.
used to calculate this value. Thus, the individual methods
To use keyphrase extraction methods it is in our best
differ in the use of different types of graphs and evalu-
interest to develop a vocabulary of potential keyphrases.
ation functions. One of the first algorithms to extract
Keyphrase extraction methods such as TF-IDF prioritize
keywords from a text that uses a graph is TextRank [16],
keyphrases that are unique to a specific document that
which has inspired a number of other graph-based al-
might not be suitable for the purposes of topic clus-
gorithms. Its evaluation function calculates the values
tering. Using this vocabulary as a basis for creating
for the vertices of the oriented graph recursively and the
phrase embeddings can help us semantically compare
information at the input of this function is global, ie in
the keyphrases in order to facilitate better keyphrase
each step, it comes from the whole graph. The evaluation
selection.
function used is the PageRank [17] algorithm, which is
Let 𝑉 be the set of all unigrams, bigrams and trigrams
iterative and its input is the oriented graph. PageRank
used in court decisions documents1 . Relations among
was originally designed for scoring web pages by impor-
phrases in 𝑉 are denoted as set of 𝐸. We mine these
tance on the web, but in TextRank it is used to give score
relations mainly from Slovak Law Thesaurus (SLT) [19]
to candidate keyphrases.
as follows:
RAKE [18] is another graph-based unsupervised algo-
rithm and it uses word frequency and word co-occurence 1. Let phrase𝑖 and phrase𝑗 be words or phrases de-
to create a graph and assign scores to phrases. It needs fined in SLT. In case that phrase𝑗 occurs in defini-
a list of stop-words and delimiters at the input, but it is tion of phrase𝑖 , we expand our set 𝐸 by the pair
able to identify interior stop-words in phrases. (phrase𝑖 , phrase𝑗 ).
2. Let phrase𝑖 be word or phrase defined in SLT. Let
{phrase1 , phrase2 , … , phrase𝑗 } ⊆ 𝑉 be the words
3. Methods used in definition of phrase𝑖 , but without def-
inition in SLT. We add pairs (phrase𝑖 ,phrase1 ),
3.1. Background knowledge (phrase𝑖 ,phrase2 ), … , (phrase𝑖 ,phrase𝑗 ) to the set
In this section we describe how we mine knowledge from 𝐸. Not all words appearing in the definition are
sources other than the document from which we want related to a defined phrase, therefore we weigh
to extract key phrases. This background knowledge is these relations with the TF-IDF used in our global
used as weighting mechanism in methods described in weight function. The set of documents used for
the next section. IDF calculation 𝐷 is the set of all definitions from
SLT.
3. Let phrase𝑖 ∈ 𝑉 be a phrase that is not found in
3.1.1. Term frequency – inverse document
SLT. We find the definitions of individual words
frequency.
that make up the phrase in the Dictionary of
TF-IDF is a statistical measure that evaluates how rele- Slovak language and continue as in the previ-
vant a word is to a document in a collection of documents. ous step. Let {phrase1 , phrase2 , … , phrase𝑗 } ⊆ 𝑉
This measure is multiplication of two metrics: be the words used in definition of phrase𝑖 . We
add pairs (phrase𝑖 ,phrase1 ), (phrase𝑖 ,phrase2 ), … ,
1. term frequency expresses how many times a word
(phrase𝑖 ,phrase𝑗 ) to the set 𝐸. The set of docu-
appears in a document,
ments used for IDF calculation 𝐷 is the set of all
2. the inverse document frequency expresses how
definitions from Dictionary of Slovak language.
unique a given word is to a document. It is the
frequency of the word across a set of all docu- Using this set of definitions, we model a network of
ments: legal phrases defined as follows:
Let 𝐺 = (𝑉 , 𝐸, 𝜙) be a directed evaluated graph, where
|𝐷| 𝜙 ∶ 𝐸 → 𝑅 is a function:
idf(𝑡, 𝐷) = log
|{𝑑 ∈ 𝐷 ∶ 𝑡 ∈ 𝑑}|
where |{𝑑 ∈ 𝐷 ∶ 𝑡 ∈ 𝑑}| is the number of docu- 1, if 𝑒 gained in 1
ments where the term 𝑡 appears. 𝜙(𝑒) = {
tf-idf(phrase𝑖 , phrase𝑗 ), if 𝑒 gained in 2 ∨ 3
So idf examines the frequency values in all documents
such that 𝑒 = (phrase𝑖 , phrase𝑗 ), phrase𝑖 ∈ 𝑉 is defined
to reduce the impact of frequent words.
phrase, phrase𝑗 ∈ 𝑉 is phrase occurring in definition of
phrase𝑖 and
𝑒 ∈ 𝐸.
1
Vocabulary 𝑉 does not contain stop words.
In the next step, we use the graph embedding tech-
niques described in [20] which produce a semantic rep- 𝑤𝑖𝑗 = attr(𝑣𝑖 , 𝑣𝑗 ) × dice(𝑣𝑖 , 𝑣𝑗 ) (3)
resentation for each phrase from 𝑉. In our approach, we
To extract keywords from the keywords of a graph, we
use the Node2Vec algorithm, described in [21] which is
will make use of the weighted PageRank algorithm. The
one of the graph embedding techniques based on a ran-
PageRank algorithm is an iterative algorithm that calcu-
dom walk. These vectors with semantic interpretation
lates a score for each node of the graph, with a higher
are used as background knowledge for the algorithms
score indicating higher suitability as a keyphrase. The
described below. A detailed description of the method
weighted PageRank algorithm ranks a node according to
for obtaining embeddings is described in [22].
the rank of the sum of all its adjacent nodes, as well as
Suppose we need embedding for a phrase2 consisting
the weights that connect them.
of more than one word. We compute it as an element-
Then, the PageRank score is calculated, for each node
wise average of all word embedding occurring in the
of the graph recursively. The score at a given time step
phrases.
is calculated as:
3.2. Weighted PageRank 𝑤𝑖𝑗
𝑃𝑡 (𝑣𝑖 ) = (1 − 𝑑) + 𝑑 × ∑ 𝑃 (𝑣 ) (4)
In order to incorporate our vocabulary and embeddings,
𝑣𝑗 ∈𝐶(𝑣𝑗 )
∑𝑣𝑘 ∈𝐶(𝑣𝑗 ) 𝑤𝑗𝑘 𝑡−1 𝑗
we can use a keyphrase selection method described in
[23] in conjunction with our phrase embeddings. where 𝑃𝑡 (𝑣𝑖 ) is the PageRank score for the node 𝑣𝑖 at
First, we create an undirected weighted graph rep- time 𝑡, 𝐶(𝑣𝑖 ) is the set of edges adjacent to node 𝑣𝑖 , 𝑑 is
resenting a given court decision, with each node cor- the dumping factor.
responding to a phrase of the decision present in our The results obtained from the PageRank algorithm can
vocabulary 𝑉. A pair of nodes 𝑣1 and 𝑣2 , each represent- then be used to determine the most likely keyphrase can-
ing a potential keyphrase, will be connected by an edge if didates, with a higher score representing a more suitable
they are located within a fixed-size sliding window. The keyphrase.
weight of these edges represents the similarity between The issue with using the weighted PageRank algo-
the potential keyphrases that make up its nodes. This rithm on its own is that it works only with a given docu-
similarity is defined by two metrics. One of them is the ment, which makes it useful in extracting keyphrases that
dice coefficient which measures the interlinkedness of describe the text itself, but not what differentiates it from
the two phrases. It is calculated as the number of times other texts. Since the texts are judicial decisions, many
the phrases appear in the decision as a tuple, divided by court-centric phrases would hinder our ability to differ-
the sum of frequencies of phrases individually: entiate court decisions by topic. Therefore the score for
2 × freq(𝑣𝑖 , 𝑣𝑗 ) each phrase we obtained from the weighted PageRank
dice(𝑣𝑖 , 𝑣𝑗 ) = (1) was multiplied by its IDF score, calculated from all avail-
freq(𝑣𝑖 ) + freq(𝑣𝑗 ) able court decisions as described by the TF-IDF metric.
where 𝑣𝑖 and 𝑣𝑗 are vertices connected by an edge, Multiplying the PageRank score by the IDF should favor
𝑓 𝑟𝑒𝑞(𝑣𝑖 ) is the number of times the vertex 𝑣𝑖 appears in keywords that are not as frequent and would therefore
the document, and freq(𝑣𝑖 , 𝑣𝑗 ) is the frequency where the probably not be court-centric and thus more relevant to
vertices 𝑣𝑖 and 𝑣𝑗 form a tuple, in whichever order. the specific topic of that decision.
The second metric is inspired by Newton’s law of uni-
versal gravitation. The frequencies of the phrases are 3.3. Autoencoders
used as the mass of the objects, and the distance is calcu-
Keyword extraction methods like TF-IDF penalize
lated as the cosine distance between the embeddings of
phrases that are frequent in many documents, but infre-
the two phrases.
quent phrases are not necessarily semantically informa-
freq(𝑣𝑖 ) × freq(𝑣𝑗 ) tive. The task of removing court-centric phrases would
attr(𝑣𝑖 , 𝑣𝑗 ) = 2
(2) be better achieved by using some form of semantic com-
𝑑(𝑣𝑖 , 𝑣𝑗 )
parison. Phrases that are semantically dissimilar to the
where 𝑑(𝑣𝑖 , 𝑣𝑗 ) is the cosine between the embeddings meaning of the majority of phrases are more likely to
of phrases 𝑣𝑖 and 𝑣𝑗 . be keyphrases that can be used to meaningfully cluster
The weight of an edge is then calculated combining documents. To perform semantic comparisons, we can
the attraction force and the dice coefficient: combine our phrase embeddings with the autoencoder
2
We already have embeddings for phrases defined in SLT. Here we method.
talk about phrases from 𝑉 (or unseen) that do not occur in any Autoencoders, described in detail in [24] are unsuper-
relation to 𝐸. vised neural networks that aim to create a representation
4. Evaluation
We have implemented two algorithms to serve as our
baseline. The first is the regular TF-IDF metric used
for keyword extraction, using all available court deci-
sions to calculate the IDF value. This method is corpus-
dependent, so other documents are taken into account.
The second is the YAKE! algorithm [15], which takes
into account only the current document. The algorithm
described in 3.2 combines weighted PageRank with our
phrase embeddings and multiplies the result by the IDF
score of the TF-IDF metric. We will refer to this algorithm
as WPR. The algorithm that multiplies regular TF-IDF
score with cosine distance between and the algorithm
described in 3.3 we labelled as AE.
Since we did not have access to extracted keyphrases of
Figure 1: The scheme of autoencoder any court decisions, we have chosen five random court
decisions for manual and expert evaluation. We have
asked a legal expert to evaluate results in three ways:
of data that selects only the most relevant parameters,
• creation of abstracts that offer a brief summary
which can be used to reconstruct the original data. Au-
of the content of the decisions (see figures 1 and
toencoders consist of two main parts: the encoder, which
3),
converts the input into an encoding (usually of lesser
• manual extraction of keyphrases from the deci-
dimensionality than the input), and a decoder that tries
sions using dictionary of keyphrases used by the
to reconstruct the input from the encoding (Fig. 1). Using
analytical department of the Supreme Court (see
simple feedforward neural networks, the encoding ℎ be
figures 1 and 3),
calculated as:
• the expert’s opinion on the potential of the com-
ℎ = 𝜔(𝑊 𝑥 + 𝑏) (5) puted keyphrases to be included in the dictionary
or to be used in any other way (see section 4.1).
where 𝑥 is the input, 𝜔 is the element-wise activation
function, 𝑊 is a weight matrix and 𝑏 is the bias. This en- We summarized the outputs of the algorithms into
′
coding can then be used to obtain 𝑥 , the reconstruction tables 2 and 4, where the rows are documents and the
of the input. The reconstruction is calculated as: columns are algorithms. Each table cell consists of the
top five keyphrases found by the given algorithm for the
′ ′ ′ ′
𝑥 = 𝜔 (𝑊 ℎ + 𝑏 ) (6) given document.
′ ′ ′ We have compared the computed key phrases with ab-
where 𝜔 , 𝑊 and 𝑏 might be different from 𝜔, 𝑊 and stracts and manually extracted keyphrases. The phrases
𝑏. that are present in the abstract are highlighted in yel-
We have trained our autoencoder to reconstruct the low. If the keyphrase matches the manually extracted
embeddings of phrases of the vocabulary 𝑉, described keyphrase, it is highlighted by a black frame.
in 3.1.2.3 Due to the vocabulary being made up primar- As we can see, the YAKE! algorithm provides many
ily of phrases relevant to court decisions, we can infer keyphrases that cannot be found in abstracts or man-
that the reconstruction performance will be better with ual keyphrases. This is due to the chosen keyphrases
phrases explicitly related to court decisions. However, being too long and heavily related to the topic of judi-
these phrases are detrimental to topic-based differenti- cial decisions that offer little in phrases of differentiating
ation. Therefore by penalizing a high reconstruction decisions from one another since the method is corpus-
success of a keyphrase, we can filter out those that are independent.
not relevant to the topic of that court decision. In our The weighted WPR algorithm multiplied by the IDF
case, we multiplied the TF-IDF score of keyphrases with score performs quite a bit better, achieving good per-
the cosine distance between the input embedding and formance on documents 3 and 5, but is outclassed by
the reconstructed embedding from the autoencoder: the algorithms using TF-IDF as the basis of selection.
This is likely because the WPR algorithm prefers phrases
that are frequent and that are semantically similar to the
score(𝑣𝑖 ) = tf-idf(𝑣𝑖 ) ∗ cos(emb(𝑣𝑖 ), rec(emb(𝑣𝑖 ))) (7)
other keyphrases, which is a good approach for general
3
Link to lemmatized court decisions. https://bit.ly/3zUwbYA
keyphrase extraction; however those might not be well 4.2. Summary and future work
suited to clustering within a corpus.
This paper proposes and evaluates unsupervised keyword
TF-IDF on its own achieves good performance, as the
extraction methods because we lack labeled data as a
metric is built for extracting phrases that are good unique
proof of concept. We can conclude from the statement
descriptors of documents. It brings many matches on
of a legal expert that the most relevant keyphrases are
all of the documents, with the top five keyphrases being
legal institutes and legal processes.
good topic descriptors for all documents.
In our new project, we plan to cooperate with the
The most abstract and manual keyphrase matches were
Supreme Court of the Slovak Republic, in which we
achieved by the AE algorithm, combining TF-IDF with
should be able to work with manually extracted phrases
the reconstruction error of the autoencoder.
from their court decisions. This cooperation will allow us
An interesting finding of all evaluated methods is that
to design and test supervised keyword extraction meth-
the resulting phrases are found mainly in abstracts and
ods and compare them with the methods presented in
less among manually obtained phrases. We would also
this paper. In our future work, we want to include laws
like to point out that several manually extracted phrases
and regulations cited by court decisions as a source of
are not even in the abstracts themselves.
names of legal institutes.
We asked a legal expert to weigh in on the results from
her perspective. We present her statement in full in the
next section. 5. Conclusion
4.1. Legal expert statement In the article, we studied the problem of revealing
keyphrases in the court decisions of the Slovak Republic.
The keyphrases selected by the analysis define the nature We proposed two unsupervised algorithms and evalu-
of the respective judicial decisions to varying degrees. In ated them on five arbitrary court decisions. We have
some cases, the selected keyphrases sufficiently charac- compared computed keyphrases with expert-written ab-
terize the decisions, e. g. as regards the second decision stracts and manually extracted keyphrases. The results
where it is clear that the decision regards the cancella- show that the methods extract keyphrases that are mainly
tion of the child support obligation. In other cases, the included in abstracts rather than manually extracted
keyphrases extracted from the decisions’ text describe keyphrases. The best results proposed the AE algorithm,
the factual circumstances of the case rather than the combining TF-IDF with the reconstruction error of the
relevant legal institutes applied in them or the legal pro- autoencoder.
cess as such. To illustrate, the keyphrases describing We believe that the results of the algorithms can be
the first decision focus on the factual background of the used as recommendations for manual annotation of court
case, namely the asserting of warranty (”refund”) for the decisions with keyphrases if the intersection of found
services provided (”to train”), but do not specifically de- keyphrases with a dictionary of legal phrases is applied.
fine the applicable legal institute (liability for defects), or It can also be used to enrich search results and expand
the type of contract concluded between the parties to a filtering options.
dispute (framework agreement on cooperation), which
would be most likely the keyphrases used by the legal ex-
pert to search for decisions in analogous cases. Similarly, Acknowledgement
it is unclear from the keyphrases characterizing other
decisions examined what type of a decision is adopted This work was supported by the Slovak Research and
(decision on the merits of the case or a procedural deci- Development Agency under contract No. APVV-21-0336
sion). To demonstrate, it is not apparent that the third Analysis of court decisions by methods of artificial intel-
decision regards the appellant’s court reversal and refer- ligence. This work was supported by the Scientific Grant
ral of the decision of the court of the first instance, that Agency of the Ministry of Education, Science, Research
in the fourth decision, the court discontinued the execu- and Sport of the Slovak Republic under contract VEGA
tion of a judgment or that the fifth decision approves the 1/0177/21 Descriptive and computational complexity of
agreement on guilt and punishment (although in this case automata and algorithms. This work was supported by
the phrase ”approve the agreement” has been selected). the internal project at the Faculty of Science at Pavol Jozef
This is, however, understandable, as these are all legal Šafárik University in Košice vvgs-pf-2021-1789 Legal text
categories that may not be immediately identifiable from analysis using computer linguistics.
the decisions’ text alone without previous legal input.
References C. Nunes, A. Jatowt, Yake! keyword extraction
from single documents using multiple local features,
[1] E. Papagiannopoulou, G. Tsoumakas, A review Information Sciences 509 (2020) 257–289.
of keyphrase extraction, Wiley Interdisciplinary [16] R. Mihalcea, P. Tarau, Textrank: Bringing order
Reviews: Data Mining and Knowledge Discovery into text, in: Proceedings of the 2004 conference on
10 (2020) e1339. empirical methods in natural language processing,
[2] M. Medvedeva, M. Vols, M. Wieling, Using machine 2004, pp. 404–411.
learning to predict decisions of the european court [17] S. Brin, L. Page, The anatomy of a large-scale hy-
of human rights, Artificial Intelligence and Law 28 pertextual web search engine, Computer networks
(2020) 237–266. and ISDN systems 30 (1998) 107–117.
[3] N. Aletras, D. Tsarapatsanis, D. Preoţiuc-Pietro, [18] S. Rose, D. Engel, N. Cramer, W. Cowley, Auto-
V. Lampos, Predicting judicial decisions of the eu- matic keyword extraction from individual docu-
ropean court of human rights: A natural language ments, Text mining: applications and theory 1
processing perspective, PeerJ Computer Science 2 (2010) 10–1002.
(2016) e93. [19] Slovak law thesaurus, Legislative and infor-
[4] D. Alghazzawi, O. Bamasag, A. Albeshri, I. Sana, mation portal, Ministry of Justice of the Slo-
H. Ullah, M. Z. Asghar, Efficient prediction of vak Republic (2022). URL: https://www.slov-lex.sk/
court judgments using an lstm+ cnn neural network zoznam-tezaurov.
model with an optimal feature set, Mathematics 10 [20] P. Goyal, E. Ferrara, Graph embedding techniques,
(2022) 683. applications, and performance: A survey, Knowl.
[5] D. Varga, Z. Szoplák, S. Krajci, P. Sokol, P. Gurskỳ, Based Syst. 151 (2018) 78–94.
Analysis and prediction of legal judgements in the [21] A. Grover, J. Leskovec, node2vec: Scalable feature
slovak criminal (2021). learning for networks, Proceedings of the 22nd
[6] P. H. Luz de Araujo, T. E. de Campos, F. Ataides Braz, ACM SIGKDD International Conference on Knowl-
N. Correia da Silva, VICTOR: a dataset for Brazilian edge Discovery and Data Mining (2016).
legal documents classification, in: Proceedings of [22] S. Horvát, S. Krajči, L. Antoni, Semantic representa-
the 12th Language Resources and Evaluation Con- tion of slovak words, CEUR Workshop Proceedings
ference, European Language Resources Association, Vol-2718 (2020).
Marseille, France, 2020, pp. 1449–1458. URL: https: [23] R. Wang, Corpus-independent generic keyphrase
//www.aclweb.org/anthology/2020.lrec-1.181. extraction using word embedding vectors, Software
[7] I. Chalkidis, A. Jana, D. Hartung, M. Bommar- engineering research conference Vol. 39 (2014).
ito, I. Androutsopoulos, D. M. Katz, N. Aletras, [24] D. Bank, N. Koenigstein, R. Giryes, Autoencoders,
LexGLUE: A benchmark dataset for legal lan- CoRR abs/2003.05991 (2020). URL: https://arxiv.org/
guage understanding in english, arXiv preprint abs/2003.05991. arXiv:2003.05991 .
arXiv:2110.00976 (2021).
[8] The Free Access to Law Movement (FALM), 2022.
URL: http://falm.info/.
[9] The Canadian Legal Information Institute (CanLII),
2022. URL: https://www.canlii.org/.
[10] A. Kanapala, S. Pal, R. Pamula, Text summarization
from legal documents: a survey, Artificial Intelli-
gence Review 51 (2019) 371–402.
[11] Z. S. Harris, Distributional structure, Word 10
(1954) 146–162.
[12] K. S. Jones, A statistical interpretation of term speci-
ficity and its application in retrieval, Journal of
documentation (1972).
[13] S. R. El-Beltagy, A. Rafea, Kp-miner: A keyphrase
extraction system for english and arabic documents,
Information systems 34 (2009) 132–144.
[14] Z. Liu, P. Li, Y. Zheng, M. Sun, Clustering to find ex-
emplar terms for keyphrase extraction, in: Proceed-
ings of the 2009 conference on empirical methods
in natural language processing, 2009, pp. 257–266.
[15] R. Campos, V. Mangaravite, A. Pasquali, A. Jorge,
Table 1
Abstracts from court decisions and manually extracted keyphrases by legal expert translated to English.
No. Abstract Manually extracted keyphrases
1 The complainant (lector) demanded via judicial proceedings that the defendant pays the contract, liability for defects, liability,
full price of the in-voice for the services provided (realization of professional training). The default, client, innominate contract,
defendant, who was the complainant’s customer, paid the invoice only in part (liability warranty, service, action
for delay) due to considering the services provided by the complainant to be of poor
quality (liability for defects). The defendant has also demanded a refund.
2 The complainant demanded the court to cancel the duty to support and maintain against alimony, duty to support and maintain
the two defendants, who graduated from high school, are legal adults who are able to
earn a living wage. The defendants agreed with the cancellation of the duty to support
and maintain.
3 The complainant applied a bill of exchange against the defendant, which was rejected by bill of exchange, claim, commercial
the district court.The reasoning of rejection was the fact that the district court called for paper, appeal, referral, reversing de-
the complainant to fill in additional data in to the proposal form , which the complainant cision
did not do. The court of appeals ruled in favour of the complainant,affirming that he did
not need to fill in his proposal with additional data. The first instance court arrived at the
decision by applying incorrect legislation and incorrect interpretation of the legislation
and EU rights.
4 The court rejected the proposal of granting authorization to a court distrainor and discontinue distraint, distraint pro-
stopped all distraint proceedings. The court didn’t assign the distraint expenses to ceedings, distraint, court distrainor
the court distrainor.
5 The accused was neglmigently driving a motor vehicle, not paying attention to the traffic bodily harm, agreement on guilt and
situation on the road and did not give way to a crossing pedestrian. A collision occured, punishment, negligence, punishment,
where the pedestrian suffered injuries consisting of multiple bone fractures and internal criminal offence, punishment by dis-
bleeding. The accused inflicted grievous bodily harm to the pedestrian due to negligence, qualification
due to which the accused was charged with inflicting injury. The accused was received a
fine had their driving license revoked from all types of motor vehicles and she entered a
plea agreement.
Table 2
Top 5 keyphrases translated to English language.
No. TF-IDF YAKE! WPR AE
to train according to the PRINCE methodology to train to train
customer between the participants of the proceedings lector customer
1 lector PRINCE methodology training trainer lector
project according to the commercial law section accreditation email
studies participants of the proceedings was studies refund
studies district court of Námestovo loader duty to support and maintain
duty to support and maintain by the judgment of the district court high school court of Námestovo
2 support and maintain on the basis of an employment contract worker cancel the duty to support and maintain
to work to support according to the paragraph to take care of
he finished high school studies part-time job contract of employment
court of Námestovo obligation towards
low value of the dispute assumption bill of exchange
bill of exchange
to apply the claim of the court receiving the first instance court
form
3 first instance court to apply the claim bill of exchange to apply the claim
first instance the first instance court form form of application
fill out in connection to the court of appeals stage owner of the bill of exchange
court distrainor court of Dolný Kubín Dolný Kubín court distrainor
Dolný Kubín first instance court court distrainor Dolný Kubín
4 Dolný district court of Dolný Kubín Dolný to grant a warrant
to grant authorization apartment Dolný Kubín to apply to instruct court court
to grant Dolný Kubín case reference case reference expenses of distraint
penalty by paragraph paragraph pedestrian road traffic
paragraph paragraph letter pedestrian crossing fracture
guilt bone
5 bone health by paragraph shovel
to charge months by paragraphs bone penalty
fracture Euro by paragraph lane approve the agreement
Table 3
Abstracts from court decisions and manually extracted keyphrases by legal expert in Slovak.
No. Abstract Manually extracted keyphrases
1 Navrhovateľ (lektor) sa súdnym konaním domáhal, aby odporca uhradil faktúru za zmluva, zodpovednosť za vady, zod-
poskytnuté služby (realizácia odborných školení) v plnej výške. Odporca, ktorý bol povednosť, omeškanie, objednávateľ,
zákazníkom navrhovateľa, uhradil faktúru iba čiastočne (zodpovednosť za omeškanie) nepomenovaná zmluva, reklamácia,
kvôli tomu, že navrhovateľ podľa neho poskytol vadné služby (zodpovednosť za vady). služba, žaloba
Navrhovateľ taktiež podal reklamáciu.
2 Navrhovateľka žiadala, aby súd zrušil jej vyživovaciu povinnosť voči dvom odporcom, ktorí výživné, vyživovacia povinnosť
ukončili stredoškolské štúdium, sú plnoletí a zarábajú si sami na živobytie. Odporcovia
súhlasili so zrušením vyživovacej povinnosti.
3 Navrhovateľ si v návrhu uplatnil voči odporcovi pohľadávku, ktorú mu okresný súd zmenka, pohľadávka, cenné papiere,
zamietol. Dôvodom zamietnutia bol ten, že okresný súd vyzval navrhovateľa o doplne- odvolanie, vrátenie veci, zrušujúce
nie údajov prostredníctvom tlačiva na doplnenie návrhu, ktoré navrhovateľ nedoplnil. rozhodnutie
Odvolací súd dal navrhovateľovi za pravdu, teda že navrhovateľ nemusel dopĺňať svoj
návrh o ďalšie údaje. Prvostupňový súd dospel k rozhodnutiu na základe aplikácie ne-
správnych právnych predpisov a nesprávnej interpretácie príslušných právnych predpisov
a práva EÚ.
4 Súd zamietol žiadosť o udelenie poverenia pre súdnu exekútorku a zastavil exekučné zastavenie exekúcie, exekučné ko-
konanie. Súd exekútorke trovy exekúcie neprisúdil. nanie, exekúcia, exekútor
5 Obvinená viedla motorové vozidlo a nevenovala plnú pozornosť vedeniu vozidla. Nesle- ujma na zdraví, dohoda o vine a treste,
dovala situáciu v cestnej premávke a nedala prednosť chodcovi prechádzajúceho cez nedbanlivosť, trest, trestný čin, trest
priechod pre chodcov. Došlo k zrážke, pričom chodec utrpel poranenia pozostávajúce zo zákazu činnosti
zlomením viacerých kostí a vnútorných krvácaní. Z nedbanlivosti spôsobila ťažkú ujmu
na zdraví chodcovi, čím spáchala prečin ublíženia na zdraví. Obvinená dostala peňažný
trest a trest zákazu činnosti viesť všetky druhy motorových vozidiel, pričom uzavrela
dohodu o vine a treste.
Table 4
Top 5 keyphrases in Slovak language.
No. TF-IDF YAKE! WPR AE
školiť podľa metodiky PRINCE školiť školiť
zákazník medzi účastníkmi konania lektor zákazník
1 lektor školenia metodiky PRINCE školiteľ lektor
projekt podľa ods obchodného akreditácia email
štúdium účastníkmi konania bola štúdia reklamácia
štúdium okresného súdu námestovo nakladač vyživovacia povinnosť
vyživovacia povinnosť rozsudkom okresného súdu stredoškolský súd námestovo
2 vyživovací základe pracovnej zmluvy robotník zrušiť vyživovaciu povinnosť
pracovať živiť podľa ods opatrovať
ukončil stredoškolské štúdium brigáda pracovná zmluva
súd Námestovo povinnosť voči
nízkou hodnotou sporu dohad zmenka
zmenka
uplatnenie pohľadávky súdu prijímací prvostupňový súd
tlačivo
3 prvostupňový súd uplatnenie pohľadávky zmenka uplatniť pohľadávku
prvostupňový prvostupňový súd tlačivo tlačivo návrh
vyplniť súvislosti odvolací súd etapa majiteľ zmenky
súdna exekútorka súd Dolný Kubín Dolný Kubín súdna exekútorka
Dolný Kubín súd prvého stupňa súdna exekútorka dolný kubín
4 dolný okresný súd dolný Dolný udelenie poverenia
udelenie poverenia bytom Dolný Kubín uplatniť poučiť súd súdny
udelenie dolný kubín spisová spisová značka trovy exekúcie
trest podľa ods ods chodec cestná premávka
ods ods písm priechod zlomenina
vina kosť
5 kosť zdraví podľa ods lopata
obviniť mesiacov podľa ods kosť trest
zlomenina eur podľa ods pruh schváliť dohodu