1. Introduction

O. Fetkovych);

Adapting Plagiarism Detection Techniques for Citation Identification in Legal Texts

Oleksandr Fetkovych

Peter Gurský

Dávid Varga

Zoltán Szoplák

0 0 Institute of Computer Science, Faculty of Science, Pavol Jozef Šafárik University in Košice , Jesenná 5, 040 01 Košice , Slovakia

2024

000 0 0002

Our work aims to create a tool for identifying citations among legal documents. In this paper, we present a method for detecting citations of sections of laws in court decisions using an adapted plagiarism detection technique. Our method uses the full-text database Elasticsearch to select the candidates. Since we are looking for citations of laws within court decisions, which might change over time, we must also consider the laws' amendments. We evaluated our approach on a sample of manually annotated court decisions.

eol>legal texts anti-plagiarism system citations court decisions

1. Introduction

In our project, we aim to develop a system for analyzing relationships among legal texts. Recognizing references to other legal documents is essential for understanding connections within legal documentation. We aim to build a system that identifies which texts refer to other texts and, vice versa, which are derived from others, along with the nature and significance of these relationships. Currently, no available system records such relationships. Establishing one could improve legal analysis and research, making the legal system more eficient.

At the project’s current stage, we focus on court decisions, which typically rely on the wording of legal paragraphs from laws and regulations, as well as previous decisions from other courts in similar or related matters. In their decisions, judges usually cite specific paragraph numbers of laws and case file numbers to which they refer. However, these references can sometimes be unreliable. Many cited laws have only a marginal connection to the text, and sometimes, sections of laws are cited without explicit reference to paragraph numbers. This paper introduces a method for detecting citations in the sense of a quoted text from another legal text. The cited legal texts have typically stronger connection with the original text. Therefore, quotations can be an essential part of relationship analysis.

The basic idea of our approach was to utilize a method

from the well-researched area of plagiarism detection.

However, plagiarism detection methods pursue slightly diferent goals. There are several important diferences between detecting plagiarism and citations of legal texts: • While a plagiarist tries to conceal their plagiarism by changing sentence formulations, using synonyms or other methods, a lawyer aims to quote another legal text as accurately as possible.

Therefore, in our case, we can omit algorithms that detect word matches based on semantic similarity. On the other side, legal citations often contain typos, omissions of parts of sentences, or the insertion of phrases into the quoted texts. • A standard text is usually considered plagiarized only when there is significant textual similarity over several sentences, typically spanning several pages or even paragraphs. On the other hand, legal text citations can be concise, sometimes limited to a single sentence from a law. Of course, this only holds true on certain occasions and cannot be applied as a general rule, especially in citations of lower court decisions in appellate decisions, which can involve large text sections. • When citing laws in court decisions, it is essential to consider that laws are dynamic documents that change over time through amendments. A law cited in a decision typically refers to its most recent version relative to the decision’s release date, although this may not always be the rule.

The cited law text definitely does not come from a version that became efective after the decision date. above, our approach also takes into account the speed of poorly when plagiarized fragments have similar syntactic citation detection. This paper is organized as follows: structures, but diferent semantics.

The method proposed by Vani and Gupta [ 4 ] employs • Section 2 introduces those anti-plagiarism sys- a vector space model (VSM) alongside syntactic feature tems that inspired the design of our method the extraction using shallow NLP techniques, such as Partmost. of-Speech (POS) tags, to represent documents as vectors. • Section 3 briefly presents our dataset. This approach enhances plagiarism detection by analyz• Section 4 details our citation detection methods. ing both syntactic and semantic properties of texts. The • Section 5 describes the results of testing the efec- method classifies these features using algorithms such tiveness of our methods. as Naïve Bayes, Support Vector Machine, and Decision • Section 6 summarizes the results and the applica- Trees. However, since it focuses on whole documents bility of our methods but also discusses potential rather than individual sentences, its applicability may be future research directions in the field of citation limited in certain scenarios. detection. The method of Altheneyan and Menai [ 5 ] includes several NLP features such as stop word removal, punctuation removal, and tokenization to prepare the text for com2. Related work parison. The paragraph-level comparison step compares suspect and source documents at the paragraph level, When defining our approach, we have taken inspiration while the sentence-level comparison step looks for comfrom other works about creating anti-plagiarism systems. mon unigrams between sentences. The SVM classifier We are specifically referring to the particular intricacies then checks detected instances of plagiarism, and consecof detecting and extracting legal citations mentioned in utive sections are merged in a post-processing step. This the previous section. method is, therefore, capable of detecting plagiarism in

The Anti-plagiarism system Copyfind [ 1 ] employs a obfuscated text. The method relies heavily on the numhashing function to encode all input documents, such ber of common unigrams between sentences and may that every word turns out to be a 32-bit hash code. This not efectively reveal more complex forms of plagiarism system makes pairwise comparisons between all docu- that do not rely on word frequency. ments, where cursors move over lists of their hash codes, Yalcin et al. [ 6 ] introduce an external plagiarism desearching for an identical pair. However, as in our case, tection system using n-grams of POS tags and semanworking with many documents is very time-consuming. tic vector representations of words. The preprocessing In addition to the aforementioned drawbacks, when work- involves sentence segmentation, tokenization, and asing with long sentences or more structurally complicated signing one of 45 POS tags. The system generates POS texts, resolving which phrases were similar was problem- n-grams (POSNG), indexed at the sentence level using atic and did not fit our needs. the Lucene search engine 1, to reflect syntactic text prop

Stamatatos et al. [ 2 ] introduced a method for detecting erties. Using the full-text search engine for obtaining plagiarism in document collections based on stop word n- candidate documents was the main inspiration in our grams. The authors assumed that stop word sequences re- proposed method. veal syntactic patterns in the document structure, which Matching POSNG tags between a suspicious doccan be used to detect plagiarism. This makes it useful in ument and the source indicates potential plagiarism. cases where other methods based on context might fail to Searches for candidate sentences involve querying ndetect plagiarism. Additionally, the method can be exe- grams through Lucene, seeking the highest match scores. cuted quickly because it uses a low number of stop words, The method employs two decision techniques for identireducing processing time. However, such a method also fying plagiarism: direct syntactic comparison (POSNGPD) has its disadvantages. It cannot detect multiple instances and syntactic plus semantic analysis (POSNGPD+SSBS), of plagiarism in a scrutinized document and particularly where the latter assesses semantic similarities using the struggles with short matching fragments between the Word2Vec model [ 7 ]. source and the inspected documents. The authors claim superior accuracy of their method,

Abdi et al. [ 3 ] proposed a method based on syntactic although acknowledging slower processing due to extenand semantic features, improving accuracy and eficiency. sive data handling.

The authors use the sentence approach for data preprocessing. They divide the texts into sentences and work with each sentence separately. However, this method may require more computing resources and processing time than other methods due to its reliance on syntactic and semantic features. Moreover, it might perform

1Apache Lucene, a high-performance text search engine, available

at https://lucene.apache. org/core/

3. Dataset To check the correctness of our proposed methods, we used two data sets: a set of all laws in the Slovak Republic including its history and a randomly chosen subset of court decisions. 3.1. Law articles

The Slovak government publishes the collection of laws on the Slov-Lex portal [ 8 ] in HTML format both online and in a ZIP archive. Obtaining all laws with corresponding articles and historical changes was quite a complex process. First, we needed to distinguish between the original laws, their amendments and other kinds of documents in the archive. Typically, original laws contained amendments of other laws, therefore, we needed to identify which parts were relevant. After extracting necessary data, we converted them into the JSON format for structured storage and easier processing.

Figure 1 below illustrates the JSON object representation for Section 113 of the Criminal Code. This object comprises several key attributes: since the judge would not have been able to reference laws that did not exist then.

4. Methods

• _id: A unique identifier for a law section, which Our initial approach to detecting citations involved a combines the articles section number, in our case comparison the text from court decisions against all legal 113, with an identifier from the Collection of paragraphs to identify citations. Given the extensive Laws, e.g., 300/2005 for the Criminal Code. dataset and the complexity of comparison algorithms, • versions: Contains an array of objects repre- this process proved to be very time-consuming.

senting all historical versions of the article. Consequently, we transitioned to a more sophisticated • version: Indicates the efective date from which solution utilizing Elasticsearch [ 10 ]. Elasticsearch emthe law’s current version applies. ploys advanced techniques for rapid searching through • text: Provides the actual text of the law’s para- extensive text datasets. Its key advantages include storing graph for the pertinent version. data as JSON objects, scalability, and full-text search capa• headlines: These are headings within the struc- bilities. Additionally, Elasticsearch leverages an inverted ture of the entire law leading to the specific sec- index, enhancing the eficiency of search operations. Its tion. However, this attribute was not utilized for lfexible and customizable text analysis methods convert our analysis. text into structured data optimized for efective storage and retrieval, thus significantly improving the system’s performance and responsiveness.

This new approach consists of several integral stages, each designed to optimize the processing and analysis of legal texts. Here are the main components of our new method, which we will discuss in detail in upcoming sections:

3.2. Court decisions This work analyzes a subset of court decisions published

on the Open Data website of the Ministry of Justice of the Slovak Republic [ 9 ].

The court decisions are structured as JSON objects that include details like the court type, the court name, the judge’s name, and the legal domain. Each object also features a document_fulltext attribute, which holds the anonymized text of the court decision on which we focused primarily.

Also, the attribute decision_issue_date signifies when the court decision was issued. This attribute is significant because it indicates the specific date when the judge wrote the decision. With this date, we can avoid reviewing citations of laws enacted after the judgment • Data Indexing: We initiate the process by indexing the data of legal paragraphs into an Elasticsearch index. This step organizes the data eficiently, setting the foundation for rapid retrieval and detailed analysis. • Finding candidate documents: A specific query is crafted for Elasticsearch to sift through the indexed data and extract candidate legal paragraphs. Finding candidates for deeper inspection narrows down the scope significantly. • Texts matches: We employ a custom algorithm to search common parts of the original and candidate document. This phase analyses the presence of citations within the court decisions and evaluates their relevance. • Decision-Making: After verifying the citations, we initiate a decision-making process. This step determines whether the legal paragraphs were cited accurately in the texts under review.

We validated this methodology by conducting reviews and analyses of its performance on a real court decision dataset, as will be discussed in Section 5. The results confirm the eficacy and eficiency of our approach in handling complex legal texts.

4.1. The process of data indexing 4.2. Finding candidate documents This step reduces the number of source documents (legal

paragraphs) that are then compared in detail with the text of the court decision. Using the Elasticsearch index, we created a query based on the full text of the court decision. Elasticsearch searches for documents with similar text and returns an ordered list of source documents ranked by relevance.

Key features of our query include: • The similarity query contains the whole text of the decision. • The query will return the top 30 most similar documents to streamline further processing. • Comparison is then performed based on the court decision text. • Each of the 30 documents will have a unique law article ID and contain only one version, valid at the time of the court decision.

Due to constant law amendments, most legal paragraphs have multiple versions over time, as we demonstrated in Section 3.1.

If we index each paragraph with multiple versions as Since the query is extensive and complex, we omit a single document, then during the search for similar its detailed description in this paper. As a result, we law paragraphs for a court decision, Elasticsearch will will get 30 candidate documents, which we will use to compare the text of the court decision with the texts of search for specific quotes. The number 30, a heuristic all versions of the paragraph, leading to incorrect results. parameter, is chosen to provide us with suficient results. The paragraph with more versions can be incorrectly It is important to note, however, that this parameter is returned because of many common texts, which are in not fixed and can be adjusted to suit the needs of our fact the same texts. Furthermore, the JSON document research. returned by Elasticsearch can contain matching texts in irrelevant versions while there is no match in the relevant 4.3. Searching for common matching one. To avoid this, we used a simple script to split each texts law paragraph into separate documents, each containing only one unique version of the paragraph while retaining Elasticsearch does not return positions of matching texts. the original paragraph ID. Since it is important to us what texts match and where

This approach allows us to filter relevant paragraph the matches are located, we need to find the matching versions based on the creation date of court decisions. places in both documents.

After preparing the dataset for indexing, we created an appropriate mapping [ 11 ] for an eficient data indexing 4.3.1. Finding common sequences using the process. The mapping in Elasticsearch defines how each Needleman-Wunsch algorithm ifeld in the document will be indexed, including text analysis rules, data types, and storage options. This mapping To find matching text sequences, we utilized the wellis crucial as it ensures that Elasticsearch can eficiently known Needleman-Wunsch algorithm (NW) of dynamic store and retrieve data, optimize search performance, and programming [ 12 ] for the longest common subsequence correctly handle our documents’ nested structures. search. The core principle of the algorithm is as follows:

The mapping begins by establishing a custom analyzer • Text preprocessing: Texts are stripped of puncthat simplifies and standardizes text. This text is then tuation and short words, simplifying subsequent uniformly processed to ensure consistency across all doc- analysis and reducing noise. uments, e.g., lowercasing and ascii folding. The mapping • Matrix initialization: A two-dimensional array specifies diferent types of data fields, including identi- (dp matrix) is created where each element (, ifers for precise searches and versioning information to ) stores the length of the longest common subtrack document updates. Each document version is ana- sequence between the first words of text1 and lyzed using the same method to maintain uniformity in the first words of text2. data handling. The result of the indexing is an internal Elasticsearch structure.

Part of the text from the legal paragraph with word ofset 10: "...v jeho prospech odvolanie, poškodený, zúčastnená osoba, ako aj prokurátor sa môžu výslovným vyhlásením vzdať ... Osoba, ktorá je oprávnená podať ..." Part of the text from the court decision with word ofset 100 : "...v jeho prospech obvolanie, poškodený, ako aj prokurátor sa môžu výslovným vyhlásením vzdať ... osobe, ktorá je oprávnená podať ... ako prokurátor môže ..." jeho10 prospech11 odvolanie12 poskodeny13 ... ako16 prokurator17 mozu18 vyslovnym19 vyhlasenim20 vzdat21 . . .

osoba35 ktora36 opravnena37 podat38 ... jeho100 prospech101 obvolanie102 poskodeny103 ako104 prokurator105 mozu106 vyslovnym107 vyhlasenim108 vzdat109 . . . osobe120 ktora121 opravnena122 podat123 ... ako250 prokurator251 moze252 ...

Arrays of matching word 3-grams: [ 10, 11, 16, 16, 17, 18, 19, 35, 36 ] [100, 101, 104, 250, 105, 106, 107, 120, 121]

Arrays of matching sequencies: [[ 10, 11 ], [ 16, 17, 18, 19], [ 16], [ 35, 36]] [[100, 101], [104, 105, 106, 107], [250], [120, 121]] 4.3.2. Finding common sequences using 3-grams

of words

Since the sequence assembly is performed from the

end, the results (arrays of indices) are reversed to be presented in the correct order.

Although the algorithm in Section 4.3.1 is fast, we found over time that it had trouble identifying matches in case of typos. This algorithm cannot return multiple matches of same texts, resulting in overlooking potentially longer citations that have words added or removed somewhere • Matrix filling: Using dynamic programming, the in the middle. Note that the problem of adding or removmatrix is filled with values based on word com- ing words inside the citations is covered in Section 4.4. parisons. If the words match, the value increases Therefore, we created an alternative approach that would by one compared to the previous words. Other- eliminate these shortcomings. wise, the maximum value from adjacent cells is The approach involves systematically comparing the selected. text of a judicial decision with candidate paragraphs of • Subsequence reconstruction: Starting from the laws. The process begins by taking both decision and bottom-right element of the matrix, the subse- paragraph texts and processing them to remove puncquence itself is reconstructed by following the tuation marks, numbers, and words shorter than three path that led to the maximum length. This is characters. This step helps to ensure that only meaningachieved by comparing the values in the matrix ful content is considered, as shorter words typically lack and selecting the path with the maximum value. semantic significance and can easily be replaced by synonyms. Then, we divide each text into 3-grams of words and calculate the Levenshtein [13] similarity coeficient for each pair of compared word 3-grams.

We chose the Levenshtein method for its eficiency in handling typos. Typos are often found in court decisions, but almost never in laws. By calculating the Levenshtein distance, we can quickly and accurately identify matches even in texts containing such errors.

The similarity coeficient (normalized insertiondeletion similarity) between two 3-grams of words is calculated using the Levenshtein method described in [14] according to the following formula: Similarity = 1 −

Distance

Length1 + Length2 Where: • Similarity: similarity coeficient. • Distance: Levenshtein distance between two 3

grams. • Length1: length of the first 3-gram. • Length2: length of the second 3-gram.

[[ 10, 11 ], [ 16 , 17, 18, 19], [ 35 , 36], [ 16]]

This coeficient reflects the normalized similarity between two 3-grams of words on a scale from 0 to 1, where 0 means complete dissimilarity and 1 means com- 4.5. Decision making plete match. If similarity of compared 3-grams exceeds a threshold of 0.9, we consider the 3-grams to be equal. The After the entire process, we obtain a set of arrays conindices of their positions in original texts are recorded in taining sequences of indices of words found in both texts. arrays: one for the 3-gram indices of the court decision Next, we need to decide, which matching texts are real and one for the law paragraph. citations. Many times, even if there are matched texts, it

Next, the algorithm searches for increasing continu- is just a coincidence, not a real citation. ous sequences of indices, resulting in all common sub- Actually, it is dificult to correctly identify real citations. sequences with at least 3 words between the two texts. Judges often do not put quoted text in quotation marks. The result is represented as an array of arrays of word Sometimes, even a relatively short text is a real citation 3-gram positions, as depicted in Figure 2. and at the same time a longer text does not have to be a citation. Therefore, the following two approaches should 4.4. Inserted and missed words in be considered heuristics rather than informed decisions.

The decision-making process results in whether or not citations the law article is cited in the judicial decision. We decide whether a citation is present in the judicial decision based on the longest sequence length in input arrays.

We have tested the following two conditions The previous two methods are designed to find continuous sequences within analyzed texts. Sometimes, judges use extra words or miss some words from cited laws and thus do not form precisely continuous citations. This step takes into account these kinds of citations and merges • The longest citation contains at least 7 words them. However, we cannot merge similar 3-grams of • The longest citation covers at least 5% of the origwords if they are too far apart, so we merge only those inal law article sequences that are at most ten words apart. Finally, the The first approach was chosen to minimize the risk process returns two arrays that store sequences that are of false positives arising from random or insignificant merged if possible, one associated with the law article matches of short text segments. Such short citations are and the other with the decision. often too general to identify a specific legal provision

Below we provide an example, where we continue our uniquely and could lead to incorrect conclusions. This example from Figure 2. Each input array contains sub- threshold allows us to increase the accuracy and reliabilarrays with strictly increasing subsequences by 1. In ity of the method.

Figure 3, the green color marks the last and first elements The second approach prefers citations that cite a sigof adjacent arrays whose diference is less than ten po- nificant part of the law article. This approach suppresses sitions in both texts; these will be merged into a single the occurrence of false positives but, on the other hand, sub-array, as seen in Figure 4. citations within large law articles can be skipped.

The red color marks the last and first elements of adja- As a result of the last step, the identifiers _id and cent arrays whose diference is greater than ten words, version of the legal articles, together with the cited indicating they will not be merged in the array for a court positions, are returned. decision nor corresponding sub-arrays for the law article.

We can see that the pair of subarrays <[16],[250]> do not merge with <[ 10,11 ], [100,101]>, because the distance 5. Evaluation between 101 and 250 is too big.

We created two methods for citation search and two methods for decision-making. We used two decision-making

methods for each citation search method, resulting in identifies many false positives. four distinct methods in total. The implementation can be found on our GitHub repository [ 11 ]. Due to the many versions of laws, we were unable to compare our 6. Conclusion approach with other plagiarism detection systems.

We tested our methods on randomly chosen 100 court In this article, we presented our methods for searching decisions and all laws of the Slovak Republic and sum- for citations in legal texts. Although at first glance it marized the results in Table 1. looks like a classic plagiarism detection task, citations

In our study, we used the F1 Score to evaluate our in legal texts have their specifics, which we listed in method for identifying law citations in court decisions. Section 1. We focused on finding citations of laws in We chose this metric because it balances precision and court decisions. We presented a total of 4 methods to recall, making it a reliable measure of our method’s accu- ifnd citations and compared them on a dataset of 100 racy. The F1 Score helps ensure that we correctly identify random judicial decisions. real citations while reducing mistakes, which is crucial In the future, we would like to explore other apfor trustworthy legal analysis. proaches to the decision-making process, for example,

While achieving perfect recall, the NW, absolute length involving the semantic proximity of the court decision method sufers from low precision, resulting in numerous and the cited law. false positives and an F1 Score of 0.71. Another goal is to examine the search for citations

The NW, percentage method oefrs perfect precision among judicial decisions. We will experiment with but low recall, leading to an F1 Score of 0.58. It iden- replacing the Needleman-Wunsh algorithm with the tifies citations accurately but misses a large number of Smith–Waterman [15] one as it should be more suitable citations. for finding local alignments such as law citations within perTfhoerm3-agnrcaemws,itahbbsooltuhtehliegnhgpthremciesitohnodanshdorwecsaall,byailealndcinedg lfaarsgteerr,dpeotetecntitoinallfyordrisesailm-tiimlare tteaxstkss., Fwoer athlseoppularpnotsoesteostf an F1 Score of 0.89, indicating efective citation detection. the suitability of the BLAST algorithm [16] applied to wiTthhesli3g-hgrtalymlso,wpeerrcepnrteacgiseiomnetahnoddaanlsFo1pSecrfoorremosf w0.e8l7l, scualchseaqutaesnkcewiidthenntaifictuartiaolnla.nguage text instead of biologicompared to its absolute length counterpart. Overall, Building on our previous research [17], which involved the 3-gram methods outperform the NW methods in extracting references to laws, we aim to refine how relabalancing precision and recall, suggesting they are more tionships are weighted within legal texts. When a law is suitable for nuanced detection of legal citations in court both referenced and cited within a ruling, it underscores decisions. its substantial influence. We currently use references to

When we compare these results, we can see, that law paragraphs to extract keyphrases. In the future, we Needleman-Wunsch algorithm can only search for exact plan to explore assigning greater weight to phrases from matches; it often fails to detect long citations and typi- highly valued law paragraphs to enhance our keyphrase cally only detects smaller parts of citations. As a result, extraction method. In our future work, we also aim to it often happens that the longest citation does not reach test whether removing stop words instead of short words 5% of the content of the paragraph of the law. On the improves our performance. other hand, NW method does not have the restriction Judges sometimes omit specific letters, sections, or of at least three consecutive matching words forming a even laws’ names, yet may still cite the text directly. 3-gram. The consequence is that the method for dealing Identifying these citations helps accurately pinpoint the with inserted and missing words (Section 4.4) may mis- relevant law paragraph, further enhancing the precision takenly evaluate as citations close matches of unigrams of our law reference extraction. and bigrams up to a total length exceeding the threshold of 7 words. Therefore the NW, absolute length method [13] V. I. Levenshtein, et al., Binary codes capable of correcting deletions, insertions, and reversals, in: The Slovak Research and Development Agency supported Soviet physics doklady, volume 10, Soviet Union, this work under contract No. APVV-21-0336 Analysis 1966, pp. 707–710. of Court Decisions by Methods of Artificial Intelligence. [14] M. Bachmann, python-levenshtein, 2021. URL: Pavol Jozef Šafárik University in Košice supported this https://rapidfuzz.github.io/Levenshtein. work with the internal project at vvgs-2023-2547 Legal [15] R. Mott, Smith–Waterman Algorithm, 2005. doi:10. Text Analysis Using Computer Linguistics. This article was also supported by the Scientific Grant Agency of the Ministry of Education, Science, Research and Sport of the Slovak Republic under contract VEGA 1/0645/22 entitled Proposal of Novel Methods in the Field of Formal Concept Analysis and Their Application.

[1]

Bloomfield , WCopyfind, University of Virginia. Available at URL: http://plagiarism. bloomfieldmedia. com/wordpress/software/wcopyfind/. ( 2016 ).

[2]

Stamatatos , Plagiarism detection using stopword n-grams , Journal of the American Society for Information Science and Technology 62 ( 2011 ) 2512 - 2527 .

[3]

Abdi ,

Idris ,

R. M.

Alguliyev ,

R. M.

Aliguliyev , Pdlk: Plagiarism detection using linguistic knowledge , Expert Systems with Applications 42 ( 2015 ) 8936 - 8946 .

[4]

Vani ,

Gupta , Text plagiarism classification using syntax based linguistic features , Expert Systems with Applications 88 ( 2017 ) 448 - 464 .

[5]

A. S.

Altheneyan ,

M. E. B.

Menai , Automatic plagiarism detection in obfuscated text , Pattern Analysis and Applications 23 ( 2020 ) 1627 - 1650 .

[6]

Yalcin , I. Cicekli, G. Ercan, An external plagiarism detection system based on part-of-speech (POS) tag n-grams and word embedding , Expert Systems with Applications 197 ( 2022 ) 116677 .

[7]

Mikolov ,

Chen , G. Corrado,

Dean , Eficient Estimation of Word Representations in Vector Space , 2013 . arXiv: 1301 . 3781 .

[8] Slov-Lex , 2022 . URL: https://www.slov -lex.sk/ vyhladavanie-pravnych-predpisov.

[9] Rozvoj elektronických služieb súdnictva (RESS) (

2016 ). URL: https://obcan.justice.sk/.

[10] Elastic , Elasticsearch, 2024 . URL: https://www. elastic.co/elasticsearch/.

[11]

Fetkovych ,

Gurský , Revealing implicit legal phrases in court decisions , https://github.com/ vargadavid304/legal_citations, 2024 . GitHub repository.

[12]

S. B.

Needleman ,

C. D.

Wunsch , A general method applicable to the search for similarities in the amino acid sequence of two proteins , Journal of molecular biology 48 ( 1970 ) 443 - 453 .