<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>O. Fetkovych);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Adapting Plagiarism Detection Techniques for Citation Identification in Legal Texts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Oleksandr Fetkovych</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Peter Gurský</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dávid Varga</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zoltán Szoplák</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Computer Science, Faculty of Science, Pavol Jozef Šafárik University in Košice</institution>
          ,
          <addr-line>Jesenná 5, 040 01 Košice</addr-line>
          ,
          <country country="SK">Slovakia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Our work aims to create a tool for identifying citations among legal documents. In this paper, we present a method for detecting citations of sections of laws in court decisions using an adapted plagiarism detection technique. Our method uses the full-text database Elasticsearch to select the candidates. Since we are looking for citations of laws within court decisions, which might change over time, we must also consider the laws' amendments. We evaluated our approach on a sample of manually annotated court decisions.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;legal texts</kwd>
        <kwd>anti-plagiarism system</kwd>
        <kwd>citations</kwd>
        <kwd>court decisions</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In our project, we aim to develop a system for analyzing
relationships among legal texts. Recognizing references
to other legal documents is essential for understanding
connections within legal documentation. We aim to build
a system that identifies which texts refer to other texts
and, vice versa, which are derived from others, along with
the nature and significance of these relationships.
Currently, no available system records such relationships.
Establishing one could improve legal analysis and
research, making the legal system more eficient.</p>
      <p>At the project’s current stage, we focus on court
decisions, which typically rely on the wording of legal
paragraphs from laws and regulations, as well as previous
decisions from other courts in similar or related matters.
In their decisions, judges usually cite specific paragraph
numbers of laws and case file numbers to which they
refer. However, these references can sometimes be
unreliable. Many cited laws have only a marginal connection
to the text, and sometimes, sections of laws are cited
without explicit reference to paragraph numbers. This
paper introduces a method for detecting citations in the
sense of a quoted text from another legal text. The cited
legal texts have typically stronger connection with the
original text. Therefore, quotations can be an essential
part of relationship analysis.</p>
      <sec id="sec-1-1">
        <title>The basic idea of our approach was to utilize a method</title>
        <p>from the well-researched area of plagiarism detection.</p>
        <p>However, plagiarism detection methods pursue slightly
diferent goals. There are several important diferences
between detecting plagiarism and citations of legal texts:
• While a plagiarist tries to conceal their
plagiarism by changing sentence formulations, using
synonyms or other methods, a lawyer aims to
quote another legal text as accurately as possible.</p>
        <p>Therefore, in our case, we can omit algorithms
that detect word matches based on semantic
similarity. On the other side, legal citations often
contain typos, omissions of parts of sentences, or
the insertion of phrases into the quoted texts.
• A standard text is usually considered plagiarized
only when there is significant textual similarity
over several sentences, typically spanning several
pages or even paragraphs. On the other hand,
legal text citations can be concise, sometimes
limited to a single sentence from a law. Of course,
this only holds true on certain occasions and
cannot be applied as a general rule, especially in
citations of lower court decisions in appellate
decisions, which can involve large text sections.
• When citing laws in court decisions, it is
essential to consider that laws are dynamic documents
that change over time through amendments. A
law cited in a decision typically refers to its most
recent version relative to the decision’s release
date, although this may not always be the rule.</p>
        <p>The cited law text definitely does not come from
a version that became efective after the decision
date.
above, our approach also takes into account the speed of poorly when plagiarized fragments have similar syntactic
citation detection. This paper is organized as follows: structures, but diferent semantics.</p>
        <p>
          The method proposed by Vani and Gupta [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] employs
• Section 2 introduces those anti-plagiarism sys- a vector space model (VSM) alongside syntactic feature
tems that inspired the design of our method the extraction using shallow NLP techniques, such as
Partmost. of-Speech (POS) tags, to represent documents as vectors.
• Section 3 briefly presents our dataset. This approach enhances plagiarism detection by
analyz• Section 4 details our citation detection methods. ing both syntactic and semantic properties of texts. The
• Section 5 describes the results of testing the efec- method classifies these features using algorithms such
tiveness of our methods. as Naïve Bayes, Support Vector Machine, and Decision
• Section 6 summarizes the results and the applica- Trees. However, since it focuses on whole documents
bility of our methods but also discusses potential rather than individual sentences, its applicability may be
future research directions in the field of citation limited in certain scenarios.
detection. The method of Altheneyan and Menai [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] includes
several NLP features such as stop word removal, punctuation
removal, and tokenization to prepare the text for
com2. Related work parison. The paragraph-level comparison step compares
suspect and source documents at the paragraph level,
When defining our approach, we have taken inspiration while the sentence-level comparison step looks for
comfrom other works about creating anti-plagiarism systems. mon unigrams between sentences. The SVM classifier
We are specifically referring to the particular intricacies then checks detected instances of plagiarism, and
consecof detecting and extracting legal citations mentioned in utive sections are merged in a post-processing step. This
the previous section. method is, therefore, capable of detecting plagiarism in
        </p>
        <p>
          The Anti-plagiarism system Copyfind [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] employs a obfuscated text. The method relies heavily on the
numhashing function to encode all input documents, such ber of common unigrams between sentences and may
that every word turns out to be a 32-bit hash code. This not efectively reveal more complex forms of plagiarism
system makes pairwise comparisons between all docu- that do not rely on word frequency.
ments, where cursors move over lists of their hash codes, Yalcin et al. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] introduce an external plagiarism
desearching for an identical pair. However, as in our case, tection system using n-grams of POS tags and
semanworking with many documents is very time-consuming. tic vector representations of words. The preprocessing
In addition to the aforementioned drawbacks, when work- involves sentence segmentation, tokenization, and
asing with long sentences or more structurally complicated signing one of 45 POS tags. The system generates POS
texts, resolving which phrases were similar was problem- n-grams (POSNG), indexed at the sentence level using
atic and did not fit our needs. the Lucene search engine 1, to reflect syntactic text
prop
        </p>
        <p>
          Stamatatos et al. [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] introduced a method for detecting erties. Using the full-text search engine for obtaining
plagiarism in document collections based on stop word n- candidate documents was the main inspiration in our
grams. The authors assumed that stop word sequences re- proposed method.
veal syntactic patterns in the document structure, which Matching POSNG tags between a suspicious
doccan be used to detect plagiarism. This makes it useful in ument and the source indicates potential plagiarism.
cases where other methods based on context might fail to Searches for candidate sentences involve querying
ndetect plagiarism. Additionally, the method can be exe- grams through Lucene, seeking the highest match scores.
cuted quickly because it uses a low number of stop words, The method employs two decision techniques for
identireducing processing time. However, such a method also fying plagiarism: direct syntactic comparison (POSNGPD)
has its disadvantages. It cannot detect multiple instances and syntactic plus semantic analysis (POSNGPD+SSBS),
of plagiarism in a scrutinized document and particularly where the latter assesses semantic similarities using the
struggles with short matching fragments between the Word2Vec model [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
source and the inspected documents. The authors claim superior accuracy of their method,
        </p>
        <p>
          Abdi et al. [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] proposed a method based on syntactic although acknowledging slower processing due to
extenand semantic features, improving accuracy and eficiency. sive data handling.
        </p>
        <p>The authors use the sentence approach for data
preprocessing. They divide the texts into sentences and work
with each sentence separately. However, this method
may require more computing resources and processing
time than other methods due to its reliance on
syntactic and semantic features. Moreover, it might perform</p>
      </sec>
      <sec id="sec-1-2">
        <title>1Apache Lucene, a high-performance text search engine, available</title>
        <p>at https://lucene.apache. org/core/</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Dataset</title>
      <sec id="sec-2-1">
        <title>To check the correctness of our proposed methods, we used two data sets: a set of all laws in the Slovak Republic including its history and a randomly chosen subset of court decisions.</title>
        <sec id="sec-2-1-1">
          <title>3.1. Law articles</title>
          <p>
            The Slovak government publishes the collection of laws
on the Slov-Lex portal [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ] in HTML format both online
and in a ZIP archive. Obtaining all laws with
corresponding articles and historical changes was quite a complex
process. First, we needed to distinguish between the
original laws, their amendments and other kinds of
documents in the archive. Typically, original laws contained
amendments of other laws, therefore, we needed to
identify which parts were relevant. After extracting
necessary data, we converted them into the JSON format for
structured storage and easier processing.
          </p>
          <p>Figure 1 below illustrates the JSON object
representation for Section 113 of the Criminal Code. This object
comprises several key attributes:
since the judge would not have been able to reference
laws that did not exist then.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Methods</title>
      <p>• _id: A unique identifier for a law section, which Our initial approach to detecting citations involved a
combines the articles section number, in our case comparison the text from court decisions against all legal
113, with an identifier from the Collection of paragraphs to identify citations. Given the extensive
Laws, e.g., 300/2005 for the Criminal Code. dataset and the complexity of comparison algorithms,
• versions: Contains an array of objects repre- this process proved to be very time-consuming.</p>
      <p>
        senting all historical versions of the article. Consequently, we transitioned to a more sophisticated
• version: Indicates the efective date from which solution utilizing Elasticsearch [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Elasticsearch
emthe law’s current version applies. ploys advanced techniques for rapid searching through
• text: Provides the actual text of the law’s para- extensive text datasets. Its key advantages include storing
graph for the pertinent version. data as JSON objects, scalability, and full-text search
capa• headlines: These are headings within the struc- bilities. Additionally, Elasticsearch leverages an inverted
ture of the entire law leading to the specific sec- index, enhancing the eficiency of search operations. Its
tion. However, this attribute was not utilized for lfexible and customizable text analysis methods convert
our analysis. text into structured data optimized for efective storage
and retrieval, thus significantly improving the system’s
performance and responsiveness.
      </p>
      <p>This new approach consists of several integral stages,
each designed to optimize the processing and analysis of
legal texts. Here are the main components of our new
method, which we will discuss in detail in upcoming
sections:</p>
      <sec id="sec-3-1">
        <title>3.2. Court decisions</title>
        <sec id="sec-3-1-1">
          <title>This work analyzes a subset of court decisions published</title>
          <p>
            on the Open Data website of the Ministry of Justice of
the Slovak Republic [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ].
          </p>
          <p>The court decisions are structured as JSON objects that
include details like the court type, the court name, the
judge’s name, and the legal domain. Each object also
features a document_fulltext attribute, which holds
the anonymized text of the court decision on which we
focused primarily.</p>
          <p>Also, the attribute decision_issue_date signifies
when the court decision was issued. This attribute is
significant because it indicates the specific date when the
judge wrote the decision. With this date, we can avoid
reviewing citations of laws enacted after the judgment
• Data Indexing: We initiate the process by
indexing the data of legal paragraphs into an
Elasticsearch index. This step organizes the data
eficiently, setting the foundation for rapid retrieval
and detailed analysis.
• Finding candidate documents: A specific
query is crafted for Elasticsearch to sift through
the indexed data and extract candidate legal
paragraphs. Finding candidates for deeper inspection
narrows down the scope significantly.
• Texts matches: We employ a custom algorithm
to search common parts of the original and
candidate document. This phase analyses the presence
of citations within the court decisions and
evaluates their relevance.
• Decision-Making: After verifying the citations,
we initiate a decision-making process. This step
determines whether the legal paragraphs were
cited accurately in the texts under review.</p>
          <p>We validated this methodology by conducting reviews
and analyses of its performance on a real court decision
dataset, as will be discussed in Section 5. The results
confirm the eficacy and eficiency of our approach in
handling complex legal texts.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>4.1. The process of data indexing 4.2. Finding candidate documents</title>
        <sec id="sec-3-2-1">
          <title>This step reduces the number of source documents (legal</title>
          <p>paragraphs) that are then compared in detail with the text
of the court decision. Using the Elasticsearch index, we
created a query based on the full text of the court decision.
Elasticsearch searches for documents with similar text
and returns an ordered list of source documents ranked
by relevance.</p>
          <p>Key features of our query include:
• The similarity query contains the whole text of
the decision.
• The query will return the top 30 most similar
documents to streamline further processing.
• Comparison is then performed based on the court
decision text.
• Each of the 30 documents will have a unique law
article ID and contain only one version, valid at
the time of the court decision.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>Due to constant law amendments, most legal paragraphs have multiple versions over time, as we demonstrated in Section 3.1.</title>
          <p>If we index each paragraph with multiple versions as Since the query is extensive and complex, we omit
a single document, then during the search for similar its detailed description in this paper. As a result, we
law paragraphs for a court decision, Elasticsearch will will get 30 candidate documents, which we will use to
compare the text of the court decision with the texts of search for specific quotes. The number 30, a heuristic
all versions of the paragraph, leading to incorrect results. parameter, is chosen to provide us with suficient results.
The paragraph with more versions can be incorrectly It is important to note, however, that this parameter is
returned because of many common texts, which are in not fixed and can be adjusted to suit the needs of our
fact the same texts. Furthermore, the JSON document research.
returned by Elasticsearch can contain matching texts in
irrelevant versions while there is no match in the relevant 4.3. Searching for common matching
one. To avoid this, we used a simple script to split each texts
law paragraph into separate documents, each containing
only one unique version of the paragraph while retaining Elasticsearch does not return positions of matching texts.
the original paragraph ID. Since it is important to us what texts match and where</p>
          <p>This approach allows us to filter relevant paragraph the matches are located, we need to find the matching
versions based on the creation date of court decisions. places in both documents.</p>
          <p>
            After preparing the dataset for indexing, we created an
appropriate mapping [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ] for an eficient data indexing 4.3.1. Finding common sequences using the
process. The mapping in Elasticsearch defines how each Needleman-Wunsch algorithm
ifeld in the document will be indexed, including text
analysis rules, data types, and storage options. This mapping To find matching text sequences, we utilized the
wellis crucial as it ensures that Elasticsearch can eficiently known Needleman-Wunsch algorithm (NW) of dynamic
store and retrieve data, optimize search performance, and programming [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ] for the longest common subsequence
correctly handle our documents’ nested structures. search. The core principle of the algorithm is as follows:
          </p>
          <p>The mapping begins by establishing a custom analyzer • Text preprocessing: Texts are stripped of
puncthat simplifies and standardizes text. This text is then tuation and short words, simplifying subsequent
uniformly processed to ensure consistency across all doc- analysis and reducing noise.
uments, e.g., lowercasing and ascii folding. The mapping • Matrix initialization: A two-dimensional array
specifies diferent types of data fields, including identi-  (dp matrix) is created where each element  (,
ifers for precise searches and versioning information to ) stores the length of the longest common
subtrack document updates. Each document version is ana- sequence between the first  words of text1 and
lyzed using the same method to maintain uniformity in the first  words of text2.
data handling. The result of the indexing is an internal
Elasticsearch structure.</p>
          <p>Part of the text from the legal paragraph with word ofset 10: "...v jeho prospech odvolanie, poškodený, zúčastnená osoba, ako
aj prokurátor sa môžu výslovným vyhlásením vzdať ... Osoba, ktorá je oprávnená podať ..."
Part of the text from the court decision with word ofset 100 : "...v jeho prospech obvolanie, poškodený, ako aj prokurátor sa
môžu výslovným vyhlásením vzdať ... osobe, ktorá je oprávnená podať ... ako prokurátor môže ..."
jeho10 prospech11 odvolanie12 poskodeny13 ... ako16 prokurator17 mozu18 vyslovnym19 vyhlasenim20 vzdat21 . . .</p>
          <p>osoba35 ktora36 opravnena37 podat38 ...
jeho100 prospech101 obvolanie102 poskodeny103 ako104 prokurator105 mozu106 vyslovnym107 vyhlasenim108 vzdat109 . . .
osobe120 ktora121 opravnena122 podat123 ... ako250 prokurator251 moze252 ...</p>
          <p>
            Arrays of matching word 3-grams:
[
            <xref ref-type="bibr" rid="ref10 ref11">10, 11, 16, 16, 17, 18, 19, 35, 36</xref>
            ]
[100, 101, 104, 250, 105, 106, 107, 120, 121]
          </p>
          <p>
            Arrays of matching sequencies:
[[
            <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
            ], [ 16, 17, 18, 19], [ 16], [ 35, 36]]
[[100, 101], [104, 105, 106, 107], [250], [120, 121]]
4.3.2. Finding common sequences using 3-grams
          </p>
          <p>of words</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>Since the sequence assembly is performed from the</title>
          <p>end, the results (arrays of indices) are reversed to be
presented in the correct order.</p>
          <p>Although the algorithm in Section 4.3.1 is fast, we found
over time that it had trouble identifying matches in case
of typos. This algorithm cannot return multiple matches
of same texts, resulting in overlooking potentially longer
citations that have words added or removed somewhere
• Matrix filling: Using dynamic programming, the in the middle. Note that the problem of adding or
removmatrix is filled with values based on word com- ing words inside the citations is covered in Section 4.4.
parisons. If the words match, the value increases Therefore, we created an alternative approach that would
by one compared to the previous words. Other- eliminate these shortcomings.
wise, the maximum value from adjacent cells is The approach involves systematically comparing the
selected. text of a judicial decision with candidate paragraphs of
• Subsequence reconstruction: Starting from the laws. The process begins by taking both decision and
bottom-right element of the matrix, the subse- paragraph texts and processing them to remove
puncquence itself is reconstructed by following the tuation marks, numbers, and words shorter than three
path that led to the maximum length. This is characters. This step helps to ensure that only
meaningachieved by comparing the values in the matrix ful content is considered, as shorter words typically lack
and selecting the path with the maximum value. semantic significance and can easily be replaced by
synonyms. Then, we divide each text into 3-grams of words
and calculate the Levenshtein [13] similarity coeficient
for each pair of compared word 3-grams.</p>
          <p>We chose the Levenshtein method for its eficiency in
handling typos. Typos are often found in court decisions,
but almost never in laws. By calculating the Levenshtein
distance, we can quickly and accurately identify matches
even in texts containing such errors.</p>
          <p>The similarity coeficient (normalized
insertiondeletion similarity) between two 3-grams of words is
calculated using the Levenshtein method described in
[14] according to the following formula:
Similarity = 1 −</p>
        </sec>
        <sec id="sec-3-2-4">
          <title>Distance</title>
          <p>Length1 + Length2
Where:
• Similarity: similarity coeficient.
• Distance: Levenshtein distance between two
3</p>
          <p>grams.
• Length1: length of the first 3-gram.
• Length2: length of the second 3-gram.</p>
          <p>
            [[
            <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
            ], [ 16 , 17, 18, 19], [ 35 , 36], [ 16]]
          </p>
          <p>This coeficient reflects the normalized similarity
between two 3-grams of words on a scale from 0 to 1,
where 0 means complete dissimilarity and 1 means com- 4.5. Decision making
plete match. If similarity of compared 3-grams exceeds a
threshold of 0.9, we consider the 3-grams to be equal. The After the entire process, we obtain a set of arrays
conindices of their positions in original texts are recorded in taining sequences of indices of words found in both texts.
arrays: one for the 3-gram indices of the court decision Next, we need to decide, which matching texts are real
and one for the law paragraph. citations. Many times, even if there are matched texts, it</p>
          <p>Next, the algorithm searches for increasing continu- is just a coincidence, not a real citation.
ous sequences of indices, resulting in all common sub- Actually, it is dificult to correctly identify real citations.
sequences with at least 3 words between the two texts. Judges often do not put quoted text in quotation marks.
The result is represented as an array of arrays of word Sometimes, even a relatively short text is a real citation
3-gram positions, as depicted in Figure 2. and at the same time a longer text does not have to be a
citation. Therefore, the following two approaches should
4.4. Inserted and missed words in be considered heuristics rather than informed decisions.</p>
          <p>The decision-making process results in whether or not
citations the law article is cited in the judicial decision. We decide
whether a citation is present in the judicial decision based
on the longest sequence length in input arrays.</p>
          <p>We have tested the following two conditions
The previous two methods are designed to find
continuous sequences within analyzed texts. Sometimes, judges
use extra words or miss some words from cited laws and
thus do not form precisely continuous citations. This step
takes into account these kinds of citations and merges • The longest citation contains at least 7 words
them. However, we cannot merge similar 3-grams of • The longest citation covers at least 5% of the
origwords if they are too far apart, so we merge only those inal law article
sequences that are at most ten words apart. Finally, the The first approach was chosen to minimize the risk
process returns two arrays that store sequences that are of false positives arising from random or insignificant
merged if possible, one associated with the law article matches of short text segments. Such short citations are
and the other with the decision. often too general to identify a specific legal provision</p>
          <p>Below we provide an example, where we continue our uniquely and could lead to incorrect conclusions. This
example from Figure 2. Each input array contains sub- threshold allows us to increase the accuracy and
reliabilarrays with strictly increasing subsequences by 1. In ity of the method.</p>
          <p>Figure 3, the green color marks the last and first elements The second approach prefers citations that cite a
sigof adjacent arrays whose diference is less than ten po- nificant part of the law article. This approach suppresses
sitions in both texts; these will be merged into a single the occurrence of false positives but, on the other hand,
sub-array, as seen in Figure 4. citations within large law articles can be skipped.</p>
          <p>The red color marks the last and first elements of adja- As a result of the last step, the identifiers _id and
cent arrays whose diference is greater than ten words, version of the legal articles, together with the cited
indicating they will not be merged in the array for a court positions, are returned.
decision nor corresponding sub-arrays for the law article.</p>
          <p>
            We can see that the pair of subarrays &lt;[16],[250]&gt; do not
merge with &lt;[
            <xref ref-type="bibr" rid="ref10 ref11">10,11</xref>
            ], [100,101]&gt;, because the distance 5. Evaluation
between 101 and 250 is too big.
          </p>
        </sec>
        <sec id="sec-3-2-5">
          <title>We created two methods for citation search and two methods for decision-making. We used two decision-making</title>
          <p>
            methods for each citation search method, resulting in identifies many false positives.
four distinct methods in total. The implementation can
be found on our GitHub repository [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ]. Due to the
many versions of laws, we were unable to compare our 6. Conclusion
approach with other plagiarism detection systems.
          </p>
          <p>We tested our methods on randomly chosen 100 court In this article, we presented our methods for searching
decisions and all laws of the Slovak Republic and sum- for citations in legal texts. Although at first glance it
marized the results in Table 1. looks like a classic plagiarism detection task, citations</p>
          <p>In our study, we used the F1 Score to evaluate our in legal texts have their specifics, which we listed in
method for identifying law citations in court decisions. Section 1. We focused on finding citations of laws in
We chose this metric because it balances precision and court decisions. We presented a total of 4 methods to
recall, making it a reliable measure of our method’s accu- ifnd citations and compared them on a dataset of 100
racy. The F1 Score helps ensure that we correctly identify random judicial decisions.
real citations while reducing mistakes, which is crucial In the future, we would like to explore other
apfor trustworthy legal analysis. proaches to the decision-making process, for example,</p>
          <p>While achieving perfect recall, the NW, absolute length involving the semantic proximity of the court decision
method sufers from low precision, resulting in numerous and the cited law.
false positives and an F1 Score of 0.71. Another goal is to examine the search for citations</p>
          <p>The NW, percentage method oefrs perfect precision among judicial decisions. We will experiment with
but low recall, leading to an F1 Score of 0.58. It iden- replacing the Needleman-Wunsh algorithm with the
tifies citations accurately but misses a large number of Smith–Waterman [15] one as it should be more suitable
citations. for finding local alignments such as law citations within
perTfhoerm3-agnrcaemws,itahbbsooltuhtehliegnhgpthremciesitohnodanshdorwecsaall,byailealndcinedg lfaarsgteerr,dpeotetecntitoinallfyordrisesailm-tiimlare tteaxstkss., Fwoer athlseoppularpnotsoesteostf
an F1 Score of 0.89, indicating efective citation detection. the suitability of the BLAST algorithm [16] applied to
wiTthhesli3g-hgrtalymlso,wpeerrcepnrteacgiseiomnetahnoddaanlsFo1pSecrfoorremosf w0.e8l7l, scualchseaqutaesnkcewiidthenntaifictuartiaolnla.nguage text instead of
biologicompared to its absolute length counterpart. Overall, Building on our previous research [17], which involved
the 3-gram methods outperform the NW methods in extracting references to laws, we aim to refine how
relabalancing precision and recall, suggesting they are more tionships are weighted within legal texts. When a law is
suitable for nuanced detection of legal citations in court both referenced and cited within a ruling, it underscores
decisions. its substantial influence. We currently use references to</p>
          <p>When we compare these results, we can see, that law paragraphs to extract keyphrases. In the future, we
Needleman-Wunsch algorithm can only search for exact plan to explore assigning greater weight to phrases from
matches; it often fails to detect long citations and typi- highly valued law paragraphs to enhance our keyphrase
cally only detects smaller parts of citations. As a result, extraction method. In our future work, we also aim to
it often happens that the longest citation does not reach test whether removing stop words instead of short words
5% of the content of the paragraph of the law. On the improves our performance.
other hand, NW method does not have the restriction Judges sometimes omit specific letters, sections, or
of at least three consecutive matching words forming a even laws’ names, yet may still cite the text directly.
3-gram. The consequence is that the method for dealing Identifying these citations helps accurately pinpoint the
with inserted and missing words (Section 4.4) may mis- relevant law paragraph, further enhancing the precision
takenly evaluate as citations close matches of unigrams of our law reference extraction.
and bigrams up to a total length exceeding the threshold
of 7 words. Therefore the NW, absolute length method
[13] V. I. Levenshtein, et al., Binary codes capable of
correcting deletions, insertions, and reversals, in:
The Slovak Research and Development Agency supported Soviet physics doklady, volume 10, Soviet Union,
this work under contract No. APVV-21-0336 Analysis 1966, pp. 707–710.
of Court Decisions by Methods of Artificial Intelligence. [14] M. Bachmann, python-levenshtein, 2021. URL:
Pavol Jozef Šafárik University in Košice supported this https://rapidfuzz.github.io/Levenshtein.
work with the internal project at vvgs-2023-2547 Legal [15] R. Mott, Smith–Waterman Algorithm, 2005. doi:10.
Text Analysis Using Computer Linguistics. This article
was also supported by the Scientific Grant Agency of
the Ministry of Education, Science, Research and Sport
of the Slovak Republic under contract VEGA 1/0645/22
entitled Proposal of Novel Methods in the Field of Formal
Concept Analysis and Their Application.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Bloomfield</surname>
          </string-name>
          , WCopyfind, University of Virginia. Available at URL: http://plagiarism. bloomfieldmedia. com/wordpress/software/wcopyfind/. (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E.</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          ,
          <article-title>Plagiarism detection using stopword n-grams</article-title>
          ,
          <source>Journal of the American Society for Information Science and Technology</source>
          <volume>62</volume>
          (
          <year>2011</year>
          )
          <fpage>2512</fpage>
          -
          <lpage>2527</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Abdi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Idris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Alguliyev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Aliguliyev</surname>
          </string-name>
          , Pdlk:
          <article-title>Plagiarism detection using linguistic knowledge</article-title>
          ,
          <source>Expert Systems with Applications</source>
          <volume>42</volume>
          (
          <year>2015</year>
          )
          <fpage>8936</fpage>
          -
          <lpage>8946</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>K.</given-names>
            <surname>Vani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <article-title>Text plagiarism classification using syntax based linguistic features</article-title>
          ,
          <source>Expert Systems with Applications</source>
          <volume>88</volume>
          (
          <year>2017</year>
          )
          <fpage>448</fpage>
          -
          <lpage>464</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Altheneyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. E. B.</given-names>
            <surname>Menai</surname>
          </string-name>
          ,
          <article-title>Automatic plagiarism detection in obfuscated text</article-title>
          ,
          <source>Pattern Analysis and Applications</source>
          <volume>23</volume>
          (
          <year>2020</year>
          )
          <fpage>1627</fpage>
          -
          <lpage>1650</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>Yalcin</surname>
          </string-name>
          , I. Cicekli,
          <string-name>
            <surname>G. Ercan,</surname>
          </string-name>
          <article-title>An external plagiarism detection system based on part-of-speech (POS) tag n-grams and word embedding</article-title>
          ,
          <source>Expert Systems with Applications</source>
          <volume>197</volume>
          (
          <year>2022</year>
          )
          <fpage>116677</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Corrado,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <source>Eficient Estimation of Word Representations in Vector Space</source>
          ,
          <year>2013</year>
          . arXiv:
          <volume>1301</volume>
          .
          <fpage>3781</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Slov-Lex</surname>
          </string-name>
          ,
          <year>2022</year>
          . URL: https://www.slov
          <article-title>-lex.sk/ vyhladavanie-pravnych-predpisov.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <article-title>[9] Rozvoj elektronických služieb súdnictva (RESS) (</article-title>
          <year>2016</year>
          ). URL: https://obcan.justice.sk/.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Elastic</surname>
          </string-name>
          , Elasticsearch,
          <year>2024</year>
          . URL: https://www. elastic.co/elasticsearch/.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>O.</given-names>
            <surname>Fetkovych</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gurský</surname>
          </string-name>
          ,
          <article-title>Revealing implicit legal phrases in court decisions</article-title>
          , https://github.com/ vargadavid304/legal_citations,
          <year>2024</year>
          . GitHub repository.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S. B.</given-names>
            <surname>Needleman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Wunsch</surname>
          </string-name>
          , A
          <article-title>general method applicable to the search for similarities in the amino acid sequence of two proteins</article-title>
          ,
          <source>Journal of molecular biology 48</source>
          (
          <year>1970</year>
          )
          <fpage>443</fpage>
          -
          <lpage>453</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>