<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Semantic Comparison of Examination Regulations: A Prototype for Cross-Institutional Paragraph Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Douglas Blank</string-name>
          <email>douglas.blank@hhu.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefan Conrad</string-name>
          <email>stefan.conrad@hhu.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Processing, Cross-Institutional Analysis</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Examination Regulations</institution>
          ,
          <addr-line>Paragraph Extraction, Semantic Text Similarity, Sentence Embeddings, Legal Document</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Heinrich Heine University</institution>
          ,
          <addr-line>Universitätsstraße 1, 40225 Düsseldorf</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Examination regulations define key formal rules in higher education but are dificult to compare across institutions due to inconsistent formatting and heterogeneous structure. This paper presents a prototype pipeline for extracting and semantically analyzing paragraphs from German bachelor-level examination regulations. We apply regular-expression-based segmentation to isolate legal-style paragraphs and compare their content using TF-IDF vectors and sentence embeddings. Our initial results show that while surface-level representations sufice for intra-institutional comparisons, semantic embeddings, especially when fine-tuned, are necessary for reliable cross-institutional similarity. The proposed method lays the groundwork for large-scale regulatory analyses, enabling structured comparisons of academic policies across universities.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Related</title>
    </sec>
    <sec id="sec-3">
      <title>Work</title>
      <p>Text processing of legal documents is a highly challenging task in natural language processing (NLP)
due to their complex structure and domain-specific terminology. Moreover, legal texts are often used
in high-stakes contexts, such as by courts or legal professionals, where misinterpretations or errors
Germany
https://dbs.cs.uni-duesseldorf.de/mitarbeiter.php?id=blank (D. Blank);</p>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073
can lead to serious consequences, including the misapplication of laws or financial harm. Even in less
critical settings, individuals may rely on automatically processed legal content to make personal or
educational decisions, which further underscores the importance of robust and accurate methods.</p>
      <p>Although the type of legal documents addressed in this work, which is German bachelor-level
examination regulations, arguably belongs to a lower-risk category, they still demand careful processing.
These documents occupy a small and underexplored niche within the broader domain of legal texts.
To the best of our knowledge, there exists no prior work that specifically targets this document type,
particularly in the German language.</p>
      <p>
        In recent years, research on legal NLP has grown substantially. While much of the focus has
been on text classification, the next most commonly addressed tasks are information extraction and
information retrieval [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The rise of word embeddings and especially transformer-based models has
further accelerated progress in this domain. Among these, BERT [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and its legal-domain adaptation
LEGAL-BERT [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] are widely used for various downstream tasks such as classification, summarization,
and question answering on legal corpora.
      </p>
      <p>
        More recently, the use of large language models (LLMs) in legal NLP has attracted increased attention
[
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. However, their capabilities in legal reasoning, factual correctness, and explainability still require
further systematic investigation.
      </p>
      <p>Despite the availability of such pre-trained legal models, none are directly applicable to our setting.
For example, LEGAL-BERT is trained exclusively on English legal corpora, such as EU and US legislation
or court rulings. Its structure, vocabulary and training context difer significantly from German
examination regulations. Currently there exists no analogous model for German legal texts that
supports semantic similarity tasks at the paragraph level.</p>
      <p>
        The most notable German-language contribution is the Legal Entity Recognition (LER) dataset [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
which provides high-quality annotations for named entity recognition in legal documents. Based
on this dataset, German BERT models have been fine-tuned for Legal NER [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], achieving strong
results. However, these models are tailored to entity recognition rather than capturing paragraph-level
semantics.
      </p>
      <p>Our work therefore addresses a gap in the intersection of German legal NLP and semantic
similarity modeling. In contrast to prior work, we focus on comparing semantically related paragraphs
across structurally inconsistent legal documents using contrastive learning and fine-tuned sentence
embeddings.</p>
    </sec>
    <sec id="sec-4">
      <title>3. Data and Problem Setting</title>
      <p>This work originates from a larger research project that investigates examination regulations of German
bachelor programs in the context of identifying factors associated with high dropout rates. A
comprehensive analysis requires access to a large and diverse collection of such regulations from universities
across Germany.</p>
      <p>However, there is no centralized repository for these documents, and not all universities publish
them openly. While we plan to issue formal data access requests to all German universities as part of
the broader project, this step has not yet been taken. For this initial study, we therefore rely on a small
sample of examination regulations manually collected from publicly available university websites.</p>
      <p>
        These regulations are typically available as PDF files, which must first be processed to extract their
textual content. Tools such as PyMuPDF [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] are suitable for this task and work well for most documents.
However, in some cases these tools fail, either due to unusual formatting or because the PDFs are
actually scanned images without embedded text. From our experience, this issue occurs particularly
with older documents. In such cases, we fall back on optical character recognition (OCR) using tools
like Tesseract [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] to recover the textual content.
      </p>
      <p>While there is no standardized format for how examination regulations are structured, certain
recurring elements can be observed. Typically, a document may begin with a cover page, followed by a
short introductory section that places the regulation in a legal context, indicating, for example, which
overarching laws it builds upon or refines. Some documents also include a table of contents listing all
defined paragraphs.</p>
      <p>The main body of the document consists of the actual legal paragraphs, which appear sequentially and
define the specific rules for the respective bachelor’s program. These may cover topics such as admission
to examinations, required and elective modules, grading procedures, or the roles and responsibilities of
examiners. At the end, some documents include additional content that is not legally binding, such as
general information about the university or acknowledgments of staf members.</p>
      <p>For illustration, Figure 1 shows a typical excerpt from a regulation paragraph, covering rules related
to module examination outcomes. Note the legal style, nested structure, and the mixture of definitions
and conditions, all of which pose challenges for both text extraction and semantic analysis.
§ 13 Modulprüfungen: Bestehen und Nichtbestehen
(1) Eine Prüfungsleistung ist mit Erfolg erbracht und die Modulprüfung somit bestanden, wenn sie
mindestens mit „ausreichend“ (kleiner oder gleich 4,0) bewertet wurde.
(2) Eine Modulprüfung wird als nicht bestanden bewertet, wenn sie mit der Note „nicht ausreichend“ (5,0)
bewertet wurde.
(3) Die kumulative Modulprüfung zu einem Modul ist bestanden, wenn alle geforderten
Prüfungsleistungen mit „ausreichend“ oder besser bewertet und alle geforderten Studienleistungen erbracht wurden.
Andernfalls wird die kumulative Modulprüfung mit der Note „nicht ausreichend“ (5,0) bewertet.
(4) Mit dem Bestehen der Modulprüfung sind alle gemäß Anhang auf das betrefende Modul entfallenden
Leistungspunkte erworben.
3.1. Paragraph Extraction
Assuming that all documents share a common structural core, consisting of a sequence of legal
paragraphs formatted as shown in Figure 1, we can apply regular expressions to extract these units.
Specifically, we assume that while the beginning and end of each document may vary (e.g., cover pages,
metadata, or appendices), the central portion follows a consistent paragraph-based layout.</p>
      <p>To extract individual paragraphs, we use a regular expression that identifies lines starting with the
paragraph symbol (§) followed by a paragraph number, and captures all subsequent text until the
beginning of the next paragraph or the end of the document:</p>
      <p>(^§\s*\d+.*?)(?=^§|\Z)</p>
      <p>The regular expression is applied in multiline mode, so that ”^” matches the beginning of any line
rather than only the beginning of the entire document. In addition, dot-all mode is enabled, allowing the
wildcard operator ”.” to match newline characters as well. This configuration ensures that multi-line
paragraphs are captured correctly.</p>
      <p>While this regular expression provides a powerful baseline for extracting paragraphs from
examination regulations, it is far from perfect. One notable issue is that the final matched paragraph includes
not only the actual paragraph content, but also all remaining text until the end of the document. In
some cases, this includes metadata or university-related appendices that are not relevant to our analysis.</p>
      <p>A second type of artifact results from the preprocessing step in which the PDF is converted into
plain text. Since all visible content is extracted, including headers, footers, and page numbers, these
elements can appear inside the extracted paragraph blocks. This introduces noise into the resulting text
segments and may afect downstream processing.</p>
      <p>We attempt to reduce such noise through basic text refactoring operations. These include removing
lines that consist only of isolated numbers (e.g., page numbers), collapsing multiple empty lines, merging
orphaned single-character lines with their successors, and restoring hyphenated words that were split
across line breaks. While these steps improve the structure and readability of the extracted content,
they do not yet address document-specific artifacts such as recurring footers or university metadata.</p>
      <p>We acknowledge these limitations and plan to refine our extraction pipeline in the future, for example
by incorporating layout-aware filtering or manually curated rules. For the purposes of this study,
however, we proceed with the current method and assume that remaining artifacts do not significantly
impact our preliminary similarity analysis.</p>
    </sec>
    <sec id="sec-5">
      <title>4. Approach</title>
      <p>
        Now that we have a method for extracting paragraphs from the source documents, the next step is to
transform these text segments into a representation suitable for analysis. As a first approach, we use
the term frequency-inverse document frequency (TF-IDF) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] representation, implemented via the
scikit-learn library [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>TF-IDF produces a sparse vector representation of a document based on the frequency of its words,
scaled by how often those words appear across the entire corpus. While this approach does not capture
word order or context, it ofers a robust and interpretable baseline for comparing textual content on a
statistical level. In our case, each extracted paragraph is treated as an individual document, resulting in
one TF-IDF vector per paragraph.</p>
      <p>These vector representations can now be compared to one another, for example by computing
the cosine similarity between pairs of paragraphs. As a first experiment, we apply this technique to
paragraphs extracted from examination regulations of bachelor programs at our institution, Heinrich
Heine University Düsseldorf.</p>
      <p>To explore the similarity structure visually, we compute pairwise cosine similarities between all
paragraph representations and display the resulting matrix as a heatmap. This allows us to identify clusters
of paragraphs with potentially overlapping regulatory content. In Figure 2, we present a similarity
matrix based on a sample of TF-IDF vectors derived from paragraphs taken from two diferent examination
regulations: Business Economics and Financial and Actuarial Mathematics. A brief description of the
content of the respective paragraphs is provided in Table 1.</p>
      <p>As shown in the heatmap, one paragraph from the first document exhibits a high similarity to a
paragraph in the second document that covers the same regulatory topic. Among all other paragraph
pairs, similarity scores remain comparatively low, with the possible exception of §4 and §15, which
show slightly elevated similarity. This can be explained by overlapping content. Several rules stated in
§4 are reiterated or extended in §15.</p>
      <p>Overall, the similarity values align well with the thematic relationships between paragraphs,
suggesting that TF-IDF representations may already provide a reasonable baseline for intra-institutional
comparisons. However, this no longer holds when comparing documents from diferent institutions.</p>
      <p>In Figure 3, we show similarity scores between paragraphs from the Business Administration regulation
and another regulation from a diferent university, anonymized here as Study Program X. While some
topic-related paragraph pairs exhibit slightly higher similarity than unrelated pairs, the absolute scores
remain low and ambiguous. In other words, even thematically related paragraphs may not be clearly
distinguishable from unrelated ones based on TF-IDF alone.</p>
      <p>A likely explanation is that institution-specific stylistic and structural conventions influence word
distributions. Documents from the same university are often written by the same authors or follow
consistent editorial guidelines, which benefits TF-IDF but fails to generalize. Since TF-IDF is based
purely on term frequency, it captures surface-level lexical similarity but ignores semantic meaning.</p>
      <p>
        To address this limitation, we explore sentence-level embeddings generated by pre-trained
SentenceTransformers [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. These models map entire sentences or paragraphs into dense vector representations
that aim to reflect semantic content more robustly and contextually.
      </p>
      <p>
        In our experiments, we evaluated several pre-trained Sentence-Transformer models, including the
multilingual cross-en-de-roberta-sentence-transformer [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. However, these models did not yield
satisfactory results for our task. As shown in Figure 4, which replicates the previous comparison from Figure 2,
most paragraphs appear overly similar to one another, making meaningful distinctions dificult.
      </p>
      <p>We hypothesize that this is due to the general-purpose nature of these models, which are trained on
diverse and broad datasets. As a result, they tend to encode high-level contextual similarities, such as
the fact that all input texts are part of examination regulations, while underestimating finer-grained
diferences in content.</p>
      <p>We therefore propose training a specialized model on domain-specific data. To the best of our
knowledge, no sentence embedding model currently exists that has been trained on German examination
regulations. As a next step, we aim to construct such a model and assess whether it can better capture
paragraph-level distinctions relevant to our analysis.</p>
      <p>For this purpose, we are currently building a small, manually curated dataset of examination
regulations. As outlined earlier, we intend to submit formal data access requests to all German universities,
asking for bachelor-level examination regulations from the past ten years. While the eventual
response rates remain uncertain, we hope this efort will enable the creation of a suficiently large and
representative corpus to support domain-specific model training.</p>
      <p>
        We train our model using a contrastive learning approach inspired by self-supervised methods [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ],
based on cross-document paragraph similarity. Let  = { 1,  2, … ,   } be the set of documents, where
each document   consists of a set of paragraphs.
      </p>
      <p>For every paragraph  ∈   , we construct one positive pair per other document   with  ≠  .
Specifically, we define the paragraph</p>
      <p>∗ ∈   that is most similar to  as its positive counterpart:
(,  ∗) is a positive pair ⟺  ∗ = arg max sim(, )
∈</p>
      <p>In addition, all other paragraphs  ∈   ∖ { ∗} and all paragraphs  ∈   ∖ {} are used to generate
negative pairs:</p>
      <p>(,  ) is a negative pair ⟺  ∈ (  ∖ { ∗}) ∪ (  ∖ {})</p>
      <p>This strategy results in  − 1 positive pairs and many more negative pairs per paragraph. We employ
a contrastive loss to bring semantically aligned content (positive pairs) closer together in the embedding
space, while pushing apart unrelated or less relevant paragraphs (negative pairs). The contrastive loss
is defined as
ℒ =  ⋅ ‖ −  ‖
2 + (1 −  ) ⋅ max(0,  − ‖ −  ‖)
2,
where  and  are the embeddings of two paragraphs, and  ∈ {0, 1} is a binary label indicating whether
the respective paragraphs form a positive pair ( = 1 ) or a negative pair ( = 0) . The margin parameter
 &gt; 0 defines the minimum distance that negative pairs should be apart in the embedding space. If the
embeddings of a negative pair are closer than this threshold, the loss increases, encouraging the model
to push them further apart.</p>
      <p>While we currently do not possess a dataset large enough to train a sentence embedding model from
scratch, we keep this option open and may revisit it once a suficiently large collection of examination
regulations becomes available through our formal data access requests.</p>
      <p>In the meantime, we fine-tune the pre-trained cross-en-de-roberta-sentence-transformer on our
initial dataset. If this adapted model shows improved performance on our similarity task, we intend to
further refine and scale this approach as additional data becomes available.</p>
      <p>At the current stage of our work, we use a dataset consisting of only 18 examination regulation
documents, handpicked from publicly accessible university websites across several German federal
states. As mentioned before, documents originating from the same institution tend to follow similar
stylistic and structural patterns, either because they are authored by the same administrative ofices or
because newer documents intentionally replicate earlier formats for the sake of consistency.</p>
      <p>To avoid overfitting to such layout- or style-specific patterns, we deliberately include documents
from a diverse range of institutions. This variety is intended to encourage the model to focus on the
semantic content of paragraphs rather than superficial formatting similarities.</p>
      <p>From the extracted paragraphs, we construct a training dataset consisting of 217 manually defined
positive pairs and 3, 244 derived negative pairs, following the contrastive learning scheme described
earlier. At this stage, we rely on small supervised samples for the positive examples, as we have not yet
identified a suitable heuristic or learning signal to automate the matching of semantically equivalent
paragraph pairs.</p>
      <p>One approach we are considering for future iterations is to use the fine-tuned model itself as a weak
supervision signal. Specifically, we could identify, for each paragraph, its most similar counterpart in
other documents according to the current model, and use these matches as positive training pairs. A
model trained on this automatically generated data could then be used to produce the next generation
of labels, allowing for a process of iterative refinement.</p>
      <p>However, if the generated pairs are of insuficient quality, the model may be guided in the wrong
direction, potentially reinforcing noise or drifting away from meaningful semantic distinctions. We
leave the exploration of appropriate strategies for such self-supervised label generation to future work.</p>
      <p>Fine-tuning is performed using the AdamW optimizer on an NVIDIA RTX 5080 GPU with a batch
size of 32 for 500 training steps. We use a learning rate of 2 × 10−5 and a warmup ratio of 0.1. These
settings were selected as reasonable defaults and not subjected to extensive hyperparameter tuning, as
optimization of training dynamics is beyond the scope of this work.</p>
      <p>
        The training procedure is implemented using the SentenceTransformers [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] framework, including
its ContrastiveLoss module with a margin of  = 0.5 .
§4 0.00
s
c
ita §5 0.09
m
e
h
t
a
ilaM §51 0.21
r
a
u
t
cndA §16 0.25
a
il
a
icann §18 0.43
F
      </p>
      <p>The resulting model produces more consistent and interpretable similarity scores between paragraphs.
As illustrated in the comparison between Figure 4 and Figure 5, semantically related paragraphs are more
accurately identified after fine-tuning. Additionally, the fine-tuned model enables meaningful
comparisons between paragraphs from diferent documents (Figure  6), a task in which TF-IDF representations
had previously failed (cf. Figure 3).</p>
      <p>These results demonstrate the feasibility of our approach: we successfully conceptualized and
implemented a pipeline that extracts paragraphs from examination regulations, transforms them into
vector-based representations, and enables semantic comparison across documents. While this study
represents an early proof-of-concept, the observed improvements suggest that our methodology is
promising, particularly if the acknowledged limitations are addressed and the system is scaled to a
broader dataset.</p>
    </sec>
    <sec id="sec-6">
      <title>5. Conclusion and Outlook</title>
      <p>This paper presented an initial step toward a broader, long-term analysis of examination regulations in
German higher education. The work is part of an interdisciplinary project that investigates institutional
rules and responsibilities as potential factors contributing to high dropout rates or prolonged study
durations.</p>
      <p>To support this research, we plan to develop a system that allows for cross-university comparisons
of regulatory structures. One envisioned feature is the ability to identify and highlight diferences in
specific regulatory aspects, such as exam admission criteria, between institutions with significantly
diferent dropout rates. Users would be able to select a given paragraph from one university’s regulation
and retrieve semantically similar paragraphs from others.</p>
      <p>With this larger goal in mind, the present paper first examined the structure of examination regulation
documents and proposed a simple method to extract individual paragraphs using regular expressions.
Although this method is efective as a baseline, it occasionally captures irrelevant text fragments, such
as page numbers or footers introduced during PDF preprocessing. We plan to refine this step in future
iterations to improve extraction quality and structural segmentation.</p>
      <p>In the second part, we explored how extracted paragraphs can be represented in ways that reflect
their underlying semantic meaning, enabling pairwise comparison. While TF-IDF vectors ofer a simple
and interpretable representation, they are often insuficient when comparing documents from diferent
institutions, due to stylistic and structural inconsistencies. To address this, we fine-tuned a pre-trained
sentence transformer model using a contrastive learning approach and demonstrated that the resulting
embeddings yield more meaningful semantic similarity scores across institutional boundaries.</p>
      <p>Although we currently rely on a small set of manually defined positive pairs, the proposed training
method shows promising results even on limited data. As a next step, we aim to expand this training
process by developing a fully self-supervised method for generating high-quality positive pairs. We
believe that scaling the model with a larger and more diverse dataset, that is currently under construction,
will significantly improve its semantic understanding.</p>
      <p>In addition to the limitations of the training data, it is also important to consider structural assumptions
inherent in our current approach. Specifically, we assume that the examination regulations being
compared follow a structurally similar format, that is, each paragraph in one document has a clearly
corresponding counterpart in the other. However, this is not always the case. For example, a concept
that is expressed within a single large paragraph in one document may be distributed across multiple
smaller paragraphs in another. Conversely, a given paragraph may not have any meaningful counterpart
at all.</p>
      <p>While such discrepancies could potentially be mitigated by allowing flexible similarity thresholds
or by aggregating multiple paragraphs for comparison, another complication arises in this specific
data setting. Some universities define general examination regulations that apply across all programs,
and supplement these with program-specific rules. In such cases, a complete representation of the
regulation requires the combination of multiple documents, which further complicates the alignment
task.</p>
      <p>Beyond fine-tuning, we also consider training a model entirely from scratch in the future. In addition,
further models could be developed for related tasks such as named entity recognition and question
answering within the context of examination regulations, depending on future project requirements.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work was supported by the Federal Ministry of Research, Technology and Space (BMFTR, formerly
BMBF) under grant number 16FG001B as part of the project ”RegelWerk”.
The author(s) have not employed any Generative AI tools.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Katz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hartung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gerlach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jana</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. J. B. II</surname>
          </string-name>
          ,
          <article-title>Natural language processing in the legal domain</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2302.12039.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          ,
          <year>2019</year>
          . URL: https://arxiv.org/abs/
          <year>1810</year>
          .04805.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>I.</given-names>
            <surname>Chalkidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fergadiotis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Malakasiotis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Aletras</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Androutsopoulos</surname>
          </string-name>
          ,
          <article-title>Legal-bert: The muppets straight out of law school</article-title>
          ,
          <year>2020</year>
          . URL: https://arxiv.org/abs/
          <year>2010</year>
          .02559.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Gan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Large language models in law: A survey</article-title>
          ,
          <year>2023</year>
          . URL: https: //arxiv.org/abs/2312.03718.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D. H.</given-names>
            <surname>Anh</surname>
          </string-name>
          , D.-T. Do,
          <string-name>
            <given-names>V.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. L.</given-names>
            <surname>Minh</surname>
          </string-name>
          ,
          <article-title>The impact of large language modeling on natural language processing in legal texts: A comprehensive survey</article-title>
          ,
          <source>in: 2023 15th International Conference on Knowledge and Systems Engineering (KSE)</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          . doi:
          <volume>10</volume>
          .1109/KSE59128.
          <year>2023</year>
          .
          <volume>10299488</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>E.</given-names>
            <surname>Leitner</surname>
          </string-name>
          , G. Rehm,
          <string-name>
            <given-names>J.</given-names>
            <surname>Moreno-Schneider</surname>
          </string-name>
          ,
          <article-title>Fine-grained Named Entity Recognition in Legal Documents</article-title>
          , in: M.
          <string-name>
            <surname>Acosta</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Cudré-Mauroux</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Maleshkova</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Pellegrini</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Sack</surname>
          </string-name>
          , Y. SureVetter (Eds.),
          <source>Semantic Systems. The Power of AI and Knowledge Graphs. Proceedings of the 15th International Conference (SEMANTiCS</source>
          <year>2019</year>
          ),
          <source>number 11702 in Lecture Notes in Computer Science</source>
          , Springer, Karlsruhe, Germany,
          <year>2019</year>
          , pp.
          <fpage>272</fpage>
          -
          <lpage>287</lpage>
          . 10/
          <issue>11</issue>
          <year>September 2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Darji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mitrović</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Granitzer</surname>
          </string-name>
          ,
          <article-title>German bert model for legal named entity recognition</article-title>
          ,
          <source>in: Proceedings of the 15th International Conference on Agents and Artificial Intelligence, SCITEPRESS - Science and Technology Publications</source>
          ,
          <year>2023</year>
          , p.
          <fpage>723</fpage>
          -
          <lpage>728</lpage>
          . URL: http://dx.doi.org/10. 5220/0011749400003393.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>b. o. c. b. J. X. M.</given-names>
            <surname>Artifex Software</surname>
          </string-name>
          , Inc., R. Liu, PyMuPDF: Python bindings for mupdf, https: //pymupdf.readthedocs.io/,
          <source>2025. Version 1.26.3.</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <article-title>An overview of the tesseract ocr engine</article-title>
          ,
          <source>in: Ninth International Conference on Document Analysis and Recognition (ICDAR</source>
          <year>2007</year>
          ), volume
          <volume>2</volume>
          ,
          <year>2007</year>
          , pp.
          <fpage>629</fpage>
          -
          <lpage>633</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICDAR.
          <year>2007</year>
          .
          <volume>4376991</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>C. D. Manning</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Raghavan</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Schütze</surname>
          </string-name>
          , Introduction to Information Retrieval, Cambridge University Press, Cambridge, UK,
          <year>2008</year>
          . URL: http://nlp.stanford.edu/IR-book/information-retrieval-book. html.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>L.</given-names>
            <surname>Buitinck</surname>
          </string-name>
          , G. Louppe,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blondel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Pedregosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mueller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Grisel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Niculae</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gramfort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Grobler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Layton</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. VanderPlas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Holt</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Varoquaux, API design for machine learning software: experiences from the scikit-learn project</article-title>
          ,
          <source>in: ECML PKDD Workshop: Languages for Data Mining and Machine Learning</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>108</fpage>
          -
          <lpage>122</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>Sentence-bert: Sentence embeddings using siamese bert-networks</article-title>
          ,
          <year>2019</year>
          . URL: https://arxiv.org/abs/
          <year>1908</year>
          .10084.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P.</given-names>
            <surname>May</surname>
          </string-name>
          ,
          <article-title>cross-en-de-roberta-sentence-transformer</article-title>
          , https://huggingface.co/
          <article-title>T-Systems-onsite/ cross-en-de-roberta-sentence-</article-title>
          <string-name>
            <surname>transformer</surname>
          </string-name>
          ,
          <year>2020</year>
          .
          <article-title>Licensed under the MIT License</article-title>
          . Copyright (c) 2020 Philip May,
          <source>T-Systems on site services GmbH.</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Jaiswal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Babu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Z.</given-names>
            <surname>Zadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Banerjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Makedon</surname>
          </string-name>
          ,
          <article-title>A survey on contrastive selfsupervised learning</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/
          <year>2011</year>
          .00362.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>