<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>CoRR abs/</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.48550/ARXIV.2310.06825</article-id>
      <title-group>
        <article-title>Exploring In-Context Learning Strategies for Temporal Ordering of Legal Events using Large Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrea Cacioli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Cagliero</string-name>
          <email>luca.cagliero@polito.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesco Tarasconi</string-name>
          <email>francesco.tarasconi@staff.aruba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Legal AI, Temporal Reasoning, Large Language Models</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Aruba AI Srl</institution>
          ,
          <addr-line>Corso Francia 2/bis, 10143 Turin</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Politecnico di Torino</institution>
          ,
          <addr-line>Corso Duca degli Abruzzi 24, 10129 Turin</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2302</year>
      </pub-date>
      <volume>13971</volume>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Large Language Models (LLMs) are increasingly adopted for legal document understanding by attorneys and legal consultants. Despite advances in adapting LLMs to their legal terminology and domain-specific linguistic nuances, the LLMs' ability to reason about temporal relations in legal documents remains largely underexplored. In this work, we explore the capabilities of LLMs to verify the correctness of a legal temporal ordering clause and to classify the type of temporal relationships between two legal entities. The results achieved on a public Englishwritten benchmark show that (1) instruction-based models generally perform better than the corresponding chat versions; (2) LLMs reasoning capabilities are, typically, marginally useful to address the specific temporal reasoning tasks; (3) LLMs under a Few-Shot Learning (FSL) setting turn out to be the most efective, with Grok 4 surpassing the state of the art.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Large Language Models (LLMs) have demonstrated remarkable legal document understanding and
generation capabilities [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Within the legal domain, the most established tasks encompass (1) content
search [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ], (2) document review [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ], and (3) prediction [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]. The latter category of tasks also
includes the deep understanding of complex semantic relations in text, such as legal entailment types,
rhetorical roles, and temporal relations.
      </p>
      <p>
        Reconstructing temporal relationships is known to be particularly challenging for LLMs [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
Specifically, previous studies have shown that most LLMs fall short when they are asked to either update a
knowledge base or adapt their responses to time-evolving scenarios [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        So far, limited research eforts have been devoted to addressing temporal reasoning in the legal
domain. For example, in LexTime [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] the authors address a prediction task which entails predicting
whether a temporal ordering relationship between a pair of events mentioned in the document text
(e.g., event A precedes event B) is true or false.
      </p>
      <p>The main limitations of state-of-the-art works on temporal reasoning for legal document
understanding are enumerated below.</p>
      <p>
        • Lack of Deep Reasoning: They analyze classical textual LLMs belonging to the LlaMA [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ],
GPT [12], and Mistral [13] families while ignoring the LLMs that have been specifically pretrained
with deep reasoning capabilities.
• Binary Verification : They analyze the zero-shot and few-shot LLM capabilities to verify whether
a given statement is correct or not [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], leaving open more challenging legal understanding tasks,
such as the automatic detection of the type of event ordering.
• Limited exploration of the models’ eficiency
: They do not deepen into the analysis of
relevant technical aspects, such as context length, and model inference costs.
      </p>
      <p>Published in the Proceedings of the Workshops of the EDBT/ICDT 2026 Joint Conference (March 24-27, 2026), Tampere, Finland</p>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073</p>
      <p>This paper addresses the above-mentioned issues. Specifically, it not only studies the LLM capabilities
to verify the correctness of a legal temporal ordering clause, but also classifies the type of temporal
relationships between two legal entities. It also empirically compares chat- and instruct-based LLMs,
LLMs with deep reasoning and not, and models with diferent sizes, context lengths, and inference
costs.</p>
      <p>The results achieved by Grok 41 under a few-shot learning setting surpasses the state of the art on
the binary verification task (accuracy: Grok 4 85.3% vs. GPT 4 80.8%) and achieves robust performance
on the multi-class event ordering classification task. Notably, the LLMs with deep reasoning capabilities
achieve just marginal improvements or no improvements, likely because they incorporate a limited
background in the legal domain.</p>
      <p>The remainder of this paper is organized as follows. Section 2 formalizes the established and new
temporal reasoning task. Section 3 presents the proposed methodology, while Section 4 summarizes the
main experiments. Finally, Section ?? draws conclusions and discusses the future research developments.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Problem statement</title>
      <p>
        Given a legal document  , we extract a context paragraph  in  mentioning a sequence of two legal
events ⟨, ⟩ . Events  and  are either one implicit and one explicit event or two explicit events [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. In
compliance with [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], every event is defined by an occurrence or action triggered by a verb or noun
taking place at a specific moment.
      </p>
      <p>In the following we define the tasks addresses in this work.</p>
      <sec id="sec-2-1">
        <title>Legal Event Temporal Ordering Verification Given an ordered sequence ⟨, ⟩ consisting of</title>
        <p>events  and  and an arbitrary temporal relationship  , this task, hereafter denoted by LETOV for the
sake of brevity, aims to verify whether the statement    (e.g. event a precedes event b) holds (target
response: yes) or not (target response: no).</p>
      </sec>
      <sec id="sec-2-2">
        <title>Legal Event Temporal Ordering Classfication Given an ordered sequence of events ⟨, ⟩ , and</title>
        <p>a predefined set of temporal relationships { 1,  2,   ,   } (e.g., precedes, subsequent, contemporary),
this task, hereafter denoted by LETOC for the sake of brevity, has the goal of predicting the correct
temporal relationship between events  and  .</p>
        <p>
          With the goal of deepening the analysis of the LLMs’ capabilities in legal temporal temporal reasoning,
we introduce LETOC as a new task extending LETOV [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>
        To assess the LLMs’ capabilities to address LETOV and LETOC we apply the following steps. Firstly, we
enrich the statements originally included in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] with diferent prompting styles, including chat- and
instruct versions as well as zero-shot and few-shot learning settings. Based on the results observed in
the preliminary experiments (see Section 4 for further details), we decided to employ only the instruct
style from the second setting onward due to its little impact on overall performance.
      </p>
      <p>Secondly, we design a testing framework that can uniquely identify a given prompt for a given model
and that stores the history of experiments’ outcomes.</p>
      <p>Lastly, we collect the results on a grid search over multiple models, settings and prompting strategies.
The grid search spans across the models, the number of shots  ∈ {0, 1, 3}, and the reasoning levels
 ∈ {low, medium, high}</p>
      <sec id="sec-3-1">
        <title>1https://x.ai/news/grok-4 latest access: January 7, 2026</title>
        <p>Chat vs. Instruct-based models We experiment with two main classes of prompts: chat and instruct.
The chat style is the most common way to prompt a LLM, as most interfaces are designed with this
principle in mind. Recent works [14] have inspired the creation of models that perform best when
dealing with instructions. Hence, we also experimented with this to compare their efect of legal
temporal reasoning.</p>
        <p>The instruct prompts selected for LETOV follow the following template:
You are a legal expert that never makes mistakes and that
never hallucinates.</p>
        <p>Give your unbiased opinion on the following events about
their temporal relationship.</p>
        <p>Do not make mistakes.</p>
        <p>Consider these examples:
# Example 1</p>
        <p>Given this context: ’$example_context1’
For the statement ’$example1’</p>
        <p>You should answer ’$label1’
...(other examples or no examples at all)
In the context: $context
Verify the soundness of this statement: $question
Only answer with one word: if the statement is correct,
answer with the word ”Entailment”; whereas if the
statement is wrong, answer with the word ”Contradiction”</p>
        <p>The selected LETOV chat prompt, instead, has the following structure.</p>
        <p>I am examining this paragraph from a legal context and
I want to extrapolate the temporal relations between
two events. I absolutely need these to be correct, no
mistakes allowed.</p>
        <p>This is my context: $context</p>
        <p>This is my statement: $question
I need a one word answer: if the statement is correct,
answer with the word ”Entailment”; whereas if the statement
is wrong, answer with the word ”Contradiction”</p>
        <p>For LETOC we focus on instruct prompts, identifying the following template:
You are a legal expert that never makes mistakes
and that never hallucinates. Give your unbiased
opinion on the following events about their temporal
relationship. You must pick one of three temporal
relations from a set. Do not make mistakes.</p>
        <p>Consider these examples:
# Example 1</p>
        <p>Given this context: ’$example_context1’
For the events:</p>
        <p>Event A: ’$example_a1’</p>
        <p>Event B: ’$example_b1’</p>
        <p>You should answer ’$label1’
Only answer with only one word representing the relation:
- If Event A follows event B, answer ”follows”
- If Event A precedes event B, answer ”precedes”
- If the two events happen at the same time,
answer ”simultaneous”
Hardware resources and services We run our experiments using the LLM-As-A-Service OpenRouter
platform2. The experiments took around 50 hours, and the overall cost was 173,88$.</p>
        <p>
          To prepare the inputs and postprocess the results, we used a machine equipped with 16GB of RAM,
an AMD Ryzen AI 7 PRO 350 CPU and 512 GB SSD and running Windows 11 Pro.
Dataset We adapt the LexTime open benchmark [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] to address both the LETOV and LETOC tasks.
        </p>
        <p>LexTime is composed of a legal context taken from U.S. federal complaints between 2020 and 2024.
They randomly sampled complaints categorized under the Nature of Suit (NOS) codes beginning with
7, which correspond to labor-related cases. Alongside the context, it contains a statement in natural
language about two events. For each statement corresponds a binary label: ”entailment” if the statement
is sound, ”contradiction” otherwise. Each statement also has some metadata about the nature of the
couple of events: whether they are explicitly mentioned in the context, or if one of them can only be
deduced by a legal expert, eventually marking it as implicit. Our study disregards the efect of metadata
as mainly focuses on temporal relations between legal entities.</p>
        <p>The dataset curation consisted of the following steps: firstly, we only selected the statements that are
logically sound, as it is impossible to deduce the event relation from contradicting statements. Secondly,
we used a regular expression to extrapolate each of the temporal relations that compose LexTime.
Finally, we aggregate similar ones into three classes:
• precedes: for couples of events where the first happens before the second
• follows: for couples of events where the first happens after the second
• simultaneous: for couples of events where the first and the second happen at the same time.</p>
        <p>Hereafter, we will refer to this smaller dataset as the multi-class dataset.</p>
        <p>Models We benchmark the performance of the state-of-the-art LLMs reported in Table 1. For each
model we also report its reasoning availability and whether or not the reasoning efort specification is
supported, the cost expressed in $ per million of output inference tokens and finally if it is an instruct
model or not. Opensource models are also reported.</p>
        <p>In the experiments we explored the following dimensions of analysis:
• Model openness: We compared opensource and proprietary models. We focus on state-of-the-art
model, testing a selection of models all released after April 2025.</p>
      </sec>
      <sec id="sec-3-2">
        <title>2https://openrouter.ai/ latest access: January 10, 2026</title>
        <p>Grok 4 [15]
Claude Sonnet 4.5 [16]
OpenAI GPT-5.2 [17]</p>
        <p>OpenAI o3 [18]
Gemini 3 flash prev [19]</p>
        <p>DeepSeek V3.2 [20]
OpenAI GPT OSS 120b [21]
Mistral Devstral 2 2512 [22]
Qwen3 Instruct 2507 [23]</p>
        <p>Yes</p>
        <p>Yes
efort specifiable
efort specifiable</p>
        <p>Yes</p>
        <p>Yes
efort specifiable</p>
        <p>
          No
No
• Model dimension and context length: We tested models with context size ranging from
131.072 to 1.048.576. Extending the preliminary work presented in [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] and other works [24]
that had already promoted the usefulness of large contexts in legal contexts, we aim to study the
impact of very large context length on models’ performance.
• Efect of deep reasoning : To test the impact of the reasoning capabilities, we consider models
with and without this feature (see Section 4 for more details).
• Instruct vs chat setting: we compare chat vs. instruct-based LLMs. Given the recent LLMs’
alignment to human preferences [14], we explore instruction tuning as an alternative to chat
models.
        </p>
        <p>Settings We test three diferent LEVOT settings. The first one is aimed at discovering the impact of the
instruct style prompt, as well as the model’s own preference towards a more friendly and conversational
prompt or a more strict direct order.</p>
        <p>In the second setting we verify whether content adaptation strategies are beneficial to enhance legal
temporal reasoning performance. We also empirically verify if the reasoning models are better at
generalizing from the examples and therefore applying the reasoning to the question.</p>
        <p>In the last setting we try to change the number of tokens that the models can dedicate to reasoning
by specifying an efort parameter. The efort parameter can be one of several values. We experiment
with values ’low’, ’medium’, ’high’.</p>
        <p>For LEVOC we test only the last two settings of the previous task with slight modifications. Firstly,
we test and compare zero-, one-, and three shot learning. Lastly, we once again test how the reasoning
efort afects the performance.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental results</title>
      <p>
        We measure the LETOV and LETOC performance of diferent combinations of models and settings in
terms of classification accuracy (i.e., the percentage of correctly classified samples, similarly to [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]).
Furthermore, we also evaluate the per-class performance in terms of precision, recall and F1-score. For
LETOC we adopt the weighted versions of the metric to reduce the impact of class imbalance.
Results and discussion Table 2 reports the values of the performance scores for every run in the
instruct style and a diferential score Δ. Δ is defined by the performance gap between the classifier
prompted with the instruct prompts and the same metric for one prompted with the chat style. For
every metric  ∈ {Accuracy, Precision, Recall, F1},
      </p>
      <p>Δ =  instruct −  chat</p>
      <p>
        Based on the reported Δ values, the prompting technique appears to provide limited contributions. In
addition, as shown by the F1-score results, most models marginally benefit from the instruction
prompting style. For this reason, we then further explore the instruction-based LLMs. Qwen 3 Instruct [23]
underperforms the large proprietary model, with Grok 4 [15] outperforming the other approach, except
for Claude Sonnet 4.5 [16]. Devstral [22] instead achieves a very noticeable 92.37% recall, while getting
lower precision scores. The LOVET performance on this task has improved compared to the state of
the art(80.8 accuracy) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>Table 3 reports the results for LETOC, where we focus on few-shot learning. For the sake of
completeness and clarity, we also repeat the zero-shot instruct experiment from the previous setting.
Reasoning models are expected to perform better in this task as the reasoning is further helped by the
examples. In this setting, Grok 4 [15] proves once again to able to outperform all the other models,
though with a limited extent. The performance achieved by Gemini 3 and Sonnet seems to closely
follow the one by Grok, though it consistently lags behind. Devstral confirms its tendency to have
high recall measures. Overall, the presence of one or few examples helps the model’s generalization
capabilities as expected. Finally, the top accuracy score of 84.48 achieved by Grok 4 in the three shots,
further surpasses the one from the previous setting.</p>
      <p>For the final binary classification task’s setting, we report the results in Table 4. The efect of the
reasoning seems to be limited. However, once more, Grok 4 surpasses the previous score, and we find
our best result for the accuracy metric of 85.28. However, while the Grok 4 performance increased
steadily, the latency in the response generation is significant, i.e., sometimes exceeding one minute of
thinking and generation. This should be taken into account in the cost-benefit analysis.</p>
      <p>Table 5 and 6 report the results of the LETOC task. As stated previously, this task is inherently
more complex as the number of target classes is higher. Table 5 reports the diference between the
zero-shot setting and the settings where the model’s context is enriched with examples taken from the
original dataset. Models facing this problem generally solved the task well in most cases. However, in
this scenario we do not have a clear superior model: only small diferences can be noted and the top
performance is either shared between two models or it changes with the metric chosen. It is still clear
that models that could be chosen in an industrial environment or where performances are of utmost
importance are Sonnet 4.5 [16] and Grok 4 [15]. However, the small and opensource Devstral 2 LLM
[22] achieves fairly good accuracy, especially in the three-shot setting. Hence, it could be selected for
applications where low cost and fast inference is crucial.</p>
      <p>The last result we want to discuss is once again the variation of the model’s reasoning efort. All
the measures are reported in table 6. Like in the previous reasoning variation setting, we do not see
a steep increase in performance, just a small fluctuation. Once again Grok 4 achieves most of the
top performances that we reported in bold. However, GPT 5.2 [17] seems to handle best the medium
reasoning efort parameter compared to the others.</p>
      <p>Finally, we analyze the relation between the accuracy of each model (averaged between the various
tasks), and the model’s cost and context length. Figure 1 visually represents how the cost of the model
and its context window length influences its accuracies in the various tasks. The accuracy reported as
the dependent variable is macro aggregated using the mean of all accuracies in settings of the binary
classification tasks. Those are the instruct zero shot accuracy, the one shot accuracy, the three shot
accuracy and the three accuracies of the low-medium-high reasoning efort experiment. We only
selected the accuracies of the first task because the number of examples only changes slightly (one to
three less runs if the prompt contains examples) and so the mean aggregation method makes sense.
As an independent variable, we show how cost and context window length afect the accuracy. If a
positive correlation exists between the two variables, we would expect the points to be placed on the
main diagonal. However, while this visualization suggests this to be the case for both the cost and the
length variables, we can see some notable exceptions like Grok 4 [15] and Gemini 3 [19]. The first
one shows a correlation between cost and accuracy but it seems to make the most of its short context
length better than the rest. The second one, instead uses very high context lengths, which presumably
translates to a higher power consumption, while still remaining quite inexpensive.
OpenAI o3</p>
      <p>OpenAI GPT-5.2
0.0</p>
      <p>OpenAI GPT OSS 120b
Mistral Devstral 2 2512
DeepSeek V3.2
Qwen3 Instruct 2507
2.5 5.0 7.5 10.0 12.5 15.0</p>
      <p>Cost ($ per million output tokens)
84
82
y
c
rau80
c
c
a
le78
d
o
M76
74</p>
      <p>Context Length Accuracy Trade-off</p>
      <p>Grok 4</p>
      <p>OpenAI o3 OpenAI GPT-5.2
OpenAI GPT OSS 120b</p>
      <p>Mistral Devstral 2 2512
DeepSeek V3.2
Qwen3 Instruct 2507
0.2 0.4 0.6 0.8</p>
      <p>Context length (million of tokens)
Conclusions LLMs have proved to be efective in addressing temporal reasoning on legal documents,
particularly in the understanding the temporal order between pairs of legal events. Among the tested
models, Grok 4 [15] performs best in both downstream tasks, even in the absence of deep reasoning. As
a drawback, the Grok 4’s inference time often exceeds one minute, making it not applicable to real-time
applications. As an alternative, LLMs like Claude Sonnet 4.5 [16], Gemini 3 [19] and Devstral 2 [22]
ofer fairly good performance with a more limited cost and inference time.</p>
      <p>Future works We plan to extend the set of tested models and configuration settings, including models
that are fine-tuned on in-domain sources. We would like to also dig deeper into the reasons behind
models’ failure by analyzing both the common mistakes and the questions that cause the most failures
using Explainable AI techniques. To explore the efect of deep reasoning, we plan to also analyze the
structure of the reasoning tokens. Finally, additional prompting techniques that are more specific to the
task can be tested as well. For example, we can explain the steps that the model should follow when
answering a time related question.</p>
      <p>Limitations Due to the limited number of annotated samples, we mainly focus on zero- and few-shot
learning rather than supervised fine-tuning. We plan to extend the set of labeled data in the future
work.</p>
      <p>Some of the LLMs might generate hallucinated content. For this reason, we cannot exclude the
generation of unpredictable answers at inference time.</p>
      <p>Grok 4 and GPT 5 have inference costs superior to all the other models. Due to budget limitations,
we focused on this two very large proprietary LLMs.</p>
    </sec>
    <sec id="sec-5">
      <title>Ethics statement</title>
      <p>We are not aware of the methods that the providers of the OpenRouter platform employ in terms of data
collection and model training. We made sure to disable every option that we could in the settings panel
of the website to avoid model training on our queries and all sorts of data collections and we encourage
the readers to do so as well. We strongly suggest to only use anonymous data or open source data when
and if redoing these experiments and, ideally, we would advise running models on premise if possible.</p>
    </sec>
    <sec id="sec-6">
      <title>Data and code availability</title>
      <p>The code of the project is publicly available upon request to the authors.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used Chat-GPT-5.2 in order to: Grammar and spelling
check. After using this tool, the authors reviewed and edited the content as needed and takes full
responsibility for the publication’s content.
[19] Google DeepMind, Gemini 3 Flash: Fast and Eficient Multimodal Reasoning, Technical Report,
Google DeepMind, 2024. URL: https://storage.googleapis.com/deepmind-media/gemini/gemini_3_
flash_model_evaluation.pdf.
[20] DeepSeek-AI, A. Liu, A. Mei, Z. Zhang, Z. Qu, Deepseek-v3.2: Pushing the frontier of open large
language models, 2025. URL: https://arxiv.org/abs/2512.02556. arXiv:2512.02556.
[21] OpenAI, S. Agarwal, L. Ahmad, J. Ai, S. Altman, E. Zhang, S. Zhao, gpt-oss-120b gpt-oss-20b
model card, 2025. URL: https://arxiv.org/abs/2508.10925. arXiv:2508.10925.
[22] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand,
G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril,
T. Wang, T. Lacroix, W. E. Sayed, Mistral 7b, 2023. arXiv:2310.06825.
[23] A. Yang, A. Li, B. Yang, Beichen, Z. Qiu, Qwen3 technical report, 2025. URL: https://arxiv.org/abs/
2505.09388. arXiv:2505.09388.
[24] K. Wei, A. Gautam, R. Huang, Are llms good annotators for discourse-level event relation
extraction?, 2025. URL: https://arxiv.org/abs/2407.19568. arXiv:2407.19568.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Siino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Falco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Croce</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <article-title>Exploring llms applications in law: A literature review on current legal nlp approaches</article-title>
          ,
          <source>IEEE Access 13</source>
          (
          <year>2025</year>
          )
          <fpage>18253</fpage>
          -
          <lpage>18276</lpage>
          . doi:
          <volume>10</volume>
          .1109/ACCESS.
          <year>2025</year>
          .
          <volume>3533217</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A. B.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Weller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lawrie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Holzenberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Blair-Stanek</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. Van Durme</surname>
          </string-name>
          ,
          <string-name>
            <surname>CLERC:</surname>
          </string-name>
          <article-title>A dataset for U. S. legal case retrieval and retrieval-augmented analysis generation</article-title>
          , in: L.
          <string-name>
            <surname>Chiruzzo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ritter</surname>
          </string-name>
          , L. Wang (Eds.),
          <source>Findings of the Association for Computational Linguistics: NAACL</source>
          <year>2025</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Albuquerque, New Mexico,
          <year>2025</year>
          , pp.
          <fpage>7898</fpage>
          -
          <lpage>7913</lpage>
          . URL: https://aclanthology.org/
          <year>2025</year>
          .findings-naacl.
          <volume>441</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2025</year>
          . findings-naacl.
          <volume>441</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hindi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Mohammed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Maaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Alwarafy</surname>
          </string-name>
          ,
          <article-title>Enhancing the precision and interpretability of retrieval-augmented generation (rag) in legal technology: A survey</article-title>
          ,
          <source>IEEE Access 13</source>
          (
          <year>2025</year>
          )
          <fpage>46171</fpage>
          -
          <lpage>46189</lpage>
          . doi:
          <volume>10</volume>
          .1109/ACCESS.
          <year>2025</year>
          .
          <volume>3550145</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Shaghaghian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Y.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Jafarpour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Pogrebnyakov</surname>
          </string-name>
          ,
          <article-title>Customizing contextualized language models for legal document reviews</article-title>
          ,
          <source>in: 2020 IEEE International Conference on Big Data (Big Data)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>2139</fpage>
          -
          <lpage>2148</lpage>
          . doi:
          <volume>10</volume>
          .1109/BigData50022.
          <year>2020</year>
          .
          <volume>9378201</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>I.</given-names>
            <surname>Benedetto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Koudounas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Vaiani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Pastor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Baralis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cagliero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Tarasconi</surname>
          </string-name>
          ,
          <article-title>Boosting court judgment prediction and explanation using legal entities</article-title>
          ,
          <source>in: Artificial Intelligence and Law</source>
          ,
          <year>2024</year>
          . URL: https://doi.org/10.1007/s10506-024-09397-8. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .semeval-
          <volume>1</volume>
          .
          <fpage>194</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P. P.</given-names>
            <surname>Kumari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. R.</given-names>
            <surname>Babu</surname>
          </string-name>
          ,
          <article-title>A survey on legal judgement prediction using machine learning, in: Security Intelligence in the Age of AI: Navigating Legal</article-title>
          and
          <string-name>
            <given-names>Ethical</given-names>
            <surname>Frameworks</surname>
          </string-name>
          , Emerald Publishing Limited,
          <year>2025</year>
          . URL: https://doi.org/10.1108/978-1-
          <fpage>83608</fpage>
          -156-
          <lpage>220251002</lpage>
          . doi:
          <volume>10</volume>
          .1108/ 978-1-
          <fpage>83608</fpage>
          -156-220251002.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>V.</given-names>
            <surname>Malik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sanjay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Guha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hazarika</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Nigam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bhattacharya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Modi</surname>
          </string-name>
          ,
          <article-title>Semantic segmentation of legal documents via rhetorical roles</article-title>
          , in: N.
          <string-name>
            <surname>Aletras</surname>
            ,
            <given-names>I. Chalkidis</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Barrett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Goanță</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          Preoțiuc-Pietro (Eds.),
          <source>Proceedings of the Natural Legal Language Processing Workshop</source>
          <year>2022</year>
          , Association for Computational Linguistics, Abu Dhabi,
          <source>United Arab Emirates (Hybrid)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>153</fpage>
          -
          <lpage>171</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .nllp-
          <volume>1</volume>
          .13/. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .nllp-
          <volume>1</volume>
          .
          <fpage>13</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sojitra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Acharya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Saha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jatowt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dandapat</surname>
          </string-name>
          ,
          <article-title>Do language models have a common sense regarding time? revisiting temporal commonsense reasoning in the era of large language models</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Singapore,
          <year>2023</year>
          , pp.
          <fpage>6750</fpage>
          -
          <lpage>6774</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .emnlp-main.
          <volume>418</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          . emnlp-main.
          <volume>418</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Updating large language models' memories with time constraints</article-title>
          , in: Y.
          <string-name>
            <surname>Al-Onaizan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Bansal</surname>
            ,
            <given-names>Y.-N.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          (Eds.),
          <source>Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2024</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Miami, Florida, USA,
          <year>2024</year>
          , pp.
          <fpage>13693</fpage>
          -
          <lpage>13702</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .findings-emnlp.
          <volume>801</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          . findings-emnlp.
          <volume>801</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>C.</given-names>
            <surname>Barale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Barrett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. S.</given-names>
            <surname>Bajaj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rovatsos</surname>
          </string-name>
          ,
          <article-title>Lextime: A benchmark for temporal ordering of legal events</article-title>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2506.04041. arXiv:
          <volume>2506</volume>
          .
          <fpage>04041</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lavril</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Izacard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Martinet</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Lachaux</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lacroix</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Rozière</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Hambro</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Azhar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rodriguez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Joulin</surname>
          </string-name>
          , E. Grave, G. Lample, Llama: Open and eficient
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>