1. Introduction

CoRR abs/

10.48550/ARXIV.2310.06825

Exploring In-Context Learning Strategies for Temporal Ordering of Legal Events using Large Language Models

Andrea Cacioli

0 1

Luca Cagliero

luca.cagliero@polito.it 1

Francesco Tarasconi

francesco.tarasconi@staff.aruba.it 0

Legal AI, Temporal Reasoning, Large Language Models

0 Aruba AI Srl , Corso Francia 2/bis, 10143 Turin , Italy 1 Politecnico di Torino , Corso Duca degli Abruzzi 24, 10129 Turin , Italy

2302

13971 0000 0002

Large Language Models (LLMs) are increasingly adopted for legal document understanding by attorneys and legal consultants. Despite advances in adapting LLMs to their legal terminology and domain-specific linguistic nuances, the LLMs' ability to reason about temporal relations in legal documents remains largely underexplored. In this work, we explore the capabilities of LLMs to verify the correctness of a legal temporal ordering clause and to classify the type of temporal relationships between two legal entities. The results achieved on a public Englishwritten benchmark show that (1) instruction-based models generally perform better than the corresponding chat versions; (2) LLMs reasoning capabilities are, typically, marginally useful to address the specific temporal reasoning tasks; (3) LLMs under a Few-Shot Learning (FSL) setting turn out to be the most efective, with Grok 4 surpassing the state of the art.

1. Introduction

Large Language Models (LLMs) have demonstrated remarkable legal document understanding and generation capabilities [ 1 ]. Within the legal domain, the most established tasks encompass (1) content search [ 2, 3 ], (2) document review [ 4, 5 ], and (3) prediction [ 6, 7 ]. The latter category of tasks also includes the deep understanding of complex semantic relations in text, such as legal entailment types, rhetorical roles, and temporal relations.

Reconstructing temporal relationships is known to be particularly challenging for LLMs [ 8 ]. Specifically, previous studies have shown that most LLMs fall short when they are asked to either update a knowledge base or adapt their responses to time-evolving scenarios [ 9 ].

So far, limited research eforts have been devoted to addressing temporal reasoning in the legal domain. For example, in LexTime [ 10 ] the authors address a prediction task which entails predicting whether a temporal ordering relationship between a pair of events mentioned in the document text (e.g., event A precedes event B) is true or false.

The main limitations of state-of-the-art works on temporal reasoning for legal document understanding are enumerated below.

• Lack of Deep Reasoning: They analyze classical textual LLMs belonging to the LlaMA [ 11 ], GPT [12], and Mistral [13] families while ignoring the LLMs that have been specifically pretrained with deep reasoning capabilities. • Binary Verification : They analyze the zero-shot and few-shot LLM capabilities to verify whether a given statement is correct or not [ 10 ], leaving open more challenging legal understanding tasks, such as the automatic detection of the type of event ordering. • Limited exploration of the models’ eficiency : They do not deepen into the analysis of relevant technical aspects, such as context length, and model inference costs.

Published in the Proceedings of the Workshops of the EDBT/ICDT 2026 Joint Conference (March 24-27, 2026), Tampere, Finland

CEUR Workshop

ISSN1613-0073

This paper addresses the above-mentioned issues. Specifically, it not only studies the LLM capabilities to verify the correctness of a legal temporal ordering clause, but also classifies the type of temporal relationships between two legal entities. It also empirically compares chat- and instruct-based LLMs, LLMs with deep reasoning and not, and models with diferent sizes, context lengths, and inference costs.

The results achieved by Grok 41 under a few-shot learning setting surpasses the state of the art on the binary verification task (accuracy: Grok 4 85.3% vs. GPT 4 80.8%) and achieves robust performance on the multi-class event ordering classification task. Notably, the LLMs with deep reasoning capabilities achieve just marginal improvements or no improvements, likely because they incorporate a limited background in the legal domain.

The remainder of this paper is organized as follows. Section 2 formalizes the established and new temporal reasoning task. Section 3 presents the proposed methodology, while Section 4 summarizes the main experiments. Finally, Section ?? draws conclusions and discusses the future research developments.

2. Problem statement

Given a legal document , we extract a context paragraph in mentioning a sequence of two legal events ⟨, ⟩ . Events and are either one implicit and one explicit event or two explicit events [ 10 ]. In compliance with [ 10 ], every event is defined by an occurrence or action triggered by a verb or noun taking place at a specific moment.

In the following we define the tasks addresses in this work.

Legal Event Temporal Ordering Verification Given an ordered sequence ⟨, ⟩ consisting of

events and and an arbitrary temporal relationship , this task, hereafter denoted by LETOV for the sake of brevity, aims to verify whether the statement (e.g. event a precedes event b) holds (target response: yes) or not (target response: no).

Legal Event Temporal Ordering Classfication Given an ordered sequence of events ⟨, ⟩ , and

a predefined set of temporal relationships { 1, 2, , } (e.g., precedes, subsequent, contemporary), this task, hereafter denoted by LETOC for the sake of brevity, has the goal of predicting the correct temporal relationship between events and .

With the goal of deepening the analysis of the LLMs’ capabilities in legal temporal temporal reasoning, we introduce LETOC as a new task extending LETOV [ 10 ].

3. Methodology

To assess the LLMs’ capabilities to address LETOV and LETOC we apply the following steps. Firstly, we enrich the statements originally included in [ 10 ] with diferent prompting styles, including chat- and instruct versions as well as zero-shot and few-shot learning settings. Based on the results observed in the preliminary experiments (see Section 4 for further details), we decided to employ only the instruct style from the second setting onward due to its little impact on overall performance.

Secondly, we design a testing framework that can uniquely identify a given prompt for a given model and that stores the history of experiments’ outcomes.

Lastly, we collect the results on a grid search over multiple models, settings and prompting strategies. The grid search spans across the models, the number of shots ∈ {0, 1, 3}, and the reasoning levels ∈ {low, medium, high}

1https://x.ai/news/grok-4 latest access: January 7, 2026

Chat vs. Instruct-based models We experiment with two main classes of prompts: chat and instruct. The chat style is the most common way to prompt a LLM, as most interfaces are designed with this principle in mind. Recent works [14] have inspired the creation of models that perform best when dealing with instructions. Hence, we also experimented with this to compare their efect of legal temporal reasoning.

The instruct prompts selected for LETOV follow the following template: You are a legal expert that never makes mistakes and that never hallucinates.

Give your unbiased opinion on the following events about their temporal relationship.

Do not make mistakes.

Consider these examples: # Example 1

Given this context: ’$example_context1’ For the statement ’$example1’

You should answer ’$label1’ ...(other examples or no examples at all) In the context: $context Verify the soundness of this statement: $question Only answer with one word: if the statement is correct, answer with the word ”Entailment”; whereas if the statement is wrong, answer with the word ”Contradiction”

The selected LETOV chat prompt, instead, has the following structure.

I am examining this paragraph from a legal context and I want to extrapolate the temporal relations between two events. I absolutely need these to be correct, no mistakes allowed.

This is my context: $context

This is my statement: $question I need a one word answer: if the statement is correct, answer with the word ”Entailment”; whereas if the statement is wrong, answer with the word ”Contradiction”

For LETOC we focus on instruct prompts, identifying the following template: You are a legal expert that never makes mistakes and that never hallucinates. Give your unbiased opinion on the following events about their temporal relationship. You must pick one of three temporal relations from a set. Do not make mistakes.

Consider these examples: # Example 1

Given this context: ’$example_context1’ For the events:

Event A: ’$example_a1’

Event B: ’$example_b1’

You should answer ’$label1’ Only answer with only one word representing the relation: - If Event A follows event B, answer ”follows” - If Event A precedes event B, answer ”precedes” - If the two events happen at the same time, answer ”simultaneous” Hardware resources and services We run our experiments using the LLM-As-A-Service OpenRouter platform2. The experiments took around 50 hours, and the overall cost was 173,88$.

To prepare the inputs and postprocess the results, we used a machine equipped with 16GB of RAM, an AMD Ryzen AI 7 PRO 350 CPU and 512 GB SSD and running Windows 11 Pro. Dataset We adapt the LexTime open benchmark [ 10 ] to address both the LETOV and LETOC tasks.

LexTime is composed of a legal context taken from U.S. federal complaints between 2020 and 2024. They randomly sampled complaints categorized under the Nature of Suit (NOS) codes beginning with 7, which correspond to labor-related cases. Alongside the context, it contains a statement in natural language about two events. For each statement corresponds a binary label: ”entailment” if the statement is sound, ”contradiction” otherwise. Each statement also has some metadata about the nature of the couple of events: whether they are explicitly mentioned in the context, or if one of them can only be deduced by a legal expert, eventually marking it as implicit. Our study disregards the efect of metadata as mainly focuses on temporal relations between legal entities.

The dataset curation consisted of the following steps: firstly, we only selected the statements that are logically sound, as it is impossible to deduce the event relation from contradicting statements. Secondly, we used a regular expression to extrapolate each of the temporal relations that compose LexTime. Finally, we aggregate similar ones into three classes: • precedes: for couples of events where the first happens before the second • follows: for couples of events where the first happens after the second • simultaneous: for couples of events where the first and the second happen at the same time.

Hereafter, we will refer to this smaller dataset as the multi-class dataset.

Models We benchmark the performance of the state-of-the-art LLMs reported in Table 1. For each model we also report its reasoning availability and whether or not the reasoning efort specification is supported, the cost expressed in $ per million of output inference tokens and finally if it is an instruct model or not. Opensource models are also reported.

In the experiments we explored the following dimensions of analysis: • Model openness: We compared opensource and proprietary models. We focus on state-of-the-art model, testing a selection of models all released after April 2025.

2https://openrouter.ai/ latest access: January 10, 2026

Grok 4 [15] Claude Sonnet 4.5 [16] OpenAI GPT-5.2 [17]

OpenAI o3 [18] Gemini 3 flash prev [19]

DeepSeek V3.2 [20] OpenAI GPT OSS 120b [21] Mistral Devstral 2 2512 [22] Qwen3 Instruct 2507 [23]

Yes

Yes efort specifiable efort specifiable

Yes

Yes efort specifiable

No No • Model dimension and context length: We tested models with context size ranging from 131.072 to 1.048.576. Extending the preliminary work presented in [ 10 ] and other works [24] that had already promoted the usefulness of large contexts in legal contexts, we aim to study the impact of very large context length on models’ performance. • Efect of deep reasoning : To test the impact of the reasoning capabilities, we consider models with and without this feature (see Section 4 for more details). • Instruct vs chat setting: we compare chat vs. instruct-based LLMs. Given the recent LLMs’ alignment to human preferences [14], we explore instruction tuning as an alternative to chat models.

Settings We test three diferent LEVOT settings. The first one is aimed at discovering the impact of the instruct style prompt, as well as the model’s own preference towards a more friendly and conversational prompt or a more strict direct order.

In the second setting we verify whether content adaptation strategies are beneficial to enhance legal temporal reasoning performance. We also empirically verify if the reasoning models are better at generalizing from the examples and therefore applying the reasoning to the question.

In the last setting we try to change the number of tokens that the models can dedicate to reasoning by specifying an efort parameter. The efort parameter can be one of several values. We experiment with values ’low’, ’medium’, ’high’.

For LEVOC we test only the last two settings of the previous task with slight modifications. Firstly, we test and compare zero-, one-, and three shot learning. Lastly, we once again test how the reasoning efort afects the performance.

4. Experimental results

We measure the LETOV and LETOC performance of diferent combinations of models and settings in terms of classification accuracy (i.e., the percentage of correctly classified samples, similarly to [ 10 ]). Furthermore, we also evaluate the per-class performance in terms of precision, recall and F1-score. For LETOC we adopt the weighted versions of the metric to reduce the impact of class imbalance. Results and discussion Table 2 reports the values of the performance scores for every run in the instruct style and a diferential score Δ. Δ is defined by the performance gap between the classifier prompted with the instruct prompts and the same metric for one prompted with the chat style. For every metric ∈ {Accuracy, Precision, Recall, F1},

Δ = instruct − chat

Based on the reported Δ values, the prompting technique appears to provide limited contributions. In addition, as shown by the F1-score results, most models marginally benefit from the instruction prompting style. For this reason, we then further explore the instruction-based LLMs. Qwen 3 Instruct [23] underperforms the large proprietary model, with Grok 4 [15] outperforming the other approach, except for Claude Sonnet 4.5 [16]. Devstral [22] instead achieves a very noticeable 92.37% recall, while getting lower precision scores. The LOVET performance on this task has improved compared to the state of the art(80.8 accuracy) [ 10 ].

Table 3 reports the results for LETOC, where we focus on few-shot learning. For the sake of completeness and clarity, we also repeat the zero-shot instruct experiment from the previous setting. Reasoning models are expected to perform better in this task as the reasoning is further helped by the examples. In this setting, Grok 4 [15] proves once again to able to outperform all the other models, though with a limited extent. The performance achieved by Gemini 3 and Sonnet seems to closely follow the one by Grok, though it consistently lags behind. Devstral confirms its tendency to have high recall measures. Overall, the presence of one or few examples helps the model’s generalization capabilities as expected. Finally, the top accuracy score of 84.48 achieved by Grok 4 in the three shots, further surpasses the one from the previous setting.

For the final binary classification task’s setting, we report the results in Table 4. The efect of the reasoning seems to be limited. However, once more, Grok 4 surpasses the previous score, and we find our best result for the accuracy metric of 85.28. However, while the Grok 4 performance increased steadily, the latency in the response generation is significant, i.e., sometimes exceeding one minute of thinking and generation. This should be taken into account in the cost-benefit analysis.

Table 5 and 6 report the results of the LETOC task. As stated previously, this task is inherently more complex as the number of target classes is higher. Table 5 reports the diference between the zero-shot setting and the settings where the model’s context is enriched with examples taken from the original dataset. Models facing this problem generally solved the task well in most cases. However, in this scenario we do not have a clear superior model: only small diferences can be noted and the top performance is either shared between two models or it changes with the metric chosen. It is still clear that models that could be chosen in an industrial environment or where performances are of utmost importance are Sonnet 4.5 [16] and Grok 4 [15]. However, the small and opensource Devstral 2 LLM [22] achieves fairly good accuracy, especially in the three-shot setting. Hence, it could be selected for applications where low cost and fast inference is crucial.

The last result we want to discuss is once again the variation of the model’s reasoning efort. All the measures are reported in table 6. Like in the previous reasoning variation setting, we do not see a steep increase in performance, just a small fluctuation. Once again Grok 4 achieves most of the top performances that we reported in bold. However, GPT 5.2 [17] seems to handle best the medium reasoning efort parameter compared to the others.

Finally, we analyze the relation between the accuracy of each model (averaged between the various tasks), and the model’s cost and context length. Figure 1 visually represents how the cost of the model and its context window length influences its accuracies in the various tasks. The accuracy reported as the dependent variable is macro aggregated using the mean of all accuracies in settings of the binary classification tasks. Those are the instruct zero shot accuracy, the one shot accuracy, the three shot accuracy and the three accuracies of the low-medium-high reasoning efort experiment. We only selected the accuracies of the first task because the number of examples only changes slightly (one to three less runs if the prompt contains examples) and so the mean aggregation method makes sense. As an independent variable, we show how cost and context window length afect the accuracy. If a positive correlation exists between the two variables, we would expect the points to be placed on the main diagonal. However, while this visualization suggests this to be the case for both the cost and the length variables, we can see some notable exceptions like Grok 4 [15] and Gemini 3 [19]. The first one shows a correlation between cost and accuracy but it seems to make the most of its short context length better than the rest. The second one, instead uses very high context lengths, which presumably translates to a higher power consumption, while still remaining quite inexpensive. OpenAI o3

OpenAI GPT-5.2 0.0

OpenAI GPT OSS 120b Mistral Devstral 2 2512 DeepSeek V3.2 Qwen3 Instruct 2507 2.5 5.0 7.5 10.0 12.5 15.0

Cost ($ per million output tokens) 84 82 y c rau80 c c a le78 d o M76 74

Context Length Accuracy Trade-off

Grok 4

OpenAI o3 OpenAI GPT-5.2 OpenAI GPT OSS 120b

Mistral Devstral 2 2512 DeepSeek V3.2 Qwen3 Instruct 2507 0.2 0.4 0.6 0.8

Context length (million of tokens) Conclusions LLMs have proved to be efective in addressing temporal reasoning on legal documents, particularly in the understanding the temporal order between pairs of legal events. Among the tested models, Grok 4 [15] performs best in both downstream tasks, even in the absence of deep reasoning. As a drawback, the Grok 4’s inference time often exceeds one minute, making it not applicable to real-time applications. As an alternative, LLMs like Claude Sonnet 4.5 [16], Gemini 3 [19] and Devstral 2 [22] ofer fairly good performance with a more limited cost and inference time.

Future works We plan to extend the set of tested models and configuration settings, including models that are fine-tuned on in-domain sources. We would like to also dig deeper into the reasons behind models’ failure by analyzing both the common mistakes and the questions that cause the most failures using Explainable AI techniques. To explore the efect of deep reasoning, we plan to also analyze the structure of the reasoning tokens. Finally, additional prompting techniques that are more specific to the task can be tested as well. For example, we can explain the steps that the model should follow when answering a time related question.

Limitations Due to the limited number of annotated samples, we mainly focus on zero- and few-shot learning rather than supervised fine-tuning. We plan to extend the set of labeled data in the future work.

Some of the LLMs might generate hallucinated content. For this reason, we cannot exclude the generation of unpredictable answers at inference time.

Grok 4 and GPT 5 have inference costs superior to all the other models. Due to budget limitations, we focused on this two very large proprietary LLMs.

Ethics statement

We are not aware of the methods that the providers of the OpenRouter platform employ in terms of data collection and model training. We made sure to disable every option that we could in the settings panel of the website to avoid model training on our queries and all sorts of data collections and we encourage the readers to do so as well. We strongly suggest to only use anonymous data or open source data when and if redoing these experiments and, ideally, we would advise running models on premise if possible.

Data and code availability

The code of the project is publicly available upon request to the authors.

Declaration on Generative AI

During the preparation of this work, the authors used Chat-GPT-5.2 in order to: Grammar and spelling check. After using this tool, the authors reviewed and edited the content as needed and takes full responsibility for the publication’s content. [19] Google DeepMind, Gemini 3 Flash: Fast and Eficient Multimodal Reasoning, Technical Report, Google DeepMind, 2024. URL: https://storage.googleapis.com/deepmind-media/gemini/gemini_3_ flash_model_evaluation.pdf. [20] DeepSeek-AI, A. Liu, A. Mei, Z. Zhang, Z. Qu, Deepseek-v3.2: Pushing the frontier of open large language models, 2025. URL: https://arxiv.org/abs/2512.02556. arXiv:2512.02556. [21] OpenAI, S. Agarwal, L. Ahmad, J. Ai, S. Altman, E. Zhang, S. Zhao, gpt-oss-120b gpt-oss-20b model card, 2025. URL: https://arxiv.org/abs/2508.10925. arXiv:2508.10925. [22] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, W. E. Sayed, Mistral 7b, 2023. arXiv:2310.06825. [23] A. Yang, A. Li, B. Yang, Beichen, Z. Qiu, Qwen3 technical report, 2025. URL: https://arxiv.org/abs/ 2505.09388. arXiv:2505.09388. [24] K. Wei, A. Gautam, R. Huang, Are llms good annotators for discourse-level event relation extraction?, 2025. URL: https://arxiv.org/abs/2407.19568. arXiv:2407.19568.

[1]

Siino ,

Falco ,

Croce ,

Rosso , Exploring llms applications in law: A literature review on current legal nlp approaches , IEEE Access 13 ( 2025 ) 18253 - 18276 . doi: 10 .1109/ACCESS. 2025 . 3533217 .

[2]

A. B.

Hou ,

Weller ,

Qin ,

Yang ,

Lawrie ,

Holzenberger ,

Blair-Stanek , B. Van Durme , CLERC: A dataset for U. S. legal case retrieval and retrieval-augmented analysis generation , in: L. Chiruzzo , A. Ritter , L. Wang (Eds.), Findings of the Association for Computational Linguistics: NAACL 2025 , Association for Computational Linguistics , Albuquerque, New Mexico, 2025 , pp. 7898 - 7913 . URL: https://aclanthology.org/ 2025 .findings-naacl. 441 /. doi: 10 .18653/v1/ 2025 . findings-naacl. 441 .

[3]

Hindi ,

Mohammed ,

Maaz ,

Alwarafy , Enhancing the precision and interpretability of retrieval-augmented generation (rag) in legal technology: A survey , IEEE Access 13 ( 2025 ) 46171 - 46189 . doi: 10 .1109/ACCESS. 2025 . 3550145 .

[4]

Shaghaghian ,

L. Y.

Feng ,

Jafarpour ,

Pogrebnyakov , Customizing contextualized language models for legal document reviews , in: 2020 IEEE International Conference on Big Data (Big Data) , 2020 , pp. 2139 - 2148 . doi: 10 .1109/BigData50022. 2020 . 9378201 .

[5]

Benedetto ,

Koudounas ,

Vaiani ,

Pastor ,

Baralis ,

Cagliero ,

Tarasconi , Boosting court judgment prediction and explanation using legal entities , in: Artificial Intelligence and Law , 2024 . URL: https://doi.org/10.1007/s10506-024-09397-8. doi: 10 .18653/v1/ 2023 .semeval- 1 . 194 .

[6]

P. P.

Kumari ,

G. R.

Babu , A survey on legal judgement prediction using machine learning, in: Security Intelligence in the Age of AI: Navigating Legal and

Ethical

Frameworks , Emerald Publishing Limited, 2025 . URL: https://doi.org/10.1108/978-1- 83608 -156- 220251002 . doi: 10 .1108/ 978-1- 83608 -156-220251002.

[7]

Malik ,

Sanjay ,

S. K.

Guha ,

Hazarika ,

S. K.

Nigam ,

Bhattacharya ,

Modi , Semantic segmentation of legal documents via rhetorical roles , in: N. Aletras , I. Chalkidis ,

Barrett ,

Goanță , D. Preoțiuc-Pietro (Eds.), Proceedings of the Natural Legal Language Processing Workshop 2022 , Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid) , 2022 , pp. 153 - 171 . URL: https://aclanthology.org/ 2022 .nllp- 1 .13/. doi: 10 .18653/v1/ 2022 .nllp- 1 . 13 .

[8]

Jain ,

Sojitra ,

Acharya ,

Saha ,

Jatowt ,

Dandapat , Do language models have a common sense regarding time? revisiting temporal commonsense reasoning in the era of large language models , in: H. Bouamor , J. Pino , K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , Association for Computational Linguistics, Singapore, 2023 , pp. 6750 - 6774 . URL: https://aclanthology.org/ 2023 .emnlp-main. 418 /. doi: 10 .18653/v1/ 2023 . emnlp-main. 418 .

[9]

Wu ,

Bu ,

Cai ,

Wang , Updating large language models' memories with time constraints , in: Y. Al-Onaizan , M.

Bansal , Y.-N.

Chen (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2024 , Association for Computational Linguistics , Miami, Florida, USA, 2024 , pp. 13693 - 13702 . URL: https://aclanthology.org/ 2024 .findings-emnlp. 801 /. doi: 10 .18653/v1/ 2024 . findings-emnlp. 801 .

[10]

Barale ,

Barrett ,

V. S.

Bajaj ,

Rovatsos , Lextime: A benchmark for temporal ordering of legal events , 2025 . URL: https://arxiv.org/abs/2506.04041. arXiv: 2506 . 04041 .

[11]

Touvron ,

Lavril ,

Izacard ,

Martinet , M. -

A. Lachaux , T.

Lacroix , B.

Rozière , N.

Goyal , E.

Hambro , F.

Azhar , A.

Rodriguez , A.

Joulin , E. Grave, G. Lample, Llama: Open and eficient