1. Introduction

BABILong-ITA: a new benchmark for testing Large Language Models efective context length and a Context Extension Method

Fabio Tamburini

0 0 FICLIT - University of Bologna , via Zamboni, 32, 40126, Bologna , Italy

2025

This paper introduces a new benchmark designed to evaluate the efective context length handled by Large Language Models (LLMs) in Italian. Following the structure of the five core tasks from the English BABILong dataset, we created an equivalent benchmark tailored for Italian. We used it to assess the context management capabilities of several prominent LLMs, both small and large, pretrained from scratch or fine-tuned specifically for Italian. Additionally, we tested a context extension technique called “SelfExtend” that does not require any training or fine-tuning phase, measuring its efectiveness using our proposed benchmark.

eol>Large Language Models context length evaluation new benchmark Italian context extension

1. Introduction

been developed to assess and compare the performance of LLMs across varying context lengths.

As the capabilities of Large Language Models (LLMs) con- A widely cited benchmark framework is the Kamradt’s tinue to advance, one of the most critical areas of improve- ‘Needle-in-a-Haystack’1 which probes a model’s ability to ment lies in their ability to process and retain information retrieve a small piece of relevant information embedded over extended sequences of text, a feature commonly re- in a long, distractor-filled sequence. This test is considferred to as context length. Traditional benchmarks for ered a litmus test for whether models truly attend to evaluating LLMs focus on accuracy, reasoning, and gen- long-range dependencies rather than relying on heuriseration quality, but often overlook systematic assessment tics or recency biases. of how well a model can operate when presented with Another critical benchmark is ‘Passage Retrieval and extremely long input sequences. Question Answering’ over long contexts, exemplified

LLMs long context is crucial for Retrieval-Augmented by datasets such as ‘NarrativeQA’ [1] and ‘HotpotQA’ Generation (RAG) because it allows the model to process [2]. These datasets require models to maintain coherand reason over more retrieved information at once. In ence and extract pertinent information across several RAG systems, external documents or chunks of text are paragraphs or documents. The ‘BookSum’ benchmark retrieved based on a query and then passed to the LLM [3] further extends this approach by evaluating abstracto generate accurate and contextually relevant answers. tive summarisation over entire books, posing an extreme A longer context window means the model can consider challenge to context handling. more documents or larger portions of documents simul- To assess performance on computationally eficient taneously, reducing the need to truncate or summarise long-context processing, the ‘Long Range Arena’ proinput data. This leads to better comprehension, improved vides a suite of tasks including image classification, text factual accuracy, and more coherent responses, especially retrieval, and list sorting, adapted to sequence modelling for complex or multi-part queries. tasks with sequences ranging from 1k to 16k tokens [4].

Evaluating the context length capabilities of LLMs is While not all tasks are purely devoted to natural language crucial for understanding their practical utility in real- processing, they benchmark architectural innovations world applications requiring long-range reasoning, docu- like sparse attention and memory-eficient transformers. ment understanding, and multi-turn conversations. Over ‘LongBench’ [5] provides comprehensive testbeds the past years, several standardised benchmarks have across domains covering key long-text application areas including single-doc QA, multi-doc QA, summarisaCLiC-it 2025: Eleventh Italian Conference on Computational Linguis- tion, few-shot learning, synthetic tasks, and code comtics, September 24 — 26, 2025, Cagliari, Italy pletion in both English and Chinese, evaluating both * Corresponding author. performance scaling and fidelity to far-positioned inputs. $ hfatbtpio:/./tcaomrbpuorrian.fici@litu.unniibboo.i.itt/(FP.eoTpamle/bTuarminbi)urini/ (F. Tamburini) An et al. [6] present a new evaluation suite ‘L-Eval’ 0000-0001-7950-0347 (F. Tamburini) © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License 1https://github.com/gkamradt/LLMTest_NeedleInAHaystack.git Attribution 4.0 International (CC BY 4.0). containing 20 sub-tasks, 508 long documents, and more than 2,000 human-labelled query-response pairs including diverse task types, domains, and input lengths.

Taken together, these benchmarks form a multi-faceted suite of tools that not only test LLMs for maximum supported context length but also probe their efective use of context. As models scale to handle millions of tokens, developing robust and generalisable long-context benchmarks remains an active area of research, especially for languages diferent from English.

Regarding the techniques for increasing context ‘awareness’ in transformers, recent works have introduced scaling techniques specifically targeting context length extrapolation. For example, Press et al. [7] proposed the in-Context Learning Extrapolation to test model performance when context lengths at inference time far exceed those seen during training. Considering Figure 1: BABILong schema for generating tasks: task facts this, we could refer to a recent interesting survey on tech- are hidden into distractor text fragments extracted from PG19 niques for extending transformers context by Wang et al. (picture from [9]). [8].

Another English benchmark, relevant to this work, is ‘BABILong’ [9], a benchmark specifically designed to a careful testing and benchmarking of LLMs that natively evaluate the maximum usable context length of large lan- handle the Italian language. guage models. BABILong provides a controlled and extensible framework for measuring how efectively LLMs can retrieve and use information embedded at various po- 2. A new benchmark for Italian sitions within long input contexts. The benchmark simulates real-world scenarios where crucial information may BABILong extends the bAbI benchmark [10], which conappear early in a document and must be recalled accu- sists of 20 tasks designed to evaluate basic aspects of rately much later, such as in code completion, document reasoning. These tasks are generated by simulating insummarisation, and legal or scientific reasoning tasks. teractions among characters and objects across various Each BABILong instance presents the model with a struc- locations, each represented as a fact, such as “Mary travtured sequence containing query-relevant and distractor eled to the ofice.” The challenge is to answer questions content spread over thousands to potentially millions of based on the facts generated in the current simulation, tokens. The model is then tasked with answering queries such as “Where is Mary?” The tasks in bAbI vary in or completing sequences that require precise recollection the number of facts, question complexity, and the reaof target information, making it possible to assess the soning skills they assess, including spatial and temporal degradation of performance as a function of input length. reasoning, deduction, and coreference resolution.

Unlike traditional evaluations, BABILong systemati- Solving tasks that require long-context processing decally varies the distance between the query and its corre- mands that a model efectively identify and attend to sponding reference information, enabling granular anal- relevant information embedded within extensive irreleysis of context window utilisation and scaling properties vant content. To emulate this scenario, they embed the across diferent architectures. The benchmark supports core task sentences within passages of distractor text plug-and-play integration with both decoder-only and sampled from a closely related distribution (see Figure 1). encoder-decoder models, and it is agnostic to pretraining Each example is constructed by progressively appending data, making it suitable for comparative studies across sentences from the background corpus, preserving their proprietary and open-source models. natural order, until the desired total length is achieved.

In summary, BABILong provides a scalable, inter- This approach decouples the evaluation context length pretable, and model-agnostic benchmark for long-context from the intrinsic length of the original task, thereby reasoning and memory fidelity and it is a very useful enabling the assessment of models capable of handling tool for researchers and practitioners seeking to push inputs extending to millions of tokens. As background the boundaries of eficient long-sequence modelling in material, they used books from the PG19 dataset [11], large-scale language systems. Moreover, it can be easily chosen for their substantial length and naturally occurextended to other languages: the goal of this work re- ring long-form narrative structure. gards the extension of BABILong to Italian, allowing for We reproduced the same process proposed in BABI

Long by, first, translating English sentences belonging to

BABILong tasks leveraging Google Translate and then using the Project Gutemberg2 (PG) Italian free texts as base corpus for extracting distractor fragments.

Given that all the major evaluations in the BABILong paper [9] were performed considering only the first five tasks, namely QA1-QA5, we decided to translate and post-process only these five tasks and insert them into BABILong-ITA.

In order to build a reliable and efective Italian benchmark we had to manually revise and adapt automatic translations ensuring a good adherence to common Italian language adjusting translation artifacts or wrong translations. In particular, we had to manage these phenomena: check and correct also by leveraging regular expressions: for example ‘What is the kitchen west of?’->‘Qual è la cucina a ovest?’->‘La cucina è a ovest di che cosa?’.

While we could have incorporated a broader range of

state/position-changing predicates in the translations, we chose to adhere to the original selections, as the English benchmark did not include such variations.

Table 1 shows one example for each BABILong-ITA task without the insertions of any distractor texts (0k configuration).

3. Benchmark evaluation

• dPirdonpoetrtrNaanmslaetse tErnagnlisslhatpiroonp:erGnoaomgleesTorfapnesolaptlee Ibnenocrhdmerartko atnedsttothgeraesfepcstiovmene eidsseaoafbtohuet tnheewpeprrfooprmosaendce involved in the task, thus we have to replace them of the most relevant models able to efectively handle consistently with common Italian proper names, the Italian language, we performed a set of experiments e.g. ‘John’->‘Giovanni’, ‘Mary’->’Maria’, etc. involving quite a large set of LLMs.

First of all, we considered the new models presented in • Object/Place Simplification : the automatic 2024 and trained from scratch on Italian: the first by the translation tended, in some cases, to trans- SapianzaNLP group3, namely sapienzanlp/Minerva-7Blate single English words into Italian multi- base-v1.0 and sapienzanlp/Minerva-7B-instruct-v1.0, and, word expressions artificially increasing tasks dif- second, the largest model proposed by iGenius/CINECA ifculty. We simplify objects/places translations using the unoficial conversion sapienzanlp/modellolike ‘bedroom’->‘camera da letto’->‘camera’ and italia-9b-bf16 for simplicity. We considered also ‘football’->‘pallone da calcio’->‘pallone’, etc. two fine-tuned model from DeepMount00, namely

DeepMount00/Qwen2-1.5B-Ita and DeepMount00/Mistral• Verb Tenses: for expressing past events English Ita-7b, a model from Microsoft, microsoft/Phi-4-miniconsistently use the past tense while in Italian, instruct, one from meta, meta-llama/Llama-3.1-8Beven if the equivalent past tense ‘passato remoto’ Instruct both in its original and quantised form relying is grammatically correct, is much more common on bartowski/Meta-Llama-3.1-8B-Instruct-Q4_K_S and, fiusing the ‘passato prossimo’. We then adapted nally, two models from Google, google/gemma-3-4b-it the translations replacing all these tenses, e.g. and the huge google/gemini-2.0-flash . All models were ‘andò’->‘è andato/a’, ‘posò’->‘ha posato’ and ‘si downloaded from the HuggingFace model repository4 spostò->‘si è spostato/a’ adapting the sufixes to and used on a local server except for gemini-2.0-flash that the sentence subject preserving the correct gram- was queried using the Google API. matical agreement.

3.1. Experiments setting

• Proposition Correction: sometimes Google Translate generates inappropriate translations from the point of view of the used prepositions; we corrected them, for example ‘John si recò al giardino’->‘Giovanni si è recato in giardino.’ or ‘Mary andò nel corridoio’->‘Maria è andata in corridoio’, ensuring a better adherence to the most common use of them.

In BABILong, the authors consider performance satis

factory if the accuracy of an answer exceeds 85% and a complete failure if it is below 30%. Of course, as the authors said, this definition of “satisfactory performance” is not universal and should be adapted to the specific task at hand.

The comparison with the correct result follows the • Translation Mistake Corrections: sometimes, original BABILong evaluation method: the LLM output especially when translating questions with im- is lowercased, and the first valid target it names is conplicit referents, Google Translate rendered incor- sidered as the LLM answer and compared with the gold rect Italian sentences that we have to carefully target in order to compute model accuracy.

2https://www.gutenberg.org/ 3https://nlp.uniroma1.it/minerva/ 4https://huggingface.co/

QA2 two-supporting-facts Context: Sandra si è diretta verso il corridoio. Giovanni si è diretto verso il bagno. Sandra ha aferrato il pallone lì. Daniele si è recato in camera. Giovanni ha preso il latte lì. Giovanni ha lasciato cadere il latte. Sandra si è trasferita in giardino. Daniele è tornato in corridoio. Sandra ha buttato via il pallone. Giovanni si è spostato in corridoio. Giovanni è tornato in giardino. Sandra è andata in cucina. Daniele si è trasferito in camera. Sandra si è diretta verso il corridoio. Sandra si è trasferita in cucina. Giovanni si è recato in uficio. Sandra è andata in giardino. Sandra ha aferrato il pallone lì. Sandra ha posato lì il pallone. Daniele è tornato in cucina.

Question: Dov’è il pallone? Answer: giardino.

QA3 three-supporting-facts Context: Maria è andata in uficio. Sandra si è spostata in corridoio. Sandra ha aferrato il pallone. Maria ha preso lì la mela. Sandra si è recata in giardino. Daniele si è spostato in corridoio. Sandra ha posato il pallone. Daniele è andato in camera. Sandra ha preso il pallone. Maria ha posato la mela. Maria è tornata in bagno. Giovanni si è spostato in bagno. Giovanni è andato in corridoio. Sandra ha posato il pallone. Daniele si è diretto verso il corridoio. Sandra ha raccolto il pallone. Sandra si è recata in uficio. Daniele si è recato in bagno. Daniele è tornato in uficio. Daniele si è recato in cucina. Sandra ha raccolto la mela lì. Sandra ha buttato lì la mela. Sandra ha lasciato cadere il pallone. Giovanni si è recato in giardino. Maria si è recata in giardino. Sandra ha aferrato il pallone lì. Sandra ha buttato lì il pallone. Sandra si è diretta verso la cucina. Maria si è trasferita in camera. Maria è andata in corridoio. Sandra si è diretta verso il corridoio. Giovanni è andato in cucina. Sandra si è recata in bagno. Daniele è tornato in bagno. Giovanni si è trasferito in uficio. Giovanni ha preso il latte. Giovanni si è diretto verso il bagno. Daniele è tornato in camera. Maria si è recata in camera. Daniele si è diretto verso il corridoio. Giovanni si è trasferito in camera. Sandra si è recata in giardino. Daniele è tornato in cucina. Giovanni ha lasciato il latte. Daniele si è recato in uifcio. Daniele ha preso il pallone. Maria è andata in corridoio. Daniele ha aferrato la mela lì . Giovanni si è diretto verso il bagno. Giovanni si è diretto verso il corridoio. Giovanni è andato in uficio. Giovanni è tornato in cucina. Maria si è recata in uficio. Daniele è tornato in giardino. Daniele è andato in camera. Daniele si è spostato in bagno. Daniele è tornato in giardino. Sandra è tornata in bagno. Daniele è andato in camera. Daniele ha lasciato la mela. Daniele ha lasciato il pallone. Daniele ha aferrato il pallone. Question: Dov’era la mela prima di essere in camera? Answer: giardino.

QA4 two-arg-relations Context: Il giardino si trova a ovest della camera. L’uficio si trova a est della camera.

Question: La camera è a est di che cosa? Answer: giardino.

QA5 three-arg-relations Context: Enrico ha preso il pallone lì. Enrico si è recato in giardino. Enrico ha passato il pallone a Giovanni. Maria è andata in cucina. Giovanni ha passato il pallone a Enrico. Enrico ha consegnato il pallone a Giovanni. Maria ha preso il latte lì. Giovanni si è diretto verso la cucina. Giovanni si è trasferito in giardino. Daniele si è recato in camera.

Question: Chi ha ricevuto il pallone? Answer: Giovanni.

3.2. Results

a declared maximum context length of 32k tokens, they struggle significantly even at much shorter lengths. Similar observations apply to Phi-4, which fails to achieve satisfactory results even at just 1/16 of its maximum declared context window.

Google’s Gemma3 shows slightly better performance, managing to handle contexts up to approximately 1/8 of its maximum declared length. Conversely, Gemini2.0-flash, with a nominal maximum context length of 1 million tokens, solves fewer than 50% of the tasks at 128k, an underwhelming result given its scale.

Among the tested models, LLaMA-3.1-8B stands out as the most efective. Although we completely evaluated only its quantised version, which performs slightly below the full model, it successfully retrieves 35% of the hidden information even at the maximum declared context length. It appears to ofer an excellent balance between local deployment feasibility and performance, trailing only slightly behind the much larger Gemini-2 model.

Figure 3 presents the per-task performance of the two best-performing LLMs tested, namely Gemini-2.0-flash and the quantised version of LLaMA-3.1-8B. The QA2 and QA3 tasks are notably more complex than the others, with both models struggling to retrieve the target information in QA3, even within very short contexts.

Given these results and the smooth transitions across diferent context lengths, we can conclude that BABILong-ITA appears to be a reliable benchmark for testing the efective context length of LLMs. 4. Extending Large Language Models Context Length Extending the context length of LLMs is a key research

direction aimed at improving their ability to reason over long documents, maintain dialogue coherence, and process extensive sequences of information.

Several approaches have emerged to address the computational and architectural challenges associated with long-context modeling: • Sparse Attention and Eficient Transformers .

One class of techniques involves modifying the attention mechanism to reduce its quadratic complexity with respect to sequence length. Models such as Longformer [12], BigBird [13], and Reformer [14] introduce sparse or locality-sensitive hashing attention patterns to enable eficient processing of longer sequences. These methods trade of some global attention capacity for linear or sub-quadratic scaling, allowing context lengths up to tens of thousands of tokens. • Position Encoding Innovations. Absolute positional encodings pose a limitation on extrapolation beyond trained sequence lengths. Relative positional encodings, as used in Transformer-XL [15] and Rotary Position Embeddings (RoPE) proposed by Su et al. [16] provides better generalisation to longer contexts. More recent methods such as YaRN [17] adjust RoPE scaling to maintain performance across significantly extended context lengths. • Training and Fine-Tuning on Long Contexts.

Recent advancements show that increasing context length during pretraining can yield substantial improvements. Big models like Claude, Gemini and GPT-4 are examples of models trained or adapted for extended context windows up to 128k tokens or more. Techniques such as long-context ifne-tuning, positional interpolation [ 18], and linear RoPE interpolation [7] have demonstrated effectiveness in scaling pretrained transformers to larger context windows without retraining from scratch. • Neighbour Attention focuses on dependencies among adjacent tokens within a specified range reducing the standard self-attention window to the closest positions. If is the context window for the pretrained model, the parameter < controls the dimension of the neighbour attention.

More details on SelfExtend can be found in the original paper [19].

4.1. Using SelfExtend to increase LLMs context length

• Grouped Attention captures dependencies among tokens that are far apart averaging the contributions of the pretrained self-attention between diferent positions.

The baseline model for our experiments is the

The maximum length of the extended context in the largest model produced by the SapienzaNLP team: ideal case can be computed as sapienzanlp/Minerva-7B-base-v1.0 is a Mistral-based model configured with a 4096-tokens fixed context and ( − ) * + (1) without sliding window pretrained from scratch on Italthus, for example, if we have = 4096 and choose ian and English [20]. Building on this baseline, we ex = 2048 and = 16, the ideal maximum extended tended its context using SelfExtend with varying values context would be 34 tokens. of and , resulting in several variants referred to

Figure 4 shows a small example of attention construc- as “LongMinerva”. These extended models were then tion by mixing Neighbour and Grouped Attentions. evaluated on the proposed BABILong-ITA benchmark.

These two attention levels are computed based on the Figure 5 presents the results obtained by applying Selforiginal model’s self-attention mechanism, allowing for Extend with seven diferent combinations of and . the extension of the context window with only minor The method proves to be quite efective, enabling context code modifications and no need for additional training. extension for the original Minerva model maintaining

The authors argue that LLMs inherently possess the similar performance for contexts ≤ 4k. Notably, the Longcapability to handle long contexts, and the primary chal- Minerva variants with = 512 or 1024 and = 16 lenge lies in the out-of-distribution (O.O.D.) issues related achieved satisfactory performance improvements, given to positional encoding. To mitigate this, SelfExtend maps the original performance at 0k. Considering that SelfExunseen large relative positions to those observed during tend operates without requiring any additional training pretraining, efectively addressing the positional O.O.D. or fine-tuning, these results seem particularly promising. problem.

Empirical evaluations in Jin et al. [19] demonstrate 5. Discussion & Conclusion that SelfExtend substantially improves the long-context understanding ability of LLMs and, in some cases, even This paper introduced a new benchmark for evaluating outperforms fine-tuning-based methods on tasks such the efective context length of LLMs in Italian. Based on as language modeling, synthetic long-context tasks, and a similar resource originally developed for English, we real-world long-context tasks. translated and manually cleaned the data to construct a

This method has been successfully applied to various reliable and meaningful Italian benchmark. models, including LLaMA-2, Mistral, SOLAR, and Phi-2, Our evaluation of several prominent LLMs capable of showcasing its versatility and efectiveness in extending processing Italian validated the quality of the proposed context windows without compromising performance. benchmark and ofered a clear picture of the actual context lengths these models can efectively handle.

The conclusions align closely with those reported in the original BABILong study by Kuratov et al. [9]: LLMs tend to struggle with retrieving relevant information at context lengths significantly shorter than their declared maximum capacities.

As an additional contribution, we applied the technique proposed by Jin et al. [19] to extend LLM context length without any training or fine-tuning, achieving promising results also for Italian large language models.

The benchmark data and all the codes for reproducing the experiments are available on Github5.

Acknowledgments I would like to thank the colleague S. Peroni for allowing me to use his GPU system to complete the experiments on extending LLM context length.

[10] J. Weston, A. Bordes, S. Chopra, A. M. Rush, B. van

Merriënboer, A. Joulin, T. Mikolov, Towards AI[1] T. Kočiský, J. Schwarz, P. Blunsom, C. Dyer, K. M. Complete Question Answering: A Set of PrerequiHermann, G. Melis, E. Grefenstette, The Narra- site Toy Tasks, in: Proceedings of the International tiveQA reading comprehension challenge, Transac- Conference on Learning Representations, 2016. tions of the Association for Computational Linguis- [11] J. W. Rae, A. Potapenko, S. M. Jayakumar, T. P. Liltics 6 (2018) 317–328. licrap, Compressive transformers for long-range [2] Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. Cohen, sequence modelling, in: 8th International ConR. Salakhutdinov, C. D. Manning, HotpotQA: A ference on Learning Representations, ICLR 2020, dataset for diverse, explainable multi-hop question Addis Ababa, Ethiopia, April 26-30, 2020, 2020. answering, in: E. Rilof, D. Chiang, J. Hockenmaier, [12] I. Beltagy, M. E. Peters, A. Cohan, Longformer: J. Tsujii (Eds.), Proceedings of the 2018 Conference The long-document transformer, arXiv:2004.05150 on Empirical Methods in Natural Language Pro- (2020). cessing, Association for Computational Linguistics, [13] M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, Brussels, Belgium, 2018, pp. 2369–2380. C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, [3] W. Kryściński, N. Rajani, D. Agarwal, C. Xiong, L. Yang, A. Ahmed, Big bird: Transformers for D. Radev, Booksum: A collection of datasets longer sequences, in: H. Larochelle, M. Ranzato, for long-form narrative summarization (2021). R. Hadsell, M. Balcan, H. Lin (Eds.), Advances in arXiv:2105.08209. Neural Information Processing Systems, volume 33, [4] Y. Tay, M. Dehghani, S. Abnar, Y. Shen, D. Bahri, Curran Associates, Inc., 2020, pp. 17283–17297.

P. Pham, J. Rao, L. Yang, S. Ruder, D. Metzler, Long [14] N. Kitaev, L. Kaiser, A. Levskaya, Reformer: The range arena : A benchmark for eficient transform- eficient transformer, in: 8th International Coners, in: International Conference on Learning Rep- ference on Learning Representations, ICLR 2020, resentations, 2021. Addis Ababa, Ethiopia, April 26-30, 2020, 2020. [5] Y. Bai, X. Lv, J. Zhang, H. Lyu, J. Tang, Z. Huang, [15] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. Le, Z. Du, X. Liu, A. Zeng, L. Hou, Y. Dong, R. Salakhutdinov, Transformer-XL: Attentive lanJ. Tang, J. Li, Longbench: A bilingual, multitask guage models beyond a fixed-length context, in: benchmark for long context understanding, 2024. A. Korhonen, D. Traum, L. Màrquez (Eds.), ProceedarXiv:2308.14508. ings of the 57th Annual Meeting of the Association [6] C. An, S. Gong, M. Zhong, X. Zhao, M. Li, J. Zhang, for Computational Linguistics, Florence, Italy, 2019, L. Kong, X. Qiu, L-eval: Instituting standardized pp. 2978–2988. evaluation for long context language models, in: [16] J. Su, Y. Lu, S. Pan, A. Murtadha, B. Wen, Y. Liu, RoL.-W. Ku, A. Martins, V. Srikumar (Eds.), Proceed- former: Enhanced transformer with rotary position ings of the 62nd Annual Meeting of the Association embedding (2021). arXiv:2104.09864. for Computational Linguistics (Volume 1: Long Pa- [17] B. Peng, J. Quesnelle, H. Fan, E. Shippole, YaRN: Efpers), Association for Computational Linguistics, ifcient context window extension of large language Bangkok, Thailand, 2024, pp. 14388–14411. models, in: The Twelfth International Conference [7] O. Press, N. Smith, M. Lewis, Train Short, Test on Learning Representations, 2024.

Long: Attention with Linear Biases Enables Input [18] S. Chen, S. Wong, L. Chen, Y. Tian, Extending conLength Extrapolation, in: International Conference text window of large language models via positional on Learning Representations, 2022. interpolation (2023). arXiv:2306.15595. [8] X. Wang, M. Salmani, P. Omidi, X. Ren, M. Reza- [19] H. Jin, X. Han, J. Yang, Z. Jiang, Z. Liu, C.-Y. Chang, gholizadeh, A. Eshaghi, Beyond the limits: a survey H. Chen, X. Hu, Llm maybe longlm: Selfextend llm of techniques to extend the context length in large context window without tuning, in: Proceedings language models, in: Proceedings of the Thirty- of the 41st International Conference on Machine Third International Joint Conference on Artificial Learning, ICML’24, JMLR.org, 2024.

Intelligence, IJCAI ’24, 2024. [20] R. Orlando, L. Moroni, P.-L. Huguet Cabot, S. Co[9] Y. Kuratov, A. Bulatov, P. Anokhin, I. Rodkin, nia, E. Barba, S. Orlandini, G. Fiameni, R. Navigli, D. Sorokin, A. Sorokin, M. Burtsev, BABI- Minerva LLMs: The first family of large language Long: Testing the Limits of LLMs with Long Con- models trained from scratch on Italian data, in: text Reasoning-in-a-Haystack, in: A. Globerson, F. Dell’Orletta, A. Lenci, S. Montemagni, R. SprugL. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tom- noli (Eds.), Proceedings of the 10th Italian Conferczak, C. Zhang (Eds.), Advances in Neural Informa- ence on Computational Linguistics (CLiC-it 2024), tion Processing Systems, volume 37, Curran Asso- CEUR Workshop Proceedings, Pisa, Italy, 2024, pp. ciates, Inc., 2024, pp. 106519–106554. 707–719.

Declaration on Generative AI During the preparation of this work, the author(s) did not use any generative AI tools or services.