<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>BABILong-ITA: a new benchmark for testing Large Language Models efective context length and a Context Extension Method</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fabio Tamburini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>FICLIT - University of Bologna</institution>
          ,
          <addr-line>via Zamboni, 32, 40126, Bologna</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This paper introduces a new benchmark designed to evaluate the efective context length handled by Large Language Models (LLMs) in Italian. Following the structure of the five core tasks from the English BABILong dataset, we created an equivalent benchmark tailored for Italian. We used it to assess the context management capabilities of several prominent LLMs, both small and large, pretrained from scratch or fine-tuned specifically for Italian. Additionally, we tested a context extension technique called “SelfExtend” that does not require any training or fine-tuning phase, measuring its efectiveness using our proposed benchmark.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Models</kwd>
        <kwd>context length evaluation</kwd>
        <kwd>new benchmark</kwd>
        <kwd>Italian</kwd>
        <kwd>context extension</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>been developed to assess and compare the performance
of LLMs across varying context lengths.</p>
      <p>As the capabilities of Large Language Models (LLMs) con- A widely cited benchmark framework is the Kamradt’s
tinue to advance, one of the most critical areas of improve- ‘Needle-in-a-Haystack’1 which probes a model’s ability to
ment lies in their ability to process and retain information retrieve a small piece of relevant information embedded
over extended sequences of text, a feature commonly re- in a long, distractor-filled sequence. This test is
considferred to as context length. Traditional benchmarks for ered a litmus test for whether models truly attend to
evaluating LLMs focus on accuracy, reasoning, and gen- long-range dependencies rather than relying on
heuriseration quality, but often overlook systematic assessment tics or recency biases.
of how well a model can operate when presented with Another critical benchmark is ‘Passage Retrieval and
extremely long input sequences. Question Answering’ over long contexts, exemplified</p>
      <p>LLMs long context is crucial for Retrieval-Augmented by datasets such as ‘NarrativeQA’ [1] and ‘HotpotQA’
Generation (RAG) because it allows the model to process [2]. These datasets require models to maintain
coherand reason over more retrieved information at once. In ence and extract pertinent information across several
RAG systems, external documents or chunks of text are paragraphs or documents. The ‘BookSum’ benchmark
retrieved based on a query and then passed to the LLM [3] further extends this approach by evaluating
abstracto generate accurate and contextually relevant answers. tive summarisation over entire books, posing an extreme
A longer context window means the model can consider challenge to context handling.
more documents or larger portions of documents simul- To assess performance on computationally eficient
taneously, reducing the need to truncate or summarise long-context processing, the ‘Long Range Arena’
proinput data. This leads to better comprehension, improved vides a suite of tasks including image classification, text
factual accuracy, and more coherent responses, especially retrieval, and list sorting, adapted to sequence modelling
for complex or multi-part queries. tasks with sequences ranging from 1k to 16k tokens [4].</p>
      <p>Evaluating the context length capabilities of LLMs is While not all tasks are purely devoted to natural language
crucial for understanding their practical utility in real- processing, they benchmark architectural innovations
world applications requiring long-range reasoning, docu- like sparse attention and memory-eficient transformers.
ment understanding, and multi-turn conversations. Over ‘LongBench’ [5] provides comprehensive testbeds
the past years, several standardised benchmarks have across domains covering key long-text application
areas including single-doc QA, multi-doc QA,
summarisaCLiC-it 2025: Eleventh Italian Conference on Computational Linguis- tion, few-shot learning, synthetic tasks, and code
comtics, September 24 — 26, 2025, Cagliari, Italy pletion in both English and Chinese, evaluating both
* Corresponding author. performance scaling and fidelity to far-positioned inputs.
$ hfatbtpio:/./tcaomrbpuorrian.fici@litu.unniibboo.i.itt/(FP.eoTpamle/bTuarminbi)urini/ (F. Tamburini) An et al. [6] present a new evaluation suite ‘L-Eval’
0000-0001-7950-0347 (F. Tamburini)
© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License 1https://github.com/gkamradt/LLMTest_NeedleInAHaystack.git
Attribution 4.0 International (CC BY 4.0).
containing 20 sub-tasks, 508 long documents, and more
than 2,000 human-labelled query-response pairs
including diverse task types, domains, and input lengths.</p>
      <p>Taken together, these benchmarks form a multi-faceted
suite of tools that not only test LLMs for maximum
supported context length but also probe their efective use
of context. As models scale to handle millions of tokens,
developing robust and generalisable long-context
benchmarks remains an active area of research, especially for
languages diferent from English.</p>
      <p>Regarding the techniques for increasing context
‘awareness’ in transformers, recent works have
introduced scaling techniques specifically targeting context
length extrapolation. For example, Press et al. [7]
proposed the in-Context Learning Extrapolation to test
model performance when context lengths at inference
time far exceed those seen during training. Considering Figure 1: BABILong schema for generating tasks: task facts
this, we could refer to a recent interesting survey on tech- are hidden into distractor text fragments extracted from PG19
niques for extending transformers context by Wang et al. (picture from [9]).
[8].</p>
      <p>Another English benchmark, relevant to this work,
is ‘BABILong’ [9], a benchmark specifically designed to a careful testing and benchmarking of LLMs that natively
evaluate the maximum usable context length of large lan- handle the Italian language.
guage models. BABILong provides a controlled and
extensible framework for measuring how efectively LLMs
can retrieve and use information embedded at various po- 2. A new benchmark for Italian
sitions within long input contexts. The benchmark
simulates real-world scenarios where crucial information may BABILong extends the bAbI benchmark [10], which
conappear early in a document and must be recalled accu- sists of 20 tasks designed to evaluate basic aspects of
rately much later, such as in code completion, document reasoning. These tasks are generated by simulating
insummarisation, and legal or scientific reasoning tasks. teractions among characters and objects across various
Each BABILong instance presents the model with a struc- locations, each represented as a fact, such as “Mary
travtured sequence containing query-relevant and distractor eled to the ofice.” The challenge is to answer questions
content spread over thousands to potentially millions of based on the facts generated in the current simulation,
tokens. The model is then tasked with answering queries such as “Where is Mary?” The tasks in bAbI vary in
or completing sequences that require precise recollection the number of facts, question complexity, and the
reaof target information, making it possible to assess the soning skills they assess, including spatial and temporal
degradation of performance as a function of input length. reasoning, deduction, and coreference resolution.</p>
      <p>Unlike traditional evaluations, BABILong systemati- Solving tasks that require long-context processing
decally varies the distance between the query and its corre- mands that a model efectively identify and attend to
sponding reference information, enabling granular anal- relevant information embedded within extensive
irreleysis of context window utilisation and scaling properties vant content. To emulate this scenario, they embed the
across diferent architectures. The benchmark supports core task sentences within passages of distractor text
plug-and-play integration with both decoder-only and sampled from a closely related distribution (see Figure 1).
encoder-decoder models, and it is agnostic to pretraining Each example is constructed by progressively appending
data, making it suitable for comparative studies across sentences from the background corpus, preserving their
proprietary and open-source models. natural order, until the desired total length is achieved.</p>
      <p>In summary, BABILong provides a scalable, inter- This approach decouples the evaluation context length
pretable, and model-agnostic benchmark for long-context from the intrinsic length of the original task, thereby
reasoning and memory fidelity and it is a very useful enabling the assessment of models capable of handling
tool for researchers and practitioners seeking to push inputs extending to millions of tokens. As background
the boundaries of eficient long-sequence modelling in material, they used books from the PG19 dataset [11],
large-scale language systems. Moreover, it can be easily chosen for their substantial length and naturally
occurextended to other languages: the goal of this work re- ring long-form narrative structure.
gards the extension of BABILong to Italian, allowing for We reproduced the same process proposed in
BABI</p>
      <sec id="sec-1-1">
        <title>Long by, first, translating English sentences belonging to</title>
        <p>BABILong tasks leveraging Google Translate and then
using the Project Gutemberg2 (PG) Italian free texts as
base corpus for extracting distractor fragments.</p>
        <p>Given that all the major evaluations in the BABILong
paper [9] were performed considering only the first five
tasks, namely QA1-QA5, we decided to translate and
post-process only these five tasks and insert them into
BABILong-ITA.</p>
        <p>In order to build a reliable and efective Italian
benchmark we had to manually revise and adapt automatic
translations ensuring a good adherence to common
Italian language adjusting translation artifacts or wrong
translations. In particular, we had to manage these
phenomena:
check and correct also by leveraging regular
expressions: for example ‘What is the kitchen west
of?’-&gt;‘Qual è la cucina a ovest?’-&gt;‘La cucina è a
ovest di che cosa?’.</p>
      </sec>
      <sec id="sec-1-2">
        <title>While we could have incorporated a broader range of</title>
        <p>state/position-changing predicates in the translations, we
chose to adhere to the original selections, as the English
benchmark did not include such variations.</p>
        <p>Table 1 shows one example for each BABILong-ITA
task without the insertions of any distractor texts (0k
configuration).</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Benchmark evaluation</title>
      <p>• dPirdonpoetrtrNaanmslaetse tErnagnlisslhatpiroonp:erGnoaomgleesTorfapnesolaptlee Ibnenocrhdmerartko atnedsttothgeraesfepcstiovmene eidsseaoafbtohuet tnheewpeprrfooprmosaendce
involved in the task, thus we have to replace them of the most relevant models able to efectively handle
consistently with common Italian proper names, the Italian language, we performed a set of experiments
e.g. ‘John’-&gt;‘Giovanni’, ‘Mary’-&gt;’Maria’, etc. involving quite a large set of LLMs.</p>
      <p>First of all, we considered the new models presented in
• Object/Place Simplification : the automatic 2024 and trained from scratch on Italian: the first by the
translation tended, in some cases, to trans- SapianzaNLP group3, namely
sapienzanlp/Minerva-7Blate single English words into Italian multi- base-v1.0 and sapienzanlp/Minerva-7B-instruct-v1.0, and,
word expressions artificially increasing tasks dif- second, the largest model proposed by iGenius/CINECA
ifculty. We simplify objects/places translations using the unoficial conversion
sapienzanlp/modellolike ‘bedroom’-&gt;‘camera da letto’-&gt;‘camera’ and italia-9b-bf16 for simplicity. We considered also
‘football’-&gt;‘pallone da calcio’-&gt;‘pallone’, etc. two fine-tuned model from DeepMount00, namely</p>
      <p>DeepMount00/Qwen2-1.5B-Ita and
DeepMount00/Mistral• Verb Tenses: for expressing past events English Ita-7b, a model from Microsoft,
microsoft/Phi-4-miniconsistently use the past tense while in Italian, instruct, one from meta,
meta-llama/Llama-3.1-8Beven if the equivalent past tense ‘passato remoto’ Instruct both in its original and quantised form relying
is grammatically correct, is much more common on bartowski/Meta-Llama-3.1-8B-Instruct-Q4_K_S and,
fiusing the ‘passato prossimo’. We then adapted nally, two models from Google, google/gemma-3-4b-it
the translations replacing all these tenses, e.g. and the huge google/gemini-2.0-flash . All models were
‘andò’-&gt;‘è andato/a’, ‘posò’-&gt;‘ha posato’ and ‘si downloaded from the HuggingFace model repository4
spostò-&gt;‘si è spostato/a’ adapting the sufixes to and used on a local server except for gemini-2.0-flash that
the sentence subject preserving the correct gram- was queried using the Google API.
matical agreement.</p>
      <sec id="sec-2-1">
        <title>3.1. Experiments setting</title>
        <p>• Proposition Correction: sometimes Google
Translate generates inappropriate translations
from the point of view of the used prepositions;
we corrected them, for example ‘John si recò al
giardino’-&gt;‘Giovanni si è recato in giardino.’ or
‘Mary andò nel corridoio’-&gt;‘Maria è andata in
corridoio’, ensuring a better adherence to the most
common use of them.</p>
        <sec id="sec-2-1-1">
          <title>In BABILong, the authors consider performance satis</title>
          <p>factory if the accuracy of an answer exceeds 85% and
a complete failure if it is below 30%. Of course, as the
authors said, this definition of “satisfactory performance”
is not universal and should be adapted to the specific task
at hand.</p>
          <p>The comparison with the correct result follows the
• Translation Mistake Corrections: sometimes, original BABILong evaluation method: the LLM output
especially when translating questions with im- is lowercased, and the first valid target it names is
conplicit referents, Google Translate rendered incor- sidered as the LLM answer and compared with the gold
rect Italian sentences that we have to carefully target in order to compute model accuracy.</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>2https://www.gutenberg.org/ 3https://nlp.uniroma1.it/minerva/ 4https://huggingface.co/</title>
          <p>QA2 two-supporting-facts
Context: Sandra si è diretta verso il corridoio. Giovanni si è diretto verso il bagno. Sandra ha aferrato il pallone lì. Daniele si
è recato in camera. Giovanni ha preso il latte lì. Giovanni ha lasciato cadere il latte. Sandra si è trasferita in giardino. Daniele
è tornato in corridoio. Sandra ha buttato via il pallone. Giovanni si è spostato in corridoio. Giovanni è tornato in giardino.
Sandra è andata in cucina. Daniele si è trasferito in camera. Sandra si è diretta verso il corridoio. Sandra si è trasferita in
cucina. Giovanni si è recato in uficio. Sandra è andata in giardino. Sandra ha aferrato il pallone lì. Sandra ha
posato lì il pallone. Daniele è tornato in cucina.</p>
          <p>Question: Dov’è il pallone? Answer: giardino.</p>
          <p>QA3 three-supporting-facts
Context: Maria è andata in uficio. Sandra si è spostata in corridoio. Sandra ha aferrato il pallone. Maria ha preso lì la
mela. Sandra si è recata in giardino. Daniele si è spostato in corridoio. Sandra ha posato il pallone. Daniele è andato in
camera. Sandra ha preso il pallone. Maria ha posato la mela. Maria è tornata in bagno. Giovanni si è spostato in bagno.
Giovanni è andato in corridoio. Sandra ha posato il pallone. Daniele si è diretto verso il corridoio. Sandra ha raccolto il
pallone. Sandra si è recata in uficio. Daniele si è recato in bagno. Daniele è tornato in uficio. Daniele si è recato in cucina.
Sandra ha raccolto la mela lì. Sandra ha buttato lì la mela. Sandra ha lasciato cadere il pallone. Giovanni si è recato in
giardino. Maria si è recata in giardino. Sandra ha aferrato il pallone lì. Sandra ha buttato lì il pallone. Sandra si è diretta
verso la cucina. Maria si è trasferita in camera. Maria è andata in corridoio. Sandra si è diretta verso il corridoio. Giovanni
è andato in cucina. Sandra si è recata in bagno. Daniele è tornato in bagno. Giovanni si è trasferito in uficio. Giovanni
ha preso il latte. Giovanni si è diretto verso il bagno. Daniele è tornato in camera. Maria si è recata in camera. Daniele si
è diretto verso il corridoio. Giovanni si è trasferito in camera. Sandra si è recata in giardino. Daniele è tornato in cucina.
Giovanni ha lasciato il latte. Daniele si è recato in uifcio. Daniele ha preso il pallone. Maria è andata in corridoio. Daniele
ha aferrato la mela lì . Giovanni si è diretto verso il bagno. Giovanni si è diretto verso il corridoio. Giovanni è andato
in uficio. Giovanni è tornato in cucina. Maria si è recata in uficio. Daniele è tornato in giardino. Daniele è andato
in camera. Daniele si è spostato in bagno. Daniele è tornato in giardino. Sandra è tornata in bagno. Daniele è
andato in camera. Daniele ha lasciato la mela. Daniele ha lasciato il pallone. Daniele ha aferrato il pallone.
Question: Dov’era la mela prima di essere in camera? Answer: giardino.</p>
          <p>QA4 two-arg-relations
Context: Il giardino si trova a ovest della camera. L’uficio si trova a est della camera.</p>
          <p>Question: La camera è a est di che cosa? Answer: giardino.</p>
          <p>QA5 three-arg-relations
Context: Enrico ha preso il pallone lì. Enrico si è recato in giardino. Enrico ha passato il pallone a Giovanni. Maria è
andata in cucina. Giovanni ha passato il pallone a Enrico. Enrico ha consegnato il pallone a Giovanni. Maria ha
preso il latte lì. Giovanni si è diretto verso la cucina. Giovanni si è trasferito in giardino. Daniele si è recato in camera.</p>
          <p>Question: Chi ha ricevuto il pallone? Answer: Giovanni.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>3.2. Results</title>
        <p>a declared maximum context length of 32k tokens, they
struggle significantly even at much shorter lengths.
Similar observations apply to Phi-4, which fails to achieve
satisfactory results even at just 1/16 of its maximum
declared context window.</p>
        <p>Google’s Gemma3 shows slightly better performance,
managing to handle contexts up to approximately 1/8
of its maximum declared length. Conversely,
Gemini2.0-flash, with a nominal maximum context length of 1
million tokens, solves fewer than 50% of the tasks at 128k,
an underwhelming result given its scale.</p>
        <p>Among the tested models, LLaMA-3.1-8B stands out
as the most efective. Although we completely evaluated
only its quantised version, which performs slightly below
the full model, it successfully retrieves 35% of the
hidden information even at the maximum declared context
length. It appears to ofer an excellent balance between
local deployment feasibility and performance, trailing
only slightly behind the much larger Gemini-2 model.</p>
        <p>Figure 3 presents the per-task performance of the two
best-performing LLMs tested, namely Gemini-2.0-flash
and the quantised version of LLaMA-3.1-8B. The QA2
and QA3 tasks are notably more complex than the
others, with both models struggling to retrieve the target
information in QA3, even within very short contexts.</p>
        <sec id="sec-2-2-1">
          <title>Given these results and the smooth transitions across diferent context lengths, we can conclude that BABILong-ITA appears to be a reliable benchmark for testing the efective context length of LLMs.</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Extending Large Language</title>
    </sec>
    <sec id="sec-4">
      <title>Models Context Length</title>
      <sec id="sec-4-1">
        <title>Extending the context length of LLMs is a key research</title>
        <p>direction aimed at improving their ability to reason over
long documents, maintain dialogue coherence, and
process extensive sequences of information.</p>
        <p>Several approaches have emerged to address the
computational and architectural challenges associated with
long-context modeling:
• Sparse Attention and Eficient Transformers .</p>
        <p>One class of techniques involves modifying the
attention mechanism to reduce its quadratic
complexity with respect to sequence length. Models
such as Longformer [12], BigBird [13], and
Reformer [14] introduce sparse or locality-sensitive
hashing attention patterns to enable eficient
processing of longer sequences. These methods trade
of some global attention capacity for linear or
sub-quadratic scaling, allowing context lengths
up to tens of thousands of tokens.
• Position Encoding Innovations. Absolute
positional encodings pose a limitation on
extrapolation beyond trained sequence lengths. Relative
positional encodings, as used in Transformer-XL
[15] and Rotary Position Embeddings (RoPE)
proposed by Su et al. [16] provides better
generalisation to longer contexts. More recent methods
such as YaRN [17] adjust RoPE scaling to
maintain performance across significantly extended
context lengths.
• Training and Fine-Tuning on Long Contexts.</p>
        <p>Recent advancements show that increasing
context length during pretraining can yield
substantial improvements. Big models like Claude,
Gemini and GPT-4 are examples of models trained or
adapted for extended context windows up to 128k
tokens or more. Techniques such as long-context
ifne-tuning, positional interpolation [ 18], and
linear RoPE interpolation [7] have demonstrated
effectiveness in scaling pretrained transformers to
larger context windows without retraining from
scratch.
• Neighbour Attention focuses on dependencies
among adjacent tokens within a specified range
reducing the standard self-attention window to
the closest positions. If  is the context window
for the pretrained model, the parameter  &lt; 
controls the dimension of the neighbour
attention.</p>
        <p>More details on SelfExtend can be found in the original
paper [19].</p>
        <sec id="sec-4-1-1">
          <title>4.1. Using SelfExtend to increase LLMs context length</title>
          <p>• Grouped Attention captures dependencies
among tokens that are far apart averaging the
contributions of the pretrained self-attention
between diferent  positions.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>The baseline model for our experiments is the</title>
        <p>The maximum length of the extended context in the largest model produced by the SapienzaNLP team:
ideal case can be computed as sapienzanlp/Minerva-7B-base-v1.0 is a Mistral-based
model configured with a 4096-tokens fixed context and
( − ) *  +  (1) without sliding window pretrained from scratch on
Italthus, for example, if we have  = 4096 and choose ian and English [20]. Building on this baseline, we
ex = 2048 and  = 16, the ideal maximum extended tended its context using SelfExtend with varying values
context would be 34 tokens. of  and , resulting in several variants referred to</p>
        <p>Figure 4 shows a small example of attention construc- as “LongMinerva”. These extended models were then
tion by mixing Neighbour and Grouped Attentions. evaluated on the proposed BABILong-ITA benchmark.</p>
        <p>These two attention levels are computed based on the Figure 5 presents the results obtained by applying
Selforiginal model’s self-attention mechanism, allowing for Extend with seven diferent combinations of  and .
the extension of the context window with only minor The method proves to be quite efective, enabling context
code modifications and no need for additional training. extension for the original Minerva model maintaining</p>
        <p>The authors argue that LLMs inherently possess the similar performance for contexts ≤ 4k. Notably, the
Longcapability to handle long contexts, and the primary chal- Minerva variants with  = 512 or 1024 and  = 16
lenge lies in the out-of-distribution (O.O.D.) issues related achieved satisfactory performance improvements, given
to positional encoding. To mitigate this, SelfExtend maps the original performance at 0k. Considering that
SelfExunseen large relative positions to those observed during tend operates without requiring any additional training
pretraining, efectively addressing the positional O.O.D. or fine-tuning, these results seem particularly promising.
problem.</p>
        <p>Empirical evaluations in Jin et al. [19] demonstrate 5. Discussion &amp; Conclusion
that SelfExtend substantially improves the long-context
understanding ability of LLMs and, in some cases, even This paper introduced a new benchmark for evaluating
outperforms fine-tuning-based methods on tasks such the efective context length of LLMs in Italian. Based on
as language modeling, synthetic long-context tasks, and a similar resource originally developed for English, we
real-world long-context tasks. translated and manually cleaned the data to construct a</p>
        <p>This method has been successfully applied to various reliable and meaningful Italian benchmark.
models, including LLaMA-2, Mistral, SOLAR, and Phi-2, Our evaluation of several prominent LLMs capable of
showcasing its versatility and efectiveness in extending processing Italian validated the quality of the proposed
context windows without compromising performance.
benchmark and ofered a clear picture of the actual
context lengths these models can efectively handle.</p>
        <p>The conclusions align closely with those reported in
the original BABILong study by Kuratov et al. [9]: LLMs
tend to struggle with retrieving relevant information at
context lengths significantly shorter than their declared
maximum capacities.</p>
        <p>As an additional contribution, we applied the
technique proposed by Jin et al. [19] to extend LLM context
length without any training or fine-tuning, achieving
promising results also for Italian large language models.</p>
        <p>The benchmark data and all the codes for reproducing
the experiments are available on Github5.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <sec id="sec-5-1">
        <title>I would like to thank the colleague S. Peroni for allowing me to use his GPU system to complete the experiments on extending LLM context length.</title>
        <p>[10] J. Weston, A. Bordes, S. Chopra, A. M. Rush, B. van</p>
        <p>Merriënboer, A. Joulin, T. Mikolov, Towards
AI[1] T. Kočiský, J. Schwarz, P. Blunsom, C. Dyer, K. M. Complete Question Answering: A Set of
PrerequiHermann, G. Melis, E. Grefenstette, The Narra- site Toy Tasks, in: Proceedings of the International
tiveQA reading comprehension challenge, Transac- Conference on Learning Representations, 2016.
tions of the Association for Computational Linguis- [11] J. W. Rae, A. Potapenko, S. M. Jayakumar, T. P.
Liltics 6 (2018) 317–328. licrap, Compressive transformers for long-range
[2] Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. Cohen, sequence modelling, in: 8th International
ConR. Salakhutdinov, C. D. Manning, HotpotQA: A ference on Learning Representations, ICLR 2020,
dataset for diverse, explainable multi-hop question Addis Ababa, Ethiopia, April 26-30, 2020, 2020.
answering, in: E. Rilof, D. Chiang, J. Hockenmaier, [12] I. Beltagy, M. E. Peters, A. Cohan, Longformer:
J. Tsujii (Eds.), Proceedings of the 2018 Conference The long-document transformer, arXiv:2004.05150
on Empirical Methods in Natural Language Pro- (2020).
cessing, Association for Computational Linguistics, [13] M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie,
Brussels, Belgium, 2018, pp. 2369–2380. C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang,
[3] W. Kryściński, N. Rajani, D. Agarwal, C. Xiong, L. Yang, A. Ahmed, Big bird: Transformers for
D. Radev, Booksum: A collection of datasets longer sequences, in: H. Larochelle, M. Ranzato,
for long-form narrative summarization (2021). R. Hadsell, M. Balcan, H. Lin (Eds.), Advances in
arXiv:2105.08209. Neural Information Processing Systems, volume 33,
[4] Y. Tay, M. Dehghani, S. Abnar, Y. Shen, D. Bahri, Curran Associates, Inc., 2020, pp. 17283–17297.</p>
        <p>P. Pham, J. Rao, L. Yang, S. Ruder, D. Metzler, Long [14] N. Kitaev, L. Kaiser, A. Levskaya, Reformer: The
range arena : A benchmark for eficient transform- eficient transformer, in: 8th International
Coners, in: International Conference on Learning Rep- ference on Learning Representations, ICLR 2020,
resentations, 2021. Addis Ababa, Ethiopia, April 26-30, 2020, 2020.
[5] Y. Bai, X. Lv, J. Zhang, H. Lyu, J. Tang, Z. Huang, [15] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. Le,
Z. Du, X. Liu, A. Zeng, L. Hou, Y. Dong, R. Salakhutdinov, Transformer-XL: Attentive
lanJ. Tang, J. Li, Longbench: A bilingual, multitask guage models beyond a fixed-length context, in:
benchmark for long context understanding, 2024. A. Korhonen, D. Traum, L. Màrquez (Eds.),
ProceedarXiv:2308.14508. ings of the 57th Annual Meeting of the Association
[6] C. An, S. Gong, M. Zhong, X. Zhao, M. Li, J. Zhang, for Computational Linguistics, Florence, Italy, 2019,
L. Kong, X. Qiu, L-eval: Instituting standardized pp. 2978–2988.
evaluation for long context language models, in: [16] J. Su, Y. Lu, S. Pan, A. Murtadha, B. Wen, Y. Liu,
RoL.-W. Ku, A. Martins, V. Srikumar (Eds.), Proceed- former: Enhanced transformer with rotary position
ings of the 62nd Annual Meeting of the Association embedding (2021). arXiv:2104.09864.
for Computational Linguistics (Volume 1: Long Pa- [17] B. Peng, J. Quesnelle, H. Fan, E. Shippole, YaRN:
Efpers), Association for Computational Linguistics, ifcient context window extension of large language
Bangkok, Thailand, 2024, pp. 14388–14411. models, in: The Twelfth International Conference
[7] O. Press, N. Smith, M. Lewis, Train Short, Test on Learning Representations, 2024.</p>
        <p>Long: Attention with Linear Biases Enables Input [18] S. Chen, S. Wong, L. Chen, Y. Tian, Extending
conLength Extrapolation, in: International Conference text window of large language models via positional
on Learning Representations, 2022. interpolation (2023). arXiv:2306.15595.
[8] X. Wang, M. Salmani, P. Omidi, X. Ren, M. Reza- [19] H. Jin, X. Han, J. Yang, Z. Jiang, Z. Liu, C.-Y. Chang,
gholizadeh, A. Eshaghi, Beyond the limits: a survey H. Chen, X. Hu, Llm maybe longlm: Selfextend llm
of techniques to extend the context length in large context window without tuning, in: Proceedings
language models, in: Proceedings of the Thirty- of the 41st International Conference on Machine
Third International Joint Conference on Artificial Learning, ICML’24, JMLR.org, 2024.</p>
        <p>Intelligence, IJCAI ’24, 2024. [20] R. Orlando, L. Moroni, P.-L. Huguet Cabot, S.
Co[9] Y. Kuratov, A. Bulatov, P. Anokhin, I. Rodkin, nia, E. Barba, S. Orlandini, G. Fiameni, R. Navigli,
D. Sorokin, A. Sorokin, M. Burtsev, BABI- Minerva LLMs: The first family of large language
Long: Testing the Limits of LLMs with Long Con- models trained from scratch on Italian data, in:
text Reasoning-in-a-Haystack, in: A. Globerson, F. Dell’Orletta, A. Lenci, S. Montemagni, R.
SprugL. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tom- noli (Eds.), Proceedings of the 10th Italian
Conferczak, C. Zhang (Eds.), Advances in Neural Informa- ence on Computational Linguistics (CLiC-it 2024),
tion Processing Systems, volume 37, Curran Asso- CEUR Workshop Proceedings, Pisa, Italy, 2024, pp.
ciates, Inc., 2024, pp. 106519–106554. 707–719.</p>
        <p>Declaration on Generative AI
During the preparation of this work, the author(s) did not use any generative AI tools or services.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>