<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Optimization, Adaptation and Applications of Large Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>David Ponce</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fundación Vicomtech, Basque Research and Technology Alliance (BRTA)</institution>
          ,
          <addr-line>Mikeletegi 57, 20009 Donostia-San Sebastián (</addr-line>
          <country country="ES">Spain</country>
          <institution>) University of the Basque Country (EHU), Faculty of Informatics</institution>
          ,
          <addr-line>Manuel Lardizabal pasealekua 1, 20018 Donostia-San Sebastián</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>The exponential growth in the size of Large Language Models (LLMs) has led to a paradigm shift in Natural Language Processing (NLP), demonstrating unprecedented capabilities across diverse natural language understanding and generation tasks. Despite their remarkable performance, these models present substantial computational and environmental challenges due to their massive parameter counts, requiring significant resources for both training and deployment phases. This doctoral thesis aims to explore three complementary aspects of LLMs: (i) model compression and optimization; (ii) parameter-eficient adaptation for down-streaming tasks; and (iii) exploring the application of language models across a diverse spectrum of NLP tasks, with particular emphasis on leveraging eficient architectures such as Small Language Models (SLMs). This work aims to establish optimal trade-ofs between model performance and computational eficiency, thereby contributing to the development of more accessible and environmentally sustainable language technologies without compromising task-specific eficacy.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Eficient Large Language Models</kwd>
        <kwd>Small Language Models</kwd>
        <kwd>Parameter-Eficient Adaptation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>models aim to provide alternatives for deployment in resource-constrained environments, though
they typically demonstrate lower performance than their larger counterparts, particularly on complex
reasoning tasks.</p>
      <p>
        Various eficiency techniques have emerged to mitigate the computational demands of adapting
language models to specific tasks or domains. Parameter-eficient methods like Low-Rank Adaptation
(LoRA) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] enable model specialization with minimal computational overhead by introducing small
trainable components while keeping most parameters frozen. Quantization reduces memory requirements
by representing weights with fewer bits [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], while knowledge distillation [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] transfers capabilities
from larger models to more compact architectures. Pruning eliminates redundant parameters, creating
sparser networks that maintain performance with reduced computational needs [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Training
optimizations including mixed precision training [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and curriculum learning [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] further reduce resource
requirements while potentially improving model quality. These approaches collectively ofer practical
solutions for deploying advanced language technologies in resource-constrained environments without
compromising essential capabilities.
      </p>
      <p>This doctoral thesis investigates optimization, adaptation, and application for language models with
the objective of enabling their efective deployment in computationally constrained environments. This
research aims to identify practical trade-ofs between model performance and computational eficiency.
The work focuses on three complementary approaches: (i) optimization of training methodologies
through data curation and architectural refinements; (ii) parameter-eficient adaptation techniques
that minimize computational overhead during task specialization; (iii) exploring the application of
language models across a diverse spectrum of NLP tasks, with particular emphasis on leveraging eficient
architectures and compact models, which facilitate the practical deployment of these technologies in
industrial settings where computational resources are limited. These approaches address the need for
more accessible and sustainable language model technologies that maintain adequate performance
while reducing environmental impact.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <sec id="sec-2-1">
        <title>2.1. Training Optimization</title>
        <p>The training phase represents the most computationally intensive component in the lifecycle of large
language models, with contemporary models requiring staggering amounts of computation. Substantial
research eforts have focused on reducing training duration without compromising model quality
through data-centric methods and computational eficiency techniques.</p>
        <p>
          Data curation has emerged as a critical determinant of model eficiency. Dodge et al. [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]
demonstrated that careful data selection yields models outperforming those trained on substantially larger but
less refined datasets. Gunasekar et al. [16] showed models trained on high-quality "textbook-like" data
achieve performance comparable to those trained on vastly larger web-scraped corpora. Recent
contributions from Allal et al. [17] demonstrated benefits in mixing textual data with code and mathematical
content when training SLMs, addressing their heightened sensitivity to data noise [
          <xref ref-type="bibr" rid="ref7">18, 7</xref>
          ]. Curriculum
learning [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] presents examples to models in a structured progression from simple to complex instances,
demonstrating improved convergence rates across multiple domains.
        </p>
        <p>
          Mixed Precision Training [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] uses lower precision numerical formats for most computational
operations while selectively using higher precision for numerically critical operations, typically yielding
2-3x throughput improvements while maintaining model convergence through techniques like loss
scaling.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Inference Optimization</title>
        <p>Inference eficiency is critical for practical deployment scenarios, particularly for latency-sensitive
applications. Key approaches include structural modifications and representational optimizations.</p>
        <p>
          Model pruning eliminates redundant parameters according to various saliency criteria. In
transformerbased models, structured pruning has shown particular eficacy demonstrating that up to 50% of attention
heads can be removed with minimal performance degradation [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. Muralidharan et al. [19] advanced
this field with a structured approach involving cycles of pruning, knowledge transfer, and weight
adjustment. Layer Collapse [20] enables model size reduction by collapsing rear layers into prior ones
while preserving model structure, maintaining over 80% of task performance at 25-30% pruning ratios
and outperforming existing structured pruning methods.
        </p>
        <p>Low-rank factorization decomposes weight matrices into products of smaller matrices, exploiting
the inherent low-rank nature of neural network parameters. Wang et al. [21] demonstrated that
attention matrices can be efectively approximated through such decompositions, reducing computational
requirements and memory footprint.</p>
        <p>
          Knowledge distillation [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] transfers information from a large "teacher" model to a compact "student"
model. Gu et al. [22] showed this can preserve much of the in-context learning capabilities of LLMs in
smaller models. Recent implementations in Gemma-2 [23] and LaMini-GPT [24] have applied advanced
distillation variants for resource-constrained environments.
        </p>
        <p>Quantization reduces parameter and activation precision from training standards to lower bit-width
formats. Recent developments have focused on mixed-precision approaches, enabling sub-8-bit
quantization with minimal accuracy impact. Models like Qwen [25] and StableLM [26] demonstrate
quantization’s efectiveness.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Parameter-Eficient Adaptation Techniques</title>
        <p>Parameter-Eficient Fine-Tuning (PEFT) methodologies modify only a small subset of parameters while
maintaining comparable performance to full fine-tuning, addressing the prohibitive costs of conventional
approaches for LLMs.</p>
        <p>
          Adapter-based methods incorporate specialized modules that compress and expand internal
representations. During adaptation, only these modules are trained while the base model remains frozen,
reducing trainable parameters by 95-99%. Strategic placement of adapters has achieved near full
finetuning performance with as few as 0.1% of trainable parameters [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Prompt-tuning [27] modifies input
representation space rather than internal model parameters, prepending optimizable continuous vectors
to input embeddings. This approach shows scaling properties where performance approaches full
ifne-tuning as model size increases. Prefix-tuning [ 28] generalizes prompt tuning by incorporating
optimizable vectors at each transformer layer, enabling more expressive adaptation while
maintaining parameter eficiency. Low-Rank Adaptation [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] approximates weight updates through low-rank
decompositions, significantly reducing memory requirements. Recent extensions include QLoRA
[29], combining quantization with LoRA, and AdaLoRA [30], optimizing rank allocation across model
components. LOMO [31] ofers an alternative approach for full parameter fine-tuning with limited
resources.
        </p>
        <p>Additionally, Retrieval-Augmented Generation (RAG) ofers a complementary approach that enhances
model capabilities by retrieving relevant information from external knowledge sources without
modifying model parameters, enabling domain adaptation through contextualization rather than fine-tuning
[32].</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Architecture Optimization and Small Language Models</title>
        <p>
          Studies have revealed substantial redundancy in pretrained transformers, showing approximately 80% of
attention heads can be removed with minimal performance impact [33]. Gromov et al. [34] demonstrated
the high modeling capacity of deeper layers in generative language models, exploiting this through
layer pruning and fine-tuning. Small Language Models challenge the "bigger is better" narrative. Schick
and Schütze [
          <xref ref-type="bibr" rid="ref16">35</xref>
          ] showed relatively small models (60-350M parameters) could achieve competitive
few-shot learning performance through carefully constructed prompting. Models like TinyLlama
[
          <xref ref-type="bibr" rid="ref17">36</xref>
          ], MobileLLaMA [
          <xref ref-type="bibr" rid="ref18">37</xref>
          ], and Phi-4 [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] integrate various techniques optimizing neural networks while
limiting quality losses, demonstrating SLMs can handle specific tasks and exhibit emergent capabilities
similar to larger models.
        </p>
        <p>
          Alternative transformer formulations like Performer [
          <xref ref-type="bibr" rid="ref19">38</xref>
          ] and Linear Transformer [
          <xref ref-type="bibr" rid="ref20">39</xref>
          ] replace
quadratic-complexity attention with linear approximations. Other optimization advances include
Multi-Query Attention [
          <xref ref-type="bibr" rid="ref21">40</xref>
          ], Group-Query Attention [
          <xref ref-type="bibr" rid="ref22">41</xref>
          ], and FlashAttention [
          <xref ref-type="bibr" rid="ref23">42</xref>
          ], optimizing memory
and inference speed through improved data access. The integration of architectural innovations with
eficient adaptation techniques ofers a promising direction, potentially delivering order-of-magnitude
eficiency improvements compared to conventional methodologies.
        </p>
        <p>
          Parallel to transformer optimization eforts, researchers have investigated fundamentally
diferent architectural paradigms. Recurrent RWKV [
          <xref ref-type="bibr" rid="ref24">43</xref>
          ] combines RNN-style sequential processing with
transformer-like parallelization, achieving linear scaling with sequence length and constant memory
usage during inference. State Space Models such as Mamba [
          <xref ref-type="bibr" rid="ref25">44</xref>
          ] have demonstrated exceptional
performance on long-sequence tasks while maintaining linear computational complexity through selective
scanning mechanisms that eficiently capture long-range dependencies. Difusion-based language
models, exemplified by LLaDa [
          <xref ref-type="bibr" rid="ref26">45</xref>
          ], represent another promising research direction that adapts iterative
denoising frameworks from computer vision to text generation, ofering unique advantages in
controllable text synthesis and generation diversity. These architectural alternatives complement transformer
optimization approaches by addressing fundamental eficiency limitations through novel computational
paradigms rather than parameter reduction alone.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Description of the proposed research and hypotheses</title>
      <p>For the thesis described in this document, the following main objective has been defined: Research and
development of optimization, adaptation, and application strategies for Large Language Models that
establish optimal trade-ofs between performance and computational eficiency.</p>
      <sec id="sec-3-1">
        <title>3.1. Objectives</title>
        <p>In the framework of this thesis, optimization techniques applied to LLMs will be investigated and
compared with the aim of reducing the hardware requirements and computational resources demanded
by these models during the training and inference phases. Therefore, the work will focus on eficient
training and adaptation methods, as well as optimization and compression techniques for large language
models, in order to deploy these models in a realistic production environment.</p>
        <p>To meet the objective, the following tasks will be undertaken:
• Optimization for the training and adaptation of large language models: Given the extensive
computational infrastructure resources necessary for the training and use of high-quality large
language models, lower-cost alternatives will be investigated at diferent levels: (i) optimization of
training data, based on data selection methods, Curriculum Learning, and preprocessing variants
(word segmentation, casing, etc.); and (ii) architecture optimization, based on variants of the
standard Transformer architecture in particular.
• Eficient adaptation methods : Diferent techniques will be explored, such as adaptation through
ifne-tuning, which involve adjusting the weights of neural networks in language models, and
methods that only require minimal additional weight adjustments, without the need to adjust the
base network weights completely, based on methods like prefix-learning, LoRA, or adapter-tuning.
• Eficient deployment and inference of large language models : Compression techniques for
large language models will be investigated, such as pruning, distillation, or quantization, with
the aim of deploying these models in limited computational environments.
• Evaluation of language model capabilities across functional domains: This work will assess
how language models perform across varied NLP applications, examining their efectiveness in
diferent contexts such as conversational AI, machine translation, text simplification and other
specialized domain tasks while focusing on models optimized for computational eficiency.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Hypothesis</title>
        <p>This thesis is built upon the following hypotheses:
• Eficiency-Performance Trade-of : Small Language Models (1-5B parameters) optimized
through selected compression and adaptation techniques can achieve performance comparable to
much larger models on specific NLP tasks while requiring fewer computational resources.
• Architecture Optimality: The redundancy in standard transformer architectures can be
systematically identified and eliminated, resulting in models with fewer parameters that may preserve
nearly all of the original performance across common NLP benchmarks.
• Data Leverage: Carefully curated training and fine-tuning datasets can compensate for reduced
model size, enabling more eficient models to match or exceed the performance of larger models
trained on noisy or unfiltered corpora.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology and Research Progress</title>
      <sec id="sec-4-1">
        <title>4.1. Research Methodology</title>
        <p>The research methodology adopts a multifaceted approach to investigate eficiency-performance
tradeofs in language model optimization and deployment:
• Continuous State-of-the-Art Literature Analysis: A systematic and ongoing review of
emerging research constitutes a foundational component of the methodology. This continuous
monitoring is crucial given the rapidly evolving developments in eficient language modeling.
• Cross-Domain Task Evaluation: Language models will be systematically evaluated across
diverse NLP tasks using established benchmarks for fair comparisons and reproducibility. This
approach enables the identification of domains where optimized smaller models demonstrate
competitive performance relative to their larger counterparts.
• Parameter-Eficient Adaptation Exploration : The comparative eficacy of adaptation
techniques will be assessed for domain specialization. Additionally, Retrieval-Augmented Generation
approaches will be investigated as complementary methods that enhance model capabilities
without parameter modification.
• Resource-Aware Compression Investigation: Model compression techniques will be explored
with consideration of available computational resources. Compression approaches will be
assessed through comparative experiments measuring both task performance preservation and
computational eficiency gains
• Alternative Architectures: Investigation of alternative architectures to the Transformer
paradigm alongside automated architecture search methodologies to identify more eficient
structural configurations.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Research Progress and Preliminary Findings</title>
        <p>
          Progress in this research has resulted in several publications in conferences in the field:
• Unsupervised Subtitle Segmentation with Masked Language Models [
          <xref ref-type="bibr" rid="ref27">46</xref>
          ]: An unsupervised
approach to subtitle segmentation using pretrained masked language models, predicting line
endings and subtitle breaks based on punctuation likelihood. The method achieves competitive
segmentation accuracy while preserving original text and complying with length constraints,
validating that eficient masked language models could perform specialized text processing
tasks without requiring large-scale generative models or supervised fine-tuning. This work was
presented in the Proceedings of the 61st Annual Meeting of the Association for Computational
Linguistics (ACL 2023).
• Split and Rephrase with Large Language Models [
          <xref ref-type="bibr" rid="ref28">47</xref>
          ]: An evaluation of large language
models on the task of splitting complex sentences into shorter grammatical ones while preserving
meaning. The study includes prompting variants, domain shift analysis, and comparison of
ifne-tuned models with zero-shot and few-shot approaches, showing significant improvements
over previous state-of-the-art with relatively small models and training datasets, while revealing
that sentence splitting remains a challenging task even for large models. This research was
presented in the Proceedings of the 62nd Annual Meeting of the Association for Computational
Linguistics (ACL 2024).
• Vicomtech@WMT 2024: Shared Task on Translation into Low-Resource Languages
of Spain [
          <xref ref-type="bibr" rid="ref29">48</xref>
          ]: Participation in WMT 2024 Shared Task addressing translation into Aragonese,
Aranese, and Asturian. This study notably demonstrated that smaller, specialized models could
compete with LLMs on low-resource translation tasks, while also proposing the application of
LLMs for backtranslation generation to improve training data for smaller models. The results
were presented in the Proceedings of the Ninth Conference on Machine Translation (WMT 2024).
• Automating Easy Read Text Segmentation [49]: An investigation of methods for automating
Easy Read text segmentation, including masked and generative language models and constituent
parsing. The study includes automatic and human evaluations in three languages, analyzing
strengths and weaknesses of proposed alternatives under resource limitations. The study
demonstrated that smaller encoder-only models consistently surpassed the quality of generative
decoderonly models while significantly reducing the risk of hallucinations. This research was presented
in the Findings of the Association for Computational Linguistics: EMNLP 2024.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Research Elements Proposed for Discussion</title>
      <p>The research presented in this thesis raises several questions that would benefit from scholarly discussion
and expert feedback at the symposium:
• Parameter Redundancy and Eficient Architecture Design : While empirical evidence
demonstrates the redundancy of parameters and layers through successful application of pruning and
other compression techniques, challenges remain in training eficient models from scratch
without this redundancy. This raises questions about whether the redundancy is necessary for the
training process itself, or if alternative architectural designs and training methodologies could
yield inherently more eficient models.
• Knowledge Preservation Metrics in Model Compression: Standard metrics for model
compression evaluation such as KL divergence of vocabulary distributions or cosine similarity of
hidden representations may not fully capture a model’s capabilities. Models compressed
following these metrics sometimes demonstrate unexpected performance divergences on
knowledgeintensive tasks. This suggests the need for more comprehensive evaluation metrics that accurately
measure preservation of diferent types of capabilities during the compression process.
• Fine-tuning Stability of Instruction-Tuned Models: Fine-tuning models tuned for instruction
following has shown to be particularly sensitive, sometimes leading to a decline in generalization
quality. This phenomenon raises questions about optimal adaptation strategies for these aligned
models, including the appropriate volume and diversity of fine-tuning data, suitable learning rates,
and mechanisms to preserve general capabilities while enhancing domain-specific performance.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions</title>
      <p>This thesis aims to advance the field of Natural Language Processing by investigating optimization,
adaptation, and application strategies for Large Language Models that balance performance with
computational eficiency. By exploring compression techniques, parameter-eficient fine-tuning methods, and
architectural modifications, the research seeks to address growing concerns regarding computational
requirements and the environmental impact of increasingly large models.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>The author has not employed any Generative AI tools.
language models: Weight initializations, data orders, and early stopping, arXiv preprint
arXiv:2002.06305 (2020).
[16] S. Gunasekar, Y. Zhang, J. Aneja, C. C. T. Mendes, A. Del Giorno, S. Gopi, M. Javaheripi, P.
Kaufmann, G. de Rosa, O. Saarikivi, et al., Textbooks are all you need, arXiv preprint arXiv:2306.11644
(2023).
[17] L. B. Allal, A. Lozhkov, E. Bakouch, G. M. Blázquez, G. Penedo, L. Tunstall, A. Marafioti, H. Kydlíček,
A. P. Lajarín, V. Srivastav, et al., Smollm2: When smol goes big–data-centric training of a small
language model, arXiv preprint arXiv:2502.02737 (2025).
[18] D. Rolnick, A. Veit, S. Belongie, N. Shavit, Deep learning is robust to massive label noise, arXiv
preprint arXiv:1705.10694 (2017).
[19] S. Muralidharan, S. Turuvekere Sreenivas, R. Joshi, M. Chochowski, M. Patwary, M. Shoeybi,
B. Catanzaro, J. Kautz, P. Molchanov, Compact language models via pruning and knowledge
distillation, Advances in Neural Information Processing Systems 37 (2024) 41076–41102.
[20] Y. Yang, Z. Cao, H. Zhao, Laco: Large language model pruning via layer collapse, arXiv preprint
arXiv:2402.11187 (2024).
[21] S. Wang, B. Z. Li, M. Khabsa, H. Fang, H. Ma, Linformer: Self-attention with linear complexity,
arXiv preprint arXiv:2006.04768 (2020).
[22] Y. Gu, L. Dong, F. Wei, M. Huang, MiniLLM: Knowledge distillation of large language models, in:
The Twelfth International Conference on Learning Representations, 2024. URL: https://openreview.
net/forum?id=5h0qf7IBZZ.
[23] G. Team, M. Riviere, S. Pathak, P. G. Sessa, C. Hardin, S. Bhupatiraju, L. Hussenot, T. Mesnard,
B. Shahriari, A. Ramé, et al., Gemma 2: Improving open language models at a practical size, arXiv
preprint arXiv:2408.00118 (2024).
[24] M. Wu, A. Waheed, C. Zhang, M. Abdul-Mageed, A. F. Aji, Lamini-lm: A diverse herd of distilled
models from large-scale instructions, arXiv preprint arXiv:2304.14402 (2023).
[25] A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, et al., Qwen2.</p>
      <p>5 technical report, arXiv preprint arXiv:2412.15115 (2024).
[26] M. Bellagente, J. Tow, D. Mahan, D. Phung, M. Zhuravinskyi, R. Adithyan, J. Baicoianu, B. Brooks,
N. Cooper, A. Datta, et al., Stable lm 2 1.6 b technical report, arXiv preprint arXiv:2402.17834
(2024).
[27] B. Lester, R. Al-Rfou, N. Constant, The power of scale for parameter-eficient prompt tuning, arXiv
preprint arXiv:2104.08691 (2021).
[28] X. L. Li, P. Liang, Prefix-tuning: Optimizing continuous prompts for generation, arXiv preprint
arXiv:2101.00190 (2021).
[29] T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettlemoyer, Qlora: Eficient finetuning of quantized
llms, Advances in neural information processing systems 36 (2023) 10088–10115.
[30] Q. Zhang, M. Chen, A. Bukharin, N. Karampatziakis, P. He, Y. Cheng, W. Chen, T. Zhao, Adalora:
Adaptive budget allocation for parameter-eficient fine-tuning, arXiv preprint arXiv:2303.10512
(2023).
[31] K. Lv, Y. Yang, T. Liu, Q. Guo, X. Qiu, Full parameter fine-tuning for large language models
with limited resources, in: L.-W. Ku, A. Martins, V. Srikumar (Eds.), Proceedings of the 62nd
Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),
Association for Computational Linguistics, Bangkok, Thailand, 2024, pp. 8187–8198. URL: https:
//aclanthology.org/2024.acl-long.445/. doi:10.18653/v1/2024.acl-long.445.
[32] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih,
T. Rocktäschel, et al., Retrieval-augmented generation for knowledge-intensive nlp tasks, Advances
in neural information processing systems 33 (2020) 9459–9474.
[33] F. Dalvi, N. Durrani, H. Sajjad, Y. Belinkov, A. Bau, J. Glass, What is one grain of sand in the
desert? analyzing individual neurons in deep nlp models, in: Proceedings of the AAAI Conference
on Artificial Intelligence, volume 33, 2019, pp. 6309–6317.
[34] A. Gromov, K. Tirumala, H. Shapourian, P. Glorioso, D. Roberts, The unreasonable inefectiveness
of the deeper layers, in: The Thirteenth International Conference on Learning Representations,
of the Ninth Conference on Machine Translation, Association for Computational Linguistics,
Miami, Florida, USA, 2024, pp. 934–942. URL: https://aclanthology.org/2024.wmt-1.91/. doi:10.
18653/v1/2024.wmt-1.91.
[49] J. Calleja, T. Etchegoyhen, D. Ponce, Automating easy read text segmentation, in: Y. Al-Onaizan,
M. Bansal, Y.-N. Chen (Eds.), Findings of the Association for Computational Linguistics: EMNLP
2024, Association for Computational Linguistics, Miami, Florida, USA, 2024, pp. 11876–11894. URL:
https://aclanthology.org/2024.findings-emnlp.694/. doi: 10.18653/v1/2024.findings-emnlp.
694.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Grattafiori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dubey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jauhri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pandey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kadian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Al-Dahle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Letman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mathur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Schelten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaughan</surname>
          </string-name>
          , et al.,
          <source>The llama 3 herd of models, arXiv preprint arXiv:2407.21783</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Song,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , S. Ma,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Bi</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Deepseek-</surname>
          </string-name>
          r1:
          <article-title>Incentivizing reasoning capability in llms via reinforcement learning</article-title>
          ,
          <source>arXiv preprint arXiv:2501.12948</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P. H.</given-names>
            <surname>Martins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fernandes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Alves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. M.</given-names>
            <surname>Guerreiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Alves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pombal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farajian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Faysse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Klimaszewski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Colombo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Haddow</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. G. de Souza</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Birch</surname>
            ,
            <given-names>A. F.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
          </string-name>
          ,
          <article-title>Eurollm: Multilingual language models for europe</article-title>
          ,
          <source>Procedia Computer Science</source>
          <volume>255</volume>
          (
          <year>2025</year>
          )
          <fpage>53</fpage>
          -
          <lpage>62</lpage>
          . URL: https://www.sciencedirect.com/science/article/pii/S1877050925006210. doi:https://doi. org/10.1016/j.procs.
          <year>2025</year>
          .
          <volume>02</volume>
          .260,
          <article-title>proceedings of the Second EuroHPC user day</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gonzalez-Agirre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pàmies</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Llop</surname>
          </string-name>
          , I. Baucells,
          <string-name>
            <given-names>S. D.</given-names>
            <surname>Dalt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tamayo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Saiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Espuña</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Prats</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Aula-Blasco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rubio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shvets</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sallés</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Lacunza</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Pikabea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Palomar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Falcão</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Tormo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Vasquez-Reina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Marimon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ruíz-Fernández</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Villegas</surname>
          </string-name>
          ,
          <source>Salamandra technical report</source>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2502.08489. arXiv:
          <volume>2502</volume>
          .
          <fpage>08489</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Etxaniz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Sainz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Miguel</surname>
          </string-name>
          , I. Aldabe,
          <string-name>
            <given-names>G.</given-names>
            <surname>Rigau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Agirre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ormazabal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Artetxe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Soroa</surname>
          </string-name>
          ,
          <string-name>
            <surname>Latxa:</surname>
          </string-name>
          <article-title>An open language model and evaluation suite for Basque</article-title>
          , in: L.
          <string-name>
            <surname>-W. Ku</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
          </string-name>
          , V. Srikumar (Eds.),
          <source>Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Bangkok, Thailand,
          <year>2024</year>
          , pp.
          <fpage>14952</fpage>
          -
          <lpage>14972</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>799</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>799</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>E.</given-names>
            <surname>Strubell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ganesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>McCallum</surname>
          </string-name>
          ,
          <article-title>Energy and policy considerations for deep learning in NLP</article-title>
          , in: A.
          <string-name>
            <surname>Korhonen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Traum</surname>
          </string-name>
          , L. Màrquez (Eds.),
          <article-title>Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Florence, Italy,
          <year>2019</year>
          , pp.
          <fpage>3645</fpage>
          -
          <lpage>3650</lpage>
          . URL: https://aclanthology.org/P19-1355/. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P19</fpage>
          -1355.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Abdin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Aneja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Awadalla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Awadallah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Awan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bahree</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bakhtiari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Behl</surname>
          </string-name>
          , et al.,
          <article-title>Phi-3 technical report: A highly capable language model locally on your phone</article-title>
          ,
          <source>arXiv preprint arXiv:2404.14219</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L. B.</given-names>
            <surname>Allal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lozhkov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Bakouch</surname>
          </string-name>
          , L. von
          <string-name>
            <surname>Werra</surname>
          </string-name>
          , T. Wolf, Smollm - blazingly
          <source>fast and remarkably powerful</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wallis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Allen-Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          , et al.,
          <article-title>Lora: Low-rank adaptation of large language models</article-title>
          .
          <source>, ICLR</source>
          <volume>1</volume>
          (
          <year>2022</year>
          )
          <article-title>3</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bhandare</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sripathi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Karkada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Menon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Datta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Saletore</surname>
          </string-name>
          , Eficient 8
          <article-title>-bit quantization of transformer neural machine language translation model</article-title>
          , arXiv preprint arXiv:
          <year>1906</year>
          .
          <volume>00532</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <article-title>Distilling the knowledge in a neural network</article-title>
          ,
          <source>CoRR abs/1503</source>
          .02531 (
          <year>2015</year>
          ). URL: http://arxiv.org/abs/1503.02531. arXiv:
          <volume>1503</volume>
          .
          <fpage>02531</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>H.</given-names>
            <surname>Sajjad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dalvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Durrani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <article-title>On the efect of dropping layers of pre-trained transformer models</article-title>
          ,
          <source>Computer Speech &amp; Language</source>
          <volume>77</volume>
          (
          <year>2023</year>
          )
          <fpage>101429</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P.</given-names>
            <surname>Micikevicius</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Alben</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Diamos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Elsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Garcia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ginsburg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Houston</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Kuchaiev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Venkatesh</surname>
          </string-name>
          , et al.,
          <article-title>Mixed precision training</article-title>
          ,
          <source>arXiv preprint arXiv:1710.03740</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Louradour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Collobert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Weston</surname>
          </string-name>
          ,
          <article-title>Curriculum learning</article-title>
          ,
          <source>in: Proceedings of the 26th annual international conference on machine learning</source>
          ,
          <year>2009</year>
          , pp.
          <fpage>41</fpage>
          -
          <lpage>48</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Dodge</surname>
          </string-name>
          , G. Ilharco,
          <string-name>
            <given-names>R.</given-names>
            <surname>Schwartz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hajishirzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Smith</surname>
          </string-name>
          , Fine-tuning pretrained
          <year>2025</year>
          . URL: https://openreview.net/forum?id=ngmEcEer8a.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>T.</given-names>
            <surname>Schick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schütze</surname>
          </string-name>
          ,
          <article-title>It's not just size that matters: Small language models are also fewshot learners</article-title>
          , in: K.
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rumshisky</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Hakkani-Tur</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Beltagy</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Bethard</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Cotterell</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Chakraborty</surname>
          </string-name>
          , Y. Zhou (Eds.),
          <source>Proceedings of the</source>
          <year>2021</year>
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics</article-title>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>2339</fpage>
          -
          <lpage>2352</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .naacl-main.
          <volume>185</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .naacl-main.
          <volume>185</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , G. Zeng,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <surname>Tinyllama:</surname>
          </string-name>
          <article-title>An open-source small language model</article-title>
          ,
          <source>arXiv preprint arXiv:2401.02385</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>X.</given-names>
            <surname>Chu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Qiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wei</surname>
          </string-name>
          , et al.,
          <article-title>Mobilevlm: A fast, strong and open vision language assistant for mobile devices</article-title>
          ,
          <source>arXiv preprint arXiv:2312.16886</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [38]
          <string-name>
            <surname>K. M. Choromanski</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Likhosherstov</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Dohan</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Song</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Gane</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Sarlos</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Hawkins</surname>
            ,
            <given-names>J. Q.</given-names>
          </string-name>
          <string-name>
            <surname>Davis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Mohiuddin</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Kaiser</surname>
            ,
            <given-names>D. B.</given-names>
          </string-name>
          <string-name>
            <surname>Belanger</surname>
            ,
            <given-names>L. J.</given-names>
          </string-name>
          <string-name>
            <surname>Colwell</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Weller</surname>
          </string-name>
          ,
          <article-title>Rethinking attention with performers</article-title>
          ,
          <source>in: International Conference on Learning Representations</source>
          ,
          <year>2021</year>
          . URL: https: //openreview.net/forum?id=
          <fpage>Ua6zuk0WRH</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>A.</given-names>
            <surname>Katharopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vyas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Pappas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Fleuret</surname>
          </string-name>
          ,
          <article-title>Transformers are rnns: Fast autoregressive transformers with linear attention</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>5156</fpage>
          -
          <lpage>5165</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <article-title>Fast transformer decoding: One write-head is all you need</article-title>
          , arXiv preprint arXiv:
          <year>1911</year>
          .
          <volume>02150</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ainslie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee-Thorp</surname>
          </string-name>
          , M. de Jong, Y. Zemlyanskiy,
          <string-name>
            <given-names>F.</given-names>
            <surname>Lebron</surname>
          </string-name>
          , S. Sanghai, GQA:
          <article-title>Training generalized multi-query transformer models from multi-head checkpoints</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Singapore,
          <year>2023</year>
          , pp.
          <fpage>4895</fpage>
          -
          <lpage>4901</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .emnlp-main.
          <volume>298</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .emnlp-main.
          <volume>298</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>T.</given-names>
            <surname>Dao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. Y.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ermon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rudra</surname>
          </string-name>
          , C. Re,
          <article-title>Flashattention: Fast and memory-eficient exact attention with IO-awareness</article-title>
          , in: A. H.
          <string-name>
            <surname>Oh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Agarwal</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Belgrave</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Cho (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          ,
          <year>2022</year>
          . URL: https://openreview.net/forum?id=
          <fpage>H4DqfPSibmx</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>B.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Alcaide</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Anthony</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Albalak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Arcadinho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Biderman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Cao</surname>
          </string-name>
          , X. Cheng, M. Chung,
          <string-name>
            <given-names>L.</given-names>
            <surname>Derczynski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Grella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Gv</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kazienko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kocon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Koptyra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. S. I.</given-names>
            <surname>Mantri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Mom</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Saito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wind</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Woźniak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , R.-J. Zhu, RWKV:
          <article-title>Reinventing RNNs for the transformer era</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2023</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Singapore,
          <year>2023</year>
          , pp.
          <fpage>14048</fpage>
          -
          <lpage>14077</lpage>
          . URL: https: //aclanthology.org/
          <year>2023</year>
          .findings-emnlp.
          <volume>936</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .findings-emnlp.
          <volume>936</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [44]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gu</surname>
          </string-name>
          , T. Dao, Mamba:
          <article-title>Linear-time sequence modeling with selective state spaces</article-title>
          ,
          <source>in: First Conference on Language Modeling</source>
          ,
          <year>2024</year>
          . URL: https://openreview.net/forum?id=tEYskw1VY2.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [45]
          <string-name>
            <given-names>S.</given-names>
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>You</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-R.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Large language difusion models</article-title>
          ,
          <source>arXiv preprint arXiv:2502.09992</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [46]
          <string-name>
            <given-names>D.</given-names>
            <surname>Ponce</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Etchegoyhen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ruiz</surname>
          </string-name>
          ,
          <article-title>Unsupervised subtitle segmentation with masked language models</article-title>
          , in: A.
          <string-name>
            <surname>Rogers</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Boyd-Graber</surname>
          </string-name>
          , N. Okazaki (Eds.),
          <source>Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>2</volume>
          :
          <string-name>
            <surname>Short</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Toronto, Canada,
          <year>2023</year>
          , pp.
          <fpage>771</fpage>
          -
          <lpage>781</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .acl-short.
          <volume>67</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .acl-short.
          <volume>67</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [47]
          <string-name>
            <given-names>D.</given-names>
            <surname>Ponce</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Etchegoyhen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Calleja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gete</surname>
          </string-name>
          ,
          <article-title>Split and rephrase with large language models</article-title>
          , in: L.
          <string-name>
            <surname>-W. Ku</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
          </string-name>
          , V. Srikumar (Eds.),
          <source>Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Bangkok, Thailand,
          <year>2024</year>
          , pp.
          <fpage>11588</fpage>
          -
          <lpage>11607</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>622</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>622</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [48]
          <string-name>
            <given-names>D.</given-names>
            <surname>Ponce</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gete</surname>
          </string-name>
          , T. Etchegoyhen, Vicomtech@WMT 2024:
          <article-title>Shared task on translation into low-resource languages of Spain</article-title>
          , in: B.
          <string-name>
            <surname>Haddow</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Kocmi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Koehn</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          Monz (Eds.), Proceedings
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>