1. Introduction

A Tough Hoe to Row: Instruction Fine-Tuning LLaMA 3.2 for Multilingual Idiom Processing

Debora Ciminari

Alberto Barrón-Cedeño

0 0 Università di Bologna , Corso della Repubblica, 136, 47121, Forlì , Italy

2025

Idiomatic expressions (IEs) are a core part of language but exhibit considerable complexity and heterogeneity, posing significant challenges to natural language processing (NLP). Efective automatic idiom processing could enhance our understanding of language and could benefit downstream tasks such as machine translation. However, previous research fails to adopt a comprehensive approach and struggles to consider languages diferent from English and the rich variety of idiom types. We thus aim to develop a version of LLaMA 3.2 that is instruction fine-tuned on data in three languages - English, Italian, and Portuguese - and covering a wide range of IE types. Specifically, we build on already annotated corpora to create our instruction-formatted dataset, and we employ instruction fine-tuning on two tasks - sentence disambiguation and idiom identification. We then investigate the efectiveness of this approach and assess the impact of the instruction language on the model's performance. We release a multilingual instruction-formatted dataset for automatic idiom processing. Additionally, we show that fine-tuning might help the model disambiguate between literal and idiomatic sentences, while gains in idiom identification are limited and require further investigation. The F 1-measure also suggests that the choice of the instruction language significantly afects the results.

eol>idiomatic expressions multilinguality sentence disambiguation idiom identification instruction fine-tuning

1. Introduction Such complexity makes it challenging to deal with

IEs in the field of natural language processing (NLP).

Idiomatic expressions (IEs) are a prominent component Given the pervasive presence of IEs in language, efecof language and constitute a broad and heterogenous tive idiom processing is needed to gain a deeper and category. The canonical definition describes IEs as ex- more comprehensive understanding of language. Idiompressions whose meaning cannot be derived from the aware NLP can benefit downstream tasks, such as text meanings of their subparts [ 1, 2 ]. The typical example summarisation, sentiment analysis, question answering, is to kick the bucket, whose meaning ‘to die’ can- and machine translation [ 5, 6 ]. not be inferred from ‘kick’, ‘the’, or ‘bucket’. However, Most NLP applications focus on English, leaving mulsome cases do not fit this definition. For instance, the tilingual idiom processing largely unexplored. Recent meaning of to pull the strings (‘to use influence studies adopt encoder-based models [ 7, 8, 5 ], while studor connections’) does bear a sort of (metaphorical) rela- ies on decoder-based ones remain relatively sparse. Antion to its components. Another category of IEs can be other issue related to previous research is the models’ identified, i.e. potentially idiomatic expressions or PIEs [ 3 ], lack of a robust generalisation and its poor performance which are expressions that can have a literal or an id- on unseen idioms [ 5, 6 ]. iomatic meaning, depending on the context. That is the To fill these gaps, we develop an instruction fine-tuned case of the first idiom presented as example, to kick version of LLaMA 3.2 1B in three languages, English, Italthe bucket, which can also take a literal meaning, as ian, and Portuguese, and on two tasks, sentence disamin She got frustrated and kicked the bucket biguation and idiom identification: of paint across the garage.

In light of this diversity, the traditional definition has Task 1: Sentence Disambiguation. Framed as a bibeen challenged in favour of a more complex, multi- nary text classification task, it aims at discrimifaceted view that emphasises the heterogeneous nature of nating idiomatic from literal sentences. idiomaticity, conceived of as a continuum where expressions can be placed depending on multiple factors [ 4 ].

Task 2: Idiom Identification. Framed as a span labelling task, the model must identify the sequence of characters that correspond to an IE. specific span constituting the IE. Figure 1 shows some 2. Related Work examples of idiomatic and literal sentences.

Given this interdependence, our data is designed to The need to develop ad hoc techniques for the automatic address both tasks simultaneously. For instance, in the processing of idioms is widely acknowledged to acquire ifrst example, the model’s answer is expected to be kiss a better understanding of language [12, 13, 14]. Multiple of death, showing that the model correctly identified natural language understanding (NLU) tasks face chalthe sentence as idiomatic and proceeded to detect the lenges related to IEs, despite the use of state-of-the-art span where the idiom occurs. (SOTA) solutions. Among these tasks are sentiment anal

Starting from annotated corpora, we design our ysis [15], paraphrase generation [16], natural language instruction-formatted data, comprising an instruction inference [17], dialog models [18], and machine transla(the task description), the input (the sentence), and the tion [19, 20]. expected output [9]. Additionally, our dataset is multi- Recent approaches employ encoder-based models, like lingual in that it comprises inputs in all three languages. BERT [21], and leverage their contextual language emWhat difers is the instruction language, for which three bedding. Studies have found that this type of models subsets are created. We then fine-tune LLaMA 3.2 1B on struggles with non-compositionality and has dificulty a subset of our corpus and carry out evaluation based in disambiguating between literal and idiomatic meanon the F1-measure. We thus examine the efectiveness ings [ 22, 7, 23 ]. Yu and Ettinger [ 7 ] explore the ability of of instruction fine-tuning. Besides, we investigate the encoder-based models to handle semantic compositionalimpact of the instruction language in scenarios where the ity. In particular, they use five models, such as BERT and instruction language and the input language are the same and scenarios where they difer. To date, such an impact 1The dataset and the implementation are both available at https: //github.com/TinfFoil/MultIdiomLlama some of its variants, to examine to what extent these can Previous work falls short of capturing the complexity represent words in isolation and in phrases. They reach associated with IEs on multiple levels. On the one hand, the conclusion that these models grasp the meaning of studies have mostly focused on English, leaving other individual words but struggle to capture composed mean- languages aside. On the other hand, they have failed to ing. Zeng and Bhat [8], instead, propose the iDentifier of cover a wide enough variety of idiom types. Furthermore, Idiomatic expressions via Semantic Compatibility (DISC) studies agree on the limited ability of diferent models to to perform extraction and identification of PIEs. Their handle and process unseen idioms. framework leverages BERT to harness both the semantic and the syntactic properties of PIEs, and extract and identify all the expressions from a corpus. Results show that 3. Instruction Data Creation their model is able to outperform SOTA baselines, even in zero-shot settings, but it exhibits poor cross-domain Source Datasets. We start from three datasets to build performance. In addition, while including a notable array our instruction-formatted data in English, Italian, and of idiom types, it focuses on English data only. Portuguese: AStitchInLanguageModels [ 5 ], ID10M [ 6 ],

Some approaches take steps to include multilinguality. and MultiCoPIE [26].

TMaoydyealsr, Madaadtaabsuets hini eEtnagll.is[5h]arnedlePaosertAugSutietsceh,inanLdanegxupaagned- muAltSit-iwtcohrIdnLexapnrgeusasigoenM(oMdWelsE)isusaagdeaitnasEentgolifshidainodmPatoircit with Galician data for the SemEval-2022 Task 2 [24]. tuguese. It comprises examples containing PIEs in the Working on the idiomaticity detection task, they employ form of noun compounds, annotated according to two difmodels like BERT and XLNET [25] and conclude that ferent schemes. In the first one, sentences are labelled as models do not benefit from the inclusion of the context having an idiomatic or a literal meaning. The second one and that the zero-shot setting still produces poor results. is more fine-grained in that it provides a paraphrase of This corpus represents the first significant attempt to in- the MWE’s meaning and labels each example into one of clude multilinguality for the automatic idiom processing ifve categories: literal, idiomatic, non-idiomatic, proper and provides baselines for languages other than English. noun, or meta usage. We use data labelled with the first This dataset is, however, limited in that it only contains annotation scheme for the zero-shot scenario, with no noun compounds, thus lacking diversity and failing to overlap of PIEs between the training and the test sets. incorporate other types, such as verb and prepositional ID10M is a framework that introduces a multilinphrases. Another attempt at multilingual idiom process- gual Transformer-based architecture for sentence dising is Tedeschi et al. [ 6 ]’s ID10M. They develop a frame- ambiguation and idiom identification and provides anwork of systems and training and validation data for the notated datasets in multiple languages. It includes goldidiom identification task in 10 languages. Their findings standard data in English, German, Italian, and Spanish, confirm the distinction between zero-shot and few-shot and silver-standard data automatically annotated in 10 performance. languages: Chinese, Dutch, English, French, German, Italian, Japanese, Polish, Portuguese, and Spanish. A list of

Sentsova et al. [26] release the Multilingual Corpus of MWEs is compiled from the Wiktionary,2 and sentences IPtoatleiannti,aallnydIdCiaotmalaatnic, Ewxhpircehssiinocnlsu(dMesulatdiCdoitPioIEn)alinliRnguussisiatinc, containing MWEs are collected from WikiMatrix [30],3 features, such as semantic compositionality, head part- a multilingual corpus in 83 languages with parallel senof-speech, and English equivalents. By fine-tuning XML- tences retrieved from Wikipedia. The gold-standard data RoBERTa, they explore cross-lingual transfer, which are curated by native professional annotators, while the might benefit lower-resourced languages. Moreover, the silver-standard data are annotated based on the Wikinclusion of idioms having an English equivalent in the tionary entry of MWEs: when the MWE is marked as training set has proved helpful in disambiguating be- idiomatic, all occurrences of the MWEs are labelled as tween literal and idiomatic usages. idiomatic, and vice versa. Since these annotations do

Encoder-decoder models have also been used for not necessarily reflect the actual MWE usage in context, the development of idiom-aware systems. Zeng and Tedeschi et al. develop a dual-encoder architecture to Bhat [27] opt for the BART [28] sequence-to-sequence refine silver-standard data. They also incorporate a BIO (seq2seq) model. Their Generation of Idiom Embedding tagging scheme [31] to identify the tokens belonging to with Adapter (GIEA) model exhibits an improved ability the MWE, where B indicates the first token of a span, I at representing idiomaticity, but it is limited to English signals the intermediate token(s), and O designates the and does not show an enhanced generalisation capability. tokens out of any span.

Other studies have examined the performance of large language models (LLMs) [29, 11], finding that they fail

2https://pypi.org/project/wiktextract/

to handle idiomaticity and that they tend to be outper- 3https://github.com/facebookresearch/LASER/tree/main/tasks/ formed by other transformer-based models. WikiMatrix

MultiCoPIE is a dataset annotated for sentence disam- Table 1 biguation and idiom identification in Russian, Italian, and Examples from the instruction dataset with the output proCatalan. To build this dataset, a list of PIEs is compiled duced given an instruction and an input in diferent language for each language from online resources, such as the combinations.

Dizionario italiano De Mauro4, the Russian Wiktionary5, Examples the Diccionari català-anglès/anglès-català de locucions Input (en): Although the encounter was bathed in suni frases fetes6. PIEs with varying characteristics are in- shine, the match failed to reach boiling point. cluded, specifically, PIEs with diferent parts of speech Instr. (en): Can you spot the idiomatic expressions lurkas heads. For example, appeso a un filo (‘hung by a ing within this sentence? They are: thread’) has the adjective appeso (‘hung’) as head, while Output: boiling point con l’acqua alla gola (literally ‘with water up to Imnopsutrta(dpot)q:uNaodsriúcóltpi mteorossarneoasl,izmanuditoasmuanniovberrsaisdaaédreesatsê.m the throat’, meaning ‘to be in serious dificulty’) is headed Instr. (it): Un’analisi della frase rivela la presenza delle by the preposition con (‘with’). The dataset also covers seguenti costruzioni idiomatiche: PIEs with diverse degrees of semantic compositionality. Output: Nessuna. PIEs with a higher level of compositionality comprise at Input (en): After the day I had today , I feel like I could least one cue to the meaning of the expression. An exam- walk on water. ple is ammazzare il tempo (‘to kill time’), where the Instr. (pt): A frase contém as seguintes expressões word tempo (‘time’) helps interpreting the expression as idiomáticas: ‘to spend time trying not to get bored’. On the other hand, Output: walk on water essere al settimo cielo (‘to be on cloud nine’) is In recent years, many universities have demonstrated quadmore opaque since it does not comprise any hints about cAonpatenraslypseisrfoofr mthiengseanetreinaclemraenveoaelusvtrhees.presence of the followthe meaning ‘to be at the peak of happiness’. After se- ing idiomatic constructions: lecting the PIEs, sentences are automatically extracted None. from the Open Super-large Crawled Aggregated coRpus The sentence contains the following idiomatic expressions: (OSCAR)7 [32], a multilingual corpus generated from Common Crawl8, and refined through manual selection.

The two surrounding sentences are included to provide since all our samples include an input sentence. Finally, context. Opening and closing tags are also employed to we change the structure of the template. While the Allocate the lexicalised components of PIEs. The tags are paca template11 organises the instruction in “Instruction”, used to identify all PIEs present in the target sentence “Input”, and “Response”, we modify the order so that the and the preceding and following sentences. input is first presented, followed by the instruction and the response, since this order better fits language modeling underlying LLMs. This order meshes well with the left-to-right autoregressive nature of LLaMA: as shown in Table 1, the instruction leaves an empty slot at the end, where the model’s response is expected. Finally, the ‘input’ and ‘output’ keys are left empty to be filled in the following step.

Creation of the Instruction Templates. To create a dataset of instruction-formatted instances, we design instructions in English, Italian, and Portuguese. We first translate a seed instruction written in English into Italian and Portuguese using LLaMA 3.2 3B9 via ollama10. With the same model, we generate three paraphrased versions of the instructions. We design the prompts to produce different writing styles and perspectives, ensuring a varied Creation of the Final Dataset. We then proceed with dataset and a high linguistic diversity. These instructions the creation of the final dataset. We extract IEs and exare then organised in empty templates. The starting point amples from the aforementioned datasets. For English to construct such templates is the work by Taori et al. and Portuguese, we use ID10M and AStitchInLanguage[33], who fine-tune LLaMA 7B on instruction-formatted Models, while, for Italian, we employ ID10M and Multidemonstrations. They design a template in English to cre- CoPIE. The processing of the AStitchInLanguageModels ate the instruction-formatted examples and carry out the mainly focuses on extracting the actual MWEs present in ifne-tuning. We translate their template into Italian and the sentences since it includes the dictionary form. For Portuguese. The ‘prompt no input’ option is discarded ID10M, we process the data by reconstructing full sentences and identifying idiomatic spans. We then create a training and test split combining data from both ID10M and AStitchInLanguageModels, while ensuring that no PIEs in the test set overlap with those in the training 4https://dizionario.internazionale.it/ 5https://ru.wiktionary.org/wiki/ 6https://visca.com/apac/dites/ 7https://huggingface.co/oscar-corpus 8https://commoncrawl.org/ 9https://huggingface.co/meta-llama/Llama-3.2-3B 10https://ollama.com/

The label assignment can be represented as follows: 4. Experimental Settings Evaluation Framework. To account for both tasks,

we propose a two-fold evaluation methodology, which allows for a comprehensive understanding of the model’s ability to handle both the classification and the identification challenges.

We design an evaluation framework to assess the model’s performance on the sentence disambiguation and idiom identification tasks across various language combinations. For Task 1 we develop a labelling mechanism that considers multiple linguistic markers. Such markers are used for both ground truths and predictions to determine the label (0 or 1) to assign to each example. These keywords are language-specific and are: • Portuguese: ‘nenhuma’, ‘não’, ‘ausente’; • Italian: ‘nessuna’, ‘non’; • English: ‘none’, ‘no idiom’, ‘not contain’, ‘not’. (, ) = (, ) = 1 ∑︁

∑︁ || ∈ ∈,∈ 1 ∑︁

∑︁ | | ∈ ∈,∈ | ∩ |

|| | ∩ | || (2) (3) where is the predicted span, is the ground truth span, is the set of predicted spans, is the set of gold standard spans, is a sample, and represents the whole dataset. The F1-measure is then computed as the harmonic mean of precision and recall.

Settings. The instruction fine-tuning is implemented

on a subset of our dataset. This subset comprises 18,397 samples and retains the balance of the instruction dataset.

To optimise the fine-tuning, QLoRA [ 35] is also employed to reduce computational cost and memory usage.

As for the instruction fine-tuning, a set of default hyperparameters is configured to implement the fine-tuning of the LLaMA-3.2 1B model for the sentence disambiguation and idiom identification tasks. The model is trained with a batch size of 32 across 2 epochs, using a cutof length of 128 tokens for input sequences. For parameter- the model to prefer instructions written in English when eficient fine-tuning, LoRA [ 36 ] is employed with a rank disambiguating between literal and idiomatic sentences. (r) of 8, alpha of 16, and dropout rate of 0.05, specifically targeting the query and key projection matrices. The Idiom Identification Task. Table 4 shows F1 scores, implementation of LoRA enables to update only 851,968 averaged over 3 runs, for Task 2, before and after instrucout of more than 1 billion parameters. The optimisa- tion fine-tuning. We can see that, in general, the model tion process uses 4-bit quantization with NF4 format to exhibits poor performance and struggles to identify the reduce memory requirements. The learning process is idiom contained in the input sentence. In the idiom idenmanaged with a learning rate of 3e-4, weight decay of tification task, the improvements produced by the in0.01, and a warmup ratio of 0.1, using the Paged AdamW struction fine-tuning are mostly lower or non-existent. 32-bit optimizer and cosine learning rate schedule with The English inputs tend to benefit more from this aprestarts. Gradient accumulation is set to 2 steps with proach, gaining 2 points almost with all languages. Cona maximum gradient norm of 1.0, and gradient check- versely, the model seems to struggle on Italian data, and, pointing is enabled to optimise memory usage. The train- when associated with Italian and English instructions, it ing uses mixed-precision computation (FP16) and em- sufers from the fine-tuning, losing 1 point. When dealploys early stopping. ing with Portuguese sentences, instead, the model produces slightly improved results. Instruction fine-tuning, 5. Results and Discussion therefore, does not significantly and consistently help the model in identifying idioms. However, we should Sentence Disambiguation Task. Table 3 shows the consider that Task 2 is much more challenging in that F1 scores for Task 1, averaged over 3 runs, for all com- it consists in the identification of the idiom contained binations of instruction and input language, before and in a given sentence, at the character level. As for the after the fine-tuning. When comparing our model against instruction language, unlike Task 1, the instruction finethe baseline model without fine-tuning, we can see that tuning does not lead the model to favour English. Instead, the best results are achieved after the instruction fine- Portuguese instructions seem to better help the model in tuning: the performance gains more than 2 points across detecting the idiom. all combinations, with the Portuguese monolingual pair increasing by almost 3 points. These findings suggest Interactions between Instruction and Input Lanthat the approach we adopted consistently enhances the guage. The results abovementioned provide insights model’s performance, regardless of the instruction-input into the interactions between instruction and input lanlanguage combination. Turning to the impact of the in- guage. For Task 1, English instructions seem to aid the struction language, the baseline results indicate that En- model in distinguishing between idiomatic and literal glish inputs tend to prefer English instructions. On the sentences. Sentence disambiguation represents a simpler other hand, there seem to exist some sort of interplay task that requires a global understanding of the input between Italian and Portuguese: a slight improvement sentence. English, on which the model is mostly preis produced when Italian data are associated with Por- trained [ 37 ], might better allow LLaMA 3.2 to compretuguese instructions and vice versa. Conversely, our hend the task to carry out. Idiom identification, instead, results show that the model yields better F1 scores when is a much more complicated task requiring the model to prompted with instructions in English across all language have a deeper and more precise comprehension, not only combinations. This suggests that the fine-tuning leads at the sentence level, but also at the phrase level. This entails a finer knowledge of the input language as well.

Besides, when the instruction and the input language difer, the model is prompted in one language and asked to answer in another, which creates an additional layer of complexity. Diferent types of interactions between instruction and input language thus emerge, and future research is needed to investigate such interactions based on the languages involved and the task under study.

6. Conclusions

In this paper, we developed a fine-tuned LLaMA 3.2 1B on two tasks: sentence disambiguation and idiom identification. We adopted a multilingual approach in that we considered three languages, English, Italian, and Portuguese, and we employed instruction fine-tuning. To carry out the fine-tuning, we first constructed a multilingual dataset consisting of instruction-formatted data designed for idiomatic expressions (IEs). We examined the two tasks in a multilingual setting involving the abovementioned languages, which were used as both instruction and input languages, covering all possible combinations. This fine-tuning provided some valuable insights.

For the sentence disambiguation task, our instructionbased approach yielded better F1 scores, compared to the baseline results, which suggests that it aids the model in distinguishing between idiomatic and literal meanings. Nevertheless, after the fine-tuning, the models seemed to favour English instructions across all input languages. This might indicate that we can achieve satisfactory results prompting models with English instructions [10], and that we can limit instruction engineering to only one language [38]. On the other hand, this can be disadvantageous for other languages, potentially reducing model performance and usability in multilingual contexts.

For the idiom identification task, the model struggled to correctly identify the idiom included in the sentence, both before and after the fine-tuning. Our instructionbased approach did not necessarily lead to a significantly improved performance, and, in some cases, it produced lower F1 scores. Unlike Task 1, Task 2 represents a far more challenging task consisting in detecting IE at the character level, which might explain such a poor performance. Besides, the model did not exhibit a consistent preference for one language and produced mixed results.

Instruction fine-tuning might be beneficial for Task 1 but not necessarily for Task 2, and the instruction language plays a crucial role in the model’s performance.

However, further research is needed. From a methodological perspective, we used a relatively small model, and experiments with larger ones can be conducted. Other LLMs beyond LLaMA could be fine-tuned as well, not only to assess their performance but also to compare encoder-based and encoder-decoder models on the same

IE-related tasks. We did not implement hyperparame

ter tuning and limited the fine-tuning to a small subset.

Future research could explore optimised hyperparameters to improve performance, as well as use a larger dataset. Our study was also limited to three languages, and the scope could be expanded to others, even from different families, to gain a deeper understanding of crosslinguistic interactions. Finally, a promising direction would be the creation of datasets annotating idiomaticity on a continuum rather than as a binary distinction, aligning with more recent linguistic theories. B. Webber, T. Cohn, Y. He, Y. Liu (Eds.), Proceed- //doi.org/10.1145/3366423.3380198. doi:10.1145/ ings of the 2020 Conference on Empirical Meth- 3366423.3380198. ods in Natural Language Processing (EMNLP), As- [16] J. Zhou, Z. Zeng, H. Gong, S. Bhat, Idiomatic sociation for Computational Linguistics, Online, Expression Paraphrasing without Strong Supervi2020, pp. 4896–4907. URL: https://aclanthology.org/ sion, 2021. URL: https://arxiv.org/abs/2112.08592. 2020.emnlp-main.397/. doi:10.18653/v1/2020. arXiv:2112.08592.

emnlp-main.397. [17] T. Chakrabarty, D. Ghosh, A. Poliak, S. Mure[8] Z. Zeng, S. Bhat, Idiomatic expression identifica- san, Figurative language in recognizing textion using semantic compatibility, Transactions tual entailment, in: C. Zong, F. Xia, W. Li, of the Association for Computational Linguistics R. Navigli (Eds.), Findings of the Association for 9 (2021) 1546–1562. URL: https://aclanthology.org/ Computational Linguistics: ACL-IJCNLP 2021, As2021.tacl-1.92/. doi:10.1162/tacl_a_00442. sociation for Computational Linguistics, Online, [9] S. Zhang, L. Dong, X. Li, S. Zhang, X. Sun, S. Wang, 2021, pp. 3354–3361. URL: https://aclanthology.org/ J. Li, R. Hu, T. Zhang, F. Wu, G. Wang, Instruc- 2021.findings-acl.297/. doi: 10.18653/v1/2021. tion Tuning for Large Language Models: A Sur- findings-acl.297. vey, 2024. URL: https://arxiv.org/abs/2308.10792. [18] H. Jhamtani, V. Gangal, E. Hovy, T. BergarXiv:2308.10792. Kirkpatrick, Investigating robustness of dialog [10] N. Muennighof, T. Wang, L. Sutawika, A. Roberts, models to popular figurative language constructs, S. Biderman, T. Le Scao, M. S. Bari, S. Shen, Z. X. in: M.-F. Moens, X. Huang, L. Specia, S. W.-t. Yong, H. Schoelkopf, X. Tang, D. Radev, A. F. Aji, Yih (Eds.), Proceedings of the 2021 Conference K. Almubarak, S. Albanie, Z. Alyafeai, A. Web- on Empirical Methods in Natural Language Proson, E. Raf, C. Rafel, Crosslingual generaliza- cessing, Association for Computational Linguistion through multitask finetuning, in: A. Rogers, tics, Online and Punta Cana, Dominican Republic, J. Boyd-Graber, N. Okazaki (Eds.), Proceedings 2021, pp. 7476–7485. URL: https://aclanthology.org/ of the 61st Annual Meeting of the Association 2021.emnlp-main.592/. doi:10.18653/v1/2021. for Computational Linguistics (Volume 1: Long emnlp-main.592.

Papers), Association for Computational Linguis- [19] M. Fadaee, A. Bisazza, C. Monz, Examining the tics, Toronto, Canada, 2023, pp. 15991–16111. URL: tip of the iceberg: A data set for idiom translation, https://aclanthology.org/2023.acl-long.891/. doi:10. in: N. Calzolari, K. Choukri, C. Cieri, T. Declerck, 18653/v1/2023.acl-long.891. S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mar[11] D. Phelps, T. Pickard, M. Mi, E. Gow-Smith, iani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis, A. Villavicencio, Sign of the times: Evaluating the T. Tokunaga (Eds.), Proceedings of the Eleventh Inuse of large language models for idiomaticity de- ternational Conference on Language Resources and tection, 2024. URL: https://arxiv.org/abs/2405.09279. Evaluation (LREC 2018), European Language RearXiv:2405.09279. sources Association (ELRA), Miyazaki, Japan, 2018. [12] I. A. Sag, T. Baldwin, F. Bond, A. Copestake, URL: https://aclanthology.org/L18-1148/.

D. Flickinger, Multiword expressions: A pain in [20] E. Liu, A. Chaudhary, G. Neubig, Crossing the the neck for NLP, in: A. Gelbukh (Ed.), Compu- threshold: Idiomatic machine translation through tational Linguistics and Intelligent Text Process- retrieval augmentation and loss weighting, in: ing, Springer Berlin Heidelberg, Berlin, Heidelberg, H. Bouamor, J. Pino, K. Bali (Eds.), Proceed2002, pp. 1–15. ings of the 2023 Conference on Empirical Meth[13] A. Villavicencio, F. Bond, A. Korhonen, D. Mc- ods in Natural Language Processing, Association Carthy, Editorial: Introduction to the special is- for Computational Linguistics, Singapore, 2023, sue on multiword expressions: Having a crack pp. 15095–15111. URL: https://aclanthology.org/ at a hard nut, Comput. Speech Lang. 19 (2005) 2023.emnlp-main.933/. doi:10.18653/v1/2023. 365–377. URL: https://doi.org/10.1016/j.csl.2005.05. emnlp-main.933.

001. doi:10.1016/j.csl.2005.05.001. [21] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: [14] T. Baldwin, S. N. Kim, Multiword Expressions, CRC Pre-training of deep bidirectional transformers for

Press LLC, 2010, pp. 267–292. language understanding, in: J. Burstein, C. Do[15] R. Biddle, A. Joshi, S. Liu, C. Paris, G. Xu, Lever- ran, T. Solorio (Eds.), Proceedings of the 2019 Conaging sentiment distributions to distinguish figu- ference of the North American Chapter of the Asrative from literal health reports on twitter, in: sociation for Computational Linguistics: Human Proceedings of The Web Conference 2020, WWW Language Technologies, Volume 1 (Long and Short ’20, Association for Computing Machinery, New Papers), Association for Computational LinguisYork, NY, USA, 2020, p. 1217–1227. URL: https: tics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://aclanthology.org/N19-1423/. doi:10. pre-training for natural language generation, trans18653/v1/N19-1423. lation, and comprehension, in: D. Jurafsky, J. Chai, [22] N. Nandakumar, T. Baldwin, B. Salehi, How well do N. Schluter, J. Tetreault (Eds.), Proceedings of the embedding models capture non-compositionality? 58th Annual Meeting of the Association for ComA view from multiword expressions, in: A. Rogers, putational Linguistics, Association for ComputaA. Drozd, A. Rumshisky, Y. Goldberg (Eds.), Pro- tional Linguistics, Online, 2020, pp. 7871–7880. ceedings of the 3rd Workshop on Evaluating Vec- URL: https://aclanthology.org/2020.acl-main.703/. tor Space Representations for NLP, Association doi:10.18653/v1/2020.acl-main.703. for Computational Linguistics, Minneapolis, USA, [29] F. De Luca Fornaciari, B. Altuna, I. Gonzalez-Dios, 2019, pp. 27–34. URL: https://aclanthology.org/ M. Melero, A hard nut to crack: Idiom detection W19-2004/. doi:10.18653/v1/W19-2004. with conversational large language models, in: [23] M. Garcia, T. Kramer Vieira, C. Scarton, M. Idiart, D. Ghosh, S. Muresan, A. Feldman, T. Chakrabarty, A. Villavicencio, Probing for idiomaticity in vector E. Liu (Eds.), Proceedings of the 4th Workshop space models, in: P. Merlo, J. Tiedemann, R. Tsarfaty on Figurative Language Processing (FigLang 2024), (Eds.), Proceedings of the 16th Conference of the Association for Computational Linguistics, MexEuropean Chapter of the Association for Compu- ico City, Mexico (Hybrid), 2024, pp. 35–44. URL: tational Linguistics: Main Volume, Association for https://aclanthology.org/2024.figlang-1.5/. doi: 10. Computational Linguistics, Online, 2021, pp. 3551– 18653/v1/2024.figlang-1.5. 3564. URL: https://aclanthology.org/2021.eacl-main. [30] H. Schwenk, V. Chaudhary, S. Sun, H. Gong, 310/. doi:10.18653/v1/2021.eacl-main.310. F. Guzmán, Wikimatrix: Mining 135M parallel [24] H. Tayyar Madabushi, E. Gow-Smith, M. Garcia, sentences in 1620 language pairs from Wikipedia, C. Scarton, M. Idiart, A. Villavicencio, SemEval- Proceedings of the 16th Conference of the Eu2022 Task 2: Multilingual Idiomaticity Detec- ropean Chapter of the Association for Computation and Sentence Embedding, in: G. Emerson, tional Linguistics: Main Volume, Association for N. Schluter, G. Stanovsky, R. Kumar, A. Palmer, Computational Linguistics., 2021, pp. 1351–1361. N. Schneider, S. Singh, S. Ratan (Eds.), Proceed- doi:10.18653/v1/2021.eacl-main.115. ings of the 16th International Workshop on Se- [31] L. Ramshaw, M. Marcus, Text chunking using mantic Evaluation (SemEval-2022), Association transformation-based learning, in: Third Workfor Computational Linguistics, Seattle, United shop on Very Large Corpora, 1995. URL: https: States, 2022, pp. 107–121. URL: https://aclanthology. //aclanthology.org/W95-0107/. org/2022.semeval-1.13/. doi:10.18653/v1/2022. [32] P. J. Ortiz Suárez, B. Sagot, L. Romary, Asynsemeval-1.13. chronous pipelines for processing huge corpora [25] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhut- on medium to low resource infrastructures, Prodinov, Q. V. Le, XLNet: generalized autoregressive ceedings of the Workshop on Challenges in pretraining for language understanding, Curran As- the Management of Large Corpora (CMLC-7) sociates Inc., Red Hook, NY, USA, 2019. 2019. Cardif, 22nd July 2019, Leibniz-Institut [26] U. Sentsova, D. Ciminari, J. V. Genabith, C. España- für Deutsche Sprache, Mannheim, 2019, pp. 9 – Bonet, MultiCoPIE: A multilingual corpus of poten- 16. URL: http://nbn-resolving.de/urn:nbn:de:bsz: tially idiomatic expressions for cross-lingual PIE dis- mh39-90215. doi:10.14618/ids-pub-9021. ambiguation, in: A. K. Ojha, V. Giouli, V. B. Mititelu, [33] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, M. Constant, G. Korvel, A. S. Doğruöz, A. Rade- C. Guestrin, P. Liang, T. B. Hashimoto, Stanford maker (Eds.), Proceedings of the 21st Workshop alpaca: An instruction-following LLaMA model, on Multiword Expressions (MWE 2025), Associa- https://github.com/tatsu-lab/stanford_alpaca, 2023. tion for Computational Linguistics, Albuquerque, [34] G. Da San Martino, A. Barrón-Cedeño, New Mexico, U.S.A., 2025, pp. 67–81. URL: https: H. Wachsmuth, R. Petrov, P. Nakov, SemEval-2020 //aclanthology.org/2025.mwe-1.8/. Task 11: Detection of Propaganda Techniques in [27] Z. Zeng, S. Bhat, Getting BART to ride the idiomatic News Articles, in: A. Herbelot, X. Zhu, A. Palmer, train: Learning to represent idiomatic expres- N. Schneider, J. May, E. Shutova (Eds.), Proceedings sions, Transactions of the Association for Computa- of the Fourteenth Workshop on Semantic Evaluational Linguistics 10 (2022) 1120–1137. URL: https: tion, International Committee for Computational //aclanthology.org/2022.tacl-1.65/. doi:10.1162/ Linguistics, Barcelona (online), 2020, pp. 1377–1414. tacl_a_00510. URL: https://aclanthology.org/2020.semeval-1.186/. [28] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, doi:10.18653/v1/2020.semeval-1.186.

A. Mohamed, O. Levy, V. Stoyanov, L. Zettle- [35] T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettlemoyer, BART: Denoising sequence-to-sequence moyer, QLoRA: Eficient finetuning of quantized Declaration on Generative AI

[1]

Fraser , Idioms within a Transformational Grammar, Foundations of Language 6 ( 1970 ) 22 - 42 .

[2]

N. A.

Chomsky , Rules and Representations, Behavioral and Brain Sciences 3 ( 1980 ) 1 - 15 . doi: 10 . 1017/s0140525x00001515.

[3]

Haagsma ,

Bos , M. Nissim, MAGPIE: A large corpus of potentially idiomatic expressions , in: N. Calzolari , F.

Béchet , P.

Blache , K.

Choukri , C.

Cieri , T.

Declerck , S.

Goggi , H.

Isahara , B.

Maegaard , J.

Mariani , H.

Mazo , A.

Moreno , J.

Odijk , S. Piperidis (Eds.), Proceedings of the Twelfth Language Resources and Evaluation Conference , European Language Resources Association, Marseille, France, 2020 , pp. 279 - 287 . URL: https:// aclanthology.org/ 2020 .lrec- 1 .35/.

[4]

Wulf , Rethinking Idiomaticity: A Usage-based Approach , Research in Corpus and Discourse, Continuum, London and New York, 2008 .

[5]

Tayyar Madabushi ,

Gow-Smith ,

Scarton , A . Villavicencio, AStitchInLanguageModels: Dataset and Methods for the Exploration of Idiomaticity in Pre-Trained Language Models , in: M. -

F. Moens , X.

Huang , L.

Specia , S. W.-t. Yih (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2021 , Association for Computational Linguistics , Punta Cana, Dominican Republic, 2021 , pp. 3464 - 3477 . URL: https://aclanthology. org/ 2021 .findings-emnlp. 294 /. doi: 10 .18653/v1/ 2021 .findings-emnlp. 294 .

[6]

Tedeschi ,

Martelli ,

Navigli , ID10M: Idiom Identification in 10 Languages , in: M. Carpuat , M.-C. de Marnefe , I. V. Meza Ruiz (Eds.), Findings of the Association for Computational Linguistics: NAACL 2022 , Association for Computational Linguistics , Seattle, United States, 2022 , pp. 2715 - 2726 . URL: https://aclanthology.org/ 2022 .findings-naacl. 208 /. doi: 10 .18653/v1/ 2022 . findings-naacl. 208 .

[7]

Yu ,

Ettinger , Assessing Phrasal Representation and Composition in Transformers, in: LLMs, 2023 . URL: https://arxiv.org/abs/2305.14314. A. Fan , A. Goyal , et al., The Llama 3 herd of models, arXiv:2305.14314 . arXiv (Cornell University) ( 2024 ). doi: 10 .48550/

[36]

E. J.

Hu ,

Shen ,

Wallis ,

Allen-Zhu , arxiv . 2407 .21783.

Li ,

Wang , W. Chen, LoRA: Low-Rank [38]

W. X.

Zhao ,

Zhou ,

Li ,

Tang ,

Wang ,

Hou , Adaptation of Large Language Models, CoRR

Min ,

Zhang ,

Dong ,

Du ,

Yang , abs/2106.09685 ( 2021 ). URL: https://arxiv.org/abs/ Y. Chen,

Chen ,

Jiang ,

Ren ,

Li ,

Tang , 2106 .09685. arXiv: 2106 .09685.

Liu , P. Liu,

J.-Y.

Nie ,

J.-R.

Wen , A survey of large

[37]

Dubey ,

Jauhri ,

Pandey ,

Kadian , A . Al- language models , 2025 . URL: https://arxiv.org/abs/ Dahle,

Letman ,

Mathur ,

Schelten , A . Yang, 2303 .18223. arXiv: 2303 . 18223 .