1. Introduction and Background

S. Calderaro); alessio.miaschi@ilc.cnr.it (A. Miaschi); felice.dellorletta@ilc.cnr.it (F. Dell'Orletta) https://alemiaschi.github.io/ (A. Miaschi); http://www.italianlp.it/people/felice-dellorletta/ (F. Dell'Orletta)

The OuLiBench Benchmark: Formal Constraints as a Lens into LLM Linguistic Competence

Silvio Calderaro

Alessio Miaschi

Felice Dell'Orletta

0 0 ItaliaNLP Lab, Istituto di Linguistica Computazionale "A. Zampolli" (CNR-ILC) , Pisa 1 Università di Pisa

2025

000 0 0002

Recent progress in Large Language Models (LLMs) has led to impressive capabilities in Natural Language Generation (NLG). However, standard evaluation benchmarks often focus on surface-level performance and are predominantly English-centric, limiting insights into models' deeper linguistic competences, especially in other languages. In this paper, we introduce OuLiBench, a novel benchmark inspired by the literary movement OuLiPo, designed to evaluate LLMs' ability to generate Italian text under explicit linguistic constraints, ranging from morpho-syntactic requirements to creative and structural challenges. Our goal is to assess the extent to which LLMs can understand and manipulate language when guided by specific, sometimes artificial constraints. We evaluate a range of state-of-the-art models in both zero- and few-shot settings, comparing performance across constraint types and dificulty levels. Our results highlight significant variability across models and tasks, shedding light on the limits of controllable text generation and ofering a new lens for probing LLMs' generative and linguistic competence beyond traditional benchmarks.

eol>Large Language Models Benchmark Evaluation Controllable Text Generation

1. Introduction and Background Another critical issue, often underestimated in current

evaluation methodologies, is the overwhelming predomThe recent and rapid advancements in Large Language inance of benchmarks developed and validated primarily Models (LLMs) development has profoundly reshaped for the English language [8]. This bias significantly limits the landscape of Natural Language Processing (NLP) [1, the accurate assessment of multilingual systems or mod2, 3, 4]. These models exhibit remarkable proficiency els tailored for other languages, such as Italian. Moreover, across a wide range of tasks, particularly excelling in it impedes the identification and study of culturally spethe generation of coherent and contextually appropriate cific linguistic phenomena, which are inherently tied to text. They demonstrate a sophisticated grasp of complex the socio-cultural characteristics of individual linguistic linguistic structures with high accuracy. Such capabilities communities. have been extensively evaluated through a variety of Concurrently, Controllable Text Generation (CTG) is benchmarks, many of which are aggregated on platforms emerging as a pivotal research area within the LLM dolike the Open LLM Leaderboard [5] to facilitate cross- main [9, 10, 11, 12, 13]. CTG focuses on developing and model comparisons. analyzing techniques that guide text generation to con

However, despite the value of these benchmarks as form to explicit constraints, such as style (e.g., formal vs. reference frameworks, a significant gap remains in the informal), emotional tone, desired length, structural comcomprehensive assessment of LLMs’ intrinsic linguistic plexity (e.g., number of subordinate clauses), and predecompetencies, independently of specific task formula- ifned semantic content. By leveraging strategies such as tions and with a cross-cutting perspective [6, 7]. Stan- prompt conditioning, targeted fine-tuning on annotated dard evaluation metrics often emphasize surface-level datasets, and the implementation of dedicated control features (e.g., n-gram overlap using BLEU or ROUGE), mechanisms, CTG research aims to produce generative which may fail to capture deep semantic understanding systems capable of generating outputs that precisely sator robust syntactic flexibility. isfy specified criteria. Intrinsically, this field not only provides methodologies for evaluations better aligned with practical and real-world communicative needs but also emphasizes the models’ ability to manipulate language in response to explicit conditions.

This focus on controlled generation naturally raises the question of how far such control can be extended, particularly when constraints become highly specific or even deliberately artificial, designed not merely to produce tic manipulation and computational creativity. In this morpho-syntactic or creative linguistic phenomenon. regard, there exists a compelling parallel with the princi- The goal is to assess to what extent models can control ples of the literary group OuLiPo (Ouvroir de Littérature these properties during text generation, and how robustly Potentielle), which has long explored the generative po- they generalize across diferent types of constraints. tential of formal constraints. By imposing stringent rules For each property , we define a corresponding set of on literary creation, OuLiPo demonstrates how limita- possible target values = 1, 2, ..., . We prompt tions can paradoxically unlock new expressive forms the models to generate a fixed number of sentences conand reveal deeper structural properties of language. We ditioned on each value using a consistent prompt hypothesize that such intricate, often playful linguistic format. For example, for the property “number of words” challenges, when adapted as evaluation tasks, can yield a representative prompt would be: valuable insights into the degree of fine-grained control an LLM can exert and its implicit understanding of lin- Genera 50 frasi composte esattamente da 5 guistic structure, moving beyond mere fluency to assess parole ciascuna, escludi dal conto la puntrue generative competence. teggiatura e gli spazi. [transl. Generate 50

Building on these insights, in this paper we introduce sentences consisting of exactly 5 words each, OuLiBench, a novel benchmark, and present an extensive excluding punctuation and spaces from the evaluation of LLMs’ ability to generate Italian text under count.] targeted linguistic constraints, ranging from morphosyntactic to stylistic-formal phenomena. Considering the dificulty that LLMs show in meet

By prompting language models to generate sentences ing strict numerical specifications, such as generating that adhere to specific linguistic constraints (e.g., "Gen- sentences with an exact length in terms of words or charerate a sentence with exactly five words" or "Generate a acters, we intentionally structured the evaluation around sentence without the letter ’e’") and, where applicable, increasing values of each property. This approach allows evaluating their ability to reflect on these constraints or us to examine whether the models are sensitive to the relon properties of the generated text, we aim to address the ative ordering and magnitude of constraints, even when following research questions: i) To what extent can LLMs exact conformity is dificult to achieve. The underlying produce text that satisfies explicit linguistic constraints hypothesis is that although a model may not reliably prodefined in OuLiBench, including quantitative, structural, duce a sentence with exactly 5 words, it may still exhibit and creative constraints? ii) What diferences emerge a monotonic tendency, generating progressively longer among various LLMs in their ability to meet complex lin- sentences as the required number increases. guistic constraints, and which types of constraints pose For syntactic constraints, such as those related to the the greatest challenges? iii) How does the nature of the syntactic order of the elements (e.g. SVO, SOV, VSO), the constraint (e.g., syntactic vs. creative) afect the quality analysis focused on the model’s ability to adapt the synand coherence of the generated text? tactic structure of the sentence to predetermined patterns.

Contributions. Our main contributions are: Here, the aim is to assess the structural flexibility of the model and its ability to model the output according to • We propose a framework, based on the OuliBench specific grammatical configurations. Finally, concerning benchmark, for evaluating the linguistic abilities OuLiPo-inspired linguistic constraints, such as lipograms of state-of-the-art Italian LLMs when generating (texts that deliberately omit a particular letter) and tautotext. grams (texts in which all words start with the same letter), • We conduct extensive evaluations across diferent the evaluation was structured around specific letters of open- and closed-source models and linguistic the alphabet, testing the model’s ability to inhibit or conconstraints. centrate the use of certain letters within the generated • We evaluated models’ abilities across several con- sentences. This allows us to examine the controllabilifgurations, testing their performance in zero- and ity of the models in more creative and stylistic contexts, few-shot settings. where the constraints are not numerical but qualitative and symbolic.

The linguistic constraints span both formal properties 2. Our Approach (e.g. sentence length in words or characters, permutations of sentence elements in the context of linguistic We systematically evaluate the ability of several LLMs typology) and creative phenomena (e.g., lipograms, tautoto generate Italian sentences under a range of explicitly grams, acrostics), enabling a comprehensive evaluation of defined linguistic constraints. These constraints are for- controllability across structural and stylistic dimensions. malized as a set of properties = 1, 2, ..., , where In all cases, the evaluation assesses whether the genereach property corresponds to a specific quantitative, ated sentence not only satisfies the target constraint but also maintains syntactic correctness, semantic coherence, and linguistic appropriateness in Italian.

3. OuLiBench

To address the need for more granular evaluation tools for the Italian language, we developed OuLiBench. This novel benchmark is specifically designed to thoroughly analyze the capability of LLMs to generate text while adhering to a diverse and progressively complex set of explicit linguistic constraints, thereby moving beyond assessments based on mere surface-level fluency.

3.1. Conceptual Framework and Task Taxonomy The conceptual foundation of OuLiBench integrates prin

ciples from Controllable Text Generation (CTG) [10], which focuses on guided generation according to predefined attributes, with the creative, constraint-based methodologies of the OuLiPo (Ouvroir de Littérature Potentielle) literary group. Founded in 1960 by writer Raymond Queneau and mathematician François Le Lionnais, OuLiPo emerged as a revolutionary literary movement that sought to explore the potential of literature through the systematic application of formal constraints. In their Premier Manifeste (First Manifesto) [14] of 1961, Le Lionnais articulated the group’s foundational philosophy Littérature potentielle, defining littérature potentielle as "the search for new structures and patterns that can be used" to create literary works. The group used the restrictions of literary forms to spark creativity, developing techniques such as lipograms (texts excluding specific letters), tautograms, anagrams and palindromes. This approach demonstrated that systematic limitations could paradoxically expand rather than restrict creative possibilities, generating what the group termed "potential literature". OuLiBench adapts these philosophies into a suite of computationally evaluable tasks, entirely formulated and contextualized for the Italian language.

OuliBench is organized according to a taxonomy that reflects diferent levels and types of linguistic control: in active, passive, or reflexive/medium voice) and constituent order permutations (Subject-VerbObject), testing flexibility in generating canonical and non-canonical sentence structures. 3. Stylistic-Formal (OuLiPo-inspired) Constraints: Representing the most elaborate challenges, this category implements OuLiPian contraintes. It includes tasks such as the Lipogram (omission of specific letters), Inverse Lipogram (mandatory inclusion of specific letters), Tautogram (all words starting with the same letter), Anagram (at both word and phrasal levels), Palindrome (symmetrical text), and Acrostic (initial letters of words forming a target word). These tasks demand advanced linguistic planning and sophisticated sub-lexical and structural manipulation.

For each task, specific prompts were formulated in Italian. Table 1 provides a comprehensive overview of the tasks included in OuLiBench, as well as the prompts used for generating the sentences. 4. Experimental Setting We evaluate a pool of Italian LLMs by testing their

ability to follow the linguistic constraints defined in OuliBench. We conduct our experiments in both zeroshot and few-shot settings. In the zero-shot condition, the model receives only the instruction formulated in natural language. In the few-shot configuration, the prompt is augmented with five, ten, and fifteen exemplar sentences corresponding to the same constraint. This setup is intended to investigate whether LLMs improve in constraint-following behaviour when exposed to incontext demonstrations. In the following, we describe the set of tested models and the evaluation strategy adopted to assess the extent to which generated outputs satisfy the defined constraints.

4.1. Models

The landscape of Italian large language models (LLMs) 1. Quantitative Constraints: This category as- is evolving rapidly, with notable diferences in developsesses the precision of dimensional control over ment strategies. Some models have been pre-trained the textual output. Tasks require models to gener- from scratch with intrinsic emphasis on the Italian lanate sentences adhering to an exact word count guage, while others have been fine-tuned for Italian or an exact character count (net of punctuation starting from well-established architectures. For this and spaces). These constraints challenge models study, we selected models with comparable parameto balance numerical restrictions with semantic ter scales: Minerva-7B-instruct-v1.0 (SapienzaNLP) [15], coherence and grammatical correctness. Velvet-14B (Almawave) [16], Maestrale-chat-v0.4-beta 2. Syntactic Constraints: These tasks evaluate the (mii-llm) [17], and LLaMAntino-3-ANITA-8B-Inst-DPOmodels’ competence in manipulating fundamen- ITA (SWAP-UNIBA) [18]. The first group includes three tal Italian grammatical structures. They include models pre-trained from scratch. Minerva-7B-instructverbal diathesis control (requiring generation v1.0 is a 7-billion-parameter Transformer pre-trained Length by Words

Length by Characters Syntactic

Diathesis Control Stylistic-Formal (OuLiPo-inspired)

Word Order Permutations Lipogram Inverse Lipogram Tautogram Word Anagram Phrasal Anagram Palindrome Acrostic Generate Italian sentences with an exact word count.

Generate Italian sentences with an exact character count (no punct/space).

Example Target Sentence (Italian) "Il gatto dorme sul divano." (5 words) "Mangio la pizza" (13 chars) Generate Italian sentences in specified voice (active, passive, "La lettera è scritta da Marco." (passiva) reflexive).

Generate Italian sentences using specific SVO permutations "Mangia la mela Luca" (VOS) (SOV, VSO, etc.).

Generate Italian text excluding a specific letter.

Generate Italian sentences where a specific letter appears min. once for each words.

Generate Italian text where all words start with the same letter.

Generate a valid Italian anagram for a given Italian word.

Reorder sentence letters into a new meaningful Italian sentence.

Generate Italian text reading the same forwards and back- "Aceto nell’enoteca" wards.

Generate Italian text where initial word letters form a target "Viva V.E.R.D.I." word.

"Oggi vado in montagna" (without ’e’) "Questo esercizio contiene molte esse." ( ’e’) "Maria mangia mele morbide" (’m’) "Noce" → "Ceno" "Amo Roma" → "Moro ama" on 2.5 trillion tokens, balancing Italian, English, and 4.2. Prompting Optimization code, and later refined through supervised fine-tuning (SFT) and direct preference optimization (DPO). Velvet- The efectiveness of text generation using advanced Lan14B is a dense 14-billion-parameter Transformer trained guage Models is critically dependent on the calibration from scratch on the Leonardo HPC system using 4 tril- and formulation of prompts. Our research has systematilion multilingual tokens, approximately 23% of which cally analyzed the interaction between prompt structure are in Italian, achieving competitive scores on Italian- and output quality for each model, defining optimized language benchmarks. These models integrate Italian strategies to maximize compliance with experimental language knowledge from the earliest stages of train- requirements. Generally, precision in criteria definition ing. The second group is based on existing architec- was found to be critical: for text length control, making tures. LLaMAntino-3-ANITA-8B-Inst-DPO-ITA is de- explicit the exclusion of non-linguistic elements (such as rived from Meta-LLaMA-3-8B-Instruct and specializes in punctuation and spaces) significantly improved the preItalian through super-fine-tuning (QLoRA SFT) on mixed cision of some models (Maestrale and Anita). Similarly, datasets and DPO optimization. Maestrale-chat-v0.4-beta, for the handling of verbal diathesis in particular midbased on Mistral-7B, underwent continued pre-training dle (or reflexive) diathesis, explicit formulations reduced on an Italian corpus and "Occiglot," followed by conver- interpretive ambiguities, increasing the adherence of outsational SFT and DPO alignment aimed at improving puts. In the context of OuLiPo constraints, whenever factuality and mathematical reasoning. Although these possible we avoided specific terminology in the prompt models build upon pre-trained foundations, they have (Lipograms, Inverse Lipograms, Tautograms, and Palininvested significantly in adapting and optimizing for the dromes), describing the task directly and using quotation specific characteristics of the Italian language. To achieve marks to highlight restricted letters. a comprehensive and diversified evaluation of LLM ca- A crucial aspect of our methodology was the implepabilities across the tasks proposed by the benchmark, it mentation of few-shot learning, exploring its configurawas essential to extend the comparison to include larger tions with 0, 5, 10 and 15 examples. The tasks that emproprietary models that currently represent the state of ployed few-shot were: quantitative constraints, diathesis, the art in the field. This strategic choice enabled assess- Lipograms, Palindromes. The examples were collected ment of the selected Italian open-source models in rela- from the Italian Universal Dependency dataset, a corpus tion to the highest standards achieved by global research consisting of 34,383 sentences derived from the main and development. Specifically, the comparison included Italian treebanks included in the Universal DependenClaude Sonnet 4 [19], DeepSeek [ 20 ], Gemini 2.5 Flash, cies project, including ISDT[21] VIT[22], PARTUT[23], and GPT-4o mini [2]. PoSTWITTA[24] and TWITTIRO [25].

During few-shot experimentation, it emerged that the Minerva and Velvet models tended to slavishly reproduce the examples provided in the prompt, generating outputs identical or nearly identical to the initial examples, regardless of the variation required by the task. This behavior compromised the evaluability of the outputs, as it did not allow verification of the model’s ability to generalize or adapt to the specific constraint. Consequently, these models were excluded from the tables related to few-shot configurations. essary to allow some tolerance in assessing the other qualitative aspects.

5. Results

4.3. Evaluation Strategy The results obtained from the application of the OuLiBench benchmark highlight substantial diferences The assessment of model performance within OuLiBench among the tested models, both in terms of absolute capaemploys an integrated approach, combining quantitative bilities and sensitivity to various types of linguistic conmetrics for formal adherence with qualitative analyses straints. The analysis was conducted considering both forTmheorperinmuaarnyceqduaansptietcattsivoefmgeentreircastaiorne:. rqeulaanttiiotant)ivaenmdeqturaiclsit(aStuivceceevssalRuaattieonansdoSfpseemaramntaicnc’sochoerr-• Success Rate (SR): Calculated as the percent- ence and grammatical correctness.

age of generated outputs that perfectly satisfy the linguistic constraint imposed by the specific 5.1. Overall Performance task. This metric provides a direct measure of the model’s precision. Table 2 reports the results obtained by the Italian open• Spearman’s Rank Correlation Coeficient ( ): source models, which highlight a significant variability Used to determine the models’ sensitivity to incre- in models’ linguistic control capabilities. LLaMAntinomental or decremental variations in constraints 3-ANITA-8B-Inst-DPO-ITA (Anita) stands out as the (e.g., whether models produce longer sentences best-performing Italian model, achieving an averwhen requested to increase word count), even age SR of 53% in the zero-shot setting, clearly outwhen exact adherence is not achieved. This met- performing the others. Velvet-14B reaches an average ric was only computed for the evaluation of the of 29%, while Maestrale-chat-v0.4-beta and Minerva-7Bquantitative constraints. instruct-v1.0 show more limited performance, with 19% and 12% respectively.

To apply these metrics, particularly for SR on con- To better contextualize these results, Table 3 reports straints involving specific lexical or syntactic features, the performance of larger proprietary models, which can model outputs were pre-processed and analyzed, partly be considered as an upper bound relative to the Italian with the support of linguistic analysis tools. In partic- ones. Within this group, Gemini 2.5 Flash achieves ular, we employed ProfilingUD [ 26], a tool that allows the highest performance with an overall average of 70%, the extraction of more than 130 properties representa- followed by GPT-4o mini (66%) and DeepSeek R1 (65%). tive of the linguistic structure underlying a sentence and Claude Sonnet 4, while competitive across several tasks, derived from raw, morpho-syntactic and syntactic levels records an overall average of 61.5%. of annotation based on the UD formalism. ProfilingUD was specifically applied to the sentences generated by 5.2. Analysis by Constraint Categories the tested models to extract linguistic features used to evaluate model performance (e.g. sentence length, in 5.2.1. Quantitative Constraints terms of tokens or characters, diathesis control, etc.).

The qualitative analysis was carried out manually on Length control tasks proved to be the most challengthe responses that had passed the automatic evaluation, ing for all tested models. In word-count control, Gemmeaning those that met the formal constraints required ini performed best (34%), followed by DeepSeek (30%) by the task. The aim was to examine more closely the and GPT-4o mini (17%), while Claude obtained the worst linguistic quality of the sentences produced, considering performance (9%). Among open-source models, Anita three main aspects: grammatical correctness, semantic achieved 27% in zero-shot, significantly outperforming coherence, and linguistic appropriateness. These criteria Maestrale (9%), Velvet (5%), and Minerva (3%). Spearman were not applied according to a strict hierarchy, although correlations were consistently high for proprietary modsemantic coherence often played a central role, as it is els (94%–100%), thus indicating strong ordinal sensitivity crucial for the comprehensibility and meaning of the sen- despite dificulties in precise control. tence. In the presence of particularly strong constraints, Character-count control was even more demanding: such as in the case of tautograms or anagrams, the evalu- Gemini led (14%), trailed by GPT-4o mini (13%), DeepSeek ation was conducted with greater flexibility. The rigidity (05%) while Claude struggled severely (0.03%). Anita of the structure required by these constraints can com- remained competitive (15%) among open-source models, promise the naturalness of the sentences, making it nec

Task Word Length Char. Length Diathesis Permutations Lipograms Inverse Lipogr.

Word Anagrams Sent. Anagrams Tautograms Palindromes Acrostics Model Avg 5.2.2. Syntactic Constraints 5.2.3. Stylistic-Formal Constraints

This category showed the widest performance gaps. For

lipograms, GPT-4o mini achieved the best results (89%), ahead of DeepSeek (79%), Claude (77%), and Gemini (73%). Anita remained competitive (59%), while other opensource models obtained significantly lower scores: Velvet (47%), Maestrale (32%), and Minerva (28%).

Tautograms revealed polarizing results: Claude led (0.99), followed by GPT-4o mini (98%), Gemini (94%), and DeepSeek (91%). Among open-source models, Anita (0.73) vastly outperformed Maestrale (55%), with Velvet (8%) and Minerva (0.07%) failing critically.

Word anagrams exhibited extreme variability: GPT4o mini scored perfectly (1.0), while Anita surprised with 92%, surpassing DeepSeek (76%), Gemini (58%), and Claude (54%). Other open-source models failed completely: Maestrale (18%), Velvet (10%), and Minerva (0).

Palindromes were universally the hardest task. Claude led (74%), with GPT-4o mini (26%), Gemini (20%), and DeepSeek (18%) far behind. Anita achieved 54% in zeroshot, while all other open-source models scored zero.

Diathesis control revealed in general a clear advan

tage for proprietary models: DeepSeek and Anita 5.3. Efects of Few-Shot Learning achieved near-perfect scores (100% and 99%, respectively), followed by GPT-4o mini (97%) and Gemini (99%). The few-shot learning analysis reveals non-uniform patClaude trailed slightly (89%), while Italian open-source terns across models and tasks. Anita shows a general models—Minerva (72%), Velvet (67%), and Maestrale degradation of performance with an increase in examples (59%)—struggled more. (from 53% in zero-shot to 26-40% in few-shot configura

Constituent order permutations highlighted a stark di- tions), particularly evident in quantitative tasks where vide: GPT-4o mini excelled (99%), with DeepSeek (100%), word control decreases from 27% to 13% with 15 examples, Gemini (95%), and Claude (86%) close behind. Open- and voice control degrades from 99% to 90%. This trend source models performed uniformly worse: Anita and suggests possible contextual overfitting phenomena. Velvet (both 16%), Maestrale (12%), and Minerva (8%), Maestrale, on the other hand, exhibits a pattern of suggesting architectural limitations in complex syntactic gradual improvement (from 19% in zero-shot to 35% with manipulation. 15 examples), with clear benefits in quantitative tasks: character control improves from 0.006 to 0.03, and voice control reaches perfection (1.0) with 5 and 15 examples. exhibit a robust implicit grasp of linguistic structure, they A slight improvement from 0.32 to 0.37 with 5 examples struggle with fine-grained numerical control, a limitais also observed in lipograms, indicating more robust tion likely rooted in the statistical nature of transformer in-context learning capabilities. architectures.

It is noteworthy that Minerva and Velvet systemat- Comparing open-source and closed-source models, the ically tend to reproduce the few-shot examples almost latter generally outperform the former, particularly in verbatim, particularly in quantitative tasks and lipograms. tasks involving stylistic-formal constraints. However, This behavior made their outputs efectively unassessable this advantage is not consistent across all task types. Noin few-shot settings. A plausible explanation is that the tably, even closed-source models, despite their overall high complexity of the tasks, combined with the explicit superiority, struggle with specific tasks such as palinpresence of in-context examples, may lead these mod- dromes, which require strict character-level control. Simels to default to copying strategies rather than genuine ilarly, tasks involving quantitative constraints pose siggeneralization. This tendency ultimately compromises nificant challenges for both model categories, as they output quality and originality, suggesting limitations in demand precise control over features like length or reptheir ability to adapt constraints creatively beyond pro- etition, capabilities that are dificult to enforce within vided exemplars. transformer-based architectures relying on statistical patterns rather than explicit rule-based mechanisms. These limitations further corroborate the value of OuLiBench 6. Discussion as a benchmark for evaluating LLMs’ ability to generate text while adhering to complex and diverse constraints.

The OuLiBench results provide valuable insights into the Finally, models from both categories perform well on synlinguistic competence of Large Language Models (LLMs), tactic constraints, suggesting that such structural aspects particularly in their ability to generate text under various are relatively well captured by current architectures. formal constraints. One of the most striking findings is Focusing instead on smaller open-source models, we the performance gap between tasks involving quantita- noticed that their linguistic production frequently suftive constraints and those requiring more structural or fered, primarily in stylistic-formal tasks, from an inability stylistic control. This disparity suggests that while LLMs to generate truly well-structured sentences in Italian, often producing ungrammatical or semantically incoherent In summary, these results highlight the dificulty of outputs. This degradation of linguistic quality under com- models in reflecting and producing according to plex constraints highlights the trade-of between adher- meta-linguistic principles, a fundamental feature of ence to the constraint and maintenance of basic linguistic human linguistic creativity, thus highlighting the limcompetence. A particularly notable pattern emerged in itations of multi-objective planning mechanisms the palindrome tasks: smaller models frequently aban- with respect to controllability and performance in doned Italian and began generating sentences in English. complex linguistic tasks.

This involuntary code-switching suggests a tendency to revert to the predominant language in the training data when the task deviates from standard generation 7. Conclusion and Future Works patterns.

From a more qualitative point of view, the generated outputs of the models reveal systematic behavioral patterns, particularly evident in smaller models but also observable in larger ones. A recurring phenomenon is the tendency for thematic and lexical repetition with superficial word order variations across most tasks, suggesting limitations in creative diversification under constraints.

In the specific case of anagrammatic tasks, Anita and Velvet showed a simplified resolution strategy, limiting themselves to swapping word order within phrases rather than performing true letter-level permutations (as shown in the examples below). This behavior indicates a superifcial understanding of the anagrammatic constraint and the adoption of simplified heuristics.

Examples from Anita: In this study, we presented OuLiBench, a novel benchmark designed to rigorously assess the linguistic capabilities of Large Language Models (LLMs) through the generation of Italian texts governed by explicit formal constraints. Drawing inspiration from the Oulipo literary tradition, our benchmark diverges from conventional evaluation methodologies that typically emphasize task performance on downstream applications. Instead, OuLiBench centers its evaluation on the model’s proficiency in adhering to a diverse array of linguistic constraints, encompassing structural, quantitative, syntactic, and stylistic dimensions. This shift of focus allows for a more nuanced understanding of a model’s fine-grained control over language generation processes. Our empirical evaluation involved both open-source and commercial LLMs tested in zero-shot and few-shot scenarios. The results revealed substantial variability in their ability to meet the prescribed constraints. Quantitative constraints, such as specific letter counts or palindromic structures, posed significant dificulties across the board, underscoring persistent limitations in current architectures for handling sub-lexical control. Conversely, syntactic and stylistic constraints were more successfully navigated by larger models, suggesting that model scale and complexity contribute positively to managing higher-level linguistic features. Notably, Italian-focused LLMs, including Anita, demonstrated competitive performance, highlighting the benefits of dedicated linguistic resources and targeted training on specific languages, which can partially ofset the advantages conferred by sheer model size. These findings emphasize the persistent challenges in controllable text generation, especially under intersecting and mutually interacting constraints and demand simultaneous fulfillment without compromising linguistic naturalness and coherence. The results indicate a pressing need for innovative generation frameworks capable of embedding meta-linguistic reasoning and constraint-aware planning mechanisms throughout the text production pipeline.

Looking forward, OuLiBench lays the groundwork for several promising directions in computational linguistics and AI research. Extending the benchmark to other languages would facilitate cross-linguistic investigations into the controllability of multilingual LLMs, while the integration of multimodal or pragmatic constraints could Original: “Tre gatti in casa fanno rumore strepito” Anagram: “Strepito in casa fanno gatti tre rumore” English: “Three cats in the house make noise and uproar” → “Uproar in the house make cats three noise”

Original: “Tre per cento in banca stanno”

Anagram: “Stanno in banca trecento per” English: “Three percent are in the bank” → “Are in the bank threehundred percent” Examples from Velvet:

Original: “Il sole splende.”

Anagram: “Splende il sole.” English: “The sun shines.” → “Shines the sun.”

Original: “La luna brilla.”

Anagram: “Brilla la luna.” English: “The moon shines.” → “Shines the moon.”

Original: “Il gatto mangia.”

Anagram: “Mangia il gatto.”

English: “The cat eats.” → “Eats the cat.”

Acknowledgments This work has been supported by the FAIR - Future AI

Research (PE00000013) project under the NRRP MUR program funded by the NextGenerationEU. Partial support was also received by the project “Understanding and Enhancing Preference Alignment in Large Language Models Through Controlled Text Generation” (IsCc8_ALIGNLLM), funded by CINECA under the ISCRA initiative, for the availability of HPC resources and support. broaden the scope of evaluation beyond purely textual parameters. Additionally, developing refined qualitative and creativity-focused metrics will be critical to advancing our understanding of deep linguistic competence, ultimately guiding the design of next-generation models with enhanced flexibility, expressiveness, and adherence to formal language structures. Ultimately, OuLiBench not only enriches the evaluation toolkit for Italian NLP but also serves as a conceptual bridge between computational linguistics and literary formalism, pushing the boundaries of what LLMs can achieve under constraint. Declaration on Generative AI During the preparation of this work, the author(s) used ChatGPT (OpenAI) in order to: Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

1973 , pp. 19 - 22 . Pubblicato originariamente in *Les ian twitter treebank in universal dependencies , in:

Dossiers du Collège de 'Pataphysique*, n. 17 (dicem- Proceedings of the Eleventh International Con-

bre 1961 ). ference on Language Resources and Evaluation [15]

Orlando ,

Moroni , P.-L. Huguet Cabot , S. Co- ( LREC 2018 ), European Language Resources Asso-

nia , E. Barba, S.

Orlandini , G. Fiameni, R. Nav- ciation (ELRA) , Miyazaki, Japan, 2018 , p. ? -? URL:

igli , Minerva LLMs: The first family of large https://aclanthology .org/L18-1279/.

language models trained from scratch on Italian [25]

A. T.

Cignarella ,

Bosco ,

Patti ,

Lai , Ap-

Sprugnoli (Eds.), Proceedings of the 10th Italian for irony on the italian twitter corpus twittirÒ , in:

it 2024 ), CEUR Workshop Proceedings, Pisa, Italy, ence on Language Resources and Evaluation (LREC

2024 , pp. 707 - 719 . URL: https://aclanthology.org/ 2018 ), European Language Resources Association

2024.clicit- 1 .77/. (ELRA), Miyazaki, Japan, 2018 , pp. 4204 - 4211 . [16] Almawave , Velvet-14b, HuggingFace & azienda [26] D.

Brunato , A.

Cimino , F.

Dell'Orletta , G. Venturi,

Almawave , 2025 . URL: https://huggingface.co/ S. Montemagni, Profiling-UD: a tool for linguis-

Almawave/Velvet-14B, 14B parametri, multilingue tic profiling of texts , in: N. Calzolari , F. Béchet ,

(it , en, es, pt-BR, de, fr), training su HPC Leonardo . P. Blache,

Choukri ,

Cieri ,

Declerck , S. Goggi, [ 17 ] mii-llm, Maestrale-chat-v0.4-beta , HuggingFace H. Isahara , B.

Maegaard , J.

Mariani , H. Mazo,

model card , 2025 . URL: https://huggingface.co/ A. Moreno,

Odijk , S. Piperidis (Eds.), Proceedings

mii-llm/maestrale-chat-v0.4-beta, 7.2B parametri, of the Twelfth Language Resources and Evaluation

built with Axolotl, safe chat beta . Conference, European Language Resources Associ[18]

Polignano ,

Basile , G. Semeraro, Advanced ation, Marseille, France, 2020 , pp. 7145 - 7151 . URL:

natural-based interaction for the italian language : https://aclanthology.org/ 2020 .lrec- 1 .883/.

Llamantino- 3 -anita- 8b -inst-dpo-ita, arXiv preprint,

2024 . arXiv: 2405 . 07101 . [19] Anthropic , Claude 3 .7 Sonnet Sys-

claude-3-7-sonnet-system-

card , 2025 . Sys-

tem card for the hybrid-reasoning model Claude 3 . 7

Sonnet. [20] DeepSeek-AI , A.

Liu , B.

Feng , B.

Xue , . et al.,

Deepseek-v3 technical report , arXiv preprint, 2024 .

arXiv:2412 . 19437 . [21]

Bosco ,

Montemagni ,

Simi , Converting

Dipper (Eds.), Proceedings of the 7th Linguistic

tics , Sofia, Bulgaria, 2013 , pp. 61 - 69 . [22]

Alfieri ,

Tamburini , (almost) automatic conver-

ings of the 3rd Italian Conference on Computational

Linguistics (CLiC-IT

2016 ), CEUR Workshop Pro-

ceedings , Napoli, Italy, 2016 , pp. 19 - 23 . [23]

Sanguinetti ,

Bosco , Parttut: The turin uni-

Springer Verlag, Heidelberg, 2014 . [24]

Sanguinetti ,

Bosco ,

Lavelli , A . Mazzei,