<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>S. Calderaro);
alessio.miaschi@ilc.cnr.it (A. Miaschi); felice.dellorletta@ilc.cnr.it
(F. Dell'Orletta)
 https://alemiaschi.github.io/ (A. Miaschi);
http://www.italianlp.it/people/felice-dellorletta/ (F. Dell'Orletta)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>The OuLiBench Benchmark: Formal Constraints as a Lens into LLM Linguistic Competence</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Silvio Calderaro</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessio Miaschi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Felice Dell'Orletta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ItaliaNLP Lab, Istituto di Linguistica Computazionale "A. Zampolli" (CNR-ILC)</institution>
          ,
          <addr-line>Pisa</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Università di Pisa</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Recent progress in Large Language Models (LLMs) has led to impressive capabilities in Natural Language Generation (NLG). However, standard evaluation benchmarks often focus on surface-level performance and are predominantly English-centric, limiting insights into models' deeper linguistic competences, especially in other languages. In this paper, we introduce OuLiBench, a novel benchmark inspired by the literary movement OuLiPo, designed to evaluate LLMs' ability to generate Italian text under explicit linguistic constraints, ranging from morpho-syntactic requirements to creative and structural challenges. Our goal is to assess the extent to which LLMs can understand and manipulate language when guided by specific, sometimes artificial constraints. We evaluate a range of state-of-the-art models in both zero- and few-shot settings, comparing performance across constraint types and dificulty levels. Our results highlight significant variability across models and tasks, shedding light on the limits of controllable text generation and ofering a new lens for probing LLMs' generative and linguistic competence beyond traditional benchmarks.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Models</kwd>
        <kwd>Benchmark</kwd>
        <kwd>Evaluation</kwd>
        <kwd>Controllable Text Generation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction and Background</title>
      <sec id="sec-1-1">
        <title>Another critical issue, often underestimated in current</title>
        <p>evaluation methodologies, is the overwhelming
predomThe recent and rapid advancements in Large Language inance of benchmarks developed and validated primarily
Models (LLMs) development has profoundly reshaped for the English language [8]. This bias significantly limits
the landscape of Natural Language Processing (NLP) [1, the accurate assessment of multilingual systems or
mod2, 3, 4]. These models exhibit remarkable proficiency els tailored for other languages, such as Italian. Moreover,
across a wide range of tasks, particularly excelling in it impedes the identification and study of culturally
spethe generation of coherent and contextually appropriate cific linguistic phenomena, which are inherently tied to
text. They demonstrate a sophisticated grasp of complex the socio-cultural characteristics of individual linguistic
linguistic structures with high accuracy. Such capabilities communities.
have been extensively evaluated through a variety of Concurrently, Controllable Text Generation (CTG) is
benchmarks, many of which are aggregated on platforms emerging as a pivotal research area within the LLM
dolike the Open LLM Leaderboard [5] to facilitate cross- main [9, 10, 11, 12, 13]. CTG focuses on developing and
model comparisons. analyzing techniques that guide text generation to
con</p>
        <p>However, despite the value of these benchmarks as form to explicit constraints, such as style (e.g., formal vs.
reference frameworks, a significant gap remains in the informal), emotional tone, desired length, structural
comcomprehensive assessment of LLMs’ intrinsic linguistic plexity (e.g., number of subordinate clauses), and
predecompetencies, independently of specific task formula- ifned semantic content. By leveraging strategies such as
tions and with a cross-cutting perspective [6, 7]. Stan- prompt conditioning, targeted fine-tuning on annotated
dard evaluation metrics often emphasize surface-level datasets, and the implementation of dedicated control
features (e.g., n-gram overlap using BLEU or ROUGE), mechanisms, CTG research aims to produce generative
which may fail to capture deep semantic understanding systems capable of generating outputs that precisely
sator robust syntactic flexibility. isfy specified criteria. Intrinsically, this field not only
provides methodologies for evaluations better aligned
with practical and real-world communicative needs but
also emphasizes the models’ ability to manipulate
language in response to explicit conditions.</p>
        <p>This focus on controlled generation naturally raises the
question of how far such control can be extended,
particularly when constraints become highly specific or even
deliberately artificial, designed not merely to produce
tic manipulation and computational creativity. In this morpho-syntactic or creative linguistic phenomenon.
regard, there exists a compelling parallel with the princi- The goal is to assess to what extent models can control
ples of the literary group OuLiPo (Ouvroir de Littérature these properties during text generation, and how robustly
Potentielle), which has long explored the generative po- they generalize across diferent types of constraints.
tential of formal constraints. By imposing stringent rules For each property , we define a corresponding set of
on literary creation, OuLiPo demonstrates how limita- possible target values  = 1, 2, ..., . We prompt
tions can paradoxically unlock new expressive forms the models to generate a fixed number of sentences
conand reveal deeper structural properties of language. We ditioned on each value  using a consistent prompt
hypothesize that such intricate, often playful linguistic format. For example, for the property “number of words”
challenges, when adapted as evaluation tasks, can yield a representative prompt would be:
valuable insights into the degree of fine-grained control
an LLM can exert and its implicit understanding of lin- Genera 50 frasi composte esattamente da 5
guistic structure, moving beyond mere fluency to assess parole ciascuna, escludi dal conto la
puntrue generative competence. teggiatura e gli spazi. [transl. Generate 50</p>
        <p>Building on these insights, in this paper we introduce sentences consisting of exactly 5 words each,
OuLiBench, a novel benchmark, and present an extensive excluding punctuation and spaces from the
evaluation of LLMs’ ability to generate Italian text under count.]
targeted linguistic constraints, ranging from
morphosyntactic to stylistic-formal phenomena. Considering the dificulty that LLMs show in
meet</p>
        <p>By prompting language models to generate sentences ing strict numerical specifications, such as generating
that adhere to specific linguistic constraints (e.g., "Gen- sentences with an exact length in terms of words or
charerate a sentence with exactly five words" or "Generate a acters, we intentionally structured the evaluation around
sentence without the letter ’e’") and, where applicable, increasing values of each property. This approach allows
evaluating their ability to reflect on these constraints or us to examine whether the models are sensitive to the
relon properties of the generated text, we aim to address the ative ordering and magnitude of constraints, even when
following research questions: i) To what extent can LLMs exact conformity is dificult to achieve. The underlying
produce text that satisfies explicit linguistic constraints hypothesis is that although a model may not reliably
prodefined in OuLiBench, including quantitative, structural, duce a sentence with exactly 5 words, it may still exhibit
and creative constraints? ii) What diferences emerge a monotonic tendency, generating progressively longer
among various LLMs in their ability to meet complex lin- sentences as the required number increases.
guistic constraints, and which types of constraints pose For syntactic constraints, such as those related to the
the greatest challenges? iii) How does the nature of the syntactic order of the elements (e.g. SVO, SOV, VSO), the
constraint (e.g., syntactic vs. creative) afect the quality analysis focused on the model’s ability to adapt the
synand coherence of the generated text? tactic structure of the sentence to predetermined patterns.</p>
        <p>Contributions. Our main contributions are: Here, the aim is to assess the structural flexibility of the
model and its ability to model the output according to
• We propose a framework, based on the OuliBench specific grammatical configurations. Finally, concerning
benchmark, for evaluating the linguistic abilities OuLiPo-inspired linguistic constraints, such as lipograms
of state-of-the-art Italian LLMs when generating (texts that deliberately omit a particular letter) and
tautotext. grams (texts in which all words start with the same letter),
• We conduct extensive evaluations across diferent the evaluation was structured around specific letters of
open- and closed-source models and linguistic the alphabet, testing the model’s ability to inhibit or
conconstraints. centrate the use of certain letters within the generated
• We evaluated models’ abilities across several con- sentences. This allows us to examine the
controllabilifgurations, testing their performance in zero- and ity of the models in more creative and stylistic contexts,
few-shot settings. where the constraints are not numerical but qualitative
and symbolic.</p>
        <p>The linguistic constraints span both formal properties
2. Our Approach (e.g. sentence length in words or characters,
permutations of sentence elements in the context of linguistic
We systematically evaluate the ability of several LLMs typology) and creative phenomena (e.g., lipograms,
tautoto generate Italian sentences under a range of explicitly grams, acrostics), enabling a comprehensive evaluation of
defined linguistic constraints. These constraints are for- controllability across structural and stylistic dimensions.
malized as a set of properties  = 1, 2, ..., , where In all cases, the evaluation assesses whether the
genereach property  corresponds to a specific quantitative, ated sentence not only satisfies the target constraint but
also maintains syntactic correctness, semantic coherence,
and linguistic appropriateness in Italian.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. OuLiBench</title>
      <p>To address the need for more granular evaluation tools
for the Italian language, we developed OuLiBench. This
novel benchmark is specifically designed to thoroughly
analyze the capability of LLMs to generate text while
adhering to a diverse and progressively complex set of
explicit linguistic constraints, thereby moving beyond
assessments based on mere surface-level fluency.</p>
      <sec id="sec-2-1">
        <title>3.1. Conceptual Framework and Task</title>
      </sec>
      <sec id="sec-2-2">
        <title>Taxonomy</title>
        <sec id="sec-2-2-1">
          <title>The conceptual foundation of OuLiBench integrates prin</title>
          <p>ciples from Controllable Text Generation (CTG) [10],
which focuses on guided generation according to
predefined attributes, with the creative, constraint-based
methodologies of the OuLiPo (Ouvroir de Littérature
Potentielle) literary group. Founded in 1960 by writer
Raymond Queneau and mathematician François Le
Lionnais, OuLiPo emerged as a revolutionary literary
movement that sought to explore the potential of literature
through the systematic application of formal constraints.
In their Premier Manifeste (First Manifesto) [14] of 1961,
Le Lionnais articulated the group’s foundational
philosophy Littérature potentielle, defining littérature
potentielle as "the search for new structures and patterns that
can be used" to create literary works. The group used
the restrictions of literary forms to spark creativity,
developing techniques such as lipograms (texts excluding
specific letters), tautograms, anagrams and palindromes.
This approach demonstrated that systematic limitations
could paradoxically expand rather than restrict creative
possibilities, generating what the group termed
"potential literature". OuLiBench adapts these philosophies into
a suite of computationally evaluable tasks, entirely
formulated and contextualized for the Italian language.</p>
          <p>OuliBench is organized according to a taxonomy that
reflects diferent levels and types of linguistic control:
in active, passive, or reflexive/medium voice) and
constituent order permutations
(Subject-VerbObject), testing flexibility in generating canonical
and non-canonical sentence structures.
3. Stylistic-Formal (OuLiPo-inspired)
Constraints: Representing the most elaborate
challenges, this category implements OuLiPian
contraintes. It includes tasks such as the
Lipogram (omission of specific letters), Inverse
Lipogram (mandatory inclusion of specific
letters), Tautogram (all words starting with
the same letter), Anagram (at both word and
phrasal levels), Palindrome (symmetrical text),
and Acrostic (initial letters of words forming
a target word). These tasks demand advanced
linguistic planning and sophisticated sub-lexical
and structural manipulation.</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>For each task, specific prompts were formulated in Italian. Table 1 provides a comprehensive overview of the tasks included in OuLiBench, as well as the prompts used for generating the sentences.</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Experimental Setting</title>
      <sec id="sec-3-1">
        <title>We evaluate a pool of Italian LLMs by testing their</title>
        <p>ability to follow the linguistic constraints defined in
OuliBench. We conduct our experiments in both
zeroshot and few-shot settings. In the zero-shot condition,
the model receives only the instruction formulated in
natural language. In the few-shot configuration, the
prompt is augmented with five, ten, and fifteen exemplar
sentences corresponding to the same constraint. This
setup is intended to investigate whether LLMs improve
in constraint-following behaviour when exposed to
incontext demonstrations. In the following, we describe the
set of tested models and the evaluation strategy adopted
to assess the extent to which generated outputs satisfy
the defined constraints.</p>
        <sec id="sec-3-1-1">
          <title>4.1. Models</title>
          <p>The landscape of Italian large language models (LLMs)
1. Quantitative Constraints: This category as- is evolving rapidly, with notable diferences in
developsesses the precision of dimensional control over ment strategies. Some models have been pre-trained
the textual output. Tasks require models to gener- from scratch with intrinsic emphasis on the Italian
lanate sentences adhering to an exact word count guage, while others have been fine-tuned for Italian
or an exact character count (net of punctuation starting from well-established architectures. For this
and spaces). These constraints challenge models study, we selected models with comparable
parameto balance numerical restrictions with semantic ter scales: Minerva-7B-instruct-v1.0 (SapienzaNLP) [15],
coherence and grammatical correctness. Velvet-14B (Almawave) [16], Maestrale-chat-v0.4-beta
2. Syntactic Constraints: These tasks evaluate the (mii-llm) [17], and
LLaMAntino-3-ANITA-8B-Inst-DPOmodels’ competence in manipulating fundamen- ITA (SWAP-UNIBA) [18]. The first group includes three
tal Italian grammatical structures. They include models pre-trained from scratch.
Minerva-7B-instructverbal diathesis control (requiring generation v1.0 is a 7-billion-parameter Transformer pre-trained
Length by Words</p>
          <p>Length by Characters
Syntactic</p>
          <p>Diathesis Control
Stylistic-Formal
(OuLiPo-inspired)</p>
          <p>Word Order Permutations
Lipogram
Inverse Lipogram
Tautogram
Word Anagram
Phrasal Anagram
Palindrome
Acrostic
Generate Italian sentences with an exact word count.</p>
          <p>Generate Italian sentences with an exact character count
(no punct/space).</p>
          <p>Example Target Sentence (Italian)
"Il gatto dorme sul divano." (5 words)
"Mangio la pizza" (13 chars)
Generate Italian sentences in specified voice (active, passive, "La lettera è scritta da Marco." (passiva)
reflexive).</p>
          <p>Generate Italian sentences using specific SVO permutations "Mangia la mela Luca" (VOS)
(SOV, VSO, etc.).</p>
          <p>Generate Italian text excluding a specific letter.</p>
          <p>Generate Italian sentences where a specific letter appears
min. once for each words.</p>
          <p>Generate Italian text where all words start with the same
letter.</p>
          <p>Generate a valid Italian anagram for a given Italian word.</p>
          <p>Reorder sentence letters into a new meaningful Italian
sentence.</p>
          <p>Generate Italian text reading the same forwards and back- "Aceto nell’enoteca"
wards.</p>
          <p>Generate Italian text where initial word letters form a target "Viva V.E.R.D.I."
word.</p>
          <p>
            "Oggi vado in montagna" (without ’e’)
"Questo esercizio contiene molte esse." ( ’e’)
"Maria mangia mele morbide" (’m’)
"Noce" → "Ceno"
"Amo Roma" → "Moro ama"
on 2.5 trillion tokens, balancing Italian, English, and 4.2. Prompting Optimization
code, and later refined through supervised fine-tuning
(SFT) and direct preference optimization (DPO). Velvet- The efectiveness of text generation using advanced
Lan14B is a dense 14-billion-parameter Transformer trained guage Models is critically dependent on the calibration
from scratch on the Leonardo HPC system using 4 tril- and formulation of prompts. Our research has
systematilion multilingual tokens, approximately 23% of which cally analyzed the interaction between prompt structure
are in Italian, achieving competitive scores on Italian- and output quality for each model, defining optimized
language benchmarks. These models integrate Italian strategies to maximize compliance with experimental
language knowledge from the earliest stages of train- requirements. Generally, precision in criteria definition
ing. The second group is based on existing architec- was found to be critical: for text length control, making
tures. LLaMAntino-3-ANITA-8B-Inst-DPO-ITA is de- explicit the exclusion of non-linguistic elements (such as
rived from Meta-LLaMA-3-8B-Instruct and specializes in punctuation and spaces) significantly improved the
preItalian through super-fine-tuning (QLoRA SFT) on mixed cision of some models (Maestrale and Anita). Similarly,
datasets and DPO optimization. Maestrale-chat-v0.4-beta, for the handling of verbal diathesis in particular
midbased on Mistral-7B, underwent continued pre-training dle (or reflexive) diathesis, explicit formulations reduced
on an Italian corpus and "Occiglot," followed by conver- interpretive ambiguities, increasing the adherence of
outsational SFT and DPO alignment aimed at improving puts. In the context of OuLiPo constraints, whenever
factuality and mathematical reasoning. Although these possible we avoided specific terminology in the prompt
models build upon pre-trained foundations, they have (Lipograms, Inverse Lipograms, Tautograms, and
Palininvested significantly in adapting and optimizing for the dromes), describing the task directly and using quotation
specific characteristics of the Italian language. To achieve marks to highlight restricted letters.
a comprehensive and diversified evaluation of LLM ca- A crucial aspect of our methodology was the
implepabilities across the tasks proposed by the benchmark, it mentation of few-shot learning, exploring its
configurawas essential to extend the comparison to include larger tions with 0, 5, 10 and 15 examples. The tasks that
emproprietary models that currently represent the state of ployed few-shot were: quantitative constraints, diathesis,
the art in the field. This strategic choice enabled assess- Lipograms, Palindromes. The examples were collected
ment of the selected Italian open-source models in rela- from the Italian Universal Dependency dataset, a corpus
tion to the highest standards achieved by global research consisting of 34,383 sentences derived from the main
and development. Specifically, the comparison included Italian treebanks included in the Universal
DependenClaude Sonnet 4 [19], DeepSeek [
            <xref ref-type="bibr" rid="ref22">20</xref>
            ], Gemini 2.5 Flash, cies project, including ISDT[21] VIT[22], PARTUT[23],
and GPT-4o mini [2]. PoSTWITTA[24] and TWITTIRO [25].
          </p>
          <p>During few-shot experimentation, it emerged that the
Minerva and Velvet models tended to slavishly
reproduce the examples provided in the prompt, generating
outputs identical or nearly identical to the initial
examples, regardless of the variation required by the task. This
behavior compromised the evaluability of the outputs, as
it did not allow verification of the model’s ability to
generalize or adapt to the specific constraint. Consequently,
these models were excluded from the tables related to
few-shot configurations.
essary to allow some tolerance in assessing the other
qualitative aspects.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Results</title>
      <p>4.3. Evaluation Strategy The results obtained from the application of the
OuLiBench benchmark highlight substantial diferences
The assessment of model performance within OuLiBench among the tested models, both in terms of absolute
capaemploys an integrated approach, combining quantitative bilities and sensitivity to various types of linguistic
conmetrics for formal adherence with qualitative analyses straints. The analysis was conducted considering both
forTmheorperinmuaarnyceqduaansptietcattsivoefmgeentreircastaiorne:.
rqeulaanttiiotant)ivaenmdeqturaiclsit(aStuivceceevssalRuaattieonansdoSfpseemaramntaicnc’sochoerr-• Success Rate (SR): Calculated as the percent- ence and grammatical correctness.</p>
      <p>age of generated outputs that perfectly satisfy
the linguistic constraint imposed by the specific 5.1. Overall Performance
task. This metric provides a direct measure of the
model’s precision. Table 2 reports the results obtained by the Italian
open• Spearman’s Rank Correlation Coeficient (  ): source models, which highlight a significant variability
Used to determine the models’ sensitivity to incre- in models’ linguistic control capabilities.
LLaMAntinomental or decremental variations in constraints 3-ANITA-8B-Inst-DPO-ITA (Anita) stands out as the
(e.g., whether models produce longer sentences best-performing Italian model, achieving an
averwhen requested to increase word count), even age SR of 53% in the zero-shot setting, clearly
outwhen exact adherence is not achieved. This met- performing the others. Velvet-14B reaches an average
ric was only computed for the evaluation of the of 29%, while Maestrale-chat-v0.4-beta and
Minerva-7Bquantitative constraints. instruct-v1.0 show more limited performance, with 19%
and 12% respectively.</p>
      <p>To apply these metrics, particularly for SR on con- To better contextualize these results, Table 3 reports
straints involving specific lexical or syntactic features, the performance of larger proprietary models, which can
model outputs were pre-processed and analyzed, partly be considered as an upper bound relative to the Italian
with the support of linguistic analysis tools. In partic- ones. Within this group, Gemini 2.5 Flash achieves
ular, we employed ProfilingUD [ 26], a tool that allows the highest performance with an overall average of 70%,
the extraction of more than 130 properties representa- followed by GPT-4o mini (66%) and DeepSeek R1 (65%).
tive of the linguistic structure underlying a sentence and Claude Sonnet 4, while competitive across several tasks,
derived from raw, morpho-syntactic and syntactic levels records an overall average of 61.5%.
of annotation based on the UD formalism. ProfilingUD
was specifically applied to the sentences generated by 5.2. Analysis by Constraint Categories
the tested models to extract linguistic features used to
evaluate model performance (e.g. sentence length, in 5.2.1. Quantitative Constraints
terms of tokens or characters, diathesis control, etc.).</p>
      <p>The qualitative analysis was carried out manually on Length control tasks proved to be the most
challengthe responses that had passed the automatic evaluation, ing for all tested models. In word-count control,
Gemmeaning those that met the formal constraints required ini performed best (34%), followed by DeepSeek (30%)
by the task. The aim was to examine more closely the and GPT-4o mini (17%), while Claude obtained the worst
linguistic quality of the sentences produced, considering performance (9%). Among open-source models, Anita
three main aspects: grammatical correctness, semantic achieved 27% in zero-shot, significantly outperforming
coherence, and linguistic appropriateness. These criteria Maestrale (9%), Velvet (5%), and Minerva (3%). Spearman
were not applied according to a strict hierarchy, although correlations were consistently high for proprietary
modsemantic coherence often played a central role, as it is els (94%–100%), thus indicating strong ordinal sensitivity
crucial for the comprehensibility and meaning of the sen- despite dificulties in precise control.
tence. In the presence of particularly strong constraints, Character-count control was even more demanding:
such as in the case of tautograms or anagrams, the evalu- Gemini led (14%), trailed by GPT-4o mini (13%), DeepSeek
ation was conducted with greater flexibility. The rigidity (05%) while Claude struggled severely (0.03%). Anita
of the structure required by these constraints can com- remained competitive (15%) among open-source models,
promise the naturalness of the sentences, making it
nec</p>
      <p>Task
Word Length
Char. Length
Diathesis
Permutations
Lipograms
Inverse Lipogr.</p>
      <p>Word Anagrams
Sent. Anagrams
Tautograms
Palindromes
Acrostics
Model Avg
5.2.2. Syntactic Constraints
5.2.3. Stylistic-Formal Constraints</p>
      <sec id="sec-4-1">
        <title>This category showed the widest performance gaps. For</title>
        <p>lipograms, GPT-4o mini achieved the best results (89%),
ahead of DeepSeek (79%), Claude (77%), and Gemini (73%).
Anita remained competitive (59%), while other
opensource models obtained significantly lower scores: Velvet
(47%), Maestrale (32%), and Minerva (28%).</p>
        <p>Tautograms revealed polarizing results: Claude led
(0.99), followed by GPT-4o mini (98%), Gemini (94%),
and DeepSeek (91%). Among open-source models, Anita
(0.73) vastly outperformed Maestrale (55%), with Velvet
(8%) and Minerva (0.07%) failing critically.</p>
        <p>Word anagrams exhibited extreme variability:
GPT4o mini scored perfectly (1.0), while Anita surprised
with 92%, surpassing DeepSeek (76%), Gemini (58%), and
Claude (54%). Other open-source models failed
completely: Maestrale (18%), Velvet (10%), and Minerva (0).</p>
        <p>Palindromes were universally the hardest task. Claude
led (74%), with GPT-4o mini (26%), Gemini (20%), and
DeepSeek (18%) far behind. Anita achieved 54% in
zeroshot, while all other open-source models scored zero.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Diathesis control revealed in general a clear advan</title>
        <p>tage for proprietary models: DeepSeek and Anita 5.3. Efects of Few-Shot Learning
achieved near-perfect scores (100% and 99%,
respectively), followed by GPT-4o mini (97%) and Gemini (99%). The few-shot learning analysis reveals non-uniform
patClaude trailed slightly (89%), while Italian open-source terns across models and tasks. Anita shows a general
models—Minerva (72%), Velvet (67%), and Maestrale degradation of performance with an increase in examples
(59%)—struggled more. (from 53% in zero-shot to 26-40% in few-shot
configura</p>
        <p>Constituent order permutations highlighted a stark di- tions), particularly evident in quantitative tasks where
vide: GPT-4o mini excelled (99%), with DeepSeek (100%), word control decreases from 27% to 13% with 15 examples,
Gemini (95%), and Claude (86%) close behind. Open- and voice control degrades from 99% to 90%. This trend
source models performed uniformly worse: Anita and suggests possible contextual overfitting phenomena.
Velvet (both 16%), Maestrale (12%), and Minerva (8%), Maestrale, on the other hand, exhibits a pattern of
suggesting architectural limitations in complex syntactic gradual improvement (from 19% in zero-shot to 35% with
manipulation. 15 examples), with clear benefits in quantitative tasks:
character control improves from 0.006 to 0.03, and voice
control reaches perfection (1.0) with 5 and 15 examples. exhibit a robust implicit grasp of linguistic structure, they
A slight improvement from 0.32 to 0.37 with 5 examples struggle with fine-grained numerical control, a
limitais also observed in lipograms, indicating more robust tion likely rooted in the statistical nature of transformer
in-context learning capabilities. architectures.</p>
        <p>It is noteworthy that Minerva and Velvet systemat- Comparing open-source and closed-source models, the
ically tend to reproduce the few-shot examples almost latter generally outperform the former, particularly in
verbatim, particularly in quantitative tasks and lipograms. tasks involving stylistic-formal constraints. However,
This behavior made their outputs efectively unassessable this advantage is not consistent across all task types.
Noin few-shot settings. A plausible explanation is that the tably, even closed-source models, despite their overall
high complexity of the tasks, combined with the explicit superiority, struggle with specific tasks such as
palinpresence of in-context examples, may lead these mod- dromes, which require strict character-level control.
Simels to default to copying strategies rather than genuine ilarly, tasks involving quantitative constraints pose
siggeneralization. This tendency ultimately compromises nificant challenges for both model categories, as they
output quality and originality, suggesting limitations in demand precise control over features like length or
reptheir ability to adapt constraints creatively beyond pro- etition, capabilities that are dificult to enforce within
vided exemplars. transformer-based architectures relying on statistical
patterns rather than explicit rule-based mechanisms. These
limitations further corroborate the value of OuLiBench
6. Discussion as a benchmark for evaluating LLMs’ ability to generate
text while adhering to complex and diverse constraints.</p>
        <p>The OuLiBench results provide valuable insights into the Finally, models from both categories perform well on
synlinguistic competence of Large Language Models (LLMs), tactic constraints, suggesting that such structural aspects
particularly in their ability to generate text under various are relatively well captured by current architectures.
formal constraints. One of the most striking findings is Focusing instead on smaller open-source models, we
the performance gap between tasks involving quantita- noticed that their linguistic production frequently
suftive constraints and those requiring more structural or fered, primarily in stylistic-formal tasks, from an inability
stylistic control. This disparity suggests that while LLMs to generate truly well-structured sentences in Italian,
often producing ungrammatical or semantically incoherent In summary, these results highlight the dificulty of
outputs. This degradation of linguistic quality under com- models in reflecting and producing according to
plex constraints highlights the trade-of between adher- meta-linguistic principles, a fundamental feature of
ence to the constraint and maintenance of basic linguistic human linguistic creativity, thus highlighting the
limcompetence. A particularly notable pattern emerged in itations of multi-objective planning mechanisms
the palindrome tasks: smaller models frequently aban- with respect to controllability and performance in
doned Italian and began generating sentences in English. complex linguistic tasks.</p>
        <p>This involuntary code-switching suggests a tendency
to revert to the predominant language in the training
data when the task deviates from standard generation 7. Conclusion and Future Works
patterns.</p>
        <p>From a more qualitative point of view, the generated
outputs of the models reveal systematic behavioral
patterns, particularly evident in smaller models but also
observable in larger ones. A recurring phenomenon is the
tendency for thematic and lexical repetition with
superficial word order variations across most tasks, suggesting
limitations in creative diversification under constraints.</p>
        <p>In the specific case of anagrammatic tasks, Anita and
Velvet showed a simplified resolution strategy, limiting
themselves to swapping word order within phrases rather
than performing true letter-level permutations (as shown
in the examples below). This behavior indicates a
superifcial understanding of the anagrammatic constraint and
the adoption of simplified heuristics.</p>
        <p>Examples from Anita:
In this study, we presented OuLiBench, a novel
benchmark designed to rigorously assess the linguistic
capabilities of Large Language Models (LLMs) through the
generation of Italian texts governed by explicit formal
constraints. Drawing inspiration from the Oulipo
literary tradition, our benchmark diverges from
conventional evaluation methodologies that typically emphasize
task performance on downstream applications. Instead,
OuLiBench centers its evaluation on the model’s
proficiency in adhering to a diverse array of linguistic
constraints, encompassing structural, quantitative, syntactic,
and stylistic dimensions. This shift of focus allows for a
more nuanced understanding of a model’s fine-grained
control over language generation processes. Our
empirical evaluation involved both open-source and commercial
LLMs tested in zero-shot and few-shot scenarios. The
results revealed substantial variability in their ability to
meet the prescribed constraints. Quantitative constraints,
such as specific letter counts or palindromic structures,
posed significant dificulties across the board,
underscoring persistent limitations in current architectures for
handling sub-lexical control. Conversely, syntactic and
stylistic constraints were more successfully navigated by larger
models, suggesting that model scale and complexity
contribute positively to managing higher-level linguistic
features. Notably, Italian-focused LLMs, including Anita,
demonstrated competitive performance, highlighting the
benefits of dedicated linguistic resources and targeted
training on specific languages, which can partially ofset
the advantages conferred by sheer model size. These
findings emphasize the persistent challenges in controllable
text generation, especially under intersecting and
mutually interacting constraints and demand simultaneous
fulfillment without compromising linguistic naturalness
and coherence. The results indicate a pressing need for
innovative generation frameworks capable of embedding
meta-linguistic reasoning and constraint-aware planning
mechanisms throughout the text production pipeline.</p>
        <p>Looking forward, OuLiBench lays the groundwork for
several promising directions in computational
linguistics and AI research. Extending the benchmark to other
languages would facilitate cross-linguistic investigations
into the controllability of multilingual LLMs, while the
integration of multimodal or pragmatic constraints could
Original: “Tre gatti in casa fanno rumore
strepito”
Anagram: “Strepito in casa fanno gatti tre
rumore”
English: “Three cats in the house make
noise and uproar” → “Uproar in the house
make cats three noise”</p>
      </sec>
      <sec id="sec-4-3">
        <title>Original: “Tre per cento in banca stanno”</title>
        <p>Anagram: “Stanno in banca trecento per”
English: “Three percent are in the bank”
→ “Are in the bank threehundred
percent”
Examples from Velvet:</p>
      </sec>
      <sec id="sec-4-4">
        <title>Original: “Il sole splende.”</title>
        <p>Anagram: “Splende il sole.”
English: “The sun shines.” → “Shines the
sun.”</p>
      </sec>
      <sec id="sec-4-5">
        <title>Original: “La luna brilla.”</title>
        <p>Anagram: “Brilla la luna.”
English: “The moon shines.” → “Shines
the moon.”</p>
      </sec>
      <sec id="sec-4-6">
        <title>Original: “Il gatto mangia.”</title>
        <p>Anagram: “Mangia il gatto.”</p>
        <p>English: “The cat eats.” → “Eats the cat.”</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <sec id="sec-5-1">
        <title>This work has been supported by the FAIR - Future AI</title>
        <p>Research (PE00000013) project under the NRRP MUR
program funded by the NextGenerationEU. Partial support
was also received by the project “Understanding and
Enhancing Preference Alignment in Large Language Models
Through Controlled Text Generation” (IsCc8_ALIGNLLM),
funded by CINECA under the ISCRA initiative, for the
availability of HPC resources and support.
broaden the scope of evaluation beyond purely textual
parameters. Additionally, developing refined qualitative
and creativity-focused metrics will be critical to
advancing our understanding of deep linguistic competence,
ultimately guiding the design of next-generation models
with enhanced flexibility, expressiveness, and adherence
to formal language structures. Ultimately, OuLiBench
not only enriches the evaluation toolkit for Italian NLP
but also serves as a conceptual bridge between
computational linguistics and literary formalism, pushing the
boundaries of what LLMs can achieve under constraint.
Declaration on Generative AI
During the preparation of this work, the author(s) used ChatGPT (OpenAI) in order to: Grammar
and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content
as needed and take(s) full responsibility for the publication’s content.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <year>1973</year>
          , pp.
          <fpage>19</fpage>
          -
          <lpage>22</lpage>
          .
          <article-title>Pubblicato originariamente in *Les ian twitter treebank in universal dependencies</article-title>
          , in:
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Dossiers</surname>
          </string-name>
          du Collège de 'Pataphysique*,
          <source>n. 17 (dicem- Proceedings of the Eleventh International Con-</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          bre
          <year>1961</year>
          ).
          <source>ference on Language Resources and Evaluation</source>
          [15]
          <string-name>
            <given-names>R.</given-names>
            <surname>Orlando</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Moroni</surname>
          </string-name>
          , P.-L. Huguet
          <string-name>
            <surname>Cabot</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <article-title>Co-</article-title>
          (
          <source>LREC</source>
          <year>2018</year>
          ), European Language Resources Asso-
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>nia</surname>
            , E. Barba,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Orlandini</surname>
          </string-name>
          , G. Fiameni, R. Nav- ciation
          <source>(ELRA)</source>
          , Miyazaki, Japan,
          <year>2018</year>
          , p.
          <fpage>?</fpage>
          -? URL:
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>igli</surname>
          </string-name>
          , Minerva LLMs:
          <article-title>The first family of large https://aclanthology</article-title>
          .org/L18-1279/.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <article-title>language models trained from scratch on Italian [25]</article-title>
          <string-name>
            <given-names>A. T.</given-names>
            <surname>Cignarella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bosco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Patti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lai</surname>
          </string-name>
          , Ap-
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>R.</given-names>
            <surname>Sprugnoli</surname>
          </string-name>
          (Eds.),
          <article-title>Proceedings of the 10th Italian for irony on the italian twitter corpus twittirÒ</article-title>
          , in:
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          it
          <year>2024</year>
          ), CEUR Workshop Proceedings, Pisa, Italy, ence
          <article-title>on Language Resources and Evaluation (LREC</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <year>2024</year>
          , pp.
          <fpage>707</fpage>
          -
          <lpage>719</lpage>
          . URL: https://aclanthology.org/
          <year>2018</year>
          ), European Language Resources Association
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          2024.clicit-
          <volume>1</volume>
          .77/. (ELRA), Miyazaki, Japan,
          <year>2018</year>
          , pp.
          <fpage>4204</fpage>
          -
          <lpage>4211</lpage>
          . [16]
          <string-name>
            <surname>Almawave</surname>
            , Velvet-14b, HuggingFace &amp; azienda [26]
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Brunato</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Cimino</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Dell'Orletta</surname>
          </string-name>
          , G. Venturi,
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Almawave</surname>
          </string-name>
          ,
          <year>2025</year>
          . URL: https://huggingface.co/ S. Montemagni,
          <article-title>Profiling-UD: a tool for linguis-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <article-title>Almawave/Velvet-14B, 14B parametri, multilingue tic profiling of texts</article-title>
          , in: N.
          <string-name>
            <surname>Calzolari</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Béchet</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <article-title>(it</article-title>
          , en, es, pt-BR, de, fr),
          <article-title>training su HPC Leonardo</article-title>
          . P. Blache,
          <string-name>
            <given-names>K.</given-names>
            <surname>Choukri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Declerck</surname>
          </string-name>
          , S. Goggi, [
          <volume>17</volume>
          ]
          <article-title>mii-llm, Maestrale-chat-v0.4-beta</article-title>
          , HuggingFace H.
          <string-name>
            <surname>Isahara</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Maegaard</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mariani</surname>
          </string-name>
          , H. Mazo,
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <source>model card</source>
          ,
          <year>2025</year>
          . URL: https://huggingface.co/ A. Moreno,
          <string-name>
            <given-names>J.</given-names>
            <surname>Odijk</surname>
          </string-name>
          , S. Piperidis (Eds.), Proceedings
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <article-title>mii-llm/maestrale-chat-v0.4-beta, 7.2B parametri, of the Twelfth Language Resources</article-title>
          and Evaluation
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <article-title>built with Axolotl, safe chat beta</article-title>
          . Conference, European Language Resources Associ[18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Polignano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Basile</surname>
          </string-name>
          , G. Semeraro, Advanced ation, Marseille, France,
          <year>2020</year>
          , pp.
          <fpage>7145</fpage>
          -
          <lpage>7151</lpage>
          . URL:
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <article-title>natural-based interaction for the italian language</article-title>
          : https://aclanthology.org/
          <year>2020</year>
          .lrec-
          <volume>1</volume>
          .883/.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Llamantino-</surname>
          </string-name>
          3
          <string-name>
            <surname>-anita-</surname>
          </string-name>
          8b
          <article-title>-inst-dpo-ita, arXiv preprint,</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <year>2024</year>
          . arXiv:
          <volume>2405</volume>
          .
          <fpage>07101</fpage>
          . [19]
          <string-name>
            <surname>Anthropic</surname>
          </string-name>
          , Claude
          <volume>3</volume>
          .7 Sonnet Sys-
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <article-title>claude-3-7-sonnet-system-</article-title>
          <string-name>
            <surname>card</surname>
          </string-name>
          ,
          <year>2025</year>
          . Sys-
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <article-title>tem card for the hybrid-reasoning model Claude 3</article-title>
          .
          <fpage>7</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          Sonnet. [20]
          <string-name>
            <surname>DeepSeek-AI</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Feng</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Xue</surname>
          </string-name>
          , . et al.,
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <source>Deepseek-v3 technical report</source>
          , arXiv preprint,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <source>arXiv:2412</source>
          .
          <fpage>19437</fpage>
          . [21]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bosco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Montemagni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Simi</surname>
          </string-name>
          , Converting
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <given-names>S.</given-names>
            <surname>Dipper</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 7th Linguistic</source>
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <surname>tics</surname>
          </string-name>
          , Sofia, Bulgaria,
          <year>2013</year>
          , pp.
          <fpage>61</fpage>
          -
          <lpage>69</lpage>
          . [22]
          <string-name>
            <given-names>L.</given-names>
            <surname>Alfieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Tamburini</surname>
          </string-name>
          ,
          <article-title>(almost) automatic conver-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <source>ings of the 3rd Italian Conference on Computational</source>
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <article-title>Linguistics (CLiC-IT</article-title>
          <year>2016</year>
          ), CEUR Workshop Pro-
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <surname>ceedings</surname>
          </string-name>
          , Napoli, Italy,
          <year>2016</year>
          , pp.
          <fpage>19</fpage>
          -
          <lpage>23</lpage>
          . [23]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sanguinetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bosco</surname>
          </string-name>
          , Parttut: The turin uni-
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          Springer Verlag, Heidelberg,
          <year>2014</year>
          . [24]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sanguinetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bosco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lavelli</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Mazzei,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>