-

Controllable Text Generation To Evaluate Linguistic Abilities of Italian LLMs

Cristiano Ciaccio

Felice Dell'Orletta

Alessio Miaschi

Giulia Venturi

0 0 ItaliaNLP Lab, Istituto di Linguistica Computazionale “A. Zampolli” (CNR-ILC) , Pisa , Italy

State-of-the-art Large Language Models (LLMs) demonstrate exceptional proficiency across diverse tasks, yet systematic evaluations of their linguistic abilities remain limited. This paper addresses this gap by proposing a new evaluation framework leveraging the potentialities of Controllable Text Generation. Our approach evaluates the models' capacity to generate sentences that adhere to specific linguistic constraints and their ability to recognize the linguistic properties of their own generated sentences, also in terms of consistency with the specified constraints. We tested our approach on six Italian LLMs using various linguistic constraints.

eol>Large Language Models Sentence Generation Controllable Text Generation Linguistic constraints

the linguistic abilities of these models.

From a complementary perspective, in recent years, several works have proposed diverse approaches to assess the consistency of LLMs as an essential component of the models’ evaluation [15], where consistency can be defined as “the requirement that no two statements given by the system are contradictor” [16] or "the invariance of its behaviour under meaning-preserving alternations in its input" [17]. Despite their diferences, all these approaches aim to understand the reasoning processes that the models employ in various reasoning tasks [18, 19] while also measuring the predictability and coherence of the models’ generated responses under diferent conditioning inputs. Among these, [20] studied the consistency between generation (e.g. “what is 7+8?” ) and validation (e.g. "7+8=15, True or False?” ) of LLMs consid

1. Introduction and Background

Large-scale Language Models (LLMs) [ 1, 2, 3 ] have exhibited extraordinary proficiency in a wide range of tasks, from text generation to complex problem-solving, by producing coherent and fluent texts [ 4 ]. Their ability to understand context, generate human-like responses, and even engage in creative tasks underscores their potential in various applications. Such capabilities have been extensively evaluated against several benchmarks, as evidenced by the success of platforms such as the OpenLLM Leaderboard [ 5 ] or Italian LLM-Leaderboard [ 6 ], specifically developed to evaluate Italian models. However, despite their impressive capabilities, the evaluation of LLMs’ linguistic abilities when generating sentences remains an understudied topic. In fact, while earlier works have demonstrated the implicit encoding of many linguistic phenomena within the representations of smaller models [7, 8, 9] or by prompting LLMs to assess their linguistic competence [10, 11, 12], there is no guarantee that generative LLMs can comply with such properties in generating texts.

Studies on Controllable Text Generation (CTG) indirectly assessed models’ capabilities by examining their adherence to linguistic constraints [13]. For instance, [14] studied the abilities of LLMs in adhering to lexical and morpho-syntactic constraints when generating personalized texts. Nevertheless, these works are mainly focused on task-oriented scenarios (e.g. text simplification) and therefore they do not provide systematic evaluations of transfer). [21], instead, employed several consistency checks to measure models’ faithfulness and to understand whether self-explanations truly reflect the model’s behaviour. Importantly, the training procedure of an LM does not explicitly target consistency [17], meaning this ability to produce non-contradictory statements eventually emerges as a byproduct of pre-training and ifne-tuning. Therefore, studying models under such conditions serves as a valuable proxy for evaluating their capacity to handle diferent but complementary tasks, such as generation vs. validation.

In this paper, we bring together the two perspectives and propose an evaluation approach to thoroughly test the linguistic abilities of several Italian LLMs. Specifically, by instructing a model to generate sentences that adhere to a set of targeted linguistic constraints (e.g. “Generate a sentence with 2 adjectives” ) and then asking to validate its own sentences ("How many adjectives does this sentence have: <s>?"), we seek to answer the following research questions: i) To what extent is an Italian LLM capable of generating sentences that adhere to specific linguistic constraints? ii) How consistent are LLM’s responses to the validation questions w.r.t. the specified linguistic constraints? iii) How well can Italian LLMs recognize the linguistic features present in their own generated sentences? Contributions. Our main contributions are: Genera una frase di senso compiuto che contenga 2 verbi. (trad. Generate a complete sentence containing 2 verbs.)

Given the well-known dificulty of LLMs in producing texts with precise numerical constraints [13], we decided to constrain the models on increasing values of linguistic properties , to evaluate their ability also to generate sentences following incremental constraints. Our premise lies in the fact that while an LLM may struggle to precisely generate a sentence with an exact value of a particular linguistic property, it is likely to be sensitive to incremental values, i.e. it can generate a sentence characterized by either the absence or the frequent occurrence of a linguistic property.

As a second step, we validate each model against their own samples:

Quanti verbi ci sono nella seguente frase: <s>? (trad. How many verbs does this sentence have: <s>?) • We assess models’ consistency with the requested constraints and their ability to validate their own generated content.

where <s> corresponds to the sentence that the same LLM generated in the previous step. This validation process was conducted by evaluating the models’ responses against the requested linguistic constraints’ values and the actual property values generated by the models. Here the goal is twofold: first, to measure the linguistic consis• We propose a framework for evaluating the lin- tency of a model, that is if the requested features in the guistic abilities of state-of-the-art Italian LLMs generation step align with the ones found by the model when generating text. in their own samples; secondly, to assess the models’ ability to correctly recognize the actual properties of their • We conduct extensive evaluations across diferent generated sentences.

models and linguistic constraints. Due to some models struggling to produce reliable responses in a zero-shot scenario, we experimented with a few-shot scenario1 to ensure more comparable results.

2.1. Linguistic Constraints

2. Approach The linguistic properties we employed as constraints in the generation process include raw, morpho-syntactic, For the purpose of this paper, we devised a two-step and syntactic properties of a sentence. In particular, we approach aimed at i) assessing LLMs’ ability to follow tested the following ones: the length of the sentence in a set of linguistic constraints, and ii) validating their terms of tokens (n_tokens); a subset of Part-Of-Speech ability to recognize the presence of linguistic constraints (POS) as defined by the Universal Dependency (UD) in generated sentences. project [22], i.e. noun (NOUN ), verb (VERB), adjective

To achieve the first goal, we asked the models to gen- (ADJ) and adverb (ADV ); the number of subjects and erate sentences with targeted linguistic constraints cor- objects in a sentence (subj and obj), and the number of responding to a set of morpho-syntactic and syntactic subordinative clauses in a sentence (subord) still as deproperties of a sentence, denoted as = {1, 2, ..., }. fined by the UD framework. These properties have been In particular, for each property, we prompted each LLM shown to play a highly predictive role when leveraged to produce a fixed number of sentences having a pre- by traditional learning models on various classification cise value , as drawn from a set of possible values problems and can also be efectively used to profile the = {1 , 2 , ..., }. For instance, a prompt asking knowledge encoded in the internal representations of the model to generate a sentence with two verbs will have the following structure: 1See Appendix B.1 for details.

2The set of properties values are reported in Appendix B.2. 3https://huggingface.co/iGeniusAI/Italia-9B-Instruct-v0.1. 4See Appendix A for more information about the models. Both steps of analysis were evaluated using two metrics.

First, we computed the Success Rate (SR) for each model and linguistic property. Specifically, for the generation of sentences with linguistic constraints, we measured the SR as the fraction of times the model generated a Table 1 sentence whose property value exactly matched the reDetails of the LLMs used in our experiments. The Pre-train quested value. For the validation step, we computed column indicates if the model was pre-trained exclusively on the SR as the fraction of times the model’s response acItalian, the SFT/IT column shows whether the model under- curately matched i) the requested linguistic constraint went a supervised fine-tuning (SFT) or instruction-tuning (IT) (consistency) and ii) the property value of the generated phase for adaptation to the Italian language, and CPT (Con- sentence. tinual Pre-training) indicates whether the model underwent a As previously mentioned, given the dificulty LLMs continual pre-training phase on the Italian language. have in following precise numerical constraints, we relied also on a metric that measures the models’ abilities to comply with increasing values rather than precise a pre-trained Transformer-based model and to enhance ones. For the evaluation of the generation step, we calcutheir linguistic abilities [23, 24]. lated the Spearman correlation coeficients ( ) between Constraints Selection. the increasing property values we requested and those

To ensure the selection of authentic property values, extracted from the generated sentences. This metric prowe relied on diferent sections of the Italian Universal vides an overall picture of the models’ ability to follow Dependency Treebank (IUDT) version 2.5 [25], namely constraints at a macro level, including increasing, deParTUT [26], VIT [27], ISDT [28], PoSTWITA [29] and creasing, or removing a specific property when asked. TWITTIRÒ [30]. To avoid dealing with excessively short For the validation step, the correlation was computed or long sentences, possibly containing non-standard val- between the responses produced by the model and i) ues, we filtered the treebanks to retain only sentences the requested linguistic constraints, and ii) the property containing a minimum of 5 and a maximum of 40 to- values of the generated sentences. kens. The resulting dataset contains 26,744 sentences. Models’ generated sentences were linguistically anStarting from this subset, we selected five increasing val- notated with Stanza [35] and further analyzed using ues for each linguistic property2. Specifically, we asked Profiling-UD [ 36], a web-based application that captures each model to generate 100 sentences for every value multiple aspects of sentence structure. The tool extracts within the set of five values , thus obtaining a total of around 130 properties representative of the underlying 500 sentences per property. linguistic structure of a sentence, derived from raw, mor

Moreover, since we performed our experiments in a phosyntactic, and syntactic levels of sentence annotation, few-shot scenario, we used 5 exemplar sentences for each all based on the Universal Dependencies (UD) formalism linguistic property extracted from IUDT. [37]. Thus, it allows computing the distribution of the set of constrained linguistic properties and their values 2.2. Models within generated sentences.

We evaluated several Italian LLMs, with parameter counts ranging from 7 to 9 billion. We specifically leveraged 3. Results the instruction-tuned variants of these models to assess their ability to adhere more closely to prompts contain- 3.1. Sentence Generation ing detailed instructions. Importantly, we selected models that difer across several factors (architecture, the amount of pre-training and instruction tuning data, the language adaptation strategy, etc.) in order to investigate how these characteristics impact performance. The overall models used in our experiments are: ANITA [31], Camoscio [32], Cerbero [33], DanteLLM [ 6 ], Italia3 and LLaMAntino [34]4.

Table 2 reports the results in terms of Success Rate (SR) and Spearman correlation ( ) obtained for each model and each linguistic property. When examining the average scores across all linguistic constraints (Avg column), we notice that the model rankings remain consistent across both evaluation metrics. Specifically, ANITA consistently outperforms the other models on average, while Italia (SR) and Camoscio ( ) perform the worst. Interestingly, the scores do not correlate with the models’ parameter sizes; for example, the largest model, Italia, ranks poorly in terms of SR. However, a distinction is

NOUN .47/.97 .14/.44 .15/.56 .15/.54 .09/.34 .12/.48 .19/.56

VERB .46/.96 .16/.18 .24/.5 .22/.66 .16/.2 .19/.43 .24/.49 evident between architectures: models with more recent, higher-performing architectures like ANITA (based on LLaMA 3), DanteLLM, and Cerbero (both based on Mistral) tend to excel. Notably, ANITA stands out with its base model, LLaMA 3, being pre-trained on an impressive dataset of 15 trillion tokens and having already undergone an instruction tuning and alignment phase using both Proximal Policy Optimization (PPO) [38] and Direct Preference Optimization (DPO) [39] in the English language. This suggests that the aforementioned strategy may enhance instruction-following abilities since also DanteLLM was instruction-tuned on Italian starting from the English-instructed version of Mistral. On the contrary, Cerbero, which is based on the non-instruct version of Mistral, obtained lower performance compared to DanteLLM. Given the lack of insight into the models pre- Figure 2: Success rate for each linguistic property and each training data and the importance of understanding this model. Scores are reported for each group of feature values. phenomenon, further study on the impact of instruction tuning before language adaptation is encouraged.

Linguistic Properties. When we analyze which linguis- an increasing trend in token constraints. tic constraints the models followed the most, we observe Figure 2 illustrates, for each model and each property, notable diferences between the two evaluation metrics, the SR scores obtained in the generation of sentences highlighting their complementarity and their ability to with a value , reported on the x-axis. This analysis capture diverse aspects of the models’ constrained sen- enables us to identify linguistic control elements that tence generation capabilities. Specifically, the rankings of models can adhere to more accurately, thereby indicatlinguistic properties based on SR and Spearman correla- ing their proficiency in mastering specific property valtion scores difer significantly. On average ( Avg row), the ues within the spectrum of Italian language possibilities. top three linguistic characteristics with the highest SR Generally, models achieve lower scores for high property are the use of subordination, subjects and objects (paired values, while scores tend to be higher when the property with adjectives). In contrast, the top three characteris- value is 0, indicating the absence of the given property. tics with the highest Spearman scores are the length of These contrasting trends suggest that models can diferthe generated sentences (n_tokens), the use of adjectives, entiate between generating sentences with or without a and verbs. Interestingly, in terms of SR, on average the specific property and face greater dificulty with higher models struggle with generating sentences featuring a property values, which may be less common in Italian. specific length in terms of the number of tokens. One An interesting exception is the subj property, where SR possible explanation for this behaviour could be that, scores increase as the property value rises from 0 to 1. although sentence length can be considered a basic prop- This indicates that models are less accurate at generating erty, its wide range of variation makes it challenging for sentences without a subject. an LLM to generate sentences with an exact number of tokens compared to other properties. Conversely, n_tokens achieves the highest Spearman scores among all models indicating that the models are still capable of following

Model ANITA

Camoscio . Cerbero s n DanteLLM o C Italia

LLaMAntino Avg ANITA

Camoscio .s+ CDearnbteerLoLM n oC Italia

LLaMAntino Avg 3.2. Sentence Validation results discussed in Section 3.1: the model that demonstrated the best controlled generation abilities is also the As mentioned in Section 2, the validation step of our most capable of correctly answering the validation quesstudy is two-fold. tion and the most consistent with the requests. When Consistency. Table 3 presents the results of the valida- we focus on the analysis of the linguistic constraints tion of the consistency of the LLMs, evaluated against we observe some diferences between the two evaluathe requested linguistic constraints’ values. The results tion metrics considered. In terms of SR, both for Cons. are reported for two sets of generated sentences: the and Cons.+, we notice that the constraints the models entire set (Cons. in the table) and the subset including are better able to follow (see Table 2) are also those the only the sentences generated by correctly following the models can better recognize in the generated sentences. constraints (Cons.+)5. A first observation concerns the Specifically, these are the three syntactic properties of fact that the scores, both in terms of SR and Spearman, the sentence we considered (subj, obj, subord). Two main are higher when we consider the Cons.+ set. This sug- exceptions are ANITA and Camoscio. ANITA, while begests that when the models generate sentences that pre- ing the best model in generating sentences with the exact cisely adhere to the requested values, they tend to answer number of requested tokens (n_tokens), is the least able the validation question more accurately, thus showing to recognize the length of the generated sentences. On greater coherence with the requested constraints. How- the contrary, for the same constraint, Camoscio, with ever, we can notice some diferences across LLMs, lin- only a 0.1 SR in sentence generation, is the model most guistic characteristics and evaluation metrics. capable of correctly answering the validation question.

By focusing on the ranking of the LLMs (Avg column), Such a direct relationship with the generation abilities is we find that ANITA is the most coherent model in terms less observable for the evaluation in terms of Spearman of both SR and Spearman scores. This aligns with the correlation scores. Namely, the ranking of the Spearman 5Note that for this subset, the number of sentences for each model scores in the Avg row in Table 3 does not align with and linguistic property varies as detailed in Appendix C. the ranking in Table 2. For example, consider the subject constraint: while it is the constraint that models are, in-depth analyses focused on various aspects of evaluaon average, least able to incrementally follow, it is the tion. Among other aspects, we could evaluate the overall one with which they are most consistent in terms of the quality of the generated sentences, which we have not requested values. accounted for so far. Preliminary investigations revealed Recognizing linguistic properties. Table 4 reports the that the overall quality of the generations varies across results of the second validation step. A general compari- Italian LLMs, with Italia appearing to be the most fluent 6. son between the Avg column here and the corresponding Thus, future research should also involve a more comprecolumn in Table 2, reveals diferent trends, depending on hensive evaluation that compares the linguistic abilities the evaluation metric. This highlights that our approach of LLMs with their fluency and grammaticality. efectively distinguishes the models’ varying abilities.

Specifically, in terms of SR, most models, except ANITA, show a stronger ability to recognize the linguistic prop- Acknowledgments erties of their own generated sentences than to correctly This work has been supported by: generate sentences with the requested constraint. Conversely, when considering Spearman evaluation, four out FAIR - Future AI Research (PE00000013) of the six models, i.e. ANITA, Camoscio, DanteLLM, and project under the NRRP MUR program LLaMAntino, demonstrate greater proficiency in gener- funded by the NextGenerationEU. ating sentences following incremental constraints than in validating the linguistic properties of those sentences.

A final remark concerns the ranking of the linguistic fea- TEAMING-UP - Teaming up with Sotures (Avg row in the table). It generally aligns with the cial Artificial Agents project under the one discussed in Section 3.1 for both evaluation metrics. PRIN grant no. 20177FX2A7 funded by The main exception is the models’ ability to recognize the Italian Ministry of University and the exact number of subjects in their own generated sen- Research. tences. This linguistic characteristic is the best recognized on average across the models in terms of SR (0.44), References which is notably higher compared to the average SR of the generation abilities (0.27).

4. Conclusion and Future Works In this paper, we presented the results of a new frame

work to extensively evaluate the linguistic abilities of Italian LLMs when generating sentences according to multiple linguistic constraints and, subsequently, when validating the linguistic properties of their own outputs.

Results showed that models’ architectures and dimensions of pre-training data have an impact on their ability to correctly follow the constraints, with ANITA being the best-performing model across all configurations. When validating each model against their own generated sentences, we noticed that i) LLMs tend to be more consistent with the requested constraints when they correctly followed them during the generation phase, and ii) the generation abilities do not always align with the ability of the models to recognize the linguistic properties of their generated sentences.

Our findings also highlighted that the evaluation metric chosen can significantly afect the results, underscoring the complexity of evaluating LLMs and the necessity for further research in this direction.

Considering that the evaluation of LLMs is an ongoing and multifaceted efort across all languages, we believe that this study opens the way for numerous further

A. Lenci, S. Sakti, N. Xue (Eds.), Proceedings of [13] J. Sun, Y. Tian, W. Zhou, N. Xu, Q. Hu, the 2024 Joint International Conference on Com- R. Gupta, J. Wieting, N. Peng, X. Ma, Evaluputational Linguistics, Language Resources and ating large language models on controlled genEvaluation (LREC-COLING 2024), ELRA and ICCL, eration tasks, in: H. Bouamor, J. Pino, K. Bali Torino, Italia, 2024, pp. 4343–4355. URL: https: (Eds.), Proceedings of the 2023 Conference on //aclanthology.org/2024.lrec-main.388. Empirical Methods in Natural Language Pro[7] G. Jawahar, B. Sagot, D. Seddah, What does BERT cessing, Association for Computational Linguislearn about the structure of language?, in: A. Ko- tics, Singapore, 2023, pp. 3155–3168. URL: https: rhonen, D. Traum, L. Màrquez (Eds.), Proceedings //aclanthology.org/2023.emnlp-main.190. doi:10. of the 57th Annual Meeting of the Association for 18653/v1/2023.emnlp-main.190. Computational Linguistics, Association for Com- [14] B. Alhafni, V. Kulkarni, D. Kumar, V. Raheja, Personputational Linguistics, Florence, Italy, 2019, pp. alized text generation with fine-grained linguistic 3651–3657. URL: https://aclanthology.org/P19-1356. control, in: A. Deshpande, E. Hwang, V. Murahari, doi:10.18653/v1/P19-1356. J. S. Park, D. Yang, A. Sabharwal, K. Narasimhan, [8] I. Tenney, D. Das, E. Pavlick, BERT rediscov- A. Kalyan (Eds.), Proceedings of the 1st Workshop ers the classical NLP pipeline, in: A. Korho- on Personalization of Generative AI Systems (PERnen, D. Traum, L. Màrquez (Eds.), Proceedings of SONALIZE 2024), Association for Computational the 57th Annual Meeting of the Association for Linguistics, St. Julians, Malta, 2024, pp. 88–101. URL: Computational Linguistics, Association for Com- https://aclanthology.org/2024.personalize-1.8. putational Linguistics, Florence, Italy, 2019, pp. [15] X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H. Chi, 4593–4601. URL: https://aclanthology.org/P19-1452. S. Narang, A. Chowdhery, D. Zhou, Self-consistency doi:10.18653/v1/P19-1452. improves chain of thought reasoning in language [9] A. Rogers, O. Kovaleva, A. Rumshisky, A models, in: The Eleventh International Conference primer in BERTology: What we know about on Learning Representations, 2023. URL: https:// how BERT works, Transactions of the Associa- openreview.net/forum?id=1PL1NIMMrw. tion for Computational Linguistics 8 (2020) 842– [16] A. Chen, J. Phang, A. Parrish, V. Padmakumar, 866. URL: https://aclanthology.org/2020.tacl-1.54. C. Zhao, S. R. Bowman, K. Cho, Two failures of doi:10.1162/tacl_a_00349. self-consistency in the multi-step reasoning of llms, [10] J. Li, R. Cotterell, M. Sachan, Probing via prompting, Transactions on Machine Learning Research (2024). in: M. Carpuat, M.-C. de Marnefe, I. V. Meza Ruiz [17] Y. Elazar, N. Kassner, S. Ravfogel, A. Ravichan(Eds.), Proceedings of the 2022 Conference of the der, E. Hovy, H. Schütze, Y. Goldberg, MeaNorth American Chapter of the Association for suring and improving consistency in pretrained Computational Linguistics: Human Language Tech- language models, Transactions of the Associanologies, Association for Computational Linguis- tion for Computational Linguistics 9 (2021) 1012– tics, Seattle, United States, 2022, pp. 1144–1157. 1031. URL: https://aclanthology.org/2021.tacl-1.60. URL: https://aclanthology.org/2022.naacl-main.84. doi:10.1162/tacl_a_00410.

doi:10.18653/v1/2022.naacl-main.84. [18] S. Kadavath, T. Conerly, A. Askell, T. Henighan, [11] T. Blevins, H. Gonen, L. Zettlemoyer, Prompt- D. Drain, E. Perez, N. Schiefer, Z. Hatfield-Dodds, ing language models for linguistic structure, in: N. DasSarma, E. Tran-Johnson, et al., Language A. Rogers, J. Boyd-Graber, N. Okazaki (Eds.), Pro- models (mostly) know what they know, arXiv ceedings of the 61st Annual Meeting of the Associa- preprint arXiv:2207.05221 (2022). tion for Computational Linguistics (Volume 1: Long [19] L. Parcalabescu, A. Frank, On measuring faithPapers), Association for Computational Linguis- fulness of natural language explanations, arXiv tics, Toronto, Canada, 2023, pp. 6649–6663. URL: preprint arXiv:2311.07466 (2023). https://aclanthology.org/2023.acl-long.367. doi:10. [20] X. L. Li, V. Shrivastava, S. Li, T. Hashimoto, P. Liang, 18653/v1/2023.acl-long.367. Benchmarking and improving generator-validator [12] M. Di Marco, K. Hämmerl, A. Fraser, A study on consistency of language models, in: The Twelfth accessing linguistic information in pre-trained lan- International Conference on Learning Representaguage models by using prompts, in: H. Bouamor, tions, 2023.

J. Pino, K. Bali (Eds.), Proceedings of the 2023 Con- [21] A. Madsen, S. Chandar, S. Reddy, Are selfference on Empirical Methods in Natural Language explanations from large language models faithProcessing, Association for Computational Linguis- ful?, ArXiv abs/2401.07927 (2024). URL: https: tics, Singapore, 2023, pp. 7328–7336. URL: https: //api.semanticscholar.org/CorpusID:266999774. //aclanthology.org/2023.emnlp-main.454. doi:10. [22] M.-C. de Marnefe, C. D. Manning, J. Nivre, 18653/v1/2023.emnlp-main.454. D. Zeman, Universal Dependencies, Computational Linguistics 47 (2021) 255–308. [31] M. Polignano, P. Basile, G. Semeraro, AdURL: https://aclanthology.org/2021.cl-2.11. vanced natural-based interaction for the italian doi:10.1162/coli_a_00402. language: Llamantino-3-anita, arXiv preprint [23] A. Miaschi, D. Brunato, F. Dell’Orletta, G. Ven- arXiv:2405.07101 (2024).

turi, Linguistic profiling of a neural lan- [32] A. Santilli, E. Rodolà, Camoscio: an italian guage model, in: D. Scott, N. Bel, C. Zong instruction-tuned llama, in: Proceedings of the (Eds.), Proceedings of the 28th International Con- Ninth Italian Conference on Computational Linference on Computational Linguistics, Interna- guistics (CLiC-it 2023), CEUR.org, 2023. tional Committee on Computational Linguistics, [33] F. A. Galatolo, M. G. Cimino, Cerbero-7b: A leap forBarcelona, Spain (Online), 2020, pp. 745–756. ward in language-specific llms through enhanced URL: https://aclanthology.org/2020.coling-main.65. chat corpus generation and evaluation, arXiv doi:10.18653/v1/2020.coling-main.65. preprint arXiv:2311.15698 (2023). [24] A. Miaschi, F. Dell’Orletta, G. Venturi, Linguis- [34] P. Basile, E. Musacchio, M. Polignano, L. Siciliani, tic knowledge can enhance encoder-decoder mod- G. Fiameni, G. Semeraro, Llamantino: Llama 2 models (if you let it), in: N. Calzolari, M.-Y. Kan, els for efective text generation in italian language, V. Hoste, A. Lenci, S. Sakti, N. Xue (Eds.), Pro- arXiv preprint arXiv:2312.09993 (2023). ceedings of the 2024 Joint International Conference [35] P. Qi, Y. Zhang, Y. Zhang, J. Bolton, C. D. Manon Computational Linguistics, Language Resources ning, Stanza: A python natural language proand Evaluation (LREC-COLING 2024), ELRA and cessing toolkit for many human languages, in: ICCL, Torino, Italia, 2024, pp. 10539–10554. URL: A. Celikyilmaz, T.-H. Wen (Eds.), Proceedings of the https://aclanthology.org/2024.lrec-main.922. 58th Annual Meeting of the Association for Com[25] D. Zeman, J. Nivre, M. Abrams, et al., Universal de- putational Linguistics: System Demonstrations, pendencies 2.5, in: LINDAT/CLARIAH-CZ digital Association for Computational Linguistics, Onlibrary at the Institute of Formal and Applied Lin- line, 2020, pp. 101–108. URL: https://aclanthology. guistics (ÚFAL), 2019. URL: http://hdl.handle.net/ org/2020.acl-demos.14. doi:10.18653/v1/2020. 11234/1-3105. acl-demos.14. [26] M. Sanguinetti, C. Bosco, PartTUT: The turin uni- [36] D. Brunato, A. Cimino, F. Dell’Orletta, G. Venturi, versity parallel treebank, in: R. B. et al. (Ed.), S. Montemagni, Profiling-UD: a tool for linguisHarmonization and Development of Re- sources tic profiling of texts, in: N. Calzolari, F. Béchet, and Tools for Italian Natural Language Process- P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, ing within the PARLI Project, Springer, 2015, p. H. Isahara, B. Maegaard, J. Mariani, H. Mazo, 51–69. URL: https://link.springer.com/chapter/10. A. Moreno, J. Odijk, S. Piperidis (Eds.), Proceedings 1007/978-3-319-14206-7_3. of the Twelfth Language Resources and Evaluation [27] R. Delmonte, A. Bristot, S. Tonelli, VIT - Venice Conference, European Language Resources AssociItalian Treebank: Syntactic and quantitative fea- ation, Marseille, France, 2020, pp. 7145–7151. URL: tures, in: Proceedings of the Sixth International https://aclanthology.org/2020.lrec-1.883. Workshop on Treebanks and Linguistic Theories, [37] M.-C. de Marnefe, C. D. Manning, J. Nivre, D. Ze2007. man, Universal Dependencies, Computational Lin[28] C. Bosco, S. Montemagni, M. Simi, Converting guistics 47 (2021) 255–308. URL: https://doi.org/10. italian treebanks: Towards an italian stanford de- 1162/coli_a_00402. doi:10.1162/coli_a_00402. pendency treebank, in: Proceedings of the ACL [38] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, Linguistic Annotation Workshop & Interoperabil- O. Klimov, Proximal policy optimization algorithms, ity with Discourse, 2013. arXiv preprint arXiv:1707.06347 (2017). [29] M. Sanguinetti, C. Bosco, A. Lavelli, A. Mazzei, [39] R. Rafailov, A. Sharma, E. Mitchell, C. D. ManF. Tamburini, PoSTWITA-UD: an Italian Twit- ning, S. Ermon, C. Finn, Direct preference ter Treebank in universal dependencies, in: Pro- optimization: Your language model is secretly ceedings of the Eleventh Language Resources and a reward model, in: A. Oh, T. Naumann, Evaluation Conference (LREC 2018), 2018. URL: A. Globerson, K. Saenko, M. Hardt, S. Levine https://www.aclweb.org/anthology/L18-1279.pdf . (Eds.), Advances in Neural Information Processing [30] A. T. Cignarella, C. Bosco, P. Rosso, Presenting Systems, volume 36, Curran Associates, Inc., 2023, TWITTIRÒ-UD: An italian twitter treebank in uni- pp. 53728–53741. URL: https://proceedings. versal dependencies, in: Proceedings of the Fifth neurips.cc/paper_files/paper/2023/file/ International Conference on Dependency Linguis- a85b405ed65c6477a4fe8302b5e06ce7-Paper-Conference. tics (Depling, SyntaxFest 2019), 2019. URL: https: pdf . //www.aclweb.org/anthology/W19-7723.pdf . [40] T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettlemoyer, Qlora: Eficient finetuning of quantized llms, in: A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, S. Levine (Eds.), Advances in Neural Information Processing Systems, volume 36, Curran Associates, Inc., 2023, pp. 10088–10115. URL: https://proceedings. neurips.cc/paper_files/paper/2023/file/ 1feb87871436031bdc0f2beaa62a049b-Paper-Conference.

pdf . [41] G. Sarti, M. Nissim, IT5: Text-to-text pretraining for

Italian language understanding and generation, in: N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti, N. Xue (Eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LRECCOLING 2024), ELRA and ICCL, Torino, Italia, 2024, pp. 9422–9433. URL: https://aclanthology.org/2024.

lrec-main.823. [42] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li,

S. Wang, L. Wang, W. Chen, Lora: Low-rank adaptation of large language models, arXiv preprint arXiv:2106.09685 (2021). [43] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li,

C. Guestrin, P. Liang, T. B. Hashimoto, Stanford alpaca: An instruction-following llama model, https: //github.com/tatsu-lab/stanford_alpaca, 2023. [44] D. Croce, A. Zelenanska, R. Basili, Neural learning for question answering in italian, in: C. Ghidini, B. Magnini, A. Passerini, P. Traverso (Eds.), AI*IA 2018 – Advances in Artificial Intelligence, Springer

International Publishing, Cham, 2018, pp. 389–402. [45] P. Koehn, Europarl: A parallel corpus for statistical machine translation, in: Proceedings of Machine Translation Summit X: Papers, Phuket, Thailand, 2005, pp. 79–86. URL: https://aclanthology.org/2005.

mtsummit-papers.11. [46] C. Xu, D. Guo, N. Duan, J. McAuley, Baize: An opensource chat model with parameter-eficient tuning on self-chat data, arXiv preprint arXiv:2304.01196 (2023). [47] A. Bacciu, G. Trappolini, A. Santilli, E. Rodolà, F. Silvestri, Fauno: The italian large language model that will leave you senza parole!, https://github.com/ andreabac3/Fauno-Italian-LLM, 2023. [48] A. Holtzman, J. Buys, L. Du, M. Forbes, Y. Choi, The curious case of neural text degeneration, in: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, OpenReview.net, 2020. URL: https: //openreview.net/forum?id=rygGQyrFvH.

7https://huggingface.co/swap-uniba/

LLaMAntino-2-chat-7b-hf-UltraChat-ITA 8https://huggingface.co/datasets/basilepp19/dolly-15k-it 9https://huggingface.co/swap-uniba/ LLaMAntino-3-ANITA-8B-Inst-DPO-ITA 10https://huggingface.co/datasets/Chat-Error/wizard_alpaca_ dolly_orca 11https://huggingface.co/sag-uniroma2/extremITA-Camoscio-7b 12https://huggingface.co/rstless-research/

DanteLLM-7B-Instruct-Italian-v0.1 13https://huggingface.co/galatolo/cerbero-7b 14https://huggingface.co/iGeniusAI/Italia-9B-Instruct-v0.1

While in the validation step the model is prompted about recognising the linguistic properties of its own sentence : _ + Quante ci sono nella seguente frase: ’’? Non fornire spiegazioni. (trad. _ + How many are there in the following sentence: ’’? Do not give an explanation.)

For each model we used the author’s recommended chat template and the specified system prompt when available, otherwise we exclude it. All models are loaded

NOUN 236 71 76 74 43 61

B.2. Constraint values C. Further details about the results

15https://huggingface.co/docs/transformers/index. i) the requested linguistic constraints (Fig. 4) and ii) the property values of the generated sentences (Fig. 3). ADJ ADJ ADJ ADJ ADJ NOUN NOUN NOUN NOUN NOUN NOUN subord subord subord subord subord subord 3 5 Italia Italia ANITA ANITA Italia Italia Italia ANITA ANITA ANITA Italia Italia Italia ANITA ANITA ANITA

Sentence Il suo spirito indomito ha sempre trovato il modo di superare le sfide, senza mai arrendersi, nonostante le avversità (His indomitable spirit always found ways to overcome challenges, never giving up despite adversity.).

Il suo atteggiamento gentile e la sua presenza rassicurante hanno aiutato tutti a sentirsi a proprio agio e ben accolti (His kind attitude and reassuring presence helped everyone feel comfortable and welcome.). La sua passione per la cucina italiana autentica si manifesta nei suoi piatti ricchi di sapori tradizionali, che richiamano la genuinità della cucina casalinga. (His passion for authentic Italian cuisine is evident in his dishes rich in traditional flavors that evoke the authenticity of home cooking.) L’uomo va in uficio ( The man goes to the ofice. ).

Le case vecchie richiedono lavori costosi di restauro (Old houses require expensive restoration work.).

Il grande ed antico castello si erge sulla cima di un piccolo e ripido colle (The large and ancient castle stands on the top of a small and steep hill.).

Dopo un’attenta analisi, l’esperto ha concluso che la situazione era di una gravità estrema. (After careful analysis, the expert concluded that the situation was of extreme gravity.) La verità è emersa, ma ha lasciato un segno che non potrà mai essere cancellato. (The truth has emerged, but it has left a mark that can never be erased.) La missione umanitaria è stata ostacolata da un’insolita condizione meteorologica che ha causato ritardi nei soccorsi. (The humanitarian mission was hampered by unusual weather conditions that caused delays in relief eforts. ) Il cane corre dietro il gatto. (The dog runs after the cat. ) Il bambino ha in braccio il gatto. (The child is holding the cat.) I clienti visitano il ristorante con la famiglia e i bambini. (Customers visit the restaurant with their families and children.) La pioggia porta sollievo alle piante assetate. (Rain brings relief to thirsty plants.) Il suo viaggio attraverso le montagne è stato reso più agevole dalla presenza di un amico che lo ha accompagnato lungo il percorso. (His journey through the mountains was made easier by the presence of a friend who accompanied him along the way.) L’impegno di tutti è fondamentale per garantire il successo del progetto. (Everyone’s commitment is essential to ensure the success of the project.) Ho visitato la città in cui nacque Manzoni. (I visited the city where Manzoni was born.) Il concerto inizia solo dopo le nove. (The concert does not start until after nine o’clock. ) L’uomo che aveva visto il film che era uscito l’anno prima , era rimasto deluso. (The man who had seen the film that came out the year before was disappointed.)

[1]

Achiam ,

Adler ,

Agarwal ,

Ahmad ,

Akkaya ,

F. L.

Aleman ,

Almeida ,

Altenschmidt ,

Altman ,

Anadkat , et al., Gpt-4 technical report, arXiv preprint arXiv:2303.08774 ( 2023 ).

[2]

Touvron ,

Martin ,

Stone ,

Albert ,

Almahairi ,

Babaei ,

Bashlykov ,

Batra ,

Bhargava ,

Bhosale , et al., Llama 2 : Open foundation and finetuned chat models , arXiv preprint arXiv:2307.09288 ( 2023 ).

[3]

A. Q.

Jiang ,

Sablayrolles ,

Mensch ,

Bamford ,

D. S.

Chaplot , D. d. l. Casas,

Bressand , G. Lengyel,

Lample ,

Saulnier , et al., Mistral 7b, arXiv preprint arXiv:2310.06825 ( 2023 ).

[4]

Yang ,

Jin ,

Tang ,

Han ,

Feng ,

Jiang ,

Zhong ,

Yin ,

Hu , Harnessing the power of llms in practice: A survey on chatgpt and beyond , ACM Trans. Knowl. Discov. Data 18 ( 2024 ). URL: https://doi.org/10.1145/3649506. doi: 10 .1145/3649506.

[5]

Beeching ,

Fourrier ,

Habib , S. Han,

Lambert ,

Rajani ,

Sanseviero ,

Tunstall , T. Wolf, Open llm leaderboard, https: //huggingface.co/spaces/open-llm-leaderboard/ open_llm_leaderboard, 2023 .

[6]

Bacciu ,

Campagnano ,

Trappolini ,

Silvestri , DanteLLM: Let's push Italian LLM research forward! , in: N. Calzolari , M.- Y.

Kan , V.

Hoste , 6A sample of the generated sentences can be found in Appendix C.