<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Controllable Text Generation To Evaluate Linguistic Abilities of Italian LLMs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Cristiano Ciaccio</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Felice Dell'Orletta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessio Miaschi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giulia Venturi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ItaliaNLP Lab, Istituto di Linguistica Computazionale “A. Zampolli” (CNR-ILC)</institution>
          ,
          <addr-line>Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>State-of-the-art Large Language Models (LLMs) demonstrate exceptional proficiency across diverse tasks, yet systematic evaluations of their linguistic abilities remain limited. This paper addresses this gap by proposing a new evaluation framework leveraging the potentialities of Controllable Text Generation. Our approach evaluates the models' capacity to generate sentences that adhere to specific linguistic constraints and their ability to recognize the linguistic properties of their own generated sentences, also in terms of consistency with the specified constraints. We tested our approach on six Italian LLMs using various linguistic constraints.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Models</kwd>
        <kwd>Sentence Generation</kwd>
        <kwd>Controllable Text Generation</kwd>
        <kwd>Linguistic constraints</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>the linguistic abilities of these models.</p>
      <p>From a complementary perspective, in recent years,
several works have proposed diverse approaches to
assess the consistency of LLMs as an essential component
of the models’ evaluation [15], where consistency can be
defined as “the requirement that no two statements given
by the system are contradictor” [16] or "the invariance
of its behaviour under meaning-preserving alternations
in its input" [17]. Despite their diferences, all these
approaches aim to understand the reasoning processes that
the models employ in various reasoning tasks [18, 19]
while also measuring the predictability and coherence
of the models’ generated responses under diferent
conditioning inputs. Among these, [20] studied the
consistency between generation (e.g. “what is 7+8?” ) and
validation (e.g. "7+8=15, True or False?” ) of LLMs
consid</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction and Background</title>
      <p>
        Large-scale Language Models (LLMs) [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ] have
exhibited extraordinary proficiency in a wide range of tasks,
from text generation to complex problem-solving, by
producing coherent and fluent texts [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Their ability to
understand context, generate human-like responses, and
even engage in creative tasks underscores their
potential in various applications. Such capabilities have been
extensively evaluated against several benchmarks, as
evidenced by the success of platforms such as the OpenLLM
Leaderboard [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] or Italian LLM-Leaderboard [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
specifically developed to evaluate Italian models. However,
despite their impressive capabilities, the evaluation of
LLMs’ linguistic abilities when generating sentences
remains an understudied topic. In fact, while earlier works
have demonstrated the implicit encoding of many
linguistic phenomena within the representations of smaller
models [7, 8, 9] or by prompting LLMs to assess their
linguistic competence [10, 11, 12], there is no guarantee
that generative LLMs can comply with such properties
in generating texts.
      </p>
      <p>Studies on Controllable Text Generation (CTG)
indirectly assessed models’ capabilities by examining their
adherence to linguistic constraints [13]. For instance, [14]
studied the abilities of LLMs in adhering to lexical and
morpho-syntactic constraints when generating
personalized texts. Nevertheless, these works are mainly focused
on task-oriented scenarios (e.g. text simplification) and
therefore they do not provide systematic evaluations of
transfer). [21], instead, employed several consistency
checks to measure models’ faithfulness and to
understand whether self-explanations truly reflect the model’s
behaviour. Importantly, the training procedure of an
LM does not explicitly target consistency [17],
meaning this ability to produce non-contradictory statements
eventually emerges as a byproduct of pre-training and
ifne-tuning. Therefore, studying models under such
conditions serves as a valuable proxy for evaluating their
capacity to handle diferent but complementary tasks,
such as generation vs. validation.</p>
      <p>In this paper, we bring together the two perspectives
and propose an evaluation approach to thoroughly test
the linguistic abilities of several Italian LLMs. Specifically,
by instructing a model to generate sentences that adhere
to a set of targeted linguistic constraints (e.g. “Generate a
sentence with 2 adjectives” ) and then asking to validate its
own sentences ("How many adjectives does this sentence
have: &lt;s&gt;?"), we seek to answer the following research
questions: i) To what extent is an Italian LLM capable
of generating sentences that adhere to specific linguistic
constraints? ii) How consistent are LLM’s responses to
the validation questions w.r.t. the specified linguistic
constraints? iii) How well can Italian LLMs recognize
the linguistic features present in their own generated
sentences?
Contributions. Our main contributions are:
Genera una frase di senso compiuto che contenga
2 verbi.
(trad. Generate a complete sentence containing
2 verbs.)</p>
      <p>Given the well-known dificulty of LLMs in producing
texts with precise numerical constraints [13], we decided
to constrain the models on increasing values of linguistic
properties  , to evaluate their ability also to
generate sentences following incremental constraints. Our
premise lies in the fact that while an LLM may struggle
to precisely generate a sentence with an exact value of a
particular linguistic property, it is likely to be sensitive to
incremental values, i.e. it can generate a sentence
characterized by either the absence or the frequent occurrence
of a linguistic property.</p>
      <p>As a second step, we validate each model against their
own samples:</p>
      <p>Quanti verbi ci sono nella seguente frase: &lt;s&gt;?
(trad. How many verbs does this sentence have:
&lt;s&gt;?)
• We assess models’ consistency with the requested
constraints and their ability to validate their own
generated content.</p>
      <p>where &lt;s&gt; corresponds to the sentence that the same
LLM generated in the previous step. This validation
process was conducted by evaluating the models’ responses
against the requested linguistic constraints’ values and
the actual property values generated by the models. Here
the goal is twofold: first, to measure the linguistic
consis• We propose a framework for evaluating the lin- tency of a model, that is if the requested features in the
guistic abilities of state-of-the-art Italian LLMs generation step align with the ones found by the model
when generating text. in their own samples; secondly, to assess the models’
ability to correctly recognize the actual properties of their
• We conduct extensive evaluations across diferent generated sentences.</p>
      <p>models and linguistic constraints. Due to some models struggling to produce reliable
responses in a zero-shot scenario, we experimented with
a few-shot scenario1 to ensure more comparable results.</p>
      <sec id="sec-2-1">
        <title>2.1. Linguistic Constraints</title>
        <p>2. Approach The linguistic properties  we employed as constraints
in the generation process include raw, morpho-syntactic,
For the purpose of this paper, we devised a two-step and syntactic properties of a sentence. In particular, we
approach aimed at i) assessing LLMs’ ability to follow tested the following ones: the length of the sentence in
a set of linguistic constraints, and ii) validating their terms of tokens (n_tokens); a subset of Part-Of-Speech
ability to recognize the presence of linguistic constraints (POS) as defined by the Universal Dependency (UD)
in generated sentences. project [22], i.e. noun (NOUN ), verb (VERB), adjective</p>
        <p>To achieve the first goal, we asked the models to gen- (ADJ) and adverb (ADV ); the number of subjects and
erate sentences with targeted linguistic constraints cor- objects in a sentence (subj and obj), and the number of
responding to a set of morpho-syntactic and syntactic subordinative clauses in a sentence (subord) still as
deproperties of a sentence, denoted as  = {1, 2, ..., }. fined by the UD framework. These properties have been
In particular, for each property, we prompted each LLM shown to play a highly predictive role when leveraged
to produce a fixed number of sentences having a pre- by traditional learning models on various classification
cise value  , as drawn from a set of possible values problems and can also be efectively used to profile the
  = {1 , 2 , ...,  }. For instance, a prompt asking knowledge encoded in the internal representations of
the model to generate a sentence with two verbs will
have the following structure: 1See Appendix B.1 for details.</p>
        <sec id="sec-2-1-1">
          <title>2The set of properties values are reported in Appendix B.2. 3https://huggingface.co/iGeniusAI/Italia-9B-Instruct-v0.1. 4See Appendix A for more information about the models.</title>
        </sec>
        <sec id="sec-2-1-2">
          <title>Both steps of analysis were evaluated using two metrics.</title>
          <p>First, we computed the Success Rate (SR) for each model
and linguistic property. Specifically, for the generation
of sentences with linguistic constraints, we measured
the SR as the fraction of times the model generated a
Table 1 sentence whose property value exactly matched the
reDetails of the LLMs used in our experiments. The Pre-train quested value. For the validation step, we computed
column indicates if the model was pre-trained exclusively on the SR as the fraction of times the model’s response
acItalian, the SFT/IT column shows whether the model under- curately matched i) the requested linguistic constraint
went a supervised fine-tuning (SFT) or instruction-tuning (IT) (consistency) and ii) the property value of the generated
phase for adaptation to the Italian language, and CPT (Con- sentence.
tinual Pre-training) indicates whether the model underwent a As previously mentioned, given the dificulty LLMs
continual pre-training phase on the Italian language. have in following precise numerical constraints, we
relied also on a metric that measures the models’ abilities
to comply with increasing values rather than precise
a pre-trained Transformer-based model and to enhance ones. For the evaluation of the generation step, we
calcutheir linguistic abilities [23, 24]. lated the Spearman correlation coeficients (  ) between
Constraints Selection. the increasing property values we requested and those</p>
          <p>To ensure the selection of authentic property values, extracted from the generated sentences. This metric
prowe relied on diferent sections of the Italian Universal vides an overall picture of the models’ ability to follow
Dependency Treebank (IUDT) version 2.5 [25], namely constraints at a macro level, including increasing,
deParTUT [26], VIT [27], ISDT [28], PoSTWITA [29] and creasing, or removing a specific property when asked.
TWITTIRÒ [30]. To avoid dealing with excessively short For the validation step, the  correlation was computed
or long sentences, possibly containing non-standard val- between the responses produced by the model and i)
ues, we filtered the treebanks to retain only sentences the requested linguistic constraints, and ii) the property
containing a minimum of 5 and a maximum of 40 to- values of the generated sentences.
kens. The resulting dataset contains 26,744 sentences. Models’ generated sentences were linguistically
anStarting from this subset, we selected five increasing val- notated with Stanza [35] and further analyzed using
ues for each linguistic property2. Specifically, we asked Profiling-UD [ 36], a web-based application that captures
each model to generate 100 sentences for every value  multiple aspects of sentence structure. The tool extracts
within the set of five values , thus obtaining a total of around 130 properties representative of the underlying
500 sentences per property. linguistic structure of a sentence, derived from raw,
mor</p>
          <p>Moreover, since we performed our experiments in a phosyntactic, and syntactic levels of sentence annotation,
few-shot scenario, we used 5 exemplar sentences for each all based on the Universal Dependencies (UD) formalism
linguistic property extracted from IUDT. [37]. Thus, it allows computing the distribution of the set
of constrained linguistic properties  and their values
2.2. Models within generated sentences.</p>
          <p>
            We evaluated several Italian LLMs, with parameter counts
ranging from 7 to 9 billion. We specifically leveraged 3. Results
the instruction-tuned variants of these models to assess
their ability to adhere more closely to prompts contain- 3.1. Sentence Generation
ing detailed instructions. Importantly, we selected
models that difer across several factors (architecture, the
amount of pre-training and instruction tuning data, the
language adaptation strategy, etc.) in order to
investigate how these characteristics impact performance. The
overall models used in our experiments are: ANITA [31],
Camoscio [32], Cerbero [33], DanteLLM [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ], Italia3 and
LLaMAntino [34]4.
          </p>
          <p>Table 2 reports the results in terms of Success Rate (SR)
and Spearman correlation ( ) obtained for each model
and each linguistic property. When examining the
average scores across all linguistic constraints (Avg column),
we notice that the model rankings remain consistent
across both evaluation metrics. Specifically, ANITA
consistently outperforms the other models on average, while
Italia (SR) and Camoscio ( ) perform the worst.
Interestingly, the scores do not correlate with the models’
parameter sizes; for example, the largest model, Italia,
ranks poorly in terms of SR. However, a distinction is</p>
          <p>NOUN
.47/.97
.14/.44
.15/.56
.15/.54
.09/.34
.12/.48
.19/.56</p>
          <p>VERB
.46/.96
.16/.18
.24/.5
.22/.66
.16/.2
.19/.43
.24/.49
evident between architectures: models with more recent,
higher-performing architectures like ANITA (based on
LLaMA 3), DanteLLM, and Cerbero (both based on
Mistral) tend to excel. Notably, ANITA stands out with its
base model, LLaMA 3, being pre-trained on an impressive
dataset of 15 trillion tokens and having already
undergone an instruction tuning and alignment phase using
both Proximal Policy Optimization (PPO) [38] and
Direct Preference Optimization (DPO) [39] in the English
language. This suggests that the aforementioned
strategy may enhance instruction-following abilities since
also DanteLLM was instruction-tuned on Italian starting
from the English-instructed version of Mistral. On the
contrary, Cerbero, which is based on the non-instruct
version of Mistral, obtained lower performance compared to
DanteLLM. Given the lack of insight into the models pre- Figure 2: Success rate for each linguistic property and each
training data and the importance of understanding this model. Scores are reported for each group of feature values.
phenomenon, further study on the impact of instruction
tuning before language adaptation is encouraged.</p>
          <p>Linguistic Properties. When we analyze which linguis- an increasing trend in token constraints.
tic constraints the models followed the most, we observe Figure 2 illustrates, for each model and each property,
notable diferences between the two evaluation metrics, the SR scores obtained in the generation of sentences
highlighting their complementarity and their ability to
with a value  , reported on the x-axis. This analysis
capture diverse aspects of the models’ constrained sen- enables us to identify linguistic control elements that
tence generation capabilities. Specifically, the rankings of models can adhere to more accurately, thereby
indicatlinguistic properties based on SR and Spearman correla- ing their proficiency in mastering specific property
valtion scores difer significantly. On average ( Avg row), the ues within the spectrum of Italian language possibilities.
top three linguistic characteristics with the highest SR Generally, models achieve lower scores for high property
are the use of subordination, subjects and objects (paired values, while scores tend to be higher when the property
with adjectives). In contrast, the top three characteris- value is 0, indicating the absence of the given property.
tics with the highest Spearman scores are the length of These contrasting trends suggest that models can
diferthe generated sentences (n_tokens), the use of adjectives, entiate between generating sentences with or without a
and verbs. Interestingly, in terms of SR, on average the specific property and face greater dificulty with higher
models struggle with generating sentences featuring a property values, which may be less common in Italian.
specific length in terms of the number of tokens. One An interesting exception is the subj property, where SR
possible explanation for this behaviour could be that, scores increase as the property value rises from 0 to 1.
although sentence length can be considered a basic prop- This indicates that models are less accurate at generating
erty, its wide range of variation makes it challenging for sentences without a subject.
an LLM to generate sentences with an exact number of
tokens compared to other properties. Conversely, n_tokens
achieves the highest Spearman scores among all models
indicating that the models are still capable of following</p>
          <p>Model
ANITA</p>
          <p>Camoscio
. Cerbero
s
n DanteLLM
o
C Italia</p>
          <p>LLaMAntino
Avg
ANITA</p>
          <p>Camoscio
.s+ CDearnbteerLoLM
n
oC Italia</p>
          <p>LLaMAntino
Avg
3.2. Sentence Validation results discussed in Section 3.1: the model that
demonstrated the best controlled generation abilities is also the
As mentioned in Section 2, the validation step of our most capable of correctly answering the validation
quesstudy is two-fold. tion and the most consistent with the requests. When
Consistency. Table 3 presents the results of the valida- we focus on the analysis of the linguistic constraints
tion of the consistency of the LLMs, evaluated against we observe some diferences between the two
evaluathe requested linguistic constraints’ values. The results tion metrics considered. In terms of SR, both for Cons.
are reported for two sets of generated sentences: the and Cons.+, we notice that the constraints the models
entire set (Cons. in the table) and the subset including are better able to follow (see Table 2) are also those the
only the sentences generated by correctly following the models can better recognize in the generated sentences.
constraints (Cons.+)5. A first observation concerns the Specifically, these are the three syntactic properties of
fact that the scores, both in terms of SR and Spearman, the sentence we considered (subj, obj, subord). Two main
are higher when we consider the Cons.+ set. This sug- exceptions are ANITA and Camoscio. ANITA, while
begests that when the models generate sentences that pre- ing the best model in generating sentences with the exact
cisely adhere to the requested values, they tend to answer number of requested tokens (n_tokens), is the least able
the validation question more accurately, thus showing to recognize the length of the generated sentences. On
greater coherence with the requested constraints. How- the contrary, for the same constraint, Camoscio, with
ever, we can notice some diferences across LLMs, lin- only a 0.1 SR in sentence generation, is the model most
guistic characteristics and evaluation metrics. capable of correctly answering the validation question.</p>
          <p>By focusing on the ranking of the LLMs (Avg column), Such a direct relationship with the generation abilities is
we find that ANITA is the most coherent model in terms less observable for the evaluation in terms of Spearman
of both SR and Spearman scores. This aligns with the correlation scores. Namely, the ranking of the Spearman
5Note that for this subset, the number of sentences for each model scores in the Avg row in Table 3 does not align with
and linguistic property varies as detailed in Appendix C. the ranking in Table 2. For example, consider the
subject constraint: while it is the constraint that models are, in-depth analyses focused on various aspects of
evaluaon average, least able to incrementally follow, it is the tion. Among other aspects, we could evaluate the overall
one with which they are most consistent in terms of the quality of the generated sentences, which we have not
requested values. accounted for so far. Preliminary investigations revealed
Recognizing linguistic properties. Table 4 reports the that the overall quality of the generations varies across
results of the second validation step. A general compari- Italian LLMs, with Italia appearing to be the most fluent 6.
son between the Avg column here and the corresponding Thus, future research should also involve a more
comprecolumn in Table 2, reveals diferent trends, depending on hensive evaluation that compares the linguistic abilities
the evaluation metric. This highlights that our approach of LLMs with their fluency and grammaticality.
efectively distinguishes the models’ varying abilities.</p>
          <p>Specifically, in terms of SR, most models, except ANITA,
show a stronger ability to recognize the linguistic prop- Acknowledgments
erties of their own generated sentences than to correctly
This work has been supported by:
generate sentences with the requested constraint.
Conversely, when considering Spearman evaluation, four out FAIR - Future AI Research (PE00000013)
of the six models, i.e. ANITA, Camoscio, DanteLLM, and project under the NRRP MUR program
LLaMAntino, demonstrate greater proficiency in gener- funded by the NextGenerationEU.
ating sentences following incremental constraints than
in validating the linguistic properties of those sentences.</p>
          <p>A final remark concerns the ranking of the linguistic fea- TEAMING-UP - Teaming up with
Sotures (Avg row in the table). It generally aligns with the cial Artificial Agents project under the
one discussed in Section 3.1 for both evaluation metrics. PRIN grant no. 20177FX2A7 funded by
The main exception is the models’ ability to recognize the Italian Ministry of University and
the exact number of subjects in their own generated sen- Research.
tences. This linguistic characteristic is the best
recognized on average across the models in terms of SR (0.44), References
which is notably higher compared to the average SR of
the generation abilities (0.27).</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Conclusion and Future Works</title>
      <sec id="sec-3-1">
        <title>In this paper, we presented the results of a new frame</title>
        <p>work to extensively evaluate the linguistic abilities of
Italian LLMs when generating sentences according to
multiple linguistic constraints and, subsequently, when
validating the linguistic properties of their own outputs.</p>
        <p>Results showed that models’ architectures and
dimensions of pre-training data have an impact on their ability
to correctly follow the constraints, with ANITA being the
best-performing model across all configurations. When
validating each model against their own generated
sentences, we noticed that i) LLMs tend to be more
consistent with the requested constraints when they correctly
followed them during the generation phase, and ii) the
generation abilities do not always align with the ability
of the models to recognize the linguistic properties of
their generated sentences.</p>
        <p>Our findings also highlighted that the evaluation
metric chosen can significantly afect the results,
underscoring the complexity of evaluating LLMs and the necessity
for further research in this direction.</p>
        <p>Considering that the evaluation of LLMs is an
ongoing and multifaceted efort across all languages, we
believe that this study opens the way for numerous further</p>
        <p>A. Lenci, S. Sakti, N. Xue (Eds.), Proceedings of [13] J. Sun, Y. Tian, W. Zhou, N. Xu, Q. Hu,
the 2024 Joint International Conference on Com- R. Gupta, J. Wieting, N. Peng, X. Ma,
Evaluputational Linguistics, Language Resources and ating large language models on controlled
genEvaluation (LREC-COLING 2024), ELRA and ICCL, eration tasks, in: H. Bouamor, J. Pino, K. Bali
Torino, Italia, 2024, pp. 4343–4355. URL: https: (Eds.), Proceedings of the 2023 Conference on
//aclanthology.org/2024.lrec-main.388. Empirical Methods in Natural Language
Pro[7] G. Jawahar, B. Sagot, D. Seddah, What does BERT cessing, Association for Computational
Linguislearn about the structure of language?, in: A. Ko- tics, Singapore, 2023, pp. 3155–3168. URL: https:
rhonen, D. Traum, L. Màrquez (Eds.), Proceedings //aclanthology.org/2023.emnlp-main.190. doi:10.
of the 57th Annual Meeting of the Association for 18653/v1/2023.emnlp-main.190.
Computational Linguistics, Association for Com- [14] B. Alhafni, V. Kulkarni, D. Kumar, V. Raheja,
Personputational Linguistics, Florence, Italy, 2019, pp. alized text generation with fine-grained linguistic
3651–3657. URL: https://aclanthology.org/P19-1356. control, in: A. Deshpande, E. Hwang, V. Murahari,
doi:10.18653/v1/P19-1356. J. S. Park, D. Yang, A. Sabharwal, K. Narasimhan,
[8] I. Tenney, D. Das, E. Pavlick, BERT rediscov- A. Kalyan (Eds.), Proceedings of the 1st Workshop
ers the classical NLP pipeline, in: A. Korho- on Personalization of Generative AI Systems
(PERnen, D. Traum, L. Màrquez (Eds.), Proceedings of SONALIZE 2024), Association for Computational
the 57th Annual Meeting of the Association for Linguistics, St. Julians, Malta, 2024, pp. 88–101. URL:
Computational Linguistics, Association for Com- https://aclanthology.org/2024.personalize-1.8.
putational Linguistics, Florence, Italy, 2019, pp. [15] X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H. Chi,
4593–4601. URL: https://aclanthology.org/P19-1452. S. Narang, A. Chowdhery, D. Zhou, Self-consistency
doi:10.18653/v1/P19-1452. improves chain of thought reasoning in language
[9] A. Rogers, O. Kovaleva, A. Rumshisky, A models, in: The Eleventh International Conference
primer in BERTology: What we know about on Learning Representations, 2023. URL: https://
how BERT works, Transactions of the Associa- openreview.net/forum?id=1PL1NIMMrw.
tion for Computational Linguistics 8 (2020) 842– [16] A. Chen, J. Phang, A. Parrish, V. Padmakumar,
866. URL: https://aclanthology.org/2020.tacl-1.54. C. Zhao, S. R. Bowman, K. Cho, Two failures of
doi:10.1162/tacl_a_00349. self-consistency in the multi-step reasoning of llms,
[10] J. Li, R. Cotterell, M. Sachan, Probing via prompting, Transactions on Machine Learning Research (2024).
in: M. Carpuat, M.-C. de Marnefe, I. V. Meza Ruiz [17] Y. Elazar, N. Kassner, S. Ravfogel, A.
Ravichan(Eds.), Proceedings of the 2022 Conference of the der, E. Hovy, H. Schütze, Y. Goldberg,
MeaNorth American Chapter of the Association for suring and improving consistency in pretrained
Computational Linguistics: Human Language Tech- language models, Transactions of the
Associanologies, Association for Computational Linguis- tion for Computational Linguistics 9 (2021) 1012–
tics, Seattle, United States, 2022, pp. 1144–1157. 1031. URL: https://aclanthology.org/2021.tacl-1.60.
URL: https://aclanthology.org/2022.naacl-main.84. doi:10.1162/tacl_a_00410.</p>
        <p>doi:10.18653/v1/2022.naacl-main.84. [18] S. Kadavath, T. Conerly, A. Askell, T. Henighan,
[11] T. Blevins, H. Gonen, L. Zettlemoyer, Prompt- D. Drain, E. Perez, N. Schiefer, Z. Hatfield-Dodds,
ing language models for linguistic structure, in: N. DasSarma, E. Tran-Johnson, et al., Language
A. Rogers, J. Boyd-Graber, N. Okazaki (Eds.), Pro- models (mostly) know what they know, arXiv
ceedings of the 61st Annual Meeting of the Associa- preprint arXiv:2207.05221 (2022).
tion for Computational Linguistics (Volume 1: Long [19] L. Parcalabescu, A. Frank, On measuring
faithPapers), Association for Computational Linguis- fulness of natural language explanations, arXiv
tics, Toronto, Canada, 2023, pp. 6649–6663. URL: preprint arXiv:2311.07466 (2023).
https://aclanthology.org/2023.acl-long.367. doi:10. [20] X. L. Li, V. Shrivastava, S. Li, T. Hashimoto, P. Liang,
18653/v1/2023.acl-long.367. Benchmarking and improving generator-validator
[12] M. Di Marco, K. Hämmerl, A. Fraser, A study on consistency of language models, in: The Twelfth
accessing linguistic information in pre-trained lan- International Conference on Learning
Representaguage models by using prompts, in: H. Bouamor, tions, 2023.</p>
        <p>J. Pino, K. Bali (Eds.), Proceedings of the 2023 Con- [21] A. Madsen, S. Chandar, S. Reddy, Are
selfference on Empirical Methods in Natural Language explanations from large language models
faithProcessing, Association for Computational Linguis- ful?, ArXiv abs/2401.07927 (2024). URL: https:
tics, Singapore, 2023, pp. 7328–7336. URL: https: //api.semanticscholar.org/CorpusID:266999774.
//aclanthology.org/2023.emnlp-main.454. doi:10. [22] M.-C. de Marnefe, C. D. Manning, J. Nivre,
18653/v1/2023.emnlp-main.454. D. Zeman, Universal Dependencies,
Computational Linguistics 47 (2021) 255–308. [31] M. Polignano, P. Basile, G. Semeraro,
AdURL: https://aclanthology.org/2021.cl-2.11. vanced natural-based interaction for the italian
doi:10.1162/coli_a_00402. language: Llamantino-3-anita, arXiv preprint
[23] A. Miaschi, D. Brunato, F. Dell’Orletta, G. Ven- arXiv:2405.07101 (2024).</p>
        <p>turi, Linguistic profiling of a neural lan- [32] A. Santilli, E. Rodolà, Camoscio: an italian
guage model, in: D. Scott, N. Bel, C. Zong instruction-tuned llama, in: Proceedings of the
(Eds.), Proceedings of the 28th International Con- Ninth Italian Conference on Computational
Linference on Computational Linguistics, Interna- guistics (CLiC-it 2023), CEUR.org, 2023.
tional Committee on Computational Linguistics, [33] F. A. Galatolo, M. G. Cimino, Cerbero-7b: A leap
forBarcelona, Spain (Online), 2020, pp. 745–756. ward in language-specific llms through enhanced
URL: https://aclanthology.org/2020.coling-main.65. chat corpus generation and evaluation, arXiv
doi:10.18653/v1/2020.coling-main.65. preprint arXiv:2311.15698 (2023).
[24] A. Miaschi, F. Dell’Orletta, G. Venturi, Linguis- [34] P. Basile, E. Musacchio, M. Polignano, L. Siciliani,
tic knowledge can enhance encoder-decoder mod- G. Fiameni, G. Semeraro, Llamantino: Llama 2
models (if you let it), in: N. Calzolari, M.-Y. Kan, els for efective text generation in italian language,
V. Hoste, A. Lenci, S. Sakti, N. Xue (Eds.), Pro- arXiv preprint arXiv:2312.09993 (2023).
ceedings of the 2024 Joint International Conference [35] P. Qi, Y. Zhang, Y. Zhang, J. Bolton, C. D.
Manon Computational Linguistics, Language Resources ning, Stanza: A python natural language
proand Evaluation (LREC-COLING 2024), ELRA and cessing toolkit for many human languages, in:
ICCL, Torino, Italia, 2024, pp. 10539–10554. URL: A. Celikyilmaz, T.-H. Wen (Eds.), Proceedings of the
https://aclanthology.org/2024.lrec-main.922. 58th Annual Meeting of the Association for
Com[25] D. Zeman, J. Nivre, M. Abrams, et al., Universal de- putational Linguistics: System Demonstrations,
pendencies 2.5, in: LINDAT/CLARIAH-CZ digital Association for Computational Linguistics,
Onlibrary at the Institute of Formal and Applied Lin- line, 2020, pp. 101–108. URL: https://aclanthology.
guistics (ÚFAL), 2019. URL: http://hdl.handle.net/ org/2020.acl-demos.14. doi:10.18653/v1/2020.
11234/1-3105. acl-demos.14.
[26] M. Sanguinetti, C. Bosco, PartTUT: The turin uni- [36] D. Brunato, A. Cimino, F. Dell’Orletta, G. Venturi,
versity parallel treebank, in: R. B. et al. (Ed.), S. Montemagni, Profiling-UD: a tool for
linguisHarmonization and Development of Re- sources tic profiling of texts, in: N. Calzolari, F. Béchet,
and Tools for Italian Natural Language Process- P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi,
ing within the PARLI Project, Springer, 2015, p. H. Isahara, B. Maegaard, J. Mariani, H. Mazo,
51–69. URL: https://link.springer.com/chapter/10. A. Moreno, J. Odijk, S. Piperidis (Eds.), Proceedings
1007/978-3-319-14206-7_3. of the Twelfth Language Resources and Evaluation
[27] R. Delmonte, A. Bristot, S. Tonelli, VIT - Venice Conference, European Language Resources
AssociItalian Treebank: Syntactic and quantitative fea- ation, Marseille, France, 2020, pp. 7145–7151. URL:
tures, in: Proceedings of the Sixth International https://aclanthology.org/2020.lrec-1.883.
Workshop on Treebanks and Linguistic Theories, [37] M.-C. de Marnefe, C. D. Manning, J. Nivre, D.
Ze2007. man, Universal Dependencies, Computational
Lin[28] C. Bosco, S. Montemagni, M. Simi, Converting guistics 47 (2021) 255–308. URL: https://doi.org/10.
italian treebanks: Towards an italian stanford de- 1162/coli_a_00402. doi:10.1162/coli_a_00402.
pendency treebank, in: Proceedings of the ACL [38] J. Schulman, F. Wolski, P. Dhariwal, A. Radford,
Linguistic Annotation Workshop &amp; Interoperabil- O. Klimov, Proximal policy optimization algorithms,
ity with Discourse, 2013. arXiv preprint arXiv:1707.06347 (2017).
[29] M. Sanguinetti, C. Bosco, A. Lavelli, A. Mazzei, [39] R. Rafailov, A. Sharma, E. Mitchell, C. D.
ManF. Tamburini, PoSTWITA-UD: an Italian Twit- ning, S. Ermon, C. Finn, Direct preference
ter Treebank in universal dependencies, in: Pro- optimization: Your language model is secretly
ceedings of the Eleventh Language Resources and a reward model, in: A. Oh, T. Naumann,
Evaluation Conference (LREC 2018), 2018. URL: A. Globerson, K. Saenko, M. Hardt, S. Levine
https://www.aclweb.org/anthology/L18-1279.pdf . (Eds.), Advances in Neural Information Processing
[30] A. T. Cignarella, C. Bosco, P. Rosso, Presenting Systems, volume 36, Curran Associates, Inc., 2023,
TWITTIRÒ-UD: An italian twitter treebank in uni- pp. 53728–53741. URL: https://proceedings.
versal dependencies, in: Proceedings of the Fifth neurips.cc/paper_files/paper/2023/file/
International Conference on Dependency Linguis- a85b405ed65c6477a4fe8302b5e06ce7-Paper-Conference.
tics (Depling, SyntaxFest 2019), 2019. URL: https: pdf .
//www.aclweb.org/anthology/W19-7723.pdf . [40] T. Dettmers, A. Pagnoni, A. Holtzman, L.
Zettlemoyer, Qlora: Eficient finetuning of quantized
llms, in: A. Oh, T. Naumann, A. Globerson,
K. Saenko, M. Hardt, S. Levine (Eds.), Advances
in Neural Information Processing Systems,
volume 36, Curran Associates, Inc., 2023,
pp. 10088–10115. URL: https://proceedings.
neurips.cc/paper_files/paper/2023/file/
1feb87871436031bdc0f2beaa62a049b-Paper-Conference.</p>
        <p>pdf .
[41] G. Sarti, M. Nissim, IT5: Text-to-text pretraining for</p>
        <p>Italian language understanding and generation, in:
N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti,
N. Xue (Eds.), Proceedings of the 2024 Joint
International Conference on Computational
Linguistics, Language Resources and Evaluation
(LRECCOLING 2024), ELRA and ICCL, Torino, Italia, 2024,
pp. 9422–9433. URL: https://aclanthology.org/2024.</p>
        <p>lrec-main.823.
[42] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li,</p>
        <p>S. Wang, L. Wang, W. Chen, Lora: Low-rank
adaptation of large language models, arXiv preprint
arXiv:2106.09685 (2021).
[43] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li,</p>
        <p>C. Guestrin, P. Liang, T. B. Hashimoto, Stanford
alpaca: An instruction-following llama model, https:
//github.com/tatsu-lab/stanford_alpaca, 2023.
[44] D. Croce, A. Zelenanska, R. Basili, Neural learning
for question answering in italian, in: C. Ghidini,
B. Magnini, A. Passerini, P. Traverso (Eds.), AI*IA
2018 – Advances in Artificial Intelligence, Springer</p>
        <p>International Publishing, Cham, 2018, pp. 389–402.
[45] P. Koehn, Europarl: A parallel corpus for statistical
machine translation, in: Proceedings of Machine
Translation Summit X: Papers, Phuket, Thailand,
2005, pp. 79–86. URL: https://aclanthology.org/2005.</p>
        <p>mtsummit-papers.11.
[46] C. Xu, D. Guo, N. Duan, J. McAuley, Baize: An
opensource chat model with parameter-eficient tuning
on self-chat data, arXiv preprint arXiv:2304.01196
(2023).
[47] A. Bacciu, G. Trappolini, A. Santilli, E. Rodolà, F.
Silvestri, Fauno: The italian large language model that
will leave you senza parole!, https://github.com/
andreabac3/Fauno-Italian-LLM, 2023.
[48] A. Holtzman, J. Buys, L. Du, M. Forbes, Y. Choi, The
curious case of neural text degeneration, in: 8th
International Conference on Learning
Representations, ICLR 2020, Addis Ababa, Ethiopia, April
26-30, 2020, OpenReview.net, 2020. URL: https:
//openreview.net/forum?id=rygGQyrFvH.</p>
      </sec>
      <sec id="sec-3-2">
        <title>7https://huggingface.co/swap-uniba/</title>
        <p>LLaMAntino-2-chat-7b-hf-UltraChat-ITA
8https://huggingface.co/datasets/basilepp19/dolly-15k-it
9https://huggingface.co/swap-uniba/
LLaMAntino-3-ANITA-8B-Inst-DPO-ITA
10https://huggingface.co/datasets/Chat-Error/wizard_alpaca_
dolly_orca
11https://huggingface.co/sag-uniroma2/extremITA-Camoscio-7b
12https://huggingface.co/rstless-research/</p>
        <p>DanteLLM-7B-Instruct-Italian-v0.1
13https://huggingface.co/galatolo/cerbero-7b
14https://huggingface.co/iGeniusAI/Italia-9B-Instruct-v0.1</p>
        <p>While in the validation step the model is prompted
about recognising the linguistic properties of its own
sentence :
_ + Quante   ci sono nella seguente
frase: ’’? Non fornire spiegazioni.
(trad. _ + How many   are there in
the following sentence: ’’? Do not give an
explanation.)</p>
      </sec>
      <sec id="sec-3-3">
        <title>For each model we used the author’s recommended chat template and the specified system prompt when available, otherwise we exclude it. All models are loaded</title>
        <p>NOUN
236
71
76
74
43
61</p>
        <sec id="sec-3-3-1">
          <title>B.2. Constraint values</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>C. Further details about the results</title>
      <p>15https://huggingface.co/docs/transformers/index.
i) the requested linguistic constraints (Fig. 4) and ii) the
property values of the generated sentences (Fig. 3).
ADJ
ADJ
ADJ
ADJ
ADJ
NOUN
NOUN
NOUN
NOUN
NOUN
NOUN
subord
subord
subord
subord
subord
subord
3
5
Italia
Italia
ANITA
ANITA
Italia
Italia
Italia
ANITA
ANITA
ANITA
Italia
Italia
Italia
ANITA
ANITA
ANITA</p>
      <p>Sentence
Il suo spirito indomito ha sempre trovato il modo di superare le
sfide, senza mai arrendersi, nonostante le avversità (His indomitable
spirit always found ways to overcome challenges, never giving up despite
adversity.).</p>
      <p>Il suo atteggiamento gentile e la sua presenza rassicurante hanno
aiutato tutti a sentirsi a proprio agio e ben accolti (His kind attitude
and reassuring presence helped everyone feel comfortable and welcome.).
La sua passione per la cucina italiana autentica si manifesta nei suoi
piatti ricchi di sapori tradizionali, che richiamano la genuinità della
cucina casalinga. (His passion for authentic Italian cuisine is evident
in his dishes rich in traditional flavors that evoke the authenticity of
home cooking.)
L’uomo va in uficio ( The man goes to the ofice. ).</p>
      <p>Le case vecchie richiedono lavori costosi di restauro (Old houses
require expensive restoration work.).</p>
      <p>Il grande ed antico castello si erge sulla cima di un piccolo e ripido
colle (The large and ancient castle stands on the top of a small and steep
hill.).</p>
      <p>Dopo un’attenta analisi, l’esperto ha concluso che la situazione era
di una gravità estrema. (After careful analysis, the expert concluded
that the situation was of extreme gravity.)
La verità è emersa, ma ha lasciato un segno che non potrà mai essere
cancellato. (The truth has emerged, but it has left a mark that can never
be erased.)
La missione umanitaria è stata ostacolata da un’insolita condizione
meteorologica che ha causato ritardi nei soccorsi. (The humanitarian
mission was hampered by unusual weather conditions that caused delays
in relief eforts. )
Il cane corre dietro il gatto. (The dog runs after the cat. )
Il bambino ha in braccio il gatto. (The child is holding the cat.)
I clienti visitano il ristorante con la famiglia e i bambini.
(Customers visit the restaurant with their families and children.)
La pioggia porta sollievo alle piante assetate. (Rain brings relief to
thirsty plants.)
Il suo viaggio attraverso le montagne è stato reso più agevole dalla
presenza di un amico che lo ha accompagnato lungo il percorso.
(His journey through the mountains was made easier by the presence of
a friend who accompanied him along the way.)
L’impegno di tutti è fondamentale per garantire il successo del
progetto. (Everyone’s commitment is essential to ensure the success of
the project.)
Ho visitato la città in cui nacque Manzoni. (I visited the city where
Manzoni was born.)
Il concerto inizia solo dopo le nove. (The concert does not start until
after nine o’clock. )
L’uomo che aveva visto il film che era uscito l’anno prima , era
rimasto deluso. (The man who had seen the film that came out the year
before was disappointed.)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Achiam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Adler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Akkaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. L.</given-names>
            <surname>Aleman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Altenschmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Altman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Anadkat</surname>
          </string-name>
          , et al.,
          <source>Gpt-4 technical report, arXiv preprint arXiv:2303.08774</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Stone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Albert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Almahairi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Babaei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bashlykov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Batra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bhargava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bhosale</surname>
          </string-name>
          , et al.,
          <source>Llama</source>
          <volume>2</volume>
          :
          <article-title>Open foundation and finetuned chat models</article-title>
          ,
          <source>arXiv preprint arXiv:2307.09288</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A. Q.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sablayrolles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mensch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bamford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Chaplot</surname>
          </string-name>
          , D. d. l. Casas,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bressand</surname>
          </string-name>
          , G. Lengyel,
          <string-name>
            <given-names>G.</given-names>
            <surname>Lample</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Saulnier</surname>
          </string-name>
          , et al.,
          <source>Mistral 7b, arXiv preprint arXiv:2310.06825</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <article-title>Harnessing the power of llms in practice: A survey on chatgpt and beyond</article-title>
          ,
          <source>ACM Trans. Knowl. Discov. Data</source>
          <volume>18</volume>
          (
          <year>2024</year>
          ). URL: https://doi.org/10.1145/3649506. doi:
          <volume>10</volume>
          .1145/3649506.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E.</given-names>
            <surname>Beeching</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Fourrier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Habib</surname>
          </string-name>
          , S. Han,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lambert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Rajani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Sanseviero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Tunstall</surname>
          </string-name>
          , T. Wolf, Open llm leaderboard, https: //huggingface.co/spaces/open-llm-leaderboard/ open_llm_leaderboard,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bacciu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Campagnano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Trappolini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Silvestri</surname>
          </string-name>
          ,
          <article-title>DanteLLM: Let's push Italian LLM research forward!</article-title>
          , in: N.
          <string-name>
            <surname>Calzolari</surname>
            , M.-
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Kan</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Hoste</surname>
          </string-name>
          ,
          <article-title>6A sample of the generated sentences can be found in Appendix C.</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>