1. Introduction

La non canonica l'hai studiata? Exploring LLMs and Sentence Canonicity in Italian

Claudiu Daniel Hromei

Danilo Croce

Rodolfo Delmonte

Roberto Basili

1 0 Ca' Foscari University , Venice , Italy 1 Department of Enterprise Engineering, University of Rome Tor Vergata , Italy

This paper investigates the ability of Large Language Models (LLMs) to diferentiate between canonical and non-canonical sentences in Italian, employing advanced neural architectures like LLaMA and its adaptations. Canonical sentences adhere to the standard Subject-Verb-Object (SVO) structure. We hypothesize that recent generative LLMs are influenced heavily by the English language, where non-canonical structures are very rare. Using the in-context learning technique, we probe these models and further fine-tune them for this specific task. Initial results indicate that these models continue to struggle with this task even after fine-tuning. Additionally, we introduce a test set comprising several hundred sentences from the poetry domain, which presents significant challenges for the canonical structure task.

eol>Large Language Models Italian Sentence Structure Non-Canonical Structures In-Context Learning

1. Introduction

CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, Dec 04 — 06, 2024, Pisa, Italy * Corresponding author. $ hromei@ing.uniroma2.it (C. D. Hromei); croce@info.uniroma2.it (D. Croce); delmont@unive.it (R. Delmonte); basili@info.uniroma2.it (R. Basili)

0009-0000-8204-5023 (C. D. Hromei); 0000-0001-9111-1950 (D. Croce); 0000-0003-0282-7661 (R. Delmonte); 0000-0001-5140-0694 (R. Basili) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License

Attribution 4.0 International (CC BY 4.0). 1Elizabethan English was more similar to Italian in its variety of syntactic structures. 2In English: “Always dear to me was this solitary hill and this hedge which from large side of the ultimate horizon the gaze excludes” projectivity in written texts is 7%, based on 230, 629 con- and 3, 000 sentences, respectively. Illo tempore, these stituents. Compared to Latin, where the non-projectivity eforts yielded an F1 score of 82.96%, while comparaindex is 6.65% in the Latin Dependency Treebank con- ble parsers (Stanford, Collins, and MaltParser) achieved taining about 55, 000 tokens, Italian and Latin are quite about 92.10% on the WSJ treebank. The lower perforsimilar. In contrast, English tree projectivity in the Penn mance in Italian was primarily due to two factors: a Treebank (PT), where the majority of data corresponds higher number of non-canonical structures (i.e., word to the articles of Wall Street Journal (WSJ), shows much order variations) and the presence of pro-drop clauses, lower numbers: with 720, 086 constituents, the non- where the subject is lexically omitted — a challenge also projectivity index is 0.01004%. documented for other similar languages [11].

Thus, Italian speakers have high expectancies for the Significant improvements in parsing performance presence of an NCS due to processing dificulties also were noted in a paper on the EVALITA shared task on raised by the number of unexpressed subjects: 61% of constituency parsing, where the best F1 score increased all Inflected Propositions lack a lexically expressed sub- from 70% to 84% [12], attributed to the nearly douject. This does not apply to English speakers, for whom bling of training samples between 2007 and 2011. In [13], NCS are infrequent and context-specific. In this view, the authors presented a new dataset of Italian based on Italian is considered unique for its use of many of the “marked” sentences to test the performance of the neural non-canonical structures found in contemporary poetry parser TINT. The result for LAS dependency structures and examined in this experiment. The richness and free- was 77% accuracy, three points below the best results dom of the language give the speakers the ability to pro- on the UD corpus of Italian, which was 80%. This outduce such a diverse typology of non-canonical structures, come confirmed previous findings with a small dataset which stems from its Latin heritage, with the Null Sub- of strongly marked sentences, where accuracy was beject being one of the most well-known features. Like low 50%. The authors detailed seven types of marked many other languages, including Spanish, Portuguese, structures in their treebank corpus: cleft, left-dislocated, and Catalan, as well as Chinese, Japanese, Slavic lan- right-dislocated, presentative “ci” (there in English), inguages, Greek, and Hebrew, Italian is a Null Subject Lan- verted subject, pseudo-clefts, and hanging topic, with guage. However, this parameter alone does not fully ex- cleft and left-dislocated sentences being the most complain the richness and complexity of syntactic structures mon. seen in Italian poetry. While other Romance languages In this context, it is interesting to explore the capashare similar syntactic traits, the specific linguistic legacy bilities of state-of-the-art methods for addressing the and poetic traditions of Italian give it a unique character problem of distinguishing between canonical and nonin this regard. canonical sentences in Italian. This exploration is mo

In this paper, we want to analyze the ability of recently tivated by the complexity and richness of Italian synproposed Large Language Models to detect non-canonical tax, which presents unique challenges for natural lansentences in Italian. Our hypothesis is that, given the very guage processing models. Mostly all actual state-of-thelarge percentage of English training data (usually more art models are based on the Transformer architecture than 90%) and the very low percentage of Italian training [14]. This game-changer model comprises two main data (usually less than 1%), these models have a limited components, leading to diferent model families. The capacity to process such structures and they rely mostly encoder, used in models like BERT [15], RoBERTa [16], on the English writing structures. On the other hand, the and Sentence BERT [17], encodes input sequences using models that have been specifically adapted or fine-tuned self-attention. In contrast, decoders, such as GPT [18], on Italian data should show a better understanding of GPT-3 [19], and LLaMA [20], generate output sequences the canonicity in Italian. auto-regressively. Beyond these, encoder-decoder mod

In the rest, Section 2 describes the related work, Sec- els like T5 [21] and BART [22] integrate both components, tion 3 shows the approach in recognizing the canonical excelling in tasks such as translation, summarization, and structures, Section 4 presents and discusses the results, question-answering. while Section 5 derives the conclusions. One notable Transformer-based architecture is the LLaMA foundational model [20]. LLaMA is a large model with billions of parameters that generates output se2. Related Work quences auto-regressively based on the input and previously generated tokens. It has been recently applied Our approach has been previously adopted by other re- to a variety of linguistic tasks by instruction-tuning a searchers but with slightly diferent aims, as described monolithic architecture to solve them all [23]. This fambelow. Initial attempts at parsing Italian treebanks of con- ily of models is promising as they rely on auto-regressive stituent structures focused on two small treebanks: TUT generation methods and, thanks to their massive amount [ 8, 9 ] and ISST [10], containing approximately 3, 500 of training data and parameters, can solve a plethora of linguistic tasks. Additionally, [24] demonstrated the ap- is to generate vector embeddings of sentences using a plication of LLaMA-family models for syntactic parsing model like sBERT [17]. This model produces a contextuacross multiple languages, highlighting the capability alized vector that represents the information contained of the model to analyze and detect sentence structures. in a sentence. By applying Cosine Similarity, we can This work underscores the versatility of large language rank these vectors and select the training examples most models in handling diverse syntactic frameworks, fur- similar to the input sequence. This process ensures that ther probing their performance in cross-linguistic scenar- the model is supplied with the most relevant solved exios. Finally, architectures specifically adapted for Italian, amples for a given input. It’s important to note that these such as Camoscio [25] and LLaMAntino [26], are tuned examples may not always capture the same explicit synwith instruction datasets for the Italian language, starting tax representation as a Tree Kernel [ 28 ] function would, from the original LLaMA model and its second variant, in which every word of the sentence is explicitly annoLLaMA2-chat, respectively. They demonstrate a strong tated with syntactic information and linked to each other. understanding of the language and an excellent ability However, the crucial aspect is that the examples provided to generate appropriate responses. are suficiently similar in meaning and context, and the

In this paper, we aim to explore the ability of Large sBERT architecture is very efective.

Language Models (LLMs) to distinguish between canoni- When the model’s pre-existing knowledge is insufical and non-canonical sentences in Italian using neural cient, we can fine-tune it on the downstream task. Finearchitectures such as LLaMA and its various adaptations, tuning involves training the model in a traditional manas discussed in the next Section. It’s interesting to note ner using input-output pairs (training data) to adjust its that in the future one might explore the applications parameters. This process improves the model’s perforof probing syntax at the intermediate layers of various mance on specific tasks, allowing it to learn from a more models. extensive set of examples. As a result, the model becomes more adept at handling similar queries in the future, with a focus on the specific task at hand. By leveraging these 3. Recognizing Canonical techniques, LLMs can recognize and respond to canonstructures through LLMs ical structures with varying degrees of eficiency and accuracy.

To address the capabilities of Large Language Models in recognizing the canonical structures, they can be utilized 3.1. Training LLMs against non-Canonical through In-Context Learning techniques [ 27 ] or by directly fine-tuning the model for specific downstream structures tasks. In-context learning relies on the model’s pre- To interact with the models, we need a suficiently deexisting knowledge acquired during pre-training and on tailed prompt, which includes a natural language descripinstructions provided in natural language at inference tion of the task (i.e., the rules to determine whether an time. This method does not involve additional training Italian sentence follows the canonical structure) and specand can be categorized based on the number of examples ifies the type of answer we expect the LLM to produce: Sì provided: i) 0-shot Learning, where no examples are (Yes in English) if the sentence is canonical and follows given, and the model generates responses based solely the rules, or No otherwise. For the training and the 0-shot on its pre-existing knowledge and the provided instruc- strategy, we used the following prompt: tions; ii) 1-shot Learning, where one example per class (positive and negative in our case) is added to provide a “Dimmi se la seguente frase ha una struttura more precise context, these examples help the model bet- canonica o meno. Per Canonica si intende ter understand the task by ofering a concrete reference una frase che segue una struttura standard point; iii) Few-shot Learning, where more than one ex- per ogni verbo presente. Più nello specifico, ample per class is provided to give the model additional le frasi canoniche seguono queste regole: concontextual information during decision-making. This ap- tengono SOLO sequenze del tipo nome o strutproach is particularly efective when very few examples ture nominali SEGUITE da struttura verbale a (such as 2 or 4) are given, but it can be extended up to sua volta seguita (oppure no) da complementi the maximum input context length. OPPURE contengono SOLO sequenze composte

For both one-shot and few-shot learning approaches, a da struttura verbale seguita da complementi, key challenge is selecting the most informative examples dove: STRUTTURE VERBALI sono sequenze to provide during inference. One efective strategy is composte da ausiliare o/e modale e verbo, e tra to retrieve examples that are most similar to the current i due ci può essere un avverbio oppure strutsequence to be classified, focusing on those with a similar ture preposizionali COMPLEMENTI sono strutstructure or meaning. A commonly used method for this ture nominali oppure strutture preposizionali oppure strutture frasali oppure strutture infinitivali. Tutte le altre frasi sono da considerarsi come Non Canoniche. Riguardo il prossimo input, rispondi ’sì’ se è ’canonico’, ’no’ se è ’non canonico’.”

For the 1-shot scenario, immediately after the above prompt, we append the following instruction, where the two provided examples are selected as the most relevant for the input example:

Ti faccio un paio di esempi: <Positive_Example> e devi rispondere sì. <Negative_Example> e devi rispondere no.

text is canonical and follows the rules, or No otherwise.

When fine-tuning a model, a highly detailed prompt For training, we used the VIT Treebank [ 30 ], which might seem excessive, especially since traditional train- contains approximately 320, 000 words. Among other ing involves repeating the prompt multiple times. How- information, each sentence is categorized into canonical ever, our hypothesis is that clearly explaining the task or not. The dataset was divided into a Training set and a to the model aids in faster convergence of the parame- Development set with a 90/10 ratio. The class distribution ters and a more rapid reduction in loss during training. is shown in Figure 1, where it is evident that the vast Therefore, this is the reason why our prompt includes majority of the sentences are canonical, reflecting the a comprehensive description of the canonical sentence natural usage patterns of Italian speakers. structure. This description details that each verb must ad- We employed the LoRA [ 31 ] technique and the Peft here to specified constraints, the types of sequences they package on a single Tesla T4 GPU to train the models for can contain, the verbal structures, and the order of com- 3 epochs, with a learning rate of 3− 4 and using a linear plements. If a verb does not adhere to these constraints, scheduler with 10% warmup. The LoRA parameter it should be classified as Non-Canonical. was set to 8, to 16, and all available layers were involved (for more details, refer to the original paper [ 31 ]). 3.2. LLM architectures of non-Canonical For computational eficiency, the floating-point precision of the parameters was set to 8 bits, allowing the use of a structures single GPU.

Today, the landscape of Large Language Models (LLMs) For the Test set, we used a collection of Italian poetry is vast, making it challenging to choose the most suitable comprising 51 texts with a total of 303 sentences. For the model. In this paper, we focus on several well-known same reason that people still regard Dante as the greatest models from the LLaMA family: LLaMA1 [20], the first Italian poet and students are required to learn his best in the series; LLaMA2 [ 29 ], which introduced minor im- poems by heart, we have chosen what is regarded as the provements in Transformer architecture; Camoscio [25], best Contemporary Italian poetry: a manually curated an instruction-tuned LLaMA model fine-tuned on Ital- collection of excerpts from Italian poems from the late ian data; ExtremITA [23], an architecture designed for 19th and early 20th centuries. In particular, we used poa wide range of Italian tasks; and LLaMAntino [26], an ems from the 1975 Nobel Prize Eugenio Montale, with adaptation of the original LLaMA2 model for the Italian about one hundred excerpts taken from the volume “Ossi language. di Seppia”. The class distribution of this test set is shown

We expect the best-performing models to be those in Figure 1. Notably, the distribution of Yes (the sentence specifically adapted or fine-tuned on Italian data, such as is canonical) and No (the sentence is non-canonical) is Camoscio, ExtremITA, or LLaMAntino. One significant reversed compared to the Training and Development sets, issue with the English models is that non-canonicity is due to poetic license and rhyming constraints. This revery rare in English, as the language predominantly fol- versal poses a significant challenge for the models we lows the Subject-Verb-Object structure, which is canoni- trained, but it presents an interesting test case. More decal, with very few (grammatically correct) non-canonical tails about this and a simple Error Analysis are presented examples. in the Appendix B.

In this context, it is important to note that the consideration of structures which, in Chomskyan transfor4. Empirical Investigation mational theory, were once viewed as surface-level realizations of deep canonical structures has not been a deliberate focus of this experiment. The first reason for In this setup, the models trained and those utilized in the k-shot scenario are required to answer Yes if the given excluding structures like passives, interrogatives, rela- a second comparison, we train an Italian BERT model tive clauses, cleft sentences, tough constructions, and for 3 epochs which starts showing some awareness of others, is their relative scarcity in poetry, though they the task and reaching an overall 40% of Micro-F1. Usare more frequent in prose. A second reason, closely tied ing our Development set we selected only the best LLM to the first, is that these common structures do not add to report here for space constraints, which is based on an element of surprise, given their frequency in everyday Camoscio [25]. Finally, the Fine-Tuned model reaches language use. That said, some of these common non- the best performance with a very good Precision (98%) canonical structures can still be found in Italian literary for the non-canonical sentences and very good Recall prose, but not all are represented in the examples we (98%) for the canonical ones, with a final 60% of both studied. On the other hand, focus fronting (also referred Macro and Micro F1. to as object preposing, complement preposing, or full argument inversion, depending on the constituent be- 4.2. Corpus Analysis ing fronted) is prevalent in the examples included in the experiment. An exemplar list of such structures can be found in Appendix C.

For a better insight into the current measured perfor

mance, we studied the role of training material as representative of the adopted test dataset. We analyzed the test 4.1. Results and Discussion dataset used in terms of the average word frequencies, as observed on the ITWaC corpus4. This corpus provides The models used in this paper are those already antici- pre-computed frequencies for each word: for comparapated in Section 3.2, available from Huggingface, using tive reasons, we normalized in [ 0, 1 ] and measured them the prompt described in Section 3.1. The results are avail- for each sentence in terms of the mean frequency, i.e., able in Table 1. Given the distribution of the sentences the sum of the word frequencies over each sentence. By of the Training set, we report a simple but informed independently averaging frequencies of canonical and Yes-Baseline. This baseline cannot perform well on non-canonical sentences, we obtained the following figthe inverted distribution of the Test Set, as it always an- ures: swers Yes. We first used the LLMs anticipated in Section 3.2 in a 0-shot manner and you can notice an overall good • Canonical Sentences, AVG frequency: 0.38 ability to detect the non-canonical sentences reaching a • Non-Canonical Sentences, AVG frequency: 0.24 73% of Precision and 93% of Recall for Camoscio, but still struggles to identify the canonical ones. We hoped Intuitively, a value approaching 1 characterizes highly to heavily boost the performances of the model in the 1- frequent words in ITWaC: this suggests that they are shot scenario3, but it seemed to decrease in performance. well-represented in the original LLM. Conversely, values The same trend can be noted for all the other models. As closer to 0 characterize less represented sentences. Notice that only canonical sentences (AVG 0.38) are represented, although in a limited manner, in standard Italian texts. This result sheds light on the specific relationship 3We experimented with more than 1 example per class, increasing the number of samples up to a 16-shot scenario. Unfortunately, the performance was not increasing but stale around 60% of Micro-F1.

We didn’t report such results here for space constraints.

4https://www.sketchengine.eu/itwac-italian-corpus/

between word frequencies and training: LLMs, partic- and improving model architectures to better capture the ularly Camoscio, are more “confident” with words they nuances of non-canonical sentences. encountered during pre-training or fine-tuning. It is noticeable that almost 50% of our test set words (adjectives, verbs, nouns) do not even occur in the ITWaC and, in Acknowledgments fact, they are also absent in any canonical sentence of the training set. Another issue lies in the pre-training Claudiu Daniel Hromei is a Ph.D. student enrolled in data of these LLMs. Since most of the data is in English the National Ph.D. in Artificial Intelligence, XXXVII (over 88%) and non-canonical sentences are extremely cycle, course on Health and life sciences, organized by rare in English, models like LLaMA or Camoscio have the Università Campus Bio-Medico di Roma. We acrarely encountered such data, leading to suboptimal per- knowledge financial support from the PNRR MUR project formance. Moreover, the length of the sentence could PE0000013-FAIR and support from Project ECS 0000024 be a factor that may influence the performance of LLMs, Rome Technopole, - CUP B83C22002820006, NRP Mission specifically in poetry, in the ability to detect canonical or 4 Component 2 Investment 1.5, Funded by the European non-canonical sentences. Union - NextGenerationEU.

Therefore, to achieve a more balanced evaluation, we merged the Training, Development, and Testing sets into References a single dataset to balance the classes and ensure that the model learns to recognize non-canonical sentences.

We then performed an N-Fold Cross-Validation (N = 5).

Only the trained model was re-evaluated, and the results are presented in Table 2. We maintained the simple and informed Yes-Baseline for comparison and recomputed its performance. In this setting, the class distribution aligns again with the Training set. The fine-tuned Camoscio model now shows very good performance in distinguishing canonical sentences, achieving a Macro-F1 of 88% and a Micro-F1 of 90%.

5. Conclusions

In this study, we have shown the potential of Large Language Models, particularly the LLaMA architecture and its Italian adaptations, in distinguishing between canonical and non-canonical sentences in Italian. Our experiments indicate that instruction-tuned models specifically for Italian, such as Camoscio and LLaMAntino, exhibit a strong grasp of Italian syntax and can efectively handle diverse sentence structures. However, the performance for this task is still penalized by the large portion of English data they ingest during pre-training.

The findings underscore the importance of tailored language models for specific languages and the benefits of incorporating extensive syntactic variations into training datasets. Future work should focus on expanding the training datasets with more diverse syntactic structures hauer (Eds.), Proceedings of the Second In- 2019, pp. 4171–4186. ternational Conference on Language Resources [16] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, and Evaluation (LREC’00), European Language O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Resources Association (ELRA), Athens, Greece, Roberta: A robustly optimized BERT pretraining 2000. URL: http://www.lrec-conf.org/proceedings/ approach, CoRR abs/1907.11692 (2019). lrec2000/pdf/220.pdf. [17] N. Reimers, I. Gurevych, Sentence-bert: Sen[9] C. Bosco, A. Mazzei, V. Lombardo, G. Attardi, tence embeddings using siamese bert-networks, in: A. Corazza, A. Lavelli, L. Lesmo, G. Satta, M. Simi, Proceedings of the 2019 Conference on Empirical Comparing Italian parsers on a common tree- Methods in Natural Language Processing, Assobank: the EVALITA experience, in: N. Calzo- ciation for Computational Linguistics, 2019. URL: lari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, http://arxiv.org/abs/1908.10084.

S. Piperidis, D. Tapias (Eds.), Proceedings of the [18] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, Sixth International Conference on Language Re- et al., Improving language understanding by genersources and Evaluation (LREC’08), European Lan- ative pre-training, 2018. guage Resources Association (ELRA), Marrakech, [19] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, Morocco, 2008. URL: http://www.lrec-conf.org/ J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, proceedings/lrec2008/pdf/528_paper.pdf. G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, [10] S. Montemagni, F. Barsotti, M. Battista, N. Calzo- G. Krueger, T. Henighan, R. Child, A. Ramesh, lari, O. Corazzari, A. Lenci, A. Zampolli, F. Fan- D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, ciulli, M. Massetani, R. Rafaelli, R. Basili, M. T. E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, Pazienza, D. Saracino, F. Zanzotto, N. Mana, F. Pi- C. Berner, S. McCandlish, A. Radford, I. Sutskever, anesi, R. Delmonte, Building the Italian Syntactic- D. Amodei, Language models are few-shot learners, Semantic Treebank, Springer Netherlands, Dor- CoRR abs/2005.14165 (2020). drecht, 2003, pp. 189–210. URL: https://doi. [20] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. org/10.1007/978-94-010-0201-1_11. doi:10.1007/ Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Ham978-94-010-0201-1_11. bro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, [11] T. Chung, M. Post, D. Gildea, Factors afecting the G. Lample, Llama: Open and eficient foundation accuracy of Korean parsing, in: D. Seddah, S. Koe- language models, 2023. URL: https://arxiv.org/abs/ bler, R. Tsarfaty (Eds.), Proceedings of the NAACL 2302.13971. arXiv:2302.13971. HLT 2010 First Workshop on Statistical Parsing [21] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang, of Morphologically-Rich Languages, Association M. Matena, Y. Zhou, W. Li, P. J. Liu, Explorfor Computational Linguistics, Los Angeles, CA, ing the limits of transfer learning with a unified USA, 2010, pp. 49–57. URL: https://aclanthology. text-to-text transformer, J. Mach. Learn. Res. 21 org/W10-1406. (2020) 140:1–140:67. URL: http://jmlr.org/papers/ [12] C. Bosco, A. Mazzei, A. Lavelli, Looking back to v21/20-074.html.

the evalita constituency parsing task: 2007-2011, [22] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Moin: B. Magnini, F. Cutugno, M. Falcone, E. Pianta hamed, O. Levy, V. Stoyanov, L. Zettlemoyer, BART: (Eds.), Evaluation of Natural Language and Speech denoising sequence-to-sequence pre-training for Tools for Italian, Springer Berlin Heidelberg, Berlin, natural language generation, translation, and comHeidelberg, 2013, pp. 46–57. prehension, CoRR abs/1910.13461 (2019). [13] T. Paccosi, A. Palmero Aprosio, S. Tonelli, It is [23] C. D. Hromei, D. Croce, V. Basile, R. Basili, Exmarkit that is new: An italian treebank of marked tremITA at EVALITA 2023: Multi-Task Sustainable constructions, in: CLiC-it 2021 - Italian Conference Scaling to Large Language Models at its Extreme, in: on Computational Linguistics, 2022. Proceedings of the Eighth Evaluation Campaign of [14] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, Natural Language Processing and Speech Tools for L. Jones, A. N. Gomez, L. u. Kaiser, I. Polosukhin, At- Italian. Final Workshop (EVALITA 2023), CEUR.org, tention is all you need, in: I. Guyon, U. V. Luxburg, Parma, Italy, 2023.

S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, [24] C. D. Hromei, D. Croce, R. Basili, U-DepPLLaMA: R. Garnett (Eds.), Advances in Neural Information Universal Dependency Parsing via Auto-regressive Processing Systems, volume 30, Curran Associates, Large Language Models, IJCoL 10 (2024). URL: http: Inc., 2017. //journals.openedition.org/ijcol/1352. [15] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: [25] A. Santilli, E. Rodolà, Camoscio: an Italian pre-training of deep bidirectional transformers for Instruction-tuned LLaMA, 2023. URL: https://arxiv. language understanding, in: J. Burstein, C. Doran, org/abs/2307.16456. arXiv:2307.16456. T. Solorio (Eds.), Proceedings of the NAACL 2019, [26] P. Basile, E. Musacchio, M. Polignano, L. Siciliani,

In assessing the data distribution disparities between languages in the pre-training phase of the LLaMA family models, we provide an illustrative breakdown in Table 3, where English accounts for nearly 90% of the data, while Italian is present in less than 1%.

Among the limitations of the proposed model, the computational costs associated with training a model like LLaMA are undoubtedly significant, requiring hundreds In this section, we present a simple Error Analysis with two diferent cases: i) a sentence from the Development set, which should reflect the distribution of the training data for the models introduced in Section 3.2; a sentence from the poetry domain that is radically diferent from the training data. We will then report the answer for each model specifying the modality (in-context learning or training) and eventually the number of shots used for inference.

As a first example, consider “ Dificile tenersi in quel cammino”5, which is non-canonical as the main verb “è” is missing. The models answered as follows: • LLaMA1 0s: canonical • LLaMA1 1s: canonical • LLaMA2 0s: canonical • LLaMA2 1s: canonical • ExtremITA 0s: canonical • ExtremITA 1s: non-canonical • LLaMAntino 0s: canonical • LLaMAntino 1s: non-canonical • Camoscio 0s: non-canonical • Camoscio 1s: non-canonical • BERT FT: non-canonical • Camoscio FT: non-canonical

This example is interesting because all the Italian

adapted models in some way (1-shot or Fine-Tuned) answered correctly, thus recognizing that the sentence was missing the main verb, given the initial prompt. Notice that only Camoscio answered correctly both in 0-shot and 1-shot

As a second and more dificult example, consider the sentence “Zacinto mio che te specchi nell’onde del greco mar da cui vergine nacque Venere”6, taken from the poetry test set. This example is very hard to comprehend as some words are very rare in spoken/written Italian (nell’onde), the usage of the uncommon te to express that the city is actively mirroring in the sea, and the reversed order of the last words. In this case, all the models answered that the sentence is non-canonical, recognizing the strange structure of the sentence, except for BERT FT which classified this sentence as canonical.

C. Typical Non-Canonical Structures

In this section, we report a list of typical non-canonical structures as an example of the complexity the models are dealing with. 5In English: “(It’s) Hard to keep in that path.” 6In English: “My Zacinto that you mirror in the waves of the Greek sea where virgin was born Venus from” 1. Inversion of the complete argument, where the complement is fronted, and the subject follows the verb. 2. Subject inversion, positioning the subject after the main verb. 3. Fronting of the object, moving the object to the beginning of the sentence before the subject. 4. Extraction of the object from an infinitival clause, placing it at the beginning of the sentence. 5. Preposing of a prepositional adjunct from a participial clause, moving the prepositional complement of a past participle to a position before the verb. 6. Leftward extraction of the lexical verb, where the untensed, non-finite main verb precedes the auxiliary or modal verb. 7. Right dislocation of the subject, placing the subject after the complements of the sentence. 8. Fronting of both the subject and the object, positioning them before the main verb, with the subject preceding the object. 9. Fronting of a prepositional specification, often introduced by "of", extracting it from the noun phrase and positioning it at the front. 10. Right dislocation of the clitic, where a clitic pronoun attached to the main verb corefers to an object noun phrase positioned later in the sentence. 11. Right dislocation of the object, placing the object after indirect objects, adjuncts, or an inverted subject. 12. Insertion of parentheticals or adjuncts between the subject and the main verb. 13. Rightward extraction of the adjective from the noun phrase, positioning it after any noun adjuncts. 14. Right stranding of a prepositional specification, such as "of", leaving it at the end of the sentence, separate from the noun phrase. 15. Rightward extraction of the lexical verb, positioning the untensed, non-finite main verb after the complements of the sentence. 16. Right stranding of the predicate’s head noun, leaving it after two adjuncts.

[1]

Delmonte , Syntax and semantics of italian poetry in the first half of the 20th century , 2018 . URL: https: //arxiv.org/abs/ 1802 .03712. arXiv: 1802 .03712.

[2]

Delmonte , Cognitive Models of Poetry Reading , Springer International Publishing, Cham, 2021 , pp. 1 - 39 . URL: https://doi.org/10.1007/978-3- 030 -44982-7_ 19 - 4 . doi: 10 .1007/978-3- 030 -44982-7_ 19 - 4 .

[3]

Delmonte , Recursion and Ambiguity:

Linguistic and Computational Perspective , 2015 , pp. 257 - 284 . doi: 10 .1007/978-3- 319 -08043-7_ 15 .

[4]

Bresnan , The Mental Representation of Grammatical Relations , The MIT Press, Cambridge, 1982 .

[5]

Bresnan , Lexical-Functional

Syntax

, Blackwell Publishing, Oxford, 2001 .

[6]

Ward ,

Birner , Information Structure and Non-canonical Syntax , 2008 , pp. 152 - 174 . doi: 10 . 1002/9780470756959.ch7.

[7]

Delmonte ,

Busetto , Measuring similarity by linguistic features rather than frequency , in: H. Bunt (Ed.), Proceedings of the 18th Joint ACL - ISO Workshop on Interoperable Semantic Annotation within LREC2022 , European Language Resources Association, Marseille, France, 2022 , pp. 42 - 52 . URL: https://aclanthology.org/ 2022 .isa- 1 .6.

[8]

Bosco ,

Lombardo ,

Vassallo , L. Lesmo, Building a treebank for Italian: a data-driven annotation schema , in: M. Gavrilidou , G. Carayannis, S.

Markantonatou , S.

Piperidis , G.

StainG. Fiameni , G. Semeraro, LLaMAntino: LLaMA 2 Table 3 Models for Efective Text Generation in Italian Lan- Data distribution . guage, 2023 . URL: https://arxiv.org/abs/2312.09993. Code Language Percentage arXiv:2312.09993. en English 89 ,70%

[27]

Dong ,

Li ,

Dai ,

Zheng , J. Ma,

Li ,

Xia , unk unknown 8 ,38% J. Xu , Z.

Wu , B.

Chang , X.

Sun , L.

Li , Z.

Sui , A de German 0 ,17% survey on in-context learning, 2024. URL: https: fr French 0 ,16% //arxiv.org/abs/2301.00234. arXiv: 2301 .00234. sv Swedish 0 , 15 %

[28]

Croce ,

Moschitti ,

Basili , Structured lex- zh Chinese 0 ,13% idceanlcsyimtrileaersi, ty ivni:a Rco .nBvaorlzuiltaioyn, Mk e.rJnoehlnssoonn d(eEpdesn.)-, renusl SRDpuuastnscihiasnh 000 ,,, 111233 %%% Proceedings of the 2011 Conference on Empirical it Italian 0,11% Methods in Natural Language Processing, Asso- ja Japanese 0 ,10% ciation for Computational Linguistics, Edinburgh, pl Polish 0 ,09% Scotland, UK., 2011 , pp. 1034 - 1046 . URL: https: pt Portuguese 0 ,09% //aclanthology.org/D11-1096. vi Vietnamese 0 , 08 %

[29]

Touvron ,

Martin ,

Stone ,

Albert , A . Alma- uk Ukrainian 0 ,07% hairi,

Babaei ,

Bashlykov ,

Batra , P. Bhar- ko Korean 0 ,06% gava, S. Bhosale,

Bikel ,

Blecher , C. C. Ferrer, ca Catalan 0,04%

Chen ,

Cucurull ,

Esiobu ,

Fernandes , J. Fu, sr Serbian 0 ,04%

Fu ,

Fuller ,

Gao ,

Goswami , N. Goyal, cs Czech 0 ,03%

Hartshorn ,

Hosseini ,

Hou ,

Inan , M. Kar- fhiu FHiunnngisahrian 00 ,, 0033 %% das,

Kerkez ,

Khabsa , I. Kloumann , A . Ko- id Indonesian 0 ,03% renev,

P. S.

Koura , M. -

A. Lachaux , T.

Lavril , J.

Lee , no Norwegian 0 ,03%

Liskovich ,

Lu ,

Mao ,

Martinet , T. Mihaylov, ro Romanian 0 ,03%

Mishra , I. Molybog,

Nie ,

Poulton , J. Reizen- bg Bulgarian 0 ,02% stein,

Rungta ,

Saladi ,

Schelten ,

Silva , E. M. da Danish 0 ,02% Smith , R.

Subramanian , X. E.

Tan , B.

Tang , R. Tay- hr Croatian 0 ,01% lor,

Williams ,

J. X.

Kuan ,

Xu ,

Yan , I. Zarov, sl Slovenian 0 ,01%

Zhang ,

Fan ,

Kambadur ,

Narang ,

Rodriguez ,

Stojnic ,

Edunov , T. Scialom, Llama 2 : Open foundation and fine-tuned chat models , 2023 . of hours on a GPU. We have implemented methods to arXiv:2307.09288. streamline this process, but the computational expen-

[30]

Delmonte ,

Bristot , S. Tonelli, VIT - Venice diture for training on a 16GB GPU remains high. This Italian Treebank: Syntactic and Quantitative Fea- becomes even more pronounced considering the model's tures , in: Proc. Sixth International Workshop on sentence processing time , which is slightly less than half Treebanks and Linguistic Theories , volume 1 , Nealt a second per sentence . Given the required computational Proc. Series , 2007 , pp. 43 - 54 . URL: https://catalog. power to run the model, this duration is relatively long. elra.info/en-us/repository/browse/ELRA-W0324/. Regarding the model's application, since it heavily re-

[31]

E. J.

Hu ,

Shen ,

Wallis ,

Allen-Zhu , Y. Li, lies on an LLM, it might be susceptible to hallucination S. Wang ,

Chen , Lora: Low-rank adaptation - generating non-existent sentences or fragments. Howof large language models , CoRR abs/2106 .09685 ever, during inference (few-shot or training), it seems to ( 2021 ). always answer in the request format, very rarely (especially in 0-shot) adding some explanation for its decision after a Yes or No.