<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">BEEP -BEst DrivEr&apos;s License Performer: A CALAMITA Challenge</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Fabio</forename><surname>Mercorio</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Dept of Statistics and Quantitative Methods</orgName>
								<orgName type="institution">University of Milano Bicocca</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
							<affiliation key="aff2">
								<orgName type="department">CRISP Research Centre crispresearch.eu</orgName>
								<orgName type="institution">University of Milano Bicocca</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Daniele</forename><surname>Potertì</surname></persName>
							<affiliation key="aff1">
								<orgName type="department">Dept of Economics, Management and Statistics</orgName>
								<orgName type="institution">University of Milano Bicocca</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Antonio</forename><surname>Serino</surname></persName>
							<affiliation key="aff1">
								<orgName type="department">Dept of Economics, Management and Statistics</orgName>
								<orgName type="institution">University of Milano Bicocca</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author role="corresp">
							<persName><forename type="first">Andrea</forename><surname>Seveso</surname></persName>
							<email>andrea.seveso@unimib.it</email>
							<affiliation key="aff0">
								<orgName type="department">Dept of Statistics and Quantitative Methods</orgName>
								<orgName type="institution">University of Milano Bicocca</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
							<affiliation key="aff2">
								<orgName type="department">CRISP Research Centre crispresearch.eu</orgName>
								<orgName type="institution">University of Milano Bicocca</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">BEEP -BEst DrivEr&apos;s License Performer: A CALAMITA Challenge</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">C49B4B5D7729CC4EBBB4606EC5701BA4</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T17:34+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Large Language Models</term>
					<term>Benchmarks</term>
					<term>CALAMITA</term>
					<term>CLiC-it</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>We present BEEP (BEst DrivEr's License Performer), a benchmark challenge to evaluate large language models in the context of a simulated Italian driver's license exam. This challenge tests the models' ability to understand and apply traffic laws, road safety regulations, and vehicle-related knowledge through a series of true/false questions. The dataset is derived from official ministerial materials used in the Italian licensing process, specifically targeting Category B licenses. We evaluate models such as LLaMA and Mixtral across multiple categories. In addition, we simulate a driving license test to assess the models' real-world applicability, where the pass rate is determined based on the number of errors allowed. While scaling up model size improved performance, even larger models struggled to pass the exam consistently. The challenge demonstrates the capabilities and limitations of LLMs in handling real-world, high-stakes scenarios, providing insights into their practical use and areas for further improvement.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Challenge: Introduction and Motivation</head><p>In recent years, Large Language Models (LLMs) have become a significant breakthrough in Natural Language Processing (NLP) and Artificial Intelligence (AI) <ref type="bibr" target="#b0">[1]</ref>. Assessing model performance is crucial yet challenging, involving multiple critical attributes: models must be precise, resilient, fair, and efficient, among other characteristics <ref type="bibr" target="#b1">[2]</ref>. Developing effective models in underrepresented languages such as Italian is a continuing challenge <ref type="bibr" target="#b2">[3]</ref>. This disparity arises from limited and lower-quality data <ref type="bibr" target="#b3">[4]</ref> and a development process often prioritising Anglocentric perspectives <ref type="bibr" target="#b4">[5]</ref>. Recently, there has been a surge in research aimed at making LLMs more culturally inclusive, moving beyond mere multilingualism to address deeper cultural contexts <ref type="bibr" target="#b5">[6]</ref>. For instance, a structured benchmark utilising the INVALSI tests-well-established assessments measuring educational competencies across Italy-represents one such effort to embed culturally relevant content in model evaluation <ref type="bibr" target="#b6">[7]</ref>.</p><p>This work is part of CALAMITA <ref type="bibr" target="#b7">[8]</ref> (Challenge the Abilities of LAnguage Models in ITAlian), an initiative launched by AILC, the Italian Association for Computational Linguistics. CALAMITA aims to develop a comprehensive and evolving benchmark for evaluating the capabilities of LLMs in Italian. The goal is to establish a shared platform with a suite of tasks and a live leaderboard, allowing for ongoing assessments of Italian and multilingual LLMs. CALAMITA seeks to build this benchmark through community-driven challenges, inviting researchers to propose tasks and datasets that evaluate specific aspects of LLMs' performance in Italian. This paper contributes to this collaborative effort by presenting a benchmark that assesses LLMs' ability to comprehend and apply Italian driving regulations, forming one of the initial tasks in this evolving benchmark.</p><p>This challenge evaluates LLM's ability to comprehend and apply knowledge in a practical, real-world scenario. While LLMs have shown remarkable capabilities in understanding and generating human language, their effectiveness in real-world decision-making scenarios remains underexplored, especially in languages such as Italian. This challenge tests whether these models can perform effectively in a linguistically demanding and contextually rich domain. Success in this challenge would demonstrate the model's ability to generalise language understanding to practical tasks, a crucial step towards their broader application in everyday life. simulated driver's license exam in Italian. This task requires a deep understanding of traffic laws and reasoning through driving situations.</p><p>In Italy, obtaining a driver's license is a structured process involving theoretical and practical assessments to ensure drivers are well-versed in road safety, traffic regulations, and practical driving skills. The Italian driver's license process is governed by strict rules set forth by the Ministero delle Infrastrutture e dei Trasporti (Ministry of Infrastructure and Transport), and the license is recognised across the European Union.</p><p>Italy offers several categories of driver's licenses, depending on the type of vehicle a person wishes to operate. We focus on Category B, which is required for cars (up to 3.5 tons) and vehicles with up to 8 seats.</p><p>The theoretical exam is crucial to obtaining a driver's license in Italy, and it is required, along with the practical exam. It assesses the applicant's knowledge of traffic laws, road signs, and driving regulations. It consists of multiple-choice questions and is typically administered electronically. The candidate must understand traffic regulations, road signs, driving behaviour, and vehicle maintenance. A Category B license test typically consists of 30 questions; a candidate can pass up to 3 errors.</p><p>The licensing process is not just about learning the rules; it requires candidates to internalise and apply them practically. BEEP reflects this focus on real-world application and safety. The Italian driving system also emphasises road etiquette and the ability to navigate complex traffic situations, particularly in high-density urban areas. Consequently, the challenge aims to mirror this complexity in evaluating LLMs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Data description</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Origin of data</head><p>BEEP is derived from the publicly accessible PDF "Listato A e B", which includes all quiz questions related to Italian driver's license examinations provided by the official ministerial listing <ref type="foot" target="#foot_0">1</ref> . The quizzes consist of true or false questions for driving license categories A and B, with data updated as of 01/07/2020.</p><p>We extracted the data from the official PDF file. The text is segmented by identifying distinct patterns indicating the start of new questions and sections. These segments are classified into predefined categories and sub-categories. For each text segment, relevant metadata, question types (e.g., true/false) and related image numbers are extracted and compiled into a structured format. The final dataset is exported, offering a well-organised collection of questions for the evaluation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Data format</head><p>The dataset is formatted with the following columns: </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Example of prompts used</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Question</head><p>The road can be divided into lanes. We exclusively employed the zero-shot setting in our evaluation process, where no prior examples were provided. An illustrative example of a prompt used in this setting is shown in Figure <ref type="figure" target="#fig_0">1</ref>, which demonstrates the structure and input format supplied to the model. The decision to have the language model answer with '[letter]' rather than simply 'letter' or 'True/False' is due to our use of pattern matching for response extraction. By enforcing a consistent answer format with brackets, we</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Options</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 1</head><p>An overview of the dataset categorised by major and minor traffic-related topics. The columns display the number of entries, the percentage of those entries containing figures, and the proportion of correct answers for each category.  can reliably parse responses, reducing ambiguity and ensuring that variations in phrasing or formatting do not interfere with accurate evaluation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Category</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.">Detailed data statistics</head><p>The questions are organised into the categories described in Tab. 1. This table summarises statistics across various road safety and vehicle regulation categories, providing detailed insight into major and minor classifications. Each entry in the table is categorised into broad Major Categories such as "DOCUMENTS, " "Vehicle Equipment, " and "Road Signage, " which are further subdivided into more specific Minor Categories. For example, the major category "DOCUMENTS" includes the minor category "Mandatory Documents, Agents, and License Plates," highlighting different aspects of document requirements and administrative details.</p><p>We also include figures associated with specific questions, particularly those addressing traffic signals, road signs, and right-of-way scenarios. These visual elements provide additional context and enhance the comprehension of complex traffic situations. However, for the CALAMITA challenge, we opted not to include questions containing figures, focusing solely on text-based questions. This decision ensured that the evaluation of LLMs remains centred on their language comprehension, knowledge and reasoning abilities rather than visual processing capabilities. Including images would limit participation to multimodal models, excluding many language models that cannot process visual information. By using only text, we maintain a broader, more accessible benchmark.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Metrics</head><p>Since the dataset comprises questions that can only be answered with true and false, we involved the Overall Accuracy to evaluate the models' answers in our task. Overall accuracy is commonly used in classification tasks, particularly in true-false or binary decision evaluations <ref type="bibr" target="#b8">[9]</ref>. It measures the proportion of all correct predictions (true positives and negatives) out of the total number of predictions made. In other words, it quantifies how well a binary classification system performs by indicating the fraction of correctly classified instances (both positive and negative classes) relative to the total number of instances evaluated.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 3</head><p>Overall accuracy of selected models, ranging from LLaMA to Mixtral, demonstrating their performance on the dataset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Model</head><p>Overall Accuracy llama-3-8b-instruct 56.27% llama-3-70b-instruct 77.23% mixtral-8x7b-instruct 77.19% mixtral-8x22b-instruct 83.29%</p><p>Table <ref type="table">3</ref> shows the Overall Accuracy obtained by LLAMA3 8B -Instruct<ref type="foot" target="#foot_1">2</ref> and others State of the Art models. We evaluate the metrics on the portion of our dataset that does not require image processing operations. The scaling laws hold as it is observed that performance increases with the number of parameters.</p><p>Table <ref type="table" target="#tab_2">2</ref> shows the Overall Accuracy stratified by Major Category for each tested model. Models perform better in the "SAFETY AND POLLUTION", "FIRST AID", and "AC-CIDENTS AND INSURANCE" categories. This may be possible given the generality of these major categories, as opposed to more niche categories such as 'DOCUMENTS' or 'VEHICLE EQUIPMENT', where the performance is worse.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Simulated Driving License Test</head><p>We also test the models by simulating a proper driving licence exam, following the appropriate official guidelines and creating a new indicator. We sampled 1000 samples of 30 questions from the dataset, ensuring each sample was unique. We then counted the correct and incorrect answers for each sample and each evaluated model. The guidelines state that the test is passed if the number of wrong answers is less than or equal to 3. Therefore, we built an indicator for each model that considered the percentage of driving licence exams passed, related to the number of examinations attempted. The results are shown in Tab. 4. As expected, smaller models made many mistakes on average (around 13), which was fatal as it never passed the test in any of the attempts. Even larger models like Mixtral-8x22b did not perform well in most cases. However, we believe more advanced models, such as GPT-4, might succeed more reliably. It is important to note that this simulated test is not integral to the CALAMITA benchmark. While it provides additional insights into the models' performance in a high-stakes, applied setting, the official evaluation metric focuses solely on overall accuracy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Limitations</head><p>Considering state-of-the-art LLMs, it is possible that one's training sets are contaminated with examples from the U.S. driving licence test and that these may influence performance on our benchmark. Furthermore, although the benchmark allows the real driving licence test to be reproduced, it can only assess true-or-false binary answers and not dialogue or reasoning ability.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Ethical issues</head><p>Although the models may demonstrate positive performance in this benchmark, it is crucial to recognise that such results do not equate to an actual ability to drive or navigate safely in real-world environments. The benchmark assesses the models' ability to process and understand driving-related questions, a far cry from the complex task of driving a vehicle, which requires perception, decision-making and real-time motor control.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Data license and copyright issues</head><p>The data are publicly available online and not subject to copyright restrictions.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: An example question, with instructions and a correct answer highlighted.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 2</head><label>2</label><figDesc>Overall accuracy of different models across major dataset categories, allowing for comparison of their effectiveness within these distinct areas.</figDesc><table><row><cell>Category</cell><cell>llama-3-8b</cell><cell>llama-3-70b</cell><cell>mixtral-8x7b</cell><cell>mixtral-8x22b</cell></row><row><cell>DOCUMENTS</cell><cell>53.26%</cell><cell>66.28%</cell><cell>67.43%</cell><cell>79.69%</cell></row><row><cell>VEHICLE EQUIPMENT</cell><cell>51.97%</cell><cell>66.45%</cell><cell>71.71%</cell><cell>75.00%</cell></row><row><cell>VEHICLES</cell><cell>51.89%</cell><cell>77.36%</cell><cell>82.08%</cell><cell>84.91%</cell></row><row><cell>THE MOTOR VEHICLE</cell><cell>56.13%</cell><cell>82.61%</cell><cell>82.21%</cell><cell>86.56%</cell></row><row><cell>ACCIDENTS AND INSURANCE</cell><cell>59.22%</cell><cell>85.78%</cell><cell>85.49%</cell><cell>91.15%</cell></row><row><cell>THE ROAD</cell><cell>51.72%</cell><cell>70.94%</cell><cell>71.92%</cell><cell>81.77%</cell></row><row><cell>RULES OF CONDUCT</cell><cell>54.36%</cell><cell>71.11%</cell><cell>70.34%</cell><cell>76.85%</cell></row><row><cell>FIRST AID</cell><cell>61.46%</cell><cell>90.62%</cell><cell>86.46%</cell><cell>88.54%</cell></row><row><cell>ROAD SIGNAGE</cell><cell>37.50%</cell><cell>75.00%</cell><cell>100.00%</cell><cell>100.00%</cell></row><row><cell>SAFETY AND POLLUTION</cell><cell>65.31%</cell><cell>88.57%</cell><cell>85.71%</cell><cell>88.57%</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 4</head><label>4</label><figDesc>Driving license Metrics of the Selected Models</figDesc><table><row><cell>Model</cell><cell>Total Tests Passed (%)</cell><cell>Avg Errors (Std.)</cell></row><row><cell>llama-3-8b-instruct</cell><cell>0/1000 (0%)</cell><cell>13.17 (±2.71)</cell></row><row><cell>llama-3-70b-instruct</cell><cell>64/1000 (6.4%)</cell><cell>6.88 (±2.65)</cell></row><row><cell>mixtral-8x7b-instruct</cell><cell>61/1000 (6.1%)</cell><cell>6.79 (±2.24)</cell></row><row><cell>mixtral-8x22b-instruct</cell><cell>258/1000 (25.8%)</cell><cell>5.01 (±2.09)</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">Visit ListatoAB for more information at https://www.neca.it/assets/ pdf/ListatoAB.pdf.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>We thank Thomas Passera for providing the initial code for the dataset's extraction. Evaluation of the opensource models was conducted on Leonardo supercomputer with the support of CINECA-Italian Super Computing Resource Allocation, class C project IsCb7_LLM-EVAL (HP10CIO7T9).</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">A survey on evaluation of large language models</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Yi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Ye</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">S</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Xie</surname></persName>
		</author>
		<ptr target="http://arxiv.org/abs/2307.03109.arXiv:2307.03109" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">P</forename><surname>Liang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Bommasani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Tsipras</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Soylu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Yasunaga</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Narayanan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kumar</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2211.09110</idno>
		<title level="m">Holistic evaluation of language models</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Xtreme-r: Towards more challenging and nuanced multilingual evaluation</title>
		<author>
			<persName><forename type="first">S</forename><surname>Ruder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Constant</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Botha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Siddhant</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Firat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Fu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Garrette</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Neubig</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics</title>
				<meeting>the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Quality at a glance: An audit of web-crawled multilingual datasets</title>
		<author>
			<persName><forename type="first">J</forename><surname>Kreutzer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Caswell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Wahab</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Van Esch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Ulzii-Orshikh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Tapo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Subramani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Sokolov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Sikasote</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Transactions of the Association for Computational Linguistics</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="page" from="50" to="72" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">You reap what you sow: On the challenges of bias evaluation under multilingual settings</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Talat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Névéol</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Biderman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Clinciu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Dey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Longpre</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Luccioni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Masoud</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mitchell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Radev</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of BigScience Episode# 5-Workshop on Challenges &amp; Perspectives in Creating Large Language Models</title>
				<meeting>BigScience Episode# 5-Workshop on Challenges &amp; Perspectives in Creating Large Language Models</meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="26" to="41" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Pawar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Park</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Jin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Arora</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Myung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Yadav</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">G</forename><surname>Haznitrama</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Oh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Augenstein</surname></persName>
		</author>
		<title level="m">Survey of cultural awareness in language models: Text and beyond</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<author>
			<persName><forename type="first">F</forename><surname>Mercorio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mezzanzanica</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Potertì</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Serino</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Seveso</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2406.17535</idno>
		<title level="m">Disce aut deficere: Evaluating llms proficiency on the invalsi italian benchmark</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">CALAMITA: Challenge the Abilities of LAnguage Models in ITAlian</title>
		<author>
			<persName><forename type="first">G</forename><surname>Attanasio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Basile</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Borazio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Croce</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Francis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Gili</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Musacchio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Nissim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Patti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Rinaldi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Scalena</surname></persName>
		</author>
		<ptr target="CEUR-WS.org" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)</title>
		<title level="s">CEUR Workshop Proceedings</title>
		<meeting>the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)<address><addrLine>Pisa, Italy</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2024-12-06">December 4 -December 6, 2024. 2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Pattern recognition and machine learning</title>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">M</forename><surname>Bishop</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Springer google schola</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="1122" to="1128" />
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
