<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>BLiMP-IT: Harnessing Automatic Minimal Pair Generation for Italian Language Model Evaluation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Matilde Barbini</string-name>
          <email>matilde.barbini@iusspavia.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maria Letizia Piccini Bianchessi</string-name>
          <email>letizia.piccinibianchessi@iusspavia.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Veronica Bressan</string-name>
          <email>veronica.bressan@iusspavia.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Achille Fusco</string-name>
          <email>achille.fusco@iusspavia.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sofia Neri</string-name>
          <email>sofia.neri@iusspavia.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sarah Rossi</string-name>
          <email>sarah.rossi@iusspavia.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tommaso Sgrizzi</string-name>
          <email>tommaso.sgrizzi@iusspavia.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cristiano Chesi</string-name>
          <email>cristiano.chesi@iusspavia.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Linguistics and Comparative Cultural Studies, Ca' Foscari University of Venice</institution>
          ,
          <addr-line>Fondamenta Tofetti 1075, 30123 Venice</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>EPFL Lausanne Doctoral Program Digital Humanities EDDH - Social Computing Group</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>NeTS Lab, IUSS Pavia, Palazzo del Broletto. Piazza della Vittoria</institution>
          ,
          <addr-line>15 - 27100 Pavia</addr-line>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University School for Advanced Studies IUSS Pavia, Palazzo del Broletto. Piazza della Vittoria</institution>
          ,
          <addr-line>15 - 27100 Pavia</addr-line>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>University of Florence</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>In this work we introduce the automatically generated dataset in BLiMP-IT, a novel benchmark for evaluating Italian language models based on minimal pairs (i.e. sentence pairs that difer only in a critical morphosyntactic aspect). Drawing inspiration from the success of BLiMP for English, BLiMP-IT combines and adapts several existing resources-including COnVERSA, AcCompl-it, and BLiMP-to construct a high-quality evaluation dataset for Italian. We present an automatic methodology for generating the evaluation's items by leveraging a large Italian corpus for lexicon extraction, POS tagging, and animacy annotations. Our approach not only ensures coverage of diverse morphosyntactic phenomena (e.g., agreement and inflection, verb class, non-local dependencies) but also scales the creation of minimal pairs to automatically expand the items for the evaluation benchmark. BLiMP-IT demonstrates that an automated pipeline for generating minimal pairs to evaluate LMs is both feasible and efective, ensuring comprehensive coverage of diverse morphosyntactic phenomena in Italian while reducing reliance on manual annotation.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Computational Linguistics</kwd>
        <kwd>Automatic Sentence Generation</kwd>
        <kwd>Language Model Evaluation</kwd>
        <kwd>Linguistic Benchmarks</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The development of benchmarks and datasets for the
linguistic evaluation of Language Models (LMs) in a specific
language is essential for a systematic assessment of their
ability to handle its morphosyntactic structures. Given
cross-linguistic variation in inflectional morphology,
syntactic configurations, agreement mechanisms, and word
order flexibility, language models often exhibit
diferential performance depending on the structural properties
of the target language. A dedicated evaluation framework
allows for rigorous analysis of morphosyntactic accuracy,
including the handling of inflectional paradigms,
syntactic dependencies, agreement constraints, and constituent
ordering, providing a comprehensive assessment of a
model’s grammatical competence. In linguistic theory
acceptability judgments have been often defined as the
main empirical method used to access human
linguistic competence and language acquisition [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. This
methodology has also been proved to be a classical and
reliable tool for assessing the linguistic capabilities of
LMs across various linguistic phenomena [
        <xref ref-type="bibr" rid="ref3 ref4 ref5 ref6">3, 4, 5, 6</xref>
        ]. A
common methodology is the employment of minimal
pairs, couples of sentences difering minimally in their
structure, with one being grammatically acceptable and
the other one being unacceptable. An efective LM should
assign higher probabilities to grammatically acceptable
sentences than to their unacceptable counterparts.
Alternatively, it can be evaluated by presenting a series of
sentences—both grammatical and ungrammatical—and
requiring the model to perform a binary acceptability
classification. While benchmarks such as BLiMP have
provided valuable insights for English, the lack of
analogous resources for Italian poses a challenge for
multilingual NLP and for an efective and comprehensive
evaluation of these models. We address this gap by
introducing BLiMP-IT 1, a benchmark specifically designed
for Italian. Our contributions are twofold:
      </p>
      <sec id="sec-1-1">
        <title>1Forthcoming in Proceedings of GLOW 47. The resources for BLiMP</title>
        <p>
          IT can be found at https://nets-lab.github.io/blimpit/
• Resource Adaptation and Assembly: We con- improvement with additional training often yield
diminstruct BLiMP-IT by integrating and adapting ex- ishing returns in psycholinguistic terms [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. Recent
isting Italian and English datasets and bench- work capitalizing on the BabyLM Challenge in English
marks for the linguistic evaluation of LMs, within [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] and similar tasks in Italian [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] has stressed the
ima minimal pairs’ framework. portance of adopting appropriate linguistic benchmarks
to meaningfully challenge the Poverty of Stimulus
hy• Automatic Minimal Pair Generation: We de- pothesis. For Italian specifically, resources like
Laccolvelop an automated pipeline for generating min- ith [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] and AcCompl-it [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] have targeted acceptability
imal pairs by extracting a detailed lexicon from judgments through binary and rating-based methods.
a large Italian corpus, tagging it with linguistic However, there remains a need for a comprehensive
Italinformation (e.g., POS, UPOS, animacy), and sys- ian benchmark that harnesses the minimal pairs
frametematically mapping various linguistic phenom- work, a gap that BLiMP-IT aims to fill.
ena to unique sequence tags, to produce both
grammatical and ungrammatical sentence pairs
(i.e. minimal pairs) 2. 3. BLiMP-IT Dataset Construction
In this work, we focus on the automatic pipeline com- 3.1. Minimal Pairs Framework
ponent of the BLiMP-IT resource, providing a
comprehensive description of its operational workflow.
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>The minimal pairs framework adopted in BLiMP-IT cen</title>
        <p>
          ters on constructing sentence pairs that difer only in a
critical grammatical feature. One sentence in the pair
2. Related work is grammatically acceptable, while the other violates a
specific morphosyntactic rule. This approach builds on
Large Language Models (LLMs) have sparked an ongoing previous work in linguistic evaluation, notably the BLiMP
debate about whether they develop genuine linguistic benchmark for English (e.g., [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]) and provides a
finecompetence or rely primarily on spurious statistical gen- grained measure of a language model’s sensitivity to
eralizations [
          <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
          ]. This fundamental question is compli- subtle grammatical contrasts. Minimal pairs serve as a
cated by LLMs’ opacity in processing language patterns ifne-grained diagnostic tool: by presenting a model with
and their tendency to conflate world knowledge with two sentences that are identical except for one
grammatmorphosyntactic competence [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. While some interpret ical feature, researchers can assess whether the model is
LLMs’ performance with complex grammatical configu- sensitive to the relevant linguistic distinction. For
examrations as evidence against the Poverty of Stimulus hy- ple, in the case of subject-verb agreement, a model should
pothesis [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], critics note that such results depend on consistently assign higher probability or acceptability to
dramatically oversized training data compared to child the correct agreement form (e.g., "La ragazza mangia la
language acquisition [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. Moreover, higher performance mela" vs. "La ragazza mangiano la mela") 3. This
conon increasingly specific tasks does not always correspond trolled setup eliminates confounding variables and allows
to genuine gains in linguistic understanding [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], sug- precise measurement of model performance on
particgesting that standard performance metrics may inade- ular phenomena. To ensure interpretability and
reproquately capture linguistic competence [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Within this ducibility, BLiMP-IT constructs minimal pairs based on
context, developing linguistically informed benchmarks abstract tag templates that encode both grammatical and
has become crucial for evaluating model performance ungrammatical structures. These templates are manually
and the nature of their competence [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. The evalua- designed and systematically mapped to lexical entries
tion of language models via acceptability judgments and drawn from a linguistically annotated corpus. The use
minimal pairs has a long-standing history in theoretical of tag-based generation not only facilitates large-scale
and computational linguistics. Recent benchmarks such pair creation but also guarantees that the only diference
as BLiMP [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] and CLiMP [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] have demonstrated the between the sentences in a pair is the grammatical target
value of this approach, while recent shared tasks have under investigation. The minimal pairs are organized
highlighted how small-sized training regimes (10-100M around four major categories of morphosyntactic
phetokens) can achieve relatively good results on various lin- nomena: Agreement and Inflection, Verb Class and
Arguguistic benchmarks including BLiMP and CoLA [
          <xref ref-type="bibr" rid="ref14 ref16">16, 14</xref>
          ]. ment Structure, Pronouns, and Non-local Dependencies.
However, the most performant architectures that show Each pair is associated with a specific sub-phenomenon
(e.g., determiner-noun agreement, reflexive clitic
placement, long-distance wh-dependencies), enabling detailed
evaluation across diverse syntactic domains. In
design2The automatically generated resources, as well as a flowchart
describing the process, can be accessed at https://nets-lab.github.io/
blimpit-generation/. Please note that these data are provisional and
subject to ongoing generation and refinement.
3"The girl eats the apple" vs. *"The girl eat the apple"
4. BLiMP-IT: automated
generation
ing these pairs, particular attention was paid to
structural symmetry, lexical consistency, and plausibility.
Sentences were constructed to be semantically neutral where
possible, to avoid introducing biases unrelated to the 4.1. Corpus Creation for Lexicon
grammatical phenomenon. This was especially
important for more complex structures, such as those involv- Extraction
ing coordination or wh-movement, where maintaining A fundamental component of our automatic generation
interpretability across grammatical and ungrammatical pipeline is the creation of a large high-quality Italian
variants can be challenging. Finally, minimal pair evalua- dataset, initially developed to take part to the BabyLM
tion supports both probabilistic scoring (e.g., comparing challenge [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ], which consists of approximately 3
millog-likelihoods assigned by a language model) and bi- lion tokens sourced from diverse resources and serves as
nary classification tasks, such as acceptability judgments. the foundation for lexicon extraction. It is divided into
This flexibility allows BLiMP-IT to be used with a wide ifve sections: child-directed speech (CHILDES Italian
secrange of language models and evaluation metrics, align- tion), child movie subtitles (from OpenSubtitles), child
ing with the goals of interpretability and cross-model songs (from the Zecchino D’Oro repository), telephone
comparability. conversations (VoLIP corpus, [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ], and fairy tales (from
copyright-expired sources). After a cleaning process that
3.2. BLiMP-IT: Integrated Resources removed metalinguistic annotations and children’s
productions, the corpus contains 2,431,038 tokens with an
overall Type-Token Ratio (TTR) of 0.03. The distribution
of tokens across sections is as follows: CHILDES (346,155
tokens, TTR = 0.03), SUBTITLES (700,729 tokens, TTR
= 0.05), CONVERSATIONS (58,039 tokens, TTR = 0.11),
SONGS (222,572 tokens, TTR = 0.08), and FAIRY TALES
(1,287,826 tokens, TTR = 0.05). Statistical analysis of the
corpus ensures suficient lexical diversity and coverage
of the linguistic phenomena under investigation.
        </p>
      </sec>
      <sec id="sec-1-3">
        <title>BLiMP-IT encompasses 78 morphosyntactic phenomena,</title>
        <p>which are categorized into four main groups:
Agreement and Inflection (including phenomena such as
noundeterminer and subject-verb agreement), Verb Class and
Argument Structure (addressing issues like auxiliary
selection and  -role assignment), Pronouns (focusing on
clitics, reflexives, and person agreement), and Non-local
Dependencies (encompassing long-distance
dependencies and island efects).</p>
        <p>The dataset is constructed by integrating multiple
existing Italian linguistic resources (and English resources
in the case of BLiMP) while also incorporating newly
created minimal pairs. Our sources include:</p>
        <sec id="sec-1-3-1">
          <title>4.2. Lexicon Extraction and Linguistic</title>
        </sec>
        <sec id="sec-1-3-2">
          <title>Tagging</title>
          <p>
            We extract a lexicon from the corpus that captures key
lin• COnVERSA: A battery designed for assessing guistic attributes for each word. First, we annotate words
grammaticality through minimal pairs [
            <xref ref-type="bibr" rid="ref22">22</xref>
            ]. with both POS and UPOS tags using state-of-the-art
taggers (spaCy). In addition, we manually labeled nouns
• AcCompl-it: An evaluation campaign component with animacy information to address semantic nuances.
focused on acceptability and complexity judg- This lexicon forms the basis for selecting appropriate
ments [
            <xref ref-type="bibr" rid="ref21">21</xref>
            ]. words when generating minimal pairs.
• BLiMP: a test set for evaluating the grammatical
knowledge of English LLMs, featuring 67 minimal
pair paradigms across 12 categories [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ].
• New phenomena: a set of new linguistic
phenomena such as ATB [
            <xref ref-type="bibr" rid="ref23">23</xref>
            ] and parasitic gaps (inspired
by [
            <xref ref-type="bibr" rid="ref24">24</xref>
            ]).
          </p>
          <p>The adaptation process involved selecting phenomena
that are central to Italian grammar (e.g., noun-determiner
agreement, subject-verb agreement, verb argument
structure, clitic usage, and non-local dependencies) and
reformulating the examples to align with the minimal pairs
methodology. For instance, items from English BLiMP, if
compatible with and relevant for Italian morphosyntax,
were carefully translated and restructured to account for
Italian-specific syntactic and morphological features.</p>
        </sec>
        <sec id="sec-1-3-3">
          <title>4.3. The pipeline for minimal pairs generation</title>
          <p>Our automatic minimal pair generation process follows
a structured and modular pipeline designed to produce
large-scale, linguistically controlled sentence pairs. This
section details each stage of the pipeline, emphasizing
both the design rationale and the implementation steps.
• Resource loading: The process begins with the
loading of two key components: (i) a lexicon
extracted from the Italian corpus, enriched with
linguistic annotations such as part-of-speech (POS),
universal POS (UPOS), animacy, and
morphological features; and (ii) a set of tag sequences, each
defining the structure of a sentence in terms of
syntactic categories. These tag sequences are
constructed in minimal pairs, where each pair
consists of a grammatical and an
ungrammatical variant. The ungrammatical variant
introduces a targeted morphosyntactic violation (e.g.,
a mismatched subject-verb agreement or
incorrect determiner-noun concord), ensuring that the
only diference between the two sequences is the
critical grammatical contrast under investigation.
This design supports a controlled evaluation of
model sensitivity to specific phenomena.
• Tag Matching and Word Selection: Once the
tag templates are loaded, the system proceeds
to match each tag in a sequence with a suitable
word from the lexicon. Word selection is guided
by the required grammatical features encoded in
the tag (e.g., number, gender, animacy, tense). To
prevent repetition and encourage lexical
diversity, a tracking mechanism records previously
selected tokens and prioritizes less frequently
used words when possible. Special handling is
applied to verbs, which require agreement
features to be matched precisely with their subject
counterparts. The system identifies verb roots
and selects appropriate inflected forms based on
number and person. Additionally, animacy plays
a role in selecting nouns and pronouns, especially
in structures where semantic compatibility
influences grammaticality (e.g., reflexive pronouns or
clitic constructions). If a matching lexical item
cannot be found for a given tag within the
constraints, the system either retries with an
alternative lexeme or skips the current sequence to
maintain sentence well-formedness and overall
dataset quality.
sentence pairs. Duplicates are identified not only
by surface form but also by underlying tag
structure, preventing syntactically redundant
examples from being overrepresented. The
generation process is iterative: multiple passes are
performed over the tag templates and lexicon,
dynamically adjusting word choices based on
availability and prior usage. When generation fails
(e.g., no valid word found for a required
combination), the system logs the instance and skips the
pair to avoid compromising the grammatical
precision of the dataset. Internally, each generated
(good-sentence, bad-sentence) tuple is stored in
a Python set and tested for membership in O(1)
time: any exact surface-form repeat is skipped.
To prevent an endless loop when unique pairs
run out, the loop also caps the total number of
attempts (e.g., 10× the target) and logs a warning
if it cannot reach the requested count.
• Quality check: We employ a human-in-the-loop
strategy, where a team of linguistic experts
meticulously reviews the generated minimal pairs to
ensure grammatical accuracy and naturalness.
Each pair is independently rated by at least two
reviewers and any doubts trigger a discussion
session to reach consensus and to establish if the pair
must be removed. Experts also log error types
and provide targeted feedback on problematic tag
sequences or lexicon entries.Their expertise not
only enhances the overall quality of our
evaluation tool but also ensures inter-rater reliability,
fostering consistency and objectivity in the
assessment process.</p>
        </sec>
        <sec id="sec-1-3-4">
          <title>4.4. Methodological challenges</title>
          <p>• Iterative Generation and Quality Control:</p>
          <p>To ensure dataset diversity and minimize
redundancy, the pipeline includes a control mechanism
to detect and filter out duplicate or near-duplicate
• Sentence Construction: With the tag-to-word While the automatic generation pipeline described above
mappings established, the system constructs sen- enables scalable creation of minimal pairs, its
implementence pairs by linearizing the selected tokens ac- tation also revealed several methodological challenges
cording to their tag sequence order. Minimal that required careful consideration. First, the process
surface normalization is performed at this stage, of animacy annotation introduced a bottleneck due to
including the insertion of appropriate punctua- the need for manual labeling. Although part-of-speech
tion, handling of elisions and contractions, and (POS) and universal POS (UPOS) tags could be obtained
capitalization of the sentence-initial token. Each using existing NLP tools such as spaCy, the
classificasentence is generated in parallel with its minimal tion of nouns and pronouns based on animacy required
counterpart, ensuring that both share identical human intervention. This task is particularly sensitive
lexical items and structure, difering only in the in Italian, where animacy can influence grammaticality
targeted morphosyntactic element. This paral- judgments, especially in constructions involving clitics,
lelism ensures the interpretability and diagnostic reflexives, or subject-verb agreement. Ensuring
consisvalue of each pair. tent annotation across the lexicon was essential to
preserve the validity of minimal pairs involving
semantically conditioned structures. Second, the construction of
sequence tags—representing grammatical and
ungrammatical syntactic structures—proved complex. Tag
sequences must encode subtle contrasts in grammaticality
while remaining compatible with the lexicon and word
selection rules. Designing these templates required
extensive linguistic knowledge and iterative refinement. In
some cases, identifying minimal but meaningful
structural contrasts demanded revisiting the theoretical
underpinnings of the targeted phenomenon. Another critical
challenge was matching lexical items to abstract tag
templates. While the lexicon provides detailed linguistic
annotations, finding appropriate word combinations that
meet all morphological and syntactic constraints was
nontrivial. This was especially true for verbs, where selecting
appropriate inflected forms (e.g., singular/plural, tense,
auxiliary selection) required tracking agreement features
and root compatibility. Additionally, ensuring lexical
diversity while avoiding repetitive or unnatural
constructions added further complexity to word selection. The
generation process also involved quality control
mechanisms to filter out low-quality or duplicate pairs. Despite Figure 1: The linguistic phenomena (with diferent levels of
automated checks, certain errors—such as overly rigid granularity) reflected in the automatically generated
minor implausible sentences—could only be caught through imal pairs. A detailed description of the phenomena and
manual review. This underscores the continued impor- the acronyms, with relevant references, can be found at
tance of human-in-the-loop validation, particularly for https://nets-lab.github.io/blimpit-generation/
capturing edge cases that automatic systems may
overlook. Finally, the reliance on a corpus of child-directed
speech and simplified texts (developed for the BabyLM
Challenge) had implications for lexical diversity. While
the corpus ofered controlled and well-annotated input
data, its domain-specific nature may limit coverage of
more formal or idiomatic constructions. Addressing this
limitation requires expanding the source corpus in
future iterations to include a broader range of registers and
genres.
approximately 10 million tokens increases overall
accuracy (rising from 40% to 79%), its performance on certain
BLiMP-IT components actually worsens (dropping from
61% to 52%). The models’ reliability in distinguishing
correct from incorrect language forms decreases from 44%
to 32%, falling short of human benchmarks (around 86%
accuracy and 72% consistency observed in seven-year-old
children). We are still in the process of testing and
evaluating diferent models on our automatically-generated
minimal pairs.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>5. Results</title>
      <sec id="sec-2-1">
        <title>Our pipeline successfully generated 2,899 minimal pairs</title>
        <p>
          covering 18 phenomena—spanning agreement, non-local
dependencies, and other key categories—from the 78
phenomena included in BLiMP-IT. We are actively working
to expand this coverage to include all 78 phenomena.
Following the methodology proposed for English in [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ],
early findings from employing BLiMP-IT to assess models
that replicate the constraints children face while learning
language show that strong performance on standard
evaluation metrics doesn’t translate to equally strong results
on minimal pair tests, and these models fail to capture the
linguistic patterns typical of children [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. These initial
ifndings indicate that children’s language learning
follows expected linguistic principles, while large language
models demonstrate inconsistent behavior. Specifically,
preliminary results 4 reveal that although training
different language models (GPT-2, BERT, ad hoc RNN) on
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>4Forthcoming in Proceedings of GLOW 47</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>6. Discussion</title>
      <p>BLiMP-IT represents a significant step forward in the
evaluation of Italian language models by providing a
benchmark that combines manually curated linguistic
phenomena with an innovative pipeline for automatic
minimal pair generation. Through the integration of
diverse resources and a structured methodology, our
approach ensures both linguistic relevance and
scalability. One of the strengths of our approach lies in the
combination of curated content and automation. While
the manual adaptation of resources such as COnVERSA
and AcCompl-it guarantees that the dataset reflects core
aspects of Italian grammar, the automated generation
pipeline makes it possible to scale the number of
minimal pairs eficiently and consistently. This dual strategy
enables us to address a broader range of
morphosyntactic phenomena while maintaining control over the
grammatical integrity of the examples. Moreover, by linguistic resources with an automated pipeline for
minimplementing a human-in-the-loop quality control pro- imal pair generation. This hybrid methodology allows
cess, we ensure that automatically generated sentence us to systematically and eficiently generate sentence
pairs remain grammatically accurate and linguistically pairs that test key morphosyntactic competencies—such
natural. Linguistic experts systematically validate the as agreement, inflection, verb argument structure, and
outputs, which strengthens the internal consistency of non-local dependencies—across 78 targeted phenomena.
the dataset and enhances its reliability for downstream Our approach ensures scalability while maintaining high
evaluation tasks. This step is crucial given the complexity linguistic quality through expert validation. The
conof Italian syntax and morphology, where subtle changes tribution of BLiMP-IT is twofold: first, it addresses the
in word form or word order can significantly afect ac- significant gap in Italian-specific evaluation datasets for
ceptability. Another key contribution of BLiMP-IT is language models, and second, it proposes a generalizable,
its focus on minimal pairs as an evaluation methodol- language-agnostic framework for benchmark
construcogy. This approach provides a fine-grained tool for test- tion. These features make BLiMP-IT a valuable tool not
ing specific grammatical contrasts, such as subject-verb only for evaluating existing LMs, but also for
supportagreement or clitic placement, that are often underrep- ing their training and fine-tuning—particularly in
lowresented in broader benchmarks. By isolating individual resource or developmentally plausible settings, such as
linguistic features, BLiMP-IT allows researchers to probe those promoted by the BabyLM challenge. The automatic
the syntactic sensitivity of language models in a con- generation pipeline opens the door for large-scale,
controlled and interpretable way. The breadth of phenomena sistent, and reusable evaluation items, minimizing the
reincluded in BLiMP-IT, spanning from local agreement liance on manual crafting, which is both time-consuming
patterns to long-distance dependencies, also makes it and dificult to scale. This makes it feasible to evaluate
a valuable diagnostic resource. In particular, the inclu- a wide range of grammatical contrasts in a way that is
sion of lesser-tested constructions such as parasitic gaps both linguistically informed and computationally
pracor ATB (Across-The-Board) movement contributes to a tical. Looking forward, we aim to expand coverage to
more comprehensive picture of a model’s grammatical all 78 phenomena, increase the lexical and syntactic
dicompetence. This is especially important in the context versity of the generated items, and incorporate more
adof evaluating transformer-based models, which may suc- vanced linguistic annotations, such as semantic roles and
ceed in surface-level generalizations but struggle with animacy, using semi-supervised or model-assisted
techdeeper syntactic dependencies. Furthermore, the design niques. Additionally, we plan to develop a fully
languageof BLiMP-IT allows for ongoing extension and refine- independent version of the pipeline, enabling researchers
ment. Since the core generation pipeline is modular, it to create similar benchmarks for other morphologically
can be expanded to incorporate additional phenomena rich languages. By combining linguistic depth with
comas more linguistic data becomes available. The current putational scalability, BLiMP-IT sets a new standard for
focus on 18 phenomena, though already substantial, rep- targeted evaluation of linguistic competence in Italian
resents only a subset of the 78 phenomena identified in language models and ofers a blueprint for multilingual
the full benchmark framework. Ongoing work is directed benchmarking in the future.
toward increasing this coverage while maintaining the
same level of quality control. Finally, by grounding our
dataset in a linguistically annotated corpus developed 8. Limitations
for the BabyLM Challenge, we ensure that our lexical
and syntactic inputs are well-attested and systematically
organized. Although this corpus primarily reflects
childdirected language, it still provides suficient lexical and
morphosyntactic variety to generate a diverse and
representative set of sentence pairs. The detailed analysis
of type-token ratios across subdomains (e.g., fairy tales,
songs, subtitles) confirms that the source material
supports the goals of minimal pair generation in a
linguistically meaningful way.</p>
      <sec id="sec-3-1">
        <title>As discussed in Section 4.4, several methodological chal</title>
        <p>lenges were encountered during the design of the
automatic generation pipeline. In addition to those, our
current setup faces broader limitations that afect the
dataset’s generalizability and scalability. Most notably,
the underlying corpus was originally developed for the
BabyLM Challenge and, as such, is largely composed
of texts classified as ‘child-directed speech’. This focus
limits the diversity of the lexicon used for minimal pair
creation and may not fully represent the broader
spectrum of language registers. In future work, we plan to
7. Conclusions extend our dataset to incorporate a wider range of text
sources, thereby enriching the lexicon and enhancing
We have presented BLiMP-IT, a novel evaluation bench- representativeness. Additionally, our current pipeline
remark for Italian language models that integrates curated lies on manual processes for animacy annotation and the
construction of sequence tags. This dependency on
manual eforts introduces potential inconsistencies and limits
scalability. We aim to transition to a fully automated
approach in subsequent iterations, which will improve
both the reliability and eficiency of our pipeline.</p>
        <p>Declaration on Generative AI</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>N.</given-names>
            <surname>Chomsky</surname>
          </string-name>
          ,
          <source>Aspects of the Theory of Syntax</source>
          , 11, MIT press,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>C. T Schütze</surname>
          </string-name>
          ,
          <article-title>The empirical base of linguistics: Grammaticality judgments and linguistic methodology</article-title>
          , Language Science Press,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Linzen</surname>
          </string-name>
          , E. Dupoux,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Goldberg</surname>
          </string-name>
          ,
          <article-title>Assessing the ability of lstms to learn syntax-sensitive dependencies, Transactions of the Association for Computational Linguistics 4 (</article-title>
          <year>2016</year>
          )
          <fpage>521</fpage>
          -
          <lpage>535</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Marvin</surname>
          </string-name>
          , T. Linzen,
          <article-title>Targeted syntactic evaluation of language models</article-title>
          , arXiv preprint arXiv:
          <year>1808</year>
          .
          <volume>09031</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E.</given-names>
            <surname>Wilcox</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Morita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Futrell</surname>
          </string-name>
          ,
          <article-title>What do rnn language models learn about filler-gap dependencies?</article-title>
          , arXiv preprint arXiv:
          <year>1809</year>
          .
          <volume>00042</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hu</surname>
          </string-name>
          , J. Gauthier,
          <string-name>
            <given-names>P.</given-names>
            <surname>Qian</surname>
          </string-name>
          , E. Wilcox,
          <string-name>
            <given-names>R. P.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <article-title>A systematic assessment of syntactic generalization in neural language models</article-title>
          , arXiv preprint arXiv:
          <year>2005</year>
          .
          <volume>03692</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>E. G.</given-names>
            <surname>Wilcox</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Futrell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. P.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <article-title>Using computational models to test syntactic learnability</article-title>
          ,
          <source>Linguistic Inquiry</source>
          <volume>55</volume>
          (
          <year>2022</year>
          )
          <fpage>805</fpage>
          -
          <lpage>848</lpage>
          . URL: https://api. semanticscholar.org/CorpusID:247235030.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Bender</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gebru</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>McMillan-Major</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shmitchell</surname>
          </string-name>
          ,
          <article-title>On the dangers of stochastic parrots: Can language models be too big?</article-title>
          ,
          <source>in: Proceedings of the 2021 ACM conference on fairness, accountability, and transparency</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>610</fpage>
          -
          <lpage>623</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Bender</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Koller</surname>
          </string-name>
          , Climbing towards NLU:
          <article-title>On meaning, form, and understanding in the age of data</article-title>
          , in: D.
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Chai</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Schluter</surname>
          </string-name>
          , J. Tetreault (Eds.),
          <article-title>Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>5185</fpage>
          -
          <lpage>5198</lpage>
          . URL: https://aclanthology. org/
          <year>2020</year>
          .acl-main.
          <volume>463</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          . acl-main.
          <volume>463</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S. T.</given-names>
            <surname>Piantadosi</surname>
          </string-name>
          ,
          <article-title>Modern language models refute chomsky's approach to language, From fieldwork to linguistic theory: A tribute to Dan Everett 15 (</article-title>
          <year>2023</year>
          )
          <fpage>353</fpage>
          -
          <lpage>414</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R.</given-names>
            <surname>Katzir</surname>
          </string-name>
          ,
          <article-title>Why large language models are poor theories of human linguistic cognition. a reply to piantadosi (</article-title>
          <year>2023</year>
          ), Manuscript. Tel Aviv University. url: https://lingbuzz. net/lingbuzz/007190 (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>K.</given-names>
            <surname>Ethayarajh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jurafsky</surname>
          </string-name>
          ,
          <article-title>Utility is in the eye of the user: A critique of nlp leaderboards</article-title>
          , arXiv preprint arXiv:
          <year>2009</year>
          .
          <volume>13888</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Coda-Forno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Binz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. X.</given-names>
            <surname>Wang</surname>
          </string-name>
          , E. Schulz,
          <article-title>Cogbench: a large language model walks into a psychology lab</article-title>
          ,
          <source>arXiv preprint arXiv:2402.18225</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Warstadt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Parrish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohananey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-F.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Bowman</surname>
          </string-name>
          ,
          <article-title>Blimp: The benchmark of linguistic minimal pairs for english (electronic resources) (</article-title>
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>B.</given-names>
            <surname>Xiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Warstadt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kann</surname>
          </string-name>
          ,
          <article-title>Climp: A benchmark for chinese language model evaluation</article-title>
          ,
          <source>arXiv preprint arXiv:2101.11131</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Michael</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hill</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Bowman</surname>
          </string-name>
          ,
          <article-title>Glue: A multi-task benchmark and analysis platform for natural language understanding</article-title>
          , arXiv preprint arXiv:
          <year>1804</year>
          .
          <volume>07461</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J.</given-names>
            <surname>Steuer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mosbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Klakow</surname>
          </string-name>
          ,
          <article-title>Large GPTlike models are bad babies: A closer look at the relationship between linguistic competence and psycholinguistic measures</article-title>
          , in: A.
          <string-name>
            <surname>Warstadt</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Mueller</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Choshen</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Wilcox</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Zhuang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Ciro</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Mosquera</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Paranjabe</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Linzen</surname>
          </string-name>
          , R. Cotterell (Eds.),
          <source>Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning</source>
          , Association for Computational Linguistics, Singapore,
          <year>2023</year>
          , pp.
          <fpage>142</fpage>
          -
          <lpage>157</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .conll-babylm.
          <volume>12</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          . conll-babylm.
          <volume>12</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>C.</given-names>
            <surname>Chesi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Bressan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Barbini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fusco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L. P.</given-names>
            <surname>Bianchessi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Neri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rossi</surname>
          </string-name>
          , T. Sgrizzi,
          <article-title>Diferent ways to forget: Linguistic gates in recurrent neural networks</article-title>
          , in: M. Y.
          <string-name>
            <surname>Hu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Mueller</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Ross</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Linzen</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Zhuang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Choshen</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Cotterell</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Warstadt</surname>
          </string-name>
          , E. G. Wilcox (Eds.),
          <source>The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning</source>
          , Association for Computational Linguistics, Miami, FL, USA,
          <year>2024</year>
          , pp.
          <fpage>106</fpage>
          -
          <lpage>117</lpage>
          . URL: https://aclanthology. org/
          <year>2024</year>
          .conll-babylm.9/.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Fusco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Barbini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L. Piccini</given-names>
            <surname>Bianchessi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Bressan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Neri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rossi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Sgrizzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chesi</surname>
          </string-name>
          ,
          <article-title>Recurrent networks are (linguistically) better? an (ongoing) experiment on small-LM training on child-directed speech in Italian</article-title>
          , in: F.
          <string-name>
            <surname>Dell'Orletta</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lenci</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Montemagni</surname>
          </string-name>
          , R. Sprugnoli (Eds.),
          <source>Proceedings of the 10th Italian Conference on Computational Linguistics</source>
          (CLiC-it
          <year>2024</year>
          ), CEUR Workshop Proceedings, Pisa, Italy,
          <year>2024</year>
          , pp.
          <fpage>382</fpage>
          -
          <lpage>389</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .clicit-
          <volume>1</volume>
          .46/.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>D.</given-names>
            <surname>Trotta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Guarasci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Leonardelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tonelli</surname>
          </string-name>
          ,
          <article-title>Monolingual and cross-lingual acceptability judgments with the Italian CoLA corpus</article-title>
          , in: M.
          <article-title>-</article-title>
          <string-name>
            <surname>F. Moens</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Specia</surname>
          </string-name>
          , S. W.-t. Yih (Eds.),
          <source>Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2021</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Punta Cana, Dominican Republic,
          <year>2021</year>
          , pp.
          <fpage>2929</fpage>
          -
          <lpage>2940</lpage>
          . URL: https://aclanthology. org/
          <year>2021</year>
          .findings-emnlp.
          <volume>250</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .findings-emnlp.
          <volume>250</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>D.</given-names>
            <surname>Brunato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chesi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dell'Orletta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Montemagni</surname>
          </string-name>
          , G. Venturi,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zamparelli</surname>
          </string-name>
          , et al.,
          <article-title>Accompl-it@ evalita2020: Overview of the acceptability &amp; complexity evaluation task for italian</article-title>
          ,
          <source>in: CEUR WORKSHOP PROCEEDINGS, CEUR Workshop Proceedings (CEUR-WS. org)</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>C.</given-names>
            <surname>Chesi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Ghersi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Musella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Musola</surname>
          </string-name>
          , et al.,
          <article-title>Conversa: Test di comprensione delle opposizioni morfo-sintattiche verbali attraverso la scrittura (</article-title>
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Ross</surname>
          </string-name>
          ,
          <article-title>Constraints on variables in syntax</article-title>
          . (
          <year>1967</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>N.</given-names>
            <surname>Lan</surname>
          </string-name>
          , E. Chemla,
          <string-name>
            <given-names>R.</given-names>
            <surname>Katzir</surname>
          </string-name>
          ,
          <article-title>Large language models and the argument from the poverty of the stimulus</article-title>
          , Linguistic
          <string-name>
            <surname>Inquiry</surname>
          </string-name>
          (
          <year>2024</year>
          )
          <fpage>1</fpage>
          -
          <lpage>28</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>L.</given-names>
            <surname>Choshen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cotterell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Y.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Linzen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mueller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ross</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Warstadt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Wilcox</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          , [
          <article-title>call for papers] the 2nd babylm challenge: Sample-eficient pretraining on a developmentally plausible corpus</article-title>
          ,
          <source>arXiv preprint arXiv:2404.06214</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>I.</given-names>
            <surname>Alfano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Cutugno</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. De Rosa</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Iacobini</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Savy</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Voghera</surname>
          </string-name>
          , et al.,
          <article-title>Volip: a corpus of spoken italian and a virtuous example of reuse of linguistic resources</article-title>
          ,
          <source>in: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)</source>
          ,
          <source>European Language Resources Association (ELRA)</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>3897</fpage>
          -
          <lpage>3901</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>