<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Matrices for Italian: A CALAMITA Challenge</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>ChunyangJiang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Samo</string-name>
          <email>giuseppe.samo@idiap.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vivi Nastas</string-name>
          <email>vivi.a.nastase@gmail.co</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>PaolaMerlo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Blackbird Language Matrices, Causative/inchoative alternation</institution>
          ,
          <addr-line>Object-drop alternation, subject-verb number agreement</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Idiap Research Institute</institution>
          ,
          <addr-line>Martigny</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Geneva</institution>
          ,
          <addr-line>Geneva</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2000</year>
      </pub-date>
      <abstract>
        <p>In this challenge, we propose Blackbird Language Matrices (BLMs), linguistic puzzles to learn language-related problems and investigate deeper formal and semantic properties of language, through a process of paradigm understanding. A BLM matrix consists of a context set and an answer set. The context is a sequence of sentences that encode implicitly an underlying generative linguistic rule. The contrastive multiple-choice answer set includes negative examples produced following corrupted generating rules. We propose three subtasks -agreement cAongrc)o,rcdau(sativeC(aus) and object-dropO(d) alternation detection- each in two variants of increasing lexical complexity. The datasets comprise a few prompts for few-shot learning and a large test set.</p>
      </abstract>
      <kwd-group>
        <kwd>rule-based abstraction</kwd>
        <kwd>disentanglement</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction and Motivation</title>
      <p>Current generative large language models (LLMs)
translate across close languages, produce fluent and
informative summaries, and answer questions promptly. And
yet, they still fail in very non-human ways. As proven
by their prohibitive needs in size of training data and
exthe language they produce and process so well?— and
isation abilities —what exactly do LLMs understanddooufble border).
help us train them to more complex skills.</p>
      <sec id="sec-2-1">
        <title>In the CALAMITA challenge1[], we propose to find</title>
        <p>pensive computational resources, large language modFeilgsure 1: Example of a Raven’s Progressive Matrix (RPM)
do not generalise nor abstract systematically. Humfaronms, visual intelligence tests. This instance is generated with
instead, are good at abstraction and generalisationt.wo generative rules: (i) the red dot moves one place clockwise</p>
        <p>To reach systematic abilities in abstraction and gewnheren- traversing the matrix left to right; (ii) the blue square
alisation in neural networks, we need to develop tamskosves one place anticlockwise when traversing the matrix top
and data that help us understand their current genseetrathl-at correctly completes the sequence (indicated with a
to bottom. The task consists in finding the tile in the answer
the solution to Blackbird Language Matrices (BLMs), linU-nlike other attempts to create textual versions of
guistic puzzles developed in analogy to the visual RaveRnPMs, BLMs are not simplistic transcriptions of visual
Progressive Matrices test2]s. [Raven’s Progressive Ma- stimuli [4]—a technique that, in practice, might give away
trices (RPMs) consist of a sequence of images, called thpearts of the solution to the problem—, nor are they
auxilcontext, connected in a logical sequence by underlyinigary abstractions of stimuli in the visual
dom5a].inIn[generative rules3][. The task is to determine the misss-tead, BLMs are matrices developed specifically to learn
ing element in this visual sequence, tahneswer, chosen
among a set of closely or loosely similar alternatives,aansd semantic properties of language, through a process
language-related problems and delve into deeper formal
CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, and an answer set. The context is a sequence of sentences
illustrated in Figur1e.
(P. Merlo)</p>
        <p>Like RPMs, a BLM instance consists of a context set
that encode a linguistic rule. They encode, for example,
the rule of grammatical number concord: subject and
verb agree in their grammatical number, and they do
so independently of how many noun phrases intervene
between them. BLMs are presented as linguistic puzzles
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons Lirceensqeuiring the selection of the missing sentence. In order
Attribution 4.0 International (CC BY 4.0).
to examine the representations underlying the response,
the answer sets include not only the correct answer, but NP-sg VP-sg
also erroneous candidates constructed by corrupting the NP-pl VP-pl
generating rules. An example template is illustrated in NNPP--spgl VVPP--spgl
Figure2. NP-sg VP-sg</p>
        <p>BLM datasets are richly structured and support many NP-pl VP-pl
diferent types of investigations, at both the sentence NP-sg VP-sg
and matrix levels. The context-answer set up support
counterfactual investigations of possible types of errors:NP-pl PP1-pl VP-pl Correct
language errors, reasoning errors, and their interactionNsP-pl PP1-pl VP-pl Coord
[6, 7, 8]. The regular syntactic forms and the systematic NP-pl PP1-pl VP-pl WNA
semantic properties support investigations on system-NP-pl PP1-sg PP1-sg VP-pl WN1
aticity and compositionality in neural networks. The pre-NP-pl PP1-pl PP2-pl VP-pl WN2
dictable syntactic structure of individual sentences, andNP-pl PP1-pl PP2-pl VP-sg AEV
the structure within the sequence of a BLM context, alsoNNPP--ppll PP1-sg PP2-pl VP-sg AEN1
support investigations on sentence embeddi9n,g1s0[]. PP1-pl PP2-sg VP-sg AEN2
BLMs exists for several tasks and diferent languages,
enabling multi-tasks and multi-language comparatFivigeure 2: BLM-AgrI templatefor verb-subject agreement,
studies 1[1, 12]. Finally, each BLM problem is a linguisticwith one-two intervening phrases. Three generative rules:
paradigm and can be seen as a tool for linguistic inve(sit)iS-ubject matches in number with verb (singular or plural);
gation of specific phenomena. (ii) material can intervene and is of unbounded length; (iii)
singular and plural alternate in regular patterns. NP=Noun</p>
        <p>Phrase, PP=Prepositional Phrase, VP=Verb Phrase. Answers:
2. The BLM-It Challenge WNA= wrong number of attractors; WN1= wrong nr. f or 1
attractor noun (N1); WN2= wrong nr. for 2attractor noun</p>
      </sec>
      <sec id="sec-2-2">
        <title>Context</title>
        <p>PP1-sg
PP1-sg
PP1-pl
PP1-pl
PP1-sg PP2-sg
PP1-sg PP2-sg
PP1-pl PP2-sg</p>
        <p>Answer set
PP2-sg
et PP2-sg
The BLM-It challenge consists of six sub-tas1kAsl.l sub- (N2); AEV=agreement error on the verb; AEN1=agreement
error on N1; AEN2=agreement error on N2.
tasks are instances of the general BLM task, but they
differ along two dimensions: the linguistic problem defined
(Agr, Caus, Od) and the lexical complexity of the data (II,
III).2 While the agreementA(gr) task focuses on informa-transitive verb bears the same semantic role (Patient) as
tion about the formal grammatical property of agreemtehnets, ubject of the intransitive veLr’abrt(ista ha aperto
the causativeC(aus) and object-dropO(d) alternation la finestra/La finestra si è aperta ‘The artist opened the
tasks focus on lexical semantic properties of verbs, thweiinrdow’/‘The window opened’). The transitive form of
ability to enter or not in a causative alternation andtthheevirerb has a causative meanin1g3][.
systematic alternation in the syntactic-semantic mappiTnhge BLM-CausI template is shown in Figu4r.eThe
conof grammatical functions and semantic roles. text set of the causative alternation varies depending on
the presence of one or two arguments and their attributes
BLM-AgrI The BLM problem for subject-verb agree(-agents,Ag; patientsP, at) and the activeA(kt) and
pasment [6] consists of a context set of seven sentences tshiavte (Pass) or passive voice of the verb. The sentences
share the subject-verb agreement phenomenon, but difearre organised in a structured sequence: an alternation
in other aspects – e.g. number of intervening attracetvoerrsy two items between a prepositional phrase
introbetween the subject and the verb, diferent grammaticdaulced by multifarious prepositions (ei.ng.p,ochi secondi,
numbers for these attractors, and diferent clause stPr-uNcP- ) and a PP introduced by the agentdiva-eNP (e.g.,
tures. The answer set comprises contrastive sentendcaells’artista, da-Ag/da-Pat).
that violate some of the generative rules. The BLM-AgrIThe answer set is composed of one correct answer and
Template can be seen in Figur2e. contrastive erroneous answers, all formed by the same
four elements: a verb, two nominal constituents and the
BLM-CausI The BLM-CausI matrix represents thperesence (or absence) of a prepositional phrase.
causative/inchoative alternation, where the object of the</p>
        <p>BLM-OdI The BLM-OdI template is minimally
difer2Witeecahsoieorsetonacmroessso-rfetfaesrkesnacendanledxiccoamlcpoamrepltexhietydaletvaeldsetshcartibmedaekhneertefrom BLM-CausI. They also act as each other’s
conwith other papers published on BLMs. trols. In contrastCtaous, the subject inOd bears the
2Our datasets are available here: same semantic role (Agent) in both the transitive and
https://www.idiap.ch/en/scientific-research/data/blm-agr,i-genintransitive formLs’(artista dipingeva la finestra/L’artista
https://www.idiap.ch/en/scientific-research/data/blm-cau,si-gendipingeva ‘the artist painted the window’/‘the artist
https://www.idiap.ch/en/scientific-research/data/blm-od. i-gen
mood and wrong subject semantic role; R-trans=wrong sei-t is an intransitive form witdha-aNP.
quence reasoning (transitive sentence with the second NP not
preceded by a preposition); IE-WrBy=ungrammatical sentence
(NP following the prepositionda).</p>
        <p>Lexical variants Each of the three BLM templates
described above is developed in two lexical variants, with
less (II) or more (III) lexical variation. In type II BLMs,
only one word in each sentence changes for each matrix,
painted’) and the verb does not have a causative meaning</p>
        <p>compared to the other sentences, while in type III data,
[13].</p>
        <p>all words can change. Instances of the two variations are
The BLM template foOrd is the same as forCaus, but</p>
        <p>shown in Figure3.
here the passive voice serves as a confounding element
and one of the contrastive answersCfaours is, in fact,
the correct answer here. 3. Data description</p>
        <p>The template for BLM-OdI is in Figur5e. Due to the
asymmetry between thCeaus andOd BLM templates, The data is generated by the process described in Figure
the contexts of the BLMs minimally difer in the intrans6i:-(i) start from identifying a linguistic phenomenon of
tive followed byP-NP (sentence 7). The correct answeirnterest, its forms of expression and factors influencing it
also varies across the two groups, although in both cawseitshin a context, (ii) produce a set of seed examples from
dataset
BLM-AgrI (II/III)
BLM-CausI (II/III)
BLM-OdI (II/III)</p>
        <p>For the BLM-AgrI datasets, for each of types II and III,
we randomly sample 10 instances for few-shot learning
from a dataset of 2010 instances. The rest will be used for
natural or synthetic data, (iii) automatically augmenttetshteing. For the BLM-CausI and BLM-OdI datasets, which
seeds using a fill-mask strategy, (iv) produce BLM in-are focused on specific verbs, we extract all instances for
stances following the designed templates and generatoivnee verb (based on the correct answer in each instance)
rules. Two instances oOfd verb alternations are shownfor few-shot training. From an initial dataset of 2160
in Figure3. instances for 27 verbs (80 instances per verb), we select
the 80 instances for one verb for few-shot training, and
3.1. Origin of data the rest are left for testing.</p>
        <p>BLM-AgrI To instantiate the templates, our starting
point are the examples in Franck et[1a4l,. appendix1]. 3.4. Example of prompts
They provide a set of subject NPs of various complexitWye design prompts in English and Italian in zero-shot
– including prepositional phrases, themselves of variouasnd few-shot prediction settings, to test the impact of
complexity. The sentences were produced based on thetshee language of the prompt on the task. These prompts
subject NPs by manually adding verb phrases, and by</p>
        <p>test LLMs’ ability to perform complex linguistic tasks
making the NPs more complex to increase the distanwcieth varying levels of context. Both types of prompts are
between the subject and the verb in the sente6n].ce s[tructured to minimize ambiguity and focus on the core
Each of these sentences is used to produce a seed. task of selecting the best sentence to follow the given
context.</p>
        <p>BLM-CausI and BLM-OdI Thirty verbs from each of Zero-Shot Prompt Example in English The prompt
the causative and object-drop classes in English in LeviinnFigure8 is designed to create a clear zero-shot
base[13] were selected and translated by a native speaker ilninteofor challenging linguistic tasks. We avoid complex
Italian, where translations maintain the same alternaptrioomnpting techniques, like chain-of-thought or
step-bystructure. step reasoning1[6, 17]. This ensures that the model’s</p>
        <p>The seeds were augmented using masked modelingperformance reflects its intrinsic capabilities for
linguison bert-base-uncased [15]. The Italian data are builttic understanding and reasoning without prior in-context
as native-speaker translations of the English data, wleiatrhning or guided reasoning steps.
manual corrections to guarantee the acceptability aWnde format the prompt Minarkdown format and
exsemantic plausibility of the sentences, and assure vaprliic-it label sections for Context and Answer Set. The
ability in gender and number. task is framed as a simple “puzzle” with the instruction
to “choose […] the sentence that could […] follow the
3.2. Data format context”. This abstract formulation guides the model to
focus on identifying the best sequential fit without
introThe structured BLM data is provided in a json file, eacdhucing ambiguity. The prompt also aims to reduce noise
instance as one element with specific fields described inand simplify the evaluation by fixing its output format.
Figure7. A data instance is shown in Figu1r0ein the Few-Shot (One-Shot) Prompt Example in Italian
appendix. For the one-shot prediction setup (as is shown in
Figure9), we provide an example of the task in Italian before
presenting the new instance to the model. The prompt
serves to test the model’s ability to use prior examples
{
},</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Metrics</title>
      <p># COMPITO: Ti chiedo di risolvere un quesito. La
lingua di questo quesito e' l'italiano.</p>
      <p>Ti daro' una lista di frasi (numerate da 1 a 7) che
chiameremo **Contesto**, e un insieme di frasi
(identificate da una lettera) che chiameremo
**Risposte**.</p>
      <p>Il tuo compito e' di scegliere fra le **Risposte** la
frase che potrebbe essere la frase seguente del
**Contesto**.
# FORMATO: Devi mettere **SOLO** la lettera che
corrisponde alla risposta migliore. Non inserire altro
testo, ne' prima ne' dopo.
# ESEMPIO 1
**Contesto**
{{Context_concatenated}}
**Risposte**
{{Answer_concatenated}}
**Scelta corretta**
{Correct_option}
# DOMANDA
**Contesto**
{{Context_concatenated}}
**Risposte**
{{Answer_concatenated}}
**La tua scelta**
We perform zero-shot and one-shot evaluation on
BLMAgrI, BLM-CausI and BLM-OdI tasks, using English andFigure 9: Few (One)-Shot Prompt in Italian.
Italian prompts, with 100 samples each (batch size of
one, evaluated instance by instance, over three
independent runs) witMheta-Llama-3-8B-Instruct
(ML8), Meta-Llama-3-70B-Instruct (ML-70),
Mistral-7BInstruct-v0.3 (M-7), and Gemma-2-9b-It (G-2). We
report averaged F1 scores over 3 runs in T2a.ble
BLM-AgrI tasks Meta-Llama-3-70B-Instruct
consistently outperforms the other models, particularly in
zero-shot English prompts, while also competitive in</p>
      <sec id="sec-3-1">
        <title>Zero-Shot</title>
      </sec>
      <sec id="sec-3-2">
        <title>One-Shot</title>
      </sec>
      <sec id="sec-3-3">
        <title>Zero-Shot</title>
      </sec>
      <sec id="sec-3-4">
        <title>One-Shot</title>
      </sec>
      <sec id="sec-3-5">
        <title>Results</title>
      </sec>
      <sec id="sec-3-6">
        <title>BLM-AgrI type II</title>
        <p>ML-70 44.1 ± 0.46
ML-8 22.34 ± 0.33
M-7 25.54 ± 0.58
G-2 42.75 ± 1.01</p>
      </sec>
      <sec id="sec-3-7">
        <title>BLM-AgrI type III</title>
        <p>ML-70 45.64 ± 0.05
ML-8 26.65 ± 1.71
M-7 31.26 ± 1.60
G-2 38.48 ± 1.12</p>
      </sec>
      <sec id="sec-3-8">
        <title>BLM-CausI type II</title>
        <p>ML-70 19.97 ± 0.65
ML-8 5.85 ± 0.20
M-7 8.45 ± 0.44
G-2 18.06 ± 0.25</p>
      </sec>
      <sec id="sec-3-9">
        <title>BLM-CausI type III</title>
        <p>ML-70 26.49 ± 0.85
ML-8 18.03 ± 1.52
M-7 20.08 ± 0.76
G-2 29.12 ± 0.73</p>
      </sec>
      <sec id="sec-3-10">
        <title>BLM-OdI type II</title>
        <p>ML-70 18.28 ± 2.18
ML-8 8.55 ± 0.21
M-7 1.92 ± 0.27
G-2 14.07 ± 0.78</p>
      </sec>
      <sec id="sec-3-11">
        <title>BLM-OdI type III</title>
        <p>ML-70 17.70 ± 0.32
ML-8 9.50 ± 0.95
M-7 11.60 ± 0.64
G-2 14.74 ± 0.40
one-shot settingsG.emma-2-9b-it shows robust per- BLM-CausI tasks Meta-Llama-3-70B-Instruct
formance, especially with Italian prompts, performilnegads across both English and Italian prompts, with
similarly to the larger Meta-Llama model. In contriamsptr,ovement in one-shot English for
typeGIeI.mmasmaller models, such asMeta-Llama-3-8B-Instruct 2-9b-it shows comparable performance across both
andMistral-7B-Instruct-v0.3, perform more weakly, languages, in both zero-shot and one-shot settings.
especially with Italian prompts. Smaller models perform worse for this task, especially in
one-shot Italian prompts.
dataset
train:test
BLM-AgrI type II 2400:4121 0.881 (0.003) 0.784 (0.007)
BLM-AgrI type III 2400:4121 0.874 (0.006) 0.336 (0.005)
BLM-CausI type II 2160:240 0.486 (0.005) 0.903 (0.010)
BLM-CausI type III 2160:240 0.475 (0.010) 0.918 (0.010)
BLM-OdI type II 2160:240 0.596 (0.010) 0.983 (0.003)
BLM-OdI type III 2160:240 0.592 (0.024) 0.994 (0.004)</p>
        <p>While not directly comparable due to the diferent
training process and the diferent test data, using
pretrained transformer encoder architectures, like Electra,
significantly outperform the zero and one-shot
prompting baseline. The performance gap suggests that while
zero or one-shot prompting is flexible, it may not capture
the complex syntactic and semantic features required for
the BLM task in Italian.</p>
        <p>While the data is very rich and richly structured, it shares
all the limitations of artificial and synthetic data: stilted
sentence structure, limited variability, possibly sentences
BLM-OdI tasks OdI tasks show the lowest overaltl hat are too short. This artificiality, though, might reduce,
performance across models. This indicates that twhiethout eliminating, the risk of having sentences that
task is the most complex and challenging for the mowde-re directly seen in the training data of the pretrained
els. Meta-Llama-3-70B-Instruct performs best, partic-models that will be used, and that we use, for further
ularly in one-shot English and Italian prompts. Howeveexrp,eriments.</p>
        <p>Mistral-7B-Instruct-v0.3 struggles the most, partic- The initial seed sentences, although minimal, were
ularly in zero-shot settings, which reflects that the modcrealfted by experts. This approach is deliberate, like in the
has limited generalisation capabilities in complex linguAisR-C dataset, to guarantee that the data are not
algorithtic tasks. mically reproducible1[9]. This expert-based approach,
though, might not be easily scalable, especially given the
Key Observations Larger models, such asMeta- complexity of the data. Exploring methods to leverage
Llama-3-70B-Instruct and Gemma-2-9b-it, consis- existing datasets for seed generation could mitigate this
tently outperform smaller models, showing better gendeerp-endency.</p>
        <p>The current dataset comprises three main tasks. More
alisation and stability across tasks. English prompts
generally result in higher F1 scores, though Italian prompttassks and variants are needed to demonstrate the
robustsometimes achieve comparable performance, particulanrelyss and the wider appeal of the data.
withGemma-2-9b-it. One-shot prompting tends to
improve performance, though the degree of improveme6nt. Ethical issues
varies by model and task complexity. Smaller models,
such asMistral-7B-Instruct andMeta-Llama-3-8B- The data presented include an augmentation step that
Instruct, show substantial variance, especially in onues-es large language models (LLMs). LLMs are trained on
shot scenarios, indicating instability in complex linguisetxictensive text data, which may unintentionally
incorpotasks. rate biases present in the training corpus.</p>
      </sec>
      <sec id="sec-3-12">
        <title>Comparison with Multitask Learning Approaches</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>7. Data license and copyright</title>
      <p>We compare our LLM prompting results with the work of
[12, 11], which explored the properties of Italian sentence issues
embeddings – the embeddings of the [CLS] token from a
pretrained Electra mod1e8l][3 – through the agreementThis work is licensed under theCreative
Comand the causative and object-drop BLM datasets, usminogns Attribution-NonCommercial-ShareAlike 4.0
Intera two-level Variational Encoder-Decoder architectnuartei.onal (CC BY-NC-SA 4.0.)For uses outside of these
This system learns to compress the sentence embeddintgesrms, please contact the authors.
into representations relevant for the specific BLM tasks.</p>
      <p>The dataset statistics, and results on the individual BLM</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>tasks as averaged F1 score over three runs and diferent
amounts of lexical variation are shown in T3a.ble</p>
      <p>We gratefully acknowledge the support of this work by
the Swiss National Science Foundation, through grant
3Italian Electra (E-It) pretrained model: dbmdz/electra-bSaNseF- Advanced grant TMAG-1_209426 to PM.
italian-xxl-cased-discriminator, multi-lingual Electra (E-M) model:
google/electra-base-discriminator
"ID": 215,
"Context": [
"le pittrici possono disegnare delle forme in meno di due giorni",
"le artiste possono disegnare delle rappresentazioni artistiche da un mese",
"alcune coreografie sono disegnate dalle pittrici nel salone espositivo",
"delle rappresentazioni artistiche devono poter essere disegnate da queste studentesse da un mese",
"alcune coreografie devono essere disegnate con pochi mezzi economici",
"le scenografie devono essere disegnate da pochi mesi",
"le pittrici devono disegnare nel salone espositivo"],
"Context_concatenated": "1\tle pittrici possono disegnare delle forme in meno di due giorni\n2\tle artiste possono
disegnare delle rappresentazioni artistiche da un mese\n3\talcune coreografie sono disegnate dalle pittrici nel
salone espositivo\n4\tdelle rappresentazioni artistiche devono poter essere disegnate da queste studentesse da
un mese\n5\talcune coreografie devono essere disegnate con pochi mezzi economici\n6\tle scenografie devono essere
disegnate da pochi mesi\n7\tle pittrici devono disegnare nel salone espositivo",
"Answer_set": [
"delle rappresentazioni artistiche devono poter disegnare le sue allieve",
"le scenografie devono essere disegnate dalle sue allieve",
"le sue allieve devono essere disegnate da delle rappresentazioni artistiche",
"le pittrici possono disegnare le scenografie",
"le pittrici possono disegnare da un anno circa",
"delle forme devono poter disegnare da pochi mesi",
"le artiste devono poter disegnare da alcune coreografie",
"delle rappresentazioni artistiche devono disegnare dalle artiste"],
"Answer_concatenated": "A\tdelle rappresentazioni artistiche devono poter disegnare le sue allieve\nB\tle scenografie
devono essere disegnate dalle sue allieve\nC\tle sue allieve devono essere disegnate da delle rappresentazioni
artistiche\nD\tle pittrici possono disegnare le scenografie\nE\tle pittrici possono disegnare da un anno circa\nF\tdelle
forme devono poter disegnare da pochi mesi\nG\tle artiste devono poter disegnare da alcune coreografie\nE\tdelle
rappresentazioni artistiche devono disegnare dalle artiste",
"Correct_option": "E",
"Correct_answer": "le pittrici possono disegnare da un anno circa",
"Answer_set_annotation": [
{ "label": "IR-trans",
"value": false,
"option": "A" },
{ "label": "IER-pass",
"value": false,
"option": "B" },
{ "label": "ER-pass",
"value": false,
"option": "C" },
{ "label": "R-trans",
"value": false,
"option": "D" },
{ "label": "Correct",
"value": true,
"option": "E" },
{ "label": "I-Int",
"value": false,
"option": "F" },
{ "label": "E-WrBy",
"value": false,
"option": "G" },
{ "label": "IE-WrBy",
"value": false,
"option": "H" }
},
....
]
],
"Verb": "disegnare"</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>