<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Exploring Italian sentence embeddings properties through multi-tasking</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vivi Nastase</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Samo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chunyang Jiang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paola Merlo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Idiap Research Institute</institution>
          ,
          <addr-line>Martigny</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Geneva</institution>
          ,
          <addr-line>Geneva</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2052</year>
      </pub-date>
      <abstract>
        <p>We investigate to what degree existing LLMs encode abstract linguistic information in Italian in a multi-task setting. We exploit curated synthetic data on a large scale - several Blackbird Language Matrices (BLMs) problems in Italian - and use them to study how sentence representations built using pre-trained language models encode specific syntactic and semantic information. We use a two-level architecture to model separately a compression of the sentence embeddings into a representation that contains relevant information for a task, and a BLM task. We then investigate whether we can obtain compressed sentence representations that encode syntactic and semantic information relevant to several BLM tasks. While we expected that the sentence structure - in terms of sequence of phrases/chunks - and chunk properties could be shared across tasks, performance and error analysis show that the clues for the diferent tasks are encoded in diferent manners in the sentence embeddings, suggesting that abstract linguistic notions such as constituents or thematic roles does not seem to be present in the pretrained sentence embeddings. L'obiettivo di questo lavoro è indagare fino a che punto gli attuali LLM apprendono rappresentazioni linguistiche astratte in configurazioni multitask. Utilizzando dati sintetici curati su larga scala di vari problemi BLM in italiano, studiamo come le rappresentazioni di frasi costruite da modelli di linguaggio pre-addestrati codifichino le informazioni semantiche e sintattiche. Abbiamo utilizzato un'architettura a due livelli per modellare separatamente, da un lato, la compressione degli embeddings delle frasi di input in una rappresentazione che contiene informazioni rilevanti per i tasks BLM e, dall'altro lato, il BLM stesso. Abbiamo poi verificato se fosse possibile ottenere rappresentazioni compresse di frasi che codificano informazioni sintattiche e semantiche rilevanti per i diversi tasks BLM. Contrariamente alla predizione che la struttura della frase - in termini di sequenza di frasi/chunks - e le proprietà dei chunk possano essere condivise tra i vari tasks, i risultati e l'analisi degli errori mostrano che gli indizi per i diversi task sono codificati in modo diverso negli embeddings delle frasi. Questo risultato suggerisce che nozioni linguistiche astratte come i costituenti o i ruoli tematici non vi sembrano essere presenti.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;synthetic structured data</kwd>
        <kwd>multi-task</kwd>
        <kwd>diagnostic studies of deep learning models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        tic structure and argument structure – can be assembled
from the information encoded in the sentence
embedDriven by increasing computational scale and progress dings. This, however, may not be due to a deeper
unin deep learning techniques, NLP models can rival hu- derstanding of such information encoded by LLMs, but
man capabilities on established benchmarks. New bench- rather because of useful surface indicators [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
marks, then, that capture deeper levels of language un- In this paper, we adopt BLMs to investigate whether
derstanding must be created and analysed [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. current pretrained models encode abstract linguistic
no
      </p>
      <p>
        Blackbird’s Language Matrices (BLM) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is a recent tions, such as constituents, and are able to do so in a
task inspired by visual tests of analytic intelligence manner that comprises both functional elements, such
(Raven Progressive Matrices/RPMs, [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]). The BLM tasks as pronouns, demonstratives and lexical elements, such
have cast light on whether the correct predictions in pre- as nominal constituents.
viously studied linguistic problems, e.g. number agree- We concentrate on Italian, and study several
grammatment or verb alternations, stem from sentence embed- ical problems whose solutions can theoretically help each
dings that encode deeper linguistic information, such as other, in a multi-task setting. We adopt a two-level
archisyntactic structure and semantic properties of phrases tecture developed specifically to model what we know
[
        <xref ref-type="bibr" rid="ref4 ref5 ref6">4, 5, 6</xref>
        ]. We found that higher-level information – syntac- about how humans solve puzzles similar to BLMs [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
Level 1 aims to obtain compressed sentence
representaCLiC-it 2024: 10th Italian Conference on Computational Linguistics, tions that capture information about constituents and
Dec 04 — 06, 2024, Pisa, Italy their properties; level 2 uses the compressed sentence
* Corresponding author.
$ vivi.a.nastase@gmail.com (V. Nastase); giuseppe.samo@idiap.ch representations to solve a BLM problem. This
architec(G. Samo); chunyang.jiang42@gmail.com (C. Jiang); ture provides a tool to study how LLMs encode diferent
Paola.Merlo@unige.ch (P. Merlo) types of syntactic and semantic information.
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
      </p>
      <p>Attribution 4.0 International (CC BY 4.0).</p>
      <p>We make two contributions: (i) an initial core BLM
dataset for Italian that covers linguistic problems of
diferent nature; (ii) single and multi-task experiments that
provide new insights into the information encoded by LLMs.</p>
      <p>The datasets are available at https://www.idiap.ch/datas
et/(blm-agri|blm-causi|blm-odi) and the code at https:
//github.com/CLCL-Geneva/BLM-SNFDisentangling.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>BLM agreement problem (BLM-AgrI)</p>
      <p>Context Template
NP-sg PP1-sg
NP-pl PP1-sg
NP-sg PP1-pl
NP-pl PP1-pl
NP-sg PP1-sg
NP-pl PP1-sg
NP-sg PP1-pl</p>
      <sec id="sec-2-1">
        <title>Multi-task learning has been popular in improving NLP</title>
        <p>systems’ performance by using knowledge shared across PP1-sg
multiple tasks [9]. PP2-pl</p>
        <p>Multi-task learning architectures include parallel, hier- PP2-pl
archical, and modular designs [10]. Parallel architectures PP2-pl
share intermediate layers across tasks, conducive to efi- PP2-sg
cient knowledge transfer [11]. Hierarchical architectures Figure 1: BLM instances for verb-subject agreement, with
capture task dependencies by layering task-specific mod- two attractors. We build candidate answers displaying one
ules on shared bases. Modular approaches selectively of two types of errors: (i) sequence errors: WNA= wrong nr.
share components among tasks to balance between gen- of attractors; WN1= wrong gram. nr. for 1 attractor noun
eralisation and task-specific optimisation [ 12]. These (N1); WN2= wrong gram. nr. for 2 attractor noun (N2);
training strategies are not mutually exclusive and can be (ii) grammatical errors: AEV=agreement error on the verb;
combined. AEN1=agreement error on N1; AEN2=agreement error on N2.</p>
        <p>Multi-task learning can be used eficiently in
resourceconstrained environments, to counter data scarcity and 3. The BLM task and the BLM
overfitting: aggregating training data and sharing param- Italian datasets
eters across related tasks acts as a form of data
augmentation [13]. Raven’s progressive matrices are multiple-choice
com</p>
        <p>
          Efective multi-task learning depends on the related- pletion IQ tests, whose solution requires discovering
unness of the tasks involved. Tasks that are similar or have derlying generative rules of a sequence of images [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
similar objectives tend to benefit more from shared rep- A similar task has been developed for linguistic
probresentations. This observation has been used in various lems, called Blackbird Language Matrices (BLMs) [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], as
NLP tasks, including named entity recognition [14], text given in Figure 1, which illustrates the template of a BLM
generation[15], and machine translation [16], among oth- agreement matrix. A BLM comprises a context and an
ers. Selecting related tasks that contribute positively to answer set. The context is a sequence of sentences
genthe shared model’s training is important and remains an erated following the relevant rules of a given linguistic
active area of research [9]. phenomenon under investigation and that this way
im
        </p>
        <p>Pretrained large language models exhibit general- plicitly illustrates these grammatical properties. This
purpose abilities and knowledge, with high results with sequence also follows some extra-linguistic progression
little or no fine-tuning on downstream tasks [ 17, 18]. We rules. Each context is paired with a set of candidate
ancan then regard these language models as the results of swers. The answer sets contain minimally contrastive
"multi-task" learning, and our aim here is to test whether examples built by corrupting some of the generating
sentence embeddings obtained from these models en- rules.
code syntactic and semantic information consistently, The BLM Italian datasets consists of BLMs focused
such that diferent BLM problems that rely on similar on the property of subject-verb agreement and two
linguistic information draw on the same clues from these transitive-intransitive alternations: the change-of-state
representations. In particular, we will use BLM tasks on alternation and the object-drop alternation.
subject-verb agreement – which relies on chunk
structure and the chunks’ grammatical number properties –
and on verb alternations – which relies on chunk struc- 3.1. BLM-AgrI – subject-verb agreement
ture and the chunks’ semantic role properties – to test in Italian
whether chunk structure is encoded in a manner that
allows for it to be shared by the two tasks.</p>
        <p>
          The BLM-AgrI dataset is created by manually translating
the seed French sentences [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] into Italian by a native
speaker, one of the authors, and then generating the full The template is also in Figure 2. Due to the asymmetry
dataset following the same process of lexical augmenta- between the two classes of verbs, the contexts of the
tion and sentence shufling among instances described BLMs minimally difer in the intransitive followed by
in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. The internal nominal structure in these languages P-NP (sentence 7). The correct answer also varies across
is very similar, so translations are almost parallel. An the two groups, although in both cases it is an intransitive
illustrative, simplified example for Italian is provided in form with a da-NP. Examples are shown in the Appendix.
Figure 7, in the appendix. The dataset comprises three
subsets of increasing lexical complexity (called Type I,
Type II and Type III) to test the ability of the system to
handle item novelty.
3.2. BLM-CausI and BLM-OdI
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>While BLM-AgrI tests information about a formal gram</title>
        <p>matical property, agreement, the Causative (Caus) and
Object-drop (Od) alternation datasets test lexical
semantic properties of verbs, their ability to enter or not a
causative alternation. Caus represents the
causative/inchoative alternation, where the object of the transitive
verb bears the same semantic role (Patient) as the
subject of the intransitive verb (L’artista ha aperto la
finestra/La finestra si è aperta ‘The artist opened the
window’/‘The window opened’). The transitive form of the
verb has a causative meaning. In contrast, the subject
in Od bears the same semantic role (Agent) in both the
transitive and intransitive forms (L’artista dipingeva la
ifnestra/L’artista dipingeva ‘the artist painted the
window’/‘the artist painted’) and the verb does not have a
causative meaning [19, 20].</p>
        <p>BLM-CausI context and answers The context set of
the verb alternation varies depending on the presence of
one or two arguments and their attributes (agents, Ag;
patients, Pat) and the active (Akt) and passive (Pass) or
passive voice of the verb. The non-linguistic factor that
structures the sequence is an alternation every two items
between a prepositional phrase introduced by any
preposition (e.g., in pochi secondi, P-NP) and a PP introduced
by the agentive da-NP (e.g., dall’artista, da-Ag/da-Pat).</p>
        <p>
          The answer set is composed of one correct answer
and contrastive wrong answers, all formed by the same
four elements: a verb, two nominal constituents and a
prepositional phrase. Figure 2 shows the template.1
BLM-OdI Context and Answers The BLM for Od is
the same as for Caus, but here the passive voice serves
as a confounding element and one of the contrastive
answers for Caus is, in fact, the correct answer here.
1Following BLM formal specifications [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], we build the errors
representing violations of internal (I ), external (E) and relational (R)
rules of the BLM, and their combination (e.g. IE IER, etc.). This
information is used in the first part of the error acronym. The
second part of the errors’ label indicates the structure the sentence
represent: intransitive (Int), passive (Pass), Transitive (Trans) or,
in some cases, the NP introduced by the da preposition (WrBy).
        </p>
        <p>Caus context Caus answers
1 Ag Akt Pat P-NP 1 Pat Akt da-NP Correct
2 Ag Akt Pat da-NP 2 Ag Akt da-NP I-Int
3 Pat Pass da-Ag P-NP 3 Pat Pass da-Ag ER-Pass
4 Pat Pass da-Ag da-NP 4 Ag Pass da-Pat IER-Pass
5 Pat Pass P-NP 5 Pat Akt Ag R-Trans
6 Pat Pass da-NP 6 Ag Akt Pat IR-Trans
7 Pat Akt P-NP 7 Pat Akt da-Ag E-WrBy
? ??? 8 Ag Akt da-Pat IE-WrBy</p>
        <p>Od context Od answers
1 Ag Akt Pat P-NP 1 Pat Akt da-NP I-Int
2 Ag Akt Pat da-NP 2 Ag Akt da-NP Correct
3 Pat Pass da-Ag P-NP 3 Pat Pass da-Ag IER-Pass
4 Pat Pass da-Ag da-NP 4 Ag Pass da-Pat ER-Pass
5 Pat Pass P-NP 5 Pat Akt Ag IR-Trans
6 Pat Pass da-NP 6 Ag Akt Pat R-Trans
7 Ag Akt P-NP 7 Pat Akt da-Ag IE-WrBy
? ??? 8 Ag Akt da-Pat E-WrBy</p>
      </sec>
      <sec id="sec-2-3">
        <title>Each subset is split 90:20:10 into train:dev:test subsets.</title>
        <p>The training and testing are disjoint (agreement data is
split based on the correct answer, the alternations data
based on the verb). Agreement has 230 test instances
for type I, 4121 for types II and III. The verb alternations
have 240 test instances for all subsets. We randomly
sample a number of training instances, depending on the
experimental set-up.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Multi-task representations</title>
      <p>
        Sentence embeddings encode much information from
the input sentence – lexical, syntactic, semantic, and
possibly other types of information. Previous experiments
have shown that sentence embeddings can be compressed
into very small representations (vectors of size 5) that
encode information about the structure of the sentence
in terms of chunks and their properties, such that they
contribute to finding the sequence patterns in BLMs [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>In this work, we investigate whether several BLM tasks
can share the same structural information from a
sentence embedding. Towards this end, we built a multi-task
version of a two-level system, illustrated in Figure 3. In
this system, one level processes individual sentences and
learns to compress them into small vectors that retain
information pertinent to a task and the other level uses
the compressed sentence representation to find patterns
across an input sequence to solve a BLM task. The
multitask variation consists in a single shared sentence-level
component, and multiple task components, one for each
of the BLM tasks.</p>
      <p>The BLM problems encode a linguistic phenomenon () =  (ˆ, +, − )
through data that has structure on multiple levels – + (|| (0, 1))
within sentences, and across a sequence of sentences.</p>
      <p>
        We can exploit this structure to develop an indirectly  (ˆ, +, − ) =
supervised approach to discover and use these diferent (0, 1 − (ˆ, +)
lteavskelsasofasttwruoc-tsuterep. pWroectehsuss: (mi)ocdoemltphreesssolivnidnigviodfuaalBsLeMn- + ∑︀− ∈−(^,− ) )
tences into a representation that emphasizes the sentence
structure relevant to the BLM problem (e.g. chunks and
their grammatical number for the subject-verb agreement
task) (ii) use the compressed representations to detect
the sequence-level pattern and solve the BLM task. This
two-step process has been shown to be used by people
solving visual intelligence tests [21]. In our case, this () =  (, ,  ∖ {})
setup allows us to investigate whether the sentence level + ( | (0, 1)).
can be guided to learn shared information, relevant to The loss of the two-level systems is:
the diferent linguistic tasks described in section 3.
twWineedimarpchleimtecetnutrethiilsluastprparteodacihn Finigtuhree 3tw,aon-dledveeslcirnibteerd- () = ∑︀∈ () + ()
in detail elsewhere [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The data is pre-encoded with
Electra [18].2 The sentence representations is provided
by the embedding of the [CLS] token.3. We chose Electra
because of its stronger sentence-level supervision signal,
      </p>
      <sec id="sec-3-1">
        <title>The loss at the task level for input sequence  is</title>
        <p>computed in a similar manner for the constructed
answer , but relative to the answer set  and the
correct answer  of the task:</p>
        <p>The input batches are shufled, to alternate between
tasks during training, and avoid getting stuck in a local
maximum for one of the tasks.</p>
      </sec>
      <sec id="sec-3-2">
        <title>2Italian Electra (E-It) pretrained model: dbmdz/electra-base-italian</title>
        <p>xxl-cased-discriminator. Multi-lingual Electra (E-M) model:
google/electra-base-discriminator.
3To simplify the discussion of the method, we write "sentence"
instead of "sentence embedding", when discussing the system.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Multi-task results</title>
      <p>
        Previous published work from our group and current
ongoing work has benchmarked the problems generated
by some of these datasets [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. This work has shown
that information about the syntactic phrases in a
sentence and their properties can be obtained from sentence
embeddings, and this information is helpful in solving
the BLM tasks. We had studied these tasks separately,
and investigate here whether such structure is encoded
in the sentence embeddings, or whether it is assembled
based on shallower patterns within the sentence
representations.
      </p>
      <sec id="sec-4-1">
        <title>The comparison of all the tasks suggests that some</title>
        <p>syntactic and semantic regularities –such as constituents,
grammatical number and semantic roles– cannot be
encoded together as they compete with each other when the
system learns to distil them from the pretrained sentence
embeddings.</p>
        <p>Agr
type_I-SingleTask
type_I-Multitask</p>
        <p>Caus</p>
        <p>Task
type_I -SingleTask
type_I -Multitask
type_I I-SingleTask
type_I I-Multitask</p>
        <p>
          Od
Figure 4: Performance comparison across single-task and Error Analysis For the agreement task, errors on the
multi-task training paradigms for the three subtasks (single grammatical number of the attractor nouns (WN1, WN2)
task darker shade of each colour, multi-task lighter shade), are high under both paradigms. These are "sequence
ertrained on type-I data, tested on the three types, and aver- rors", indicating that the system was not able to detect
aged over three independent runs. Results obtained using the the patterns in the input sequence, possibly because
inItalian Electra pretrained model. dividual sentence structures were not properly detected.
Previous experiments have shown, though, that in the
Discussion We expect that if the multi-task setup suc- single-task setting, the sentence level does manage to
ceeds in sharing information across tasks, then the re- compress the desired information [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. The fact that both
sults on the individual test data will be at least as good these errors increase in the multi-task setting indicates
as when learning tasks individually, given that the multi- that the information compression on the sentence level
task setup uses a larger training set data – the union of is less successful than in the single-task setting.
the training sets of the individual tasks. But, overall, this For the alternation tasks, error patterns vary, although
does not seem to be the case. their distributions remain similar between single-task
        </p>
        <p>As the results in Figure 4 show (and also the detailed re- and multi-task environments. We observe an overall
insults in Tables 1-2 for the Italian Electra pretrained model, crease of error proportions in the multi-task environment.
and in Tables 3-4 for a multilingual Electra pretrained Specifically, mistakes of the type I-int are frequent in
model), single-task training outperforms multi-tasking type III data for the Caus task. These errors incorrectly
in the agreement and verb alternation subtasks. The map the thematic roles onto the syntax of the arguments
drop suggests that the multi-task model is not able to (e.g. L’artista si è chiuso ‘the artist closed’ or La
carlearn shared properties for these tasks, and forcing it to bonara mangiava ‘the carbonara was eating’). In the
do so leads to a model that is not optimal for either of same dataset, we also note an increase of errors related
them. Both tasks require information about the syntactic to the last constituent in type I and type II data (errors
structure (or sequence of phrases), while each requires of type E-WrBy, e.g. La finestra si chiuse dall’artista ‘the
diferent phrase properties – grammatical number for window closed by the artist’). Finally, for the Od task,
the agreement task, and semantic properties for the verb we remark that R-trans errors are not the most
promialternation. While the system is able to distil all this in- nent —these are the errors resulting in standard
transiformation from sentence embeddings in the single-task tive clauses (e.g., L’artista dipinse un paesaggio ‘the artist
setting, it is not able to compress it into a shared repre- painted a landscape’)— and do not increase in multi-task
sentation when learning the tasks together. environments, suggesting that the chosen answer is not</p>
        <p>The Od single-task and multi-task have comparable derived from some forms of transitive bias [22].
performance, probably because the Od tasks involve a An overall comparison shows that the error patterns
simpler alternation than the Caus task. They do not have vary across subtasks. This variety in error patterns
cona causative meaning and do not require a change in the ifrms that the diferent dimensions (types of alternations,
semantic role of the subjects. levels of lexicalisation and single and multi-task learning)</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Conclusions</title>
      <sec id="sec-5-1">
        <title>In this paper, we have presented curated synthetic</title>
        <p>datasets of Italian on two linguistic phenomena of an
heterogeneous nature, such as agreement and verbal
transitive/intransitive alternation, embedded in the BLM task.</p>
        <p>The results on the performance and the error analysis
of a tailored two-level architecture have shown that
multitask environments do not help, suggesting that abstract
linguistic notions, such as constituents or thematic roles
do not seem to be present in the learning process.</p>
        <p>Current work is developing new analyses and
architectures to probe further in the encoding of information
in sentence embeddings and creating new BLM problems
across various languages and linguistic phenomena.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <sec id="sec-6-1">
        <title>We gratefully acknowledge the partial support of this work by the Swiss National Science Foundation, through grant SNF Advanced grant TMAG-1_209426 to PM.</title>
        <p>0.12
0.10
ces Test., Psychological Review 97 (1990) 404–431. -main.75.</p>
        <p>doi:10.1037/0033-295X.97.3.404. [17] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT:
[9] Z. Zhang, W. Yu, M. Yu, Z. Guo, M. Jiang, A sur- Pre-training of deep bidirectional transformers for
vey of multi-task learning in natural language pro- language understanding, in: Proceedings of the
cessing: Regarding task relatedness and training 2019 Conference of the North American Chapter of
methods, in: A. Vlachos, I. Augenstein (Eds.), Pro- the Association for Computational Linguistics:
Huceedings of the 17th Conference of the European man Language Technologies, Volume 1 (Long and
Chapter of the Association for Computational Lin- Short Papers), Association for Computational
Linguistics, Association for Computational Linguis- guistics, Minneapolis, Minnesota, 2019, pp. 4171–
tics, Dubrovnik, Croatia, 2023, pp. 943–956. URL: 4186. URL: https://aclanthology.org/N19- 1423.
https://aclanthology.org/2023.eacl- main.66. doi:10.18653/v1/N19-1423.</p>
        <p>doi:10.18653/v1/2023.eacl-main.66. [18] K. Clark, M.-T. Luong, Q. V. Le, C. D. Manning,
Elec[10] S. Chen, Y. Zhang, Q. Yang, Multi-task learning in tra: Pre- training text encoders as discriminators
natural language processing: An overview, ACM rather than generators, in: ICLR, 2020, pp. 1–18.</p>
        <p>Computing Surveys (2021). [19] B. Levin, English Verb Classes and Alternations A
[11] S. Ruder, An overview of multi-task learn- Preliminary Investigation, University of Chicago
ing in deep neural networks, arXiv preprint Press, Chicago and London, 1993.
arXiv:1706.05098 (2017). [20] P. Merlo, S. Stevenson, Automatic verb
classifica[12] J. Pfeifer, S. Ruder, I. Vulić, E. M. Ponti, Modu- tion based on statistical distributions of argument
lar deep learning, arXiv preprint arXiv:2302.11529 structure, Computational Linguistics 27 (2001) 373–
(2023). 408.
[13] T. Standley, A. Zamir, D. Chen, L. Guibas, J. Malik, [21] P. A. Carpenter, M. A. Just, P. Shell, What one
S. Savarese, Which tasks should be learned together intelligence test measures: a theoretical account of
in multi-task learning?, in: International conference the processing in the raven progressive matrices
on machine learning, PMLR, 2020, pp. 9120–9132. test., Psychological review 97 (1990) 404.
[14] B. Zhou, X. Cai, Y. Zhang, X. Yuan, An end-to-end [22] K. Kann, A. Warstadt, A. Williams, S. R. Bowman,
progressive multi-task learning framework for med- Verb argument structure alternations in word and
ical named entity recognition and normalization, sentence embeddings, in: Proceedings of the
Soin: C. Zong, F. Xia, W. Li, R. Navigli (Eds.), Proceed- ciety for Computation in Linguistics (SCiL) 2019,
ings of the 59th Annual Meeting of the Association 2019, pp. 287–297. URL: https://aclanthology.org
for Computational Linguistics and the 11th Interna- /W19-0129. doi:10.7275/q5js-4y86.
tional Joint Conference on Natural Language
Processing (Volume 1: Long Papers), Association for
Computational Linguistics, Online, 2021, pp. 6214–
6224. URL: https://aclanthology.org/2021.acl-long.</p>
        <p>485. doi:10.18653/v1/2021.acl-long.485.
[15] Z. Hu, H. P. Chan, L. Huang, MOCHA: A multi-task
training approach for coherent text generation from
cognitive perspective, in: Y. Goldberg, Z. Kozareva,
Y. Zhang (Eds.), Proceedings of the 2022
Conference on Empirical Methods in Natural Language
Processing, Association for Computational
Linguistics, Abu Dhabi, United Arab Emirates, 2022, pp.
10324–10334. URL: https://aclanthology.org/2022.
emnlp-main.705. doi:10.18653/v1/2022.emnlp
-main.705.
[16] Y. Wang, C. Zhai, H. Hassan, Multi-task learning
for multilingual neural machine translation, in:
B. Webber, T. Cohn, Y. He, Y. Liu (Eds.),
Proceedings of the 2020 Conference on Empirical Methods
in Natural Language Processing (EMNLP),
Association for Computational Linguistics, Online, 2020,
pp. 1022–1034. URL: https://aclanthology.org/2020.</p>
        <p>emnlp-main.75. doi:10.18653/v1/2020.emnlp</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>A. Appendix</title>
      <p>A.1. An Italian example for the subject-verb agreement BLM</p>
      <p>A.2. Verb alternation examples
1
2</p>
      <p>Od, typeI - Context
La turista mangia una carbonara in un secondo
La turista mangia una carbonara da mezz’ora
Una carbonara è mangiata dalla turista in un secondo
Una carbonara è mangiata dalla turista da mezz’ora
Una carbonara è mangiata in un secondo
Una carbonara è mangiata da mezz’ora
La turista mangia in un secondo
???</p>
      <sec id="sec-7-1">
        <title>Od, typeII - Context</title>
        <p>La zia mangia una bistecca nella sala grande
La presidente può mangiare una bistecca da programma
La specialità della casa deve essere mangiata dalla turista
nella sala grande
Una bistecca fu mangiata dalla presidente da sola
La specialità della casa deve essere mangiata in un secondo
Una bistecca deve poter essere mangiata da sola
La turista deve mangiare con fame
???</p>
      </sec>
      <sec id="sec-7-2">
        <title>Od, typeIII - Context</title>
        <p>L’attore deve canticchiare un motivetto dopo il festival
L’amica di mia mamma deve cucire la tasca da qualche
giorno
L’inno nazionale può essere cantato dal vincitore del
festival con solo pianoforte
Una bistecca deve essere mangiata dalla turista da sola
Il manuale è insegnato nell’aula magna
Questi attrezzi devono essere intagliati da manuale
I due fratelli studiano con molta attenzione
???</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>B. Results</title>
      <p>B.1. Results with the Italian Electra pretrained model:
dbmdz/electra-base- italian-xxl-cased-discriminator
B.2. Results with the multilingual Electra pretrained model:
google/electra-base-discriminator</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ruder</surname>
          </string-name>
          , Challenges and Opportunities in NLP Benchmarking, http://www.ruder.io/nlp- bench marking,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Merlo</surname>
          </string-name>
          ,
          <article-title>Blackbird language matrices (BLM), a new task for rule-like generalization in neural networks: Motivations and formal specifications</article-title>
          ,
          <source>ArXiv cs.CL 2306.11444</source>
          (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/a rXiv.
          <volume>2306</volume>
          .11444. doi:
          <volume>10</volume>
          .48550/arXiv.2306.11 444.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Raven</surname>
          </string-name>
          , Standardization of progressive matrices,
          <source>British Journal of Medical Psychology</source>
          <volume>19</volume>
          (
          <year>1938</year>
          )
          <fpage>137</fpage>
          -
          <lpage>150</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>An</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Rodriguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Nastase</surname>
          </string-name>
          , P. Merlo,
          <string-name>
            <surname>BLM-AgrF</surname>
          </string-name>
          :
          <article-title>A new French benchmark to investigate generalization of agreement in neural networks</article-title>
          ,
          <source>in: Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics</source>
          , Association for Computational Linguistics, Dubrovnik, Croatia,
          <year>2023</year>
          , pp.
          <fpage>1363</fpage>
          -
          <lpage>1374</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .eacl -main.99.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>V.</given-names>
            <surname>Nastase</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Merlo</surname>
          </string-name>
          ,
          <article-title>Grammatical information in BERT sentence embeddings as two-dimensional arrays</article-title>
          , in: B.
          <string-name>
            <surname>Can</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Mozes</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Cahyawijaya</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Saphra</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Kassner</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Ravfogel</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ravichander</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>I. Augenstein</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rogers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Grefenstette</surname>
          </string-name>
          , L. Voita (Eds.),
          <source>Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP</source>
          <year>2023</year>
          ),
          <article-title>Association for Computational Linguistics</article-title>
          , Toronto, Canada,
          <year>2023</year>
          , pp.
          <fpage>22</fpage>
          -
          <lpage>39</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .repl4nlp-
          <fpage>1</fpage>
          .3. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .repl4nlp-
          <fpage>1</fpage>
          .3.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>V.</given-names>
            <surname>Nastase</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Merlo</surname>
          </string-name>
          ,
          <article-title>Are there identifiable structural parts in the sentence embedding whole?</article-title>
          , in: Y.
          <string-name>
            <surname>Belinkov</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Jumelet</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Mohebbi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Mueller</surname>
          </string-name>
          , H. Chen (Eds.),
          <source>Proceedings of the 7th BlackboxNLP Workshop</source>
          : Analyzing and
          <article-title>Interpreting Neural Networks for NLP, Association for Computational Linguistics</article-title>
          , Miami, Florida,
          <string-name>
            <surname>US</surname>
          </string-name>
          ,
          <year>2024</year>
          , pp.
          <fpage>23</fpage>
          -
          <lpage>42</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .blackb oxnlp-
          <volume>1</volume>
          .3.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Lenci</surname>
          </string-name>
          ,
          <article-title>Understanding natural language understanding systems, Sistemi intelligenti, Rivista quadrimestrale di scienze cognitive e di intelligenza artificiale (</article-title>
          <year>2023</year>
          )
          <fpage>277</fpage>
          -
          <lpage>302</lpage>
          . URL: https://www.rivi steweb.it/doi/10.1422/107438. doi:
          <volume>10</volume>
          .1422/1074 38.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P. A.</given-names>
            <surname>Carpenter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Just</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shell</surname>
          </string-name>
          ,
          <article-title>What one intelligence test measures: A theoretical account of the processing in the Raven Progressive Matri-</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>