Exploring Italian sentence embeddings properties
                                through multi-tasking
                                Vivi Nastase1,* , Giuseppe Samo1 , Chunyang Jiang1,2 and Paola Merlo1,2
                                1
                                    Idiap Research Institute, Martigny, Switzerland
                                2
                                    University of Geneva, Geneva, Switzerland


                                                   Abstract
                                                   We investigate to what degree existing LLMs encode abstract linguistic information in Italian in a multi-task setting. We
                                                   exploit curated synthetic data on a large scale – several Blackbird Language Matrices (BLMs) problems in Italian – and
                                                   use them to study how sentence representations built using pre-trained language models encode specific syntactic and
                                                   semantic information. We use a two-level architecture to model separately a compression of the sentence embeddings into a
                                                   representation that contains relevant information for a task, and a BLM task. We then investigate whether we can obtain
                                                   compressed sentence representations that encode syntactic and semantic information relevant to several BLM tasks. While
                                                   we expected that the sentence structure – in terms of sequence of phrases/chunks – and chunk properties could be shared
                                                   across tasks, performance and error analysis show that the clues for the different tasks are encoded in different manners in
                                                   the sentence embeddings, suggesting that abstract linguistic notions such as constituents or thematic roles does not seem to
                                                   be present in the pretrained sentence embeddings.

                                                   L’obiettivo di questo lavoro è indagare fino a che punto gli attuali LLM apprendono rappresentazioni linguistiche astratte in
                                                   configurazioni multitask. Utilizzando dati sintetici curati su larga scala di vari problemi BLM in italiano, studiamo come le
                                                   rappresentazioni di frasi costruite da modelli di linguaggio pre-addestrati codifichino le informazioni semantiche e sintattiche.
                                                   Abbiamo utilizzato un’architettura a due livelli per modellare separatamente, da un lato, la compressione degli embeddings
                                                   delle frasi di input in una rappresentazione che contiene informazioni rilevanti per i tasks BLM e, dall’altro lato, il BLM
                                                   stesso. Abbiamo poi verificato se fosse possibile ottenere rappresentazioni compresse di frasi che codificano informazioni
                                                   sintattiche e semantiche rilevanti per i diversi tasks BLM. Contrariamente alla predizione che la struttura della frase - in
                                                   termini di sequenza di frasi/chunks - e le proprietà dei chunk possano essere condivise tra i vari tasks, i risultati e l’analisi
                                                   degli errori mostrano che gli indizi per i diversi task sono codificati in modo diverso negli embeddings delle frasi. Questo
                                                   risultato suggerisce che nozioni linguistiche astratte come i costituenti o i ruoli tematici non vi sembrano essere presenti.

                                                   Keywords
                                                   synthetic structured data, multi-task, diagnostic studies of deep learning models


                                1. Introduction                                                                                             tic structure and argument structure – can be assembled
                                                                                                                                            from the information encoded in the sentence embed-
                                Driven by increasing computational scale and progress                                                       dings. This, however, may not be due to a deeper un-
                                in deep learning techniques, NLP models can rival hu-                                                       derstanding of such information encoded by LLMs, but
                                man capabilities on established benchmarks. New bench-                                                      rather because of useful surface indicators [7].
                                marks, then, that capture deeper levels of language un-                                                        In this paper, we adopt BLMs to investigate whether
                                derstanding must be created and analysed [1].                                                               current pretrained models encode abstract linguistic no-
                                   Blackbird’s Language Matrices (BLM) [2] is a recent                                                      tions, such as constituents, and are able to do so in a
                                task inspired by visual tests of analytic intelligence                                                      manner that comprises both functional elements, such
                                (Raven Progressive Matrices/RPMs, [3]). The BLM tasks                                                       as pronouns, demonstratives and lexical elements, such
                                have cast light on whether the correct predictions in pre-                                                  as nominal constituents.
                                viously studied linguistic problems, e.g. number agree-                                                        We concentrate on Italian, and study several grammat-
                                ment or verb alternations, stem from sentence embed-                                                        ical problems whose solutions can theoretically help each
                                dings that encode deeper linguistic information, such as                                                    other, in a multi-task setting. We adopt a two-level archi-
                                syntactic structure and semantic properties of phrases                                                      tecture developed specifically to model what we know
                                [4, 5, 6]. We found that higher-level information – syntac-                                                 about how humans solve puzzles similar to BLMs [8].
                                                                                                                                            Level 1 aims to obtain compressed sentence representa-
                                CLiC-it 2024: 10th Italian Conference on Computational Linguistics,                                         tions that capture information about constituents and
                                Dec 04 — 06, 2024, Pisa, Italy                                                                              their properties; level 2 uses the compressed sentence
                                *
                                 Corresponding author.
                                $ vivi.a.nastase@gmail.com (V. Nastase); giuseppe.samo@idiap.ch
                                                                                                                                            representations to solve a BLM problem. This architec-
                                (G. Samo); chunyang.jiang42@gmail.com (C. Jiang);                                                           ture provides a tool to study how LLMs encode different
                                Paola.Merlo@unige.ch (P. Merlo)                                                                             types of syntactic and semantic information.
                                             © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                             Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
   We make two contributions: (i) an initial core BLM              BLM agreement problem (BLM-AgrI)
                                                                          Context Template
dataset for Italian that covers linguistic problems of differ-
                                                                   NP-sg PP1-sg            VP-sg
ent nature; (ii) single and multi-task experiments that pro-
                                                                   NP-pl  PP1-sg           VP-pl
vide new insights into the information encoded by LLMs.            NP-sg PP1-pl            VP-sg
The datasets are available at https://www.idiap.ch/datas           NP-pl  PP1-pl           VP-pl
et/(blm-agri|blm-causi|blm-odi) and the code at https:             NP-sg PP1-sg PP2-sg VP-sg
//github.com/CLCL-Geneva/BLM-SNFDisentangling.                     NP-pl  PP1-sg PP2-sg VP-pl
                                                                   NP-sg PP1-pl    PP2-sg VP-sg
                                                                                  Answer set
2. Related Work                                                    NP-pl PP1-pl   PP2-sg     VP-pl          Correct
                                                                   NP-pl PP1-pl   et PP2-sg VP-pl             Coord
Multi-task learning has been popular in improving NLP              NP-pl PP1-pl              VP-pl            WNA
systems’ performance by using knowledge shared across              NP-pl PP1-sg PP1-sg       VP-pl             WN1
multiple tasks [9].                                                NP-pl PP1-pl   PP2-pl     VP-pl             WN2
    Multi-task learning architectures include parallel, hier-      NP-pl PP1-pl   PP2-pl     VP-sg             AEV
archical, and modular designs [10]. Parallel architectures         NP-pl PP1-sg PP2-pl       VP-sg            AEN1
share intermediate layers across tasks, conducive to effi-         NP-pl PP1-pl   PP2-sg     VP-sg            AEN2
cient knowledge transfer [11]. Hierarchical architectures        Figure 1: BLM instances for verb-subject agreement, with
capture task dependencies by layering task-specific mod-         two attractors. We build candidate answers displaying one
ules on shared bases. Modular approaches selectively             of two types of errors: (i) sequence errors: WNA= wrong nr.
share components among tasks to balance between gen-             of attractors; WN1= wrong gram. nr. for 1𝑠𝑡 attractor noun
eralisation and task-specific optimisation [12]. These           (N1); WN2= wrong gram. nr. for 2𝑛𝑑 attractor noun (N2);
training strategies are not mutually exclusive and can be        (ii) grammatical errors: AEV=agreement error on the verb;
combined.                                                        AEN1=agreement error on N1; AEN2=agreement error on N2.
    Multi-task learning can be used efficiently in resource-
constrained environments, to counter data scarcity and           3. The BLM task and the BLM
overfitting: aggregating training data and sharing param-
eters across related tasks acts as a form of data augmen-
                                                                    Italian datasets
tation [13].                                                     Raven’s progressive matrices are multiple-choice com-
    Effective multi-task learning depends on the related-        pletion IQ tests, whose solution requires discovering un-
ness of the tasks involved. Tasks that are similar or have       derlying generative rules of a sequence of images [3].
similar objectives tend to benefit more from shared rep-            A similar task has been developed for linguistic prob-
resentations. This observation has been used in various          lems, called Blackbird Language Matrices (BLMs) [2], as
NLP tasks, including named entity recognition [14], text         given in Figure 1, which illustrates the template of a BLM
generation[15], and machine translation [16], among oth-         agreement matrix. A BLM comprises a context and an
ers. Selecting related tasks that contribute positively to       answer set. The context is a sequence of sentences gen-
the shared model’s training is important and remains an          erated following the relevant rules of a given linguistic
active area of research [9].                                     phenomenon under investigation and that this way im-
    Pretrained large language models exhibit general-            plicitly illustrates these grammatical properties. This
purpose abilities and knowledge, with high results with          sequence also follows some extra-linguistic progression
little or no fine-tuning on downstream tasks [17, 18]. We        rules. Each context is paired with a set of candidate an-
can then regard these language models as the results of          swers. The answer sets contain minimally contrastive
"multi-task" learning, and our aim here is to test whether       examples built by corrupting some of the generating
sentence embeddings obtained from these models en-               rules.
code syntactic and semantic information consistently,               The BLM Italian datasets consists of BLMs focused
such that different BLM problems that rely on similar            on the property of subject-verb agreement and two
linguistic information draw on the same clues from these         transitive-intransitive alternations: the change-of-state
representations. In particular, we will use BLM tasks on         alternation and the object-drop alternation.
subject-verb agreement – which relies on chunk struc-
ture and the chunks’ grammatical number properties –
and on verb alternations – which relies on chunk struc-          3.1. BLM-AgrI – subject-verb agreement
ture and the chunks’ semantic role properties – to test               in Italian
whether chunk structure is encoded in a manner that
                                                                 The BLM-AgrI dataset is created by manually translating
allows for it to be shared by the two tasks.
                                                                 the seed French sentences [4] into Italian by a native
speaker, one of the authors, and then generating the full                       The template is also in Figure 2. Due to the asymmetry
dataset following the same process of lexical augmenta-                      between the two classes of verbs, the contexts of the
tion and sentence shuffling among instances described                        BLMs minimally differ in the intransitive followed by
in [4]. The internal nominal structure in these languages                    P-NP (sentence 7). The correct answer also varies across
is very similar, so translations are almost parallel. An                     the two groups, although in both cases it is an intransitive
illustrative, simplified example for Italian is provided in                  form with a da-NP. Examples are shown in the Appendix.
Figure 7, in the appendix. The dataset comprises three
subsets of increasing lexical complexity (called Type I,                                  Caus context                 Caus answers
Type II and Type III) to test the ability of the system to                      1   Ag     Akt Pat     P-NP    1   Pat Akt da-NP Correct
                                                                                2   Ag     Akt Pat     da-NP   2   Ag Akt da-NP I-Int
handle item novelty.                                                            3   Pat    Pass da-Ag P-NP     3   Pat Pass da-Ag ER-Pass
                                                                                4   Pat    Pass da-Ag da-NP    4   Ag Pass da-Pat IER-Pass
3.2. BLM-CausI and BLM-OdI                                                      5   Pat    Pass        P-NP    5   Pat Akt Ag     R-Trans
                                                                                6   Pat    Pass        da-NP   6   Ag Akt Pat     IR-Trans
While BLM-AgrI tests information about a formal gram-                           7   Pat    Akt         P-NP    7   Pat Akt da-Ag E-WrBy
                                                                                ?   ???                        8   Ag Akt da-Pat IE-WrBy
matical property, agreement, the Causative (Caus) and
                                                                                          Od context                    Od answers
Object-drop (Od) alternation datasets test lexical seman-
                                                                                1   Ag    Akt Pat      P-NP    1   Pat Akt da-NP I-Int
tic properties of verbs, their ability to enter or not a                        2   Ag    Akt Pat      da-NP   2   Ag Akt da-NP Correct
causative alternation. Caus represents the causative/in-                        3   Pat   Pass da-Ag   P-NP    3   Pat Pass da-Ag IER-Pass
choative alternation, where the object of the transitive                        4   Pat   Pass da-Ag   da-NP   4   Ag Pass da-Pat ER-Pass
verb bears the same semantic role (Patient) as the sub-                         5   Pat   Pass         P-NP    5   Pat Akt Ag     IR-Trans
ject of the intransitive verb (L’artista ha aperto la fines-                    6   Pat   Pass         da-NP   6   Ag Akt Pat     R-Trans
                                                                                7   Ag    Akt          P-NP    7   Pat Akt da-Ag IE-WrBy
tra/La finestra si è aperta ‘The artist opened the win-                         ?   ???                        8   Ag Akt da-Pat E-WrBy
dow’/‘The window opened’). The transitive form of the
verb has a causative meaning. In contrast, the subject                       Figure 2: BLM contexts answers and their location of errors
in Od bears the same semantic role (Agent) in both the                       (see text) for the Change of state group (Caus) and the object
transitive and intransitive forms (L’artista dipingeva la                    drop (Od) class.
finestra/L’artista dipingeva ‘the artist painted the win-                      We illustrate the data in Figure 8 in the appendix with
dow’/‘the artist painted’) and the verb does not have a                      the Italian Change-of-state verb chiudere ’close’.
causative meaning [19, 20].
                                                                             Lexicalisation In line with previous work on BLMs,
BLM-CausI context and answers The context set of                             each dataset also contains a varying amount of lexicalisa-
the verb alternation varies depending on the presence of                     tion. In type I the lexical material of the sentences within
one or two arguments and their attributes (agents, Ag;                       a single context does not change, in type II only the verb
patients, Pat) and the active (Akt) and passive (Pass) or                    remains the same, in type III data all words can change
passive voice of the verb. The non-linguistic factor that                    (Figure 9, in the appendix).
structures the sequence is an alternation every two items
between a prepositional phrase introduced by any prepo-                      3.3. Dataset statistics
sition (e.g., in pochi secondi, P-NP) and a PP introduced
by the agentive da-NP (e.g., dall’artista, da-Ag/da-Pat).                    Each subset is split 90:20:10 into train:dev:test subsets.
   The answer set is composed of one correct answer                          The training and testing are disjoint (agreement data is
and contrastive wrong answers, all formed by the same                        split based on the correct answer, the alternations data
four elements: a verb, two nominal constituents and a                        based on the verb). Agreement has 230 test instances
prepositional phrase. Figure 2 shows the template.1                          for type I, 4121 for types II and III. The verb alternations
                                                                             have 240 test instances for all subsets. We randomly
                                                                             sample a number of training instances, depending on the
BLM-OdI Context and Answers The BLM for Od is
                                                                             experimental set-up.
the same as for Caus, but here the passive voice serves
as a confounding element and one of the contrastive
answers for Caus is, in fact, the correct answer here.                       4. Multi-task representations
1
    Following BLM formal specifications [2], we build the errors rep-
    resenting violations of internal (I ), external (E) and relational (R)   Sentence embeddings encode much information from
    rules of the BLM, and their combination (e.g. IE IER, etc.). This        the input sentence – lexical, syntactic, semantic, and pos-
    information is used in the first part of the error acronym. The          sibly other types of information. Previous experiments
    second part of the errors’ label indicates the structure the sentence    have shown that sentence embeddings can be compressed
    represent: intransitive (Int), passive (Pass), Transitive (Trans) or,
                                                                             into very small representations (vectors of size 5) that
    in some cases, the NP introduced by the da preposition (WrBy).
                                                                         which leads to higher results when testing the encoding
                                                                         of structural information compared to BERT, RoBERTa,
                                                                         and models tuned by semantic similarity [6].
                                                                            The two levels are learned together. The input is a
                                                                         BLM instance which is processed on the fly to produce
                                                                         training instances for the sentence level for each sentence
                                                                         𝑖𝑛𝑘 in the input sequence 𝑆. The compressed sentence
Figure 3: A two-level VAE: the sentence level learns to com-             representations on the latent layer 𝑧𝑖𝑛𝑘 are stacked and
press a sentence into a representation useful to solve the BLM           passed as input to the task level, which produces a sen-
problem on the task level.                                               tence representation 𝑎𝑛𝑠𝑤 as output, which is compared
                                                                         to the answer set of the respective BLM instance 𝐴.
                                                                            The sentence level uses a variational encode-decoder
encode information about the structure of the sentence
                                                                         architecture to learn how to compress on the latent layer
in terms of chunks and their properties, such that they
                                                                         a representation that captures relevant structural infor-
contribute to finding the sequence patterns in BLMs [6].
                                                                         mation. We guide the system towards this representa-
In this work, we investigate whether several BLM tasks
                                                                         tion by constructing a contrastive set of candidates for
can share the same structural information from a sen-
                                                                         comparison with the reconstructed input. The correct
tence embedding. Towards this end, we built a multi-task
                                                                         output (𝑜𝑢𝑡+ ) is the same as the input (𝑖𝑛), and a selec-
version of a two-level system, illustrated in Figure 3. In
                                                                         tion of other sentences from the input sequence will be
this system, one level processes individual sentences and
                                                                         the contrastive negative outputs (𝑂𝑢𝑡− = {𝑜𝑢𝑡−       𝑖 ,𝑖 =
learns to compress them into small vectors that retain
                                                                         1, 𝑁𝑛𝑒𝑔𝑠 }, 𝑁𝑛𝑒𝑔𝑠 = 7 (note that an input sequence con-
information pertinent to a task and the other level uses
                                                                         sists of sentences with different patterns to each other
the compressed sentence representation to find patterns
                                                                         – Figure 1 and 2). We use a max-margin loss function
across an input sequence to solve a BLM task. The multi-                                                                     ˆ is the
                                                                         to take advantage of the contrastive answers, 𝑖𝑛
task variation consists in a single shared sentence-level
                                                                         reconstructed input sentence from the sampled latent
component, and multiple task components, one for each
                                                                         vector 𝑧𝑖𝑛 :
of the BLM tasks.
   The BLM problems encode a linguistic phenomenon                                              ˆ , 𝑜𝑢𝑡+ , 𝑂𝑢𝑡− )
                                                                         𝑙𝑜𝑠𝑠𝑠𝑒𝑛𝑡 (𝑖𝑛) = 𝑚𝑎𝑥𝑀 (𝑖𝑛
through data that has structure on multiple levels –                                    + 𝐾𝐿(𝑧𝑖𝑛 ||𝒩 (0, 1))
within sentences, and across a sequence of sentences.
We can exploit this structure to develop an indirectly                   𝑚𝑎𝑥𝑀 (𝑖ˆ𝑛, 𝑜𝑢𝑡+ , 𝑂𝑢𝑡− ) =
supervised approach to discover and use these different                                                 ˆ , 𝑜𝑢𝑡+ )
                                                                                        𝑚𝑎𝑥(0, 1 − 𝑐𝑜𝑠(𝑖𝑛
                                                                                                                   ^ ,𝑜𝑢𝑡− )
                                                                                               ∑︀
levels of structure. We thus model the solving of a BLM                                               −       𝑐𝑜𝑠(𝑖𝑛
                                                                                                𝑜𝑢𝑡     ∈𝑂𝑢𝑡−            𝑖
                                                                                                      𝑖
task as a two-step process: (i) compress individual sen-                                   +              𝑁𝑛𝑒𝑔𝑠
                                                                                                                               )
tences into a representation that emphasizes the sentence
                                                                           The loss at the task level for input sequence 𝑆 is
structure relevant to the BLM problem (e.g. chunks and
                                                                         computed in a similar manner for the constructed
their grammatical number for the subject-verb agreement
                                                                         answer 𝑎𝑛𝑠𝑤, but relative to the answer set 𝒜 and the
task) (ii) use the compressed representations to detect
                                                                         correct answer 𝑎𝑐 of the task:
the sequence-level pattern and solve the BLM task. This
two-step process has been shown to be used by people
                                                                         𝑙𝑜𝑠𝑠𝑡𝑎𝑠𝑘 (𝑆) = 𝑚𝑎𝑥𝑀 (𝑎𝑛𝑠𝑤, 𝑎𝑐 , 𝐴 ∖ {𝑎𝑐 })
solving visual intelligence tests [21]. In our case, this
                                                                                      + 𝐾𝐿𝑠𝑒𝑞 (𝑧𝑆 |𝒩 (0, 1)).
setup allows us to investigate whether the sentence level
can be guided to learn shared information, relevant to
                                                                           The loss of the two-level systems is:
the different linguistic tasks described in section 3.
   We implement this approach in the two-level inter-                                ∑︀
                                                                         𝑙𝑜𝑠𝑠(𝑆) =      𝑖𝑛𝑘 ∈𝑆 𝑙𝑜𝑠𝑠𝑠𝑒𝑛𝑡 (𝑖𝑛𝑘 ) + 𝑙𝑜𝑠𝑠𝑡𝑎𝑠𝑘 (𝑆)
twined architecture illustrated in Figure 3, and described
in detail elsewhere [6]. The data is pre-encoded with
                                                                            The input batches are shuffled, to alternate between
Electra [18].2 The sentence representations is provided
                                                                         tasks during training, and avoid getting stuck in a local
by the embedding of the [CLS] token.3 . We chose Electra
                                                                         maximum for one of the tasks.
because of its stronger sentence-level supervision signal,

2
  Italian Electra (E-It) pretrained model: dbmdz/electra-base-italian-   5. Multi-task results
  xxl-cased-discriminator. Multi-lingual Electra (E-M) model:
  google/electra-base-discriminator.                                     Previous published work from our group and current
3
  To simplify the discussion of the method, we write "sentence" in-
  stead of "sentence embedding", when discussing the system.             ongoing work has benchmarked the problems generated
by some of these datasets [4, 5]. This work has shown                                                                                                                   0.30
that information about the syntactic phrases in a sen-                                                                                                                  0.25


                                                                                                                                                     Error Proportion
tence and their properties can be obtained from sentence                                                                                                                0.20
embeddings, and this information is helpful in solving                                                                                                                  0.15
the BLM tasks. We had studied these tasks separately,                                                                                                                   0.10
and investigate here whether such structure is encoded                                                                                                                  0.05
in the sentence embeddings, or whether it is assembled                                                                                                                  0.00
                                                                                                                                                                                  WN1         WN2   Coord      WNA       AEN1        AEV
based on shallower patterns within the sentence repre-                                                                                                                                                Error Types
sentations.                                                                                                                                                                    type_I-SingleTask    type_II-SingleTask    type_III-SingleTask
                                                                                                                                                                               type_I-Multitask     type_II-Multitask     type_III-Multitask

                                                                                                                                                    Figure 5: Error analysis for agreement: multi- vs. single task,

                                                                                                          1.00

                                                                                                                 1.00

                                                                                                                        0.97
     1.0                                                                                                                                            training on type I data, testing on all.


                                                                                                                                      0.94
                                                                                                                               0.94
                                                                                   0.93
                                                       0.92

                                                              0.91
           0.91


                                                                     0.91


                                                                                                                                             0.89
                                                                                          0.87
                                                                            0.85
                  0.77

                         0.76


     0.8
                                       0.71


                                                                                                                                                      The comparison of all the tasks suggests that some
                                0.66


     0.6
F1


                                                                                                                                                    syntactic and semantic regularities –such as constituents,
                                              0.48


     0.4                                                                                                                                            grammatical number and semantic roles– cannot be en-
     0.2                                                                                                                                            coded together as they compete with each other when the
     0.0
                                                                                                                                                    system learns to distil them from the pretrained sentence
                           Agr                                        Caus                                                Od
                                                                      Task                                                                          embeddings.
                                   type_I-SingleTask      type_II-SingleTask              type_III-SingleTask
                                   type_I-Multitask       type_II-Multitask               type_III-Multitask


Figure 4: Performance comparison across single-task and                                                                                             Error Analysis For the agreement task, errors on the
multi-task training paradigms for the three subtasks (single                                                                                        grammatical number of the attractor nouns (WN1, WN2)
task darker shade of each colour, multi-task lighter shade),                                                                                        are high under both paradigms. These are "sequence er-
trained on type-I data, tested on the three types, and aver-                                                                                        rors", indicating that the system was not able to detect
aged over three independent runs. Results obtained using the                                                                                        the patterns in the input sequence, possibly because in-
Italian Electra pretrained model.                                                                                                                   dividual sentence structures were not properly detected.
                                                                                                                                                    Previous experiments have shown, though, that in the
Discussion We expect that if the multi-task setup suc-                                                                                              single-task setting, the sentence level does manage to
ceeds in sharing information across tasks, then the re-                                                                                             compress the desired information [6]. The fact that both
sults on the individual test data will be at least as good                                                                                          these errors increase in the multi-task setting indicates
as when learning tasks individually, given that the multi-                                                                                          that the information compression on the sentence level
task setup uses a larger training set data – the union of                                                                                           is less successful than in the single-task setting.
the training sets of the individual tasks. But, overall, this                                                                                          For the alternation tasks, error patterns vary, although
does not seem to be the case.                                                                                                                       their distributions remain similar between single-task
   As the results in Figure 4 show (and also the detailed re-                                                                                       and multi-task environments. We observe an overall in-
sults in Tables 1-2 for the Italian Electra pretrained model,                                                                                       crease of error proportions in the multi-task environment.
and in Tables 3-4 for a multilingual Electra pretrained                                                                                             Specifically, mistakes of the type I-int are frequent in
model), single-task training outperforms multi-tasking                                                                                              type III data for the Caus task. These errors incorrectly
in the agreement and verb alternation subtasks. The                                                                                                 map the thematic roles onto the syntax of the arguments
drop suggests that the multi-task model is not able to                                                                                              (e.g. L’artista si è chiuso ‘the artist closed’ or La car-
learn shared properties for these tasks, and forcing it to                                                                                          bonara mangiava ‘the carbonara was eating’). In the
do so leads to a model that is not optimal for either of                                                                                            same dataset, we also note an increase of errors related
them. Both tasks require information about the syntactic                                                                                            to the last constituent in type I and type II data (errors
structure (or sequence of phrases), while each requires                                                                                             of type E-WrBy, e.g. La finestra si chiuse dall’artista ‘the
different phrase properties – grammatical number for                                                                                                window closed by the artist’). Finally, for the Od task,
the agreement task, and semantic properties for the verb                                                                                            we remark that R-trans errors are not the most promi-
alternation. While the system is able to distil all this in-                                                                                        nent —these are the errors resulting in standard transi-
formation from sentence embeddings in the single-task                                                                                               tive clauses (e.g., L’artista dipinse un paesaggio ‘the artist
setting, it is not able to compress it into a shared repre-                                                                                         painted a landscape’)— and do not increase in multi-task
sentation when learning the tasks together.                                                                                                         environments, suggesting that the chosen answer is not
   The Od single-task and multi-task have comparable                                                                                                derived from some forms of transitive bias [22].
performance, probably because the Od tasks involve a                                                                                                   An overall comparison shows that the error patterns
simpler alternation than the Caus task. They do not have                                                                                            vary across subtasks. This variety in error patterns con-
a causative meaning and do not require a change in the                                                                                              firms that the different dimensions (types of alternations,
semantic role of the subjects.                                                                                                                      levels of lexicalisation and single and multi-task learning)
                   0.12                                                                      0.12

                   0.10                                                                      0.10

                   0.08                                                                      0.08
Error Proportion


                                                                          Error Proportion
                   0.06                                                                      0.06

                   0.04                                                                      0.04

                   0.02                                                                      0.02

                   0.00                                                                      0.00
                          I-Int         E-WrBy           R-trans                                    I-Int         E-WrBy          R-trans
                                      Error Types                                                               Error Types
                          (a) Caus task error analysis                                               (b) Od task error analysis
Figure 6: Error analysis between single and multi-task training paradigms trained on type-I data, tested on the three types, as
averages over three runs (single task darker shade of each colour, multi-task lighter shade). For the Caus and Od tasks, we
report only three representative error types of I, E and R.

are separate uncorrelated dimensions. It also indicates            [3] J. C. Raven, Standardization of progressive matrices,
that the differences in the F1 results shown in Figure 4               British Journal of Medical Psychology 19 (1938) 137–
are real, despite the more homogeneous trends exhibited                150.
by these aggregated F1 numbers.                                    [4] A. An, C. Jiang, M. A. Rodriguez, V. Nastase,
                                                                       P. Merlo, BLM-AgrF: A new French benchmark
                                                                       to investigate generalization of agreement in neu-
6. Conclusions                                                         ral networks, in: Proceedings of the 17th Confer-
                                                                       ence of the European Chapter of the Association for
In this paper, we have presented curated synthetic
                                                                       Computational Linguistics, Association for Compu-
datasets of Italian on two linguistic phenomena of an
                                                                       tational Linguistics, Dubrovnik, Croatia, 2023, pp.
heterogeneous nature, such as agreement and verbal tran-
                                                                       1363–1374. URL: https://aclanthology.org/2023.eacl
sitive/intransitive alternation, embedded in the BLM task.
                                                                       -main.99.
   The results on the performance and the error analysis
                                                                   [5] V. Nastase, P. Merlo, Grammatical information in
of a tailored two-level architecture have shown that multi-
                                                                       BERT sentence embeddings as two-dimensional
task environments do not help, suggesting that abstract
                                                                       arrays, in: B. Can, M. Mozes, S. Cahyawijaya,
linguistic notions, such as constituents or thematic roles
                                                                       N. Saphra, N. Kassner, S. Ravfogel, A. Ravichan-
do not seem to be present in the learning process.
                                                                       der, C. Zhao, I. Augenstein, A. Rogers, K. Cho,
   Current work is developing new analyses and archi-
                                                                       E. Grefenstette, L. Voita (Eds.), Proceedings of the
tectures to probe further in the encoding of information
                                                                       8th Workshop on Representation Learning for NLP
in sentence embeddings and creating new BLM problems
                                                                       (RepL4NLP 2023), Association for Computational
across various languages and linguistic phenomena.
                                                                       Linguistics, Toronto, Canada, 2023, pp. 22–39. URL:
                                                                       https://aclanthology.org/2023.repl4nlp- 1.3.
Acknowledgments                                                        doi:10.18653/v1/2023.repl4nlp-1.3.
                                                                   [6] V. Nastase, P. Merlo, Are there identifiable struc-
We gratefully acknowledge the partial support of this                  tural parts in the sentence embedding whole?,
work by the Swiss National Science Foundation, through                 in: Y. Belinkov, N. Kim, J. Jumelet, H. Mohebbi,
grant SNF Advanced grant TMAG-1_209426 to PM.                          A. Mueller, H. Chen (Eds.), Proceedings of the 7th
                                                                       BlackboxNLP Workshop: Analyzing and Interpret-
                                                                       ing Neural Networks for NLP, Association for Com-
References                                                             putational Linguistics, Miami, Florida, US, 2024, pp.
                                                                       23–42. URL: https://aclanthology.org/2024.blackb
    [1] S. Ruder, Challenges and Opportunities in NLP
                                                                       oxnlp-1.3.
        Benchmarking, http://www.ruder.io/nlp-bench
                                                                   [7] A. Lenci, Understanding natural language un-
        marking, 2021.
                                                                       derstanding systems, Sistemi intelligenti, Rivista
    [2] P. Merlo, Blackbird language matrices (BLM), a new
                                                                       quadrimestrale di scienze cognitive e di intelligenza
        task for rule-like generalization in neural networks:
                                                                       artificiale (2023) 277–302. URL: https://www.rivi
        Motivations and formal specifications, ArXiv cs.CL
                                                                       steweb.it/doi/10.1422/107438. doi:10.1422/1074
        2306.11444 (2023). URL: https://doi.org/10.48550/a
                                                                       38.
        rXiv.2306.11444. doi:10.48550/arXiv.2306.11
                                                                   [8] P. A. Carpenter, M. A. Just, P. Shell, What one
        444.
                                                                       intelligence test measures: A theoretical account
                                                                       of the processing in the Raven Progressive Matri-
     ces Test., Psychological Review 97 (1990) 404–431.             -main.75.
     doi:10.1037/0033-295X.97.3.404.                           [17] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT:
 [9] Z. Zhang, W. Yu, M. Yu, Z. Guo, M. Jiang, A sur-               Pre-training of deep bidirectional transformers for
     vey of multi-task learning in natural language pro-            language understanding, in: Proceedings of the
     cessing: Regarding task relatedness and training               2019 Conference of the North American Chapter of
     methods, in: A. Vlachos, I. Augenstein (Eds.), Pro-            the Association for Computational Linguistics: Hu-
     ceedings of the 17th Conference of the European                man Language Technologies, Volume 1 (Long and
     Chapter of the Association for Computational Lin-              Short Papers), Association for Computational Lin-
     guistics, Association for Computational Linguis-               guistics, Minneapolis, Minnesota, 2019, pp. 4171–
     tics, Dubrovnik, Croatia, 2023, pp. 943–956. URL:              4186. URL: https://aclanthology.org/N19-1423.
     https://aclanthology.org/2023.eacl- main.66.                   doi:10.18653/v1/N19-1423.
     doi:10.18653/v1/2023.eacl-main.66.                        [18] K. Clark, M.-T. Luong, Q. V. Le, C. D. Manning, Elec-
[10] S. Chen, Y. Zhang, Q. Yang, Multi-task learning in             tra: Pre- training text encoders as discriminators
     natural language processing: An overview, ACM                  rather than generators, in: ICLR, 2020, pp. 1–18.
     Computing Surveys (2021).                                 [19] B. Levin, English Verb Classes and Alternations A
[11] S. Ruder,       An overview of multi-task learn-               Preliminary Investigation, University of Chicago
     ing in deep neural networks, arXiv preprint                    Press, Chicago and London, 1993.
     arXiv:1706.05098 (2017).                                  [20] P. Merlo, S. Stevenson, Automatic verb classifica-
[12] J. Pfeiffer, S. Ruder, I. Vulić, E. M. Ponti, Modu-            tion based on statistical distributions of argument
     lar deep learning, arXiv preprint arXiv:2302.11529             structure, Computational Linguistics 27 (2001) 373–
     (2023).                                                        408.
[13] T. Standley, A. Zamir, D. Chen, L. Guibas, J. Malik,      [21] P. A. Carpenter, M. A. Just, P. Shell, What one
     S. Savarese, Which tasks should be learned together            intelligence test measures: a theoretical account of
     in multi-task learning?, in: International conference          the processing in the raven progressive matrices
     on machine learning, PMLR, 2020, pp. 9120–9132.                test., Psychological review 97 (1990) 404.
[14] B. Zhou, X. Cai, Y. Zhang, X. Yuan, An end-to-end         [22] K. Kann, A. Warstadt, A. Williams, S. R. Bowman,
     progressive multi-task learning framework for med-             Verb argument structure alternations in word and
     ical named entity recognition and normalization,               sentence embeddings, in: Proceedings of the So-
     in: C. Zong, F. Xia, W. Li, R. Navigli (Eds.), Proceed-        ciety for Computation in Linguistics (SCiL) 2019,
     ings of the 59th Annual Meeting of the Association             2019, pp. 287–297. URL: https://aclanthology.org
     for Computational Linguistics and the 11th Interna-            /W19-0129. doi:10.7275/q5js-4y86.
     tional Joint Conference on Natural Language Pro-
     cessing (Volume 1: Long Papers), Association for
     Computational Linguistics, Online, 2021, pp. 6214–
     6224. URL: https://aclanthology.org/2021.acl-long.
     485. doi:10.18653/v1/2021.acl-long.485.
[15] Z. Hu, H. P. Chan, L. Huang, MOCHA: A multi-task
     training approach for coherent text generation from
     cognitive perspective, in: Y. Goldberg, Z. Kozareva,
     Y. Zhang (Eds.), Proceedings of the 2022 Confer-
     ence on Empirical Methods in Natural Language
     Processing, Association for Computational Linguis-
     tics, Abu Dhabi, United Arab Emirates, 2022, pp.
     10324–10334. URL: https://aclanthology.org/2022.
     emnlp-main.705. doi:10.18653/v1/2022.emnlp
     -main.705.
[16] Y. Wang, C. Zhai, H. Hassan, Multi-task learning
     for multilingual neural machine translation, in:
     B. Webber, T. Cohn, Y. He, Y. Liu (Eds.), Proceed-
     ings of the 2020 Conference on Empirical Methods
     in Natural Language Processing (EMNLP), Associ-
     ation for Computational Linguistics, Online, 2020,
     pp. 1022–1034. URL: https://aclanthology.org/2020.
     emnlp-main.75. doi:10.18653/v1/2020.emnlp
A. Appendix
A.1. An Italian example for the subject-verb agreement BLM

                                                                Context
                                      1    Il vaso    con il fiore                  si è rotto.
                                      2    I vasi     con il fiore                  si sono rotti.
                                      3    Il vaso    con i fiori                   si è rotto.
                                      4    I vasi     con i fiori                   si sono rotti.
                                      5    Il vaso    con il fiore del giardino     si è rotto.
                                      6    I vasi     con il fiore del giardino     si sono rotti.
                                      7    Il vaso    con i fiori  del giardino     si è rotto.
                                      8    ???
                                                               Answer set
                                      1 Il vaso con i fiori e il giardino si è rotto.    coord
                                      2 I vasi con i fiori del giardino si sono rotti. correct
                                      3 Il vaso con il fiore si è rotto.                 WNA
                                      4 I vasi con il fiore del giardino si sono rotti.  WN1
                                      5 I vasi con i fiori dei giardini si sono rotti.   WN2
                                      6 Il vaso con il fiore del giardino si sono rotti. AEV
                                      7 Il vaso con i fiori del giardino si sono rotti.  AEN1
                                      8 Il vaso con il fiore dei giardini si sono rotti. AEN2
Figure 7: An illustrative example for the BLM instances for verb-subject agreement, with 2 attractors (fiore ’flower’, giardino
’garden’), with candidate answer set.


A.2. Verb alternation examples

                            Caus - Context
    1   Una stella del cinema chiuse la sua carriera con forza                                Caus - Answers
    2   Una stella del cinema chiuse la sua carriera da pochissimo     1   La sua carriera si chiuse da pochissimo tempo
        tempo                                                          2   Una stella del cinema si chiuse da pochissimo tempo
    3   La sua carriera fu chiusa da una stella del cinema con forza   3   La sua carriera fu chiusa da una stella del cinema
    4   La sua carriera fu chiusa da una stella del cinema da          4   Una stella del cinema fu chiusa dalla sua carriera
        pochissimo tempo                                               5   La sua carriera chiuse una stella del cinema
    5   La sua carriera fu chiusa con forza                            6   Una stella del cinema chiuse la sua carriera
    6   La sua carriera fu chiusa da pochissimo tempo                  7   La sua carriera si chiuse da una stella del cinema
    7   La sua carriera si chiuse con forza                            8   Una stella del cinema si chiuse dalla sua carriera
    8   ???

Figure 8: Examples for the Caus BLMs for the Italian verb chiudere ’close’ belonging to Caus class
                      Od, typeI - Context                                                Od, typeI - Answers
   1   La turista mangia una carbonara in un secondo                  1   Una carbonara mangia da mezz’ora
   2   La turista mangia una carbonara da mezz’ora                    2   La turista mangia da mezz’ora
   3   Una carbonara è mangiata dalla turista in un secondo           3   Una carbonara è mangiata dalla turista
   4   Una carbonara è mangiata dalla turista da mezz’ora             4   La turista è mangiata da una carbonara
   5   Una carbonara è mangiata in un secondo                         5   Una carbonara mangia la turista
   6   Una carbonara è mangiata da mezz’ora                           6   La turista mangia una carbonara
   7   La turista mangia in un secondo                                7   Una carbonara mangia dalla turista
   8   ???                                                            8   La turista mangia da una carbonara


                        Od, typeII - Context
                                                                                          Od, typeII - Answers
   1   La zia mangia una bistecca nella sala grande
                                                                      1   La specialità della casa può mangiare da sola
   2   La presidente può mangiare una bistecca da programma
                                                                      2   La squadra di calcio deve mangiare da mezz’ora
   3   La specialità della casa deve essere mangiata dalla turista
                                                                      3   Una bistecca è mangiata dalla turista
       nella sala grande
                                                                      4   La squadra di calcio può essere mangiata da una carbonara
   4   Una bistecca fu mangiata dalla presidente da sola
                                                                      5   La pasta col pomodoro può mangiare la squadra di calcio
   5   La specialità della casa deve essere mangiata in un secondo
                                                                      6   La squadra di calcio mangia una bistecca
   6   Una bistecca deve poter essere mangiata da sola
                                                                      7   La specialità della casa deve poter mangiare dalla turista
   7   La turista deve mangiare con fame
                                                                      8   La presidente mangia da una bistecca
   8   ???


                        Od, typeIII - Context
                                                                                            Od, typeIII - Answers
   1   L’attore deve canticchiare un motivetto dopo il festival
                                                                      1   La pasta frolla deve impastare da sola
   2   L’amica di mia mamma deve cucire la tasca da qualche
                                                                      2   L’autrice deve poter scrivere da qualche giorno
       giorno
                                                                      3   I libri di testo devono poter essere studiati dai candidati
   3   L’inno nazionale può essere cantato dal vincitore del festi-
                                                                      4   Questi stilisti devono poter essere tessuti dai vestiti per la
       val con solo pianoforte
                                                                          parata
   4   Una bistecca deve essere mangiata dalla turista da sola
                                                                      5   Questi motivi greci possono tessere questi stilisti
   5   Il manuale è insegnato nell’aula magna
                                                                      6   L’idraulico saldò i cavi del lampadario
   6   Questi attrezzi devono essere intagliati da manuale
                                                                      7   La stanza pulisce da una delle propretarie dell’albergo
   7   I due fratelli studiano con molta attenzione
                                                                      8   Le sommozzatrici pescarono da delle trote
   8   ???

Figure 9: Examples of Od BLMs for type I, type II and type III
B. Results
B.1. Results with the Italian Electra pretrained model:
dbmdz/electra-base- italian-xxl-cased-discriminator

                           train on    test on                          task
                                                    agreement           Caus              Od
                           type I      type I      0.772 (0.011)    0.910 (0.002)    0.996 (0.003)
                                       type II     0.660 (0.016)    0.849 (0.022)    0.938 (0.007)
                                       type III    0.483 (0.042)    0.870 (0.027)    0.893 (0.010)
                           type II     type I      0.504 (0.056)    0.917 (0.012)    0.993 (0.004)
                                       type II     0.519 (0.027)    0.872 (0.007)    0.981 (0.007)
                                       type III    0.406 (0.018)    0.907 (0.004)    0.950 (0.009)
                           type III    type I      0.274 (0.012)    0.946 (0.003)    0.994 (0.002)
                                       type II     0.330 (0.004)    0.929 (0.003)    0.983 (0.003)
                                       type III    0.325 (0.008)    0.889 (0.014)    0.967 (0.007)
Table 1
Multi-task learning results as F1 averages over three runs (and standard deviation). Training with 3000 instances – 1000 from
each task.


                           train on    test on                          task
                                                    agreement           Caus              Od
                           type I      type I      0.909 (0.007)    0.919 (0.005)    1.000 (0.000)
                                       type II     0.760 (0.030)    0.906 (0.017)    0.971 (0.003)
                                       type III    0.707 (0.028)    0.926 (0.005)    0.940 (0.010)
                           type II     type I      0.881 (0.013)    0.932 (0.007)    1.000 (0.000)
                                       type II     0.784 (0.007)    0.903 (0.010)    0.983 (0.003)
                                       type III    0.714 (0.005)    0.956 (0.005)    0.975 (0.009)
                           type III    type I      0.296 (0.011)    0.960 (0.005)    0.998 (0.002)
                                       type II     0.345 (0.002)    0.950 (0.007)    0.993 (0.004)
                                       type III    0.336 (0.005)    0.918 (0.010)    0.994 (0.004)
Table 2
Single task learning results as F1 averages over three runs (and standard deviation). Training with 2160 instances for Caus and
Od for all types, and for agreement 2052 instances for type I (maximum available), and 3000 instances for type II and type III.
B.2. Results with the multilingual Electra pretrained model:
google/electra-base-discriminator

                           train on    test on                          task
                                                    agreement           Caus              Od
                           type I      type I      0.664 (0.053)    0.543 (0.011)    0.714 (0.012)
                                       type II     0.733 (0.018)    0.407 (0.023)    0.561 (0.002)
                                       type III    0.586 (0.022)    0.483 (0.016)    0.656 (0.016)
                           type II     type I      0.599 (0.025)    0.610 (0.035)    0.646 (0.010)
                                       type II     0.660 (0.019)    0.536 (0.004)    0.601 (0.004)
                                       type III    0.518 (0.025)    0.601 (0.011)    0.686 (0.019)
                           type III    type I      0.320 (0.047)    0.551 (0.014)    0.729 (0.015)
                                       type II     0.401 (0.058)    0.450 (0.021)    0.661 (0.020)
                                       type III    0.378 (0.052)    0.413 (0.012)    0.618 (0.005)
Table 3
Multi-task learning results as F1 averages over three runs (and standard deviation). Training with 3000 instances – 1000 from
each task.


                           train on    test on                          task
                                                    agreement           Caus              Od
                           type I      type I      0.875 (0.031)    0.599 (0.040)    0.749 (0.030)
                                       type II     0.886 (0.005)    0.425 (0.019)    0.579 (0.037)
                                       type III    0.815 (0.016)    0.529 (0.020)    0.660 (0.014)
                           type II     type I      0.841 (0.024)    0.543 (0.027)    0.651 (0.007)
                                       type II     0.881 (0.003)    0.486 (0.005)    0.596 (0.010)
                                       type III    0.814 (0.008)    0.582 (0.026)    0.685 (0.013)
                           type III    type I      0.826 (0.022)    0.632 (0.023)    0.761 (0.023)
                                       type II     0.878 (0.005)    0.557 (0.013)    0.697 (0.009)
                                       type III    0.874 (0.006)    0.475 (0.010)    0.592 (0.024)
Table 4
Single task learning results as F1 averages over three runs (and standard deviation). Training with 2160 instances for Caus and
Od for all types, and for agreement 2052 instances for type I (maximum available), and 3000 instances for type II and type III.