=Paper= {{Paper |id=Vol-2765/163 |storemode=property |title=AcCompl-it @ EVALITA2020: Overview of the Acceptability & Complexity Evaluation Task for Italian |pdfUrl=https://ceur-ws.org/Vol-2765/paper163.pdf |volume=Vol-2765 |authors=Dominique Brunato,Cristiano Chesi,Felice Dell'Orletta,Simonetta Montemagni,Giulia Venturi,Roberto Zamparelli |dblpUrl=https://dblp.org/rec/conf/evalita/BrunatoCDMVZ20 }} ==AcCompl-it @ EVALITA2020: Overview of the Acceptability & Complexity Evaluation Task for Italian== https://ceur-ws.org/Vol-2765/paper163.pdf

AcCompl-it @ EVALITA2020: Overview of the Acceptability &
Complexity Evaluation Task for Italian
Dominique Brunato1 , Cristiano Chesi2 , Felice Dell’Orletta1 ,
Simonetta Montemagni1 , Giulia Venturi1 , Roberto Zamparelli3
1
ILC-CNR, Via G. Moruzzi 1, Pisa, Italy
2
NETS-IUSS, P.zza Vittoria 15, Pavia, Italy
3
CIMeC-UNITRENTO, Corso Bettini 31, Rovereto Italy
[name.surname]@ilc.cnr.it, cristiano.chesi@iusspavia.it,
roberto.zamparelli@unitn.it
Abstract containing acceptability judgments and analyzed
with machine learning techniques can be useful
The Acceptability and Complexity eval- to test the extent to which syntactic and semantic
uation task for Italian (AcCompl-it) was deviance can be induced from corpus data alone,
aimed at developing and evaluating meth- especially for low frequency phenomena (Chowd-
ods to classify Italian sentences according hury and Zamparelli, 2018; Gulordava et al., 2018;
to Acceptability and Complexity. It con- Wilcox et al., 2018), while the same data, seen
sists of two independent tasks asking par- from a psycholinguistic angle, can shed light on
ticipants to predict either the acceptabil- the relation between complexity and acceptabil-
ity or the complexity rate (or both) of a ity (Chesi and Canal, 2019), and on the extent to
given set of sentences previously scored which measures of on-line perplexity in artificial
by native speakers on a 1-to-7 points Lik- language models can track human parsing prefer-
ert scale. In this paper, we introduce the ences (Demberg and Keller, 2008; Hale, 2001).
datasets distributed to the participants, we The Acceptability & Complexity evaluation
describe the different approaches of the task for Italian (AcCompl-it) at EVALITA 2020
participating systems and provide a first (Basile et al., 2020) is in line with this emerg-
analysis of the obtained results. ing scenario. Specifically, it is aimed at devel-
oping and evaluating methods to classify Italian
1 Motivation sentences according to Acceptability and Com-
plexity, which can be viewed as two simple nu-
The availability of annotated resources and sys-
meric measures associated with linguistic produc-
tems aimed at predicting the level of grammati-
tions. Among the outcomes of the task, we also
cal acceptability or linguistic complexity of a sen-
include the creation of a set of sentences annotated
tence (see, among others, (Warstadt et al., 2018;
with acceptability and complexity human judg-
Brunato et al., 2018)) is becoming increasingly
ments that we are going to share with the lin-
relevant for different research communities that
guistic community. While datasets annotated for
focus on the study of language. From the Natural
acceptability exist for English, see in particular
Language Processing (NLP) perspective, the inter-
the COLA dataset (Warstadt et al., 2018), to our
est has been recently prompted by automatic gen-
knowledge the present dataset is a first for Italian,
eration systems (e.g. Machine Translation, Text
and is also the first one to combine judgments of
Simplification, Summarization) mostly based on
acceptability and complexity.
Deep Neural Networks algorithms (Gatt and Krah-
mer, 2018). In this scenario, resources and meth- 2 Definition of the task
ods able to assess the quality of automatically gen-
erated sentences or devoted to investigate the abil- We conceived AcCompl-it as a prediction task
ity of artificial neural networks to score linguistic where participants were asked to estimate the aver-
phenomena on the acceptability and complexity age acceptability and complexity score of a set of
scales are of pivotal importance. From the theo- sentences previously rated by native speakers on
retical linguistics perspectives, controlled datasets a 1-7 Likert scale and, if possible, to predict the
actual standard error (SE) among the annotations.
Copyright © 2020 for this paper by its authors. Use per-
mitted under Creative Commons License Attribution 4.0 In- SE gives an estimation of the actual agreement be-
ternational (CC BY 4.0). tween human annotators: the highest the SE, the
lowest the agreement. The task is articulated in In all subtasks, participants were free to use ex-
three subtasks, as follows: ternal resources, and they were evaluated against a
blind test set.
• the Acceptability prediction task (ACCEPT),
where participants have to estimate the ac- 3 Dataset
ceptability score of sentences (along with
their standard error); in this case, 1 corre- 3.1 Composition
sponds to the lowest degree of acceptability, Acceptability dataset: it contains 1,683 Italian
while 7 corresponds to the highest level. The sentences annotated with human judgments of ac-
assignment of a score on a gradual scale is in ceptability on a 7–point Likert scale. The num-
line with the definition of perceived accept- ber of annotations per sentence ranges from 10
ability that we intend to empirically inspect. to 85, with an average of 16.38. The dataset
According to the literature, in fact, accept- is constructed by merging the data of four psy-
ability is a concept closely related to gram- cholinguistic studies on minimal variations of con-
maticality but with some major differences trolled linguistic oppositions with different levels
(see, among others, (Sprouse, 2007; Sorace of grammaticality with a subset of 672 sentences
and Keller, 2005)). While the latter is a theo- generated from templates.
retical construction corresponding to syntac-
The first subset (128 sentences), taken from
tic wellformedness and it is typically inter-
(Chesi and Canal, 2019), focuses on person fea-
preted as a binary property (i.e., a sentence
tures oppositions in object clefts dependencies
is either grammatical or ungrammatical), ac-
where Determiner Phrases (DPs) are either intro-
ceptability can depend on many factors, such
duced by determiners or by pronouns used as de-
as syntactic, semantic, pragmatic, and non–
terminer as in (1).
linguistic factors;
(1) {Sono | siete} {gli | voi} architetti che {gli |
• the Complexity prediction task (COMPL), {are3Ppl | are2Ppl } {the | you} architects that {the |
where participants have to estimate the com- voi} ingegneri {hanno | avete} consultato.
plexity score of sentences (along with their you} engineers {have3Ppl | have2Ppl } consulted
standard error); in this case, 1 corresponds to ‘it is {the|you} architects that {the|you} engi-
the lowest possible level of complexity, while neers have consulted’
7 indicates the highest degree. Similarly to
the Acceptability prediction task, the use of The second subset (515 sentences) is taken from
the Likert scale as a tool to collect perceived the studies presented in (Greco et al., 2020) in-
values is motivated by the assumption that volving copular constructions (e.g. canonical (2a)
sentence complexity is a gradient, rather than vs. inverse (2b) (Moro, 1997).
binary, concept;
(2) a. Le foto del muro sono la causa della
• the Open task, where participants are re- the pictures of_the wall are the cause of_the
rivolta.
quested to model linguistic phenomena corre- riot
lated with the human ratings of sentence ac-
b. La causa della rivolta sono le foto
ceptability and/or complexity in the datasets the cause of_the riot are the pictures
provided. del muro.
of_the wall
The three subtasks were independent and par-
ticipants could decide to participate in any one of This subset also contains declarative and inter-
them, though we encouraged participation in mul- rogative (yes/no) sentences with a minimal ver-
tiple subtasks, since the complexity metrics might bal structure (contrasting preverbal vs postverbal
be influenced by the grammatical status of an ex- subject position in unergatives (3a), unaccusatives
pression and vice versa. In line with this intuition, (3b) and transitive predicates (3c))
we distributed a subset of sentences annotated
with both acceptability and complexity scores in (3) a. I cani hanno abbaiato | Hanno abbaiato i
the dogs have barked | have barked the
order to investigate whether and to what extent cani.
there is a correlation between the two phenomena. dogs
b. Gli autobus sono partiti | Sono partiti gli (8) Quale provvedimento Maria ha saputo {che |
The buses have left | have left the Which measure M. has heard {that |
autobus. dove | perché | quando} il ministro prenderà?
buses where | why | when} il ministro prenderà?
c. Le bambine hanno mangiato il dolce |
the girls have eaten the dessert | (iv) extractions from VPs in subject vs. object po-
Hanno mangiato le bambine il dolce sitions (9) (cf. (2)).
have eaten the girls the dessert
(9) Carlo conosceva bene il compagnoi di classe
The third set (320 sentences) is based on a study in Carlo knew well the classmatei
which number and person subject-verb agreement che {incontrare i divertiva sempre Anna | Anna
and unagreement cases are tested (Mancini et al., that {meetinf i amused always Anna | Anna

2018): voleva sempre incontrare i }
wanted always meetinf i}

(4) Qualcuno ha detto che io1Psg {scrivo1Psg |*
Somebody has said that I1Psg {write1Psg |
(v) NEGPOLs (nessuno, alcunché, mai ‘any, any-
scriviamo1Ppl } una lettera. thing, ever’) that are licensed by a higher nega-
*write1Psg } a letter tion, by a question, or not licensed, in simple or
(deeply) embedded sentences (e.g. (10)).
The fourth one (48 sentences) contains experimen-
tal items from (Villata et al., 2015) involving dif- (10) {Maria | Nessuno} si aspetta che qualcuno
ferent types of wh-islands violations. {M. | No-one} self expects that someone
possa aver {già | mai} finito questo
(5) {Cosa | Quale edificio}i ti chiedi {chi | could have {already | never} completed this
{What | Which building}i do you wonder {who | esercizio (?)
quale ingegnere} abbia costruito i ? exercise (?)
which engineer} has built i?
The use of expanded templates was designed to
The last set of 672 sentences was generated by cre- minimize the potential effect of collocations or
ating all the possible content word combinations specific lexical choices.
from various structural templates designed to test Whenever possible each sentence was also
acceptability patterns due to: (i) extra or missing manually annotated according to the linguistic-
gaps in Wh-extractions (6a) vs. topic construc- theoretic expectations for “grammaticality”, on a
tions (6b). 4-points scale: * (ungrammatical, coded as 0), ??
(very marginal, coded as 0.66), ? (marginal, coded
(6) a. {Cosa | Quale problema}i lo studente as 0.33) and OK (grammatical, coded as 1).
{what | which problem}i the student
dovrebbe descriver(e) { i | -loi | questo Complexity dataset: it comprises 2,530 Ital-
should describe { i | it | this ian sentences annotated with human judgments of
problema}? perceived complexity on a 7–point Likert scale
problem} as for the acceptability dataset. The number of
b. Questo problemai , lo studente dovrebbe annotations per sentence ranged from 11 to 20,
this problem, the student should
with an average of 16.753. The corpus was in-
descriver(e) { i | -loi | questo problema}
describe { i | iti | this problem} ternally subdivided into two subsets representa-
tive of two different typologies of data, i.e. 1,858
(ii) Wh- and relative clauses with gaps inside VP naturalistic sentences extracted from corpora and
conjunctions (in all conjuncts, i.e. "Across the 672 artificially-generated sentences drawn from
Board", in only one conjunct, or not at all, see e.g. the Acceptability dataset, and chosen to cover the
(7)). range of linguistic phenomena represented in its
templates. The first subset contains sentences
(7) Chii ... Maria vuole chiamar(e) { i | -lo} e taken from the Universal Dependency (UD) tree-
whoi ... Mary wants callinf { i | him} and banks (Nivre et al., 2016) available for Italian,
il dottore medicar(e) { i | -lo}?
the doctor cure { i | him}? representative of different text genres and do-
mains. In this regard, the largest portion con-
(iii) embedded Wh-clauses and the possibility of tains 1,128 sentences taken from the newswire
subextractions from them (similar to (5)). section of the Italian Stanford Dependency Tree-
bank (ISDT) (Bosco et al., ), annotated with com- were asked to read each sentence and answer the
plexity judgments by Brunato et al. (2018). Be- following question:
side these, we chose to include in this corpus
smaller subsets of sentences representative of a “Quanto è complessa questa frase da 1
non-standard language variety and of specific con- (semplicissima) a 7 (molto difficile)?”
structions, i.e. Wh-questions and direct speech. ‘How difficult is this sentence from 1
Non-standard sentences (for a total of 323) are (very easy) to 7 (very difficult)?’
in the form of generic tweets and tweets labelled
for irony taken from two representative treebanks, Beyond complexity, the 672 artificially-
i.e. PoSTWITA and TWITTIRÒ (Sanguinetti et generated sentences were also labelled for
al., 2018; Cignarella et al., 2019). Wh-questions perceived acceptability according to the following
(164 sentences) were extracted from a dedicated question:
section (prefixed by the string ‘quest’) included in
“Quanto è accettabile questa frase da
ISDT. Direct speech sentences (243) mainly in-
1 (completamente agrammaticale) a 7
clude transcripts of European parliamentary de-
(perfettamente grammaticale)?” ‘How
bates (taken from the ‘europarl’ section of ISDT)
acceptable is this sentence from 1 (com-
and extracts from literary texts (mostly contained
pletely ungrammatical) to 7 (completely
in the UD Italian VIT (Delmonte et al., 2007)).
grammatical)?’
The choice of annotating a shared portion of data
with both acceptability and complexity scores was After collecting all annotations, we excluded
explicitly motivated by the attempt to empirically workers who performed the assigned task in less
investigate whether there is a correlation between than 10 minutes, which we set as the minimum
the two sentence properties, and whether complex- threshold to accurately complete the survey.
ity is judged differently in the case of ill-formed
constructions. 3.3 Analysis of Judgments across Corpora
For the purpose of the task, both datasets were Table 1 shows the average value, standard de-
split into training and validation samples with a viation and minimum and maximum score of
proportion of 80% to 20%, respectively. complexity and acceptability labels for the whole
dataset. As it can be noticed, complexity values
3.2 Annotation with Human Judgments
are on average lower and less scattered than the ac-
For the collection of judgments of sentence ac- ceptability ones. For this corpus, the lowest value
ceptability and complexity by Italian native speak- on the Likert scale (1) – which should have been
ers we relied on crowdsourcing techniques using used to label sentences perceived as very easy, in
different platforms. More specifically, for Ac- line to the task question – was given only twice,
ceptability, the set of sentences drawn from the specifically to the following sentences:
psycholinguistic studies described in Section 3.1
was annotated using an on-line platform based on (11) Dimmi il nome di una città finlandese.
tell me the name of a town Finnish
jsPsych scripts (De Leeuw, 2015). For the Com-
‘Tell me the name of Finnish town’
plexity dataset, the annotation of the subcorpus of
sentences taken from (Brunato et al., 2018) was (12) Quali uve si usano per produrre vino?
Which grapes PRT they_use to make wine?
performed through the CrowdFlower platform1
(more details are reported in the reference paper), Conversely, for the acceptability corpus, the high-
while the remaining sentences in this dataset were est value on the Likert scale (i.e. 7, meaning in
annotated using Prolific2 . To make the annota- this case completely acceptable) was attributed to
tion process comparable to the one followed by 26 sentences. For space reasons, we report here
(Brunato et al., 2018), the whole process was split only two examples:
into different tasks, each one consisting in the an-
notation of about 200 sentences randomly mixed (13) Le sorelle sono sopravvissute.
The sisters are survived.
for the various typologies. For all tasks, workers
‘the sisters have survived’
1
Now known as Figure Eight, https://appen.com/ (14) I lupi hanno ululato.
2
www.prolific.co The wolves have howled.
With respect to the ‘worst’ values, two sample sen- SCORE SE MIN MAX
tences judged respectively as the most complex ISDT_news 3.28 0.33 1.25 5.7
(i.e. 6.46 on the Likert scale) and (among) the least Twitter 2.59 0.31 1.13 4.69
acceptable (1.55) in each dataset are the following DirectSpeech 2.68 0.31 1.14 6
ones, respectively: Wh-Quest 1.61 0.22 1 2.94
ArtifSent 3.63 0.37 1.42 6.47
(15) Chi è che lui ha affermato che il professore
who is that he has claimed that the professor Table 2: Average complexity score, standard er-
aveva detto che lo studente avrebbe dovuto ror and minimum and maximum value across the
had said that the student hadsubj must
different subsets of the Complexity dataset.
considerare questo candidato?
consider this candidate?
(16) Il falegname è arrivato mentre noi between expected grammaticality and mean com-
The carpenter has arrived while we plexity (r(656)=.34, p<.001). Still considering
montavo la mensola. this subset, an additional outcome is the moderate
were_assembling1Psg the shelf. (and negative) correlation between the two metrics
(r(672)=.49, p<.001), further suggesting that the
COMPL ACCEPT more a sentence is perceived as complex, the less
SCORE SE SCORE SE acceptable it is.
µ 3.12 0.332 4.45 0.36
σ 1.04 0.08 1.7 0.14 4 Evaluation measures
min 1 0 1.13 0 For both the ACCEPT and COMPL Task, the
max 6.46 0.63 7 0.74 evaluation metric was based on Spearman’s rank
correlation coefficient between the participants’
Table 1: Statistics collected for the two corpora of
scores and the test set scores. For each task, two
the AcCompl-it dataset.
different ranks were produced according to the
If we consider the internal composition of the two prediction of the relative scores and to standard er-
datasets we can see a more articulated picture de- rors. In each task a different baseline was defined:
pending on its various subparts (see Table 2 and
3). For complexity, average scores are higher for • in the ACCEPT task, it corresponds to the
sentences created to display specific acceptability score assigned by a SVM linear regression
patterns, thus proving that acceptability does af- using unigram and bigram of words as fea-
fect the perception of complexity. Note that the tures;
most complex sentence (reported in (15)) is con-
• in the COMPL task, it corresponds to the
tained in this set, and is ungrammatical (no gap).
score assigned by a SVM linear regression
Among the treebank sentences, those extracted
using sentence length as its sole feature.
from journalistic texts (ISDT_news) were judged
on average as the most complex, questions as the 5 Participation and results
easiest ones. Twitter and direct speech sentences
obtained scores in between the highest and the The AcCompl-it task received three submissions
lowest value and very close to each other. This is for each subtask from two different participants,
in line with stylistic and linguistic analysis show- for a total of 6 runs. Unfortunately, neither partic-
ing that the language of social media inherits many ipant took part in the Open Task. Results for the
features from spoken language. other ones are reported in Tables 4 and 5.
For the whole acceptability dataset, the Spear- The systems from the two participants in the
man’s rank correlation coefficient between task follow very different approaches: one is based
theoretically-driven grammaticality and mean on deep learning and trained on raw texts (Sarti,
acceptability labels is very strong (r(656)=.83, 2020), the other relies on (heuristic) rules applied
p<.001). While this could be somehow expected, to semantic and syntactic features automatically
when we focus only on the 672 sentences anno- extracted from sentences (Delmonte, 2020). In
tated for both complexity and acceptability, we spite of their very different nature, the two ap-
still observe a significant but lower correlation proaches also present some commonalities, such
SCORE SE MIN MAX PARTICIPANT SCORE SE
clefts (1) 4.27 0.39 1.36 6
copular 5.01 0.48 2.90 6.5 UmBERTO-MTSA (Sarti) 0.83** 0.51**
canonical (2a) 5.47 0.44 3.58 6.5 ItVenses-run1 (Delmonte) 0.31** 0.09*
inverse (2b) 4.56 0.51 2.90 5.9 ItVenses-run2 (Delmonte) 0.31** 0.07
unerg V (3a) 5.91 0.30 3.70 7
SV 6.64 0.21 5.94 7 Baseline 0.50** 0.33**
VS 5.18 0.40 3.70 6.2
unacc V (3b) 6.28 0.27 4.86 7
SV 6.61 0.20 5.82 7 Table 5: COMPL task results. **p value<0.001;
VS 5.96 0.33 4.86 6.72 *p value <0.05
trans V (3c) 4.91 0.34 2 7
SV 6.47 0.24 5.06 7
VS 3.34 0.43 2 4.52 amples extracted from available Italian treebanks.
S V agree (4) 3.81 0.31 1.25 6.93
match 5.87 0.30 3.14 6.92 The bigger dataset was then split into different
mismatch 1.74 0.32 1.25 2.72 portions to train an ensemble of classifiers. The
wh-island (arg) (5) 3.85 0.17 1.68 5.63 resulting MTL model was finally used to predict
filler-gap dep. 3.56 0.51 1.5 6.69
doubly filled (6) 3.28 0.42 1.5 6.69 the complexity/acceptability labels on the original
coord (7) 4.25 0.47 2.5 6 test sets.
wh-island (adj) (8) 3.02 0.41 1.38 5.6 Delmonte’s ItVenses system parses the sen-
no extraction 6.26 0.29 5.26 7
subj/obj (9) 3.70 0.51 2.66 5.15 tences to obtain a sequence of constituents and a
NPIs (10) 4.75 0.45 2.27 6.6 set of sentence-level semantic features (presence
bad fillers 1.13 0.06 1.13 1.13
good fillers 6.76 0.07 6.76 6.76
of agreement, negation markers, speech act and
medium fillers 4.07 0.18 4.07 4.07 factivity). These features, along with constituent
triples and their frequency in the training set and
Table 3: Average acceptability score, standard er- in the Venice Italian Treebank are weighed with
ror and minimum and maximum value across the various heuristics and used to derive a predic-
different linguistic phenomena of the Acceptabil- tion. Agreement mismatches were checked using
ity. Numbers in (·) refer to examples in the text. morphological analysis of verb and subject, while
the argumental structure is inferred using a deep
PARTICIPANT SCORE SE parser. The two versions of the system (run1 and
UmBERTO-MTSA (Sarti) 0.88** 0.52** run2) differ only in their use of features (run2 dis-
ItVenses-run1 (Delmonte) 0.44** 0.25** penses with proposional negation and certain verb
ItVenses-run2 (Delmonte) 0.49** 0.41** agreement features).
Baseline 0.30** 0.35** As it can be seen, ItVenses’s performance were
considerably lower than Sarti’s system (lower, in
Table 4: ACCEPT task results.**p value<0.001; fact, than the baseline based on sentence length, in
*p value <0.05 the COMPL prediction task). However, as better
explained in the following section, in the artificial
data subset, which has complex but far less diverse
as the reliance on external resources. In partic-
structures, the gap with the winning system is re-
ular, both make use of additional sentences taken
duced in the COMPL task (cfr. Table 7) and, even
from existing Italian treebanks, either to enrich the
more robustly, in the ACCEPT task (Table 6).
original training sets with additional annotated ex-
amples (Sarti’s case) or to check the frequency of 6 Discussion
a given construction and use this info among the
features of the proposed system (ItVenses). The extremely good performance of the winning
Sarti’s systems obtained the best performance system in both tasks is not wholly unexpected in
on both tasks using a similar multi-task learning light of the impressive results obtained by cur-
(MTL) approach, which consists in leveraging the rent neural networks models across a variety of
predictions of a state-of-the-art neural language NLP tasks. In this regard, it is worth noticing
model for Italian (i.e. UmBERTO3 ) fine-tuned on that, in his report, the author compared the per-
the two downstream tasks to augment the original formance of the best system based on multi-task
development sets with a large set of unlabeled ex- learning to the one obtained by a simpler version
of the UmBERTO-based model with standard fine-
3
https://github.com/musixmatchresearch/umberto tuning on the two downstream tasks, achieving al-
ready very good results (.90 and .84 for accept- PARTICIPANT SCORE SE
ability and complexity predictions on the training Psycholinguistics related
corpus, respectively). Similarly, and especially UmBERTO-MTSA (Sarti) 0.90** 0.55**
for the automatic assessment of sentence accept- ItVenses-run1 (Delmonte) 0.42** 0.24**
ability, the scores obtained by the winning system ItVenses-run2 (Delmonte) 0.50** 0.48**
(.88) are in line with those reported in (Linzen et Artificial data
al., 2016), who train a classifier to detect subject- UmBERTO-MTSA (Sarti) 0.74** 0.33**
verb agreement mismatches from the hidden states ItVenses-run1 (Delmonte) 0.50** 0.20*
of an LSTM, achieving a .83 score. Most other ItVenses-run2 (Delmonte) 0.46** 0.25*
systems at work on the ability of neural models to
detect acceptability or grammaticality in a broader Table 6: ACCEPT task results on different subsets
range of cases report much lower scores, but they of the official test set. **p value<0.001; *p value
try to read (minimal pair) judgments from metrics <0.05
associated to the performance of systems that have
not been expressly trained on giving judgments,
matical ones (r=.76, p<.001). Similarly, but less
reasoning that ‘judgment giving’ is not a task hu-
robustly, the same numerical asymmetry is ob-
mans have a life-long training for, but which is
served in both Delmonte’s runs: for grammatical
nonetheless feasible.
predictions, RUN1 r=.33, RUN2 r=.35; for un-
To have a better understanding of the potential grammatical ones RUN1 r=.32, RUN2 r=.34, all
impact of different types of data on the predic- correlations being equally significant (p<.001).
tive capabilities of the two systems, we further in-
spected the final results by testing each system on PARTICIPANT SCORE SE
sentences representative of diverse linguistic phe- Treebank sentences
nomena and textual genres. To this end, we split UmBERTO-MTSA (Sarti) 0.86** 0.61**
the whole test set into the distinct subsets defined ItVenses-run1 (Delmonte) 0.25** 0.13*
in the corpus collection process (cfr. Section 3.1) ItVenses-run2 (Delmonte) 0.24** 0.10*
and we assessed the correlation score between pre- Artificial data
dicted and real labels for each type: note that, for UmBERTO-MTSA (Sarti) 0.70** 0.06
the ACCEPT predictions, this analysis was per- ItVenses-run1 (Delmonte) 0.44** -0.07
formed considering only two ‘macro–classes’, i.e. ItVenses-run2 (Delmonte) 0.51** -0.11
artificial vs psycholinguistics-related data, in or-
der to have a significant number of examples in Table 7: COMPL task results on different subsets
the test set. Similarly, for COMPL, we distin- of the official test set. **p value<0.001; *p value
guished the artificially–generated sentences from <0.05
sentences drawn from all treebanks. Results of this
fine-grained analysis are shown in Tables 6 and 7.
Interestingly, although the gap between the two References
systems is still evident, we observed that artificial
data have an opposite effect on their performance. Valerio Basile, Danilo Croce, Maria Di Maro, and Lu-
cia C. Passaro. 2020. Evalita 2020: Overview
In particular, as anticipated in the previous section, of the 7th evaluation campaign of natural language
ItVenses is more accurate in predicting both the processing and speech tools for italian. In Valerio
complexity and, especially, the acceptability level Basile, Danilo Croce, Maria Di Maro, and Lucia C.
of this group of sentences. The opposite holds for Passaro, editors, Proceedings of Seventh Evalua-
tion Campaign of Natural Language Processing and
Sarti’s system, which although still very good in Speech Tools for Italian. Final Workshop (EVALITA
both tasks, achieves lower correlation scores when 2020), Online. CEUR.org.
tested against artificial data.
C. Bosco, S. Montemagni, and M. Simi. Converting
Running an exploratory analysis based on ex- Italian Treebanks: Towards an Italian Stanford De-
pected grammaticality, we observed that Sarti’s pendency Treebank. In Proceedings of the ACL Lin-
system performs much better in predicting the ac- guistic Annotation Workshop & Interoperability with
Discourse.
ceptability score on expected grammatical sen-
tences (r=.80, p<.001) than on expected ungram- Dominique Brunato, Lorenzo De Mattei, Felice
Dell’Orletta, Benedetta Iavarone, and Giulia Ven- T. Linzen, E. Dupoux, and Y. Goldberg. 2016. Assess-
turi. 2018. Is this sentence difficult? do you agree? ing the ability of LSTMs to learn syntax-sensitive
In Proceedings of the 2018 Conference on Empiri- dependencies. In Transactions of the Association for
cal Methods in Natural Language Processing, pages Computational Linguistics, volume 4, pages 521–
2690–2699. 535.

Cristiano Chesi and Paolo Canal. 2019. Person fea- Simona Mancini, Paolo Canal, and Cristiano
tures and lexical restrictions in Italian clefts. Fron- Chesi. 2018. The acceptability of person
tiers in Psychology, 10:2105. and number agreement/disagreement in Ital-
ian: an experimental study. Lingbuzz preprint:
Shammur Absar Chowdhury and Roberto Zamparelli. https://ling.auf.net/lingbuzz/005514.
2018. Rnn simulations of grammaticality judgments
on long-distance dependencies. In Proceedings of Andrea Moro. 1997. The raising of predicates: Pred-
the 27th international conference on computational icative noun phrases and the theory of clause struc-
linguistics, pages 133–144. ture, volume 80. Cambridge University Press.

Alessandra Teresa Cignarella, Cristina Bosco, and Joakim Nivre, Marie-Catherine De Marneffe, Filip
Paolo Rosso. 2019. Presenting TWITTIRÒ-UD: Ginter, Yoav Goldberg, Jan Hajic, Christopher D
An italian twitter treebank in universal dependen- Manning, Ryan McDonald, Slav Petrov, Sampo
cies. In Proceedings of the Fifth International Con- Pyysalo, Natalia Silveira, et al. 2016. Univer-
ference on Dependency Linguistics (Depling, Syn- sal dependencies v1: A multilingual treebank col-
taxFest 2019). lection. In Proceedings of the Tenth International
Conference on Language Resources and Evaluation
Joshua R De Leeuw. 2015. jspsych: A javascript li- (LREC’16), pages 1659–1666.
brary for creating behavioral experiments in a web
browser. Behavior research methods, 47(1):1–12. Manuela Sanguinetti, Cristina Bosco, Alberto Lavelli,
Alessandro Mazzei, and Fabio Tamburini. 2018.
Rodolfo Delmonte, Antonella Bristot, and Sara Tonelli. PoSTWITA-UD: an Italian Twitter Treebank in uni-
2007. VIT - Venice Italian Treebank: Syntactic and versal dependencies. In Proceedings of the Eleventh
quantitative features. In Proceedings of the Sixth In- Language Resources and Evaluation Conference
ternational Workshop on Treebanks and Linguistic (LREC 2018).
Theories.
Gabriele Sarti. 2020. UmBERTo-MTSA @ AcCompl-
Rodolfo Delmonte. 2020. Venses@AcCompl-it: it: Improving complexity and acceptability predic-
Computing complexity vs acceptability with a con- tion with multi-task learning on self-supervised an-
stituent trigram model and semantics. In Valerio notations. In Valerio Basile, Danilo Croce, Maria
Basile, Danilo Croce, Maria Di Maro, and Lucia C. Di Maro, and Lucia C. Passaro, editors, Proceedings
Passaro, editors, Proceedings of Seventh Evalua- of Seventh Evaluation Campaign of Natural Lan-
tion Campaign of Natural Language Processing and guage Processing and Speech Tools for Italian. Fi-
Speech Tools for Italian. Final Workshop (EVALITA nal Workshop (EVALITA 2020), Online. CEUR.org.
2020), Online. CEUR.org.
Antonella Sorace and Frank Keller. 2005. Gradience
Vera Demberg and Frank Keller. 2008. Data from eye- in linguistic data. Lingua, 115:1497–1524.
tracking corpora as evidence for theories of syntactic Jon Sprouse. 2007. Continuous acceptability, categor-
processing complexity. Cognition, 109(2):193–210. ical grammaticality, and experimental synta. Biolin-
guistics, pages 1123–134.
Albert Gatt and Emiel Krahmer. 2018. Survey of the
state of the art in natural language generation: Core Sandra Villata, Paolo Canal, Julie Franck, An-
tasks, applications and evaluation. Journal of Artifi- drea Carlo Moro, and Cristiano Chesi. 2015. Inter-
cial Intelligence Research, 61:65–170. vention effects in wh-islands: An eye-tracking study.
In Architectures and Mechanisms for Language Pro-
Matteo Greco, Paolo Lorusso, Cristiano Chesi, and An- cessing (AMLaP 2015), pages 195–195.
drea Moro. 2020. Asymmetries in nominal copular
sentences: Psycholinguistic evidence in favor of the Alex Warstadt, Amanpreet Singh, and Samuel R Bow-
raising analysis. Lingua, 245:102926. man. 2018. Neural network acceptability judg-
ments. arXiv preprint arXiv:1805.12471.
Kristina Gulordava, Piotr Bojanowski, Edouard Grave,
Tal Linzen, and Marco Baroni. 2018. Color- Ethan Wilcox, Roger Levy, Takashi Morita, and
less green recurrent networks dream hierarchically. Richard Futrell. 2018. What do rnn language
arXiv preprint arXiv:1803.11138. models learn about filler-gap dependencies? arXiv
preprint arXiv:1809.00042.
John Hale. 2001. A probabilistic earley parser as
a psycholinguistic model. In Second Meeting of
the North American Chapter of the Association for
Computational Linguistics.