=Paper= {{Paper |id=Vol-2765/163 |storemode=property |title=AcCompl-it @ EVALITA2020: Overview of the Acceptability & Complexity Evaluation Task for Italian |pdfUrl=https://ceur-ws.org/Vol-2765/paper163.pdf |volume=Vol-2765 |authors=Dominique Brunato,Cristiano Chesi,Felice Dell'Orletta,Simonetta Montemagni,Giulia Venturi,Roberto Zamparelli |dblpUrl=https://dblp.org/rec/conf/evalita/BrunatoCDMVZ20 }} ==AcCompl-it @ EVALITA2020: Overview of the Acceptability & Complexity Evaluation Task for Italian== https://ceur-ws.org/Vol-2765/paper163.pdf
        AcCompl-it @ EVALITA2020: Overview of the Acceptability &
                   Complexity Evaluation Task for Italian
              Dominique Brunato1 , Cristiano Chesi2 , Felice Dell’Orletta1 ,
             Simonetta Montemagni1 , Giulia Venturi1 , Roberto Zamparelli3
                         1
                           ILC-CNR, Via G. Moruzzi 1, Pisa, Italy
                       2
                         NETS-IUSS, P.zza Vittoria 15, Pavia, Italy
                 3
                   CIMeC-UNITRENTO, Corso Bettini 31, Rovereto Italy
       [name.surname]@ilc.cnr.it, cristiano.chesi@iusspavia.it,
                         roberto.zamparelli@unitn.it
                        Abstract                                containing acceptability judgments and analyzed
                                                                with machine learning techniques can be useful
    The Acceptability and Complexity eval-                      to test the extent to which syntactic and semantic
    uation task for Italian (AcCompl-it) was                    deviance can be induced from corpus data alone,
    aimed at developing and evaluating meth-                    especially for low frequency phenomena (Chowd-
    ods to classify Italian sentences according                 hury and Zamparelli, 2018; Gulordava et al., 2018;
    to Acceptability and Complexity. It con-                    Wilcox et al., 2018), while the same data, seen
    sists of two independent tasks asking par-                  from a psycholinguistic angle, can shed light on
    ticipants to predict either the acceptabil-                 the relation between complexity and acceptabil-
    ity or the complexity rate (or both) of a                   ity (Chesi and Canal, 2019), and on the extent to
    given set of sentences previously scored                    which measures of on-line perplexity in artificial
    by native speakers on a 1-to-7 points Lik-                  language models can track human parsing prefer-
    ert scale. In this paper, we introduce the                  ences (Demberg and Keller, 2008; Hale, 2001).
    datasets distributed to the participants, we                   The Acceptability & Complexity evaluation
    describe the different approaches of the                    task for Italian (AcCompl-it) at EVALITA 2020
    participating systems and provide a first                   (Basile et al., 2020) is in line with this emerg-
    analysis of the obtained results.                           ing scenario. Specifically, it is aimed at devel-
                                                                oping and evaluating methods to classify Italian
1    Motivation                                                 sentences according to Acceptability and Com-
                                                                plexity, which can be viewed as two simple nu-
The availability of annotated resources and sys-
                                                                meric measures associated with linguistic produc-
tems aimed at predicting the level of grammati-
                                                                tions. Among the outcomes of the task, we also
cal acceptability or linguistic complexity of a sen-
                                                                include the creation of a set of sentences annotated
tence (see, among others, (Warstadt et al., 2018;
                                                                with acceptability and complexity human judg-
Brunato et al., 2018)) is becoming increasingly
                                                                ments that we are going to share with the lin-
relevant for different research communities that
                                                                guistic community. While datasets annotated for
focus on the study of language. From the Natural
                                                                acceptability exist for English, see in particular
Language Processing (NLP) perspective, the inter-
                                                                the COLA dataset (Warstadt et al., 2018), to our
est has been recently prompted by automatic gen-
                                                                knowledge the present dataset is a first for Italian,
eration systems (e.g. Machine Translation, Text
                                                                and is also the first one to combine judgments of
Simplification, Summarization) mostly based on
                                                                acceptability and complexity.
Deep Neural Networks algorithms (Gatt and Krah-
mer, 2018). In this scenario, resources and meth-               2   Definition of the task
ods able to assess the quality of automatically gen-
erated sentences or devoted to investigate the abil-            We conceived AcCompl-it as a prediction task
ity of artificial neural networks to score linguistic           where participants were asked to estimate the aver-
phenomena on the acceptability and complexity                   age acceptability and complexity score of a set of
scales are of pivotal importance. From the theo-                sentences previously rated by native speakers on
retical linguistics perspectives, controlled datasets           a 1-7 Likert scale and, if possible, to predict the
                                                                actual standard error (SE) among the annotations.
     Copyright © 2020 for this paper by its authors. Use per-
mitted under Creative Commons License Attribution 4.0 In-       SE gives an estimation of the actual agreement be-
ternational (CC BY 4.0).                                        tween human annotators: the highest the SE, the
lowest the agreement. The task is articulated in           In all subtasks, participants were free to use ex-
three subtasks, as follows:                             ternal resources, and they were evaluated against a
                                                        blind test set.
   • the Acceptability prediction task (ACCEPT),
     where participants have to estimate the ac-        3     Dataset
     ceptability score of sentences (along with
     their standard error); in this case, 1 corre-      3.1     Composition
     sponds to the lowest degree of acceptability,      Acceptability dataset: it contains 1,683 Italian
     while 7 corresponds to the highest level. The      sentences annotated with human judgments of ac-
     assignment of a score on a gradual scale is in     ceptability on a 7–point Likert scale. The num-
     line with the definition of perceived accept-      ber of annotations per sentence ranges from 10
     ability that we intend to empirically inspect.     to 85, with an average of 16.38. The dataset
     According to the literature, in fact, accept-      is constructed by merging the data of four psy-
     ability is a concept closely related to gram-      cholinguistic studies on minimal variations of con-
     maticality but with some major differences         trolled linguistic oppositions with different levels
     (see, among others, (Sprouse, 2007; Sorace         of grammaticality with a subset of 672 sentences
     and Keller, 2005)). While the latter is a theo-    generated from templates.
     retical construction corresponding to syntac-
                                                           The first subset (128 sentences), taken from
     tic wellformedness and it is typically inter-
                                                        (Chesi and Canal, 2019), focuses on person fea-
     preted as a binary property (i.e., a sentence
                                                        tures oppositions in object clefts dependencies
     is either grammatical or ungrammatical), ac-
                                                        where Determiner Phrases (DPs) are either intro-
     ceptability can depend on many factors, such
                                                        duced by determiners or by pronouns used as de-
     as syntactic, semantic, pragmatic, and non–
                                                        terminer as in (1).
     linguistic factors;
                                                        (1) {Sono | siete} {gli | voi} architetti che {gli |
   • the Complexity prediction task (COMPL),                  {are3Ppl | are2Ppl } {the | you} architects that {the |
     where participants have to estimate the com-             voi} ingegneri {hanno | avete} consultato.
     plexity score of sentences (along with their             you} engineers {have3Ppl | have2Ppl } consulted
     standard error); in this case, 1 corresponds to          ‘it is {the|you} architects that {the|you} engi-
     the lowest possible level of complexity, while           neers have consulted’
     7 indicates the highest degree. Similarly to
     the Acceptability prediction task, the use of      The second subset (515 sentences) is taken from
     the Likert scale as a tool to collect perceived    the studies presented in (Greco et al., 2020) in-
     values is motivated by the assumption that         volving copular constructions (e.g. canonical (2a)
     sentence complexity is a gradient, rather than     vs. inverse (2b) (Moro, 1997).
     binary, concept;
                                                        (2) a. Le foto        del    muro sono la causa della
   • the Open task, where participants are re-                   the pictures of_the wall are the cause of_the
                                                                 rivolta.
     quested to model linguistic phenomena corre-                riot
     lated with the human ratings of sentence ac-
                                                              b. La causa della rivolta sono le foto
     ceptability and/or complexity in the datasets               the cause of_the riot are the pictures
     provided.                                                   del     muro.
                                                                 of_the wall
   The three subtasks were independent and par-
ticipants could decide to participate in any one of     This subset also contains declarative and inter-
them, though we encouraged participation in mul-        rogative (yes/no) sentences with a minimal ver-
tiple subtasks, since the complexity metrics might      bal structure (contrasting preverbal vs postverbal
be influenced by the grammatical status of an ex-       subject position in unergatives (3a), unaccusatives
pression and vice versa. In line with this intuition,   (3b) and transitive predicates (3c))
we distributed a subset of sentences annotated
with both acceptability and complexity scores in        (3) a. I     cani hanno abbaiato | Hanno abbaiato i
                                                                 the dogs have barked | have barked the
order to investigate whether and to what extent                  cani.
there is a correlation between the two phenomena.                dogs
    b. Gli autobus sono partiti | Sono partiti gli      (8) Quale provvedimento Maria ha saputo {che |
       The buses have left | have left the                  Which measure        M. has heard {that |
       autobus.                                             dove | perché | quando} il ministro prenderà?
       buses                                                where | why | when} il ministro prenderà?
    c. Le bambine hanno mangiato il dolce |
       the girls  have eaten     the dessert |          (iv) extractions from VPs in subject vs. object po-
       Hanno mangiato le bambine il dolce               sitions (9) (cf. (2)).
       have eaten     the girls   the dessert
                                                        (9) Carlo conosceva bene il compagnoi di classe
The third set (320 sentences) is based on a study in        Carlo knew      well the classmatei
which number and person subject-verb agreement              che {incontrare i divertiva sempre Anna | Anna
and unagreement cases are tested (Mancini et al.,           that {meetinf    i amused always Anna | Anna

2018):                                                      voleva sempre incontrare i }
                                                            wanted always meetinf      i}

(4) Qualcuno ha detto che io1Psg {scrivo1Psg |*
    Somebody has said that I1Psg {write1Psg |
                                                        (v) NEGPOLs (nessuno, alcunché, mai ‘any, any-
    scriviamo1Ppl } una lettera.                        thing, ever’) that are licensed by a higher nega-
    *write1Psg }    a letter                            tion, by a question, or not licensed, in simple or
                                                        (deeply) embedded sentences (e.g. (10)).
The fourth one (48 sentences) contains experimen-
tal items from (Villata et al., 2015) involving dif-    (10) {Maria | Nessuno} si aspetta che qualcuno
ferent types of wh-islands violations.                        {M. | No-one} self expects that someone
                                                              possa aver {già     | mai} finito      questo
(5) {Cosa | Quale edificio}i ti chiedi     {chi |             could have {already | never} completed this
    {What | Which building}i do you wonder {who |             esercizio (?)
    quale ingegnere} abbia costruito i ?                      exercise (?)
    which engineer} has built        i?
                                                        The use of expanded templates was designed to
The last set of 672 sentences was generated by cre-     minimize the potential effect of collocations or
ating all the possible content word combinations        specific lexical choices.
from various structural templates designed to test         Whenever possible each sentence was also
acceptability patterns due to: (i) extra or missing     manually annotated according to the linguistic-
gaps in Wh-extractions (6a) vs. topic construc-         theoretic expectations for “grammaticality”, on a
tions (6b).                                             4-points scale: * (ungrammatical, coded as 0), ??
                                                        (very marginal, coded as 0.66), ? (marginal, coded
(6) a. {Cosa | Quale problema}i lo studente             as 0.33) and OK (grammatical, coded as 1).
       {what | which problem}i the student
       dovrebbe descriver(e) { i | -loi | questo           Complexity dataset: it comprises 2,530 Ital-
       should describe          { i | it | this         ian sentences annotated with human judgments of
       problema}?                                       perceived complexity on a 7–point Likert scale
       problem}                                         as for the acceptability dataset. The number of
    b. Questo problemai , lo studente dovrebbe          annotations per sentence ranged from 11 to 20,
       this    problem, the student should
                                                        with an average of 16.753. The corpus was in-
       descriver(e) { i | -loi | questo problema}
       describe     { i | iti | this problem}           ternally subdivided into two subsets representa-
                                                        tive of two different typologies of data, i.e. 1,858
(ii) Wh- and relative clauses with gaps inside VP       naturalistic sentences extracted from corpora and
conjunctions (in all conjuncts, i.e. "Across the        672 artificially-generated sentences drawn from
Board", in only one conjunct, or not at all, see e.g.   the Acceptability dataset, and chosen to cover the
(7)).                                                   range of linguistic phenomena represented in its
                                                        templates. The first subset contains sentences
(7) Chii ... Maria vuole chiamar(e) { i | -lo} e        taken from the Universal Dependency (UD) tree-
    whoi ... Mary wants callinf      { i | him} and     banks (Nivre et al., 2016) available for Italian,
    il dottore medicar(e) { i | -lo}?
    the doctor cure       { i | him}?                   representative of different text genres and do-
                                                        mains. In this regard, the largest portion con-
(iii) embedded Wh-clauses and the possibility of        tains 1,128 sentences taken from the newswire
subextractions from them (similar to (5)).              section of the Italian Stanford Dependency Tree-
bank (ISDT) (Bosco et al., ), annotated with com-      were asked to read each sentence and answer the
plexity judgments by Brunato et al. (2018). Be-        following question:
side these, we chose to include in this corpus
smaller subsets of sentences representative of a             “Quanto è complessa questa frase da 1
non-standard language variety and of specific con-           (semplicissima) a 7 (molto difficile)?”
structions, i.e. Wh-questions and direct speech.             ‘How difficult is this sentence from 1
Non-standard sentences (for a total of 323) are              (very easy) to 7 (very difficult)?’
in the form of generic tweets and tweets labelled
for irony taken from two representative treebanks,       Beyond complexity, the 672 artificially-
i.e. PoSTWITA and TWITTIRÒ (Sanguinetti et             generated sentences were also labelled for
al., 2018; Cignarella et al., 2019). Wh-questions      perceived acceptability according to the following
(164 sentences) were extracted from a dedicated        question:
section (prefixed by the string ‘quest’) included in
                                                             “Quanto è accettabile questa frase da
ISDT. Direct speech sentences (243) mainly in-
                                                             1 (completamente agrammaticale) a 7
clude transcripts of European parliamentary de-
                                                             (perfettamente grammaticale)?” ‘How
bates (taken from the ‘europarl’ section of ISDT)
                                                             acceptable is this sentence from 1 (com-
and extracts from literary texts (mostly contained
                                                             pletely ungrammatical) to 7 (completely
in the UD Italian VIT (Delmonte et al., 2007)).
                                                             grammatical)?’
The choice of annotating a shared portion of data
with both acceptability and complexity scores was         After collecting all annotations, we excluded
explicitly motivated by the attempt to empirically     workers who performed the assigned task in less
investigate whether there is a correlation between     than 10 minutes, which we set as the minimum
the two sentence properties, and whether complex-      threshold to accurately complete the survey.
ity is judged differently in the case of ill-formed
constructions.                                         3.3   Analysis of Judgments across Corpora
   For the purpose of the task, both datasets were     Table 1 shows the average value, standard de-
split into training and validation samples with a      viation and minimum and maximum score of
proportion of 80% to 20%, respectively.                complexity and acceptability labels for the whole
                                                       dataset. As it can be noticed, complexity values
3.2      Annotation with Human Judgments
                                                       are on average lower and less scattered than the ac-
For the collection of judgments of sentence ac-        ceptability ones. For this corpus, the lowest value
ceptability and complexity by Italian native speak-    on the Likert scale (1) – which should have been
ers we relied on crowdsourcing techniques using        used to label sentences perceived as very easy, in
different platforms. More specifically, for Ac-        line to the task question – was given only twice,
ceptability, the set of sentences drawn from the       specifically to the following sentences:
psycholinguistic studies described in Section 3.1
was annotated using an on-line platform based on       (11) Dimmi il nome di una città finlandese.
                                                             tell me the name of a   town Finnish
jsPsych scripts (De Leeuw, 2015). For the Com-
                                                             ‘Tell me the name of Finnish town’
plexity dataset, the annotation of the subcorpus of
sentences taken from (Brunato et al., 2018) was        (12) Quali uve     si usano per produrre vino?
                                                             Which grapes PRT they_use to make  wine?
performed through the CrowdFlower platform1
(more details are reported in the reference paper),    Conversely, for the acceptability corpus, the high-
while the remaining sentences in this dataset were     est value on the Likert scale (i.e. 7, meaning in
annotated using Prolific2 . To make the annota-        this case completely acceptable) was attributed to
tion process comparable to the one followed by         26 sentences. For space reasons, we report here
(Brunato et al., 2018), the whole process was split    only two examples:
into different tasks, each one consisting in the an-
notation of about 200 sentences randomly mixed         (13) Le sorelle sono sopravvissute.
                                                             The sisters are survived.
for the various typologies. For all tasks, workers
                                                             ‘the sisters have survived’
   1
       Now known as Figure Eight, https://appen.com/   (14) I    lupi   hanno ululato.
   2
       www.prolific.co                                       The wolves have howled.
With respect to the ‘worst’ values, two sample sen-                        SCORE      SE     MIN     MAX
tences judged respectively as the most complex              ISDT_news      3.28       0.33   1.25    5.7
(i.e. 6.46 on the Likert scale) and (among) the least       Twitter        2.59       0.31   1.13    4.69
acceptable (1.55) in each dataset are the following         DirectSpeech   2.68       0.31   1.14    6
ones, respectively:                                         Wh-Quest       1.61       0.22   1       2.94
                                                            ArtifSent      3.63       0.37   1.42    6.47
(15) Chi è che lui ha affermato che il professore
     who is that he has claimed that the professor      Table 2: Average complexity score, standard er-
     aveva detto che lo studente avrebbe dovuto         ror and minimum and maximum value across the
     had said that the student hadsubj must
                                                        different subsets of the Complexity dataset.
     considerare questo candidato?
     consider this candidate?
(16) Il  falegname è arrivato mentre noi                between expected grammaticality and mean com-
     The carpenter has arrived while we                 plexity (r(656)=.34, p<.001). Still considering
     montavo             la mensola.                    this subset, an additional outcome is the moderate
     were_assembling1Psg the shelf.                     (and negative) correlation between the two metrics
                                                        (r(672)=.49, p<.001), further suggesting that the
               COMPL               ACCEPT               more a sentence is perceived as complex, the less
            SCORE     SE        SCORE    SE             acceptable it is.
    µ       3.12   0.332        4.45    0.36
    σ       1.04    0.08        1.7     0.14            4    Evaluation measures
    min     1          0        1.13       0            For both the ACCEPT and COMPL Task, the
    max     6.46    0.63        7       0.74            evaluation metric was based on Spearman’s rank
                                                        correlation coefficient between the participants’
Table 1: Statistics collected for the two corpora of
                                                        scores and the test set scores. For each task, two
the AcCompl-it dataset.
                                                        different ranks were produced according to the
If we consider the internal composition of the two      prediction of the relative scores and to standard er-
datasets we can see a more articulated picture de-      rors. In each task a different baseline was defined:
pending on its various subparts (see Table 2 and
3). For complexity, average scores are higher for           • in the ACCEPT task, it corresponds to the
sentences created to display specific acceptability           score assigned by a SVM linear regression
patterns, thus proving that acceptability does af-            using unigram and bigram of words as fea-
fect the perception of complexity. Note that the              tures;
most complex sentence (reported in (15)) is con-
                                                            • in the COMPL task, it corresponds to the
tained in this set, and is ungrammatical (no gap).
                                                              score assigned by a SVM linear regression
   Among the treebank sentences, those extracted
                                                              using sentence length as its sole feature.
from journalistic texts (ISDT_news) were judged
on average as the most complex, questions as the        5    Participation and results
easiest ones. Twitter and direct speech sentences
obtained scores in between the highest and the          The AcCompl-it task received three submissions
lowest value and very close to each other. This is      for each subtask from two different participants,
in line with stylistic and linguistic analysis show-    for a total of 6 runs. Unfortunately, neither partic-
ing that the language of social media inherits many     ipant took part in the Open Task. Results for the
features from spoken language.                          other ones are reported in Tables 4 and 5.
For the whole acceptability dataset, the Spear-            The systems from the two participants in the
man’s rank correlation coefficient between              task follow very different approaches: one is based
theoretically-driven grammaticality and mean            on deep learning and trained on raw texts (Sarti,
acceptability labels is very strong (r(656)=.83,        2020), the other relies on (heuristic) rules applied
p<.001). While this could be somehow expected,          to semantic and syntactic features automatically
when we focus only on the 672 sentences anno-           extracted from sentences (Delmonte, 2020). In
tated for both complexity and acceptability, we         spite of their very different nature, the two ap-
still observe a significant but lower correlation       proaches also present some commonalities, such
                          SCORE      SE     MIN    MAX       PARTICIPANT                    SCORE       SE
  clefts (1)              4.27      0.39    1.36   6
  copular                 5.01      0.48    2.90   6.5       UmBERTO-MTSA (Sarti)            0.83**     0.51**
   canonical (2a)         5.47      0.44    3.58   6.5       ItVenses-run1 (Delmonte)        0.31**     0.09*
   inverse (2b)           4.56      0.51    2.90   5.9       ItVenses-run2 (Delmonte)        0.31**     0.07
  unerg V (3a)            5.91      0.30    3.70   7
   SV                     6.64      0.21    5.94   7         Baseline                        0.50**     0.33**
   VS                     5.18      0.40    3.70   6.2
  unacc V (3b)            6.28      0.27    4.86   7
   SV                     6.61      0.20    5.82   7        Table 5: COMPL task results. **p value<0.001;
   VS                     5.96      0.33    4.86   6.72     *p value <0.05
  trans V (3c)            4.91      0.34    2      7
   SV                     6.47      0.24    5.06   7
   VS                     3.34      0.43    2      4.52     amples extracted from available Italian treebanks.
  S V agree (4)           3.81      0.31    1.25   6.93
   match                  5.87      0.30    3.14   6.92     The bigger dataset was then split into different
   mismatch               1.74      0.32    1.25   2.72     portions to train an ensemble of classifiers. The
  wh-island (arg) (5)     3.85      0.17    1.68   5.63     resulting MTL model was finally used to predict
  filler-gap dep.         3.56      0.51    1.5    6.69
   doubly filled (6)      3.28      0.42    1.5    6.69     the complexity/acceptability labels on the original
   coord (7)              4.25      0.47    2.5    6        test sets.
   wh-island (adj) (8)    3.02      0.41    1.38   5.6         Delmonte’s ItVenses system parses the sen-
      no extraction       6.26      0.29    5.26   7
   subj/obj (9)           3.70      0.51    2.66   5.15     tences to obtain a sequence of constituents and a
  NPIs (10)               4.75      0.45    2.27   6.6      set of sentence-level semantic features (presence
   bad fillers            1.13      0.06    1.13   1.13
   good fillers           6.76      0.07    6.76   6.76
                                                            of agreement, negation markers, speech act and
   medium fillers         4.07      0.18    4.07   4.07     factivity). These features, along with constituent
                                                            triples and their frequency in the training set and
Table 3: Average acceptability score, standard er-          in the Venice Italian Treebank are weighed with
ror and minimum and maximum value across the                various heuristics and used to derive a predic-
different linguistic phenomena of the Acceptabil-           tion. Agreement mismatches were checked using
ity. Numbers in (·) refer to examples in the text.          morphological analysis of verb and subject, while
                                                            the argumental structure is inferred using a deep
 PARTICIPANT                         SCORE         SE       parser. The two versions of the system (run1 and
 UmBERTO-MTSA (Sarti)                 0.88**       0.52**   run2) differ only in their use of features (run2 dis-
 ItVenses-run1 (Delmonte)             0.44**       0.25**   penses with proposional negation and certain verb
 ItVenses-run2 (Delmonte)             0.49**       0.41**   agreement features).
 Baseline                             0.30**       0.35**      As it can be seen, ItVenses’s performance were
                                                            considerably lower than Sarti’s system (lower, in
Table 4: ACCEPT task results.**p value<0.001;               fact, than the baseline based on sentence length, in
*p value <0.05                                              the COMPL prediction task). However, as better
                                                            explained in the following section, in the artificial
                                                            data subset, which has complex but far less diverse
as the reliance on external resources. In partic-
                                                            structures, the gap with the winning system is re-
ular, both make use of additional sentences taken
                                                            duced in the COMPL task (cfr. Table 7) and, even
from existing Italian treebanks, either to enrich the
                                                            more robustly, in the ACCEPT task (Table 6).
original training sets with additional annotated ex-
amples (Sarti’s case) or to check the frequency of          6   Discussion
a given construction and use this info among the
features of the proposed system (ItVenses).                 The extremely good performance of the winning
   Sarti’s systems obtained the best performance            system in both tasks is not wholly unexpected in
on both tasks using a similar multi-task learning           light of the impressive results obtained by cur-
(MTL) approach, which consists in leveraging the            rent neural networks models across a variety of
predictions of a state-of-the-art neural language           NLP tasks. In this regard, it is worth noticing
model for Italian (i.e. UmBERTO3 ) fine-tuned on            that, in his report, the author compared the per-
the two downstream tasks to augment the original            formance of the best system based on multi-task
development sets with a large set of unlabeled ex-          learning to the one obtained by a simpler version
                                                            of the UmBERTO-based model with standard fine-
   3
       https://github.com/musixmatchresearch/umberto        tuning on the two downstream tasks, achieving al-
ready very good results (.90 and .84 for accept-          PARTICIPANT                  SCORE             SE
ability and complexity predictions on the training                  Psycholinguistics related
corpus, respectively). Similarly, and especially          UmBERTO-MTSA (Sarti)           0.90**      0.55**
for the automatic assessment of sentence accept-          ItVenses-run1 (Delmonte)       0.42**      0.24**
ability, the scores obtained by the winning system        ItVenses-run2 (Delmonte)       0.50**      0.48**
(.88) are in line with those reported in (Linzen et                       Artificial data
al., 2016), who train a classifier to detect subject-     UmBERTO-MTSA (Sarti)           0.74**      0.33**
verb agreement mismatches from the hidden states          ItVenses-run1 (Delmonte)       0.50**       0.20*
of an LSTM, achieving a .83 score. Most other             ItVenses-run2 (Delmonte)       0.46**       0.25*
systems at work on the ability of neural models to
detect acceptability or grammaticality in a broader      Table 6: ACCEPT task results on different subsets
range of cases report much lower scores, but they        of the official test set. **p value<0.001; *p value
try to read (minimal pair) judgments from metrics        <0.05
associated to the performance of systems that have
not been expressly trained on giving judgments,
                                                         matical ones (r=.76, p<.001). Similarly, but less
reasoning that ‘judgment giving’ is not a task hu-
                                                         robustly, the same numerical asymmetry is ob-
mans have a life-long training for, but which is
                                                         served in both Delmonte’s runs: for grammatical
nonetheless feasible.
                                                         predictions, RUN1 r=.33, RUN2 r=.35; for un-
   To have a better understanding of the potential       grammatical ones RUN1 r=.32, RUN2 r=.34, all
impact of different types of data on the predic-         correlations being equally significant (p<.001).
tive capabilities of the two systems, we further in-
spected the final results by testing each system on       PARTICIPANT                  SCORE             SE
sentences representative of diverse linguistic phe-                    Treebank sentences
nomena and textual genres. To this end, we split          UmBERTO-MTSA (Sarti)           0.86**      0.61**
the whole test set into the distinct subsets defined      ItVenses-run1 (Delmonte)       0.25**       0.13*
in the corpus collection process (cfr. Section 3.1)       ItVenses-run2 (Delmonte)       0.24**       0.10*
and we assessed the correlation score between pre-                        Artificial data
dicted and real labels for each type: note that, for      UmBERTO-MTSA (Sarti)           0.70**         0.06
the ACCEPT predictions, this analysis was per-            ItVenses-run1 (Delmonte)       0.44**        -0.07
formed considering only two ‘macro–classes’, i.e.         ItVenses-run2 (Delmonte)       0.51**        -0.11
artificial vs psycholinguistics-related data, in or-
der to have a significant number of examples in          Table 7: COMPL task results on different subsets
the test set. Similarly, for COMPL, we distin-           of the official test set. **p value<0.001; *p value
guished the artificially–generated sentences from        <0.05
sentences drawn from all treebanks. Results of this
fine-grained analysis are shown in Tables 6 and 7.
   Interestingly, although the gap between the two       References
systems is still evident, we observed that artificial
data have an opposite effect on their performance.       Valerio Basile, Danilo Croce, Maria Di Maro, and Lu-
                                                           cia C. Passaro. 2020. Evalita 2020: Overview
In particular, as anticipated in the previous section,     of the 7th evaluation campaign of natural language
ItVenses is more accurate in predicting both the           processing and speech tools for italian. In Valerio
complexity and, especially, the acceptability level        Basile, Danilo Croce, Maria Di Maro, and Lucia C.
of this group of sentences. The opposite holds for         Passaro, editors, Proceedings of Seventh Evalua-
                                                           tion Campaign of Natural Language Processing and
Sarti’s system, which although still very good in          Speech Tools for Italian. Final Workshop (EVALITA
both tasks, achieves lower correlation scores when         2020), Online. CEUR.org.
tested against artificial data.
                                                         C. Bosco, S. Montemagni, and M. Simi. Converting
   Running an exploratory analysis based on ex-            Italian Treebanks: Towards an Italian Stanford De-
pected grammaticality, we observed that Sarti’s            pendency Treebank. In Proceedings of the ACL Lin-
system performs much better in predicting the ac-          guistic Annotation Workshop & Interoperability with
                                                           Discourse.
ceptability score on expected grammatical sen-
tences (r=.80, p<.001) than on expected ungram-          Dominique Brunato, Lorenzo De Mattei, Felice
  Dell’Orletta, Benedetta Iavarone, and Giulia Ven-         T. Linzen, E. Dupoux, and Y. Goldberg. 2016. Assess-
  turi. 2018. Is this sentence difficult? do you agree?        ing the ability of LSTMs to learn syntax-sensitive
  In Proceedings of the 2018 Conference on Empiri-             dependencies. In Transactions of the Association for
  cal Methods in Natural Language Processing, pages            Computational Linguistics, volume 4, pages 521–
  2690–2699.                                                   535.

Cristiano Chesi and Paolo Canal. 2019. Person fea-          Simona Mancini, Paolo Canal, and Cristiano
  tures and lexical restrictions in Italian clefts. Fron-     Chesi.      2018.       The acceptability of person
  tiers in Psychology, 10:2105.                               and number agreement/disagreement in Ital-
                                                              ian: an experimental study. Lingbuzz preprint:
Shammur Absar Chowdhury and Roberto Zamparelli.               https://ling.auf.net/lingbuzz/005514.
  2018. Rnn simulations of grammaticality judgments
  on long-distance dependencies. In Proceedings of          Andrea Moro. 1997. The raising of predicates: Pred-
  the 27th international conference on computational          icative noun phrases and the theory of clause struc-
  linguistics, pages 133–144.                                 ture, volume 80. Cambridge University Press.

Alessandra Teresa Cignarella, Cristina Bosco, and           Joakim Nivre, Marie-Catherine De Marneffe, Filip
  Paolo Rosso. 2019. Presenting TWITTIRÒ-UD:                  Ginter, Yoav Goldberg, Jan Hajic, Christopher D
  An italian twitter treebank in universal dependen-          Manning, Ryan McDonald, Slav Petrov, Sampo
  cies. In Proceedings of the Fifth International Con-        Pyysalo, Natalia Silveira, et al. 2016. Univer-
  ference on Dependency Linguistics (Depling, Syn-            sal dependencies v1: A multilingual treebank col-
  taxFest 2019).                                              lection. In Proceedings of the Tenth International
                                                              Conference on Language Resources and Evaluation
Joshua R De Leeuw. 2015. jspsych: A javascript li-            (LREC’16), pages 1659–1666.
   brary for creating behavioral experiments in a web
   browser. Behavior research methods, 47(1):1–12.          Manuela Sanguinetti, Cristina Bosco, Alberto Lavelli,
                                                             Alessandro Mazzei, and Fabio Tamburini. 2018.
Rodolfo Delmonte, Antonella Bristot, and Sara Tonelli.       PoSTWITA-UD: an Italian Twitter Treebank in uni-
  2007. VIT - Venice Italian Treebank: Syntactic and         versal dependencies. In Proceedings of the Eleventh
  quantitative features. In Proceedings of the Sixth In-     Language Resources and Evaluation Conference
  ternational Workshop on Treebanks and Linguistic           (LREC 2018).
  Theories.
                                                            Gabriele Sarti. 2020. UmBERTo-MTSA @ AcCompl-
Rodolfo Delmonte. 2020. Venses@AcCompl-it:                    it: Improving complexity and acceptability predic-
  Computing complexity vs acceptability with a con-           tion with multi-task learning on self-supervised an-
  stituent trigram model and semantics. In Valerio            notations. In Valerio Basile, Danilo Croce, Maria
  Basile, Danilo Croce, Maria Di Maro, and Lucia C.           Di Maro, and Lucia C. Passaro, editors, Proceedings
  Passaro, editors, Proceedings of Seventh Evalua-            of Seventh Evaluation Campaign of Natural Lan-
  tion Campaign of Natural Language Processing and            guage Processing and Speech Tools for Italian. Fi-
  Speech Tools for Italian. Final Workshop (EVALITA           nal Workshop (EVALITA 2020), Online. CEUR.org.
  2020), Online. CEUR.org.
                                                            Antonella Sorace and Frank Keller. 2005. Gradience
Vera Demberg and Frank Keller. 2008. Data from eye-           in linguistic data. Lingua, 115:1497–1524.
  tracking corpora as evidence for theories of syntactic    Jon Sprouse. 2007. Continuous acceptability, categor-
  processing complexity. Cognition, 109(2):193–210.           ical grammaticality, and experimental synta. Biolin-
                                                              guistics, pages 1123–134.
Albert Gatt and Emiel Krahmer. 2018. Survey of the
  state of the art in natural language generation: Core     Sandra Villata, Paolo Canal, Julie Franck, An-
  tasks, applications and evaluation. Journal of Artifi-      drea Carlo Moro, and Cristiano Chesi. 2015. Inter-
  cial Intelligence Research, 61:65–170.                      vention effects in wh-islands: An eye-tracking study.
                                                              In Architectures and Mechanisms for Language Pro-
Matteo Greco, Paolo Lorusso, Cristiano Chesi, and An-         cessing (AMLaP 2015), pages 195–195.
 drea Moro. 2020. Asymmetries in nominal copular
 sentences: Psycholinguistic evidence in favor of the       Alex Warstadt, Amanpreet Singh, and Samuel R Bow-
 raising analysis. Lingua, 245:102926.                        man. 2018. Neural network acceptability judg-
                                                              ments. arXiv preprint arXiv:1805.12471.
Kristina Gulordava, Piotr Bojanowski, Edouard Grave,
  Tal Linzen, and Marco Baroni. 2018. Color-                Ethan Wilcox, Roger Levy, Takashi Morita, and
  less green recurrent networks dream hierarchically.         Richard Futrell. 2018. What do rnn language
  arXiv preprint arXiv:1803.11138.                            models learn about filler-gap dependencies? arXiv
                                                              preprint arXiv:1809.00042.
John Hale. 2001. A probabilistic earley parser as
  a psycholinguistic model. In Second Meeting of
  the North American Chapter of the Association for
  Computational Linguistics.