=Paper=
{{Paper
|id=Vol-2765/163
|storemode=property
|title=AcCompl-it @ EVALITA2020: Overview of the Acceptability & Complexity Evaluation Task for Italian
|pdfUrl=https://ceur-ws.org/Vol-2765/paper163.pdf
|volume=Vol-2765
|authors=Dominique Brunato,Cristiano Chesi,Felice Dell'Orletta,Simonetta Montemagni,Giulia Venturi,Roberto Zamparelli
|dblpUrl=https://dblp.org/rec/conf/evalita/BrunatoCDMVZ20
}}
==AcCompl-it @ EVALITA2020: Overview of the Acceptability & Complexity Evaluation Task for Italian==
AcCompl-it @ EVALITA2020: Overview of the Acceptability & Complexity Evaluation Task for Italian Dominique Brunato1 , Cristiano Chesi2 , Felice Dell’Orletta1 , Simonetta Montemagni1 , Giulia Venturi1 , Roberto Zamparelli3 1 ILC-CNR, Via G. Moruzzi 1, Pisa, Italy 2 NETS-IUSS, P.zza Vittoria 15, Pavia, Italy 3 CIMeC-UNITRENTO, Corso Bettini 31, Rovereto Italy [name.surname]@ilc.cnr.it, cristiano.chesi@iusspavia.it, roberto.zamparelli@unitn.it Abstract containing acceptability judgments and analyzed with machine learning techniques can be useful The Acceptability and Complexity eval- to test the extent to which syntactic and semantic uation task for Italian (AcCompl-it) was deviance can be induced from corpus data alone, aimed at developing and evaluating meth- especially for low frequency phenomena (Chowd- ods to classify Italian sentences according hury and Zamparelli, 2018; Gulordava et al., 2018; to Acceptability and Complexity. It con- Wilcox et al., 2018), while the same data, seen sists of two independent tasks asking par- from a psycholinguistic angle, can shed light on ticipants to predict either the acceptabil- the relation between complexity and acceptabil- ity or the complexity rate (or both) of a ity (Chesi and Canal, 2019), and on the extent to given set of sentences previously scored which measures of on-line perplexity in artificial by native speakers on a 1-to-7 points Lik- language models can track human parsing prefer- ert scale. In this paper, we introduce the ences (Demberg and Keller, 2008; Hale, 2001). datasets distributed to the participants, we The Acceptability & Complexity evaluation describe the different approaches of the task for Italian (AcCompl-it) at EVALITA 2020 participating systems and provide a first (Basile et al., 2020) is in line with this emerg- analysis of the obtained results. ing scenario. Specifically, it is aimed at devel- oping and evaluating methods to classify Italian 1 Motivation sentences according to Acceptability and Com- plexity, which can be viewed as two simple nu- The availability of annotated resources and sys- meric measures associated with linguistic produc- tems aimed at predicting the level of grammati- tions. Among the outcomes of the task, we also cal acceptability or linguistic complexity of a sen- include the creation of a set of sentences annotated tence (see, among others, (Warstadt et al., 2018; with acceptability and complexity human judg- Brunato et al., 2018)) is becoming increasingly ments that we are going to share with the lin- relevant for different research communities that guistic community. While datasets annotated for focus on the study of language. From the Natural acceptability exist for English, see in particular Language Processing (NLP) perspective, the inter- the COLA dataset (Warstadt et al., 2018), to our est has been recently prompted by automatic gen- knowledge the present dataset is a first for Italian, eration systems (e.g. Machine Translation, Text and is also the first one to combine judgments of Simplification, Summarization) mostly based on acceptability and complexity. Deep Neural Networks algorithms (Gatt and Krah- mer, 2018). In this scenario, resources and meth- 2 Definition of the task ods able to assess the quality of automatically gen- erated sentences or devoted to investigate the abil- We conceived AcCompl-it as a prediction task ity of artificial neural networks to score linguistic where participants were asked to estimate the aver- phenomena on the acceptability and complexity age acceptability and complexity score of a set of scales are of pivotal importance. From the theo- sentences previously rated by native speakers on retical linguistics perspectives, controlled datasets a 1-7 Likert scale and, if possible, to predict the actual standard error (SE) among the annotations. Copyright © 2020 for this paper by its authors. Use per- mitted under Creative Commons License Attribution 4.0 In- SE gives an estimation of the actual agreement be- ternational (CC BY 4.0). tween human annotators: the highest the SE, the lowest the agreement. The task is articulated in In all subtasks, participants were free to use ex- three subtasks, as follows: ternal resources, and they were evaluated against a blind test set. • the Acceptability prediction task (ACCEPT), where participants have to estimate the ac- 3 Dataset ceptability score of sentences (along with their standard error); in this case, 1 corre- 3.1 Composition sponds to the lowest degree of acceptability, Acceptability dataset: it contains 1,683 Italian while 7 corresponds to the highest level. The sentences annotated with human judgments of ac- assignment of a score on a gradual scale is in ceptability on a 7–point Likert scale. The num- line with the definition of perceived accept- ber of annotations per sentence ranges from 10 ability that we intend to empirically inspect. to 85, with an average of 16.38. The dataset According to the literature, in fact, accept- is constructed by merging the data of four psy- ability is a concept closely related to gram- cholinguistic studies on minimal variations of con- maticality but with some major differences trolled linguistic oppositions with different levels (see, among others, (Sprouse, 2007; Sorace of grammaticality with a subset of 672 sentences and Keller, 2005)). While the latter is a theo- generated from templates. retical construction corresponding to syntac- The first subset (128 sentences), taken from tic wellformedness and it is typically inter- (Chesi and Canal, 2019), focuses on person fea- preted as a binary property (i.e., a sentence tures oppositions in object clefts dependencies is either grammatical or ungrammatical), ac- where Determiner Phrases (DPs) are either intro- ceptability can depend on many factors, such duced by determiners or by pronouns used as de- as syntactic, semantic, pragmatic, and non– terminer as in (1). linguistic factors; (1) {Sono | siete} {gli | voi} architetti che {gli | • the Complexity prediction task (COMPL), {are3Ppl | are2Ppl } {the | you} architects that {the | where participants have to estimate the com- voi} ingegneri {hanno | avete} consultato. plexity score of sentences (along with their you} engineers {have3Ppl | have2Ppl } consulted standard error); in this case, 1 corresponds to ‘it is {the|you} architects that {the|you} engi- the lowest possible level of complexity, while neers have consulted’ 7 indicates the highest degree. Similarly to the Acceptability prediction task, the use of The second subset (515 sentences) is taken from the Likert scale as a tool to collect perceived the studies presented in (Greco et al., 2020) in- values is motivated by the assumption that volving copular constructions (e.g. canonical (2a) sentence complexity is a gradient, rather than vs. inverse (2b) (Moro, 1997). binary, concept; (2) a. Le foto del muro sono la causa della • the Open task, where participants are re- the pictures of_the wall are the cause of_the rivolta. quested to model linguistic phenomena corre- riot lated with the human ratings of sentence ac- b. La causa della rivolta sono le foto ceptability and/or complexity in the datasets the cause of_the riot are the pictures provided. del muro. of_the wall The three subtasks were independent and par- ticipants could decide to participate in any one of This subset also contains declarative and inter- them, though we encouraged participation in mul- rogative (yes/no) sentences with a minimal ver- tiple subtasks, since the complexity metrics might bal structure (contrasting preverbal vs postverbal be influenced by the grammatical status of an ex- subject position in unergatives (3a), unaccusatives pression and vice versa. In line with this intuition, (3b) and transitive predicates (3c)) we distributed a subset of sentences annotated with both acceptability and complexity scores in (3) a. I cani hanno abbaiato | Hanno abbaiato i the dogs have barked | have barked the order to investigate whether and to what extent cani. there is a correlation between the two phenomena. dogs b. Gli autobus sono partiti | Sono partiti gli (8) Quale provvedimento Maria ha saputo {che | The buses have left | have left the Which measure M. has heard {that | autobus. dove | perché | quando} il ministro prenderà? buses where | why | when} il ministro prenderà? c. Le bambine hanno mangiato il dolce | the girls have eaten the dessert | (iv) extractions from VPs in subject vs. object po- Hanno mangiato le bambine il dolce sitions (9) (cf. (2)). have eaten the girls the dessert (9) Carlo conosceva bene il compagnoi di classe The third set (320 sentences) is based on a study in Carlo knew well the classmatei which number and person subject-verb agreement che {incontrare i divertiva sempre Anna | Anna and unagreement cases are tested (Mancini et al., that {meetinf i amused always Anna | Anna 2018): voleva sempre incontrare i } wanted always meetinf i} (4) Qualcuno ha detto che io1Psg {scrivo1Psg |* Somebody has said that I1Psg {write1Psg | (v) NEGPOLs (nessuno, alcunché, mai ‘any, any- scriviamo1Ppl } una lettera. thing, ever’) that are licensed by a higher nega- *write1Psg } a letter tion, by a question, or not licensed, in simple or (deeply) embedded sentences (e.g. (10)). The fourth one (48 sentences) contains experimen- tal items from (Villata et al., 2015) involving dif- (10) {Maria | Nessuno} si aspetta che qualcuno ferent types of wh-islands violations. {M. | No-one} self expects that someone possa aver {già | mai} finito questo (5) {Cosa | Quale edificio}i ti chiedi {chi | could have {already | never} completed this {What | Which building}i do you wonder {who | esercizio (?) quale ingegnere} abbia costruito i ? exercise (?) which engineer} has built i? The use of expanded templates was designed to The last set of 672 sentences was generated by cre- minimize the potential effect of collocations or ating all the possible content word combinations specific lexical choices. from various structural templates designed to test Whenever possible each sentence was also acceptability patterns due to: (i) extra or missing manually annotated according to the linguistic- gaps in Wh-extractions (6a) vs. topic construc- theoretic expectations for “grammaticality”, on a tions (6b). 4-points scale: * (ungrammatical, coded as 0), ?? (very marginal, coded as 0.66), ? (marginal, coded (6) a. {Cosa | Quale problema}i lo studente as 0.33) and OK (grammatical, coded as 1). {what | which problem}i the student dovrebbe descriver(e) { i | -loi | questo Complexity dataset: it comprises 2,530 Ital- should describe { i | it | this ian sentences annotated with human judgments of problema}? perceived complexity on a 7–point Likert scale problem} as for the acceptability dataset. The number of b. Questo problemai , lo studente dovrebbe annotations per sentence ranged from 11 to 20, this problem, the student should with an average of 16.753. The corpus was in- descriver(e) { i | -loi | questo problema} describe { i | iti | this problem} ternally subdivided into two subsets representa- tive of two different typologies of data, i.e. 1,858 (ii) Wh- and relative clauses with gaps inside VP naturalistic sentences extracted from corpora and conjunctions (in all conjuncts, i.e. "Across the 672 artificially-generated sentences drawn from Board", in only one conjunct, or not at all, see e.g. the Acceptability dataset, and chosen to cover the (7)). range of linguistic phenomena represented in its templates. The first subset contains sentences (7) Chii ... Maria vuole chiamar(e) { i | -lo} e taken from the Universal Dependency (UD) tree- whoi ... Mary wants callinf { i | him} and banks (Nivre et al., 2016) available for Italian, il dottore medicar(e) { i | -lo}? the doctor cure { i | him}? representative of different text genres and do- mains. In this regard, the largest portion con- (iii) embedded Wh-clauses and the possibility of tains 1,128 sentences taken from the newswire subextractions from them (similar to (5)). section of the Italian Stanford Dependency Tree- bank (ISDT) (Bosco et al., ), annotated with com- were asked to read each sentence and answer the plexity judgments by Brunato et al. (2018). Be- following question: side these, we chose to include in this corpus smaller subsets of sentences representative of a “Quanto è complessa questa frase da 1 non-standard language variety and of specific con- (semplicissima) a 7 (molto difficile)?” structions, i.e. Wh-questions and direct speech. ‘How difficult is this sentence from 1 Non-standard sentences (for a total of 323) are (very easy) to 7 (very difficult)?’ in the form of generic tweets and tweets labelled for irony taken from two representative treebanks, Beyond complexity, the 672 artificially- i.e. PoSTWITA and TWITTIRÒ (Sanguinetti et generated sentences were also labelled for al., 2018; Cignarella et al., 2019). Wh-questions perceived acceptability according to the following (164 sentences) were extracted from a dedicated question: section (prefixed by the string ‘quest’) included in “Quanto è accettabile questa frase da ISDT. Direct speech sentences (243) mainly in- 1 (completamente agrammaticale) a 7 clude transcripts of European parliamentary de- (perfettamente grammaticale)?” ‘How bates (taken from the ‘europarl’ section of ISDT) acceptable is this sentence from 1 (com- and extracts from literary texts (mostly contained pletely ungrammatical) to 7 (completely in the UD Italian VIT (Delmonte et al., 2007)). grammatical)?’ The choice of annotating a shared portion of data with both acceptability and complexity scores was After collecting all annotations, we excluded explicitly motivated by the attempt to empirically workers who performed the assigned task in less investigate whether there is a correlation between than 10 minutes, which we set as the minimum the two sentence properties, and whether complex- threshold to accurately complete the survey. ity is judged differently in the case of ill-formed constructions. 3.3 Analysis of Judgments across Corpora For the purpose of the task, both datasets were Table 1 shows the average value, standard de- split into training and validation samples with a viation and minimum and maximum score of proportion of 80% to 20%, respectively. complexity and acceptability labels for the whole dataset. As it can be noticed, complexity values 3.2 Annotation with Human Judgments are on average lower and less scattered than the ac- For the collection of judgments of sentence ac- ceptability ones. For this corpus, the lowest value ceptability and complexity by Italian native speak- on the Likert scale (1) – which should have been ers we relied on crowdsourcing techniques using used to label sentences perceived as very easy, in different platforms. More specifically, for Ac- line to the task question – was given only twice, ceptability, the set of sentences drawn from the specifically to the following sentences: psycholinguistic studies described in Section 3.1 was annotated using an on-line platform based on (11) Dimmi il nome di una città finlandese. tell me the name of a town Finnish jsPsych scripts (De Leeuw, 2015). For the Com- ‘Tell me the name of Finnish town’ plexity dataset, the annotation of the subcorpus of sentences taken from (Brunato et al., 2018) was (12) Quali uve si usano per produrre vino? Which grapes PRT they_use to make wine? performed through the CrowdFlower platform1 (more details are reported in the reference paper), Conversely, for the acceptability corpus, the high- while the remaining sentences in this dataset were est value on the Likert scale (i.e. 7, meaning in annotated using Prolific2 . To make the annota- this case completely acceptable) was attributed to tion process comparable to the one followed by 26 sentences. For space reasons, we report here (Brunato et al., 2018), the whole process was split only two examples: into different tasks, each one consisting in the an- notation of about 200 sentences randomly mixed (13) Le sorelle sono sopravvissute. The sisters are survived. for the various typologies. For all tasks, workers ‘the sisters have survived’ 1 Now known as Figure Eight, https://appen.com/ (14) I lupi hanno ululato. 2 www.prolific.co The wolves have howled. With respect to the ‘worst’ values, two sample sen- SCORE SE MIN MAX tences judged respectively as the most complex ISDT_news 3.28 0.33 1.25 5.7 (i.e. 6.46 on the Likert scale) and (among) the least Twitter 2.59 0.31 1.13 4.69 acceptable (1.55) in each dataset are the following DirectSpeech 2.68 0.31 1.14 6 ones, respectively: Wh-Quest 1.61 0.22 1 2.94 ArtifSent 3.63 0.37 1.42 6.47 (15) Chi è che lui ha affermato che il professore who is that he has claimed that the professor Table 2: Average complexity score, standard er- aveva detto che lo studente avrebbe dovuto ror and minimum and maximum value across the had said that the student hadsubj must different subsets of the Complexity dataset. considerare questo candidato? consider this candidate? (16) Il falegname è arrivato mentre noi between expected grammaticality and mean com- The carpenter has arrived while we plexity (r(656)=.34, p<.001). Still considering montavo la mensola. this subset, an additional outcome is the moderate were_assembling1Psg the shelf. (and negative) correlation between the two metrics (r(672)=.49, p<.001), further suggesting that the COMPL ACCEPT more a sentence is perceived as complex, the less SCORE SE SCORE SE acceptable it is. µ 3.12 0.332 4.45 0.36 σ 1.04 0.08 1.7 0.14 4 Evaluation measures min 1 0 1.13 0 For both the ACCEPT and COMPL Task, the max 6.46 0.63 7 0.74 evaluation metric was based on Spearman’s rank correlation coefficient between the participants’ Table 1: Statistics collected for the two corpora of scores and the test set scores. For each task, two the AcCompl-it dataset. different ranks were produced according to the If we consider the internal composition of the two prediction of the relative scores and to standard er- datasets we can see a more articulated picture de- rors. In each task a different baseline was defined: pending on its various subparts (see Table 2 and 3). For complexity, average scores are higher for • in the ACCEPT task, it corresponds to the sentences created to display specific acceptability score assigned by a SVM linear regression patterns, thus proving that acceptability does af- using unigram and bigram of words as fea- fect the perception of complexity. Note that the tures; most complex sentence (reported in (15)) is con- • in the COMPL task, it corresponds to the tained in this set, and is ungrammatical (no gap). score assigned by a SVM linear regression Among the treebank sentences, those extracted using sentence length as its sole feature. from journalistic texts (ISDT_news) were judged on average as the most complex, questions as the 5 Participation and results easiest ones. Twitter and direct speech sentences obtained scores in between the highest and the The AcCompl-it task received three submissions lowest value and very close to each other. This is for each subtask from two different participants, in line with stylistic and linguistic analysis show- for a total of 6 runs. Unfortunately, neither partic- ing that the language of social media inherits many ipant took part in the Open Task. Results for the features from spoken language. other ones are reported in Tables 4 and 5. For the whole acceptability dataset, the Spear- The systems from the two participants in the man’s rank correlation coefficient between task follow very different approaches: one is based theoretically-driven grammaticality and mean on deep learning and trained on raw texts (Sarti, acceptability labels is very strong (r(656)=.83, 2020), the other relies on (heuristic) rules applied p<.001). While this could be somehow expected, to semantic and syntactic features automatically when we focus only on the 672 sentences anno- extracted from sentences (Delmonte, 2020). In tated for both complexity and acceptability, we spite of their very different nature, the two ap- still observe a significant but lower correlation proaches also present some commonalities, such SCORE SE MIN MAX PARTICIPANT SCORE SE clefts (1) 4.27 0.39 1.36 6 copular 5.01 0.48 2.90 6.5 UmBERTO-MTSA (Sarti) 0.83** 0.51** canonical (2a) 5.47 0.44 3.58 6.5 ItVenses-run1 (Delmonte) 0.31** 0.09* inverse (2b) 4.56 0.51 2.90 5.9 ItVenses-run2 (Delmonte) 0.31** 0.07 unerg V (3a) 5.91 0.30 3.70 7 SV 6.64 0.21 5.94 7 Baseline 0.50** 0.33** VS 5.18 0.40 3.70 6.2 unacc V (3b) 6.28 0.27 4.86 7 SV 6.61 0.20 5.82 7 Table 5: COMPL task results. **p value<0.001; VS 5.96 0.33 4.86 6.72 *p value <0.05 trans V (3c) 4.91 0.34 2 7 SV 6.47 0.24 5.06 7 VS 3.34 0.43 2 4.52 amples extracted from available Italian treebanks. S V agree (4) 3.81 0.31 1.25 6.93 match 5.87 0.30 3.14 6.92 The bigger dataset was then split into different mismatch 1.74 0.32 1.25 2.72 portions to train an ensemble of classifiers. The wh-island (arg) (5) 3.85 0.17 1.68 5.63 resulting MTL model was finally used to predict filler-gap dep. 3.56 0.51 1.5 6.69 doubly filled (6) 3.28 0.42 1.5 6.69 the complexity/acceptability labels on the original coord (7) 4.25 0.47 2.5 6 test sets. wh-island (adj) (8) 3.02 0.41 1.38 5.6 Delmonte’s ItVenses system parses the sen- no extraction 6.26 0.29 5.26 7 subj/obj (9) 3.70 0.51 2.66 5.15 tences to obtain a sequence of constituents and a NPIs (10) 4.75 0.45 2.27 6.6 set of sentence-level semantic features (presence bad fillers 1.13 0.06 1.13 1.13 good fillers 6.76 0.07 6.76 6.76 of agreement, negation markers, speech act and medium fillers 4.07 0.18 4.07 4.07 factivity). These features, along with constituent triples and their frequency in the training set and Table 3: Average acceptability score, standard er- in the Venice Italian Treebank are weighed with ror and minimum and maximum value across the various heuristics and used to derive a predic- different linguistic phenomena of the Acceptabil- tion. Agreement mismatches were checked using ity. Numbers in (·) refer to examples in the text. morphological analysis of verb and subject, while the argumental structure is inferred using a deep PARTICIPANT SCORE SE parser. The two versions of the system (run1 and UmBERTO-MTSA (Sarti) 0.88** 0.52** run2) differ only in their use of features (run2 dis- ItVenses-run1 (Delmonte) 0.44** 0.25** penses with proposional negation and certain verb ItVenses-run2 (Delmonte) 0.49** 0.41** agreement features). Baseline 0.30** 0.35** As it can be seen, ItVenses’s performance were considerably lower than Sarti’s system (lower, in Table 4: ACCEPT task results.**p value<0.001; fact, than the baseline based on sentence length, in *p value <0.05 the COMPL prediction task). However, as better explained in the following section, in the artificial data subset, which has complex but far less diverse as the reliance on external resources. In partic- structures, the gap with the winning system is re- ular, both make use of additional sentences taken duced in the COMPL task (cfr. Table 7) and, even from existing Italian treebanks, either to enrich the more robustly, in the ACCEPT task (Table 6). original training sets with additional annotated ex- amples (Sarti’s case) or to check the frequency of 6 Discussion a given construction and use this info among the features of the proposed system (ItVenses). The extremely good performance of the winning Sarti’s systems obtained the best performance system in both tasks is not wholly unexpected in on both tasks using a similar multi-task learning light of the impressive results obtained by cur- (MTL) approach, which consists in leveraging the rent neural networks models across a variety of predictions of a state-of-the-art neural language NLP tasks. In this regard, it is worth noticing model for Italian (i.e. UmBERTO3 ) fine-tuned on that, in his report, the author compared the per- the two downstream tasks to augment the original formance of the best system based on multi-task development sets with a large set of unlabeled ex- learning to the one obtained by a simpler version of the UmBERTO-based model with standard fine- 3 https://github.com/musixmatchresearch/umberto tuning on the two downstream tasks, achieving al- ready very good results (.90 and .84 for accept- PARTICIPANT SCORE SE ability and complexity predictions on the training Psycholinguistics related corpus, respectively). Similarly, and especially UmBERTO-MTSA (Sarti) 0.90** 0.55** for the automatic assessment of sentence accept- ItVenses-run1 (Delmonte) 0.42** 0.24** ability, the scores obtained by the winning system ItVenses-run2 (Delmonte) 0.50** 0.48** (.88) are in line with those reported in (Linzen et Artificial data al., 2016), who train a classifier to detect subject- UmBERTO-MTSA (Sarti) 0.74** 0.33** verb agreement mismatches from the hidden states ItVenses-run1 (Delmonte) 0.50** 0.20* of an LSTM, achieving a .83 score. Most other ItVenses-run2 (Delmonte) 0.46** 0.25* systems at work on the ability of neural models to detect acceptability or grammaticality in a broader Table 6: ACCEPT task results on different subsets range of cases report much lower scores, but they of the official test set. **p value<0.001; *p value try to read (minimal pair) judgments from metrics <0.05 associated to the performance of systems that have not been expressly trained on giving judgments, matical ones (r=.76, p<.001). Similarly, but less reasoning that ‘judgment giving’ is not a task hu- robustly, the same numerical asymmetry is ob- mans have a life-long training for, but which is served in both Delmonte’s runs: for grammatical nonetheless feasible. predictions, RUN1 r=.33, RUN2 r=.35; for un- To have a better understanding of the potential grammatical ones RUN1 r=.32, RUN2 r=.34, all impact of different types of data on the predic- correlations being equally significant (p<.001). tive capabilities of the two systems, we further in- spected the final results by testing each system on PARTICIPANT SCORE SE sentences representative of diverse linguistic phe- Treebank sentences nomena and textual genres. To this end, we split UmBERTO-MTSA (Sarti) 0.86** 0.61** the whole test set into the distinct subsets defined ItVenses-run1 (Delmonte) 0.25** 0.13* in the corpus collection process (cfr. Section 3.1) ItVenses-run2 (Delmonte) 0.24** 0.10* and we assessed the correlation score between pre- Artificial data dicted and real labels for each type: note that, for UmBERTO-MTSA (Sarti) 0.70** 0.06 the ACCEPT predictions, this analysis was per- ItVenses-run1 (Delmonte) 0.44** -0.07 formed considering only two ‘macro–classes’, i.e. ItVenses-run2 (Delmonte) 0.51** -0.11 artificial vs psycholinguistics-related data, in or- der to have a significant number of examples in Table 7: COMPL task results on different subsets the test set. Similarly, for COMPL, we distin- of the official test set. **p value<0.001; *p value guished the artificially–generated sentences from <0.05 sentences drawn from all treebanks. Results of this fine-grained analysis are shown in Tables 6 and 7. Interestingly, although the gap between the two References systems is still evident, we observed that artificial data have an opposite effect on their performance. Valerio Basile, Danilo Croce, Maria Di Maro, and Lu- cia C. Passaro. 2020. Evalita 2020: Overview In particular, as anticipated in the previous section, of the 7th evaluation campaign of natural language ItVenses is more accurate in predicting both the processing and speech tools for italian. In Valerio complexity and, especially, the acceptability level Basile, Danilo Croce, Maria Di Maro, and Lucia C. of this group of sentences. The opposite holds for Passaro, editors, Proceedings of Seventh Evalua- tion Campaign of Natural Language Processing and Sarti’s system, which although still very good in Speech Tools for Italian. Final Workshop (EVALITA both tasks, achieves lower correlation scores when 2020), Online. CEUR.org. tested against artificial data. C. Bosco, S. Montemagni, and M. Simi. Converting Running an exploratory analysis based on ex- Italian Treebanks: Towards an Italian Stanford De- pected grammaticality, we observed that Sarti’s pendency Treebank. In Proceedings of the ACL Lin- system performs much better in predicting the ac- guistic Annotation Workshop & Interoperability with Discourse. ceptability score on expected grammatical sen- tences (r=.80, p<.001) than on expected ungram- Dominique Brunato, Lorenzo De Mattei, Felice Dell’Orletta, Benedetta Iavarone, and Giulia Ven- T. Linzen, E. Dupoux, and Y. Goldberg. 2016. Assess- turi. 2018. Is this sentence difficult? do you agree? ing the ability of LSTMs to learn syntax-sensitive In Proceedings of the 2018 Conference on Empiri- dependencies. In Transactions of the Association for cal Methods in Natural Language Processing, pages Computational Linguistics, volume 4, pages 521– 2690–2699. 535. Cristiano Chesi and Paolo Canal. 2019. Person fea- Simona Mancini, Paolo Canal, and Cristiano tures and lexical restrictions in Italian clefts. Fron- Chesi. 2018. The acceptability of person tiers in Psychology, 10:2105. and number agreement/disagreement in Ital- ian: an experimental study. Lingbuzz preprint: Shammur Absar Chowdhury and Roberto Zamparelli. https://ling.auf.net/lingbuzz/005514. 2018. Rnn simulations of grammaticality judgments on long-distance dependencies. In Proceedings of Andrea Moro. 1997. The raising of predicates: Pred- the 27th international conference on computational icative noun phrases and the theory of clause struc- linguistics, pages 133–144. ture, volume 80. Cambridge University Press. Alessandra Teresa Cignarella, Cristina Bosco, and Joakim Nivre, Marie-Catherine De Marneffe, Filip Paolo Rosso. 2019. Presenting TWITTIRÒ-UD: Ginter, Yoav Goldberg, Jan Hajic, Christopher D An italian twitter treebank in universal dependen- Manning, Ryan McDonald, Slav Petrov, Sampo cies. In Proceedings of the Fifth International Con- Pyysalo, Natalia Silveira, et al. 2016. Univer- ference on Dependency Linguistics (Depling, Syn- sal dependencies v1: A multilingual treebank col- taxFest 2019). lection. In Proceedings of the Tenth International Conference on Language Resources and Evaluation Joshua R De Leeuw. 2015. jspsych: A javascript li- (LREC’16), pages 1659–1666. brary for creating behavioral experiments in a web browser. Behavior research methods, 47(1):1–12. Manuela Sanguinetti, Cristina Bosco, Alberto Lavelli, Alessandro Mazzei, and Fabio Tamburini. 2018. Rodolfo Delmonte, Antonella Bristot, and Sara Tonelli. PoSTWITA-UD: an Italian Twitter Treebank in uni- 2007. VIT - Venice Italian Treebank: Syntactic and versal dependencies. In Proceedings of the Eleventh quantitative features. In Proceedings of the Sixth In- Language Resources and Evaluation Conference ternational Workshop on Treebanks and Linguistic (LREC 2018). Theories. Gabriele Sarti. 2020. UmBERTo-MTSA @ AcCompl- Rodolfo Delmonte. 2020. Venses@AcCompl-it: it: Improving complexity and acceptability predic- Computing complexity vs acceptability with a con- tion with multi-task learning on self-supervised an- stituent trigram model and semantics. In Valerio notations. In Valerio Basile, Danilo Croce, Maria Basile, Danilo Croce, Maria Di Maro, and Lucia C. Di Maro, and Lucia C. Passaro, editors, Proceedings Passaro, editors, Proceedings of Seventh Evalua- of Seventh Evaluation Campaign of Natural Lan- tion Campaign of Natural Language Processing and guage Processing and Speech Tools for Italian. Fi- Speech Tools for Italian. Final Workshop (EVALITA nal Workshop (EVALITA 2020), Online. CEUR.org. 2020), Online. CEUR.org. Antonella Sorace and Frank Keller. 2005. Gradience Vera Demberg and Frank Keller. 2008. Data from eye- in linguistic data. Lingua, 115:1497–1524. tracking corpora as evidence for theories of syntactic Jon Sprouse. 2007. Continuous acceptability, categor- processing complexity. Cognition, 109(2):193–210. ical grammaticality, and experimental synta. Biolin- guistics, pages 1123–134. Albert Gatt and Emiel Krahmer. 2018. Survey of the state of the art in natural language generation: Core Sandra Villata, Paolo Canal, Julie Franck, An- tasks, applications and evaluation. Journal of Artifi- drea Carlo Moro, and Cristiano Chesi. 2015. Inter- cial Intelligence Research, 61:65–170. vention effects in wh-islands: An eye-tracking study. In Architectures and Mechanisms for Language Pro- Matteo Greco, Paolo Lorusso, Cristiano Chesi, and An- cessing (AMLaP 2015), pages 195–195. drea Moro. 2020. Asymmetries in nominal copular sentences: Psycholinguistic evidence in favor of the Alex Warstadt, Amanpreet Singh, and Samuel R Bow- raising analysis. Lingua, 245:102926. man. 2018. Neural network acceptability judg- ments. arXiv preprint arXiv:1805.12471. Kristina Gulordava, Piotr Bojanowski, Edouard Grave, Tal Linzen, and Marco Baroni. 2018. Color- Ethan Wilcox, Roger Levy, Takashi Morita, and less green recurrent networks dream hierarchically. Richard Futrell. 2018. What do rnn language arXiv preprint arXiv:1803.11138. models learn about filler-gap dependencies? arXiv preprint arXiv:1809.00042. John Hale. 2001. A probabilistic earley parser as a psycholinguistic model. In Second Meeting of the North American Chapter of the Association for Computational Linguistics.