Auxiliary selection in Italian intransitive verbs: a computational investigation based on annotated corpora Ilaria Ghezzi Cristina Bosco Dipartimento di Lingue e Letterature Straniere Alessandro Mazzei e Culture Moderne Dipartimento di Informatica Università degli Studi di Torino Università degli Studi di Torino ghezzi.ila@gmail.com {bosco,mazzei}@di.unito.it Abstract Universal Dependencies standards. UD-IT and PoSTWITA-UD are treebanks (morphologically English. The purpose of this paper is and syntactically annotated corpora) for the Italian the analysis of the auxiliary selection in language. UD-IT is made up of texts from various intransitive verbs in Italian. The ap- sources, namely the Italian Constitution, the Ital- plied methodology consists in comparing ian Civil Code, newspaper articles and Wikipedia. the linguistic theory with the data ex- It is a balanced corpus and, therefore, a represen- tracted from two different annotated cor- tative corpus for Italian standard language. On the pora: UD-IT and PoSTWITA-UD. The an- other hand, PoSTWITA-UD contains tweets from alyzed verbs have been classified in differ- the social media Twitter, and can therefore be con- ent semantic categories depending on the sidered a representative corpus for the Italian Lan- linguistic theory. The results confirm the guage used in social media (non-standard Italian). theoretical assumptions and they could be This difference allows us to investigate verbs’ be- considered as a starting point for many ap- haviour in standard and non-standard Italian Lan- plicative tasks as Natural Language Gen- guage. eration. Intransitive verbs have been extensively studied Italiano. Obiettivo di questo lavoro è in both traditional grammar and linguistics, since l’analisi della selezione dell’ausiliare dei they do not always follow a standardized rule for verbi intransitivi in italiano. La metodolo- the auxiliary selection (see examples Section 2). gia applicata consiste nel confrontare la This fact could be the reason why their status is teoria linguistica con dati estratti da due not currently formalized enough in NLP, as long corpora annotati: UD-IT e PoSTWITA- as Italian is concerned. Among the most recent in- UD. I verbi analizzati sono stati clas- vestigation which use a corpus linguistic method- sificati nelle categorie semantiche indi- ology for the Italian language, we find (Amore, viduate partendo dalla letteratura teor- 2017). ica. I risultati confermano con buona ap- Our analysis starts from traditional Italian gram- prossimazione gli assunti teorici e pos- mars and then moves to the Auxiliary Selection sono quindi essere il punto di partenza per Hierarchy by (Sorace, 2000), a syntactic and se- l’implementazione di strumenti come sis- mantic perspective on the behaviour of intran- temi di Natural Language Generation. sitive verbs and auxiliary selection in Romance languages. That can be useful for formalizing the studied phenomenon and thus providing Nat- 1 Introduction ural Language Generation systems with the neces- In this work we have applied a corpus-based ap- sary information regarding the auxiliary selection, proach to the investigation of the behavior of Ital- which is our final goal. Another contribute for the ian intransitive verbs for what concerns the selec- same systems but for what concerns adjectives has tion of the auxiliary verb. We considered two cor- been published in (Conte et al., 2017). pora, namely UD-IT1 and PoSTWITA-UD (San- 2 Auxiliary Selection in Italian guinetti et al., 2018), annotated following the 1 http://universaldependencies.org/it/ As in several other languages, in Italian one overview/introduction.html among two auxiliary verbs can be used together with the past participle verbal forms for com- the Unaccusative Hypothesis discussed in (Perl- pounding periphrastic tenses: avere (to have) and mutter, 1978) and moving to the Auxiliary Selec- essere (to be), henceforth respectively indicated as tion Hierarchy proposed in (Sorace, 2000). A or E. When the verb is transitive, the auxiliary Moreover, we considered the application of a selection follows standard rules, depending on the corpus-based approach, provided that corpora rep- diathesis: transitive verbs in active diathesis select resent the way Italian native speakers use A or E A (e.g. Luca ha mangiato la mela – Luca ate the together with intransitive verbs. We hypothesized apple) while transitive verbs in passive diathesis that, this kind of probabilistic perspective can al- select E (e.g. La mela è mangiata da Luca – The low a reliable description of the phenomenon. In apple is eaten by Luca). fact, when there is a lack of standard grammar Problems in the auxiliary selection occur in- rules, it is possible to determine certain linguistic stead when the verb is intransitive. In fact, aspects by extracting data from corpora. Doing so, provided that the behaviour of intransitive verbs we can compensate the lack of standard grammar depends on both semantic and syntactic factors rules with probabilistic and statistic data. (Van Valin, 1990), a general rule for their auxil- iary selection cannot always be formulated2 (Pa- 2.1 The theoretical status of intransitive tota, 2003). Some intransitive verbs can actually verbs select both A or E depending on the semantics of For accounting for the behavior of intransitive the sentence, while others only admit E or A. See verbs, in 1978, Perlmutter expressed the Un- the examples3 below: accusative Hypothesis, which splits intransitive verbs in 2 subcategories: the unaccusative verbs 1. Maria ha corso alle olimpiadi / Maria è corsa and the unergative verbs. Perlmutter suggested a casa that the unaccusative verbs are intransitive verbs (Maria has run at the Olympics / Maria is run whose grammatical subject is not an agent (e.g. La home) nave è affondata – The ship is sunk), while unerga- tive verbs are intransitive verbs whose grammati- 2. Ieri ho camminato al parco / *Ieri sono cam- cal subject is an agent (e.g. Giulia ha camminato minato al parco4 - Giulia has walked). (I walked in the park yesterday) More recently other linguists and researchers analysed the topic, following two major lines: Even if all the verbs involved describe a form of Rosen that suggested to follow a syntactic-only movement and are semantically similar, in the first approach (Rosen, 1984), Van Valin and Dowty that couple of examples the intransitive verb correre suggested a semantic-only approach (Van Valin, (to run) allows the selection of both E and A, while 1990; Dowty, 1979). in the second one the intransitive verb camminare A development of Perlmutter’s hypothesis sup- (to walk) only allows the selection of A, and the ported by experimental and psycho-linguistic re- sentence generated by selecting E is indeed un- sults can be found in Sorace (2000) that proposed grammatical. an interesting modelling of the behaviour of in- Traditional and normative Italian grammars do transitive verbs with respect to the selection of not provide an analysis of intransitive verbs and auxiliary for Italian too. This theory especially in- auxiliary selection which could be formalized and spired our current work. therefore usefully spent in NLP. In fact, they only suggest lists of verbs that select A or E as auxil- 2.2 A hierarchy for auxiliary selection iary, see e.g. (Moretti and Orvieto, 1979), (Patota, According to the theory proposed by Sorace, in- 2003), (Renzi et al., 1991), (Serianni, 1988), (Dar- transitive verbs can be hierarchically organized ac- dano and Trifone, 1997). For this reason, we de- cording to their different degree of telicity and cided to consider other theories too, starting from agentivity. The more a verb is telic or agentive, the 2 Flexibility in auxiliary selection can be accounted for a more it systematically selects the auxiliary verb E large number of cases if context is taken into account. or A respectively. 3 The translation of the examples can be not correctly This hierarchy of intransitive verbs, also known mapped on the English rules. When this happens the aux- iliary is underlined. as Auxiliary Selection Hierarchy (ASH), includes 4 Sentences marked with * are ungrammatical. categories defined on the basis of thematic and as- ASH category examples auxiliary selection Change of location (maximum telicity) to go, to arrive selects E Change of state to appear, to happen Continuation of pre-existing state to stay, to last Existence of state to exist, to seem Uncontrolled process to sleep, to rain Controlled process - motional to walk, to run Controlled process - non motional (maximum agentivity) to act, to play selects A Table 1: Examples of verbs organized in the ASH: at the poles verbs that always select E or always select A, and between the verbs that alternatively select both. pectual features. At one end of the ASH we find used Babelnet5 , a multilingual lexicalized seman- intransitive verbs which categorically select E as tic network and ontology. After the disambigua- auxiliary, while at the other end we find intransi- tion process, the total number of verbs is 67. tive verbs that always select A. The verbs between For what concerns intransitive pronominal verbs the two poles of the ASH can have an alternation (e.g.rompersi, ”to break”), we decided not to take in the auxiliary selection. them into consideration for our research, since The ASH has been exploited in our work for clas- they always select the auxiliary E when con- sifying Italian intransitive verbs depending on its structed in compound tenses (eg. Gli occhiali si categories which are reported and exemplified in sono rotti (The glasses broke)). The choice to limit Table 1. This classification may seem wrong for our research to the FO vocabulary is due to the fact verbs like ”to go” (andare), which are both agen- that one should expect an expert usage of the verbs tive and unaccusative, but, as Sorace (2000:863) of this class also by an artificial speaker. points out, the verbs that express a change of lo- cation have the highest degree of dinamicity and 3.2 Verbs classification telicity, and they always select E as auxiliary. After having selected the verbs, we proceeded to their classification, following the theory proposed 3 Intransitive verbs in the fundamental by (Sorace, 2000). The intransitive verbs belong- Italian vocabulary ing to the FO Italian vocabulary have therefore been included in different categories, depending 3.1 Verbs selection both on the semantics and the syntax. In order to focus our study on the intransitive verbs Table 2 shows some examples of Italian intran- that are more commonly and competently used sitive verbs belonging to the FO class, classified by Italian speakers, we decided to extract the in- depending on the ASH by Sorace (2000). transitive verbs to be studied from the Nuovo vo- cabolario di base della lingua italiana (Chiari and ASH FO verbs De Mauro, 2016), a well known reference resource Change of location andare (to go) for Italian lexicography. The lexical entries are Change of state apparire (to appear) here organized in three basic vocabulary ranges Contin. pre-existing state rimanere (to last) according to their frequency of use and ease of Existence of state esistere (to exist) recovery in speakers’ brain: fundamental vocab- Uncontrolled process dormire (to sleep) ulary (FO), high usage (AU) and high availability Control. proc. (motion) camminare (to walk) (AD). Control. proc. (nonmotion) agire (to act) For the present work, we considered only the verbs of the FO vocabulary, for a total of 51 intransitive Table 2: Examples of intransitive verbs belonging verbs. But some of these verbs showed more than to FO and classified according to ASH. one single meaning and they could therefore be in- cluded in different categories of Sorace’s ASH. In 5 order to carry out a disambiguation process, we https://babelnet.org/ As figure 1 shows, in UD-IT the auxiliary A is selected by 10% of the verbs and the auxiliary E by 69%. As long as PoSTWITA-UD is concerned (see fig.2), 49% of verbs select E and 9% select A in this corpus. The remaining percentages (in grey) are made up by the verbs that do not appear in compound tenses in the corpus and did not pro- vide useful result for our study; they must be stud- ied in larger corpora. Figure 1: The percentage of intransitive verbs se- lecting E (in blue), A (in orange) or not detected (in grey) in UD-it. 4 Reference corpora As mentioned above, the reference corpora for this work are the treebanks UD-IT and PoSTWITA- UD, both annotated according to the Universal De- pendencies (UD) format for what concerns mor- Figure 2: The distribution of verbs selecting E (in phology and syntax. Provided that UD is currently blue) and A (in orange) in postwita-UD. a standard de facto, the exploitation of this format allows us the application of the same methodology on other resources or languages. The exploitation of both the data set is moti- vated by the need to extend our research on the larger available amount of data, and by the fact that UD-IT is representative of the standard Italian lan- guage, while PoSTWITA-UD represents the Ital- ian language used in social media. This allows us to obtain a comprehensive set of results. Figure 3: The distribution of verbs selecting E 4.1 Data extraction (in blue) and A (in orange) across Sorace’s verbal classes in postwita-UD. To extract the data concerning the auxiliary se- lection on UD-it and PoSWITA we used the Sets Treebank Search provided by the Univer- sity of Turku, available for free at http:// bionlp-www.utu.fi/dep_search/. We formulated an expression that allowed us to extract data related only to intransitive verbs that appear in the reference corpora at the past par- ticiple form together with an auxiliary verb (A or E). We then compared the data from the corpora against the classification based on the linguistic Figure 4: The distribution of verbs selecting E theory. (in blue) and A (in orange) across Sorace’s verbal classes in it-UD. 5 Results The overall results confirm the linguistic the- After the data extraction from UD-IT and ory for what concerns the distribution in seman- PoSTWITA-UD, a first consideration is to be tic classes organized by Sorace in hierarchy. In made about the percentages of intransitive verbs fact, as Sorace affirms in (Sorace, 2000), the aux- that select A or E in the two corpora. iliary E is selected by intransitive verbs belonging to the categories of Change of location, Change of L2. state, Continuation of condition and Existence of We adopted in this study a corpus-based perspec- state as shown in figure 3 and 4 with respect to our tive and we tested our assumption on two tree- two reference corpora. Figure 5 shows an example banks for Italian respectively representig standard with the verb ”to go” taken from UD-it. and social media language. The results confirm On the other hand, the auxiliary A is selected and validate the theory and they could be used to develop a formal model that can be exploited in a computational context. References Figure 5: Example taken from UD-IT. In English: M. Amore. 2017. I verbi neologici nell’italiano ”He has gone away only half an hour before the del web: Comportamento sintattico e selezione end”. dell’ausiliare. In Proceedings of the Fourth Italian Conference on Computational Linguistics (CLiC-it by verbs belonging to the categories of Uncon- 2017), Rome, Italy, December 11-13, 2017. trolled process, Controlled motional Process and I. Chiari and T. De Mauro. 2016. Nuovo vocabolario Controlled nonmotional process. This is an exam- di base della lingua italiana. ple taken from the corpus UD-It, for the verb ”to act”, agire in Italian: Se, a richiesta del mittente, il G. Conte, C. Bosco, and A. Mazzei. 2017. Dealing vettore emette la lettera di trasporto aereo, si con- with Italian adjectives in noun phrase: a study ori- ented to natural language generation. In Proceed- sidera, sino a prova contraria, che egli abbia agito ings of 4th Italian Conference on Computational in nome del mittente 6 . Linguistics (Clic-it 2017), Rome, Italy. As fig. 4 shows, the results related to the cat- egory of “controlled nonmotional process” show M. Dardano and P. Trifone. 1997. La nuova grammat- ica della lingua italiana. Zanichelli, Bologna. that both auxiliary A and E can be admitted. This fact is also mentioned by (Sorace, 2000), when she D. Dowty. 1979. Word Meaning and Montague Gram- says that some Italian native speakers may accept mar,. D. Reidel, Dordrecht. the auxiliary verb E for this category of verb (e.g. Il cibo dell’ONU ha / è funzionato solo come pal- A. Mazzei, C. Battaglino, and C. Bosco. 2016. SimpleNLG-IT: adapting SimpleNLG to Italian. In liativo). Proceedings of the 9th International Natural Lan- guage Generation conference, pages 184–192, Ed- 6 Conclusion and future work inburgh, UK, September 5-8. Association for Com- putational Linguistics. The paper presents a study about the auxiliary se- lection in intransitive verbs in Italian. Providing Alessandro Mazzei. 2016. Building a computa- that the qualitative description given by traditional tional lexicon by using SQL. In Pierpaolo Basile, Anna Corazza, Francesco Cutugno, Simonetta Mon- grammars does not allow the definition of a formal temagni, Malvina Nissim, Viviana Patti, Giovanni model for the auxiliary selection, we considered a Semeraro, and Rachele Sprugnoli, editors, Proceed- study (Sorace, 2000) that classifies the intransitive ings of Third Italian Conference on Computational verbs taking into account both semantic and syn- Linguistics (CLiC-it 2016) & Fifth Evaluation Cam- paign of Natural Language Processing and Speech tactic features and behaviors. The long-term goal Tools for Italian. Final Workshop (EVALITA 2016), of this study is to contribute to the development Napoli, Italy, December 5-7, 2016., volume 1749, of a natural language generation system for Ital- pages 1–5. CEUR-WS.org, December. ian (Mazzei et al., 2016; Mazzei, 2016; Conte et al., 2017). In particular, the facilities of a fluent G.B Moretti and G.R. Orvieto. 1979. Grammatica italiana. Benucci, Perugia. automatic selection of the auxiliary can be an im- portant feature also in context where the realizer G. Patota. 2003. Grammatica di riferimento della lin- module of the system is used for extracting sug- gua italiana per stranieri. Le Monnier, Firenze. gestions for non-native speakers learning Italian as D. M. Perlmutter. 1978. Impersonal passives and the 6 English translation: If, under request of the sender, the unaccusative hypothesis. In Proceedings of the An- carrier issues the airway bill, it is considered, if not proven nual Meeting of the Berkeley Linguistics Society 38. otherwise, that he has acted in the name of the sender. Linguistic Society of America. L. Renzi, G. Salvi, and A. Cardinaletti. 1991. Grande grammatica italiana di consultazione. Il Mulino, Bologna. C. Rosen. 1984. The interface between semantic roles and initial grammatical relations. In D.M. Perl- mutter and C. Rosen, editors, Studies in Relational Grammar 2, pages 38–77. University of Chicago Press. M. Sanguinetti, C. Bosco, A. Lavelli, A. Mazzei, and F. Tamburini. 2018. PoSTWITA-UD: an Italian Twitter treebank in Universal Dependencies. In Proceedings of 11th International Conference on Language Resources and Evaluation - LREC 2018, Miyazaki, Japan, 7-12 May. L. Serianni. 1988. Grammatica italiana. Italiano co- mune e lingua letteraria. Suoni, forme e costrutti. UTET, Torino. A. Sorace. 2000. Gradients in auxiliary selection with intransitive verbs. Language, 76(4):859–890. R. D. Van Valin. 1990. Semantic parameters of split intransitivity. Language, 66(2):221–260.