Effective Communication without Verbs? Sure! Identification of Nominal Utterances in Italian Social Media Texts Gloria Comandini Manuela Speranza, Bernardo Magnini Università di Trento Fondazione Bruno Kessler Trento, Italy Trento, Italy gloria.comandini@unitn.it {manspera,magnini}@fbk.eu Abstract defined as concrete units of actually produced text, devoid of any pre-determined syntactic or seman- English. Nominal utterances are very fre- tic form (Sabatini and Coletti, 1997; Adger, 2003; quent, especially in social media texts, and Graffi, 2012; Ferrari, 2014). play a crucial role as they are very dense It has been clearly shown that nominal utter- from a semantic point of view. In spite ances (NUs) occur with relatively high frequency of this, their automatic identification has not only in spoken language (Cresti, 1998; Lan- received little to no attention. We have dolfi et al., 2010; Garcia-Marchena, 2016) but also thus developed a framework for the anno- in written texts. Literary and journalistic prose tation of nominal utterances and created certainly offer some fine examples of NUs (Mor- the manually annotated corpus COSMI- tara Garavelli, 1971; Dardano and Trifone, 2001), ANU (Corpus Of Social Media Italian An- but nonetheless texts produced with computer me- notated with Nominal Utterances), which diated communication (CMC) or, more generally, could be used to train automatic systems. within social media, are also a fertile ground for Italiano. Gli enunciati nominali sono this phenomenon. In fact, NUs are extremely im- un fenomento linguistico molto frequente, portant from the semantic point of view as they al- specialmente nello scritto dei social me- low speakers or writers to provide a lot of informa- dia, e di cruciale importanza, data la tion using only a few words (high semantic den- loro alta densità semantica. Tuttavia, ben sity), often without any explicit hierarchical rela- poca attenzione è stata dedicata al loro ri- tionship (Sornicola, 1981; Ferrari, 2011a), which conoscimento automatico. In quest’ottica, is a typical feature of CMC (Ferrari, 2011b). questo lavoro illustra le guidelines per Yet NUs pose significant challenges when it l’annotazione manuale degli enunciati comes to both their automatic processing, because nominali da noi sviluppate e presenta il of the absence of a verbal head, and identification, corpus dell’italiano dei social media da due to the fact that they can have diverse syntac- noi annotato con gli enunciati nominali tic structures, containing, for example, dependent (COSMIANU), utilizzabile per addestrare clauses with finite verbs. sistemi automatici. So far, little or no attention has been paid to the identification and processing of NUs in NLP ar- eas such as information extraction/retrieval, senti- 1 Introduction ment analysis, and opinion mining. However, in Syntactic declarative constructions built around a order to address newly emerging challenges, these non-verbal head (as in, for example, “What a nice research fields could greatly benefit from tackling movie!”) are very common linguistic phenomena NUs specifically. This is the case, for instance, in many Indo-European, Slavic and Semitic lan- with aspect-based sentiment analysis, which aims guages (such as Latin, Hebrew, Arabic, Russian, to identify the main (e.g., the most frequently dis- English, Spanish, and Italian), as well as in Finno- cussed) aspects (e.g., food, service) of given tar- Ugric and Bantu languages (Benveniste, 1990; Si- get entities (e.g., restaurants) and the sentiment mone, 2013). Not all of these nominal construc- expressed towards each aspect, instead of detect- tions can be unanimously considered sentences, ing the overall polarity of a text span (as senti- although they can surely be considered utterances, ment analysis usually does). Similarly, argumen- tation mining, which takes one step forward with set than verbless utterances (in our perspective, in respect to opinion mining by extracting not only fact, the main clause of a NU can govern depen- information about people’s attitudes and opinions, dent clauses with finite verbs). For this reason we but also about the arguments they give in favor of devised a complete annotation framework. More- and against their target entities (e.g., products, in- over, to the best of our knowledge, our work is stitutions, politicians, celebrities, etc.), could dra- the first attempt towards a corpus-based study of matically improve by focusing on NUs, which are NUs on written texts (Cresti (2004), Landolfi et often used, just like slogans, as the most emphatic al. (2010), and Garcia-Marchena (2016) address part of the argumentation. spoken language). As a first step towards enabling automatic sys- tems to process NUs, we have developed a com- 3 Annotation Framework plete framework for their annotation, and have cre- In the following, we provide a brief summary of ated the Corpus Of Social Media Italian Annotated the annotation framework we devised for the man- with Nominal Utterances (COSMIANU), which ual annotation of NUs, which is based on the liter- will be freely distributed with a Creative Com- ature on NUs in Italian (Mortara Garavelli, 1971; mons (CC-BY) licence and can therefore be used Ferrari, 2011a; Ferrari, 2011b). For a thorough de- to train automatic systems. scription (and plenty of annotated examples), see In this paper, we first summarize the main cri- the document “Linee guida per l’annotazione degli teria adopted for the annotation of NUs (Section enunciati nominali” (in Italian) 2 . 3); in Section 4 we describe the annotated corpus; in Section 5 we present the results of some pre- 3.1 NU Identification liminary experiments on automatic identification of NUs, and finally, in Section 6, we draw some According to the annotation schema we propose, conclusions. every utterance whose main clause is non-verbal, i.e. it does not contain a finite verb (see (1)), is 2 Related work marked as a Nominal Utterance (NU); note, how- ever, that a non-verbal main clause can contain The first corpus-based study of NUs was part of non-finite verbs, such as infinitive and/or particip- the C-ORAL-ROM project, a multilingual (Ital- ial forms and gerunds (see (2), (3), and (4)). ian, French, Portuguese and Spanish) corpus com- posed by 1,200,000 words of spontaneous speech, (1) Felicissima per il suo ritorno! created in order to describe the prosodic and syn- [Very happy about his return!] tactic structures of romance languages (Cresti et al., 2004). (2) Ma impegnarsi di più? Relatively similar is the study conducted on the [Why not put more effort into it?] AN.ANA.S Multilingual Treebank, consisting of 21,300 words of spontaneous speech and task- (3) Spariti i negozi, l’edicola, il oriented dialogues in Italian, English and Spanish, posteggio. manually annotated in order to identify verbless [Shops, news stand, and car park, all gone.] clauses (Landolfi et al., 2010). (4) Facendo due conti. In more recent work, Garcia-Marchena (2016) [Doing the math.] uses the Spanish open-source corpus CORLEC1 to manually identify and classify over 7,000 verbless 3.2 Coordination of main clauses utterances in a detailed taxonomy. While the above-mentioned studies all address When the main clause of an utterance bears a co- verbless sentences and clauses, the phenomenon ordination relation to another clause, the NU is an- in which we are interested is wider and includes notated as follows: more complex syntactic structures, partly because we address nominal utterances, which is a wider • If both are non-verbal, the extent of the NU includes them both (see (5)); 1 CORLEC, Corpus Oral de Referencia de la 2 Lengua Española Contemporánea, available from: This document is available for consultation from http://www.lllf.uam.es/ING/Corlec.html http://tiny.cc/auhvvy • If one is verbal and the other one is non- #sentences #words #tokens verbal, the extent of the NU includes only the Blogs 1,178 16,054 18,874 non-verbal one (see (6)). Forums 1,331 15,168 18,105 Newsgroups 1,395 15,045 19,109 (5) Acqua a dirotto e tutti a casa! Soc. networks 1,057 7,770 9,923 [Too much rain and everyone home!] Total 4,961 54,039 66,011 (6) I lavori prima, e poi si cena. [Chores first, and then we’ll eat dinner.] Table 1: Data about COSMIANU. Due to their peculiar syntactic structure, NUs 4 Annotations in COSMIANU with coordination are further marked with the at- tribute “verbal-coordinate” (coordination of ver- COSMIANU contains texts taken from the bal and non-verbal clauses) or “non-verbal- Web2Corpus IT (Chiari and Canzonetti, 2014), coordinate” (coordination of non-verbal clauses). a balanced Italian corpus of 1,050,000 words consisting of social media texts of five types, 3.3 NUs with subordinate clauses i.e., blogs, forums, newsgroups, chats, and so- Non-verbal subordinate clauses are included in the cial networks. In particular, we focused on semi- extent of an NU, as in (7), whereas verbal subor- synchronous forms of CMC, i.e. blogs, forums, dinate clauses are not, as in (8) and (9). newsgroups, and social networks (Pistolesi, 2004), and randomly chose 24 files (six from each of (7) Che bello partire tutti quanti! the four selected categories), for a total of 54,039 [Great to leave all together!] words. These texts consist of discussions between users (8) Felice che ti sia piaciuta. across a large number of themes (from politics to [Glad you liked it.] popular singers). Thus in most cases, users inter- act with each other creating a dialogic enviroment (9) Siccome piove, tutti a casa. rich in verbal crossfires and quotes. This kind of [As it is raining, everyone home.] interactions are a particularly fertile ground for el- lipses and NUs in the form of greetings, which are NUs with verbal subordinate clauses are marked usually very frequent in spoken language. with a specific attribute, i.e., “verbal-subordinate”. Automatic pre-proccessing of the corpus, for 3.4 Ellipses which we used the TextPro suite of NLP tools (Pi- anta et al., 2008), consisted of tokenization and As explained above, NUs are utterances whose sentence-splitting and resulted in 4,961 sentences main clause is non-verbal, i.e. it does not contain and 66,011 tokens (see Table 1 for more detailed a finite verb. Unlike in other NUs, in ellipses it data). is always possible to infer the omitted verb (Mor- tara Garavelli, 1971; Ferrari, 2010), since the The manual annotation was then performed by omitted verb is exactly the same as the one in the an expert annotator using the Content Annotation preceeding utterance. Tool (CAT) (Bartalesi Lenzi et al., 2012). The an- notation effort, for an expert annotator, consisted Ellipses are marked, using the specific attribute of two weeks of work. “ellipsis”, both when the preceeding utterance is written by a different user, as in (10) and when it In order to evaluate the inter-annotator agree- is written by the same user, as in (11). ment, a subpart of the corpus consisting of 5,193 tokens was annotated by a second annotator. The (10) Cosa vorresti per cena? [What would you resulting Dice coefficient is 87.40. Both annota- like for dinner?] tors identified 127 NUs, 111 of which are common Una pizza! [A pizza!] (evaluation based on exact match). Table 2 reports, for both the whole corpus and (11) Cosa voglio??? [What do I want???] for each subcategory, the total number of NUs Del rispetto! [Some respect!] and the number of NUs marked with each specific attribute, i.e. “verbal-coordinate”, “non-verbal- NUs Verbal coord. Non-verb. coord. Verbal subord. Ellipsis Simple NUs Blogs 261 30 15 32 37 194 Forums 263 36 13 23 34 190 Newsgroups 196 33 21 17 35 122 Social networks 304 41 9 19 31 231 Total 1,024 140 58 91 137 737 Table 2: Distribution of NUs in the four social media categories. Verbal coord. Non-verb. coord. Verbal subord. Ellipsis Verbal coord. - 7 13 38 Non-verb. coord. 7 - 11 10 Verbal subord. 13 11 - 26 Ellipsis 38 10 26 - no other attribute 82 30 41 63 Total 140 58 91 137 Table 3: Attribute co-occurrence. coordinate”, “verbal-subordinate”, and “ellipsis” denotative elements simply listed without any ex- (NUs that are not marked with any attribute, such plicit hierarchical bond, as in (13), in a way that as (1), (2), (3), and (4), are referred to as “simple reminds one of a list of keywords. NUs”).3 In the whole corpus we annotated 1,024 NUs, (13) Buon senso, etica, vincere tanto per which means that 20,6% of the sentences contain vincere. an NU. This percentage is lower than those re- [Common sense, ethics, winning for win- ported by Cresti (2004) (38,1%) and Landolfi et ning’s sake.] al. (2010) (28%). This can be explained by the fact that the above-mentioned studies focus on spoken Looking at the distribution of NUs in the four language, where interrupted strings, brachyologies subcategories, we see that social networks have and turn-taking cues are more frequent with re- the highest number of NUs (304), despite hav- spect to written language. Still, this percentage ing a significantly lower number of tokens than shows that the nominal style is well represented blogs, forums and newsgroups. This probably de- in written informal Italian, most likely due to its pends on the high perceived communicative econ- linguistic economy and to its high semantic den- omy typical of social networks (Cosenza, 2014), sity, which are particularly useful for expressing which leads writers to produce short, almost tele- emphasis (see (12)). graphic, texts. In Table 3 we report the co-occurence of NU (12) Dichiarazione da Mr. Hyde! attributes by pairs4 in order to show how diverse [A statement worthy of Mr. Hyde!] syntactic structures NUs can have. Particularly in- teresting is the presence of 38 NUs containing el- In addition, the large number of NUs marked lipses coordinated with a verbal clause; in fact, the as coordinate, either “verbal” (140 NUs) or “non- ellipsis usually follows the verbal clause, whose verbal” (58 NUs) shows that parataxis is constant verb is implied in a contrastive context. Addi- throughout these texts. In fact, NUs appear to tionally, ellipses can support a verbal subordinate be extremely suitable to the parataxis typical of clause (in our corpus we have 26 cases), which CMC; furthermore, they are often isolated, i.e., usually adds further information in favor of the free from hierarchical syntactic bonds. This also contrastive utterance (see (14)). explains why NUs can be composed of a series of 4 Although we have case where NUs have been marked 3 Notice that a single NU can be marked with more than with up to four attributes, we only focus on co-occurrence by one attribute. attribute pairs. (14) Non è un edificio specifico, ma una 6 Conclusion and Future Work tipologia architettonica che caratter- izza l’URSS. This work shows how common NUs are in written [It is not a specific building, but an architec- informal language, as well as how important they tural typology that characterizes the USSR.] are in conveying semantically dense concepts in emphatic informative peaks, which could be use- 5 Automatic Identification of NUs ful for many NLP fields (e.g., argumentation min- We used COSMIANU to train an open source ing and aspect-based sentiment analysis). SVM classifier, YamCha5 , and performed some By creating COSMIANU, an Italian corpus an- preliminary experiments on NU identification. As notated with NUs, and making it freely available training data, we selected 44,170 tokens (i.e. about to the research community, we made a first step 2/3 of the corpus) while maintaining the same pro- towards the development of automatic tools for portion of blogs, forums, newsgroups, and social the identification and classification of NUs. In networks over the whole corpus. We used the re- our preliminary experiments on NU identification maining part of the corpus (21,841 tokens) as a test (performed using an SWM classifier), with our set. In these preliminary experiments we also in- best configuration, we obtained a performance of cluded the NUs that appear in the text as metadata, 73.40% in terms of F1 on all NUs (i.e. including which are annotated and marked with the specific metadata). tag “metadata” in COSMIANU, as shown in Ex- In the future, we intend to further expand COS- ample (15) 6 . The training set and the test set thus MIANU, both in terms of its size and in terms of contain respectively 1,775 and 1,058 NUs. the annotations it includes, hoping that this will encourage more research on this extremely com- (15) Data: 27/09/2010. mon, and yet almost neglected, linguistic phe- [Date: 09/27/2010.] nomenon. We also plan to work on the analy- We pre-processed the data using the TextPro sis and automatic recognition of NUs, especially suite (Pianta et al., 2008) and performed a num- when they are used to convey hate speech, in the ber of experiments combining the following basic form of racist, sexist, homo/transphobic or classist features: two-word window context (W2), three- slogans and insults. word window context (W3), token (Tok), lemma (Lem), and Part-of-Speech (Pos). Acknowledgments Configuration Prec. Rec. F1 We would like to thank Isabella Chiari for pro- Baseline 33.80 27.13 30.10 viding us the Web2Corpus IT, from which we se- W2+Tok+Lem+Pos 79.80 67.96 73.40 lected the raw texts to build COSMIANU. We also thank our colleagues Roberto Zanoli and Rachele Table 4: Results on NU identification. Sprugnoli for their valuable advice and contribu- tions in performing the experiments and defining Table 4 reports, in terms of Precision, Recall, the annotation guidelines. and F1, the results we obtained with the baseline configuration (the system identifies only the NUs in the test set that also appear in the training set) References and those we obtained with the best configuration, i.e. using all the features and a two-word window David Adger. 2003. Core Syntax: A Minimalist Ap- proach. Oxford University Press. context. With the latter, the classifier identified 901 NUs, of which 719 are correct (exact match), Valentina Bartalesi Lenzi, Giovanni Moretti, and thus reaching an F1 of 73.40% and outperforming Rachele Sprugnoli. 2012. CAT: the CELCT An- the baseline by over 43 points. notation Tool. In Proceedings of the 8th Interna- 5 tional Conference on Language Resources and Eval- Yet Another Multipurpose CHunk Annotator. Website: uation (LREC’12), pages 333–338, Istanbul, Turkey, http://chasen.org/ taku/software/yamcha/ 6 May. European Language Resources Association Metadata usually refer to when and where a certain mes- sage has been written; although “metadata” NUs are very fre- (ELRA). quent in the corpus (more than 60% of the total), they are not particularly interesting from a linguistic point of view and we Émile Benveniste. 1990. Problemi di linguistica gen- did not include them in the counts of Section 4. erale. Mondadori, Milano, Italia. Isabella Chiari and Alessio Canzonetti. 2014. Le Textual Data (JADT 2010), pages 450–459. Roma, forme della comunicazione mediata dal computer: Italia, June 9-11. generi, tipi e standard di annotazione. In E. Gar- avelli and E. Suomela-Härmä, editors, Dal mano- Bice Mortara Garavelli. 1971. Fra norma e invenzione: scritto al web: canali e modalità di trasmissione lo stile nominale. In Accademia della Crusca, editor, dell’italiano, pages 595–606. Franco Cesati Editore, Studi di grammatica italiana, volume 1, pages 271– Firenze, Italia. 315. G. C. Sansoni Editore, Firenze, Italia. Giovanna Cosenza. 2014. Introduzione alla semiotica Emanuele Pianta, Christian Girardi, and Roberto dei nuovi media. Laterza, Bari, Italia. Zanoli. 2008. The TextPro tool suite. In Proceed- ings of LREC, 6th edition of the Language Resources Emanuela Cresti, Fernanda Bacelar do Nascimento, and Evaluation Conference, Marrakech, Morocco, Antonio Moreno-Sandoval, Jean Véronis, Philippe May 28-30. Martin, and Khalid Choukri. 2004. The C- ORAL-ROM CORPUS. A Multilingual Resource of Elena Pistolesi. 2004. Il parlar spedito. L’italiano di Spontaneous Speech for Romance Languages. In chat, e-mail e sms. Esedra, Padova, Italia. Maria Teresa Lino, Maria Francisca Xavier, Fátima Ferreira, Rute Costa, and Raquel Silva, editors, Pro- Francesco Sabatini and Vittorio Coletti. 1997. coceedings of the 4th LREC Conference, pages 575– Dizionario Italiano Sabatini-Coletti. Giunti, 578, Paris, France. European Language Resources Firenze, Italia. Association (ELRA). Raffaele Simone. 2013. Nuovi fondamenti di linguis- Emanuela Cresti. 1998. Gli enunciati nominali. In tica. McGraw-Hill, Milano, Italia. M. T. Navarro, editor, Atti del IV convegno inter- Rosanna Sornicola. 1981. Sul parlato. Il Mulino, nazionale SILFI (Madrid 27-29 giugno 1996), pages Bologna, Italia. 171–191, Pisa. Franco Cesati Editore. Maurizio Dardano and Pietro Trifone. 2001. La nuova grammatica della lingua italiana. Zanichelli, Mi- lano, Italia. Angela Ferrari. 2010. Enunciati ellittici. Enciclopedia dell’Italiano. http://www.treccani.it/ enciclopedia/enunciati-ellittici_ (Enciclopedia-dell’Italiano)/. Angela Ferrari. 2011a. Enunciati nomi- nali. Enciclopedia dell’Italiano. http: //www.treccani.it/enciclopedia/ enunciati-nominali_(Enciclopedia_ dell’Italiano)/. Angela Ferrari. 2011b. Stile nominale. Enciclope- dia dell’Italiano. http://www.treccani. it/enciclopedia/stile-nominale_ (Enciclopedia-dell’Italiano)/. Angela Ferrari. 2014. Linguistica del testo. Principi, fenomeni, strutture. Carocci, Roma, Italia. Oscar Garcia-Marchena. 2016. Spanish Verbless Clauses and Fragments. A corpus analysis. In An- tonio Moreno Ortiz and Chantal Pérez-Hernández, editors, CILC 2016. 8th International Conference on Corpus Linguistics, volume 1 of EPiC Series in Lan- guage and Linguistics, pages 130–143. EasyChair. Giorgio Graffi. 2012. La frase: l’analisi logica. Carocci, Roma, Italia. Annamaria Landolfi, Carmela Sammarco, and Miriam Voghera. 2010. Verbless clauses in Italian, Span- ish and English: a Treebank annotation. In S. Bo- lasco, I. Chiari, and L. Giuliano, editors, Statistical Analysis of Textual Data. Proceedings of the 10th International Conference on Statistical Analysis of