Effective Communication without Verbs? Sure!
Identification of Nominal Utterances in Italian Social Media Texts
Gloria Comandini Manuela Speranza, Bernardo Magnini
Università di Trento Fondazione Bruno Kessler
Trento, Italy Trento, Italy
gloria.comandini@unitn.it {manspera,magnini}@fbk.eu
Abstract defined as concrete units of actually produced text,
devoid of any pre-determined syntactic or seman-
English. Nominal utterances are very fre- tic form (Sabatini and Coletti, 1997; Adger, 2003;
quent, especially in social media texts, and Graffi, 2012; Ferrari, 2014).
play a crucial role as they are very dense
It has been clearly shown that nominal utter-
from a semantic point of view. In spite
ances (NUs) occur with relatively high frequency
of this, their automatic identification has
not only in spoken language (Cresti, 1998; Lan-
received little to no attention. We have
dolfi et al., 2010; Garcia-Marchena, 2016) but also
thus developed a framework for the anno-
in written texts. Literary and journalistic prose
tation of nominal utterances and created
certainly offer some fine examples of NUs (Mor-
the manually annotated corpus COSMI-
tara Garavelli, 1971; Dardano and Trifone, 2001),
ANU (Corpus Of Social Media Italian An-
but nonetheless texts produced with computer me-
notated with Nominal Utterances), which
diated communication (CMC) or, more generally,
could be used to train automatic systems.
within social media, are also a fertile ground for
Italiano. Gli enunciati nominali sono this phenomenon. In fact, NUs are extremely im-
un fenomento linguistico molto frequente, portant from the semantic point of view as they al-
specialmente nello scritto dei social me- low speakers or writers to provide a lot of informa-
dia, e di cruciale importanza, data la tion using only a few words (high semantic den-
loro alta densità semantica. Tuttavia, ben sity), often without any explicit hierarchical rela-
poca attenzione è stata dedicata al loro ri- tionship (Sornicola, 1981; Ferrari, 2011a), which
conoscimento automatico. In quest’ottica, is a typical feature of CMC (Ferrari, 2011b).
questo lavoro illustra le guidelines per Yet NUs pose significant challenges when it
l’annotazione manuale degli enunciati comes to both their automatic processing, because
nominali da noi sviluppate e presenta il of the absence of a verbal head, and identification,
corpus dell’italiano dei social media da due to the fact that they can have diverse syntac-
noi annotato con gli enunciati nominali tic structures, containing, for example, dependent
(COSMIANU), utilizzabile per addestrare clauses with finite verbs.
sistemi automatici. So far, little or no attention has been paid to the
identification and processing of NUs in NLP ar-
eas such as information extraction/retrieval, senti-
1 Introduction
ment analysis, and opinion mining. However, in
Syntactic declarative constructions built around a order to address newly emerging challenges, these
non-verbal head (as in, for example, “What a nice research fields could greatly benefit from tackling
movie!”) are very common linguistic phenomena NUs specifically. This is the case, for instance,
in many Indo-European, Slavic and Semitic lan- with aspect-based sentiment analysis, which aims
guages (such as Latin, Hebrew, Arabic, Russian, to identify the main (e.g., the most frequently dis-
English, Spanish, and Italian), as well as in Finno- cussed) aspects (e.g., food, service) of given tar-
Ugric and Bantu languages (Benveniste, 1990; Si- get entities (e.g., restaurants) and the sentiment
mone, 2013). Not all of these nominal construc- expressed towards each aspect, instead of detect-
tions can be unanimously considered sentences, ing the overall polarity of a text span (as senti-
although they can surely be considered utterances, ment analysis usually does). Similarly, argumen-
tation mining, which takes one step forward with set than verbless utterances (in our perspective, in
respect to opinion mining by extracting not only fact, the main clause of a NU can govern depen-
information about people’s attitudes and opinions, dent clauses with finite verbs). For this reason we
but also about the arguments they give in favor of devised a complete annotation framework. More-
and against their target entities (e.g., products, in- over, to the best of our knowledge, our work is
stitutions, politicians, celebrities, etc.), could dra- the first attempt towards a corpus-based study of
matically improve by focusing on NUs, which are NUs on written texts (Cresti (2004), Landolfi et
often used, just like slogans, as the most emphatic al. (2010), and Garcia-Marchena (2016) address
part of the argumentation. spoken language).
As a first step towards enabling automatic sys-
tems to process NUs, we have developed a com- 3 Annotation Framework
plete framework for their annotation, and have cre-
In the following, we provide a brief summary of
ated the Corpus Of Social Media Italian Annotated
the annotation framework we devised for the man-
with Nominal Utterances (COSMIANU), which
ual annotation of NUs, which is based on the liter-
will be freely distributed with a Creative Com-
ature on NUs in Italian (Mortara Garavelli, 1971;
mons (CC-BY) licence and can therefore be used
Ferrari, 2011a; Ferrari, 2011b). For a thorough de-
to train automatic systems.
scription (and plenty of annotated examples), see
In this paper, we first summarize the main cri-
the document “Linee guida per l’annotazione degli
teria adopted for the annotation of NUs (Section
enunciati nominali” (in Italian) 2 .
3); in Section 4 we describe the annotated corpus;
in Section 5 we present the results of some pre- 3.1 NU Identification
liminary experiments on automatic identification
of NUs, and finally, in Section 6, we draw some According to the annotation schema we propose,
conclusions. every utterance whose main clause is non-verbal,
i.e. it does not contain a finite verb (see (1)), is
2 Related work marked as a Nominal Utterance (NU); note, how-
ever, that a non-verbal main clause can contain
The first corpus-based study of NUs was part of non-finite verbs, such as infinitive and/or particip-
the C-ORAL-ROM project, a multilingual (Ital- ial forms and gerunds (see (2), (3), and (4)).
ian, French, Portuguese and Spanish) corpus com-
posed by 1,200,000 words of spontaneous speech, (1) Felicissima per il suo ritorno!
created in order to describe the prosodic and syn- [Very happy about his return!]
tactic structures of romance languages (Cresti et
al., 2004). (2) Ma impegnarsi di più?
Relatively similar is the study conducted on the [Why not put more effort into it?]
AN.ANA.S Multilingual Treebank, consisting of
21,300 words of spontaneous speech and task- (3) Spariti i negozi, l’edicola, il
oriented dialogues in Italian, English and Spanish, posteggio.
manually annotated in order to identify verbless [Shops, news stand, and car park, all gone.]
clauses (Landolfi et al., 2010).
(4) Facendo due conti.
In more recent work, Garcia-Marchena (2016)
[Doing the math.]
uses the Spanish open-source corpus CORLEC1 to
manually identify and classify over 7,000 verbless
3.2 Coordination of main clauses
utterances in a detailed taxonomy.
While the above-mentioned studies all address When the main clause of an utterance bears a co-
verbless sentences and clauses, the phenomenon ordination relation to another clause, the NU is an-
in which we are interested is wider and includes notated as follows:
more complex syntactic structures, partly because
we address nominal utterances, which is a wider • If both are non-verbal, the extent of the NU
includes them both (see (5));
1
CORLEC, Corpus Oral de Referencia de la
2
Lengua Española Contemporánea, available from: This document is available for consultation from
http://www.lllf.uam.es/ING/Corlec.html http://tiny.cc/auhvvy
• If one is verbal and the other one is non- #sentences #words #tokens
verbal, the extent of the NU includes only the Blogs 1,178 16,054 18,874
non-verbal one (see (6)). Forums 1,331 15,168 18,105
Newsgroups 1,395 15,045 19,109
(5) Acqua a dirotto e tutti a casa! Soc. networks 1,057 7,770 9,923
[Too much rain and everyone home!]
Total 4,961 54,039 66,011
(6) I lavori prima, e poi si cena.
[Chores first, and then we’ll eat dinner.] Table 1: Data about COSMIANU.
Due to their peculiar syntactic structure, NUs 4 Annotations in COSMIANU
with coordination are further marked with the at-
tribute “verbal-coordinate” (coordination of ver- COSMIANU contains texts taken from the
bal and non-verbal clauses) or “non-verbal- Web2Corpus IT (Chiari and Canzonetti, 2014),
coordinate” (coordination of non-verbal clauses). a balanced Italian corpus of 1,050,000 words
consisting of social media texts of five types,
3.3 NUs with subordinate clauses i.e., blogs, forums, newsgroups, chats, and so-
Non-verbal subordinate clauses are included in the cial networks. In particular, we focused on semi-
extent of an NU, as in (7), whereas verbal subor- synchronous forms of CMC, i.e. blogs, forums,
dinate clauses are not, as in (8) and (9). newsgroups, and social networks (Pistolesi, 2004),
and randomly chose 24 files (six from each of
(7) Che bello partire tutti quanti! the four selected categories), for a total of 54,039
[Great to leave all together!] words.
These texts consist of discussions between users
(8) Felice che ti sia piaciuta. across a large number of themes (from politics to
[Glad you liked it.] popular singers). Thus in most cases, users inter-
act with each other creating a dialogic enviroment
(9) Siccome piove, tutti a casa.
rich in verbal crossfires and quotes. This kind of
[As it is raining, everyone home.]
interactions are a particularly fertile ground for el-
lipses and NUs in the form of greetings, which are
NUs with verbal subordinate clauses are marked
usually very frequent in spoken language.
with a specific attribute, i.e., “verbal-subordinate”.
Automatic pre-proccessing of the corpus, for
3.4 Ellipses which we used the TextPro suite of NLP tools (Pi-
anta et al., 2008), consisted of tokenization and
As explained above, NUs are utterances whose
sentence-splitting and resulted in 4,961 sentences
main clause is non-verbal, i.e. it does not contain
and 66,011 tokens (see Table 1 for more detailed
a finite verb. Unlike in other NUs, in ellipses it
data).
is always possible to infer the omitted verb (Mor-
tara Garavelli, 1971; Ferrari, 2010), since the The manual annotation was then performed by
omitted verb is exactly the same as the one in the an expert annotator using the Content Annotation
preceeding utterance. Tool (CAT) (Bartalesi Lenzi et al., 2012). The an-
notation effort, for an expert annotator, consisted
Ellipses are marked, using the specific attribute
of two weeks of work.
“ellipsis”, both when the preceeding utterance is
written by a different user, as in (10) and when it In order to evaluate the inter-annotator agree-
is written by the same user, as in (11). ment, a subpart of the corpus consisting of 5,193
tokens was annotated by a second annotator. The
(10) Cosa vorresti per cena? [What would you resulting Dice coefficient is 87.40. Both annota-
like for dinner?] tors identified 127 NUs, 111 of which are common
Una pizza! [A pizza!] (evaluation based on exact match).
Table 2 reports, for both the whole corpus and
(11) Cosa voglio??? [What do I want???] for each subcategory, the total number of NUs
Del rispetto! [Some respect!] and the number of NUs marked with each specific
attribute, i.e. “verbal-coordinate”, “non-verbal-
NUs Verbal coord. Non-verb. coord. Verbal subord. Ellipsis Simple NUs
Blogs 261 30 15 32 37 194
Forums 263 36 13 23 34 190
Newsgroups 196 33 21 17 35 122
Social networks 304 41 9 19 31 231
Total 1,024 140 58 91 137 737
Table 2: Distribution of NUs in the four social media categories.
Verbal coord. Non-verb. coord. Verbal subord. Ellipsis
Verbal coord. - 7 13 38
Non-verb. coord. 7 - 11 10
Verbal subord. 13 11 - 26
Ellipsis 38 10 26 -
no other attribute 82 30 41 63
Total 140 58 91 137
Table 3: Attribute co-occurrence.
coordinate”, “verbal-subordinate”, and “ellipsis” denotative elements simply listed without any ex-
(NUs that are not marked with any attribute, such plicit hierarchical bond, as in (13), in a way that
as (1), (2), (3), and (4), are referred to as “simple reminds one of a list of keywords.
NUs”).3
In the whole corpus we annotated 1,024 NUs, (13) Buon senso, etica, vincere tanto per
which means that 20,6% of the sentences contain vincere.
an NU. This percentage is lower than those re- [Common sense, ethics, winning for win-
ported by Cresti (2004) (38,1%) and Landolfi et ning’s sake.]
al. (2010) (28%). This can be explained by the fact
that the above-mentioned studies focus on spoken Looking at the distribution of NUs in the four
language, where interrupted strings, brachyologies subcategories, we see that social networks have
and turn-taking cues are more frequent with re- the highest number of NUs (304), despite hav-
spect to written language. Still, this percentage ing a significantly lower number of tokens than
shows that the nominal style is well represented blogs, forums and newsgroups. This probably de-
in written informal Italian, most likely due to its pends on the high perceived communicative econ-
linguistic economy and to its high semantic den- omy typical of social networks (Cosenza, 2014),
sity, which are particularly useful for expressing which leads writers to produce short, almost tele-
emphasis (see (12)). graphic, texts.
In Table 3 we report the co-occurence of NU
(12) Dichiarazione da Mr. Hyde! attributes by pairs4 in order to show how diverse
[A statement worthy of Mr. Hyde!] syntactic structures NUs can have. Particularly in-
teresting is the presence of 38 NUs containing el-
In addition, the large number of NUs marked lipses coordinated with a verbal clause; in fact, the
as coordinate, either “verbal” (140 NUs) or “non- ellipsis usually follows the verbal clause, whose
verbal” (58 NUs) shows that parataxis is constant verb is implied in a contrastive context. Addi-
throughout these texts. In fact, NUs appear to tionally, ellipses can support a verbal subordinate
be extremely suitable to the parataxis typical of clause (in our corpus we have 26 cases), which
CMC; furthermore, they are often isolated, i.e., usually adds further information in favor of the
free from hierarchical syntactic bonds. This also contrastive utterance (see (14)).
explains why NUs can be composed of a series of
4
Although we have case where NUs have been marked
3
Notice that a single NU can be marked with more than with up to four attributes, we only focus on co-occurrence by
one attribute. attribute pairs.
(14) Non è un edificio specifico, ma una 6 Conclusion and Future Work
tipologia architettonica che caratter-
izza l’URSS. This work shows how common NUs are in written
[It is not a specific building, but an architec- informal language, as well as how important they
tural typology that characterizes the USSR.] are in conveying semantically dense concepts in
emphatic informative peaks, which could be use-
5 Automatic Identification of NUs ful for many NLP fields (e.g., argumentation min-
We used COSMIANU to train an open source ing and aspect-based sentiment analysis).
SVM classifier, YamCha5 , and performed some By creating COSMIANU, an Italian corpus an-
preliminary experiments on NU identification. As notated with NUs, and making it freely available
training data, we selected 44,170 tokens (i.e. about to the research community, we made a first step
2/3 of the corpus) while maintaining the same pro- towards the development of automatic tools for
portion of blogs, forums, newsgroups, and social the identification and classification of NUs. In
networks over the whole corpus. We used the re- our preliminary experiments on NU identification
maining part of the corpus (21,841 tokens) as a test (performed using an SWM classifier), with our
set. In these preliminary experiments we also in- best configuration, we obtained a performance of
cluded the NUs that appear in the text as metadata, 73.40% in terms of F1 on all NUs (i.e. including
which are annotated and marked with the specific metadata).
tag “metadata” in COSMIANU, as shown in Ex- In the future, we intend to further expand COS-
ample (15) 6 . The training set and the test set thus MIANU, both in terms of its size and in terms of
contain respectively 1,775 and 1,058 NUs. the annotations it includes, hoping that this will
encourage more research on this extremely com-
(15) Data: 27/09/2010.
mon, and yet almost neglected, linguistic phe-
[Date: 09/27/2010.]
nomenon. We also plan to work on the analy-
We pre-processed the data using the TextPro sis and automatic recognition of NUs, especially
suite (Pianta et al., 2008) and performed a num- when they are used to convey hate speech, in the
ber of experiments combining the following basic form of racist, sexist, homo/transphobic or classist
features: two-word window context (W2), three- slogans and insults.
word window context (W3), token (Tok), lemma
(Lem), and Part-of-Speech (Pos). Acknowledgments
Configuration Prec. Rec. F1 We would like to thank Isabella Chiari for pro-
Baseline 33.80 27.13 30.10 viding us the Web2Corpus IT, from which we se-
W2+Tok+Lem+Pos 79.80 67.96 73.40 lected the raw texts to build COSMIANU. We also
thank our colleagues Roberto Zanoli and Rachele
Table 4: Results on NU identification. Sprugnoli for their valuable advice and contribu-
tions in performing the experiments and defining
Table 4 reports, in terms of Precision, Recall,
the annotation guidelines.
and F1, the results we obtained with the baseline
configuration (the system identifies only the NUs
in the test set that also appear in the training set) References
and those we obtained with the best configuration,
i.e. using all the features and a two-word window David Adger. 2003. Core Syntax: A Minimalist Ap-
proach. Oxford University Press.
context. With the latter, the classifier identified
901 NUs, of which 719 are correct (exact match), Valentina Bartalesi Lenzi, Giovanni Moretti, and
thus reaching an F1 of 73.40% and outperforming Rachele Sprugnoli. 2012. CAT: the CELCT An-
the baseline by over 43 points. notation Tool. In Proceedings of the 8th Interna-
5 tional Conference on Language Resources and Eval-
Yet Another Multipurpose CHunk Annotator. Website: uation (LREC’12), pages 333–338, Istanbul, Turkey,
http://chasen.org/ taku/software/yamcha/
6 May. European Language Resources Association
Metadata usually refer to when and where a certain mes-
sage has been written; although “metadata” NUs are very fre-
(ELRA).
quent in the corpus (more than 60% of the total), they are not
particularly interesting from a linguistic point of view and we Émile Benveniste. 1990. Problemi di linguistica gen-
did not include them in the counts of Section 4. erale. Mondadori, Milano, Italia.
Isabella Chiari and Alessio Canzonetti. 2014. Le Textual Data (JADT 2010), pages 450–459. Roma,
forme della comunicazione mediata dal computer: Italia, June 9-11.
generi, tipi e standard di annotazione. In E. Gar-
avelli and E. Suomela-Härmä, editors, Dal mano- Bice Mortara Garavelli. 1971. Fra norma e invenzione:
scritto al web: canali e modalità di trasmissione lo stile nominale. In Accademia della Crusca, editor,
dell’italiano, pages 595–606. Franco Cesati Editore, Studi di grammatica italiana, volume 1, pages 271–
Firenze, Italia. 315. G. C. Sansoni Editore, Firenze, Italia.
Giovanna Cosenza. 2014. Introduzione alla semiotica Emanuele Pianta, Christian Girardi, and Roberto
dei nuovi media. Laterza, Bari, Italia. Zanoli. 2008. The TextPro tool suite. In Proceed-
ings of LREC, 6th edition of the Language Resources
Emanuela Cresti, Fernanda Bacelar do Nascimento, and Evaluation Conference, Marrakech, Morocco,
Antonio Moreno-Sandoval, Jean Véronis, Philippe May 28-30.
Martin, and Khalid Choukri. 2004. The C-
ORAL-ROM CORPUS. A Multilingual Resource of Elena Pistolesi. 2004. Il parlar spedito. L’italiano di
Spontaneous Speech for Romance Languages. In chat, e-mail e sms. Esedra, Padova, Italia.
Maria Teresa Lino, Maria Francisca Xavier, Fátima
Ferreira, Rute Costa, and Raquel Silva, editors, Pro- Francesco Sabatini and Vittorio Coletti. 1997.
coceedings of the 4th LREC Conference, pages 575– Dizionario Italiano Sabatini-Coletti. Giunti,
578, Paris, France. European Language Resources Firenze, Italia.
Association (ELRA).
Raffaele Simone. 2013. Nuovi fondamenti di linguis-
Emanuela Cresti. 1998. Gli enunciati nominali. In tica. McGraw-Hill, Milano, Italia.
M. T. Navarro, editor, Atti del IV convegno inter-
Rosanna Sornicola. 1981. Sul parlato. Il Mulino,
nazionale SILFI (Madrid 27-29 giugno 1996), pages
Bologna, Italia.
171–191, Pisa. Franco Cesati Editore.
Maurizio Dardano and Pietro Trifone. 2001. La nuova
grammatica della lingua italiana. Zanichelli, Mi-
lano, Italia.
Angela Ferrari. 2010. Enunciati ellittici. Enciclopedia
dell’Italiano. http://www.treccani.it/
enciclopedia/enunciati-ellittici_
(Enciclopedia-dell’Italiano)/.
Angela Ferrari. 2011a. Enunciati nomi-
nali. Enciclopedia dell’Italiano. http:
//www.treccani.it/enciclopedia/
enunciati-nominali_(Enciclopedia_
dell’Italiano)/.
Angela Ferrari. 2011b. Stile nominale. Enciclope-
dia dell’Italiano. http://www.treccani.
it/enciclopedia/stile-nominale_
(Enciclopedia-dell’Italiano)/.
Angela Ferrari. 2014. Linguistica del testo. Principi,
fenomeni, strutture. Carocci, Roma, Italia.
Oscar Garcia-Marchena. 2016. Spanish Verbless
Clauses and Fragments. A corpus analysis. In An-
tonio Moreno Ortiz and Chantal Pérez-Hernández,
editors, CILC 2016. 8th International Conference on
Corpus Linguistics, volume 1 of EPiC Series in Lan-
guage and Linguistics, pages 130–143. EasyChair.
Giorgio Graffi. 2012. La frase: l’analisi logica.
Carocci, Roma, Italia.
Annamaria Landolfi, Carmela Sammarco, and Miriam
Voghera. 2010. Verbless clauses in Italian, Span-
ish and English: a Treebank annotation. In S. Bo-
lasco, I. Chiari, and L. Giuliano, editors, Statistical
Analysis of Textual Data. Proceedings of the 10th
International Conference on Statistical Analysis of