Effective Communication without Verbs? Sure!
      Identification of Nominal Utterances in Italian Social Media Texts

               Gloria Comandini      Manuela Speranza, Bernardo Magnini
               Università di Trento      Fondazione Bruno Kessler
                  Trento, Italy                 Trento, Italy
         gloria.comandini@unitn.it    {manspera,magnini}@fbk.eu


                     Abstract                          defined as concrete units of actually produced text,
                                                       devoid of any pre-determined syntactic or seman-
    English. Nominal utterances are very fre-          tic form (Sabatini and Coletti, 1997; Adger, 2003;
    quent, especially in social media texts, and       Graffi, 2012; Ferrari, 2014).
    play a crucial role as they are very dense
                                                          It has been clearly shown that nominal utter-
    from a semantic point of view. In spite
                                                       ances (NUs) occur with relatively high frequency
    of this, their automatic identification has
                                                       not only in spoken language (Cresti, 1998; Lan-
    received little to no attention. We have
                                                       dolfi et al., 2010; Garcia-Marchena, 2016) but also
    thus developed a framework for the anno-
                                                       in written texts. Literary and journalistic prose
    tation of nominal utterances and created
                                                       certainly offer some fine examples of NUs (Mor-
    the manually annotated corpus COSMI-
                                                       tara Garavelli, 1971; Dardano and Trifone, 2001),
    ANU (Corpus Of Social Media Italian An-
                                                       but nonetheless texts produced with computer me-
    notated with Nominal Utterances), which
                                                       diated communication (CMC) or, more generally,
    could be used to train automatic systems.
                                                       within social media, are also a fertile ground for
    Italiano. Gli enunciati nominali sono              this phenomenon. In fact, NUs are extremely im-
    un fenomento linguistico molto frequente,          portant from the semantic point of view as they al-
    specialmente nello scritto dei social me-          low speakers or writers to provide a lot of informa-
    dia, e di cruciale importanza, data la             tion using only a few words (high semantic den-
    loro alta densità semantica. Tuttavia, ben        sity), often without any explicit hierarchical rela-
    poca attenzione è stata dedicata al loro ri-      tionship (Sornicola, 1981; Ferrari, 2011a), which
    conoscimento automatico. In quest’ottica,          is a typical feature of CMC (Ferrari, 2011b).
    questo lavoro illustra le guidelines per              Yet NUs pose significant challenges when it
    l’annotazione manuale degli enunciati              comes to both their automatic processing, because
    nominali da noi sviluppate e presenta il           of the absence of a verbal head, and identification,
    corpus dell’italiano dei social media da           due to the fact that they can have diverse syntac-
    noi annotato con gli enunciati nominali            tic structures, containing, for example, dependent
    (COSMIANU), utilizzabile per addestrare            clauses with finite verbs.
    sistemi automatici.                                   So far, little or no attention has been paid to the
                                                       identification and processing of NUs in NLP ar-
                                                       eas such as information extraction/retrieval, senti-
1   Introduction
                                                       ment analysis, and opinion mining. However, in
Syntactic declarative constructions built around a     order to address newly emerging challenges, these
non-verbal head (as in, for example, “What a nice      research fields could greatly benefit from tackling
movie!”) are very common linguistic phenomena          NUs specifically. This is the case, for instance,
in many Indo-European, Slavic and Semitic lan-         with aspect-based sentiment analysis, which aims
guages (such as Latin, Hebrew, Arabic, Russian,        to identify the main (e.g., the most frequently dis-
English, Spanish, and Italian), as well as in Finno-   cussed) aspects (e.g., food, service) of given tar-
Ugric and Bantu languages (Benveniste, 1990; Si-       get entities (e.g., restaurants) and the sentiment
mone, 2013). Not all of these nominal construc-        expressed towards each aspect, instead of detect-
tions can be unanimously considered sentences,         ing the overall polarity of a text span (as senti-
although they can surely be considered utterances,     ment analysis usually does). Similarly, argumen-
tation mining, which takes one step forward with          set than verbless utterances (in our perspective, in
respect to opinion mining by extracting not only          fact, the main clause of a NU can govern depen-
information about people’s attitudes and opinions,        dent clauses with finite verbs). For this reason we
but also about the arguments they give in favor of        devised a complete annotation framework. More-
and against their target entities (e.g., products, in-    over, to the best of our knowledge, our work is
stitutions, politicians, celebrities, etc.), could dra-   the first attempt towards a corpus-based study of
matically improve by focusing on NUs, which are           NUs on written texts (Cresti (2004), Landolfi et
often used, just like slogans, as the most emphatic       al. (2010), and Garcia-Marchena (2016) address
part of the argumentation.                                spoken language).
   As a first step towards enabling automatic sys-
tems to process NUs, we have developed a com-             3     Annotation Framework
plete framework for their annotation, and have cre-
                                                          In the following, we provide a brief summary of
ated the Corpus Of Social Media Italian Annotated
                                                          the annotation framework we devised for the man-
with Nominal Utterances (COSMIANU), which
                                                          ual annotation of NUs, which is based on the liter-
will be freely distributed with a Creative Com-
                                                          ature on NUs in Italian (Mortara Garavelli, 1971;
mons (CC-BY) licence and can therefore be used
                                                          Ferrari, 2011a; Ferrari, 2011b). For a thorough de-
to train automatic systems.
                                                          scription (and plenty of annotated examples), see
   In this paper, we first summarize the main cri-
                                                          the document “Linee guida per l’annotazione degli
teria adopted for the annotation of NUs (Section
                                                          enunciati nominali” (in Italian) 2 .
3); in Section 4 we describe the annotated corpus;
in Section 5 we present the results of some pre-          3.1    NU Identification
liminary experiments on automatic identification
of NUs, and finally, in Section 6, we draw some           According to the annotation schema we propose,
conclusions.                                              every utterance whose main clause is non-verbal,
                                                          i.e. it does not contain a finite verb (see (1)), is
2       Related work                                      marked as a Nominal Utterance (NU); note, how-
                                                          ever, that a non-verbal main clause can contain
The first corpus-based study of NUs was part of           non-finite verbs, such as infinitive and/or particip-
the C-ORAL-ROM project, a multilingual (Ital-             ial forms and gerunds (see (2), (3), and (4)).
ian, French, Portuguese and Spanish) corpus com-
posed by 1,200,000 words of spontaneous speech,            (1) <NU>Felicissima per il suo ritorno!</NU>
created in order to describe the prosodic and syn-             [Very happy about his return!]
tactic structures of romance languages (Cresti et
al., 2004).                                                (2) <NU>Ma impegnarsi di più?</NU>
   Relatively similar is the study conducted on the            [Why not put more effort into it?]
AN.ANA.S Multilingual Treebank, consisting of
21,300 words of spontaneous speech and task-               (3) <NU>Spariti i negozi, l’edicola, il
oriented dialogues in Italian, English and Spanish,            posteggio.</NU>
manually annotated in order to identify verbless               [Shops, news stand, and car park, all gone.]
clauses (Landolfi et al., 2010).
                                                           (4) <NU>Facendo due conti.</NU>
   In more recent work, Garcia-Marchena (2016)
                                                               [Doing the math.]
uses the Spanish open-source corpus CORLEC1 to
manually identify and classify over 7,000 verbless
                                                          3.2    Coordination of main clauses
utterances in a detailed taxonomy.
   While the above-mentioned studies all address          When the main clause of an utterance bears a co-
verbless sentences and clauses, the phenomenon            ordination relation to another clause, the NU is an-
in which we are interested is wider and includes          notated as follows:
more complex syntactic structures, partly because
we address nominal utterances, which is a wider               • If both are non-verbal, the extent of the NU
                                                                includes them both (see (5));
    1
     CORLEC, Corpus Oral de Referencia de la
                                                              2
Lengua Española Contemporánea,       available from:          This document is available for consultation from
http://www.lllf.uam.es/ING/Corlec.html                    http://tiny.cc/auhvvy
  • If one is verbal and the other one is non-                              #sentences     #words     #tokens
    verbal, the extent of the NU includes only the        Blogs                  1,178     16,054      18,874
    non-verbal one (see (6)).                             Forums                 1,331     15,168      18,105
                                                          Newsgroups             1,395     15,045      19,109
 (5) <NU>Acqua a dirotto e tutti a casa!</NU>             Soc. networks          1,057      7,770       9,923
     [Too much rain and everyone home!]
                                                          Total                  4,961     54,039      66,011
 (6) <NU>I lavori prima,</NU> e poi si cena.
     [Chores first, and then we’ll eat dinner.]                  Table 1: Data about COSMIANU.

   Due to their peculiar syntactic structure, NUs        4   Annotations in COSMIANU
with coordination are further marked with the at-
tribute “verbal-coordinate” (coordination of ver-        COSMIANU contains texts taken from the
bal and non-verbal clauses) or “non-verbal-              Web2Corpus IT (Chiari and Canzonetti, 2014),
coordinate” (coordination of non-verbal clauses).        a balanced Italian corpus of 1,050,000 words
                                                         consisting of social media texts of five types,
3.3   NUs with subordinate clauses                       i.e., blogs, forums, newsgroups, chats, and so-
Non-verbal subordinate clauses are included in the       cial networks. In particular, we focused on semi-
extent of an NU, as in (7), whereas verbal subor-        synchronous forms of CMC, i.e. blogs, forums,
dinate clauses are not, as in (8) and (9).               newsgroups, and social networks (Pistolesi, 2004),
                                                         and randomly chose 24 files (six from each of
 (7) <NU>Che bello partire tutti quanti!</NU>            the four selected categories), for a total of 54,039
     [Great to leave all together!]                      words.
                                                            These texts consist of discussions between users
 (8) <NU>Felice</NU> che ti sia piaciuta.                across a large number of themes (from politics to
     [Glad you liked it.]                                popular singers). Thus in most cases, users inter-
                                                         act with each other creating a dialogic enviroment
 (9) Siccome piove, <NU>tutti a casa.</NU>
                                                         rich in verbal crossfires and quotes. This kind of
     [As it is raining, everyone home.]
                                                         interactions are a particularly fertile ground for el-
                                                         lipses and NUs in the form of greetings, which are
  NUs with verbal subordinate clauses are marked
                                                         usually very frequent in spoken language.
with a specific attribute, i.e., “verbal-subordinate”.
                                                            Automatic pre-proccessing of the corpus, for
3.4   Ellipses                                           which we used the TextPro suite of NLP tools (Pi-
                                                         anta et al., 2008), consisted of tokenization and
As explained above, NUs are utterances whose
                                                         sentence-splitting and resulted in 4,961 sentences
main clause is non-verbal, i.e. it does not contain
                                                         and 66,011 tokens (see Table 1 for more detailed
a finite verb. Unlike in other NUs, in ellipses it
                                                         data).
is always possible to infer the omitted verb (Mor-
tara Garavelli, 1971; Ferrari, 2010), since the             The manual annotation was then performed by
omitted verb is exactly the same as the one in the       an expert annotator using the Content Annotation
preceeding utterance.                                    Tool (CAT) (Bartalesi Lenzi et al., 2012). The an-
                                                         notation effort, for an expert annotator, consisted
   Ellipses are marked, using the specific attribute
                                                         of two weeks of work.
“ellipsis”, both when the preceeding utterance is
written by a different user, as in (10) and when it         In order to evaluate the inter-annotator agree-
is written by the same user, as in (11).                 ment, a subpart of the corpus consisting of 5,193
                                                         tokens was annotated by a second annotator. The
(10) Cosa vorresti per cena? [What would you             resulting Dice coefficient is 87.40. Both annota-
     like for dinner?]                                   tors identified 127 NUs, 111 of which are common
     <NU>Una pizza!</NU> [A pizza!]                      (evaluation based on exact match).
                                                            Table 2 reports, for both the whole corpus and
(11) Cosa voglio??? [What do I want???]                  for each subcategory, the total number of NUs
     <NU>Del rispetto!</NU> [Some respect!]              and the number of NUs marked with each specific
                                                         attribute, i.e. “verbal-coordinate”, “non-verbal-
                       NUs     Verbal coord.      Non-verb. coord.      Verbal subord.       Ellipsis    Simple NUs
 Blogs                  261              30                    15                  32             37            194
 Forums                 263              36                    13                  23             34            190
 Newsgroups             196              33                    21                  17             35            122
 Social networks        304              41                      9                 19             31            231
 Total                1,024             140                    58                  91            137            737

                      Table 2: Distribution of NUs in the four social media categories.

                                   Verbal coord.     Non-verb. coord.      Verbal subord.       Ellipsis
            Verbal coord.                      -                    7                 13             38
            Non-verb. coord.                   7                    -                 11             10
            Verbal subord.                   13                   11                    -            26
            Ellipsis                         38                   10                  26               -
            no other attribute               82                   30                  41             63
            Total                           140                   58                  91            137

                                       Table 3: Attribute co-occurrence.

coordinate”, “verbal-subordinate”, and “ellipsis”          denotative elements simply listed without any ex-
(NUs that are not marked with any attribute, such          plicit hierarchical bond, as in (13), in a way that
as (1), (2), (3), and (4), are referred to as “simple      reminds one of a list of keywords.
NUs”).3
   In the whole corpus we annotated 1,024 NUs,             (13) <NU>Buon senso, etica, vincere tanto per
which means that 20,6% of the sentences contain                 vincere.</NU>
an NU. This percentage is lower than those re-                  [Common sense, ethics, winning for win-
ported by Cresti (2004) (38,1%) and Landolfi et                 ning’s sake.]
al. (2010) (28%). This can be explained by the fact
that the above-mentioned studies focus on spoken              Looking at the distribution of NUs in the four
language, where interrupted strings, brachyologies         subcategories, we see that social networks have
and turn-taking cues are more frequent with re-            the highest number of NUs (304), despite hav-
spect to written language. Still, this percentage          ing a significantly lower number of tokens than
shows that the nominal style is well represented           blogs, forums and newsgroups. This probably de-
in written informal Italian, most likely due to its        pends on the high perceived communicative econ-
linguistic economy and to its high semantic den-           omy typical of social networks (Cosenza, 2014),
sity, which are particularly useful for expressing         which leads writers to produce short, almost tele-
emphasis (see (12)).                                       graphic, texts.
                                                              In Table 3 we report the co-occurence of NU
(12) <NU>Dichiarazione da Mr. Hyde!</NU>                   attributes by pairs4 in order to show how diverse
     [A statement worthy of Mr. Hyde!]                     syntactic structures NUs can have. Particularly in-
                                                           teresting is the presence of 38 NUs containing el-
   In addition, the large number of NUs marked             lipses coordinated with a verbal clause; in fact, the
as coordinate, either “verbal” (140 NUs) or “non-          ellipsis usually follows the verbal clause, whose
verbal” (58 NUs) shows that parataxis is constant          verb is implied in a contrastive context. Addi-
throughout these texts. In fact, NUs appear to             tionally, ellipses can support a verbal subordinate
be extremely suitable to the parataxis typical of          clause (in our corpus we have 26 cases), which
CMC; furthermore, they are often isolated, i.e.,           usually adds further information in favor of the
free from hierarchical syntactic bonds. This also          contrastive utterance (see (14)).
explains why NUs can be composed of a series of
                                                              4
                                                                Although we have case where NUs have been marked
   3
    Notice that a single NU can be marked with more than   with up to four attributes, we only focus on co-occurrence by
one attribute.                                             attribute pairs.
(14) Non è un edificio specifico, <NU> ma una                    6   Conclusion and Future Work
     tipologia architettonica </NU> che caratter-
     izza l’URSS.                                                 This work shows how common NUs are in written
     [It is not a specific building, but an architec-             informal language, as well as how important they
     tural typology that characterizes the USSR.]                 are in conveying semantically dense concepts in
                                                                  emphatic informative peaks, which could be use-
5       Automatic Identification of NUs                           ful for many NLP fields (e.g., argumentation min-
We used COSMIANU to train an open source                          ing and aspect-based sentiment analysis).
SVM classifier, YamCha5 , and performed some                         By creating COSMIANU, an Italian corpus an-
preliminary experiments on NU identification. As                  notated with NUs, and making it freely available
training data, we selected 44,170 tokens (i.e. about              to the research community, we made a first step
2/3 of the corpus) while maintaining the same pro-                towards the development of automatic tools for
portion of blogs, forums, newsgroups, and social                  the identification and classification of NUs. In
networks over the whole corpus. We used the re-                   our preliminary experiments on NU identification
maining part of the corpus (21,841 tokens) as a test              (performed using an SWM classifier), with our
set. In these preliminary experiments we also in-                 best configuration, we obtained a performance of
cluded the NUs that appear in the text as metadata,               73.40% in terms of F1 on all NUs (i.e. including
which are annotated and marked with the specific                  metadata).
tag “metadata” in COSMIANU, as shown in Ex-                          In the future, we intend to further expand COS-
ample (15) 6 . The training set and the test set thus             MIANU, both in terms of its size and in terms of
contain respectively 1,775 and 1,058 NUs.                         the annotations it includes, hoping that this will
                                                                  encourage more research on this extremely com-
(15) <NU> Data: 27/09/2010. </NU>
                                                                  mon, and yet almost neglected, linguistic phe-
     [Date: 09/27/2010.]
                                                                  nomenon. We also plan to work on the analy-
   We pre-processed the data using the TextPro                    sis and automatic recognition of NUs, especially
suite (Pianta et al., 2008) and performed a num-                  when they are used to convey hate speech, in the
ber of experiments combining the following basic                  form of racist, sexist, homo/transphobic or classist
features: two-word window context (W2), three-                    slogans and insults.
word window context (W3), token (Tok), lemma
(Lem), and Part-of-Speech (Pos).                                  Acknowledgments
    Configuration              Prec.       Rec.        F1         We would like to thank Isabella Chiari for pro-
    Baseline                   33.80      27.13     30.10         viding us the Web2Corpus IT, from which we se-
    W2+Tok+Lem+Pos             79.80      67.96     73.40         lected the raw texts to build COSMIANU. We also
                                                                  thank our colleagues Roberto Zanoli and Rachele
         Table 4: Results on NU identification.                   Sprugnoli for their valuable advice and contribu-
                                                                  tions in performing the experiments and defining
   Table 4 reports, in terms of Precision, Recall,
                                                                  the annotation guidelines.
and F1, the results we obtained with the baseline
configuration (the system identifies only the NUs
in the test set that also appear in the training set)             References
and those we obtained with the best configuration,
i.e. using all the features and a two-word window                 David Adger. 2003. Core Syntax: A Minimalist Ap-
                                                                    proach. Oxford University Press.
context. With the latter, the classifier identified
901 NUs, of which 719 are correct (exact match),                  Valentina Bartalesi Lenzi, Giovanni Moretti, and
thus reaching an F1 of 73.40% and outperforming                     Rachele Sprugnoli. 2012. CAT: the CELCT An-
the baseline by over 43 points.                                     notation Tool. In Proceedings of the 8th Interna-
    5                                                               tional Conference on Language Resources and Eval-
      Yet Another Multipurpose CHunk Annotator. Website:            uation (LREC’12), pages 333–338, Istanbul, Turkey,
http://chasen.org/ taku/software/yamcha/
    6                                                               May. European Language Resources Association
      Metadata usually refer to when and where a certain mes-
sage has been written; although “metadata” NUs are very fre-
                                                                    (ELRA).
quent in the corpus (more than 60% of the total), they are not
particularly interesting from a linguistic point of view and we   Émile Benveniste. 1990. Problemi di linguistica gen-
did not include them in the counts of Section 4.                     erale. Mondadori, Milano, Italia.
Isabella Chiari and Alessio Canzonetti. 2014. Le              Textual Data (JADT 2010), pages 450–459. Roma,
   forme della comunicazione mediata dal computer:            Italia, June 9-11.
   generi, tipi e standard di annotazione. In E. Gar-
   avelli and E. Suomela-Härmä, editors, Dal mano-        Bice Mortara Garavelli. 1971. Fra norma e invenzione:
   scritto al web: canali e modalità di trasmissione         lo stile nominale. In Accademia della Crusca, editor,
   dell’italiano, pages 595–606. Franco Cesati Editore,       Studi di grammatica italiana, volume 1, pages 271–
   Firenze, Italia.                                           315. G. C. Sansoni Editore, Firenze, Italia.

Giovanna Cosenza. 2014. Introduzione alla semiotica         Emanuele Pianta, Christian Girardi, and Roberto
  dei nuovi media. Laterza, Bari, Italia.                     Zanoli. 2008. The TextPro tool suite. In Proceed-
                                                              ings of LREC, 6th edition of the Language Resources
Emanuela Cresti, Fernanda Bacelar do Nascimento,              and Evaluation Conference, Marrakech, Morocco,
  Antonio Moreno-Sandoval, Jean Véronis, Philippe            May 28-30.
  Martin, and Khalid Choukri. 2004. The C-
  ORAL-ROM CORPUS. A Multilingual Resource of               Elena Pistolesi. 2004. Il parlar spedito. L’italiano di
  Spontaneous Speech for Romance Languages. In                chat, e-mail e sms. Esedra, Padova, Italia.
  Maria Teresa Lino, Maria Francisca Xavier, Fátima
  Ferreira, Rute Costa, and Raquel Silva, editors, Pro-     Francesco Sabatini and Vittorio Coletti.        1997.
  coceedings of the 4th LREC Conference, pages 575–           Dizionario Italiano Sabatini-Coletti.        Giunti,
  578, Paris, France. European Language Resources             Firenze, Italia.
  Association (ELRA).
                                                            Raffaele Simone. 2013. Nuovi fondamenti di linguis-
Emanuela Cresti. 1998. Gli enunciati nominali. In             tica. McGraw-Hill, Milano, Italia.
  M. T. Navarro, editor, Atti del IV convegno inter-
                                                            Rosanna Sornicola. 1981. Sul parlato. Il Mulino,
  nazionale SILFI (Madrid 27-29 giugno 1996), pages
                                                              Bologna, Italia.
  171–191, Pisa. Franco Cesati Editore.
Maurizio Dardano and Pietro Trifone. 2001. La nuova
 grammatica della lingua italiana. Zanichelli, Mi-
 lano, Italia.
Angela Ferrari. 2010. Enunciati ellittici. Enciclopedia
  dell’Italiano.   http://www.treccani.it/
  enciclopedia/enunciati-ellittici_
  (Enciclopedia-dell’Italiano)/.
Angela Ferrari.    2011a.        Enunciati nomi-
  nali.   Enciclopedia dell’Italiano.     http:
  //www.treccani.it/enciclopedia/
  enunciati-nominali_(Enciclopedia_
  dell’Italiano)/.
Angela Ferrari. 2011b. Stile nominale. Enciclope-
  dia dell’Italiano.  http://www.treccani.
  it/enciclopedia/stile-nominale_
  (Enciclopedia-dell’Italiano)/.
Angela Ferrari. 2014. Linguistica del testo. Principi,
  fenomeni, strutture. Carocci, Roma, Italia.
Oscar Garcia-Marchena. 2016. Spanish Verbless
  Clauses and Fragments. A corpus analysis. In An-
  tonio Moreno Ortiz and Chantal Pérez-Hernández,
  editors, CILC 2016. 8th International Conference on
  Corpus Linguistics, volume 1 of EPiC Series in Lan-
  guage and Linguistics, pages 130–143. EasyChair.
Giorgio Graffi. 2012. La frase: l’analisi logica.
  Carocci, Roma, Italia.
Annamaria Landolfi, Carmela Sammarco, and Miriam
  Voghera. 2010. Verbless clauses in Italian, Span-
  ish and English: a Treebank annotation. In S. Bo-
  lasco, I. Chiari, and L. Giuliano, editors, Statistical
  Analysis of Textual Data. Proceedings of the 10th
  International Conference on Statistical Analysis of