LICO: A Lexicon of Italian Connectives Anna Feltracco Elisabetta Jezek Bernardo Magnini Manfred Stede Fondazione Bruno Kessler University of Pavia Fondazione Bruno Kessler University of Potsdam University of Pavia, Italy Pavia, Italy Povo-Trento, Italy Potsdam, Germany University of Bergamo, Italy jezek@unipv.it magnini@fbk.eu stede@uni-potsdam.de feltracco@fbk.eu Abstract in 2015 and 20161 . Downstream applications that can benefit from shallow discourse structure are, English. This paper presents the first inter alia, sentiment analysis (e.g., (Bhatia et al., release of LICO, a Lexicon for Italian 2015) and argumentation mining (e.g., (Peldszus COnnectives. LICO includes about 170 and Stede, 2013)). discourse connectives used in Italian, to- gether with their orthographical variants, Our work on connectives is mainly motivated part of speech(es), semantic relation(s) by the fact that, to the best of our knowledge, still (according to the Penn Discourse Tree- there is no high coverage resource of discourse bank relation catalogue), and a number of connectives available for Italian. LICO, the Lex- usage examples. icon for Italian COnnectives, aims at filling this gap, providing a repository of Italian connectives Italiano. Questo contributo presenta la aligned with recent developments in discourse re- prima versione di LICO, un lessico di con- lations (i.e. the last version (3.0) of the Penn Dis- nettivi per l’italiano. LICO comprende course Treebank (PDTB)). circa 170 connettivi del discorso usati in italiano, di cui abbiamo raccolto varianti In addition, the LICO lexicon takes advantage ortografiche, le parti del discorso, le re- from DimLex, a similar repository for German lazioni semantiche (ricavate dal catalogo (Scheffler and Stede, 2016; Stede and Umbach, del Penn Discourse Treebank) espresse dal 1998); in fact DimLex served as the main inspi- connettivo, e alcuni esempi d’uso. ration for creating LICO (see section 4). Dim- Lex is an XML-encoded resource that can be used for NLP; the public version provides infor- 1 Introduction mation on orthographical variants, syntactic be- Discourse connectives are explicit lexical markers havior, semantic relations (in terms of PDTB), that are used to express functional relations be- and usage examples. It is used for automatic tween parts of the discourse. As an example, the discourse parsing, and also for semi-automatic italian word “quando” in the sentence “Quando si text annotation using the ConAno tool (Stede and preme sul bottone, la porta si apre da sola” (When Heintze, 2004). Another relevant resource for you press the button, the door opens by itself) ex- connectives is LEXCONN, for French, (Roze et presses a conditional relation between two parts of al., 2012), which contains about 300 connectives the sentence (from now on, arguments). with their syntactic category and coherence rela- Work on discourse connectives in Computa- tions from Segmented Discourse Representation tional Linguistics was initially part of Rhetorical Theory (Asher and Lascarides, 2003)(and to some Structure Theory (Mann and Thompson, 1988), extent Rhetorical Structure Theory (Mann and where the focus is on discourse relations, which Thompson, 1988)). are at the basis of the notion of textual coherence. LICO is freely distributed under a CC-BY li- In Computational Linguistics, being able to iden- cence. tify connectives is a central task in “shallow dis- course parsing”, which has become very popular in recent years (e.g., (Lin et al., 2014)) and con- 1 stituted the shared task of the CONLL conference http://www.cs.brandeis.edu/ clp/conll16st/ 2 Discourse Connectives = cont”) and the specification of the two cor- relating parts, e.g. “orth = discont”: da una The definition of discourse connective is contro- parte (“part = phrasal”), dall’altra (“part = versial both in traditional grammar and in the lin- phrasal”); “orth = cont”: perché (“part = sin- guistic literature. Our definition is based on the en- gle”); cyclopedia entry on connectives by Ferrari (2010), included in the reference work for the Italian lan- • possible orthographic variants: e.g. ciò guage recently published by Treccani. In this en- nonostante (“part = phrasal”) and ciononos- try, connectives are defined as “each of the invari- tante (“part = single”); able forms [...], that introduce relations that struc- ture “logically” the meanings of the sentence and • possible lexical variants: e.g dopo di ché and of the text”2 . The definition provided in Ferrari dopo di ciò. Notice that in some cases this (2010) is restrictive, as it does not include vari- lexical variants determine a different syntac- able forms, i.e. those forms which are subject tic environment, such as in modo da and in to morphological modifications, such as ne con- modo che, the first being followed by in- segue/conseguiva che ‘it follows/followed/ that’, finitive form, the following by a subjunctive nor does it include pragmatic uses of connectives form; (also known as discourse markers) such as causal • pos category: adverbs, preposition subordi- perché ‘why’ in “Che ore sono? Perché ho dimen- nating or coordinating conjunctions; ticato l’orologio” (‘what time is it? Because I for- got my watch’). On the other end, it assumes that • the semantic relation(s) that the connective logical relations marked by connectives hold be- indicates, according to the PDTB 3.0 schema tween events or assertions, and therefore includes (see section 3.1); as arguments for the relation nominal expressions such as “dopo il pressante invito ...” ‘after the • examples of the connectives for each seman- pressing invitation ...’, i.e. expressions that con- tic relation; tain an event nominal, - although the event is, in • possible alignments with lexicon of connec- this case, referred to instead of predicated. tives in other languages. In our work, we partly drop the invariability cri- teria; we do not include forms which exhibit mor- Table 1 shows the entry for quando, which phological inflection or conjugation, but we do in- presents more than one semantic relation, and the clude connectives which show a certain degree of entry for ciononostante, ciò nonostante, nonos- lexical variability that is, multi-word expressions tante ciò, as example of a connective with or- which are not totally rigid from a lexical point of thografic variants in LICO. view (ad esempio/per esempio ‘for example’; see section 3). 3.1 Semantic relations For the annotation of the semantic relation we 3 The Structure of the Lexicon used the PDTB 3.0 schema of relations (Webber Each entry in the LICO lexicon corresponds to a et al., 2016; Rehbein et al., 2016) as proposed in connective (including its variants). Currently, for the DimLex resource (Scheffler and Stede, 2016), each entry LICO specifies: which is our main reference resource. The schema is a most recent version of PDTB • whether the connective (or its variants) is 2.0 (Prasad et al., 2008; Prasad et al., 2007) and composed by a single token (“part = single”, includes semantic relations structured in a hierar- e.g. perché) or by more than one token (“part chy composed by three levels. In the first level, the = phrasal” e.g. di conseguenza); class level, the relations are grouped in four major classes: TEMPORAL, CONTINGENCY, COM- • whether the connective is composed by cor- PARISON and EXPANSION. The second level, the relating part (“orth = discont”) or not (“orth type level, specifies further the semantics of the 2 class level. For example, the TEMPORAL: Syn- “Il termine connettivo indica in linguistica ciascuna delle forme invariabili [...], che indicano relazioni che strutturano chronous tag is used for connectives that indicate ‘logicamente’ i significati della frase e del testo”. that the two arguments are simultaneous, while the  entry-id 146  orth cont a connective and its orthographical or lexical vari-  part single ants. In order to compile this list we used a num- quando  POS subordinating ber of grammatical and lexical resources for Ital-  sem relation TEMPORAL: Synchronous ian and for other languages. ex.: Quando lasciò l’appartamento, arrivò la chiamata rel. to German id: 5 First, we retrieved the list of connectives men-  sem relation CONTINGENCY:Condition tioned by Ferrari (2010) in the Enciclopedia Trec- ex.: Quando si preme sul bottone, la porta si apre da sola. ex.: Quando me lo chiedi, lo lascerò stare. cani for the entry connettivi4 for a total of 33 con- rel. to German id: 116 nectives. Then, we retrieved the list of connectives  entry-id 30 tagged as congiunzione testuale in Sabatini Coletti  orth cont  part single 2006 (Sabatini-Coletti, 2005) discarding the ones  variant orthographic of literary use, for a total of 70 entries. Finally, ciononostante  orth cont we benefited from the DimLex resource for Ger-  part phrasal man, as we enriched our list by identifying the  variant orthographic ciò nonostante equivalent Italian terms of the German connec-  orth  part cont phrasal tives5 . This process was facilitated by the presence  variant orthographic of examples in the German resource in which the nonostante ciò  POS coordinating connective is displayed in context: only the Italian  sem relation COMPARISON:Concession:Arg2-as-denier candidates that maintain the sense of the German ex.: La procura ha ordinato la restituzione dell’esemplare confiscato. Ciononostante l’istruttoria prosegue. connectives were added to LICO. We keep trace rel. to German id: 74 of this “German-Italian” links and we will use this information to enrich also the characteristic of the Table 1: The connectives quando and ciononos- entry in LICO (e.g. aber → ma). A total of 127 tante, ciò nonostante, nonostante ciò in LICO. entries were collected with this method. Figure 1 shows the overlap between the three resources TEMPORAL: Asynchronous tag is used for con- and Table 2 shows a sample of the connectives in nectives that indicate a before-after relation be- LICO and the respective sources. tween the arguments. The third level (subtype level)3 varies according to the role of the two argu- ments involved in the relation. For example, CON- TINGENCY:Cause:Reason is used if the argument introduced by the connective -Arg2- is the rea- son for the situation in the other argument -Arg1- (e.g. I stayed at home, because it was raining), while CONTINGENCY:Cause:Results is used if Arg2 represents the result/effect of Arg1 (e.g. It was raining, therefore I stayed at home). Not ev- ery type has a further subtype. In the LICO structure, each connective is assigned with one or more three-level tags. 4 The Current Resource In this Section, we present the current resource Figure 1: Overlap between the resources. and its construction. In particular, we focus Semantic relations in LICO. on describing how the list of entries has been In LICO connectives are tagged with the se- identified so far and how we proceeded to acquire mantic relations that the connective can indicate the semantic information for each entry. in a text, selecting the most appropriate ones in the PDTB 3.0 schema. In this process we took ad- List of connectives. Currently, LICO is com- vantage from the information which was already posed by 173 entries, each one corresponding to 4 http://www.treccani.it/enciclopedia/connettivi (Enciclopedia- 3 dell’Italiano)/, last access July 21st 2016. The names of the levels are taken form Prasad et al. 5 (2007). https://github.com/discourse-lab/dimlex LICO Entries Resources jsbergen, 1997)7 for three configurations, one for Ferrari Sabatini DimLex each level of the relation schema: class agreement, Treccani Coletti (equivalent) type agreement, subtype agreement. We consid- dopo dopo dopo dopo dopo di che ered that there was agreement if both annotators dopodiché dopo di che identify exactly the same class, type, subtype re- dopodiché dopotutto dopotutto spectively. The Dice values result in 0,78 for class dunque dunque dunque dunque agreement and 0,71 for both type agreement and e e e subtype agreement. ebbene ebbene Observing cases of disagreement, we can eccetto eccetto make the following preliminary considerations. eppure eppure eppure The main cases of disagreement regard the Table 2: Sample of connectives in different re- COMPARISON:Contrast relation (on one hand) sources. and the COMPARISON:Concession and EX- PANSION:Substitution relations (on the other hand). These relations in fact appear to be the present in the resources we used for building the ones that connect arguments that are in con- list. In fact, the DimLex resource provides this in- trast. As an example, the connective anziché formation for the German connectives, and both ‘rather than’ in Example (1) has been annotated the Italian resources previously mentioned pro- as COMPARISON:Contrast by annotator1 and as vide useful information about the semantic rela- EXPANSION:Substitution:Arg1-as-subst by an- tion triggered by the connective.6 A total of 23 dif- notator2: the first enlightens the contrast between ferent PTDB relations have been used to describe “emissione attraverso il Tesoro” and “usare il LICO entries. In order to validate the tagging of tradizionale sistema”, the second emphasises that semantic relations, we conducted a research by ob- Arg2 represents the alternative to the Arg1. serving examples of the use of the connectives in corpora, i.e. we wanted to verify whether the re- (1) [..] chiedeva l’ emissione di dollari in lation that a connective introduces in a portion of banconote statunitensi attraverso il Tesoro text is one of the relations already tagged for that anziché usando il tradizionale sistema same connective in the first step. In particular, we della Federal Reserve. searched for 20 connectives in the ItWac corpus (Baroni et al., 2009) and we retrieved occurrences Another interesting case concerns the dis- with 400 characters on both sides of the connec- agreement between the relations TEM- tive. We limited our observation to 5 retrieved seg- PORAL:Asynchronous:precedence (in ments of text in which the connective is actually which Arg2 follows Arg1) and CONTIN- playing such a role. We finally tagged each con- GENCY:Cause:Result (in which Arg2 is the nective in each portion of text with the semantic results of Arg1), being the two strictly connected relation it indicates. (i.e. in a cause-effect relation, the effect follows To further confirm the corpus-driven evidences the cause). As an example, in (2) one anno- for the semantic relations, we asked two annota- tator marks the connective as indicator of the tors (one being an expert annotator, the other not) temporal sequence of Arg1 and Arg2, while the to perform the same tagging task. We then cal- other prefers to mark it as an indicator of the culated the interannotator agreement between the cause-effect relation. two annotators adopting the Dice’s coefficient (Ri- (2) [..] Il bello è che i tipi hanno pure accen- 6 In particular, in the online version of Sabatini Coletti nato a prendersela con me, al che io gli ho (http://dizionari.corriere.it/dizionarioitaliano/D/dizionario abbaiato contro una sequela di insulti [..] .shtml, last access July 21st 2016) the semantic relations the connectives can trigger are described in the definition of the connective itself, e.g. “quindi, cong. testuale: Con valore In general, the relations that were initially as- deduttivo-conclusivo, perciò, di conseguenza, per questo 7 motivo, dunque”. Ferrari (2010) in the Enciclopedia Treccani Dice’s coefficient measures how similar two sets are by proposes a non hierarchical classification which includes the dividing the number of shared elements of the two sets by following relations: “temporal relation” “causal relation”, the total number of elements they are composed by. This “consequence relation”, “condition relation”, “opposition produces a value from 1, if both sets share all elements, to 0, relations”. if they have no element in common. signed to these connectives were confirmed by the William C Mann and Sandra A Thompson. 1988. corpus-based exercise (i.e. at least one annotator Rhetorical structure theory: Toward a functional the- ory of text organization. Text-Interdisciplinary Jour- assigns the tag in at least one portions of text); nal for the Study of Discourse, 8(3):243–281. viceversa, in some cases one of the two annota- tors assigned a relation that was not initially iden- Andreas Peldszus and Manfred Stede. 2013. From ar- tified.8 gument diagrams to argumentation mining in texts: A survey. International Journal of Cognitive Infor- matics and Natural Intelligence (IJCINI), 7(1):1–31. 5 Conclusion and Further work Rashmi Prasad, Eleni Miltsakaki, Nikhil Dinesh, Alan In this paper we have presented LICO, a new re- Lee, Aravind Joshi, Livio Robaldo, and Bonnie L source for the Italian language describing lexical Webber. 2007. The Penn Discourse Treebank 2.0 properties of discourse connectives. While LICO Annotation Manual. fills a gap with respect to similar resources exist- Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Milt- ing for other languages, it is still under construc- sakaki, Livio Robaldo, Aravind K Joshi, and Bon- tion under several aspects. Our short term plans nie L Webber. 2008. The Penn Discourse Tree- include the completion of the lexical entries with Bank 2.0. In Proceedings of the Sixth International Conference on Language Resources and Evaluation corpus derived examples and the observation of (LREC’08), Marrakech, Morocco, May. the connectives in Italian corpora, in order to ac- quire more information about the semantic rela- Ines Rehbein, Merel Scholman, and Vera Demberg. 2016. Annotating Discourse Relations in Spoken tions that each connective can indicate and thus Language: A Comparison of the PDTB and CCR extend the annotation of the semantic relations in Frameworks. In Proceedings of the Tenth Interna- LICO. tional Conference on Language Resources and Eval- uation (LREC 2016), Portorož, Slovenia, May. Acknowledgment CJ van Rijsbergen. 1997. Information retrieval. 1979. We acknowledge Denise Pangrazzi for her con- Charlotte Roze, Laurence Danlos, and Philippe Muller. tribution to identify the Italian equivalents of the 2012. LEXCONN: a French lexicon of discourse German connectives. connectives. Discours. Revue de linguistique, psy- cholinguistique et informatique., (10). Il Sabatini-Coletti. 2005. Dizionario della lingua ital- References iana 2006, con CD-ROM. Milano, Rizzoli Larousse. Nicholas Asher and Alex Lascarides. 2003. Logics of Tatjana Scheffler and Manfred Stede. 2016. Adding conversation. Cambridge University Press. Semantic Relations to a Large-Coverage Connective Lexicon of German. In Proceedings of the Tenth In- Marco Baroni, Silvia Bernardini, Adriano Ferraresi, ternational Conference on Language Resources and and Eros Zanchetta. 2009. The wacky wide Evaluation (LREC 2016), Portorož, Slovenia, May. web: a collection of very large linguistically pro- cessed web-crawled corpora. Language resources Manfred Stede and Silvan Heintze. 2004. Machine- and evaluation, 43(3):209–226. assisted rhetorical structure annotation. In Proceed- ings of the 20th International Conference on Com- Parminder Bhatia, Yangfeng Ji, and Jacob Eisenstein. putational Linguistics, pages 425–431, Geneva. 2015. Better document-level sentiment analysis from rst discourse parsing. In Proceedings of the Manfred Stede and Carla Umbach. 1998. Dimlex: 2015 Conference on Empirical Methods in Natural A lexicon of discourse markers for text generation Language Processing, Lisbon, Portugal, September. and understanding. In Proceedings of the 17th inter- Association for Computational Linguistics. national conference on Computational linguistics- Volume 2, pages 1238–1242. Association for Com- Angela Ferrari. 2010. Connettivi. In Enciclopedia putational Linguistics. dell’Italiano. diretta da Raffaele Simone, con la col- laborazione di Gaetano Berruto e Paolo D’Achille, Bonnie Webber, Rashmi Prasad, Alan Lee, and Ar- Roma, Istituto della Enciclopedia Italiana. avind Joshi. 2016. A Discourse-Annotated Cor- pus of Conjoined VPs. In Proceedings of the 10th Ziheng Lin, Hwee Tou Ng, and Min-Yen Kan. 2014. Linguistic Annotation Workshop held in conjunction A pdtb-styled end-to-end discourse parser. Natural with ACL 2016 (LAW-X 2016), pages 22–31. Asso- Language Engineering, 20:151–184. ciation for Computational Linguistics. 8 For this moment, the “new” relations are not included in LICO.