Toward a bilingual lexical database on connectives: Exploiting a German/Italian parallel corpus Peter Bourgonje, Yulia Grishina, Manfred Stede Applied Computational Linguistics University of Potsdam / Germany {bourgonje,grishina,stede}@uni-potsdam.de Abstract (4) The red pen costs $2; the blue one is $2.50. English. We report on experiments to On the other hand, example (6) is a perfectly gram- validate and extend two language-specific matical sentence but the meaning is different from connective databases (German and Italian) (5), so for this case of a Concession relation, the using a word-aligned corpus. This is a first connective is in fact indispensable. step toward constructing a bilingual lexi- (5) Although it is late, we don’t need to hurry. con on connectives that are connected via their discourse senses. (6) It is late; we don’t need to hurry. Recognizing these relations, which can hold Italiano. Presentiamo una serie di es- within a sentence, between two sentences, or be- perimenti per validare ed estendere due tween larger spans of text, is a central task for database dei connettivi, che sonospecifici uncovering the structure of a text, as it has been per la lingua italiana e per quella tedesca. studied in theories like Rhetorical Structure The- Abbiamo utilizzato un corpus parallelo ory (Mann and Thompson, 1988) or Segmented allineato a livello della parola. Si tratta Discourse Representation Theory (Asher and Las- di un primo passo verso la costruzione di carides, 2003). While the usage of connectives can un lessico bilingue dei connettivi che sono sometimes be optional, the set of connectives that collegati attraverso i loro sensi del dis- a language offers is generally taken as important corso. (if not exhaustive) evidence for the set of coher- ence relations that should be assumed. 1 Introduction 1.1 Background: Connectives An important part of discourse processing deals From a syntactic viewpoint, ‘connective’ is not a with uncovering coherence relations that hold be- homogeneous class, as it contains conjunctions, tween individual, “elementary” units of a text. The different kinds of adverbials, as well as certain lexical items that can signal such a relation are prepositions. Our underlying definition of dis- referred to as discourse connectives, and exam- course connectives is based on (Pasch et al., 2003, ples of these relations, also called the connectives’ p. 331): senses, are contrast (e.g., ‘but’), elaboration (e.g., ‘in particular’), or cause (e.g., ‘therefore’). No- (7) Def.: A discourse connective is a lexical tice, however, that relations need not always be item x that exhibits each of the following signalled in text, if the context or world knowl- five properties: edge is sufficient for the reader to infer it, as (1)- (M1) x cannot be inflected. (4) demonstrate: (M2) x does not assign case features to its syntactic environment. (1) We should hurry, because it’s late. (M3) The meaning of x is a two-place relation. (2) We should hurry. It’s late. (M4) The arguments of the relation (the (3) The red pen costs $2, while the blue one is meaning of x) are propositional structures. $2.50. (M5) The expressions of the arguments of the relation can be sentential structures. Following (Stede, 2002), we drop M2 because our lexicon deliberately includes several prepositions that can be used as connectives (in the sense of M1, M3-M5), e.g., trotz (‘despite’) or wegen (‘due to’). 1.2 Motivation and contribution Connectives can pose interesting challenges to translation and for language learners, as the dif- ferences in meaning between similar connectives can be quite subtle. For these reasons, we are interested here specifically in a bilingual Italian– German lexical resource, to be built on top of two existing single-language lexicons. As a case study, we focus on the subgroup of con- trastive/concessive connectives, which we deter- mined to comprise (in the existing lexicons) 31 German connectives and 12 Italian connectives; see Tables 3.2.2 and 3.2.2. The main contributions of this paper are (1) Figure 1: al contrario entry in LICo suggestions for improving the existing language- specific resources used in this study through the technique of cross-lingual projection in a parallel corpus, which reveals correspondences between connectives and can point to gaps in either of the was inspired by DiMLex and contains annotations resources; and (2) an overview of the distribution on the same attributes and uses essentially the of connectives and their senses, to be used in a same structure (i.e., the same PDTB senses, ortho- bilingual database. Section 2 explains the two graphic variants, usage examples, etc.). An exam- monolingual lexicons we work with, and Section ple entry of LICo is shown in Figure 1. We refer 3 describes the corpus. Section 4 reviews related the reader to Feltracco et al. (2016) for details. work in this area. Section 5 elaborates the idea of bilingual connective databases, and Section 6 3 Exploiting a parallel corpus summarises our findings. For the parallel German/Italian corpus we used 2 Lexicons: DiMLex and LICo Europarl (Koehn, 2005), as it still appears to be the biggest resource of this kind, and it is, con- We extracted the German contrastive connectives veniently, already sentence-aligned. From the from DiMLex (Scheffler and Stede, 2016), a con- 1,832,053 sentences in the German-Italian part of nective lexicon with several different fields de- the corpus we extracted the word alignments us- scribing orthographical variants, syntactic type, ing MGIZA++ (Gao and Vogel, 2008). In the fol- discourse sense, and usage examples. It con- lowing, we sketch our method for obtaining the tains 275 entries. The sense annotations are based correspondence information on connectives based on the Penn Discourse Treebank (PDTB) senses on these word alignments, and then present the re- (Miltsakaki et al., 2008) in its latest version 3. The sults. lexicon is publicly available1 and aims to exhaus- tively describe the set of connectives for German, 3.1 Method: Iterative lookup thus providing a basis for our case study. We approach the problem from two sides: First The set of Italian contrastive connectives comes we look up every German connective (31 in total) from LICo (Feltracco et al., 2016), a similar lex- to get Italian alignments. 30 of them appeared in icon for Italian containing 170 entries.2 LICo our Europarl corpus (with dementgegen missing). 1 https://github.com/discourse-lab/dimlex Then we look up every Italian connective to get 2 https://hlt-nlp.fbk.eu/technologies/lico German alignments (all 12 connectives present in the corpus). We end up with a list of target lan- guage words or phrases (or empty elements, since a source language connective can also be covert in the target language) that are candidate contrastive connectives. Note that the lookup procedure does not differ structurally between words and phrases. In both cases, single words (stand-alone or in a Figure 2: Most frequent alignments of jedoch phrase) can correspond to zero, one or more target words. The target representation is collected in a key-value structure, where the key is the position in the sentence and the value the word. This list is then sorted by position to return the target word or tive candidates that were aligned to German con- phrase (which is potentially discontinuous). Be- trastive connectives, but were not present in LICo, cause the word alignment is not guaranteed to be such as al contempo, solo che, doppo tutto. Sec- correct, to filter for unlikely translations we focus ondly, we observed several possible orthographic on only the 3 most frequent alignments for every variants of the already existing Italian connectives: connective. We expect to find at least a subset of contro or contrario (as possible variants of al con- the already known (contrastive) connectives (from trario), and d’altro canto (as a variant of a discon- DiMLex or LICo), potentially complemented by a tinious connective da un canto...dall’altro). Fi- set of words or phrases that can help filling gaps in nally, we found that several Italian connectives either of the lexicons. only had the concession sense, while the corre- This procedure produces at least some incorrect sponding German connectives also had the Con- results for the following two reasons: 1) discourse trast sense, such as comunque, for which we found connectives often can appear in a text with a con- the German alignments aber, allerdings and doch, nective reading or with a non-connective reading; for example. and 2) connectives can have multiple senses, so As an example of a visualisation (for a single that a connective may not have the contrastive connective) the above analysis is based on, con- reading in the particular sentence. The candidates sider Figure 2, showing the most frequent align- produced hence have to be evaluated manually. ments of jedoch, which always has a connective Resulting candidates that have a connective read- reading, thus nullifying the first problem men- ing are added to the seed list, in order to repeat the tioned in 3.1. step back from the target language to the source 3.2.2 Italian–German language3 . The results of the first step of the iteration using 3.2 Results the 12 Italian seed connectives are displayed in Ta- 3.2.1 German–Italian ble 3.2.2. For 11 of the 12 contrastive connectives from LICo, the top 3 alignments yielded an exist- The results of the first step of the iteration us- ing DiMLex entry. The only connective without ing the 31 German seed connectives are displayed a DiMLex entry in the top 3 was al contrario, for in Table 3.2.2, where an underscore indicates an which a possible new German connective candi- empty string (meaning that the connective was not date im Gegenteil was found through alignment. aligned to a particular word or phrase in the tar- Upon further investigation of the lower-ranked get language) and the number after the underscore alignments (not included in Table 3.2.2), we were represents the (normalised) frequency of the align- able to identify several other gaps in the Ger- ment. man lexicon. Firstly, we observed that the Ital- For the evaluation, we asked a native speaker ian connective invece is frequently aligned to the of Italian with expert knowledge in linguistics to German word anstelle, which is not in DiMLex validate the resulting top 3 bilingual mappings. (but anstelle dessen is). After examining the cor- Firstly, we identified several possible connec- responding examples, we conclude that anstelle 3 Ideally going back and forth until a stable and exhaustive should be added to DimLex as a separate entry set of candidates is found. For this study, we only did the first step, and then projected the found Italian connectives back to (similarly to the already existing aufgrund vs. auf- German. grund dessen). Also, we found that DiMLex lacks German connective (frequency) Top 3 Italian alignments aber (105413) ma// (0.24)//tuttavia alldieweil (3) finché//perché allein (6973) (0.30)//solo//soltanto allerdings (16232) tuttavia// (0.22)//ma andererseits (6354) (0.30)//dall’ altro//d’ altro canto bloß dass (117) (0.10)//solo che//che solo dafür (36895) (0.70)//per//per aver dafür // dass (42) che// (0.19)//per dagegen (5423) (0.34)//contro//contrario dahingegen (24) (0.17)//invece//al contrario dementgegen (0) Figure 3: Most frequent alignments of invece demgegenüber (121) (0.25)//invece//contro doch (37423) (0.47)//ma//tuttavia einerseits (4221) da un lato// (0.31)//da una parte freilich (159) (0.30)//naturalmente//certo gleichzeitig (13293) (0.35)//al contempo//allo stesso tempo hingegen (1909) invece// (0.26)//tuttavia immerhin (1360) (0.44)//comunque//dopo tutto indessen (280) invece// (0.19)//tuttavia jedoch (47667) tuttavia// (0.27)//ma nur dass (21617) che//solo che sosehr (14) malgrado tutto unterdessen (193) nel frattempo// (0.21)//intanto wiederum (2450) (0.55)//a sua volta//ancora una volta wogegen (111) mentre// (0.19)//contro cosa wohingegen (218) mentre// (0.14)//ma während (20388) (0.28)//mentre//durante währenddessen (78) nel frattempo// (0.17)//mentre zugleich (3576) (0.41)//al contempo//allo stesso tempo zum anderen (4299) (0.09)//altri//altre zum einen (8848) un// (0.10)//una Table 1: German connectives and their Italian alignments Figure 4: Mapping of connective senses from Ital- Italian connective (frequency) Top 3 German alignments al contrario (3641) im gegenteil// (0.10)//im gegenteil ian to German bensı̀ (7107) sondern// (0.12)//sondern vielmehr contrariamente a (661) (0.08)//entgegen//im gegensatz zu da un canto (352) einerseits// (0.11)//andererseits da un lato (4612) einerseits// (0.08)//einerseits die da una parte (10194) (0.07)//und//eine invece (18778) (0.48)//anstatt//stattdessen statt dessen as an orthographic variant of the more ma (135218) aber//sondern// (0.15) canonical stattdessen. mentre (15773) während// (0.19)//und per contro (13468) gegen//und// (0.06) Finally, we identified two interesting cases that però (22687) aber//jedoch// (0.24) are DiMLex candidates: umgekehrt and (ganz) im viceversa (522) umgekehrt// (0.19)//hingegen Gegenteil, which we found aligned to the Italian Table 2: Italian connectives and their German viceversa and al contrario, respectively, but more alignments corpus evidence is required to decide whether they can indeed serve as connective in the German lan- guage. As an example visualisation, consider Figure 3, showing the most frequent alignments of invece, 4 Related work which always has a connective reading. For Italian–German, we repeated the steps Parallel corpora have been successfully exploited above with the candidates found using the Ger- before in order to automatically generate or induce man seed list (projecting the resulting Italian list connective lexicons in different languages. In par- back to German) to see if any additional connec- ticular, Versley (2010) projected discourse con- tives or orthographic variants would be found. We nectives across an English–German parallel cor- again found im Gegenteil through alignment of al pus to train a discourse parser capable of dis- contrario and a few alternative lexicalisations for ambiguating connective and non-connective read- DiMLex connectives4 , but no new candidates. ings. Similarly, Zhou et al. (2012) used an 4 English–Chinese parallel corpus in order to build a Not listed here for reasons of space. Chinese connective lexicon via cross-lingual pro- jection, and Hajlaoui and Popescu-Belis (2013) re- coherence relation. Such phrases are so far not lied on parallel data to automatically retrieve Ara- part of DiMLex nor LICo. Obviously, they are bic counterparts for a subset of English connec- much harder to detect: Corpus annotation (as done tives. in PDTB) is one way, and we regard our cross- Since our goal was not to build a connective lingual projection method as another promising lexicon from scratch, but to extend the connec- way. Quite often, connectives in language A have tive lists and refine the inventory of senses for been translated to an AltLex in language B. We the already existing lexicons, the closest approach plan to study this more systematically by a closer to ours is the one adopted by Laali and Kos- inspection of the alignments and their contexts, in seim (2014), who aimed at automatically inducing order to extract AltLex candidates as a supplement a French connective lexicon via English–French to the connective lexicons. parallel corpora using additional filtering rules. Similar to ours, their results have shown that us- 5.3 Senses and their distributions ing parallel translations can improve the coverage A bilingual connective database can shed light on of the connective lists in both languages; however, the distribution of senses over different languages since their lexicons used different sets of discourse and the degree of ambiguity that individual con- relations, they were not able to extend their con- nectives exhibit. While we consider such con- nective database in respect to senses, as opposed clusions premature for the current stage of the to our work. language-specific resources, we include Figure 4, which shows groups of connectives that share the 5 Toward a bilingual connective database same sense (or group of senses for ambiguous con- Our study is meant as a step toward moving from nectives) and their alignment to similar groups on single-language connective lexicons to a bilingual the target side. The 12 Italian connectives (on one that provides information about the relation- the left), when grouped together based on their ships between the language-specific entries. Both sense(s), form 4 sets, whereas for German (right monolingual lexicons are already publicly avail- side), fewer connectives (11 that were found in able on GitHub and in addition an interface allow- DiMLex among the top 3 alignments of the 12 ing bilingual search has been made public in a re- source connectives) group into more sets (10). lated project5 . Below we sketch additional plans This suggests more ambiguity in Italian connec- for providing this information on the levels of con- tives, with less different senses represented by a nective tokens, and senses (coherence relations). larger set of connectives. In addition, we observed that Italian connec- 5.1 Connective mappings tives with a sense Contrast or Concession are fre- One central purpose of a bilingual database is to quently aligned to their German counterparts with assist translators (human or machine) or (human) a sense Substitution, such as anstelle-invece. Hav- language learners. For most connectives, there is ing examined the parallel examples more closely, a complicated m:n mapping between languages, we conclude that assigning both senses would be which standard dictionaries do not cover, and the valid for both German and Italian, although they relevant features for making choices are not sys- are placed distantly in the PDTB hierarchy of tematically known yet. A corpus-based inventory senses. These findings are confirmed by Feltracco of mappings – ideally supplemented by pointers et al. (2016), who acknowledge that the distinction to the corpus instances and their context – can be between the two senses was one of the main cases a very useful resource for undertaking contrastive of the inter-annotator disagreement. We conclude lexical investigations. that both lexicons could benefit from adding addi- tional senses gained via comparing parallel trans- 5.2 From connectives to phrases lations. The PDTB (Prasad et al., 2008) makes a distinc- 6 Summary tion between connectives (a closed set) and “al- ternative lexicalizations” (AltLex), which are a We present, to the best of our knowledge, the first non-demarcated set of phrases used to express a Italian–German investigation of discourse connec- 5 http://connective-lex.info/ tive lexicons. For the subclass of Contrast (in a wide sense), we were able to identify several William Mann and Sandra Thompson. 1988. Rhetori- missing entries in both lexicons, and provided a cal structure theory: Towards a functional theory of text organization. TEXT, 8:243–281. start on identifying AltLex items for the two lan- guages (future work). Once the information is or- Eleni Miltsakaki, Livio Robaldo, Alan Lee, and Ar- ganized in a complete bilingual database, it can avind Joshi, 2008. Sense annotation in the Penn Dis- assist translation and conclusions can be drawn re- course Treebank, pages 275–286. Springer Berlin Heidelberg, Berlin, Heidelberg. garding connective distribution, sense distribution and ambiguity in the different languages. Renate Pasch, Ursula Brauße, Eva Breindl, and As prominent steps for future work, we note the UlrichH̃errmann Waßner. 2003. Handbuch der deutschen Konnektoren. Walter de Gruyter, disambiguation of connective- and non-connective Berlin/New York. readings, the implementation of more sophisti- cated filtering strategies to retrieve more reliable Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Milt- connective candidates and repeating this study for sakaki, Livio Robaldo, Aravind Joshi, and Bonnie Webber. 2008. The penn discourse treebank 2.0. In different languages pairs. In Proceedings of LREC. Acknowledgments Tatjana Scheffler and Manfred Stede. 2016. Adding semantic relations to a large-coverage connective We are grateful to the Deutsche Forschungsge- lexicon of German. In Nicoletta Calzolari et al., ed- meinschaft (DFG) for funding this work in the itor, Proc. of the Ninth International Conference on project ‘Anaphoricity in Connectives’. We would Language Resources and Evaluation (LREC 2016), like to thank the anonymous reviewers for com- Portorož, Slovenia, May. ments on an earlier version of this manuscript. We Manfred Stede. 2002. Dimlex: A lexical approach to also thank Flavia Adani for her help with trans- discourse markers. In Exploring the Lexicon - The- lation and interpretation of the Italian results, and ory and Computation. Edizioni dell Orso, Alessan- dria. Tatjana Scheffler for the recent work on DiMLex. Yannick Versley. 2010. Discovery of ambiguous and unambiguous discourse connectives via annotation References projection. In Proceedings of Workshop on Annota- tion and Exploitation of Parallel Corpora (AEPC). Nicholas Asher and Alex Lascarides. 2003. Logics of Northern European Association for Language Tech- Conversation. Cambridge University Press, Cam- nology (NEALT). bridge. Anna Feltracco, Elisabetta Jezek, Bernardo Magnini, Lanjun Zhou, Wei Gao, Bin Li, Zhong Wei, and Kam- and Manfred Stede. 2016. Lico: A lexicon of ital- Fai Wong. 2012. Cross-lingual identification of ian connectives. In Proceedings of the 3rd Italian ambiguous discourse connectives for resource-poor Conference on Computational Linguistics (CLiC-it), language. In Proceedings of COLING. Napoli, Italy. Qin Gao and Stephan Vogel. 2008. Parallel implemen- tations of word alignment tool. In Software Engi- neering, Testing, and Quality Assurance for Natural Language Processing, SETQA-NLP ’08, pages 49– 57, Stroudsburg, PA, USA. Association for Compu- tational Linguistics. Najeh Hajlaoui and Andrei Popescu-Belis. 2013. As- sessing the accuracy of discourse connective transla- tions: Validation of an automatic metric. In Univer- sity of the Aegean-14th International Conference on Intelligent Text Processing and Computational Lin- guistics. Springer. Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Conference Pro- ceedings: the tenth Machine Translation Summit, pages 79–86, Phuket, Thailand. AAMT, AAMT. Majid Laali and Leila Kosseim. 2014. Inducing dis- course connectives from parallel texts. In COLING, pages 610–619.