=Paper= {{Paper |id=Vol-2006/paper006 |storemode=property |title=Toward a Bilingual Lexical Database on Connectives: Exploiting a German/Italian Parallel Corpus |pdfUrl=https://ceur-ws.org/Vol-2006/paper006.pdf |volume=Vol-2006 |authors=Peter Bourgonje,Yulia Grishina,Manfred Stede |dblpUrl=https://dblp.org/rec/conf/clic-it/BourgonjeGS17 }} ==Toward a Bilingual Lexical Database on Connectives: Exploiting a German/Italian Parallel Corpus== https://ceur-ws.org/Vol-2006/paper006.pdf
                 Toward a bilingual lexical database on connectives:
                    Exploiting a German/Italian parallel corpus

                       Peter Bourgonje, Yulia Grishina, Manfred Stede
                              Applied Computational Linguistics
                               University of Potsdam / Germany
                    {bourgonje,grishina,stede}@uni-potsdam.de


                     Abstract                            (4) The red pen costs $2; the blue one is $2.50.

    English. We report on experiments to                 On the other hand, example (6) is a perfectly gram-
    validate and extend two language-specific            matical sentence but the meaning is different from
    connective databases (German and Italian)            (5), so for this case of a Concession relation, the
    using a word-aligned corpus. This is a first         connective is in fact indispensable.
    step toward constructing a bilingual lexi-           (5) Although it is late, we don’t need to hurry.
    con on connectives that are connected via
    their discourse senses.                              (6) It is late; we don’t need to hurry.
                                                         Recognizing these relations, which can hold
    Italiano. Presentiamo una serie di es-
                                                         within a sentence, between two sentences, or be-
    perimenti per validare ed estendere due
                                                         tween larger spans of text, is a central task for
    database dei connettivi, che sonospecifici
                                                         uncovering the structure of a text, as it has been
    per la lingua italiana e per quella tedesca.
                                                         studied in theories like Rhetorical Structure The-
    Abbiamo utilizzato un corpus parallelo
                                                         ory (Mann and Thompson, 1988) or Segmented
    allineato a livello della parola. Si tratta
                                                         Discourse Representation Theory (Asher and Las-
    di un primo passo verso la costruzione di
                                                         carides, 2003). While the usage of connectives can
    un lessico bilingue dei connettivi che sono
                                                         sometimes be optional, the set of connectives that
    collegati attraverso i loro sensi del dis-
                                                         a language offers is generally taken as important
    corso.
                                                         (if not exhaustive) evidence for the set of coher-
                                                         ence relations that should be assumed.
1   Introduction
                                                         1.1   Background: Connectives
An important part of discourse processing deals          From a syntactic viewpoint, ‘connective’ is not a
with uncovering coherence relations that hold be-        homogeneous class, as it contains conjunctions,
tween individual, “elementary” units of a text. The      different kinds of adverbials, as well as certain
lexical items that can signal such a relation are        prepositions. Our underlying definition of dis-
referred to as discourse connectives, and exam-          course connectives is based on (Pasch et al., 2003,
ples of these relations, also called the connectives’    p. 331):
senses, are contrast (e.g., ‘but’), elaboration (e.g.,
‘in particular’), or cause (e.g., ‘therefore’). No-      (7) Def.: A discourse connective is a lexical
tice, however, that relations need not always be             item x that exhibits each of the following
signalled in text, if the context or world knowl-            five properties:
edge is sufficient for the reader to infer it, as (1)-       (M1) x cannot be inflected.
(4) demonstrate:                                             (M2) x does not assign case features to its
                                                             syntactic environment.
(1) We should hurry, because it’s late.                      (M3) The meaning of x is a two-place
                                                             relation.
(2) We should hurry. It’s late.                              (M4) The arguments of the relation (the
(3) The red pen costs $2, while the blue one is              meaning of x) are propositional structures.
    $2.50.                                                   (M5) The expressions of the arguments of
                                                             the relation can be sentential structures.
Following (Stede, 2002), we drop M2 because our
lexicon deliberately includes several prepositions
that can be used as connectives (in the sense of
M1, M3-M5), e.g., trotz (‘despite’) or wegen (‘due
to’).

1.2       Motivation and contribution
Connectives can pose interesting challenges to
translation and for language learners, as the dif-
ferences in meaning between similar connectives
can be quite subtle. For these reasons, we are
interested here specifically in a bilingual Italian–
German lexical resource, to be built on top of
two existing single-language lexicons. As a
case study, we focus on the subgroup of con-
trastive/concessive connectives, which we deter-
mined to comprise (in the existing lexicons) 31
German connectives and 12 Italian connectives;
see Tables 3.2.2 and 3.2.2.
   The main contributions of this paper are (1)                  Figure 1: al contrario entry in LICo
suggestions for improving the existing language-
specific resources used in this study through the
technique of cross-lingual projection in a parallel
corpus, which reveals correspondences between
connectives and can point to gaps in either of the       was inspired by DiMLex and contains annotations
resources; and (2) an overview of the distribution       on the same attributes and uses essentially the
of connectives and their senses, to be used in a         same structure (i.e., the same PDTB senses, ortho-
bilingual database. Section 2 explains the two           graphic variants, usage examples, etc.). An exam-
monolingual lexicons we work with, and Section           ple entry of LICo is shown in Figure 1. We refer
3 describes the corpus. Section 4 reviews related        the reader to Feltracco et al. (2016) for details.
work in this area. Section 5 elaborates the idea
of bilingual connective databases, and Section 6         3     Exploiting a parallel corpus
summarises our findings.                                 For the parallel German/Italian corpus we used
2       Lexicons: DiMLex and LICo                        Europarl (Koehn, 2005), as it still appears to be
                                                         the biggest resource of this kind, and it is, con-
We extracted the German contrastive connectives          veniently, already sentence-aligned. From the
from DiMLex (Scheffler and Stede, 2016), a con-          1,832,053 sentences in the German-Italian part of
nective lexicon with several different fields de-        the corpus we extracted the word alignments us-
scribing orthographical variants, syntactic type,        ing MGIZA++ (Gao and Vogel, 2008). In the fol-
discourse sense, and usage examples. It con-             lowing, we sketch our method for obtaining the
tains 275 entries. The sense annotations are based       correspondence information on connectives based
on the Penn Discourse Treebank (PDTB) senses             on these word alignments, and then present the re-
(Miltsakaki et al., 2008) in its latest version 3. The   sults.
lexicon is publicly available1 and aims to exhaus-
tively describe the set of connectives for German,       3.1    Method: Iterative lookup
thus providing a basis for our case study.               We approach the problem from two sides: First
   The set of Italian contrastive connectives comes      we look up every German connective (31 in total)
from LICo (Feltracco et al., 2016), a similar lex-       to get Italian alignments. 30 of them appeared in
icon for Italian containing 170 entries.2 LICo           our Europarl corpus (with dementgegen missing).
    1
        https://github.com/discourse-lab/dimlex
                                                         Then we look up every Italian connective to get
    2
        https://hlt-nlp.fbk.eu/technologies/lico         German alignments (all 12 connectives present in
the corpus). We end up with a list of target lan-
guage words or phrases (or empty elements, since
a source language connective can also be covert in
the target language) that are candidate contrastive
connectives. Note that the lookup procedure does
not differ structurally between words and phrases.
In both cases, single words (stand-alone or in a                      Figure 2: Most frequent alignments of jedoch
phrase) can correspond to zero, one or more target
words. The target representation is collected in a
key-value structure, where the key is the position
in the sentence and the value the word. This list is
then sorted by position to return the target word or                tive candidates that were aligned to German con-
phrase (which is potentially discontinuous). Be-                    trastive connectives, but were not present in LICo,
cause the word alignment is not guaranteed to be                    such as al contempo, solo che, doppo tutto. Sec-
correct, to filter for unlikely translations we focus               ondly, we observed several possible orthographic
on only the 3 most frequent alignments for every                    variants of the already existing Italian connectives:
connective. We expect to find at least a subset of                  contro or contrario (as possible variants of al con-
the already known (contrastive) connectives (from                   trario), and d’altro canto (as a variant of a discon-
DiMLex or LICo), potentially complemented by a                      tinious connective da un canto...dall’altro). Fi-
set of words or phrases that can help filling gaps in               nally, we found that several Italian connectives
either of the lexicons.                                             only had the concession sense, while the corre-
   This procedure produces at least some incorrect                  sponding German connectives also had the Con-
results for the following two reasons: 1) discourse                 trast sense, such as comunque, for which we found
connectives often can appear in a text with a con-                  the German alignments aber, allerdings and doch,
nective reading or with a non-connective reading;                   for example.
and 2) connectives can have multiple senses, so                        As an example of a visualisation (for a single
that a connective may not have the contrastive                      connective) the above analysis is based on, con-
reading in the particular sentence. The candidates                  sider Figure 2, showing the most frequent align-
produced hence have to be evaluated manually.                       ments of jedoch, which always has a connective
Resulting candidates that have a connective read-                   reading, thus nullifying the first problem men-
ing are added to the seed list, in order to repeat the              tioned in 3.1.
step back from the target language to the source
                                                                    3.2.2 Italian–German
language3 .
                                                                    The results of the first step of the iteration using
3.2    Results                                                      the 12 Italian seed connectives are displayed in Ta-
3.2.1 German–Italian                                                ble 3.2.2. For 11 of the 12 contrastive connectives
                                                                    from LICo, the top 3 alignments yielded an exist-
The results of the first step of the iteration us-
                                                                    ing DiMLex entry. The only connective without
ing the 31 German seed connectives are displayed
                                                                    a DiMLex entry in the top 3 was al contrario, for
in Table 3.2.2, where an underscore indicates an
                                                                    which a possible new German connective candi-
empty string (meaning that the connective was not
                                                                    date im Gegenteil was found through alignment.
aligned to a particular word or phrase in the tar-
                                                                       Upon further investigation of the lower-ranked
get language) and the number after the underscore
                                                                    alignments (not included in Table 3.2.2), we were
represents the (normalised) frequency of the align-
                                                                    able to identify several other gaps in the Ger-
ment.
                                                                    man lexicon. Firstly, we observed that the Ital-
   For the evaluation, we asked a native speaker
                                                                    ian connective invece is frequently aligned to the
of Italian with expert knowledge in linguistics to
                                                                    German word anstelle, which is not in DiMLex
validate the resulting top 3 bilingual mappings.
                                                                    (but anstelle dessen is). After examining the cor-
Firstly, we identified several possible connec-
                                                                    responding examples, we conclude that anstelle
   3
     Ideally going back and forth until a stable and exhaustive     should be added to DimLex as a separate entry
set of candidates is found. For this study, we only did the first
step, and then projected the found Italian connectives back to      (similarly to the already existing aufgrund vs. auf-
German.                                                             grund dessen). Also, we found that DiMLex lacks
                                                       German connective (frequency)    Top 3 Italian alignments
                                                       aber (105413)                    ma// (0.24)//tuttavia
                                                       alldieweil (3)                   finché//perché
                                                       allein (6973)                     (0.30)//solo//soltanto
                                                       allerdings (16232)               tuttavia// (0.22)//ma
                                                       andererseits (6354)               (0.30)//dall’ altro//d’ altro canto
                                                       bloß dass (117)                   (0.10)//solo che//che solo
                                                       dafür (36895)                    (0.70)//per//per aver
                                                       dafür // dass (42)              che// (0.19)//per
                                                       dagegen (5423)                    (0.34)//contro//contrario
                                                       dahingegen (24)                   (0.17)//invece//al contrario
                                                       dementgegen (0)
  Figure 3: Most frequent alignments of invece         demgegenüber (121)               (0.25)//invece//contro
                                                       doch (37423)                      (0.47)//ma//tuttavia
                                                       einerseits (4221)                da un lato// (0.31)//da una parte
                                                       freilich (159)                    (0.30)//naturalmente//certo
                                                       gleichzeitig (13293)              (0.35)//al contempo//allo stesso tempo
                                                       hingegen (1909)                  invece// (0.26)//tuttavia
                                                       immerhin (1360)                   (0.44)//comunque//dopo tutto
                                                       indessen (280)                   invece// (0.19)//tuttavia
                                                       jedoch (47667)                   tuttavia// (0.27)//ma
                                                       nur dass (21617)                 che//solo che
                                                       sosehr (14)                      malgrado tutto
                                                       unterdessen (193)                nel frattempo// (0.21)//intanto
                                                       wiederum (2450)                   (0.55)//a sua volta//ancora una volta
                                                       wogegen (111)                    mentre// (0.19)//contro cosa
                                                       wohingegen (218)                 mentre// (0.14)//ma
                                                       während (20388)                  (0.28)//mentre//durante
                                                       währenddessen (78)              nel frattempo// (0.17)//mentre
                                                       zugleich (3576)                   (0.41)//al contempo//allo stesso tempo
                                                       zum anderen (4299)                (0.09)//altri//altre
                                                       zum einen (8848)                 un// (0.10)//una


                                                       Table 1: German connectives and their Italian
                                                       alignments
Figure 4: Mapping of connective senses from Ital-      Italian connective (frequency)    Top 3 German alignments
                                                       al contrario (3641)               im gegenteil// (0.10)//im gegenteil
ian to German
                                                       bensı̀ (7107)                     sondern// (0.12)//sondern vielmehr
                                                       contrariamente a (661)             (0.08)//entgegen//im gegensatz zu
                                                       da un canto (352)                 einerseits// (0.11)//andererseits
                                                       da un lato (4612)                 einerseits// (0.08)//einerseits die
                                                       da una parte (10194)               (0.07)//und//eine
                                                       invece (18778)                     (0.48)//anstatt//stattdessen
statt dessen as an orthographic variant of the more    ma (135218)                       aber//sondern// (0.15)
canonical stattdessen.                                 mentre (15773)                    während// (0.19)//und
                                                       per contro (13468)                gegen//und// (0.06)
   Finally, we identified two interesting cases that   però (22687)                     aber//jedoch// (0.24)
are DiMLex candidates: umgekehrt and (ganz) im         viceversa (522)                   umgekehrt// (0.19)//hingegen
Gegenteil, which we found aligned to the Italian
                                                       Table 2: Italian connectives and their German
viceversa and al contrario, respectively, but more
                                                       alignments
corpus evidence is required to decide whether they
can indeed serve as connective in the German lan-
guage.
   As an example visualisation, consider Figure 3,
showing the most frequent alignments of invece,        4   Related work
which always has a connective reading.
   For Italian–German, we repeated the steps           Parallel corpora have been successfully exploited
above with the candidates found using the Ger-         before in order to automatically generate or induce
man seed list (projecting the resulting Italian list   connective lexicons in different languages. In par-
back to German) to see if any additional connec-       ticular, Versley (2010) projected discourse con-
tives or orthographic variants would be found. We      nectives across an English–German parallel cor-
again found im Gegenteil through alignment of al       pus to train a discourse parser capable of dis-
contrario and a few alternative lexicalisations for    ambiguating connective and non-connective read-
DiMLex connectives4 , but no new candidates.           ings. Similarly, Zhou et al. (2012) used an
   4
                                                       English–Chinese parallel corpus in order to build a
       Not listed here for reasons of space.
                                                       Chinese connective lexicon via cross-lingual pro-
jection, and Hajlaoui and Popescu-Belis (2013) re-      coherence relation. Such phrases are so far not
lied on parallel data to automatically retrieve Ara-    part of DiMLex nor LICo. Obviously, they are
bic counterparts for a subset of English connec-        much harder to detect: Corpus annotation (as done
tives.                                                  in PDTB) is one way, and we regard our cross-
   Since our goal was not to build a connective         lingual projection method as another promising
lexicon from scratch, but to extend the connec-         way. Quite often, connectives in language A have
tive lists and refine the inventory of senses for       been translated to an AltLex in language B. We
the already existing lexicons, the closest approach     plan to study this more systematically by a closer
to ours is the one adopted by Laali and Kos-            inspection of the alignments and their contexts, in
seim (2014), who aimed at automatically inducing        order to extract AltLex candidates as a supplement
a French connective lexicon via English–French          to the connective lexicons.
parallel corpora using additional filtering rules.
Similar to ours, their results have shown that us-      5.3   Senses and their distributions
ing parallel translations can improve the coverage      A bilingual connective database can shed light on
of the connective lists in both languages; however,     the distribution of senses over different languages
since their lexicons used different sets of discourse   and the degree of ambiguity that individual con-
relations, they were not able to extend their con-      nectives exhibit. While we consider such con-
nective database in respect to senses, as opposed       clusions premature for the current stage of the
to our work.                                            language-specific resources, we include Figure 4,
                                                        which shows groups of connectives that share the
5       Toward a bilingual connective database          same sense (or group of senses for ambiguous con-
Our study is meant as a step toward moving from         nectives) and their alignment to similar groups on
single-language connective lexicons to a bilingual      the target side. The 12 Italian connectives (on
one that provides information about the relation-       the left), when grouped together based on their
ships between the language-specific entries. Both       sense(s), form 4 sets, whereas for German (right
monolingual lexicons are already publicly avail-        side), fewer connectives (11 that were found in
able on GitHub and in addition an interface allow-      DiMLex among the top 3 alignments of the 12
ing bilingual search has been made public in a re-      source connectives) group into more sets (10).
lated project5 . Below we sketch additional plans       This suggests more ambiguity in Italian connec-
for providing this information on the levels of con-    tives, with less different senses represented by a
nective tokens, and senses (coherence relations).       larger set of connectives.
                                                           In addition, we observed that Italian connec-
5.1      Connective mappings                            tives with a sense Contrast or Concession are fre-
One central purpose of a bilingual database is to       quently aligned to their German counterparts with
assist translators (human or machine) or (human)        a sense Substitution, such as anstelle-invece. Hav-
language learners. For most connectives, there is       ing examined the parallel examples more closely,
a complicated m:n mapping between languages,            we conclude that assigning both senses would be
which standard dictionaries do not cover, and the       valid for both German and Italian, although they
relevant features for making choices are not sys-       are placed distantly in the PDTB hierarchy of
tematically known yet. A corpus-based inventory         senses. These findings are confirmed by Feltracco
of mappings – ideally supplemented by pointers          et al. (2016), who acknowledge that the distinction
to the corpus instances and their context – can be      between the two senses was one of the main cases
a very useful resource for undertaking contrastive      of the inter-annotator disagreement. We conclude
lexical investigations.                                 that both lexicons could benefit from adding addi-
                                                        tional senses gained via comparing parallel trans-
5.2      From connectives to phrases                    lations.
The PDTB (Prasad et al., 2008) makes a distinc-
                                                        6     Summary
tion between connectives (a closed set) and “al-
ternative lexicalizations” (AltLex), which are a        We present, to the best of our knowledge, the first
non-demarcated set of phrases used to express a         Italian–German investigation of discourse connec-
    5
        http://connective-lex.info/
                                                        tive lexicons. For the subclass of Contrast (in
a wide sense), we were able to identify several           William Mann and Sandra Thompson. 1988. Rhetori-
missing entries in both lexicons, and provided a            cal structure theory: Towards a functional theory of
                                                            text organization. TEXT, 8:243–281.
start on identifying AltLex items for the two lan-
guages (future work). Once the information is or-         Eleni Miltsakaki, Livio Robaldo, Alan Lee, and Ar-
ganized in a complete bilingual database, it can            avind Joshi, 2008. Sense annotation in the Penn Dis-
assist translation and conclusions can be drawn re-         course Treebank, pages 275–286. Springer Berlin
                                                            Heidelberg, Berlin, Heidelberg.
garding connective distribution, sense distribution
and ambiguity in the different languages.                 Renate Pasch, Ursula Brauße, Eva Breindl, and
   As prominent steps for future work, we note the          UlrichH̃errmann Waßner.    2003.   Handbuch
                                                            der deutschen Konnektoren. Walter de Gruyter,
disambiguation of connective- and non-connective
                                                            Berlin/New York.
readings, the implementation of more sophisti-
cated filtering strategies to retrieve more reliable      Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Milt-
connective candidates and repeating this study for          sakaki, Livio Robaldo, Aravind Joshi, and Bonnie
                                                            Webber. 2008. The penn discourse treebank 2.0. In
different languages pairs.                                  In Proceedings of LREC.
Acknowledgments                                           Tatjana Scheffler and Manfred Stede. 2016. Adding
                                                            semantic relations to a large-coverage connective
We are grateful to the Deutsche Forschungsge-               lexicon of German. In Nicoletta Calzolari et al., ed-
meinschaft (DFG) for funding this work in the               itor, Proc. of the Ninth International Conference on
project ‘Anaphoricity in Connectives’. We would             Language Resources and Evaluation (LREC 2016),
like to thank the anonymous reviewers for com-              Portorož, Slovenia, May.
ments on an earlier version of this manuscript. We        Manfred Stede. 2002. Dimlex: A lexical approach to
also thank Flavia Adani for her help with trans-           discourse markers. In Exploring the Lexicon - The-
lation and interpretation of the Italian results, and      ory and Computation. Edizioni dell Orso, Alessan-
                                                           dria.
Tatjana Scheffler for the recent work on DiMLex.
                                                          Yannick Versley. 2010. Discovery of ambiguous and
                                                            unambiguous discourse connectives via annotation
References                                                  projection. In Proceedings of Workshop on Annota-
                                                            tion and Exploitation of Parallel Corpora (AEPC).
Nicholas Asher and Alex Lascarides. 2003. Logics of         Northern European Association for Language Tech-
  Conversation. Cambridge University Press, Cam-            nology (NEALT).
  bridge.
Anna Feltracco, Elisabetta Jezek, Bernardo Magnini,       Lanjun Zhou, Wei Gao, Bin Li, Zhong Wei, and Kam-
  and Manfred Stede. 2016. Lico: A lexicon of ital-         Fai Wong. 2012. Cross-lingual identification of
  ian connectives. In Proceedings of the 3rd Italian        ambiguous discourse connectives for resource-poor
  Conference on Computational Linguistics (CLiC-it),        language. In Proceedings of COLING.
  Napoli, Italy.
Qin Gao and Stephan Vogel. 2008. Parallel implemen-
  tations of word alignment tool. In Software Engi-
  neering, Testing, and Quality Assurance for Natural
  Language Processing, SETQA-NLP ’08, pages 49–
  57, Stroudsburg, PA, USA. Association for Compu-
  tational Linguistics.
Najeh Hajlaoui and Andrei Popescu-Belis. 2013. As-
  sessing the accuracy of discourse connective transla-
  tions: Validation of an automatic metric. In Univer-
  sity of the Aegean-14th International Conference on
  Intelligent Text Processing and Computational Lin-
  guistics. Springer.
Philipp Koehn. 2005. Europarl: A parallel corpus for
  statistical machine translation. In Conference Pro-
  ceedings: the tenth Machine Translation Summit,
  pages 79–86, Phuket, Thailand. AAMT, AAMT.
Majid Laali and Leila Kosseim. 2014. Inducing dis-
 course connectives from parallel texts. In COLING,
 pages 610–619.