=Paper=
{{Paper
|id=Vol-2006/paper006
|storemode=property
|title=Toward a Bilingual Lexical Database on Connectives: Exploiting a German/Italian Parallel Corpus
|pdfUrl=https://ceur-ws.org/Vol-2006/paper006.pdf
|volume=Vol-2006
|authors=Peter Bourgonje,Yulia Grishina,Manfred Stede
|dblpUrl=https://dblp.org/rec/conf/clic-it/BourgonjeGS17
}}
==Toward a Bilingual Lexical Database on Connectives: Exploiting a German/Italian Parallel Corpus==
Toward a bilingual lexical database on connectives:
Exploiting a German/Italian parallel corpus
Peter Bourgonje, Yulia Grishina, Manfred Stede
Applied Computational Linguistics
University of Potsdam / Germany
{bourgonje,grishina,stede}@uni-potsdam.de
Abstract (4) The red pen costs $2; the blue one is $2.50.
English. We report on experiments to On the other hand, example (6) is a perfectly gram-
validate and extend two language-specific matical sentence but the meaning is different from
connective databases (German and Italian) (5), so for this case of a Concession relation, the
using a word-aligned corpus. This is a first connective is in fact indispensable.
step toward constructing a bilingual lexi- (5) Although it is late, we don’t need to hurry.
con on connectives that are connected via
their discourse senses. (6) It is late; we don’t need to hurry.
Recognizing these relations, which can hold
Italiano. Presentiamo una serie di es-
within a sentence, between two sentences, or be-
perimenti per validare ed estendere due
tween larger spans of text, is a central task for
database dei connettivi, che sonospecifici
uncovering the structure of a text, as it has been
per la lingua italiana e per quella tedesca.
studied in theories like Rhetorical Structure The-
Abbiamo utilizzato un corpus parallelo
ory (Mann and Thompson, 1988) or Segmented
allineato a livello della parola. Si tratta
Discourse Representation Theory (Asher and Las-
di un primo passo verso la costruzione di
carides, 2003). While the usage of connectives can
un lessico bilingue dei connettivi che sono
sometimes be optional, the set of connectives that
collegati attraverso i loro sensi del dis-
a language offers is generally taken as important
corso.
(if not exhaustive) evidence for the set of coher-
ence relations that should be assumed.
1 Introduction
1.1 Background: Connectives
An important part of discourse processing deals From a syntactic viewpoint, ‘connective’ is not a
with uncovering coherence relations that hold be- homogeneous class, as it contains conjunctions,
tween individual, “elementary” units of a text. The different kinds of adverbials, as well as certain
lexical items that can signal such a relation are prepositions. Our underlying definition of dis-
referred to as discourse connectives, and exam- course connectives is based on (Pasch et al., 2003,
ples of these relations, also called the connectives’ p. 331):
senses, are contrast (e.g., ‘but’), elaboration (e.g.,
‘in particular’), or cause (e.g., ‘therefore’). No- (7) Def.: A discourse connective is a lexical
tice, however, that relations need not always be item x that exhibits each of the following
signalled in text, if the context or world knowl- five properties:
edge is sufficient for the reader to infer it, as (1)- (M1) x cannot be inflected.
(4) demonstrate: (M2) x does not assign case features to its
syntactic environment.
(1) We should hurry, because it’s late. (M3) The meaning of x is a two-place
relation.
(2) We should hurry. It’s late. (M4) The arguments of the relation (the
(3) The red pen costs $2, while the blue one is meaning of x) are propositional structures.
$2.50. (M5) The expressions of the arguments of
the relation can be sentential structures.
Following (Stede, 2002), we drop M2 because our
lexicon deliberately includes several prepositions
that can be used as connectives (in the sense of
M1, M3-M5), e.g., trotz (‘despite’) or wegen (‘due
to’).
1.2 Motivation and contribution
Connectives can pose interesting challenges to
translation and for language learners, as the dif-
ferences in meaning between similar connectives
can be quite subtle. For these reasons, we are
interested here specifically in a bilingual Italian–
German lexical resource, to be built on top of
two existing single-language lexicons. As a
case study, we focus on the subgroup of con-
trastive/concessive connectives, which we deter-
mined to comprise (in the existing lexicons) 31
German connectives and 12 Italian connectives;
see Tables 3.2.2 and 3.2.2.
The main contributions of this paper are (1) Figure 1: al contrario entry in LICo
suggestions for improving the existing language-
specific resources used in this study through the
technique of cross-lingual projection in a parallel
corpus, which reveals correspondences between
connectives and can point to gaps in either of the was inspired by DiMLex and contains annotations
resources; and (2) an overview of the distribution on the same attributes and uses essentially the
of connectives and their senses, to be used in a same structure (i.e., the same PDTB senses, ortho-
bilingual database. Section 2 explains the two graphic variants, usage examples, etc.). An exam-
monolingual lexicons we work with, and Section ple entry of LICo is shown in Figure 1. We refer
3 describes the corpus. Section 4 reviews related the reader to Feltracco et al. (2016) for details.
work in this area. Section 5 elaborates the idea
of bilingual connective databases, and Section 6 3 Exploiting a parallel corpus
summarises our findings. For the parallel German/Italian corpus we used
2 Lexicons: DiMLex and LICo Europarl (Koehn, 2005), as it still appears to be
the biggest resource of this kind, and it is, con-
We extracted the German contrastive connectives veniently, already sentence-aligned. From the
from DiMLex (Scheffler and Stede, 2016), a con- 1,832,053 sentences in the German-Italian part of
nective lexicon with several different fields de- the corpus we extracted the word alignments us-
scribing orthographical variants, syntactic type, ing MGIZA++ (Gao and Vogel, 2008). In the fol-
discourse sense, and usage examples. It con- lowing, we sketch our method for obtaining the
tains 275 entries. The sense annotations are based correspondence information on connectives based
on the Penn Discourse Treebank (PDTB) senses on these word alignments, and then present the re-
(Miltsakaki et al., 2008) in its latest version 3. The sults.
lexicon is publicly available1 and aims to exhaus-
tively describe the set of connectives for German, 3.1 Method: Iterative lookup
thus providing a basis for our case study. We approach the problem from two sides: First
The set of Italian contrastive connectives comes we look up every German connective (31 in total)
from LICo (Feltracco et al., 2016), a similar lex- to get Italian alignments. 30 of them appeared in
icon for Italian containing 170 entries.2 LICo our Europarl corpus (with dementgegen missing).
1
https://github.com/discourse-lab/dimlex
Then we look up every Italian connective to get
2
https://hlt-nlp.fbk.eu/technologies/lico German alignments (all 12 connectives present in
the corpus). We end up with a list of target lan-
guage words or phrases (or empty elements, since
a source language connective can also be covert in
the target language) that are candidate contrastive
connectives. Note that the lookup procedure does
not differ structurally between words and phrases.
In both cases, single words (stand-alone or in a Figure 2: Most frequent alignments of jedoch
phrase) can correspond to zero, one or more target
words. The target representation is collected in a
key-value structure, where the key is the position
in the sentence and the value the word. This list is
then sorted by position to return the target word or tive candidates that were aligned to German con-
phrase (which is potentially discontinuous). Be- trastive connectives, but were not present in LICo,
cause the word alignment is not guaranteed to be such as al contempo, solo che, doppo tutto. Sec-
correct, to filter for unlikely translations we focus ondly, we observed several possible orthographic
on only the 3 most frequent alignments for every variants of the already existing Italian connectives:
connective. We expect to find at least a subset of contro or contrario (as possible variants of al con-
the already known (contrastive) connectives (from trario), and d’altro canto (as a variant of a discon-
DiMLex or LICo), potentially complemented by a tinious connective da un canto...dall’altro). Fi-
set of words or phrases that can help filling gaps in nally, we found that several Italian connectives
either of the lexicons. only had the concession sense, while the corre-
This procedure produces at least some incorrect sponding German connectives also had the Con-
results for the following two reasons: 1) discourse trast sense, such as comunque, for which we found
connectives often can appear in a text with a con- the German alignments aber, allerdings and doch,
nective reading or with a non-connective reading; for example.
and 2) connectives can have multiple senses, so As an example of a visualisation (for a single
that a connective may not have the contrastive connective) the above analysis is based on, con-
reading in the particular sentence. The candidates sider Figure 2, showing the most frequent align-
produced hence have to be evaluated manually. ments of jedoch, which always has a connective
Resulting candidates that have a connective read- reading, thus nullifying the first problem men-
ing are added to the seed list, in order to repeat the tioned in 3.1.
step back from the target language to the source
3.2.2 Italian–German
language3 .
The results of the first step of the iteration using
3.2 Results the 12 Italian seed connectives are displayed in Ta-
3.2.1 German–Italian ble 3.2.2. For 11 of the 12 contrastive connectives
from LICo, the top 3 alignments yielded an exist-
The results of the first step of the iteration us-
ing DiMLex entry. The only connective without
ing the 31 German seed connectives are displayed
a DiMLex entry in the top 3 was al contrario, for
in Table 3.2.2, where an underscore indicates an
which a possible new German connective candi-
empty string (meaning that the connective was not
date im Gegenteil was found through alignment.
aligned to a particular word or phrase in the tar-
Upon further investigation of the lower-ranked
get language) and the number after the underscore
alignments (not included in Table 3.2.2), we were
represents the (normalised) frequency of the align-
able to identify several other gaps in the Ger-
ment.
man lexicon. Firstly, we observed that the Ital-
For the evaluation, we asked a native speaker
ian connective invece is frequently aligned to the
of Italian with expert knowledge in linguistics to
German word anstelle, which is not in DiMLex
validate the resulting top 3 bilingual mappings.
(but anstelle dessen is). After examining the cor-
Firstly, we identified several possible connec-
responding examples, we conclude that anstelle
3
Ideally going back and forth until a stable and exhaustive should be added to DimLex as a separate entry
set of candidates is found. For this study, we only did the first
step, and then projected the found Italian connectives back to (similarly to the already existing aufgrund vs. auf-
German. grund dessen). Also, we found that DiMLex lacks
German connective (frequency) Top 3 Italian alignments
aber (105413) ma// (0.24)//tuttavia
alldieweil (3) finché//perché
allein (6973) (0.30)//solo//soltanto
allerdings (16232) tuttavia// (0.22)//ma
andererseits (6354) (0.30)//dall’ altro//d’ altro canto
bloß dass (117) (0.10)//solo che//che solo
dafür (36895) (0.70)//per//per aver
dafür // dass (42) che// (0.19)//per
dagegen (5423) (0.34)//contro//contrario
dahingegen (24) (0.17)//invece//al contrario
dementgegen (0)
Figure 3: Most frequent alignments of invece demgegenüber (121) (0.25)//invece//contro
doch (37423) (0.47)//ma//tuttavia
einerseits (4221) da un lato// (0.31)//da una parte
freilich (159) (0.30)//naturalmente//certo
gleichzeitig (13293) (0.35)//al contempo//allo stesso tempo
hingegen (1909) invece// (0.26)//tuttavia
immerhin (1360) (0.44)//comunque//dopo tutto
indessen (280) invece// (0.19)//tuttavia
jedoch (47667) tuttavia// (0.27)//ma
nur dass (21617) che//solo che
sosehr (14) malgrado tutto
unterdessen (193) nel frattempo// (0.21)//intanto
wiederum (2450) (0.55)//a sua volta//ancora una volta
wogegen (111) mentre// (0.19)//contro cosa
wohingegen (218) mentre// (0.14)//ma
während (20388) (0.28)//mentre//durante
währenddessen (78) nel frattempo// (0.17)//mentre
zugleich (3576) (0.41)//al contempo//allo stesso tempo
zum anderen (4299) (0.09)//altri//altre
zum einen (8848) un// (0.10)//una
Table 1: German connectives and their Italian
alignments
Figure 4: Mapping of connective senses from Ital- Italian connective (frequency) Top 3 German alignments
al contrario (3641) im gegenteil// (0.10)//im gegenteil
ian to German
bensı̀ (7107) sondern// (0.12)//sondern vielmehr
contrariamente a (661) (0.08)//entgegen//im gegensatz zu
da un canto (352) einerseits// (0.11)//andererseits
da un lato (4612) einerseits// (0.08)//einerseits die
da una parte (10194) (0.07)//und//eine
invece (18778) (0.48)//anstatt//stattdessen
statt dessen as an orthographic variant of the more ma (135218) aber//sondern// (0.15)
canonical stattdessen. mentre (15773) während// (0.19)//und
per contro (13468) gegen//und// (0.06)
Finally, we identified two interesting cases that però (22687) aber//jedoch// (0.24)
are DiMLex candidates: umgekehrt and (ganz) im viceversa (522) umgekehrt// (0.19)//hingegen
Gegenteil, which we found aligned to the Italian
Table 2: Italian connectives and their German
viceversa and al contrario, respectively, but more
alignments
corpus evidence is required to decide whether they
can indeed serve as connective in the German lan-
guage.
As an example visualisation, consider Figure 3,
showing the most frequent alignments of invece, 4 Related work
which always has a connective reading.
For Italian–German, we repeated the steps Parallel corpora have been successfully exploited
above with the candidates found using the Ger- before in order to automatically generate or induce
man seed list (projecting the resulting Italian list connective lexicons in different languages. In par-
back to German) to see if any additional connec- ticular, Versley (2010) projected discourse con-
tives or orthographic variants would be found. We nectives across an English–German parallel cor-
again found im Gegenteil through alignment of al pus to train a discourse parser capable of dis-
contrario and a few alternative lexicalisations for ambiguating connective and non-connective read-
DiMLex connectives4 , but no new candidates. ings. Similarly, Zhou et al. (2012) used an
4
English–Chinese parallel corpus in order to build a
Not listed here for reasons of space.
Chinese connective lexicon via cross-lingual pro-
jection, and Hajlaoui and Popescu-Belis (2013) re- coherence relation. Such phrases are so far not
lied on parallel data to automatically retrieve Ara- part of DiMLex nor LICo. Obviously, they are
bic counterparts for a subset of English connec- much harder to detect: Corpus annotation (as done
tives. in PDTB) is one way, and we regard our cross-
Since our goal was not to build a connective lingual projection method as another promising
lexicon from scratch, but to extend the connec- way. Quite often, connectives in language A have
tive lists and refine the inventory of senses for been translated to an AltLex in language B. We
the already existing lexicons, the closest approach plan to study this more systematically by a closer
to ours is the one adopted by Laali and Kos- inspection of the alignments and their contexts, in
seim (2014), who aimed at automatically inducing order to extract AltLex candidates as a supplement
a French connective lexicon via English–French to the connective lexicons.
parallel corpora using additional filtering rules.
Similar to ours, their results have shown that us- 5.3 Senses and their distributions
ing parallel translations can improve the coverage A bilingual connective database can shed light on
of the connective lists in both languages; however, the distribution of senses over different languages
since their lexicons used different sets of discourse and the degree of ambiguity that individual con-
relations, they were not able to extend their con- nectives exhibit. While we consider such con-
nective database in respect to senses, as opposed clusions premature for the current stage of the
to our work. language-specific resources, we include Figure 4,
which shows groups of connectives that share the
5 Toward a bilingual connective database same sense (or group of senses for ambiguous con-
Our study is meant as a step toward moving from nectives) and their alignment to similar groups on
single-language connective lexicons to a bilingual the target side. The 12 Italian connectives (on
one that provides information about the relation- the left), when grouped together based on their
ships between the language-specific entries. Both sense(s), form 4 sets, whereas for German (right
monolingual lexicons are already publicly avail- side), fewer connectives (11 that were found in
able on GitHub and in addition an interface allow- DiMLex among the top 3 alignments of the 12
ing bilingual search has been made public in a re- source connectives) group into more sets (10).
lated project5 . Below we sketch additional plans This suggests more ambiguity in Italian connec-
for providing this information on the levels of con- tives, with less different senses represented by a
nective tokens, and senses (coherence relations). larger set of connectives.
In addition, we observed that Italian connec-
5.1 Connective mappings tives with a sense Contrast or Concession are fre-
One central purpose of a bilingual database is to quently aligned to their German counterparts with
assist translators (human or machine) or (human) a sense Substitution, such as anstelle-invece. Hav-
language learners. For most connectives, there is ing examined the parallel examples more closely,
a complicated m:n mapping between languages, we conclude that assigning both senses would be
which standard dictionaries do not cover, and the valid for both German and Italian, although they
relevant features for making choices are not sys- are placed distantly in the PDTB hierarchy of
tematically known yet. A corpus-based inventory senses. These findings are confirmed by Feltracco
of mappings – ideally supplemented by pointers et al. (2016), who acknowledge that the distinction
to the corpus instances and their context – can be between the two senses was one of the main cases
a very useful resource for undertaking contrastive of the inter-annotator disagreement. We conclude
lexical investigations. that both lexicons could benefit from adding addi-
tional senses gained via comparing parallel trans-
5.2 From connectives to phrases lations.
The PDTB (Prasad et al., 2008) makes a distinc-
6 Summary
tion between connectives (a closed set) and “al-
ternative lexicalizations” (AltLex), which are a We present, to the best of our knowledge, the first
non-demarcated set of phrases used to express a Italian–German investigation of discourse connec-
5
http://connective-lex.info/
tive lexicons. For the subclass of Contrast (in
a wide sense), we were able to identify several William Mann and Sandra Thompson. 1988. Rhetori-
missing entries in both lexicons, and provided a cal structure theory: Towards a functional theory of
text organization. TEXT, 8:243–281.
start on identifying AltLex items for the two lan-
guages (future work). Once the information is or- Eleni Miltsakaki, Livio Robaldo, Alan Lee, and Ar-
ganized in a complete bilingual database, it can avind Joshi, 2008. Sense annotation in the Penn Dis-
assist translation and conclusions can be drawn re- course Treebank, pages 275–286. Springer Berlin
Heidelberg, Berlin, Heidelberg.
garding connective distribution, sense distribution
and ambiguity in the different languages. Renate Pasch, Ursula Brauße, Eva Breindl, and
As prominent steps for future work, we note the UlrichH̃errmann Waßner. 2003. Handbuch
der deutschen Konnektoren. Walter de Gruyter,
disambiguation of connective- and non-connective
Berlin/New York.
readings, the implementation of more sophisti-
cated filtering strategies to retrieve more reliable Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Milt-
connective candidates and repeating this study for sakaki, Livio Robaldo, Aravind Joshi, and Bonnie
Webber. 2008. The penn discourse treebank 2.0. In
different languages pairs. In Proceedings of LREC.
Acknowledgments Tatjana Scheffler and Manfred Stede. 2016. Adding
semantic relations to a large-coverage connective
We are grateful to the Deutsche Forschungsge- lexicon of German. In Nicoletta Calzolari et al., ed-
meinschaft (DFG) for funding this work in the itor, Proc. of the Ninth International Conference on
project ‘Anaphoricity in Connectives’. We would Language Resources and Evaluation (LREC 2016),
like to thank the anonymous reviewers for com- Portorož, Slovenia, May.
ments on an earlier version of this manuscript. We Manfred Stede. 2002. Dimlex: A lexical approach to
also thank Flavia Adani for her help with trans- discourse markers. In Exploring the Lexicon - The-
lation and interpretation of the Italian results, and ory and Computation. Edizioni dell Orso, Alessan-
dria.
Tatjana Scheffler for the recent work on DiMLex.
Yannick Versley. 2010. Discovery of ambiguous and
unambiguous discourse connectives via annotation
References projection. In Proceedings of Workshop on Annota-
tion and Exploitation of Parallel Corpora (AEPC).
Nicholas Asher and Alex Lascarides. 2003. Logics of Northern European Association for Language Tech-
Conversation. Cambridge University Press, Cam- nology (NEALT).
bridge.
Anna Feltracco, Elisabetta Jezek, Bernardo Magnini, Lanjun Zhou, Wei Gao, Bin Li, Zhong Wei, and Kam-
and Manfred Stede. 2016. Lico: A lexicon of ital- Fai Wong. 2012. Cross-lingual identification of
ian connectives. In Proceedings of the 3rd Italian ambiguous discourse connectives for resource-poor
Conference on Computational Linguistics (CLiC-it), language. In Proceedings of COLING.
Napoli, Italy.
Qin Gao and Stephan Vogel. 2008. Parallel implemen-
tations of word alignment tool. In Software Engi-
neering, Testing, and Quality Assurance for Natural
Language Processing, SETQA-NLP ’08, pages 49–
57, Stroudsburg, PA, USA. Association for Compu-
tational Linguistics.
Najeh Hajlaoui and Andrei Popescu-Belis. 2013. As-
sessing the accuracy of discourse connective transla-
tions: Validation of an automatic metric. In Univer-
sity of the Aegean-14th International Conference on
Intelligent Text Processing and Computational Lin-
guistics. Springer.
Philipp Koehn. 2005. Europarl: A parallel corpus for
statistical machine translation. In Conference Pro-
ceedings: the tenth Machine Translation Summit,
pages 79–86, Phuket, Thailand. AAMT, AAMT.
Majid Laali and Leila Kosseim. 2014. Inducing dis-
course connectives from parallel texts. In COLING,
pages 610–619.