<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Translation inference across dictionaries via a combination of graph-based methods and co-occurrence statistics</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Thomas Proisl</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Philipp Heinrich</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefan Evert</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Besim Kabashi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Corpus Linguistics Group, Friedrich-Alexander-Universität Erlangen-Nürnberg</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>This system description explains how to use several bilingual dictionaries and aligned corpora in order to create translation candidates for novel language pairs. It proposes (1) a graph-based approach which does not depend on cyclical translations and (2) a combination of this method with a collocation-based model using the multilingually aligned Europarl corpus.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Translation of lexical items is a fundamental problem in computational linguistics
which plays an important role not only in machine translation, but also in
various more specific tasks such as mapping of queries, tags, denotators, and alike
across different languages. With ever more bilingual lexicons being electronically
available for some language pairs, the problem arises of how to use them to create
new bilingual dictionaries.</p>
      <p>The organizers of the shared task on Translation Inference Across Dictionaries
(TIAD) provided partial bilingual dictionaries for the following four language
chains for the eight languages German (de), English (en), Portuguese (pt),
Japanese (ja), Spanish (es), Dutch (nl), Danish (da), and French (fr ):
1. German–English–Portuguese
2. German–Japanese–Spanish–Portuguese
3. German–Danish–French–Spanish–Portuguese
4. German–Dutch–Spanish–Danish–French–Portuguese
The resulting language graph is visualized in Figure 1. In addition, the four chains
also include Portuguese–German dictionaries for “closing the loop” (dashed edge).
According to the task guidelines, use of the Portuguese–German dictionaries is
limited to validation purposes. The objective of the task is to create three new
dictionaries (dotted edges): German–Portuguese, Danish–Spanish and Dutch–
French.</p>
      <p>A naïve approach to that problem would be to recursively collect all translation
candidates: For each source word, take all translations of that word from the
source–pivot1 dictionary; then, for each translation, take all translations from
the pivot1–pivot2 dictionary and so on until the target language is reached.</p>
      <sec id="sec-1-1">
        <title>German 2 2</title>
      </sec>
      <sec id="sec-1-2">
        <title>Spanish 2,3</title>
      </sec>
      <sec id="sec-1-3">
        <title>Portuguese</title>
        <p>3
4
3
4
3,4</p>
      </sec>
      <sec id="sec-1-4">
        <title>French</title>
        <p>English</p>
      </sec>
      <sec id="sec-1-5">
        <title>Dutch</title>
      </sec>
      <sec id="sec-1-6">
        <title>Japanese</title>
      </sec>
      <sec id="sec-1-7">
        <title>Danish</title>
        <p>The problem with this approach is that it results in very noisy and divergent
dictionaries.</p>
        <p>A common solution to that problem is to make use of cycles (cf. Section 2), in
this case by utilizing the Portuguese–German dictionaries. We opted for a novel
approach: Instead of relying on cycles, we apply a weighting scheme. We also
experiment with combining the translation candidates found via this graph-based
approach with candidates extracted from parallel corpora.1
2</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>The automatic creation of multilingual dictionaries, especially the macro-structure
of their entries and annotation interfaces (Kernerman, 2011) as well as the
exploitation of resources such as aligned corpora and existing bilingual dictionaries,
have attracted commercial and academic research projects for obvious reasons.</p>
      <p>Tanaka and Umemura (1994), for example, construct a bilingual dictionary
using a third language as a pivot language by utilizing the structure of dictionaries
and the lexical entries (nouns). They measure the nearness of the meaning of the
1 We are talking about candidates, since automatic translation techniques yield n-best
lists of terms. Both the evaluation function which ranks the candidate terms as well
as the precise value of n are at the very core of lexical translation research.
lexical entries to distinguish between true translation equivalents and spurious
ones introduced as a result of ambiguity in the pivot language. Similarly, Kaji et al.
(2008) construct a Japanese–Chinese dictionary using English as intermediate
language. They use monolingual corpora of the first and second language to
eliminate the spurious translations caused by the ambiguity of the third language.
The wide-coverage monolingual corpora provide the basis for extracting word
associations in one language and translation candidates in the target language.
This method enables generating domain-specific translation candidates.</p>
      <p>Villegas et al. (2016) infer new translations for the languages in a graph of as
many as 22 bilingual dictionaries. They consider translation candidates up to three
languages away and assign a confidence score to those candidates, which is based
on the density of cycles containing the potential target. A cycle is a translation
chain which starts and ends at the same lexical item (for a formal definition
of translation chains, see section 3.1). Similarly, Mausam et al. (2009) rely on
cycles (“translation circuits” in their terms) to match senses probabilistically,
and Saralegi et al. (2011) improve precision in pivot-based automatic creation
of bilingual dictionaries by inverse consultation, i. e. by looking up translation
candidates for all the possible candidates in the target language in the source
language. This, however, only works if dictionaries in both directions are at
disposal.</p>
      <p>Haghighi et al. (2008), on the other hand, do not use a third language at
all: They learn bilingual dictionaries only using monolingual corpora and word
features in each language. Last but not least, using noisy dictionaries as input,
Shezaf and Rappoport (2010) present a method for generating higher-quality
dictionaries: their method requires two (noisy) bilingual dictionaries (from the
source language to the target language and vice versa) and two comparable
monolingual corpora (one in the source language and one in the target language)
as input and calculate similarity scores for translation candidates based on the
number of words co-occurring with the source word that can be translated into
words co-occurring with the target word.</p>
      <p>The collocation-based approach described in the present paper, on the other
hand, employs a similar idea as can be found in Kovář et al. (2016), who use a
transformation of the Dice coefficient for extracting translation candidates from
parallel corpora with sentence alignment.
3
3.1</p>
    </sec>
    <sec id="sec-3">
      <title>System description</title>
      <p>Graph-based approach
As mentioned in Section 1, we opted for a novel graph-based approach that does
not rely on cycles. Instead, we use a weighting scheme. As an additional,
selfimposed constraint, we do not make use of the Portuguese–German dictionaries
at all.</p>
      <p>For our weighting scheme, we do not only use the four paths provided by
the task organizers but all available simple chains from a source language to
a target language. Simple chains are paths that ignore the orientation of the
edges and where no vertex can occur twice. We distinguish between language
chains, i. e. chains from one language to another, as illustrated in Figure 1, and
translation chains, i. e. chains from one word to another, via the languages in a
given language chain.</p>
      <p>Formally, let Ls,t denote the set of language chains from source language s to
target language t. Each language chain ` ∈ Ls,t is assigned a weight
where |`| is the length of the chain and |r`| is the number of edges in ` that are
traversed in reverse. The weights are normalized such that
w` =</p>
      <p>1
(|`| + |r`|)</p>
      <p>,
X w` = 1.
`∈Ls,t
(1)
(2)
(3)
(4)
(5)
The intuition behind these weights is that the more intervening languages we
have and the more dictionaries we use in reverse, the more the quality suffers.
Therefore, short chains should get a higher weight than long ones and using a
dictionary in reverse should be penalized.</p>
      <p>Let Rw,` denote the set of translation chains from word w in the source
language of a language chain ` to words in the target language of that language
chain. Each translation chain r ∈ Rw,` connects w to a potential translation
equivalent e = τ (r). Each translation equivalent e in the set of translation
equivalents
is assigned a weight</p>
      <p>Ew,` = {τ (r)|r ∈ Rw,`}
we,` = |{r ∈ Rw,`|τ (r) = e}| .</p>
      <p>|Rw,`|
This weight corresponds to the relative frequency of translation chains from w to
e via the languages in language chain `.</p>
      <p>Now that we have weights for all language chains and for all translations
along a language chain, we can obtain all translation equivalents in the target
language t for word w from the source language s, i. e. Ew = S`∈Ls,t Ew,`. Each
translation equivalent e ∈ Ew is assigned a weight
we =</p>
      <p>X w`we,`.</p>
      <p>`∈Ls,t
The weights are normalized such that Pe∈Ew we = 1.</p>
      <p>Now we can simply select the n translation equivalents with the highest
weights. But what is a suitable value for n, i. e. how can we determine the best
number of translation equivalents for a given word? Let Rw = S`∈Ls,t Rw,` be
the set of all chains from word w in the source language s to words in the target
language t. Then, we set
n = l|Ew| c ,
1 m
where dxe is the ceiling function and c = Pr∈Rw |r|/|Rw|. This means we
approximate n by the average number of translations for each word along the translation
chains for word w.
3.2</p>
      <p>Collocation-based approach
We make use of the Europarl corpus (see Koehn, 2005: release v7) in its
preprocessed and sentence-aligned form (Tiedemann, 2012)2. As a further
preprocessing step, all monolingual corpora except for the Portuguese one are
lemmatized with off-the-shelf algorithms. Unfortunately, we did not lemmatize
the Portuguese corpus in time. For the language pair de–pt, our procedure thus
yields lexical surface realizations as translation candidates (see below).</p>
      <p>We retrieve translation candidates by analyzing first-order (syntagmatic)
collocations. The procedure is implemented via the R-package wordspace (Evert,
2014)3. For each language pair, lemmata (or, in the case of Portuguese, types) are
extracted together with their alignment beads from the corpus in order to create
lemma-sentence matrices with the intersection of alignment beads as columns. As
an example, the French corpus contains 28,100 lemmata, the Dutch one 36,048,
and there is an intersection of 2,003,463 alignment beads.</p>
      <p>These matrices are then transformed into one term-term co-occurrence matrix
for each language pair. The nl–fr co-occurrence matrix from the example above has
thus 36,048 rows and 28,100 columns. Subsequently, the Dice score is calculated
for each lemma of the source language (if it occurs in the corpus) and each
target term. The Dice score is a de-facto standard for the determination of
translation candidates (Smadja et al., 1996) and represents the harmonic mean
of the conditional probabilities P {source|target} and P {target|source}. Let O11
denote the co-occurrence frequency of source and target term, R1 the marginal
frequency of the target term and C1 the marginal frequency of the source term
(notation and formula taken from Evert, 2008), then the Dice score can be
calculated by means of
(6)
(7)
dice (O11, R1, C1) =</p>
      <p>2O11 .</p>
      <p>R1 + C1
The higher its value, the higher the association between source and target term.
Thus, for every source term, the target terms with the highest Dice scores serve
as translation candidates. Note that in this step we ignore all candidates which
solely consist of punctuation marks and/or digits in order to improve translation
quality.
2 http://opus.lingfil.uu.se/Europarl.php
3 http://wordspace.r-forge.r-project.org/</p>
      <p>Combination of collocation-based and graph-based approaches
Without having an evaluation measure which determines the trade-off between
precision and recall of the translation candidates, we opted for a very simple
combination of the two approaches above: the final list of candidates is gained
by union of the graph-based candidates and four collocation-based candidates.4
4</p>
    </sec>
    <sec id="sec-4">
      <title>Evaluation</title>
      <p>The evaluation procedure was announced after submission of the translation
candidates and solely takes precision (and no recall)5 into account. For each
language pair and system, 100 source-target-candidates were sampled. Subsequently,
each translation pair was reviewed manually according to whether the target
term was a correct (possible) translation of the source term.</p>
      <p>Two scalar performance measures are given, see Table 1: Precision is the
percentage of (manually determined) correct translations among the proposed
candidates. Additionally, “gold-precision” only labels those candidates as “true
positives” which can be found in the organizers’ (undisclosed) gold-standard of
translations.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Results and discussion</title>
      <p>Results for the graph-based approached (“graph”) and the combination of
collocation-based and graph-based approaches (“combined”) can be found in
Table 1. Two findings seem noteworthy: Firstly, the solely graph-based method
consistently outperforms the combined approach for both evaluation measures.
Secondly, in the nl–fr language-pair setting, both systems are drastically beaten
by the baseline, whereas in the other two settings both systems outperform the
baseline.</p>
      <p>Obviously, our strategy of providing multiple translation candidates proved
to be suboptimal for the official task evaluation, which only focused on precision.
Note however that our system is easily adaptable in case a reasonable evaluation
measure is given a priori: both graph-based and collocation-based methods yield
nbest lists of candidates with a scalar score-function enabling a more sophisticated
selection of actual candidates.6
4 The graph-based method yields between two and three candidates on average
depending on the language pair. Assuming an overlap of one or two candidates between
both methods, this heuristic guarantees that the collocation-based approach delivers
approximately two additional candidates.
5 Note that recall is not well-defined in the case of lexical translation: while human
experts may easily agree on some unambiguous translations (thus making it feasible
to create a gold-standard for calculating precision), they might disagree quickly on
particular or unusual translations (thus making it impossible to create a gold-standard
for measuring recall).
6 That is to say: if the system is to focus on precision, a very small number of candidates
should be given, and their selection should be based on the distribution of the score
functions of both the graph-based and the collocation-based candidate lists.
pair
da–es
nl–fr
de–pt
graph
0.93
0.44
0.85</p>
      <p>Advantages of our proposed graph-based system are twofold: Firstly, it does
not require cycles, i. e. it can be applied in greater variety of settings. Secondly,
the weighting scheme takes into account the number of dictionaries involved and
the directionality in which they are used on the one hand, and, on the other, the
relative frequency of translation chains leading to a translation equivalent; thus,
the system automatically determines a suitable number of translation equivalents.</p>
      <p>The proposal of further candidates retrieved from the Europarl corpus has
turned out to be counterproductive for the reasons elaborated above. Nevertheless,
given more realistic settings in which recall of all (or most) possible translations
is important, retrieval of candidates not comprised in any of the bilingual corpora
(or of those with atypical translation paths) seems desirable. Future work will thus
use more sophisticated methods for combining graph-based and collocation-based
candidates, e. g. by using the Borda count or the Schulz method.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>for Computational Linguistics</source>
          , pages
          <fpage>98</fpage>
          -
          <lpage>107</lpage>
          . Association for Computational
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Linguistics</surname>
          </string-name>
          ,
          <year>2010</year>
          . URL http://aclweb.org/anthology/P10-1011. Frank Smadja,
          <string-name>
            <surname>Kathleen R. McKeown</surname>
            ,
            <given-names>and Vasileios</given-names>
          </string-name>
          <string-name>
            <surname>Hatzivassiloglou</surname>
          </string-name>
          . Translating
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>guistics</surname>
          </string-name>
          ,
          <volume>22</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          ,
          <year>1996</year>
          . URL http://aclweb.org/anthology/J96-1001. Kumiko Tanaka and
          <string-name>
            <given-names>Kyoji</given-names>
            <surname>Umemura</surname>
          </string-name>
          .
          <article-title>Construction of a bilingual dictionary</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>intermediated by a third language</article-title>
          .
          <source>In Proceedings of the 15th conference on</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Computational</surname>
          </string-name>
          linguistics - Volume
          <volume>1</volume>
          , pages
          <fpage>297</fpage>
          -
          <lpage>303</lpage>
          . Association for Compu-
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>tational Linguistics</source>
          ,
          <year>1994</year>
          . URL http://aclweb.org/anthology/C94-1048. Jörg Tiedemann.
          <article-title>Parallel Data, Tools and Interfaces in OPUS</article-title>
          . In Proceedings
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>tion (LREC'12)</source>
          , pages
          <fpage>2214</fpage>
          -
          <lpage>2218</lpage>
          ,
          <year>2012</year>
          . URL http://www.lrec-conf.org/
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          proceedings/lrec2012/pdf/463_Paper.pdf. Marta Villegas, Maite Melero,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Gracia</surname>
          </string-name>
          . Leveraging RDF graphs for
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>tional Conference on Language Resources and Evaluation (LREC'16)</source>
          , pages
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          868-
          <fpage>876</fpage>
          ,
          <year>2016</year>
          . URL http://www.lrec-conf.
          <source>org/proceedings/lrec2016/</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>pdf/613_Paper.pdf.</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>