Translation inference across dictionaries via a combination of graph-based methods and co-occurrence statistics Thomas Proisl, Philipp Heinrich, Stefan Evert, Besim Kabashi Corpus Linguistics Group, Friedrich-Alexander-Universität Erlangen-Nürnberg Abstract. This system description explains how to use several bilingual dictionaries and aligned corpora in order to create translation candidates for novel language pairs. It proposes (1) a graph-based approach which does not depend on cyclical translations and (2) a combination of this method with a collocation-based model using the multilingually aligned Europarl corpus. 1 Introduction Translation of lexical items is a fundamental problem in computational linguistics which plays an important role not only in machine translation, but also in various more specific tasks such as mapping of queries, tags, denotators, and alike across different languages. With ever more bilingual lexicons being electronically available for some language pairs, the problem arises of how to use them to create new bilingual dictionaries. The organizers of the shared task on Translation Inference Across Dictionaries (TIAD) provided partial bilingual dictionaries for the following four language chains for the eight languages German (de), English (en), Portuguese (pt), Japanese (ja), Spanish (es), Dutch (nl), Danish (da), and French (fr): 1. German–English–Portuguese 2. German–Japanese–Spanish–Portuguese 3. German–Danish–French–Spanish–Portuguese 4. German–Dutch–Spanish–Danish–French–Portuguese The resulting language graph is visualized in Figure 1. In addition, the four chains also include Portuguese–German dictionaries for “closing the loop” (dashed edge). According to the task guidelines, use of the Portuguese–German dictionaries is limited to validation purposes. The objective of the task is to create three new dictionaries (dotted edges): German–Portuguese, Danish–Spanish and Dutch– French. A naïve approach to that problem would be to recursively collect all translation candidates: For each source word, take all translations of that word from the source–pivot1 dictionary; then, for each translation, take all translations from the pivot1 –pivot2 dictionary and so on until the target language is reached. 2 German 1 4 2 3 English Dutch Japanese Danish 4 4 2 3,4 1 3 Spanish French 2,3 4 Portuguese Fig. 1. The language graph. Numbers on the edges show which language chains in the above enumeration are using the respective edge. Dotted edges indicate the desired new direct translation paths. The problem with this approach is that it results in very noisy and divergent dictionaries. A common solution to that problem is to make use of cycles (cf. Section 2), in this case by utilizing the Portuguese–German dictionaries. We opted for a novel approach: Instead of relying on cycles, we apply a weighting scheme. We also experiment with combining the translation candidates found via this graph-based approach with candidates extracted from parallel corpora.1 2 Related work The automatic creation of multilingual dictionaries, especially the macro-structure of their entries and annotation interfaces (Kernerman, 2011) as well as the exploitation of resources such as aligned corpora and existing bilingual dictionaries, have attracted commercial and academic research projects for obvious reasons. Tanaka and Umemura (1994), for example, construct a bilingual dictionary using a third language as a pivot language by utilizing the structure of dictionaries and the lexical entries (nouns). They measure the nearness of the meaning of the 1 We are talking about candidates, since automatic translation techniques yield n-best lists of terms. Both the evaluation function which ranks the candidate terms as well as the precise value of n are at the very core of lexical translation research. 3 lexical entries to distinguish between true translation equivalents and spurious ones introduced as a result of ambiguity in the pivot language. Similarly, Kaji et al. (2008) construct a Japanese–Chinese dictionary using English as intermediate language. They use monolingual corpora of the first and second language to eliminate the spurious translations caused by the ambiguity of the third language. The wide-coverage monolingual corpora provide the basis for extracting word associations in one language and translation candidates in the target language. This method enables generating domain-specific translation candidates. Villegas et al. (2016) infer new translations for the languages in a graph of as many as 22 bilingual dictionaries. They consider translation candidates up to three languages away and assign a confidence score to those candidates, which is based on the density of cycles containing the potential target. A cycle is a translation chain which starts and ends at the same lexical item (for a formal definition of translation chains, see section 3.1). Similarly, Mausam et al. (2009) rely on cycles (“translation circuits” in their terms) to match senses probabilistically, and Saralegi et al. (2011) improve precision in pivot-based automatic creation of bilingual dictionaries by inverse consultation, i. e. by looking up translation candidates for all the possible candidates in the target language in the source language. This, however, only works if dictionaries in both directions are at disposal. Haghighi et al. (2008), on the other hand, do not use a third language at all: They learn bilingual dictionaries only using monolingual corpora and word features in each language. Last but not least, using noisy dictionaries as input, Shezaf and Rappoport (2010) present a method for generating higher-quality dictionaries: their method requires two (noisy) bilingual dictionaries (from the source language to the target language and vice versa) and two comparable monolingual corpora (one in the source language and one in the target language) as input and calculate similarity scores for translation candidates based on the number of words co-occurring with the source word that can be translated into words co-occurring with the target word. The collocation-based approach described in the present paper, on the other hand, employs a similar idea as can be found in Kovář et al. (2016), who use a transformation of the Dice coefficient for extracting translation candidates from parallel corpora with sentence alignment. 3 System description 3.1 Graph-based approach As mentioned in Section 1, we opted for a novel graph-based approach that does not rely on cycles. Instead, we use a weighting scheme. As an additional, self- imposed constraint, we do not make use of the Portuguese–German dictionaries at all. For our weighting scheme, we do not only use the four paths provided by the task organizers but all available simple chains from a source language to 4 a target language. Simple chains are paths that ignore the orientation of the edges and where no vertex can occur twice. We distinguish between language chains, i. e. chains from one language to another, as illustrated in Figure 1, and translation chains, i. e. chains from one word to another, via the languages in a given language chain. Formally, let Ls,t denote the set of language chains from source language s to target language t. Each language chain ` ∈ Ls,t is assigned a weight 1 w` = , (1) (|`| + |r` |) where |`| is the length of the chain and |r` | is the number of edges in ` that are traversed in reverse. The weights are normalized such that w` = 1. (2) X `∈Ls,t The intuition behind these weights is that the more intervening languages we have and the more dictionaries we use in reverse, the more the quality suffers. Therefore, short chains should get a higher weight than long ones and using a dictionary in reverse should be penalized. Let Rw,` denote the set of translation chains from word w in the source language of a language chain ` to words in the target language of that language chain. Each translation chain r ∈ Rw,` connects w to a potential translation equivalent e = τ (r). Each translation equivalent e in the set of translation equivalents Ew,` = {τ (r)|r ∈ Rw,` } (3) is assigned a weight |{r ∈ Rw,` |τ (r) = e}| we,` = . (4) |Rw,` | This weight corresponds to the relative frequency of translation chains from w to e via the languages in language chain `. Now that we have weights for all language chains and for all translations along a language chain, we can obtain all translation equivalents in the target language t for word w from the source language s, i. e. Ew = `∈Ls,t Ew,` . Each S translation equivalent e ∈ Ew is assigned a weight we = (5) X w` we,` . `∈Ls,t The weights are normalized such that e∈Ew we = 1. P Now we can simply select the n translation equivalents with the highest weights. But what is a suitable value for n, i. e. how can we determine the best number of translation equivalents for a given word? Let Rw = `∈Ls,t Rw,` be S 5 the set of all chains from word w in the source language s to words in the target language t. Then, we set l m n = |Ew | c , (6) 1 where dxe is the ceiling function and c = r∈Rw |r|/|Rw |. This means we approxi- P mate n by the average number of translations for each word along the translation chains for word w. 3.2 Collocation-based approach We make use of the Europarl corpus (see Koehn, 2005: release v7) in its pre- processed and sentence-aligned form (Tiedemann, 2012)2 . As a further pre- processing step, all monolingual corpora except for the Portuguese one are lemmatized with off-the-shelf algorithms. Unfortunately, we did not lemmatize the Portuguese corpus in time. For the language pair de–pt, our procedure thus yields lexical surface realizations as translation candidates (see below). We retrieve translation candidates by analyzing first-order (syntagmatic) collocations. The procedure is implemented via the R-package wordspace (Evert, 2014)3 . For each language pair, lemmata (or, in the case of Portuguese, types) are extracted together with their alignment beads from the corpus in order to create lemma-sentence matrices with the intersection of alignment beads as columns. As an example, the French corpus contains 28,100 lemmata, the Dutch one 36,048, and there is an intersection of 2,003,463 alignment beads. These matrices are then transformed into one term-term co-occurrence matrix for each language pair. The nl–fr co-occurrence matrix from the example above has thus 36,048 rows and 28,100 columns. Subsequently, the Dice score is calculated for each lemma of the source language (if it occurs in the corpus) and each target term. The Dice score is a de-facto standard for the determination of translation candidates (Smadja et al., 1996) and represents the harmonic mean of the conditional probabilities P {source|target} and P {target|source}. Let O11 denote the co-occurrence frequency of source and target term, R1 the marginal frequency of the target term and C1 the marginal frequency of the source term (notation and formula taken from Evert, 2008), then the Dice score can be calculated by means of 2O11 dice (O11 , R1 , C1 ) = . (7) R1 + C 1 The higher its value, the higher the association between source and target term. Thus, for every source term, the target terms with the highest Dice scores serve as translation candidates. Note that in this step we ignore all candidates which solely consist of punctuation marks and/or digits in order to improve translation quality. 2 http://opus.lingfil.uu.se/Europarl.php 3 http://wordspace.r-forge.r-project.org/ 6 3.3 Combination of collocation-based and graph-based approaches Without having an evaluation measure which determines the trade-off between precision and recall of the translation candidates, we opted for a very simple combination of the two approaches above: the final list of candidates is gained by union of the graph-based candidates and four collocation-based candidates.4 4 Evaluation The evaluation procedure was announced after submission of the translation candidates and solely takes precision (and no recall)5 into account. For each lan- guage pair and system, 100 source-target-candidates were sampled. Subsequently, each translation pair was reviewed manually according to whether the target term was a correct (possible) translation of the source term. Two scalar performance measures are given, see Table 1: Precision is the percentage of (manually determined) correct translations among the proposed candidates. Additionally, “gold-precision” only labels those candidates as “true positives” which can be found in the organizers’ (undisclosed) gold-standard of translations. 5 Results and discussion Results for the graph-based approached (“graph”) and the combination of collocation-based and graph-based approaches (“combined”) can be found in Table 1. Two findings seem noteworthy: Firstly, the solely graph-based method consistently outperforms the combined approach for both evaluation measures. Secondly, in the nl–fr language-pair setting, both systems are drastically beaten by the baseline, whereas in the other two settings both systems outperform the baseline. Obviously, our strategy of providing multiple translation candidates proved to be suboptimal for the official task evaluation, which only focused on precision. Note however that our system is easily adaptable in case a reasonable evaluation measure is given a priori: both graph-based and collocation-based methods yield n- best lists of candidates with a scalar score-function enabling a more sophisticated selection of actual candidates.6 4 The graph-based method yields between two and three candidates on average de- pending on the language pair. Assuming an overlap of one or two candidates between both methods, this heuristic guarantees that the collocation-based approach delivers approximately two additional candidates. 5 Note that recall is not well-defined in the case of lexical translation: while human experts may easily agree on some unambiguous translations (thus making it feasible to create a gold-standard for calculating precision), they might disagree quickly on particular or unusual translations (thus making it impossible to create a gold-standard for measuring recall). 6 That is to say: if the system is to focus on precision, a very small number of candidates should be given, and their selection should be based on the distribution of the score functions of both the graph-based and the collocation-based candidate lists. 7 precision gold-precision pair graph combined graph combined baseline da–es 0.93 0.73 0.35 0.25 0.59 nl–fr 0.44 0.37 0.28 0.25 0.52 de–pt 0.85 0.72 0.19 0.16 0.62 Table 1. Evaluation measures for all language pairs for both submitted systems (based on samples of size 100). Precision is the percentage of correct translations among the sampled candidates, gold-precision is the percentage of correct translations that were also part in the organizers’ gold standard. The baseline figures were provided by the task organizers and are based on a depth-first search for cycles of translations which include the desired source and target languages. Advantages of our proposed graph-based system are twofold: Firstly, it does not require cycles, i. e. it can be applied in greater variety of settings. Secondly, the weighting scheme takes into account the number of dictionaries involved and the directionality in which they are used on the one hand, and, on the other, the relative frequency of translation chains leading to a translation equivalent; thus, the system automatically determines a suitable number of translation equivalents. The proposal of further candidates retrieved from the Europarl corpus has turned out to be counterproductive for the reasons elaborated above. Nevertheless, given more realistic settings in which recall of all (or most) possible translations is important, retrieval of candidates not comprised in any of the bilingual corpora (or of those with atypical translation paths) seems desirable. Future work will thus use more sophisticated methods for combining graph-based and collocation-based candidates, e. g. by using the Borda count or the Schulz method. Bibliography Stefan Evert. Corpora and collocations. Corpus Linguistics. An Interna- tional Handbook, 2:1212–1248, 2008. URL http://citeseerx.ist.psu.edu/ viewdoc/download?doi=10.1.1.159.6220&rep=rep1&type=pdf. Stefan Evert. Distributional Semantics in R with the wordspace Package. In Proceedings of COLING 2014, the 25th International Conference on Computa- tional Linguistics: System Demonstrations, 2014. URL http://aclweb.org/ anthology/C14-2024. Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick, and Dan Klein. Learning Bilingual Lexicons from Monolingual Corpora. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, volume 2008, pages 771–779, 2008. URL http://www. aclweb.org/anthology/P08-1#page=815. Hiroyuki Kaji, Shin’ichi Tamamura, and Dashtseren Erdenebat. Automatic Construction of a Japanese-Chinese Dictionary via English. In Proceedings of the Sixth International Conference on Language Resources and Evalua- tion (LREC’08), pages 699–706, 2008. URL http://www.lrec-conf.org/ proceedings/lrec2008/pdf/175_paper.pdf. Ilan Kernerman. From dictionary to database: Creating a global multi-language series. Electronic Lexicography in the 21st Century: New Applications for New Users. Proceedings of eLex 2011, Bled, 10–12 November 2011, pages 113–21, 2011. URL http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1. 1.698.6117&rep=rep1&type=pdf. Philipp Koehn. Europarl: A parallel corpus for statistical machine translation. In MT summit, volume 5, pages 79–86, 2005. URL http://citeseerx.ist. psu.edu/viewdoc/download?doi=10.1.1.459.5497&rep=rep1&type=pdf. Vojtěch Kovář, Vít Baisa, and Miloš Jakubíček. Sketch Engine for Bilingual Lexicography. International Journal of Lexicography, 29(3):339–352, September 2016. ISSN 0950-3846, 1477-4577. doi: 10.1093/ijl/ecw029. URL https: //doi.org/10.1093/ijl/ecw029. Mausam, Stephen Soderland, Oren Etzioni, Daniel S. Weld, Michael Skinner, and Jeff Bilmes. Compiling a massive, multilingual dictionary via probabilistic infer- ence. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Process- ing of the AFNLP: Volume 1, pages 262–270. Association for Computational Linguistics, 2009. URL http://aclweb.org/anthology/P09-1030. Xabier Saralegi, Iker Manterola, and Inaki San Vicente. Analyzing methods for improving precision of pivot based bilingual dictionaries. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 846–856. Association for Computational Linguistics, 2011. URL http: //aclweb.org/anthology/D11-1078. Daphna Shezaf and Ari Rappoport. Bilingual lexicon generation using non-aligned signatures. In Proceedings of the 48th Annual Meeting of the Association 9 for Computational Linguistics, pages 98–107. Association for Computational Linguistics, 2010. URL http://aclweb.org/anthology/P10-1011. Frank Smadja, Kathleen R. McKeown, and Vasileios Hatzivassiloglou. Translating collocations for bilingual lexicons: A statistical approach. Computational Lin- guistics, 22(1):1–38, 1996. URL http://aclweb.org/anthology/J96-1001. Kumiko Tanaka and Kyoji Umemura. Construction of a bilingual dictionary intermediated by a third language. In Proceedings of the 15th conference on Computational linguistics – Volume 1, pages 297–303. Association for Compu- tational Linguistics, 1994. URL http://aclweb.org/anthology/C94-1048. Jörg Tiedemann. Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the Eighth International Conference on Language Resources and Evalua- tion (LREC’12), pages 2214–2218, 2012. URL http://www.lrec-conf.org/ proceedings/lrec2012/pdf/463_Paper.pdf. Marta Villegas, Maite Melero, N. Bel, and J. Gracia. Leveraging RDF graphs for crossing multiple bilingual dictionaries. In Proceedings of the Tenth Interna- tional Conference on Language Resources and Evaluation (LREC’16), pages 868–876, 2016. URL http://www.lrec-conf.org/proceedings/lrec2016/ pdf/613_Paper.pdf.