=Paper=
{{Paper
|id=Vol-2493/summary
|storemode=property
|title=Results of the Translation Inference Across Dictionaries 2019 Shared Task
|pdfUrl=https://ceur-ws.org/Vol-2493/summary.pdf
|volume=Vol-2493
|authors=Jorge Gracia,Besim Kabashi,Ilan Kernerman,Marta Lanau-Coronas,Dorielle Lonke
|dblpUrl=https://dblp.org/rec/conf/ldk/GraciaKKLL19
}}
==Results of the Translation Inference Across Dictionaries 2019 Shared Task==
Results of the Translation Inference Across Dictionaries 2019 Shared Task Jorge Gracia1 , Besim Kabashi2,3 , Ilan Kernerman4 , Marta Lanau-Coronas1 , and Dorielle Lonke4 1 Aragon Institute of Engineering Research (I3A), University of Zaragoza, Spain {jogracia,mlanau}@unizar.es 2 Ludwig-Maximilian University of Munich, Germany 3 Friedrich-Alexander University of Erlangen-Nuremberg, Germany besim.kabashi@fau.de 4 K Dictionaries, Tel Aviv, Israel {ilan,dorielle}@kdictionaries.com Abstract. The objective of the Translation Inference Across Dictionar- ies (TIAD) shared task is to explore and compare methods and tech- niques that infer translations indirectly between language pairs, based on other bilingual/multilingual lexicographic resources. In its second, 2019, edition the participating systems were asked to generate new transla- tions automatically among three languages - English, French, Portuguese - based on known indirect translations contained in the Apertium RDF graph. The evaluation of the results was carried out by the organisers against manually compiled language pairs of K Dictionaries. This paper gives an overall description of the shard task, the evaluation data and methodology, and the systems’ results. Keywords: TIAD · Apertium RDF · translation inference · lexicographic data 1 Introduction A number of methods and techniques have been explored in the past aimed at automatically generating new bilingual and multilingual dictionaries based on existing ones. For instance, given a bilingual dictionary containing translations from one language L1 to another language L2, and another dictionary with trans- lations from L2 to L3, a new set of translations from L1 to L3 is produced. The intermediate language (L2 in this example) is called pivot language, and it is pos- sible to use multiple pivots for this purpose. When using intermediate languages, it is necessary to discriminate wrong inferred translations caused by translation ambiguities. The method proposed by Tanaka and Umemura [13] in 1994, called One Time Inverse Consultation (OTIC), identified incorrect translations when constructing bilingual dictionaries intermediated by a third language. This was a pioneering work in this field and it still constitutes a baseline that is hard to beat, as we will see in this paper. The OTIC method has been further adapted Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 J. Gracia et al. and evolved in the literature, for instance by Lim et al. [6], who grounded on it for their method for multilingual lexicons creation. From a different perspective, other works were proposed that relied on cycles and graph exploration to vali- date indirectly inferred translations, such as the SenseUniformPaths algorithm by Mousam et al. [7], the CQC algorithm by Flati et al. [2] or the exploration based on cycle density by Villegas et al. [15]. However, previous work on the topic of automatic bilingual/multilingual dic- tionary generation was usually conducted on different types of datasets and evaluated in different ways, applying various algorithms that are often not com- parable. In this context, the objective of the Translation Inference Across Dictio- naries (TIAD) shared task is to support a coherent experiment framework that enables reliable validation of results and solid comparison of the processes used. This initiative also aims to enhance further research on the topic of inferring translations across languages. In this paper, we give an overall description of the shard task, the evaluation data and methodology, and the systems results of TIAD 2019. The remainder of this paper is organised as follows. In Section 2, an overall description of the shared task is given. Section 3 describes the evaluation data and Section 4 explains the evaluation process. In Section 5 the systems results are reported, and conclusions are summarised in Section 6. 2 Shared task description The objective of TIAD shared task was to explore and compare methods and techniques that infer translations indirectly between language pairs, based on other bilingual resources. Such techniques would help in auto-generating new bilingual and multilingual dictionaries based on existing ones. In this second edition, the participating systems were asked to generate new translations automatically among three languages: English, French, and Por- tuguese, based on known translations contained in the Apertium RDF graph5 . As these languages (EN, FR, PT) are not directly connected in this graph, no translations can be obtained directly among them there. Based on the available RDF data, the participants had to apply their methodologies to derive transla- tions, mediated by any other language in the graph, between the pairs EN/FR, FR/PT and PT/EN. Participants could also make use of other freely available sources of back- ground knowledge (e.g. lexical linked open data and parallel corpora) to improve performance, as long as no direct translation among the studied language pairs were available. Beyond performance, participants were encouraged to consider also the following issues in particular: 1. The role of the language family with respect to the newly generated pairs 2. The asymmetry of pairs, and how translation direction affects the results 3. The behavior of different parts of speech among different languages 5 http://linguistic.linkeddata.es/apertium/ Results of the Translation Inference Across Dictionaries 2019 Shared Task 3 4. The role that the number of pivots plays in the process The evaluation of the results was carried out by the organisers against man- ually compiled pairs of K Dictionaries, extracted from its Global Series6 , which were not accessible to the participants. 3 Evaluation data In this section we briefly describe the input data source that has been proposed in the shared task as a source of known translations, i.e., Apertium RDF, as well as the data used as golden standard, from K Dictionaries. 3.1 Source data As mentioned above, the shared task relies on known translations contained in Apertium RDF, which were used to infer new ones. Apertium RDF is the linked data counterpart of the Apertium dictionary data. Apertium [3] is a free open-source machine translation platform. The system was initially created by Universitat d’Alacant and it is released under the terms of the GNU General Public License. In its core, Apertium relies on a set of bilingual dictionaries, de- veloped by a community of contributors, which covers more than 40 languages pairs. Apertium RDF [5] is the result of publishing 22 Apertium bilingual dictio- naries as linked data on the Web. The result groups the data of the (originally disparate) Apertium bilingual dictionaries in the same graph, interconnected through the common lexical entries of the monolingual lexicons that they share. In its fist version, Apertium RDF was modelled using the lemon model [8] jointly with its translation module [12]. Each Apertium bilingual dictionary was converted into three different objects in RDF: source lexicon, target lexicon, and translation set. As a result, two independent monolingual lexicons were published as linked data on the Web per dictionary, along with a set of translations that connects them. Notice that the naming rule used to build the identifiers (URIs) of the lexical entries allows to reuse the same URI per lexical entry across all the dictionaries, thus explicitly connecting them. For instance the same URI is used for the English word bench as a noun: http://linguistic.linkeddata.es/ id/apertium/lexiconEN/bench-n-en throughout the Apertium RDF graph, no matter if it comes from, e.g., the EN-ES dictionary or the CA-EN. More details about the generation of Apertium RDF based on the Apertium data can be found at [5]. Figure 1 illustrates the Apertium RDF unified graph. The nodes in the figure are the languages and the edges are the translation sets between them. All the generated information is accessible on the Web both for humans (via a Web in- 6 https://www.lexicala.com/ 4 J. Gracia et al. Fig. 1. The Apertium RDF graph. The nodes in the figure represent the monolingual lexicons and the edges are the translation sets between them. The darker the colour, the more connections a node has. We have highlighted the three languages of this evaluation campaign: PT, FR, and EN. terface7 ) and software agents (with SPARQL8 ). All the datasets are documented in Datahub9 . There were several ways in which the evaluation data was available to the participants: though the data dumps available in Datahub, through the SPARQL endpoint10 , and in a ZIP file in tab separated values (TSV) format11 . More details on how to access the data are available in the TIAD 2019 website12 . 3.2 Gold standard The evaluation of the results was carried out by the organisers against manu- ally compiled language pairs of K Dictionaries, extracted from its Global series, particularly the following pairs: BR-EN, EN-BR, FR-EN, EN-FR, FR-PT, PT- FR. The translation pairs extracted from these dictionaries served as a golden 7 http://linguistic.linkeddata.es/apertium/ 8 http://linguistic.linkeddata.es/apertium/sparql-editor/ 9 https://datahub.ckan.io/dataset?q=apertium+rdf 10 See an example query at https://tiad2019.unizar.es/docs/ApertiumRDF_ ExampleQuery_10.txt 11 https://tiad2019.unizar.es/data/TranslationSetsApertiumRDF.zip 12 See the “how to get the data source” section at https://tiad2019.unizar.es/task. html Results of the Translation Inference Across Dictionaries 2019 Shared Task 5 standard and remained blind to the participants. Notice that the Brazilian Por- tuguese variant was used for the translations to/from English (whereas the Eu- ropean Portuguese variant was used with French), which might introduce a bias; however its influence should be equivalent to every participant system thus still allowing for a valid comparison. Given the fact that the coverage of KD is not the same as Apertium, we took the subset of KD that is covered by Apertium to build the gold standard and allow comparisons, i.e., those KD translations for which the source and target terms are present in both Apertium RDF source and target lexicons. This is shown graphically in Figure 2 for the FR-PT pair. Fig. 2. Gold standard construction for the FR-PT pair. The translations in the dashed area in the middle of the figure constitute the gold standard, selected amongst all the KD translations (for FR-PT) for which both source and target lexical entries are present in their respective Apertium RDF lexicons. Table 1 shows the size (in number of translations) of the different language pairs in the gold standard. Table 1. Number of translations per language pair in the gold standard. Lang. pair Size EN-FR 14,512 EN-PT 12,811 FR-EN 20,800 FR-PT 10,791 PT-EN 17,498 PT-FR 10,808 6 J. Gracia et al. 4 Evaluation methodology The participants run their systems locally, using the Apertium RDF data as known translations, to infer new translations among the three studied languages: FR, EN, PT. Once the output data (inferred translations) were obtained, they loaded the results into a file per language pair in TSV format, containing the following information per row (tab separated): “source written representation” “target written representation” “part of speech” “confidence score” The confidence score takes float values between 0 and 1 and is a measure of the confidence that the translation holds between the source and target written representations. If a system does not compute confidence scores, this value had to be put to 1. 4.1 Evaluation process The organisers compared the obtained results with the gold standard automat- ically. This process was followed for each system results file and per language pair: 1. Remove duplicated translations (some systems produced duplicated rows, i.e., identical source and target words, POS and confidence degree). 2. Filter out translations for which the source entry is not present in the golden standard (otherwise we cannot assess whether the translation is correct or not). We call systemGS the subset of translations that passed this filter, and GS the whole set of gold standard translations, in the given language pair. 3. Translations with confidence degree under a given threshold were removed from systemGS. In principle, the used threshold is the one reported by participants as the optimal one during the training/preparation phase. 4. Compute the coverage of the system with respect to the gold standard, i.e., how many gold standard entries in the source language were effectively translated by the system (no matter if they were correct or wrong ones). 5. Compute precision as P =(#correct translations in systemGS) / systemGS 6. Compute recall as R =(#correct translations in systemGS) / GS 7. Compute F-measure as F = 2 ∗ P ∗ R/(P + R) 4.2 Baselines We have run the above evaluation process with results obtained with two base- lines, to be compared with the participating systems results: Results of the Translation Inference Across Dictionaries 2019 Shared Task 7 Baseline 1 - Word2Vec. The method uses Word2Vec [11] to transform the graph into a vector space. A graph edge is interpreted as a sentence and the nodes are word forms with their POS tag. Word2Vec iterates multiple times over the graph and learns multilingual embeddings (without additional data). We used the Gensim13 Word2Vec implementation. For a given input word, we calculated a distance based on the cosine similarity of a word to every other word with the target-POS tag in the target language. The square of the distance from source to target word is interpreted as the confidence degree. For the first word the minimum distance is 0.62 , for the others it is 0.82 . Therefore multiple results are only in the output if the confidence is not extremely weak. In our evaluation, we applied an arbitrary threshold of 0.5 to the confidence degree. Baseline 2 - OTIC. In short, the idea of the One Time Inverse Consulta- tion (OTIC) method [13] is to explore, for a given word, the possible candidate translations that can be obtained through intermediate translations in the pivot language. Then, a score is assigned to each candidate translation based on the degree of overlap between the pivot translations shared by both the source and target words. In our evaluation, we have applied the OTIC method using Spanish as pivot language, and using an arbitrary threshold of 0.5. 5 Results In this section we review the participating systems in TIAD 2019 and their evaluation results. 5.1 Participating systems Four teams participated in the shared task. Unlike the first TIAD edition [10], all of them were able to complete the evaluation. The participants contributed with eleven system results. One team (Frankfurt) submitted the results of a single system, while the other three run the experiment on several systems or variations of the same system. Table 2 lists the participant teams and systems. The first team, Garcı́a et al. from Universidade da Coruña, developed four systems [4]: three transitive systems differing only in the pivot language used, and a fourth system based on a different approach which only needs monolin- gual corpora in both the source and target languages. All four methods make use of cross-lingual word embeddings trained on monolingual corpora, and then mapped into a shared vector space. The second team, Torregrosa et al. from National University of Ireland Galway, presented three methods [14] based on graph analysis and neural machine that did not make use of parallel data. The third contribution, by John P. McCrae, also from National University of Ireland Galway [9] applied explicit topic modelling over comparable corpora to the task 13 https://radimrehurek.com/gensim/ 8 J. Gracia et al. Table 2. Participant systems. Team System Comment Using the third language of the shared task as pivot LyS (e.g., PT is pivot in an EN- Garcı́a et al. (Univ. da Coruña) [4] FR translation) LyS EN English as pivot language LyS CA Catalan as pivot language LyS DT No pivot language UNLP-4CYCLE Cycle based approach Torregrosa et al. (National UNLP-GRAPH Graph based approach University of Ireland Galway) [14] Neural Machine Translation UNLP-NMT-3PATH and Path based approach UNLP-NMT- Neural Machine Translation 4CYCLE and Cycle based approach McCrae (National University of ONETA-ES Spanish as pivot language Ireland Galway) [9] ONETA-CA Catalan as pivot language Donandt and Chiarcos (Goethe Multilingual word embed- FRANKFURT Universität at Frankfurt) [1] dings of inferring translation candidates. In particular, he used the Orthonormal Ex- plicit Topic Analysis (ONETA) model. Finally, the fourth team, Donandt and Chiarcos from Goethe-Universität at Frankfurt, constructed a multi-lingual word embedding space by projecting new languages in the feature space of a language for which a pre-trained embedding model exists [1]. They used the similarity of the word embeddings to predict candidate translations. 5.2 Evaluation results The complete evaluation results per system and per language pair are accessible in the TIAD 2019 website14 . In order to give an overview of the results, we include here Table 3, which shows the averaged results, evaluated by using the confidence threshold that every participant reported as optimal according to their internal tests. In addition, we evaluated the systems results with other thresholds in the range [0,1]. The results are plotted in Figure 3. 5.3 Discussion As can be seen in Table 3, the two baselines obtained better results than the par- ticipating systems in terms of F-measure, which gives an idea of the difficulty of the task. Strictly speaking, these are not baselines as they are conceived in other shared tasks, meaning naive approaches with a straightforward implementation, but state-of-the-art methods to solve the task. 14 See https://tiad2019.unizar.es/results.html under the section “Evaluation re- sults”. Results of the Translation Inference Across Dictionaries 2019 Shared Task 9 Table 3. Averaged system results, ordered by F-measure in descending order. System Precision Recall F-measure Coverage BASELINE(OTIC) 0.64 0.26 0.37 0.45 BASELINE(Word2Vec) 0.66 0.24 0.35 0.51 FRANKFURT 0.64 0.22 0.32 0.43 LyS-DT 0.36 0.31 0.32 0.64 LyS-ES 0.33 0.3 0.31 0.64 LyS-CA 0.31 0.29 0.29 0.64 LyS 0.32 0.28 0.29 0.64 UNLP-NMT-3PATH 0.66 0.13 0.21 0.25 UNLP-GRAPH 0.76 0.1 0.18 0.2 UNLP-NMT-4CYCLE 0.58 0.11 0.18 0.25 ONETA-ES 0.81 0.1 0.17 0.17 ONETA-CA 0.83 0.08 0.14 0.13 UNLP-4CYCLE 0.75 0.07 0.11 0.13 Some of the participating systems kept a good balance between precision and recall (FRANKFURT, LyS-DT) while some promoted precision at the cost of recall (ONETA, UNLP), and others obtained very good recall and coverage at the cost of precision (LyS, LyS-ES, LyS-CA). Interestingly, the OTIC method, based on purely graph exploration and dated back to 1994, outperformed more contemporary methods based on word embeddings and distributional seman- tics. We argue, however, that OTIC is not upper bound and that there is still much room for improvement for such recent methods, that could benefit from a different selection of training data and dictionary-related features. Notice that the precision values shown in Table 3 are conservative since there is a small but undefined number of false negatives (correct translations that are not present in the gold standard) that can be found in the results. Some exam- ples, from the EN→FR set of translations: “wizard”→“sorcier” noun 0.81 [BASELINE Word2Vec] “abandon”→“quitter” verb 0.99 [FRANKFURT] “dump”→“vider” verb 0.71 [LyS-CA] “ban”→“prohibition” noun 0.31 [ONETA-CA] “portion”→“ration” noun 0.4 [UNLP-GRAPH] 6 Conclusions In this paper we have given an overview of the 2nd Translation Inference Across Dictionaries (TIAD) shared task, and a description of the results obtained by the 11 participating systems and two baselines. In this edition, the participating systems were asked to generate new translations automatically among English, French, Portuguese, based on known indirect translations contained in the Aper- tium RDF graph. The evaluation of the results was carried out by the organisers against manually compiled pairs of K Dictionaries. 10 J. Gracia et al. Fig. 3. Averaged system results (F-measure) with variable threshold. The results are promising and illustrate the difficulty of the tasks, show- ing that there is still much room for research and improvement in the area of translation inference across dictionaries. 7 Acknowledgements We would like to thank Michael Ruppert (University of Erlangen-Nuremberg) for his assistance with the Word2Vec baseline. This work has been supported by the European Union’s Horizon 2020 research and innovation programme through the projects Lynx (grant agreement No 780602), Elexis (grant agreement No 731015) and Prêt-à-LLOD (grant agreement No 825182). It has been also partially sup- ported by the Spanish National projects TIN2016-78011-C4-3-R (AEI/ FEDER, UE) and DGA/FEDER. References 1. Donandt, K., Chiarcos, C.: Translation inference through multi-lingual word em- bedding similarity. In: Proc. of TIAD-2019 Shared Task Translation Inference Across Dictionaries, at 2nd Language Data and Knowledge (LDK) conference. CEUR-WS (May 2019) Results of the Translation Inference Across Dictionaries 2019 Shared Task 11 2. Flati, T., Navigli, R.: The CQC Algorithm: Cycling in Graphs to Semantically Enrich and Enhance a Bilingual Dictionary (Extended Abstract). In: Proce. of the 23th International Joint Conference on Artificial Intelligence. pp. 3151–3155. IJCAI ’13, AAAI Press (2013) 3. Forcada, M.L., Ginestı́-Rosell, M., Nordfalk, J., O’Regan, J., Ortiz-Rojas, S., Pérez- Ortiz, J.A., Sánchez-Martı́nez, F., Ramı́rez-Sánchez, G., Tyers, F.: Apertium: a free/open-source platform for rule-based machine translation. Machine Translation 25(2), 127–144 (2011) 4. Garcı́a, M., Garcı́a-Salido, M., Alonso, M.A.: Exploring cross-lingual word embed- dings for the inference of bilingual dictionaries. In: Proc. of TIAD-2019 Shared Task Translation Inference Across Dictionaries, at 2nd Language Data and Knowledge (LDK) conference. CEUR-WS (May 2019) 5. Gracia, J., Villegas, M., Gómez-Pérez, A., Bel, N.: The apertium bilingual dictio- naries on the web of data. Semantic Web 9(2), 231–240 (2018) 6. Lim, L.T., Ranaivo-Malançon, B., Tang, E.K.: Low Cost Construction of a Multi- lingual Lexicon from Bilingual Lists. Polibits 43, 45–51 (2011) 7. Mausam, Soderland, S., Etzioni, O., Weld, D.S., Skinner, M., Bilmes, J.: Compiling a Massive, Multilingual Dictionary via Probabilistic Inference. In: Proc. of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1. pp. 262–270. ACL ’09, Association for Computational Linguistics, Stroudsburg, PA, USA (2009) 8. McCrae, J., Aguado-de Cea, G., Buitelaar, P., Cimiano, P., Declerck, T., Gómez- Pérez, A., Gracia, J., Hollink, L., Montiel-Ponsoda, E., Spohr, D., Wunner, T.: Interchanging lexical resources on the Semantic Web. Language Resources and Evaluation 46, 701–719 (2012) 9. McCrae, J.P.: Tiad shared task 2019: Orthonormal explicit topic analysis for trans- lation inference across dictionaries. In: Proc. of TIAD-2019 Shared Task Transla- tion Inference Across Dictionaries, at 2nd Language Data and Knowledge (LDK) conference. CEUR-WS (May 2019) 10. McCrae, J.P., Bond, F., Buitelaar, P., Cimiano, P., Declerck, T., Gracia, J., Kerner- man, I., Montiel-Ponsoda, E., Ordan, N., Piasecki, M. (eds.): Proc.of the LDK 2017 Workshops: 1st Workshop on the OntoLex Model (OntoLex-2017), Shared Task on Translation Inference Across Dictionaries & Challenges for Wordnets co-located with 1st Conference on Language, Data and Knowledge (LDK 2017). CEUR Press, Galway (Ireland) (2017) 11. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient Estimation of Word Rep- resentations in Vector Space. In: Proc. of International Conference on Learning Representations (ICLR) (2013) 12. Montiel-Ponsoda, E., Gracia, J., Aguado-De-Cea, G., Gómez-Pérez, A.: Represent- ing translations on the semantic Web. In: Proc. of the 2nd International Workshop on the Multilingual Semantic Web (MSW) at ISWC ’11. vol. 775. CEUR Press (2011) 13. Tanaka, K., Umemura, K.: Construction of a Bilingual Dictionary Intermediated by a Third Language. In: COLING. pp. 297–303 (1994) 14. Torregrosa, D., Arcan, M., Ahmadi, S., McCrae, J.P.: Tiad 2019 shared task: Lever- aging knowledge graphs with neural machine translation for automatic multilingual dictionary generation. In: Proc. of TIAD-2019 Shared Task Translation Inference Across Dictionaries, at 2nd Language Data and Knowledge (LDK) conference. CEUR-WS (May 2019) 12 J. Gracia et al. 15. Villegas, M., Melero, M., Bel, N., Gracia, J., Bel, N.: Leveraging RDF Graphs for Crossing Multiple Bilingual Dictionaries. In: Proc. of 10th Language Resources and Evaluation Conference (LREC’16) Portorož (Slovenia). pp. 868–876. European Language Resources Association (ELRA), Paris, France (may 2016)