Benchmarking Azerbaijani Neural Machine
Translation
Chih-Chen Chen1 , William Chen1
1
    University of Central Florida


                                         Abstract
                                         Little research has been done on Neural Machine Translation (NMT) for Azerbaijani. In this paper, we
                                         benchmark the performance of Azerbaijani-English NMT systems on a range of techniques and datasets.
                                         We evaluate which segmentation techniques work best on Azerbaijani translation and benchmark the
                                         performance of Azerbaijani NMT models across several domains of text. Our results show that while
                                         Unigram segmentation improves NMT performance and Azerbaijani translation models scale better with
                                         dataset quality than quantity, cross-domain generalization remains a challenge.


1. Introduction
With the recent growth in online resources, robust NLP systems have become increasingly
available for many of the world’s languages. However, this growth has not been enjoyed equally
and technologies for many languages are still under-developed, especially relative to the size
of their speaker population. This remains the case for morphologically-complex languages,
which have been considered a challenge for NLP systems due to the frequency of rare/unknown
words. One such example is Azerbaijani, a Turkic language with a highly agglutinative and
complex morphology. It has two major varieties: the Northern variant is spoken in the Republic
of Azerbaijan, while Southern Azerbaijani regions of Iran. Our experiments focus on Northern
Azerbaijani, which is written in Latin script and has considerably more online resources that
are able to support the development of NMT systems.
   Little work has been done on NLP systems for Azerbaijani, and even less on machine trans-
lation and other generative Seq2Seq tasks. Specifically, there is a lack of benchmarks on the
performance of Azerbaijani NMT and the methods that could be used to improve it. Existing
studies either include private datasets with unpublished training, testing, and validation splits
[1] or solely evaluate on very low-resource scenarios with transfer learning techniques [2]. We
build off the approach developed by Guntara et al. [3], who sought to develop benchmarks for
Indonesian NMT, and extend it to include the evaluation of different pre-processing techniques
for Azerbaijani NMT. Our goal is to help address these problems by investigating the following
research questions regarding Azerbaijani translation:

             1. What segmentation methods work best for Azerbaijani NMT?

The International Conference and Workshop on Agglutinative Language Technologies as a challenge of Natural
Language Processing (ALTNLP), June 7-8, Koper, Slovenia
$ chihchen.chen@outlook.com (C. Chen); wchen6255@knights.ucf.edu (W. Chen)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
   2. How important is data cleanliness versus training corpora size for Azerbaijani NMT?
   3. How do Azerbaijani translation systems perform across different language domains?

  To answer these questions, we set up the following experiments:

   1. We evaluate the performance of different segmentation algorithms to see which perform
      best for Azerbaijani.
   2. We evaluate the effectiveness of scaling to larger training corpora at the cost of alignment
      quality in Azerbaijani NMT.
   3. We categorize open-source Azerbaijani corpora into different domains and evaluate the
      effectiveness of NMT models trained on individual and multiple domains.

   Our results showed that both the choice of evaluation metric and segmentation algorithm have
a large impact in determining which models are the best performing, showing the importance of
evaluating across multiple metrics. We also found that sentence alignment quality was a large
factor in model performance; the addition of large but noisy/out-of-domain training datasets
did not necessarily translate to improved performance.


2. Related Work
Studies on morphologically-complex languages tend to focus on the higher-resource Turkish
or extremely low-resource languages like Inuktitut or Quechua. However, there have been
many experiments that use Azerbaijani to demonstrate the effects of transfer learning and
multilinguality due to its relationship with Turkish. Early MT systems for Azerbaijani were
built by Fatullayev et al. [4]. Their models were based off of a hybrid between rule-based and
statistical machine translation, and could translate to/from English and Turkish. Qi et al. [2]
experimented with Azerbaijani in a low-resource setting to improve NMT with aligning pre-
trained word embeddings. They showed that including Turkish with Azerbaijani in multilingual
NMT significantly improved BLEU score. Neubig and Hu [5] explored training paradigms for
multilingual NMT that also leverage Turkish to improve Azerbaijani translation. Kim et al.
[6] showed the effectiveness of cross-lingual word-embeddings in improving low-resource
Azerbaijani NMT. The most recent work on bilingual Azerbaijani NMT was by Maimaiti et al.
[1], who used Azerbaijani and Uzbek to Chinese translation as case studies for transfer learning
with pre-trained lexicon embeddings.
   Many studies have been done on the effect on subword segmentation algorithms on down-
stream NMT. Sennrich et al. [7] and Kudo [8] show that such algorithms improve the perfor-
mance of NMT models using Byte-Pair Encoding (BPE) and Unigram segmentation respectively.
While BPE has generally been the standard, recent works show that the Unigram algorithm
performs better on agglutinative languages [9][10][11]. Mager et al. [12] compared the perfor-
mance of BPE to morphological segmentation algorithms for indigenous American languages
and found that SOTA morphological segmentation methods did not translate to improved per-
formance on NMT. Results in a similar study by Sälevä and Lignos [13] were inconclusive when
comparing BPE with LMVR [14] and MORSEL [15] on Nepali, Sinhala, and Kazakh; the best
performing segmentation algorithm was language dependent and the results were statistically
indistinguishable. Pre-processing techniques have also been a feature of interest in low-resource
translation shared tasks. Chen and Fazio [16] found that Unigram segmentation [8] performed
the best for Marathi-English translation at LoResMT 2021 [17]. Vázquez et al. [18] leveraged data
cleaning and normalization techniques to overcome differences in orthographic conventions
for multilingual models at AmericasNLP 2021 [19].


3. Experimental Setup
For all of our experiments we use the OpenNMT-py [20] implementation of the Transformer
[21]. We use the set-up from Chen and Fazio [9], which has been shown to perform well with
agglutinative languages. The architecture is comprised of 6 encoder/decoder layers, 8 attention
heads, size 256 word vectors, and a feed-forward dimension of 2048. The models were trained
for 50,000 steps with a batch size of 32.
   Translation quality is evaluated using COMET [22] and the sacreBLEU [23] implementations
of BLEU [24] and chrF [25] scores. Kocmi et al. [26] recommended the use of COMET and chrF,
which they found were the metrics that best correspond to human judgement. We also provide
BLEU scores due to its standard use in machine translation. Each model was independently
trained 10 times such that the presented scores below are the average across all trials.

3.1. Q1: Segmentation Algorithms for Azerbaijani
A common pre-processing technique to improve the performance of NLP systems is subword
segmentation: separating words into small units to decrease vocabulary size and help the
model generalize to unknown vocabulary. The goal of our first set of experiments is to identify
which subword segmentation algorithms work best for Azerbaijani. We use the Azerbaijani-
English portion of WikiMatrix [27], which consists of 276k parallel sentences. The WikiMatrix
dataset provides the LASER [28] score of each sentence pair, which measures the likelihood of a
sentence pair being mutual translations. Filtering out sentences with a score less than 1.04 (the
recommended LASER threshold) reduces the dataset size to 70,725. The cleaned dataset is then
split into 47,385 training sentences, 11,670 validation sentences, and 11,670 test sentences.
   Models are trained on text segmented by different techniques: Byte-Pair Encoding (BPE) [7],
BPE-Guided [29], Unigram [8], and PRPE [30]. BPE and Unigram segmentation are the two most
popular segmentation algorithms used in state-of-the-art NMT systems due to their flexibility
and ease of use. BPE-Guided [29] and PRPE [30] are morphologically-motivated algorithms
that were shown to perform well on NMT for agglutinative languages [29][9]. Prior to subword
segmentation, the text is first tokenized by Moses Tokenizer [31].
   BPE first splits the corpus into a character level representation. The most frequently occurring
pair of tokens are then merged together, a process that is repeated until a pre-defined number
of merge operations have been reached. BPE-Guided is an extension of the BPE algorithm that
incorporates morphological information through a list of known affixes. BPE-Guided creates a
glossary of words that do not contain any known affixes, which is then used by the main BPE
algorithm as a list of words to not segment.
   Unigram segmentation is a probabilistic segmentation algorithm based on a unigram language
model [8]. A vocabulary of a pre-defined size is first built by only keeping subwords that least
reduce the loss of calculating subword occurrence probabilities via the expectation-maximization
algorithm. The output segmentation of a word is then obtained by choosing the most probable
segmentation candidate obtained from the Viterbi algorithm [32].
   Prefix-Root-Postfix-Encoding (PRPE) segments a word into three main parts: a prefix, root
and a postfix. The algorithm first learns a subword vocabulary of prefixes and postfixes with
the help of a language-specific heuristic. PRPE then uses any detected instances of those affixes
in a word to extract potential roots and obtain the most probable segmentation of the word.


               Segmentation Algorithm        BLEU      chrF    COMET         p-value
               None                          1.596    13.136    -1.205
               BPE                           1.567    13.710    -1.207       0.0240
               BPE-Guided                    1.517    12.010    -1.234       0.0006
               PRPE                          1.625    13.615    -1.195       0.0099
               Unigram                       1.730    14.150    -1.188       0.0013

Table 1
A comparison of different segmentation algorithms on Northern Azerbaijani to English NMT. Higher
scores indicate better performance. p-values are calculated using the average COMET score of the given
algorithm compared to that of no segmentation.

   The BLEU, chrF, and COMET scores are found in Table 1; p-values calculated with a paired
Student’s t-test between a chosen segmentation algorithm’s COMET score and the no segmen-
tation baseline are also included. Almost all segmentation methods obtained higher chrF and
BLEU scores than the no segmentation baseline. Unigram segmentation performed the best,
achieving the highest scores in all three evaluation metrics. PRPE was the second best perform-
ing algorithm in BLEU and COMET, but scored lower than BPE in terms of chrF. Interestingly,
these two algorithms were also the only ones that performed better than the baseline in terms
of COMET score. These results show that both the metric and segmentation algorithm used can
have a significant impact on what models are designated as "the best performing", and further
encourage the reporting of across multiple evaluation metrics in future work.

3.2. Q2: Dataset Size vs Cleanliness
We conducted a second set of experiments to examine the tradeoff between dataset cleanliness
and dataset size in regards to NMT performance by using the alignment scores provided by
the WikiMatrix dataset [28] as a measurement of cleanliness. To do so, we created additional
training datasets with the WikiMatrix sentence pairs left unused in Section 3.1. We combine
these remaining sentences with the clean 47k sentence training set to form a noisy 252k sentence
training dataset. As a middle ground, we also create a third training dataset of 120k sentences
by only keeping sentence pairs with a score of at least 1.03 from the large noisy dataset. The
validation and test sets are reused from 3.1. The text was not pre-processed with any subword
segmentation algorithm to isolate any impact on the performance metrics to the change in
training data.
   The results (Table 2) provide an interesting reflection of how the evaluation metrics are
                 Training Dataset          # Sentences   BLEU      chrF      COMET
                 Clean (T=1.04)              47,385      1.596    13.136     -1.205
                 Slightly Noisy (T=1.03)     119,725     2.276    12.614     -1.292
                 Noisy (T=0)                 252,255     2.488    11.460     -1.399

Table 2
A comparison of the tradeoff between dataset size and cleanliness. T is the LASER score threshold
use to filter sentence pairs, which is a measurement of the likelihood that two sentences are mutual
translations.


calculated. BLEU [24] scores increased as the training dataset size grew, but chrF [25] and
COMET [22] scores decreased. We hypothesize that this is because the additional training data
increased the vocabulary size of the model and thus allowed it to recognize otherwise unknown
words in the test set. Our results corroborate the findings of Kocmi et al. [26] and show the
inaccuracy of BLEU compared to other metrics: evaluating only with BLEU would indicate that
training on the smaller dataset was worse despite the opposite holding true.

3.3. Q3: Domain Benchmarks
Our final experiment was to evaluate the performance of an Azerbaijani NMT model across
several domains of text. We first obtained all Azerbaijani-English (az-en) data from OPUS [33],
which consist of the following parallel corpora: WikiMatrix [27], CCMatrix [34], Tatoeba, ELRC
public corpora, Tanzil, GNOME [35], QED [36], TED2020 [37], and XLEnt [38]. The corpora
were categorized by domain, of which the domains with little data (lecture, news, and tech)
were aggregated into a larger “Mixed" domain dataset. We thus evaluate the model on four
different datasets: General (1,325,660 lines), Religious (269,445 lines), Entities (298,236 lines), and
Mixed (68,256 lines). Each dataset was then split into 66.7% training sentences, 16.6% validation
sentences, and 16.6% test sentences. All text is pre-processed with Moses Tokenizer [31] and
segmented with a Unigram segmentation model [8].

                           Dataset                # Sentences     Domain
                           CCMatrix                  1,251,255    General
                           WikiMatrix (T=1.04)          70,725    General
                           Tatoeba                       3,680    General
                           ELRC                            129       News
                           Tanzil                      269,445   Religious
                           GNOME                        40,075        Tech
                           QED                          16,442    Lecture
                           TED2020                      11,610    Lecture
                           XLEnt                       298,236    Entities
                           Total                    1,961,597

Table 3
Dataset Statistics
  We independently train models on each dataset. To evaluate the system’s ability to generalize
across domains, we train another model on the data combined across all 4 datasets. The models
are trained for 300,000 steps and are evaluated using the best performing checkpoint on the
validation set. The 4 domain-specific models are evaluated on the test set of their domain and
the model trained on combined data is evaluated on each domain.

                           Trained on Domain Only     Trained on Combined Data
              Test Set
                          BLEU     chrF     COMET     BLEU     chrF     COMET
              General      5.55    16.999   -1.069    3.981    14.795   -1.1658
              Religious   23.199   44.535   -0.818    17.285   34.285   -0.6010
              Entities    7.607    19.845   -0.929    1.279    11.428   -1.1751
              Mixed       22.725   35.648   -0.136    4.555    15.293   -1.0216

Table 4
A comparison of the BLEU, chrF, and COMET scores between models trained on a specific data domain
and a model trained on data across all domains.

   Most of the domain-specific models performed better than the model trained on combined
data (Table 4). An exception was on the Religious dataset; while the Religious model performed
better than the Combined Data model in terms of BLEU and chrF, the Combined Data model
achieved a better COMET score. This indicates that training on a more general dataset allowed
the model to output more words that were closer to the label translation in the embedding space
(higher COMET score) but differed in terms of subwords/characters used (lower BLEU and
chrF score). These results also corroborate those of 3.2, again showing the importance of data
cleanliness. Models trained on the smaller and cleaner Religious and Mixed datasets performed
better than those trained on the larger General, Entities, and Combined datasets. The result is
particularly noticeable with the Mixed dataset model, which achieved a COMET score of -0.136
despite having only 45,500 training sentences.


4. Conclusion
We trained several Azerbaijani NMT models on text segmented by different algorithms and
show that using Unigram segmentation can noticeably improve translation quality. We also
demonstrate that properly cleaning data can lead to significant gains in performance, even when
shrinking the training corpora. Finally, we evaluated the performance of Azerbaijani-English
NMT models across multiple domains. Our results demonstrate that while generalizing across
domains remains a challenge for Azerbaijani NMT, specialized models are still able to achieve a
competitive performance.


5. Future Work
Our experiments focused only on Northern Azerbaijani due to scarcity of data for the Southern
variant. One route for exploration to develop NMT systems for the latter is to compare the
effectiveness of lower-resource cross-dialectal transfer from Northern Azerbaijani against
higher-resource cross-lingual transfer from Turkish. Developing NMT systems for Southern
Azerbaijani is particularly challenging since it is written in Arabic script, introducing the
need for transliteration to properly take advantage of transfer learning paradigms. Further
evaluation could also be done on the transfer learning and multilingual techniques used to
improve Azerbaijani translation introduced in previous works. While those studies show that
such techniques are able to improve translation quality over a simple baseline, there are little to
no comparisons of their effectiveness relative to each other.


References
 [1] M. Maimaiti, Y. Liu, H. Luan, M. Sun, Enriching the transfer learning with pre-trained
     lexicon embedding for low-resource neural machine translation, Tsinghua Science and
     Technology 27 (2022) 150–163.
 [2] Y. Qi, D. Sachan, M. Felix, S. Padmanabhan, G. Neubig, When and why are pre-trained
     word embeddings useful for neural machine translation?, in: Proceedings of the 2018 Con-
     ference of the North American Chapter of the Association for Computational Linguistics:
     Human Language Technologies, Volume 2 (Short Papers), Association for Computational
     Linguistics, 2018, pp. 529–535.
 [3] T. W. Guntara, A. F. Aji, R. E. Prasojo, Benchmarking multidomain English-Indonesian ma-
     chine translation, in: Proceedings of the 13th Workshop on Building and Using Comparable
     Corpora, European Language Resources Association, 2020, pp. 35–43.
 [4] R. Fatullayev, A. Abbasov, A. Fatullayev, Dilmanc is the 1st mt system for azerbaijani
     (2008) 63–64.
 [5] G. Neubig, J. Hu, Rapid adaptation of neural machine translation to new languages, in:
     Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,
     Association for Computational Linguistics, 2018, pp. 875–880.
 [6] Y. Kim, Y. Gao, H. Ney, Effective cross-lingual transfer of neural machine translation
     models without shared vocabularies, in: Proceedings of the 57th Annual Meeting of the
     Association for Computational Linguistics, Association for Computational Linguistics,
     2019, pp. 1246–1257.
 [7] R. Sennrich, B. Haddow, A. Birch, Neural machine translation of rare words with subword
     units, in: Proceedings of the 54th Annual Meeting of the Association for Computational
     Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, 2016, pp.
     1715–1725.
 [8] T. Kudo, Subword regularization: Improving neural network translation models with mul-
     tiple subword candidates, in: Proceedings of the 56th Annual Meeting of the Association
     for Computational Linguistics (Volume 1: Long Papers), Association for Computational
     Linguistics, 2018, pp. 66–75.
 [9] W. Chen, B. Fazio, Morphologically-guided segmentation for translation of agglutinative
     low-resource languages, in: Proceedings of the 4th Workshop on Technologies for MT
     of Low Resource Languages (LoResMT2021), Association for Machine Translation in the
     Americas, 2021, pp. 20–31.
[10] A. Richburg, R. Eskander, S. Muresan, M. Carpuat, An evaluation of subword segmentation
     strategies for neural machine translation of morphologically rich languages, in: Proceed-
     ings of the The Fourth Widening Natural Language Processing Workshop, Association for
     Computational Linguistics, 2020, pp. 151–155.
[11] K. Bostrom, G. Durrett, Byte pair encoding is suboptimal for language model pretraining,
     in: Findings of the Association for Computational Linguistics: EMNLP 2020, Association
     for Computational Linguistics, 2020, pp. 4617–4624.
[12] M. Mager, A. Oncevay, E. Mager, K. Kann, N. T. Vu, Bpe vs. morphological segmentation:
     A case study on machine translation of four polysynthetic languages, arXiv preprint
     arXiv:2203.08954 (2022).
[13] J. Sälevä, C. Lignos, The effectiveness of morphology-aware segmentation in low-resource
     neural machine translation, arXiv preprint arXiv:2103.11189 (2021).
[14] D. Ataman, M. Negri, M. Turchi, M. Federico, Linguistically motivated vocabulary reduction
     for neural machine translation from turkish to english. (2017).
[15] C. Lignos, Learning from unseen data, in: Proceedings of the Morpho Challenge 2010
     Workshop, 2010, pp. 35–38.
[16] W. Chen, B. Fazio, The UCF systems for the LoResMT 2021 machine translation shared
     task, in: Proceedings of the 4th Workshop on Technologies for MT of Low Resource
     Languages (LoResMT2021), Association for Machine Translation in the Americas, Virtual,
     2021, pp. 129–133.
[17] A. K. Ojha, C.-H. Liu, K. Kann, J. Ortega, S. Shatam, T. Fransen, Findings of the LoResMT
     2021 shared task on COVID and sign language for low-resource languages, in: Proceedings
     of the 4th Workshop on Technologies for MT of Low Resource Languages (LoResMT2021),
     Association for Machine Translation in the Americas, Virtual, 2021, pp. 114–123.
[18] R. Vázquez, Y. Scherrer, S. Virpioja, J. Tiedemann, The Helsinki submission to the Americ-
     asNLP shared task, in: Proceedings of the First Workshop on Natural Language Processing
     for Indigenous Languages of the Americas, Association for Computational Linguistics,
     Online, 2021, pp. 255–264.
[19] M. Mager, A. Oncevay, A. Ebrahimi, J. Ortega, A. Rios, A. Fan, X. Gutierrez-Vasques,
     L. Chiruzzo, G. Giménez-Lugo, R. Ramos, I. V. Meza Ruiz, R. Coto-Solano, A. Palmer,
     E. Mager-Hois, V. Chaudhary, G. Neubig, N. T. Vu, K. Kann, Findings of the AmericasNLP
     2021 shared task on open machine translation for indigenous languages of the Americas,
     in: Proceedings of the First Workshop on Natural Language Processing for Indigenous
     Languages of the Americas, Association for Computational Linguistics, Online, 2021, pp.
     202–217.
[20] G. Klein, Y. Kim, Y. Deng, J. Senellart, A. Rush, OpenNMT: Open-source toolkit for neural
     machine translation, in: Proceedings of ACL 2017, System Demonstrations, Association
     for Computational Linguistics, 2017, pp. 67–72.
[21] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polo-
     sukhin, Attention is all you need, Advances in neural information processing systems 30
     (2017).
[22] R. Rei, C. Stewart, A. C. Farinha, A. Lavie, COMET: A neural framework for MT evalua-
     tion, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language
     Processing (EMNLP), Association for Computational Linguistics, 2020, pp. 2685–2702.
[23] M. Post, A call for clarity in reporting BLEU scores, in: Proceedings of the Third Conference
     on Machine Translation: Research Papers, Association for Computational Linguistics, 2018,
     pp. 186–191.
[24] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation of
     machine translation, in: Proceedings of the 40th Annual Meeting of the Association for
     Computational Linguistics, Association for Computational Linguistics, 2002, pp. 311–318.
[25] M. Popović, chrF: character n-gram F-score for automatic MT evaluation, in: Proceedings
     of the Tenth Workshop on Statistical Machine Translation, Association for Computational
     Linguistics, 2015, pp. 392–395.
[26] T. Kocmi, C. Federmann, R. Grundkiewicz, M. Junczys-Dowmunt, H. Matsushita,
     A. Menezes, To ship or not to ship: An extensive evaluation of automatic metrics for
     machine translation, in: Proceedings of the Sixth Conference on Machine Translation,
     Association for Computational Linguistics, 2021, pp. 483–499.
[27] H. Schwenk, V. Chaudhary, S. Sun, H. Gong, F. Guzmán, WikiMatrix: Mining 135M parallel
     sentences in 1620 language pairs from Wikipedia, in: Proceedings of the 16th Conference
     of the European Chapter of the Association for Computational Linguistics: Main Volume,
     Association for Computational Linguistics, 2021, pp. 1351–1361.
[28] M. Artetxe, H. Schwenk, Massively multilingual sentence embeddings for zero-shot cross-
     lingual transfer and beyond, Transactions of the Association for Computational Linguistics
     7 (2019) 597–610.
[29] J. Ortega, R. Castro Mamani, K. Cho, Neural machine translation with a polysynthetic low
     resource language, Machine Translation (2021).
[30] J. Zuters, G. Strazds, K. Immers, Semi-automatic quasi-morphological word segmenta-
     tion for neural machine translation, in: A. Lupeikiene, O. Vasilecas, G. Dzemyda (Eds.),
     Databases and Information Systems, Springer International Publishing, 2018, pp. 289–301.
[31] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan,
     W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, E. Herbst, Moses: Open
     source toolkit for statistical machine translation, in: Proceedings of the 45th Annual
     Meeting of the Association for Computational Linguistics Companion Volume Proceedings
     of the Demo and Poster Sessions, Association for Computational Linguistics, 2007, pp.
     177–180.
[32] A. Viterbi, Error bounds for convolutional codes and an asymptotically optimum decoding
     algorithm, IEEE Transactions on Information Theory 13 (1967) 260–269.
[33] J. Tiedemann, L. Nygaard, The OPUS corpus - parallel and free: http://logos.uio.no/opus, in:
     Proceedings of the Fourth International Conference on Language Resources and Evaluation
     (LREC’04), 2004.
[34] H. Schwenk, G. Wenzek, S. Edunov, E. Grave, A. Joulin, A. Fan, CCMatrix: Mining billions of
     high-quality parallel sentences on the web, in: Proceedings of the 59th Annual Meeting of
     the Association for Computational Linguistics and the 11th International Joint Conference
     on Natural Language Processing (Volume 1: Long Papers), Association for Computational
     Linguistics, 2021, pp. 6490–6500. URL: https://aclanthology.org/2021.acl-long.507.
[35] J. Tiedemann, Parallel data, tools and interfaces in opus, in: N. C. C. Chair), K. Choukri,
     T. Declerck, M. U. Dogan, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis (Eds.), Proceedings
     of the Eight International Conference on Language Resources and Evaluation (LREC’12),
     European Language Resources Association (ELRA), 2012.
[36] A. Abdelali, F. Guzman, H. Sajjad, S. Vogel, The AMARA corpus: Building parallel
     language resources for the educational domain, in: Proceedings of the Ninth International
     Conference on Language Resources and Evaluation (LREC’14), 2014.
[37] N. Reimers, I. Gurevych, Making monolingual sentence embeddings multilingual using
     knowledge distillation, in: Proceedings of the 2020 Conference on Empirical Methods in
     Natural Language Processing, Association for Computational Linguistics, 2020.
[38] A. El-Kishky, A. Renduchintala, J. Cross, F. Guzmán, P. Koehn, XLEnt: Mining cross-lingual
     entities with lexical-semantic-phonetic word alignment, in: Preprint, 2021.