A Framework for German-English Machine Translation with GRU RNN Levi Corallo1 , Guanghui Li2 , Kenna Reagan3 , Abhishek Saxena4 , Aparna S. Varde5 and Brandon Wilde6 1 Computational Linguistics, Montclair State University, Montclair, NJ, United States 2 Computational Linguistics, Montclair State University, Montclair, NJ, United States 3 Computational Linguistics, Montclair State University, Montclair, NJ, United States 4 Data Science, Montclair State University, Montclair, NJ, United States 5 Computer Science, Montclair State University, Montclair, NJ, United States 6 Computational Linguistics, Montclair State University, Montclair, NJ, United States Abstract Machine translation (MT) using Gated Recurrent Units (GRUs) is a popular model used in industry-level web translators because of the efficiency with which it handles sequential data compared to Long Short-Term Memory (LSTM) in language modeling with smaller datasets. Motivated by this, a deep learning GRU based Recurrent Neural Network (RNN) is modeled as a framework in this paper, utilizing WMT2021’s English-German data-set that originally contains 400,000 strings from German news with parallel English translations. Our framework serves as a pilot approach in translating strings from German news media into English sentences, to build applications and pave the way for further work in the area. In real-life scenarios, this framework can be useful in developing mobile applications (apps) for quick translation where efficiency is crucial. Furthermore, our work makes broader impacts on a UN SDG (United Nations Sustainable Development Goal) of Quality Education, since offering education remotely by leveraging technology, as well as seeking equitable solutions and universal access are significant objectives there. Our framework for German-English translation in this paper can be adapted to other similar language translation tasks. 1. Introduction 1 Motivation and Goal: The open task created by EMNLP provides datasets of sentences from news ar- ticles in multiple language pairs with parallel translated data [1]. The work generated by the task seeks to ad- vance current machine translation (MT) research by us- ing the latest performance scores as a comparison for future research, to investigate the applicability of cur- rent varying methods of MT, to examine challenges in word translation for specific language pairs, and to elicit more research on low-resource, morphologically rich lan- guages. This provides the motivation for our research. Our goal is to investigate a specific machine translation problem in a morphologically rich language and model a framework to provide a feasible solution. In this context, we address the issue of German-English news transla- tion. While there is much work on translation, there are gaps in existing tools, e.g. Google Translate has a limit on characters (see Fig. 1) with translation from a Ger- Figure 1: Google Translate attempt from Frankfurter Allge- meine Zeitung (German newspaper) with limitations [2] Published in the Workshop Proceedings of the EDBT/ICDT 2022 Joint Conference (March 29-April 1, 2022), Edinburgh, UK $ corallo1@montclair.edu (L. Corallo); lig1@montclair.edu (G. Li); reagank1@montclair.edu (K. Reagan); saxenaa1@montclair.edu man News source [2]. In order to make news and other (A. Saxena); vardea@montclair.edu (A. S. Varde); such text accessible globally, it is important to address wildeb11@montclair.edu (B. Wilde) large-scale translation, for which issues such efficiency © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). are significant. We present the following. CEUR Workshop Proceedings CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 Models and Methods: We address the issue of transla- 1 Authors are in alphabetical order with equal contributions pacted (see Fig 2 from the United Nations source [6]), including language-related issues. This makes it even more important for us to address such concerns in or- der to enhance education. In addition, the framework in this paper has the real-life standpoint of being useful in mobile application (app) development due to its effi- ciency. Application of machine learning in mobile apps is broached in a variety of works, e.g. as summarized in a survey paper [7]. During online news translation in a mobile application, it is important to obtain fast results that capture the crux of the material presented in the news. Our framework is useful in such tasks. 2. Related Work Avramis et al. presented work at WMT2020 utilizing the German-English news corpus provided by EMNLP for that year’s open task [8]. The paper details the de- velopment of a test suite, containing multiple different linguistic phenomena relevant to the German to English Figure 2: UN SDG on Education and its recent concerns translation process. The most difficult concepts high- lighted in the test suite when using MT to translate Ger- man into English include ambiguous sentences, multi- word expressions, verb valency, and “false friends” which tion via a framework modeled by GRU RNN deep learning refers to words in two languages that appear similar in methods on a parallel German-English translated news composition and are often mistaken as sharing the same corpus. The RNN (Recurrent Neural Network), originally meaning, but do not. The example their paper provides is conceptualized by Rumelhart et al. [3], with the con- the German word “Novella” commonly having its target cept of GRU (Gated Recurrent Unit) proposed by Cho et translation mistaken for “novel,” which it does not trans- al. [4], is selected to model this framework based on its late into or semantically represent, but instead “novella,” current performance in machine translation due to its or “short story.” The paper points out that it is a sur- efficiency as compared to the Long Short-Term Memory prising fact that MT models are prone to false friends (LSTM) model. In order to create a reasonable training when making mistakes in translating because this is an time, we experiment with our framework on batches of observed human error. This was insightful when analyz- 64 sentence pairs at a time. We choose to work with ing the validity and accuracy of our translated sentences, Keras, an open-source Python library to implement this where we were able to understand phenomena that could framework [5]. We obtain interesting results that set the be influencing the margin of error. stage for building applications and conducting further There is research that points to Rumelhart et al. for research for enhancement. Our framework in this paper the early conceptualization of Recurrent Nets that were for German-English translation is usable for translation able to evolve into modern RNN programming [3]. This in other morphologically rich languages. early work is a predecessor introducing vital concepts Applications: From a real-life perspective, translation in neural machine translation using an RNN such as the of news is important for ensuring accessibility of current hidden layer between input and output units, sigma-pi events for readers across the world who read in different units, and so on. More recent work by Chung et al. [9] languages, and even fighting censorship of news-media is able to give empirical comparisons to LSTM in RNNs. by bridging information divides to countries without free- The original concept for GRUs was introduced by Cho dom of press. To that end, our paper broadly impacts UN et al. [4] who proposed a novel neural network model SDG 4: Quality Education since its facets include the fol- called RNN Encoder-Decoder which uses two different lowing. (1) “Help countries in mobilizing resources and neural networks as encoder and decoder respectively. implementing innovative and context-appropriate solu- The encoder is used to read the source sentence and map tions to provide education remotely, leveraging hi-tech, it into a vector of fixed length, while the decoder reads low-tech and no-tech approaches”; (2) “Seek equitable the vector and maps it back to a corresponding target solutions and universal access” [6]. In the aftermath of sentence. Along with the new architecture, they also COVID, some of these goals have been negatively im- proposed an improved version of standard RNN called a Gated Recurrent Unit (GRU) which uses a reset gate and among languages enables effective knowledge transfer, an update gate to decide how much information should while avoiding negative effects caused by incorporating be passed to the output sequence. They can be trained very distant languages. Recent work done by Oncevay et to keep information from long ago if the information al. [16] tried to embed typological features in language is critical to the prediction or forget information if it is vector space for multilingual machine translation tasks irrelevant to the prediction. They experimented with this and reported to achieve competitive translation accuracy. model on a task of translating English to French, found Recent work by Popović [17] details and compares that the overall translation performance was improved in language-structure related issues that arise in NMT terms of BLEU (BiLingual Evaluation Understudy) scores specifically between German and English. The author’s [10] and linguistic regularities at both word level and work finds that key structural differences between Ger- phrase level were captured. After their work, this model man and English causing ambiguities and inconsistent has become a mainstream model framework. target translations are the handling of prepositions, the Zhang et al. [11] proposed an alternative to the widely- translation of ambiguous English (source) words, and used bidirectional encoder with the merits of incorpo- generation of English (target) continuous tenses. English rating future and history contexts into the source repre- and German both follow SVO (Subject-Verb-Object) sen- sentation. This novel encoder is called a context-aware tence structure, so the obstacles found in Popović’s work recurrent encoder (CAEncoder) which consists of two highlighting prepositional phrasing, ambiguity, and tense levels. The bottom level summarizes the history infor- account for inaccuracies. mation and the upper level assembles this information Other work in this general area entails addressing ar- together with future context into the source represen- ticle errors and collocation errors in written text transla- tation. Through their experiment on translation tasks tion from a source language into English [18, 19, 20, 21], with two different language pairs, they found this novel by addressing issues of ESL (English as a Second Lan- encoder to be as efficient as the bidirectional encoder and guage) learners. Preposition prediction and idiom de- to demonstrate better performance. tection are addressed in some works [22, 23, 24]. Prob- Previous work has been done on multilingual neural lems on knowledge discovery from big data including machine translation (NMT) that demonstrates the dif- those on machine translation are discussed in [25]. Deep ficulties in translating between languages of the same learning techniques are used widely in machine trans- language family and languages in different language fam- lation via paradigms such as LSTM (Long Short-Term ilies. The study determined that it is difficult for one Memory) [26], BERT (Bidirectional Encoder Represen- model to handle every language to be considered for tations from Transformers) [27], GPT (Generative Pre- translation. The reasoning for this, in part, is because trained Transformer) [28] and T5 (Text-To-Text Trans- the model could be negatively impacted during train- fer Transformer) [29]. Depending on the task, one of ing when considering language pairs, such as Chinese these paradigms would be selected and adapted within to English and German to English. For this reason, the solution approaches. There are studies that emphasize study explores language clustering, where languages that commonsense knowledge in the realm of machine in- are closely related are clustered together, to boost the telligence, addressing translation among several tasks model during training. They determine that language [30, 31, 32]. Comparison is presented in [33] between embeddings, which considers genealogy and typology in symbolic knowledge graphs (KGs) and deep learning with clustering, outperforms random family, which only con- neural models, explaining their pros versus cons, and how siders genealogy [12]. Handwritten Chinese character they can potentially complement each other. Our work in recognition by distance metric learning is approached in this paper fits in the broad spectrum of such exhaustive [13] that cites work pertinent to pictorial scripts, consid- research. Its main contribution is the framework mod- ering OCR and machine translation. eled to conduct German-English news translation with Efforts in improving machine translation quality be- efficiency as needed in real-life applications. tween typologically similar languages have long been witnessed in the field. For those very close language pairs, a direct word-for-word translation method was tested and 3. Models and Methods received promising results [14]. More advanced multilin- The deep learning paradigm is one of the most widely gual neural machine translation system has been created used facets for Machine Translation. We model a frame- to address one to many or many to many translations work for morphologically rich language translation de- within language groups which share inherent similar ploying a GRU based RNN, given its success with real- structures. Azpiazu and Pera [15] put forward a novel life scenarios such as industry level web-translators, and encoder-decoder machine translation framework called adapt it specifically to our problem of German-English HNMT specifically exploited the hierarchical nature of a news translation in this paper. Our framework is imple- typological language family tree. The natural connection mented within the Python Keras platform [5] to perform translations from German to English. The methodology for the model discussed in this paper involves text pre- processing, model design and model training. This is discussed next with reference to our data in this work. 3.1. Dataset and Text Preprocessing The data used to train our model is sourced from the News Commentary dataset, obtained on the EMNLP 2021 website for the machine translation conference WMT21 [1]. The data, provided specifically for the task of ma- chine translation, is an aligned corpus of German and English news stories. The collection comprises approxi- mately 400,000 German-English sentence pairs sourced from news articles. The text preprocessing phase entails data cleaning, to- kenization and sentence padding. First, the dataset is passed through data cleaning filters. Since all sentences would be padded to the same final length, extremely long Figure 3: Framework for Machine Translation sentences are removed. This includes sentence pairs for which either the German or English sentence is more than 50 words long. Errors in the creation of the dataset can also occasionally incorrectly map one German sen- GRU is designed for sequential processing and so main- tence to two English sentences or vice versa. This is tains dependencies between different parts of a given partially corrected by passing the data through two fil- entered sentence. The GRU output is then fed into a time- ters. The first removes all sentence pairs in which one distributed dense neural layer, which produces a series sentence has more than twice as many words as its coun- of logit vectors for each sentence. Each logit effectively terpart and a minimum length of 25 words. The second represents the probability of a given English word occur- filter removes all sentence pairs in which one sentence ring in that position within the sentence, so the output is has more than four times as many words as its counter- decoded by calling the English word which corresponds part. The combined filters reduce the dataset to a size of to the position of the largest logit in each vector. approximately 378K German-English sentence pairs. Several model and training parameters are left as vari- Tokenization is then performed with the Keras Tok- ables, to ensure the easy reconfiguration of components. enizer function, dividing sentences into their component The parameter list and our chosen parameter configura- words, and assigning each unique word an integer for tions are included in Tables 1 and 2. We choose to remain further processing. Each sentence is thus converted into relatively constant with some of the configurable model a list of integers. Dummy tokens are then added functions that are well-established: Softmax is used as at the end of each tokenized sentence, so that each sen- the activation function, sparse categorical cross-entropy tence conforms to the same length and can be processed is used as the loss function, and Adam was used as the by the neural machine translation model. optimization function [5]. 3.2. Translation Model Design 3.3. Model Training We predetermined to approach this machine translation After preparing the dataset and RNN model, the data is task with an RNN model as justified earlier. After review- divided into two parts. Using Python Sklearn’s train-test- ing the literature and assessing approaches by others, we split method, the dataset is shuffled and split: 80% of the resolved to build a GRU-based RNN. Our framework for data for training and 20% for testing, to add robustness translation is illustrated in Fig. 3. to the framework. The model is built within the Keras platform and is Model training then commences with a configuring composed of two principal components: a GRU and a of model parameters and subsequent passing of the dense layer. Input data are entered into the GRU, and training data into the Keras Model.fit() method. Accuracy processed in matrices with a configurable dimensional- and loss are used as standard metrics [5] to monitor ity referred to here as GRU units (not to be confused model performance during training and provide a basis with the number of GRUs, which was only one). The for modifications to the model’s hyper-parameters. A new validation set is created at the beginning No. Parameters Value of each batch, on which the training data of the 1 Learning Rate 0.01 batch is tested following the completion of batch 2 GRU Units 128 training. This provides a way to obtain more reliable 3 Activation Function Softmax metrics than simple training statistics. The methodol- 4 Loss Function Categorical Cross En- ogy in our work including the text preprocessing and tropy actual machine translation is summarized in Algorithm 1. 5 Validation Percent- 0.2 age 6 Epochs 10 Algorithm 1: Text Preprocessing and Translation 7 Batch Size 64 Table 1 INPUT: English-German corpus Training Parameters for Simple RNN Model I DEFINE: L(S) as Length of Sentence S No. Parameters Value FOREACH sentence-pair (Sx, Sy) in corpus: IF L(Sx) > 50 OR L(Sy) > 50 1 Learning Rate 0.05 REMOVE (Sx, Sy) 2 GRU Units 512 ELSEIF L(Sx)J/L(Sy) ≥ 4 3 Activation Function Softmax OR (L(Sx)J/L(Sy) ≥ 2 𝐴𝑁 𝐷 𝐿(𝑆𝑥) ≥ 25) 4 Loss Function Categorical Cross En- tropy REMOVE (Sx, Sy) 5 Validation Percent- 0.2 ELSEIF L(Sy)J/L(Sx) ≥ 4 age OR (L(Sy)J/L(Sx) ≥ 2 𝐴𝑁 𝐷 𝐿(𝑆𝑦) ≥ 25) 6 Epochs 10 REMOVE (Sx, Sy) 7 Batch Size 64 ELSE TOKENIZE (Sx, Sy) Table 2 MAP each unique token to an integer (token ID) Training Parameters for Simple RNN Model II PAD encoded token IDs to max length DEFINE model hyper-parameters Model Train Test DEFINE model architecture via GRU RNN Model I 5 hours 10 minutes INSTANTIATE model with architecture and hyper- Model II 7 hours 12 minutes parameters Table 3 FOREACH epoch: Total Training and Testing Times Combined (For All Experi- ments Conducted) FOREACH encoded (Sx, Sy) batch in training data: FIT model to encoded (Sx, Sy) EVALUATE model on validation data problem, often in the range between 0.001 and 0.05. Our FOREACH encoded (Sx, Sy): experiments are set up with learning rates of 0.01 and 0.05 MODEL-PREDICT encoded output correspondingly. The number of GRU units are set to 128 DECODE encoded output to text and 512 for Model I and Model II respectively. We save OUTPUT: Translated sentences the history of the model throughout the training process and subsequently plot the changes in loss and accuracy (see Figs. 4 – 7). We conduct experiments with two setups 4. Experiments and Discussion for the running of the RNN model. Comparing these two setups, the principal differences lie in the learning rate Initial experimentation is conducted with abbreviated and GRU parameters. datasets (5k - 50k sentence pairs) to reduce test time and allow for the testing of more hyper-parameter combina- 4.1. Experimental Results tions. This provides a precursory glimpse of the fully trained model. Two parameter configurations are then The total training and testing times for all the executions selected for training on the full dataset, creating what combined in our experiments with Models I and II are would be named Simple RNN Model I (Table 1) and Sim- synopsized in Table 3. In Fig. 4, we can observe that ple RNN Model II (Table 2) in our overall framework. both the training and validation loss decreased overall for The learning rate is a configurable hyper-parameter Model I. Despite the occasional spikes in loss, this is what that controls how quickly the model is adapted to the we expect to see while training the model. Meanwhile, Figure 4: Model I Loss Figure 6: Model II Loss Figure 5: Model I Accuracy Figure 7: Model II Accuracy both the training and validation accuracy increase for training and validation sets shows a fluctuating change Model I, seen in Fig. 5. across all 10 epochs, as can be seen in Fig. 7, indicating However, we also notice that the validation accuracy robustness. The overall accuracy decreases as is expected drops at the end of running, demonstrating that the model with rising loss. weights and biases have not yet reached a stable optimum. Interestingly, the validation loss does not increase during the same period, as would be expected. Such occurrences 4.2. Discussion on Experiments might indicate when the model weights and biases are We observe in all our experimentation that Model I some- more precisely able to replicate several of the previously what outperforms Model II in both loss and accuracy. correct results, while losing ground on some of the less Model I has a final training accuracy of 0.655 and final certain results. In other terms, the model is becoming validation accuracy of 0.653. Model II had a final valida- more confident producing target sentence words easy to tion accuracy of 0.645 and a final validation accuracy of predict, while simultaneously losing confidence in words 0.649. Model I has a final training loss of 2.78 and the final that are more difficult to predict. validation loss was 2.85. Model II has a final validation For Model II, we change to a larger learning rate of loss of 4.66 and a final validation loss of 5.55. 0.05 and a larger number of GRU units, 512. The results Model I depicts a consistent decrease in loss and a differ drastically from Model I. The loss for both the consistent increase in validation. Model II portrays a con- training and validation set have an overall increase across sistent increase in loss while the accuracy increases and all 10 epochs, as can be seen in Fig. 6. In the second, decreases throughout the training process without any third, fifth, eighth, and ninth epochs, the training loss consistency. Despite the markedly different behavior, the decreases. In the second, fifth, ninth, and tenth epoch, the two models both finish with a difference in translation validation loss decreases. In all other epochs, the training accuracy less than 1%. Overall, it seems as though the and validation losses both increase. The accuracy for both lower learning rate of 0.01 in Model I produces better 5. Conclusions results than the learning rate of 0.05 in Model II. The ac- curacy of Model I is higher than in Model II and the loss In this paper, we model a framework using a GRU-based in Model I is lower than in Model II. The learning rate RNN to perform German-English news translation, de- is a significant factor in how well the models perform. picting a method of efficient translation of text pieces In our previous attempts to find the best parameters to in morphologically rich languages. We notice high ef- train on, we find that 512 GRU units provide the best ficiency in training and testing the model. While the preliminary results. However, despite the fact that Model accuracy levels obtained here seem good for a starting II uses 512 GRU units, Model I still outperforms Model point, there is scope for further improvement. II on the whole. It is clear that the higher learning rate In future work, apart from considering approaches hinders Model II much more than the use of 512 GRU such as word2vec and bidirectional RNN, as well as tun- units is able to help it. It is likely that with the high ing some hyper-parameters, we could recommend using learning rate, Model II over-corrects and is not able to more training epochs. Selecting an appropriate learn- narrow in on optimal results. This is reflected in Figs. 6 ing rate and number of GRU units, as well as securing and 7 where we can see that the loss increases and the sufficient training time are challenges for training deep accuracy is inconsistent. The learning rate of both of our learning MT models. During our attempts to tune the models hovers around the 65% range. Though Google model, we observe that smaller learning rates require Translate gives an accuracy in the 80% range, it faces the more training epochs, given the smaller changes made to issue of a maximum character limit which is not feasible the weights each update, whereas larger learning rates for translating news articles. Similar critiques can be ap- result in rapid changes and require fewer training epochs. plied to other tools and methods in the literature. Hence, Later, this work might benefit from using a learning rate our work, though at an early stage, can address such is- that decreases with each epoch, allowing the initial train- sues and pave the way for building efficient, larger scale, ing to advance quickly while letting the fine-tuning take and easily accessible mobile apps in news translation for the time it needs. These are some recommendations morphologically rich languages. This would complement based on our study in this paper. Furthermore, we could other state-of-the-art apps. potentially incorporate commonsense knowledge into One limitation on our model’s performance may have the learning process. As depicted in recent works, deep been the technique used to transform our data into fea- learning based models and commonsense based models ture sets. A simple word-integer assignment method is can complement each other for enhanced performance. used here that may have been a detriment due to its rep- Files for this project are available on GitHub and can resentation of words in an ordinal system as opposed to a be provided to interested users upon request. On the categorical one. Alternate approaches could include one- whole, this work provides the ground for developing hot encoding [34] or word vector generation word2vec mobile apps for news translation orthogonal to existing [35]. We could also implement an alternative architec- work in the area. It caters broadly to the United Nations ture such as a bidirectional RNN [36] into the framework Sustainable Development Goal of Quality Education. to explore if it enhances model performance We chose to work with a simple approach first in line with the logic of preferring simpler theories over complex ones 6. Acknowledgments as per Occam’s razor principle [37], and also given the Authors are in alphabetical order. A. Saxena has been funded fact that we need reduced complexity and high efficiency by a GA from the Comp. Sc. dept. at MSU. A. Varde has NSF for translation tasks in this context. While our simple grants 2018575 and 2117308. She is a visiting researcher at the approach allows the model to observe general context Max Planck Institute for Informatics, Germany. patterns, it does not offer semantic representation of the words to be translated. Furthermore, for better under- standing the performance of the translation model, we References could consider adopting BLEU scores in the evaluation [1] E. WMT, Translation Task— German-English cor- of our future model, since this is widely recognized as pus, 2021. URL: https://www.statmt.org/wmt21/ a reliable evaluation criteria in the machine translation translation-task.html. field [10]. On the whole, our current framework creates [2] Zeitung, Frankfurter Allegemeine, 2021. URL: https:// a good baseline for translating German news to English, www.faz.net/aktuell. capturing reference to context. [3] D. E. Rumelhart, G. E. Hinton, R. J. Williams, Learning internal representations by error propagation, Techni- cal Report, California Univ San Diego La Jolla Inst for Cognitive Science, 1985. [4] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, work for collocation error correction in web pages and F. Bougares, H. Schwenk, Y. Bengio, Learning phrase text documents, ACM SIGKDD Explorations 17 (2015) representations using rnn encoder-decoder for statisti- 14–23. cal machine translation, arXiv preprint arXiv:1406.1078 [22] P. Bhagat, A. S. Varde, A. Feldman, Wordprep: Word- (2014). based preposition prediction tool, in: 2019 IEEE Interna- [5] F. Chollet, Deep learning with Python, Simon and Schus- tional Conference on Big Data (Big Data), IEEE, 2019, pp. ter, 2021. 2169–2176. [6] UN, SDG website, 2021. URL: [23] J. Briskilal, C. Subalalitha, An ensemble model for clas- www.un.org/sustainabledevelopment/ sifying idioms and literal texts using bert and roberta, sustainabledevelopment-goals/. Information Processing & Management 59 (2022) 102756. [7] P. Basavaraju, A. S. Varde, Supervised learning tech- [24] A. Elghafari, D. Meurers, H. Wunsch, Exploring the data- niques in mobile device apps for androids, ACM SIGKDD driven prediction of prepositions in english, in: Coling Explorations 18 (2017) 18–29. 2010: Posters, 2010, pp. 267–275. [8] E. Avramidis, V. Macketanz, U. Strohriegel, A. Bur- [25] G. De Melo, A. S. Varde, Scalable learning technologies chardt, S. Möller, Fine-grained linguistic evaluation for big data mining, in: 20th International Conference on for state-of-the-art machine translation, arXiv preprint Database Systems for Advanced Applications, DASFAA arXiv:2010.06359 (2020). 2015, Springer Verlag, 2015. [9] J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical eval- [26] K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, uation of gated recurrent neural networks on sequence J. Schmidhuber, Lstm: A search space odyssey, IEEE modeling, arXiv preprint arXiv:1412.3555 (2014). transactions on neural networks and learning systems [10] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a 28 (2016) 2222–2232. method for automatic evaluation of machine translation, [27] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre- in: Proceedings of the 40th annual meeting of the Asso- training of deep bidirectional transformers for language ciation for Computational Linguistics, 2002, pp. 311–318. understanding, 2019. arXiv:1810.04805. [11] B. Zhang, D. Xiong, J. Su, H. Duan, A context- [28] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, aware recurrent encoder for neural machine transla- Improving language understanding by generative pre- tion, IEEE/ACM Transactions on Audio, Speech, and training (2018). Language Processing 25 (2017) 2424–2432. [29] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, [12] X. Tan, J. Chen, D. He, Y. Xia, T. Qin, T.-Y. Liu, Multilin- M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the limits of gual neural machine translation with language cluster- transfer learning with a unified text-to-text transformer, ing, arXiv preprint arXiv:1908.09324 (2019). 2020. arXiv:1910.10683. [13] B. Dong, A. S. Varde, D. Stevanovic, J. Wang, L. Zhao, In- [30] N. Tandon, A. S. Varde, G. de Melo, Commonsense knowl- terpretable distance metric learning for handwritten chi- edge in machine intelligence, ACM SIGMOD Record 46 nese character recognition, CoRR abs/2103.09714 (2021). (2017) 49–52. arXiv:2103.09714. [31] C. Matuszek, M. Witbrock, R. C. Kahlert, J. Cabral, [14] J. Hajic, Machine translation of very close languages, in: D. Schneider, P. Shah, D. Lenat, Searching for common Sixth Applied Natural Language Processing Conference, sense: Populating cyc from the web, UMBC Computer 2000, pp. 7–12. Science and Electrical Engineering Department Collec- [15] I. M. Azpiazu, M. S. Pera, A framework for hierarchi- tion (2005). cal multilingual machine translation, arXiv preprint [32] E. Onyeka, A. S. Varde, V. Anu, N. Tandon, O. Daramola, arXiv:2005.05507 (2020). Using commonsense knowledge and text mining for im- [16] A. Oncevay, B. Haddow, A. Birch, Bridging linguis- plicit requirements localization, in: 2020 IEEE 32nd Inter- tic typology and multilingual machine translation with national Conference on Tools with Artificial Intelligence multi-view language representations, arXiv preprint (ICTAI), IEEE, 2020, pp. 935–940. arXiv:2004.14923 (2020). [33] S. Razniewski, N. Tandon, A. S. Varde, Information to [17] M. Popović, Comparing language related issues for nmt wisdom: commonsense knowledge extraction and compi- and pbmt between german and english, The Prague lation, in: ACM International Conference on Web Search Bulletin of Mathematical Linguistics 108 (2017) 209. and Data Mining (WSDM), 2021, pp. 1143–1146. [18] D. Dahlmeier, H. T. Ng, Correcting semantic collocation [34] J. Liang, J. Chen, X. Zhang, Y. Zhou, J. Lin, One-hot en- errors with l1-induced paraphrases, in: Proceedings of coding and convolutional neural network based anomaly the 2011 conference on empirical methods in natural detection, Journal of Tsinghua University (Science and language processing, 2011, pp. 107–117. Technology) 59 (2019) 523–529. [19] N.-R. Han, M. Chodorow, C. Leacock, Detecting errors [35] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient esti- in english article usage by non-native speakers, Natural mation of word representations in vector space, arXiv Language Engineering 12 (2006) 115–129. preprint arXiv:1301.3781 (2013). [20] A. M. Pradhan, A. S. Varde, J. Peng, E. M. Fitzpatrick, [36] M. Schuster, K. K. Paliwal, Bidirectional recurrent neural Automatic classification of article errors in l2 written networks, IEEE transactions on Signal Processing 45 english, in: Twenty-Third International FLAIRS Confer- (1997) 2673–2681. ence, 2010. [37] T. Mitchell, M. L. McGraw-Hill, Edition, 1997. [21] A. Varghese, A. S. Varde, J. Peng, E. Fitzpatrick, A frame-