A Framework for German-English Machine Translation with
GRU RNN
Levi Corallo1 , Guanghui Li2 , Kenna Reagan3 , Abhishek Saxena4 , Aparna S. Varde5 and
Brandon Wilde6
1
  Computational Linguistics, Montclair State University, Montclair, NJ, United States
2
  Computational Linguistics, Montclair State University, Montclair, NJ, United States
3
  Computational Linguistics, Montclair State University, Montclair, NJ, United States
4
  Data Science, Montclair State University, Montclair, NJ, United States
5
  Computer Science, Montclair State University, Montclair, NJ, United States
6
  Computational Linguistics, Montclair State University, Montclair, NJ, United States


                                             Abstract
                                             Machine translation (MT) using Gated Recurrent Units (GRUs) is a popular model used in industry-level web translators
                                             because of the efficiency with which it handles sequential data compared to Long Short-Term Memory (LSTM) in language
                                             modeling with smaller datasets. Motivated by this, a deep learning GRU based Recurrent Neural Network (RNN) is modeled
                                             as a framework in this paper, utilizing WMT2021’s English-German data-set that originally contains 400,000 strings from
                                             German news with parallel English translations. Our framework serves as a pilot approach in translating strings from
                                             German news media into English sentences, to build applications and pave the way for further work in the area. In real-life
                                             scenarios, this framework can be useful in developing mobile applications (apps) for quick translation where efficiency is
                                             crucial. Furthermore, our work makes broader impacts on a UN SDG (United Nations Sustainable Development Goal) of
                                             Quality Education, since offering education remotely by leveraging technology, as well as seeking equitable solutions and
                                             universal access are significant objectives there. Our framework for German-English translation in this paper can be adapted
                                             to other similar language translation tasks.


1. Introduction
1
  Motivation and Goal: The open task created by
EMNLP provides datasets of sentences from news ar-
ticles in multiple language pairs with parallel translated
data [1]. The work generated by the task seeks to ad-
vance current machine translation (MT) research by us-
ing the latest performance scores as a comparison for
future research, to investigate the applicability of cur-
rent varying methods of MT, to examine challenges in
word translation for specific language pairs, and to elicit
more research on low-resource, morphologically rich lan-
guages. This provides the motivation for our research.
Our goal is to investigate a specific machine translation
problem in a morphologically rich language and model a
framework to provide a feasible solution. In this context,
we address the issue of German-English news transla-
tion. While there is much work on translation, there are
gaps in existing tools, e.g. Google Translate has a limit
on characters (see Fig. 1) with translation from a Ger- Figure 1: Google Translate attempt from Frankfurter Allge-
                                                                                                                      meine Zeitung (German newspaper) with limitations [2]
Published in the Workshop Proceedings of the EDBT/ICDT 2022 Joint
Conference (March 29-April 1, 2022), Edinburgh, UK
$ corallo1@montclair.edu (L. Corallo); lig1@montclair.edu (G. Li);
reagank1@montclair.edu (K. Reagan); saxenaa1@montclair.edu
                                                                                                                      man News source [2]. In order to make news and other
(A. Saxena); vardea@montclair.edu (A. S. Varde);                                                                      such text accessible globally, it is important to address
wildeb11@montclair.edu (B. Wilde)                                                                                     large-scale translation, for which issues such efficiency
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).                     are significant. We present the following.
    CEUR
    Workshop
    Proceedings          CEUR Workshop Proceedings (CEUR-WS.org)
                  http://ceur-ws.org
                  ISSN 1613-0073


                                                                                                                         Models and Methods: We address the issue of transla-
                  1
                      Authors are in alphabetical order with equal contributions
                                                             pacted (see Fig 2 from the United Nations source [6]),
                                                             including language-related issues. This makes it even
                                                             more important for us to address such concerns in or-
                                                             der to enhance education. In addition, the framework
                                                             in this paper has the real-life standpoint of being useful
                                                             in mobile application (app) development due to its effi-
                                                             ciency. Application of machine learning in mobile apps
                                                             is broached in a variety of works, e.g. as summarized in
                                                             a survey paper [7]. During online news translation in a
                                                             mobile application, it is important to obtain fast results
                                                             that capture the crux of the material presented in the
                                                             news. Our framework is useful in such tasks.


                                                             2. Related Work
                                                             Avramis et al. presented work at WMT2020 utilizing
                                                             the German-English news corpus provided by EMNLP
                                                             for that year’s open task [8]. The paper details the de-
                                                             velopment of a test suite, containing multiple different
                                                             linguistic phenomena relevant to the German to English
Figure 2: UN SDG on Education and its recent concerns        translation process. The most difficult concepts high-
                                                             lighted in the test suite when using MT to translate Ger-
                                                             man into English include ambiguous sentences, multi-
                                                             word expressions, verb valency, and “false friends” which
tion via a framework modeled by GRU RNN deep learning        refers to words in two languages that appear similar in
methods on a parallel German-English translated news         composition and are often mistaken as sharing the same
corpus. The RNN (Recurrent Neural Network), originally       meaning, but do not. The example their paper provides is
conceptualized by Rumelhart et al. [3], with the con-        the German word “Novella” commonly having its target
cept of GRU (Gated Recurrent Unit) proposed by Cho et        translation mistaken for “novel,” which it does not trans-
al. [4], is selected to model this framework based on its    late into or semantically represent, but instead “novella,”
current performance in machine translation due to its        or “short story.” The paper points out that it is a sur-
efficiency as compared to the Long Short-Term Memory         prising fact that MT models are prone to false friends
(LSTM) model. In order to create a reasonable training       when making mistakes in translating because this is an
time, we experiment with our framework on batches of         observed human error. This was insightful when analyz-
64 sentence pairs at a time. We choose to work with          ing the validity and accuracy of our translated sentences,
Keras, an open-source Python library to implement this       where we were able to understand phenomena that could
framework [5]. We obtain interesting results that set the    be influencing the margin of error.
stage for building applications and conducting further           There is research that points to Rumelhart et al. for
research for enhancement. Our framework in this paper        the early conceptualization of Recurrent Nets that were
for German-English translation is usable for translation     able to evolve into modern RNN programming [3]. This
in other morphologically rich languages.                     early work is a predecessor introducing vital concepts
   Applications: From a real-life perspective, translation   in neural machine translation using an RNN such as the
of news is important for ensuring accessibility of current   hidden layer between input and output units, sigma-pi
events for readers across the world who read in different    units, and so on. More recent work by Chung et al. [9]
languages, and even fighting censorship of news-media        is able to give empirical comparisons to LSTM in RNNs.
by bridging information divides to countries without free-   The original concept for GRUs was introduced by Cho
dom of press. To that end, our paper broadly impacts UN      et al. [4] who proposed a novel neural network model
SDG 4: Quality Education since its facets include the fol-   called RNN Encoder-Decoder which uses two different
lowing. (1) “Help countries in mobilizing resources and      neural networks as encoder and decoder respectively.
implementing innovative and context-appropriate solu-        The encoder is used to read the source sentence and map
tions to provide education remotely, leveraging hi-tech,     it into a vector of fixed length, while the decoder reads
low-tech and no-tech approaches”; (2) “Seek equitable        the vector and maps it back to a corresponding target
solutions and universal access” [6]. In the aftermath of     sentence. Along with the new architecture, they also
COVID, some of these goals have been negatively im-          proposed an improved version of standard RNN called a
Gated Recurrent Unit (GRU) which uses a reset gate and         among languages enables effective knowledge transfer,
an update gate to decide how much information should           while avoiding negative effects caused by incorporating
be passed to the output sequence. They can be trained          very distant languages. Recent work done by Oncevay et
to keep information from long ago if the information           al. [16] tried to embed typological features in language
is critical to the prediction or forget information if it is   vector space for multilingual machine translation tasks
irrelevant to the prediction. They experimented with this      and reported to achieve competitive translation accuracy.
model on a task of translating English to French, found           Recent work by Popović [17] details and compares
that the overall translation performance was improved in       language-structure related issues that arise in NMT
terms of BLEU (BiLingual Evaluation Understudy) scores         specifically between German and English. The author’s
[10] and linguistic regularities at both word level and        work finds that key structural differences between Ger-
phrase level were captured. After their work, this model       man and English causing ambiguities and inconsistent
has become a mainstream model framework.                       target translations are the handling of prepositions, the
   Zhang et al. [11] proposed an alternative to the widely-    translation of ambiguous English (source) words, and
used bidirectional encoder with the merits of incorpo-         generation of English (target) continuous tenses. English
rating future and history contexts into the source repre-      and German both follow SVO (Subject-Verb-Object) sen-
sentation. This novel encoder is called a context-aware        tence structure, so the obstacles found in Popović’s work
recurrent encoder (CAEncoder) which consists of two            highlighting prepositional phrasing, ambiguity, and tense
levels. The bottom level summarizes the history infor-         account for inaccuracies.
mation and the upper level assembles this information             Other work in this general area entails addressing ar-
together with future context into the source represen-         ticle errors and collocation errors in written text transla-
tation. Through their experiment on translation tasks          tion from a source language into English [18, 19, 20, 21],
with two different language pairs, they found this novel       by addressing issues of ESL (English as a Second Lan-
encoder to be as efficient as the bidirectional encoder and    guage) learners. Preposition prediction and idiom de-
to demonstrate better performance.                             tection are addressed in some works [22, 23, 24]. Prob-
   Previous work has been done on multilingual neural          lems on knowledge discovery from big data including
machine translation (NMT) that demonstrates the dif-           those on machine translation are discussed in [25]. Deep
ficulties in translating between languages of the same         learning techniques are used widely in machine trans-
language family and languages in different language fam-       lation via paradigms such as LSTM (Long Short-Term
ilies. The study determined that it is difficult for one       Memory) [26], BERT (Bidirectional Encoder Represen-
model to handle every language to be considered for            tations from Transformers) [27], GPT (Generative Pre-
translation. The reasoning for this, in part, is because       trained Transformer) [28] and T5 (Text-To-Text Trans-
the model could be negatively impacted during train-           fer Transformer) [29]. Depending on the task, one of
ing when considering language pairs, such as Chinese           these paradigms would be selected and adapted within
to English and German to English. For this reason, the         solution approaches. There are studies that emphasize
study explores language clustering, where languages that       commonsense knowledge in the realm of machine in-
are closely related are clustered together, to boost the       telligence, addressing translation among several tasks
model during training. They determine that language            [30, 31, 32]. Comparison is presented in [33] between
embeddings, which considers genealogy and typology in          symbolic knowledge graphs (KGs) and deep learning with
clustering, outperforms random family, which only con-         neural models, explaining their pros versus cons, and how
siders genealogy [12]. Handwritten Chinese character           they can potentially complement each other. Our work in
recognition by distance metric learning is approached in       this paper fits in the broad spectrum of such exhaustive
[13] that cites work pertinent to pictorial scripts, consid-   research. Its main contribution is the framework mod-
ering OCR and machine translation.                             eled to conduct German-English news translation with
   Efforts in improving machine translation quality be-        efficiency as needed in real-life applications.
tween typologically similar languages have long been
witnessed in the field. For those very close language pairs,
a direct word-for-word translation method was tested and       3. Models and Methods
received promising results [14]. More advanced multilin-
                                                               The deep learning paradigm is one of the most widely
gual neural machine translation system has been created
                                                               used facets for Machine Translation. We model a frame-
to address one to many or many to many translations
                                                               work for morphologically rich language translation de-
within language groups which share inherent similar
                                                               ploying a GRU based RNN, given its success with real-
structures. Azpiazu and Pera [15] put forward a novel
                                                               life scenarios such as industry level web-translators, and
encoder-decoder machine translation framework called
                                                               adapt it specifically to our problem of German-English
HNMT specifically exploited the hierarchical nature of a
                                                               news translation in this paper. Our framework is imple-
typological language family tree. The natural connection
mented within the Python Keras platform [5] to perform
translations from German to English. The methodology
for the model discussed in this paper involves text pre-
processing, model design and model training. This is
discussed next with reference to our data in this work.

3.1. Dataset and Text Preprocessing
The data used to train our model is sourced from the
News Commentary dataset, obtained on the EMNLP 2021
website for the machine translation conference WMT21
[1]. The data, provided specifically for the task of ma-
chine translation, is an aligned corpus of German and
English news stories. The collection comprises approxi-
mately 400,000 German-English sentence pairs sourced
from news articles.
    The text preprocessing phase entails data cleaning, to-
kenization and sentence padding. First, the dataset is
passed through data cleaning filters. Since all sentences
would be padded to the same final length, extremely long
                                                              Figure 3: Framework for Machine Translation
sentences are removed. This includes sentence pairs for
which either the German or English sentence is more
than 50 words long. Errors in the creation of the dataset
can also occasionally incorrectly map one German sen-         GRU is designed for sequential processing and so main-
tence to two English sentences or vice versa. This is         tains dependencies between different parts of a given
partially corrected by passing the data through two fil-      entered sentence. The GRU output is then fed into a time-
ters. The first removes all sentence pairs in which one       distributed dense neural layer, which produces a series
sentence has more than twice as many words as its coun-       of logit vectors for each sentence. Each logit effectively
terpart and a minimum length of 25 words. The second          represents the probability of a given English word occur-
filter removes all sentence pairs in which one sentence       ring in that position within the sentence, so the output is
has more than four times as many words as its counter-        decoded by calling the English word which corresponds
part. The combined filters reduce the dataset to a size of    to the position of the largest logit in each vector.
approximately 378K German-English sentence pairs.                Several model and training parameters are left as vari-
    Tokenization is then performed with the Keras Tok-        ables, to ensure the easy reconfiguration of components.
enizer function, dividing sentences into their component      The parameter list and our chosen parameter configura-
words, and assigning each unique word an integer for          tions are included in Tables 1 and 2. We choose to remain
further processing. Each sentence is thus converted into      relatively constant with some of the configurable model
a list of integers. Dummy <PAD> tokens are then added         functions that are well-established: Softmax is used as
at the end of each tokenized sentence, so that each sen-      the activation function, sparse categorical cross-entropy
tence conforms to the same length and can be processed        is used as the loss function, and Adam was used as the
by the neural machine translation model.                      optimization function [5].


3.2. Translation Model Design                                 3.3. Model Training
We predetermined to approach this machine translation         After preparing the dataset and RNN model, the data is
task with an RNN model as justified earlier. After review-    divided into two parts. Using Python Sklearn’s train-test-
ing the literature and assessing approaches by others, we     split method, the dataset is shuffled and split: 80% of the
resolved to build a GRU-based RNN. Our framework for          data for training and 20% for testing, to add robustness
translation is illustrated in Fig. 3.                         to the framework.
   The model is built within the Keras platform and is           Model training then commences with a configuring
composed of two principal components: a GRU and a             of model parameters and subsequent passing of the
dense layer. Input data are entered into the GRU, and         training data into the Keras Model.fit() method. Accuracy
processed in matrices with a configurable dimensional-        and loss are used as standard metrics [5] to monitor
ity referred to here as GRU units (not to be confused         model performance during training and provide a basis
with the number of GRUs, which was only one). The             for modifications to the model’s hyper-parameters.
A new validation set is created at the beginning               No.        Parameters              Value
of each batch, on which the training data of the               1          Learning Rate           0.01
batch is tested following the completion of batch              2          GRU Units               128
training. This provides a way to obtain more reliable          3          Activation Function     Softmax
metrics than simple training statistics. The methodol-         4          Loss Function           Categorical Cross En-
ogy in our work including the text preprocessing and                                              tropy
actual machine translation is summarized in Algorithm 1.       5          Validation   Percent-   0.2
                                                                          age
                                                               6          Epochs                  10
Algorithm 1: Text Preprocessing and Translation
                                                               7          Batch Size              64
                                                             Table 1
INPUT: English-German corpus                                 Training Parameters for Simple RNN Model I
DEFINE: L(S) as Length of Sentence S
                                                               No.        Parameters              Value
FOREACH sentence-pair (Sx, Sy) in corpus:
  IF L(Sx) > 50 OR L(Sy) > 50                                  1          Learning Rate           0.05
     REMOVE (Sx, Sy)                                           2          GRU Units               512
  ELSEIF L(Sx)J/L(Sy) ≥ 4                                      3          Activation Function     Softmax
  OR (L(Sx)J/L(Sy) ≥ 2 𝐴𝑁 𝐷 𝐿(𝑆𝑥) ≥ 25)                        4          Loss Function           Categorical Cross En-
                                                                                                  tropy
     REMOVE (Sx, Sy)
                                                               5          Validation   Percent-   0.2
  ELSEIF L(Sy)J/L(Sx) ≥ 4
                                                                          age
  OR (L(Sy)J/L(Sx) ≥ 2 𝐴𝑁 𝐷 𝐿(𝑆𝑦) ≥ 25)                        6          Epochs                  10
     REMOVE (Sx, Sy)                                           7          Batch Size              64
ELSE TOKENIZE (Sx, Sy)
                                                             Table 2
MAP each unique token to an integer (token ID)               Training Parameters for Simple RNN Model II
PAD encoded token IDs to max length
DEFINE model hyper-parameters                                  Model           Train                   Test
DEFINE model architecture via GRU RNN                          Model I         5 hours                 10 minutes
INSTANTIATE model with architecture and hyper-                 Model II        7 hours                 12 minutes
parameters
                                                             Table 3
FOREACH epoch:                                               Total Training and Testing Times Combined (For All Experi-
                                                             ments Conducted)
  FOREACH encoded (Sx, Sy) batch in training data:
    FIT model to encoded (Sx, Sy)
  EVALUATE model on validation data
                                                             problem, often in the range between 0.001 and 0.05. Our
FOREACH encoded (Sx, Sy):                                    experiments are set up with learning rates of 0.01 and 0.05
  MODEL-PREDICT encoded output                               correspondingly. The number of GRU units are set to 128
  DECODE encoded output to text                              and 512 for Model I and Model II respectively. We save
  OUTPUT: Translated sentences                               the history of the model throughout the training process
                                                             and subsequently plot the changes in loss and accuracy
                                                             (see Figs. 4 – 7). We conduct experiments with two setups
4. Experiments and Discussion                                for the running of the RNN model. Comparing these two
                                                             setups, the principal differences lie in the learning rate
Initial experimentation is conducted with abbreviated        and GRU parameters.
datasets (5k - 50k sentence pairs) to reduce test time and
allow for the testing of more hyper-parameter combina-       4.1. Experimental Results
tions. This provides a precursory glimpse of the fully
trained model. Two parameter configurations are then         The total training and testing times for all the executions
selected for training on the full dataset, creating what     combined in our experiments with Models I and II are
would be named Simple RNN Model I (Table 1) and Sim-         synopsized in Table 3. In Fig. 4, we can observe that
ple RNN Model II (Table 2) in our overall framework.         both the training and validation loss decreased overall for
   The learning rate is a configurable hyper-parameter       Model I. Despite the occasional spikes in loss, this is what
that controls how quickly the model is adapted to the        we expect to see while training the model. Meanwhile,
Figure 4: Model I Loss                                         Figure 6: Model II Loss


Figure 5: Model I Accuracy                                     Figure 7: Model II Accuracy


both the training and validation accuracy increase for         training and validation sets shows a fluctuating change
Model I, seen in Fig. 5.                                       across all 10 epochs, as can be seen in Fig. 7, indicating
   However, we also notice that the validation accuracy        robustness. The overall accuracy decreases as is expected
drops at the end of running, demonstrating that the model      with rising loss.
weights and biases have not yet reached a stable optimum.
Interestingly, the validation loss does not increase during
the same period, as would be expected. Such occurrences        4.2. Discussion on Experiments
might indicate when the model weights and biases are           We observe in all our experimentation that Model I some-
more precisely able to replicate several of the previously     what outperforms Model II in both loss and accuracy.
correct results, while losing ground on some of the less       Model I has a final training accuracy of 0.655 and final
certain results. In other terms, the model is becoming         validation accuracy of 0.653. Model II had a final valida-
more confident producing target sentence words easy to         tion accuracy of 0.645 and a final validation accuracy of
predict, while simultaneously losing confidence in words       0.649. Model I has a final training loss of 2.78 and the final
that are more difficult to predict.                            validation loss was 2.85. Model II has a final validation
   For Model II, we change to a larger learning rate of        loss of 4.66 and a final validation loss of 5.55.
0.05 and a larger number of GRU units, 512. The results           Model I depicts a consistent decrease in loss and a
differ drastically from Model I. The loss for both the         consistent increase in validation. Model II portrays a con-
training and validation set have an overall increase across    sistent increase in loss while the accuracy increases and
all 10 epochs, as can be seen in Fig. 6. In the second,        decreases throughout the training process without any
third, fifth, eighth, and ninth epochs, the training loss      consistency. Despite the markedly different behavior, the
decreases. In the second, fifth, ninth, and tenth epoch, the   two models both finish with a difference in translation
validation loss decreases. In all other epochs, the training   accuracy less than 1%. Overall, it seems as though the
and validation losses both increase. The accuracy for both
lower learning rate of 0.01 in Model I produces better        5. Conclusions
results than the learning rate of 0.05 in Model II. The ac-
curacy of Model I is higher than in Model II and the loss     In this paper, we model a framework using a GRU-based
in Model I is lower than in Model II. The learning rate       RNN to perform German-English news translation, de-
is a significant factor in how well the models perform.       picting a method of efficient translation of text pieces
In our previous attempts to find the best parameters to       in morphologically rich languages. We notice high ef-
train on, we find that 512 GRU units provide the best         ficiency in training and testing the model. While the
preliminary results. However, despite the fact that Model     accuracy levels obtained here seem good for a starting
II uses 512 GRU units, Model I still outperforms Model        point, there is scope for further improvement.
II on the whole. It is clear that the higher learning rate       In future work, apart from considering approaches
hinders Model II much more than the use of 512 GRU            such as word2vec and bidirectional RNN, as well as tun-
units is able to help it. It is likely that with the high     ing some hyper-parameters, we could recommend using
learning rate, Model II over-corrects and is not able to      more training epochs. Selecting an appropriate learn-
narrow in on optimal results. This is reflected in Figs. 6    ing rate and number of GRU units, as well as securing
and 7 where we can see that the loss increases and the        sufficient training time are challenges for training deep
accuracy is inconsistent. The learning rate of both of our    learning MT models. During our attempts to tune the
models hovers around the 65% range. Though Google             model, we observe that smaller learning rates require
Translate gives an accuracy in the 80% range, it faces the    more training epochs, given the smaller changes made to
issue of a maximum character limit which is not feasible      the weights each update, whereas larger learning rates
for translating news articles. Similar critiques can be ap-   result in rapid changes and require fewer training epochs.
plied to other tools and methods in the literature. Hence,    Later, this work might benefit from using a learning rate
our work, though at an early stage, can address such is-      that decreases with each epoch, allowing the initial train-
sues and pave the way for building efficient, larger scale,   ing to advance quickly while letting the fine-tuning take
and easily accessible mobile apps in news translation for     the time it needs. These are some recommendations
morphologically rich languages. This would complement         based on our study in this paper. Furthermore, we could
other state-of-the-art apps.                                  potentially incorporate commonsense knowledge into
   One limitation on our model’s performance may have         the learning process. As depicted in recent works, deep
been the technique used to transform our data into fea-       learning based models and commonsense based models
ture sets. A simple word-integer assignment method is         can complement each other for enhanced performance.
used here that may have been a detriment due to its rep-         Files for this project are available on GitHub and can
resentation of words in an ordinal system as opposed to a     be provided to interested users upon request. On the
categorical one. Alternate approaches could include one-      whole, this work provides the ground for developing
hot encoding [34] or word vector generation word2vec          mobile apps for news translation orthogonal to existing
[35]. We could also implement an alternative architec-        work in the area. It caters broadly to the United Nations
ture such as a bidirectional RNN [36] into the framework      Sustainable Development Goal of Quality Education.
to explore if it enhances model performance We chose
to work with a simple approach first in line with the
logic of preferring simpler theories over complex ones
                                                              6. Acknowledgments
as per Occam’s razor principle [37], and also given the       Authors are in alphabetical order. A. Saxena has been funded
fact that we need reduced complexity and high efficiency      by a GA from the Comp. Sc. dept. at MSU. A. Varde has NSF
for translation tasks in this context. While our simple       grants 2018575 and 2117308. She is a visiting researcher at the
approach allows the model to observe general context          Max Planck Institute for Informatics, Germany.
patterns, it does not offer semantic representation of the
words to be translated. Furthermore, for better under-
standing the performance of the translation model, we
                                                              References
could consider adopting BLEU scores in the evaluation          [1] E. WMT, Translation Task— German-English cor-
of our future model, since this is widely recognized as            pus, 2021. URL: https://www.statmt.org/wmt21/
a reliable evaluation criteria in the machine translation          translation-task.html.
field [10]. On the whole, our current framework creates        [2] Zeitung, Frankfurter Allegemeine, 2021. URL: https://
a good baseline for translating German news to English,            www.faz.net/aktuell.
capturing reference to context.                                [3] D. E. Rumelhart, G. E. Hinton, R. J. Williams, Learning
                                                                   internal representations by error propagation, Techni-
                                                                   cal Report, California Univ San Diego La Jolla Inst for
                                                                   Cognitive Science, 1985.
 [4] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau,               work for collocation error correction in web pages and
     F. Bougares, H. Schwenk, Y. Bengio, Learning phrase                 text documents, ACM SIGKDD Explorations 17 (2015)
     representations using rnn encoder-decoder for statisti-             14–23.
     cal machine translation, arXiv preprint arXiv:1406.1078        [22] P. Bhagat, A. S. Varde, A. Feldman, Wordprep: Word-
     (2014).                                                             based preposition prediction tool, in: 2019 IEEE Interna-
 [5] F. Chollet, Deep learning with Python, Simon and Schus-             tional Conference on Big Data (Big Data), IEEE, 2019, pp.
     ter, 2021.                                                          2169–2176.
 [6] UN,         SDG        website,           2021.        URL:    [23] J. Briskilal, C. Subalalitha, An ensemble model for clas-
     www.un.org/sustainabledevelopment/                                  sifying idioms and literal texts using bert and roberta,
     sustainabledevelopment-goals/.                                      Information Processing & Management 59 (2022) 102756.
 [7] P. Basavaraju, A. S. Varde, Supervised learning tech-          [24] A. Elghafari, D. Meurers, H. Wunsch, Exploring the data-
     niques in mobile device apps for androids, ACM SIGKDD               driven prediction of prepositions in english, in: Coling
     Explorations 18 (2017) 18–29.                                       2010: Posters, 2010, pp. 267–275.
 [8] E. Avramidis, V. Macketanz, U. Strohriegel, A. Bur-            [25] G. De Melo, A. S. Varde, Scalable learning technologies
     chardt, S. Möller, Fine-grained linguistic evaluation               for big data mining, in: 20th International Conference on
     for state-of-the-art machine translation, arXiv preprint            Database Systems for Advanced Applications, DASFAA
     arXiv:2010.06359 (2020).                                            2015, Springer Verlag, 2015.
 [9] J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical eval-      [26] K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink,
     uation of gated recurrent neural networks on sequence               J. Schmidhuber, Lstm: A search space odyssey, IEEE
     modeling, arXiv preprint arXiv:1412.3555 (2014).                    transactions on neural networks and learning systems
[10] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a                 28 (2016) 2222–2232.
     method for automatic evaluation of machine translation,        [27] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-
     in: Proceedings of the 40th annual meeting of the Asso-             training of deep bidirectional transformers for language
     ciation for Computational Linguistics, 2002, pp. 311–318.           understanding, 2019. arXiv:1810.04805.
[11] B. Zhang, D. Xiong, J. Su, H. Duan, A context-                 [28] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever,
     aware recurrent encoder for neural machine transla-                 Improving language understanding by generative pre-
     tion, IEEE/ACM Transactions on Audio, Speech, and                   training (2018).
     Language Processing 25 (2017) 2424–2432.                       [29] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang,
[12] X. Tan, J. Chen, D. He, Y. Xia, T. Qin, T.-Y. Liu, Multilin-        M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the limits of
     gual neural machine translation with language cluster-              transfer learning with a unified text-to-text transformer,
     ing, arXiv preprint arXiv:1908.09324 (2019).                        2020. arXiv:1910.10683.
[13] B. Dong, A. S. Varde, D. Stevanovic, J. Wang, L. Zhao, In-     [30] N. Tandon, A. S. Varde, G. de Melo, Commonsense knowl-
     terpretable distance metric learning for handwritten chi-           edge in machine intelligence, ACM SIGMOD Record 46
     nese character recognition, CoRR abs/2103.09714 (2021).             (2017) 49–52.
     arXiv:2103.09714.                                              [31] C. Matuszek, M. Witbrock, R. C. Kahlert, J. Cabral,
[14] J. Hajic, Machine translation of very close languages, in:          D. Schneider, P. Shah, D. Lenat, Searching for common
     Sixth Applied Natural Language Processing Conference,               sense: Populating cyc from the web, UMBC Computer
     2000, pp. 7–12.                                                     Science and Electrical Engineering Department Collec-
[15] I. M. Azpiazu, M. S. Pera, A framework for hierarchi-               tion (2005).
     cal multilingual machine translation, arXiv preprint           [32] E. Onyeka, A. S. Varde, V. Anu, N. Tandon, O. Daramola,
     arXiv:2005.05507 (2020).                                            Using commonsense knowledge and text mining for im-
[16] A. Oncevay, B. Haddow, A. Birch, Bridging linguis-                  plicit requirements localization, in: 2020 IEEE 32nd Inter-
     tic typology and multilingual machine translation with              national Conference on Tools with Artificial Intelligence
     multi-view language representations, arXiv preprint                 (ICTAI), IEEE, 2020, pp. 935–940.
     arXiv:2004.14923 (2020).                                       [33] S. Razniewski, N. Tandon, A. S. Varde, Information to
[17] M. Popović, Comparing language related issues for nmt               wisdom: commonsense knowledge extraction and compi-
     and pbmt between german and english, The Prague                     lation, in: ACM International Conference on Web Search
     Bulletin of Mathematical Linguistics 108 (2017) 209.                and Data Mining (WSDM), 2021, pp. 1143–1146.
[18] D. Dahlmeier, H. T. Ng, Correcting semantic collocation        [34] J. Liang, J. Chen, X. Zhang, Y. Zhou, J. Lin, One-hot en-
     errors with l1-induced paraphrases, in: Proceedings of              coding and convolutional neural network based anomaly
     the 2011 conference on empirical methods in natural                 detection, Journal of Tsinghua University (Science and
     language processing, 2011, pp. 107–117.                             Technology) 59 (2019) 523–529.
[19] N.-R. Han, M. Chodorow, C. Leacock, Detecting errors           [35] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient esti-
     in english article usage by non-native speakers, Natural            mation of word representations in vector space, arXiv
     Language Engineering 12 (2006) 115–129.                             preprint arXiv:1301.3781 (2013).
[20] A. M. Pradhan, A. S. Varde, J. Peng, E. M. Fitzpatrick,        [36] M. Schuster, K. K. Paliwal, Bidirectional recurrent neural
     Automatic classification of article errors in l2 written            networks, IEEE transactions on Signal Processing 45
     english, in: Twenty-Third International FLAIRS Confer-              (1997) 2673–2681.
     ence, 2010.                                                    [37] T. Mitchell, M. L. McGraw-Hill, Edition, 1997.
[21] A. Varghese, A. S. Varde, J. Peng, E. Fitzpatrick, A frame-