<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Framework for German-English Machine Translation with GRU RNN</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Levi Corallo</string-name>
          <email>corallo1@montclair.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guanghui Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kenna Reagan</string-name>
          <email>reagank1@montclair.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Abhishek Saxena</string-name>
          <email>saxenaa1@montclair.edu</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aparna S. Varde</string-name>
          <email>vardea@montclair.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Brandon Wilde</string-name>
          <email>wildeb11@montclair.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computational Linguistics, Montclair State University</institution>
          ,
          <addr-line>Montclair, NJ</addr-line>
          ,
          <country country="US">United States</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Computer Science, Montclair State University</institution>
          ,
          <addr-line>Montclair, NJ</addr-line>
          ,
          <country country="US">United States</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Data Science, Montclair State University</institution>
          ,
          <addr-line>Montclair, NJ</addr-line>
          ,
          <country country="US">United States</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Machine translation (MT) using Gated Recurrent Units (GRUs) is a popular model used in industry-level web translators because of the eficiency with which it handles sequential data compared to Long Short-Term Memory (LSTM) in language modeling with smaller datasets. Motivated by this, a deep learning GRU based Recurrent Neural Network (RNN) is modeled as a framework in this paper, utilizing WMT2021's English-German data-set that originally contains 400,000 strings from German news with parallel English translations. Our framework serves as a pilot approach in translating strings from German news media into English sentences, to build applications and pave the way for further work in the area. In real-life scenarios, this framework can be useful in developing mobile applications (apps) for quick translation where eficiency is crucial. Furthermore, our work makes broader impacts on a UN SDG (United Nations Sustainable Development Goal) of Quality Education, since ofering education remotely by leveraging technology, as well as seeking equitable solutions and universal access are significant objectives there. Our framework for German-English translation in this paper can be adapted to other similar language translation tasks.</p>
      </abstract>
      <kwd-group>
        <kwd>Figure 1</kwd>
        <kwd>Google Translate attempt from Frankfurter Allgemeine Zeitung (German newspaper) with limitations [2]</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>1 Motivation and Goal: The open task created by
EMNLP provides datasets of sentences from news
articles in multiple language pairs with parallel translated
data [1]. The work generated by the task seeks to
advance current machine translation (MT) research by
using the latest performance scores as a comparison for
future research, to investigate the applicability of
current varying methods of MT, to examine challenges in
word translation for specific language pairs, and to elicit
more research on low-resource, morphologically rich
languages. This provides the motivation for our research.
Our goal is to investigate a specific machine translation
problem in a morphologically rich language and model a
framework to provide a feasible solution. In this context,
we address the issue of German-English news
translation. While there is much work on translation, there are
gaps in existing tools, e.g. Google Translate has a limit
on characters (see Fig. 1) with translation from a
German News source [2]. In order to make news and other
such text accessible globally, it is important to address
large-scale translation, for which issues such eficiency
are significant. We present the following.</p>
      <p>Models and Methods: We address the issue of
translapacted (see Fig 2 from the United Nations source [6]),
including language-related issues. This makes it even
more important for us to address such concerns in
order to enhance education. In addition, the framework
in this paper has the real-life standpoint of being useful
in mobile application (app) development due to its
eficiency. Application of machine learning in mobile apps
is broached in a variety of works, e.g. as summarized in
a survey paper [7]. During online news translation in a
mobile application, it is important to obtain fast results
that capture the crux of the material presented in the
news. Our framework is useful in such tasks.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>Avramis et al. presented work at WMT2020 utilizing</title>
        <p>the German-English news corpus provided by EMNLP
for that year’s open task [8]. The paper details the
development of a test suite, containing multiple diferent
linguistic phenomena relevant to the German to English
Figure 2: UN SDG on Education and its recent concerns translation process. The most dificult concepts
highlighted in the test suite when using MT to translate
German into English include ambiguous sentences,
multiword expressions, verb valency, and “false friends” which
tion via a framework modeled by GRU RNN deep learning refers to words in two languages that appear similar in
methods on a parallel German-English translated news composition and are often mistaken as sharing the same
corpus. The RNN (Recurrent Neural Network), originally meaning, but do not. The example their paper provides is
conceptualized by Rumelhart et al. [3], with the con- the German word “Novella” commonly having its target
cept of GRU (Gated Recurrent Unit) proposed by Cho et translation mistaken for “novel,” which it does not
transal. [4], is selected to model this framework based on its late into or semantically represent, but instead “novella,”
current performance in machine translation due to its or “short story.” The paper points out that it is a
sureficiency as compared to the Long Short-Term Memory prising fact that MT models are prone to false friends
(LSTM) model. In order to create a reasonable training when making mistakes in translating because this is an
time, we experiment with our framework on batches of observed human error. This was insightful when
analyz64 sentence pairs at a time. We choose to work with ing the validity and accuracy of our translated sentences,
Keras, an open-source Python library to implement this where we were able to understand phenomena that could
framework [5]. We obtain interesting results that set the be influencing the margin of error.
stage for building applications and conducting further There is research that points to Rumelhart et al. for
research for enhancement. Our framework in this paper the early conceptualization of Recurrent Nets that were
for German-English translation is usable for translation able to evolve into modern RNN programming [3]. This
in other morphologically rich languages. early work is a predecessor introducing vital concepts</p>
        <p>Applications: From a real-life perspective, translation in neural machine translation using an RNN such as the
of news is important for ensuring accessibility of current hidden layer between input and output units, sigma-pi
events for readers across the world who read in diferent units, and so on. More recent work by Chung et al. [9]
languages, and even fighting censorship of news-media is able to give empirical comparisons to LSTM in RNNs.
by bridging information divides to countries without free- The original concept for GRUs was introduced by Cho
dom of press. To that end, our paper broadly impacts UN et al. [4] who proposed a novel neural network model
SDG 4: Quality Education since its facets include the fol- called RNN Encoder-Decoder which uses two diferent
lowing. (1) “Help countries in mobilizing resources and neural networks as encoder and decoder respectively.
implementing innovative and context-appropriate solu- The encoder is used to read the source sentence and map
tions to provide education remotely, leveraging hi-tech, it into a vector of fixed length, while the decoder reads
low-tech and no-tech approaches”; (2) “Seek equitable the vector and maps it back to a corresponding target
solutions and universal access” [6]. In the aftermath of sentence. Along with the new architecture, they also
COVID, some of these goals have been negatively im- proposed an improved version of standard RNN called a
Gated Recurrent Unit (GRU) which uses a reset gate and among languages enables efective knowledge transfer,
an update gate to decide how much information should while avoiding negative efects caused by incorporating
be passed to the output sequence. They can be trained very distant languages. Recent work done by Oncevay et
to keep information from long ago if the information al. [16] tried to embed typological features in language
is critical to the prediction or forget information if it is vector space for multilingual machine translation tasks
irrelevant to the prediction. They experimented with this and reported to achieve competitive translation accuracy.
model on a task of translating English to French, found Recent work by Popović [17] details and compares
that the overall translation performance was improved in language-structure related issues that arise in NMT
terms of BLEU (BiLingual Evaluation Understudy) scores specifically between German and English. The author’s
[10] and linguistic regularities at both word level and work finds that key structural diferences between
Gerphrase level were captured. After their work, this model man and English causing ambiguities and inconsistent
has become a mainstream model framework. target translations are the handling of prepositions, the</p>
        <p>Zhang et al. [11] proposed an alternative to the widely- translation of ambiguous English (source) words, and
used bidirectional encoder with the merits of incorpo- generation of English (target) continuous tenses. English
rating future and history contexts into the source repre- and German both follow SVO (Subject-Verb-Object)
sensentation. This novel encoder is called a context-aware tence structure, so the obstacles found in Popović’s work
recurrent encoder (CAEncoder) which consists of two highlighting prepositional phrasing, ambiguity, and tense
levels. The bottom level summarizes the history infor- account for inaccuracies.
mation and the upper level assembles this information Other work in this general area entails addressing
artogether with future context into the source represen- ticle errors and collocation errors in written text
translatation. Through their experiment on translation tasks tion from a source language into English [18, 19, 20, 21],
with two diferent language pairs, they found this novel by addressing issues of ESL (English as a Second
Lanencoder to be as eficient as the bidirectional encoder and guage) learners. Preposition prediction and idiom
deto demonstrate better performance. tection are addressed in some works [22, 23, 24].
Prob</p>
        <p>Previous work has been done on multilingual neural lems on knowledge discovery from big data including
machine translation (NMT) that demonstrates the dif- those on machine translation are discussed in [25]. Deep
ifculties in translating between languages of the same learning techniques are used widely in machine
translanguage family and languages in diferent language fam- lation via paradigms such as LSTM (Long Short-Term
ilies. The study determined that it is dificult for one Memory) [26], BERT (Bidirectional Encoder
Represenmodel to handle every language to be considered for tations from Transformers) [27], GPT (Generative
Pretranslation. The reasoning for this, in part, is because trained Transformer) [28] and T5 (Text-To-Text
Transthe model could be negatively impacted during train- fer Transformer) [29]. Depending on the task, one of
ing when considering language pairs, such as Chinese these paradigms would be selected and adapted within
to English and German to English. For this reason, the solution approaches. There are studies that emphasize
study explores language clustering, where languages that commonsense knowledge in the realm of machine
inare closely related are clustered together, to boost the telligence, addressing translation among several tasks
model during training. They determine that language [30, 31, 32]. Comparison is presented in [33] between
embeddings, which considers genealogy and typology in symbolic knowledge graphs (KGs) and deep learning with
clustering, outperforms random family, which only con- neural models, explaining their pros versus cons, and how
siders genealogy [12]. Handwritten Chinese character they can potentially complement each other. Our work in
recognition by distance metric learning is approached in this paper fits in the broad spectrum of such exhaustive
[13] that cites work pertinent to pictorial scripts, consid- research. Its main contribution is the framework
modering OCR and machine translation. eled to conduct German-English news translation with</p>
        <p>Eforts in improving machine translation quality be- eficiency as needed in real-life applications.
tween typologically similar languages have long been
witnessed in the field. For those very close language pairs,
a direct word-for-word translation method was tested and 3. Models and Methods
received promising results [14]. More advanced
multilingual neural machine translation system has been created The deep learning paradigm is one of the most widely
to address one to many or many to many translations used facets for Machine Translation. We model a
framewithin language groups which share inherent similar work for morphologically rich language translation
destructures. Azpiazu and Pera [15] put forward a novel ploying a GRU based RNN, given its success with
realencoder-decoder machine translation framework called life scenarios such as industry level web-translators, and
HNMT specifically exploited the hierarchical nature of a adapt it specifically to our problem of German-English
typological language family tree. The natural connection news translation in this paper. Our framework is
implemented within the Python Keras platform [5] to perform
translations from German to English. The methodology
for the model discussed in this paper involves text
preprocessing, model design and model training. This is
discussed next with reference to our data in this work.</p>
        <sec id="sec-2-1-1">
          <title>3.1. Dataset and Text Preprocessing</title>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>The data used to train our model is sourced from the</title>
        <p>News Commentary dataset, obtained on the EMNLP 2021
website for the machine translation conference WMT21
[1]. The data, provided specifically for the task of
machine translation, is an aligned corpus of German and
English news stories. The collection comprises
approximately 400,000 German-English sentence pairs sourced
from news articles.</p>
        <p>The text preprocessing phase entails data cleaning,
tokenization and sentence padding. First, the dataset is
passed through data cleaning filters. Since all sentences
would be padded to the same final length, extremely long Figure 3: Framework for Machine Translation
sentences are removed. This includes sentence pairs for
which either the German or English sentence is more
than 50 words long. Errors in the creation of the dataset
can also occasionally incorrectly map one German sen- GRU is designed for sequential processing and so
maintence to two English sentences or vice versa. This is tains dependencies between diferent parts of a given
partially corrected by passing the data through two fil- entered sentence. The GRU output is then fed into a
timeters. The first removes all sentence pairs in which one distributed dense neural layer, which produces a series
sentence has more than twice as many words as its coun- of logit vectors for each sentence. Each logit efectively
terpart and a minimum length of 25 words. The second represents the probability of a given English word
occuriflter removes all sentence pairs in which one sentence ring in that position within the sentence, so the output is
has more than four times as many words as its counter- decoded by calling the English word which corresponds
part. The combined filters reduce the dataset to a size of to the position of the largest logit in each vector.
approximately 378K German-English sentence pairs. Several model and training parameters are left as
vari</p>
        <p>Tokenization is then performed with the Keras Tok- ables, to ensure the easy reconfiguration of components.
enizer function, dividing sentences into their component The parameter list and our chosen parameter
configurawords, and assigning each unique word an integer for tions are included in Tables 1 and 2. We choose to remain
further processing. Each sentence is thus converted into relatively constant with some of the configurable model
a list of integers. Dummy &lt;PAD&gt; tokens are then added functions that are well-established: Softmax is used as
at the end of each tokenized sentence, so that each sen- the activation function, sparse categorical cross-entropy
tence conforms to the same length and can be processed is used as the loss function, and Adam was used as the
by the neural machine translation model. optimization function [5].</p>
        <sec id="sec-2-2-1">
          <title>3.2. Translation Model Design</title>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>We predetermined to approach this machine translation</title>
        <p>task with an RNN model as justified earlier. After
reviewing the literature and assessing approaches by others, we
resolved to build a GRU-based RNN. Our framework for
translation is illustrated in Fig. 3.</p>
        <p>The model is built within the Keras platform and is
composed of two principal components: a GRU and a
dense layer. Input data are entered into the GRU, and
processed in matrices with a configurable
dimensionality referred to here as GRU units (not to be confused
with the number of GRUs, which was only one). The</p>
        <sec id="sec-2-3-1">
          <title>3.3. Model Training</title>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>After preparing the dataset and RNN model, the data is</title>
        <p>divided into two parts. Using Python Sklearn’s
train-testsplit method, the dataset is shufled and split: 80% of the
data for training and 20% for testing, to add robustness
to the framework.</p>
        <p>Model training then commences with a configuring
of model parameters and subsequent passing of the
training data into the Keras Model.fit() method. Accuracy
and loss are used as standard metrics [5] to monitor
model performance during training and provide a basis
for modifications to the model’s hyper-parameters.</p>
        <p>Parameters</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Experiments and Discussion</title>
      <p>Initial experimentation is conducted with abbreviated
datasets (5k - 50k sentence pairs) to reduce test time and
allow for the testing of more hyper-parameter combina- 4.1. Experimental Results
tions. This provides a precursory glimpse of the fully
trained model. Two parameter configurations are then The total training and testing times for all the executions
selected for training on the full dataset, creating what combined in our experiments with Models I and II are
would be named Simple RNN Model I (Table 1) and Sim- synopsized in Table 3. In Fig. 4, we can observe that
ple RNN Model II (Table 2) in our overall framework. both the training and validation loss decreased overall for</p>
      <p>The learning rate is a configurable hyper-parameter Model I. Despite the occasional spikes in loss, this is what
that controls how quickly the model is adapted to the we expect to see while training the model. Meanwhile,
problem, often in the range between 0.001 and 0.05. Our
experiments are set up with learning rates of 0.01 and 0.05
correspondingly. The number of GRU units are set to 128
and 512 for Model I and Model II respectively. We save
the history of the model throughout the training process
and subsequently plot the changes in loss and accuracy
(see Figs. 4 – 7). We conduct experiments with two setups
for the running of the RNN model. Comparing these two
setups, the principal diferences lie in the learning rate
and GRU parameters.
both the training and validation accuracy increase for training and validation sets shows a fluctuating change
Model I, seen in Fig. 5. across all 10 epochs, as can be seen in Fig. 7, indicating</p>
      <p>However, we also notice that the validation accuracy robustness. The overall accuracy decreases as is expected
drops at the end of running, demonstrating that the model with rising loss.
weights and biases have not yet reached a stable optimum.</p>
      <p>Interestingly, the validation loss does not increase during
the same period, as would be expected. Such occurrences 4.2. Discussion on Experiments
might indicate when the model weights and biases are We observe in all our experimentation that Model I
somemore precisely able to replicate several of the previously what outperforms Model II in both loss and accuracy.
correct results, while losing ground on some of the less Model I has a final training accuracy of 0.655 and final
certain results. In other terms, the model is becoming validation accuracy of 0.653. Model II had a final
validamore confident producing target sentence words easy to tion accuracy of 0.645 and a final validation accuracy of
predict, while simultaneously losing confidence in words 0.649. Model I has a final training loss of 2.78 and the final
that are more dificult to predict. validation loss was 2.85. Model II has a final validation</p>
      <p>For Model II, we change to a larger learning rate of loss of 4.66 and a final validation loss of 5.55.
0.05 and a larger number of GRU units, 512. The results Model I depicts a consistent decrease in loss and a
difer drastically from Model I. The loss for both the consistent increase in validation. Model II portrays a
contraining and validation set have an overall increase across sistent increase in loss while the accuracy increases and
all 10 epochs, as can be seen in Fig. 6. In the second, decreases throughout the training process without any
third, fifth, eighth, and ninth epochs, the training loss consistency. Despite the markedly diferent behavior, the
decreases. In the second, fifth, ninth, and tenth epoch, the two models both finish with a diference in translation
validation loss decreases. In all other epochs, the training accuracy less than 1%. Overall, it seems as though the
and validation losses both increase. The accuracy for both</p>
    </sec>
    <sec id="sec-4">
      <title>5. Conclusions</title>
      <p>lower learning rate of 0.01 in Model I produces better
results than the learning rate of 0.05 in Model II. The
accuracy of Model I is higher than in Model II and the loss In this paper, we model a framework using a GRU-based
in Model I is lower than in Model II. The learning rate RNN to perform German-English news translation,
deis a significant factor in how well the models perform. picting a method of eficient translation of text pieces
In our previous attempts to find the best parameters to in morphologically rich languages. We notice high
eftrain on, we find that 512 GRU units provide the best ifciency in training and testing the model. While the
preliminary results. However, despite the fact that Model accuracy levels obtained here seem good for a starting
II uses 512 GRU units, Model I still outperforms Model point, there is scope for further improvement.
II on the whole. It is clear that the higher learning rate In future work, apart from considering approaches
hinders Model II much more than the use of 512 GRU such as word2vec and bidirectional RNN, as well as
tununits is able to help it. It is likely that with the high ing some hyper-parameters, we could recommend using
learning rate, Model II over-corrects and is not able to more training epochs. Selecting an appropriate
learnnarrow in on optimal results. This is reflected in Figs. 6 ing rate and number of GRU units, as well as securing
and 7 where we can see that the loss increases and the suficient training time are challenges for training deep
accuracy is inconsistent. The learning rate of both of our learning MT models. During our attempts to tune the
models hovers around the 65% range. Though Google model, we observe that smaller learning rates require
Translate gives an accuracy in the 80% range, it faces the more training epochs, given the smaller changes made to
issue of a maximum character limit which is not feasible the weights each update, whereas larger learning rates
for translating news articles. Similar critiques can be ap- result in rapid changes and require fewer training epochs.
plied to other tools and methods in the literature. Hence, Later, this work might benefit from using a learning rate
our work, though at an early stage, can address such is- that decreases with each epoch, allowing the initial
trainsues and pave the way for building eficient, larger scale, ing to advance quickly while letting the fine-tuning take
and easily accessible mobile apps in news translation for the time it needs. These are some recommendations
morphologically rich languages. This would complement based on our study in this paper. Furthermore, we could
other state-of-the-art apps. potentially incorporate commonsense knowledge into</p>
      <p>One limitation on our model’s performance may have the learning process. As depicted in recent works, deep
been the technique used to transform our data into fea- learning based models and commonsense based models
ture sets. A simple word-integer assignment method is can complement each other for enhanced performance.
used here that may have been a detriment due to its rep- Files for this project are available on GitHub and can
resentation of words in an ordinal system as opposed to a be provided to interested users upon request. On the
categorical one. Alternate approaches could include one- whole, this work provides the ground for developing
hot encoding [34] or word vector generation word2vec mobile apps for news translation orthogonal to existing
[35]. We could also implement an alternative architec- work in the area. It caters broadly to the United Nations
ture such as a bidirectional RNN [36] into the framework Sustainable Development Goal of Quality Education.
to explore if it enhances model performance We chose
to work with a simple approach first in line with the 6. Acknowledgments
logic of preferring simpler theories over complex ones
as per Occam’s razor principle [37], and also given the Authors are in alphabetical order. A. Saxena has been funded
fact that we need reduced complexity and high eficiency by a GA from the Comp. Sc. dept. at MSU. A. Varde has NSF
for translation tasks in this context. While our simple grants 2018575 and 2117308. She is a visiting researcher at the
approach allows the model to observe general context Max Planck Institute for Informatics, Germany.
patterns, it does not ofer semantic representation of the
words to be translated. Furthermore, for better under- References
standing the performance of the translation model, we
could consider adopting BLEU scores in the evaluation
of our future model, since this is widely recognized as
a reliable evaluation criteria in the machine translation
ifeld [10]. On the whole, our current framework creates
a good baseline for translating German news to English,
capturing reference to context.
[1] E. WMT, Translation Task— German-English
corpus, 2021. URL: https://www.statmt.org/wmt21/
translation-task.html.
[2] Zeitung, Frankfurter Allegemeine, 2021. URL: https://</p>
      <p>www.faz.net/aktuell.
[3] D. E. Rumelhart, G. E. Hinton, R. J. Williams, Learning
internal representations by error propagation,
Technical Report, California Univ San Diego La Jolla Inst for
Cognitive Science, 1985.
[4] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, work for collocation error correction in web pages and
F. Bougares, H. Schwenk, Y. Bengio, Learning phrase text documents, ACM SIGKDD Explorations 17 (2015)
representations using rnn encoder-decoder for statisti- 14–23.
cal machine translation, arXiv preprint arXiv:1406.1078 [22] P. Bhagat, A. S. Varde, A. Feldman, Wordprep:
Word(2014). based preposition prediction tool, in: 2019 IEEE
Interna[5] F. Chollet, Deep learning with Python, Simon and Schus- tional Conference on Big Data (Big Data), IEEE, 2019, pp.</p>
      <p>ter, 2021. 2169–2176.
[6] UN, SDG website, 2021. URL: [23] J. Briskilal, C. Subalalitha, An ensemble model for
claswww.un.org/sustainabledevelopment/ sifying idioms and literal texts using bert and roberta,
sustainabledevelopment-goals/. Information Processing &amp; Management 59 (2022) 102756.
[7] P. Basavaraju, A. S. Varde, Supervised learning tech- [24] A. Elghafari, D. Meurers, H. Wunsch, Exploring the
dataniques in mobile device apps for androids, ACM SIGKDD driven prediction of prepositions in english, in: Coling
Explorations 18 (2017) 18–29. 2010: Posters, 2010, pp. 267–275.
[8] E. Avramidis, V. Macketanz, U. Strohriegel, A. Bur- [25] G. De Melo, A. S. Varde, Scalable learning technologies
chardt, S. Möller, Fine-grained linguistic evaluation for big data mining, in: 20th International Conference on
for state-of-the-art machine translation, arXiv preprint Database Systems for Advanced Applications, DASFAA
arXiv:2010.06359 (2020). 2015, Springer Verlag, 2015.
[9] J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical eval- [26] K. Gref, R. K. Srivastava, J. Koutník, B. R. Steunebrink,
uation of gated recurrent neural networks on sequence J. Schmidhuber, Lstm: A search space odyssey, IEEE
modeling, arXiv preprint arXiv:1412.3555 (2014). transactions on neural networks and learning systems
[10] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a 28 (2016) 2222–2232.</p>
      <p>method for automatic evaluation of machine translation, [27] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert:
Prein: Proceedings of the 40th annual meeting of the Asso- training of deep bidirectional transformers for language
ciation for Computational Linguistics, 2002, pp. 311–318. understanding, 2019. arXiv:1810.04805.
[11] B. Zhang, D. Xiong, J. Su, H. Duan, A context- [28] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever,
aware recurrent encoder for neural machine transla- Improving language understanding by generative
pretion, IEEE/ACM Transactions on Audio, Speech, and training (2018).</p>
      <p>Language Processing 25 (2017) 2424–2432. [29] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang,
[12] X. Tan, J. Chen, D. He, Y. Xia, T. Qin, T.-Y. Liu, Multilin- M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the limits of
gual neural machine translation with language cluster- transfer learning with a unified text-to-text transformer,
ing, arXiv preprint arXiv:1908.09324 (2019). 2020. arXiv:1910.10683.
[13] B. Dong, A. S. Varde, D. Stevanovic, J. Wang, L. Zhao, In- [30] N. Tandon, A. S. Varde, G. de Melo, Commonsense
knowlterpretable distance metric learning for handwritten chi- edge in machine intelligence, ACM SIGMOD Record 46
nese character recognition, CoRR abs/2103.09714 (2021). (2017) 49–52.</p>
      <p>arXiv:2103.09714. [31] C. Matuszek, M. Witbrock, R. C. Kahlert, J. Cabral,
[14] J. Hajic, Machine translation of very close languages, in: D. Schneider, P. Shah, D. Lenat, Searching for common
Sixth Applied Natural Language Processing Conference, sense: Populating cyc from the web, UMBC Computer
2000, pp. 7–12. Science and Electrical Engineering Department
Collec[15] I. M. Azpiazu, M. S. Pera, A framework for hierarchi- tion (2005).</p>
      <p>cal multilingual machine translation, arXiv preprint [32] E. Onyeka, A. S. Varde, V. Anu, N. Tandon, O. Daramola,
arXiv:2005.05507 (2020). Using commonsense knowledge and text mining for
im[16] A. Oncevay, B. Haddow, A. Birch, Bridging linguis- plicit requirements localization, in: 2020 IEEE 32nd
Intertic typology and multilingual machine translation with national Conference on Tools with Artificial Intelligence
multi-view language representations, arXiv preprint (ICTAI), IEEE, 2020, pp. 935–940.</p>
      <p>arXiv:2004.14923 (2020). [33] S. Razniewski, N. Tandon, A. S. Varde, Information to
[17] M. Popović, Comparing language related issues for nmt wisdom: commonsense knowledge extraction and
compiand pbmt between german and english, The Prague lation, in: ACM International Conference on Web Search
Bulletin of Mathematical Linguistics 108 (2017) 209. and Data Mining (WSDM), 2021, pp. 1143–1146.
[18] D. Dahlmeier, H. T. Ng, Correcting semantic collocation [34] J. Liang, J. Chen, X. Zhang, Y. Zhou, J. Lin, One-hot
enerrors with l1-induced paraphrases, in: Proceedings of coding and convolutional neural network based anomaly
the 2011 conference on empirical methods in natural detection, Journal of Tsinghua University (Science and
language processing, 2011, pp. 107–117. Technology) 59 (2019) 523–529.
[19] N.-R. Han, M. Chodorow, C. Leacock, Detecting errors [35] T. Mikolov, K. Chen, G. Corrado, J. Dean, Eficient
estiin english article usage by non-native speakers, Natural mation of word representations in vector space, arXiv
Language Engineering 12 (2006) 115–129. preprint arXiv:1301.3781 (2013).
[20] A. M. Pradhan, A. S. Varde, J. Peng, E. M. Fitzpatrick, [36] M. Schuster, K. K. Paliwal, Bidirectional recurrent neural
Automatic classification of article errors in l2 written networks, IEEE transactions on Signal Processing 45
english, in: Twenty-Third International FLAIRS Confer- (1997) 2673–2681.</p>
      <p>ence, 2010. [37] T. Mitchell, M. L. McGraw-Hill, Edition, 1997.
[21] A. Varghese, A. S. Varde, J. Peng, E. Fitzpatrick, A
frame</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>