1. Introduction

A Framework for German-English Machine Translation with GRU RNN

Levi Corallo

corallo1@montclair.edu 0

Guanghui Li

Kenna Reagan

reagank1@montclair.edu 0

Abhishek Saxena

saxenaa1@montclair.edu 2

Aparna S. Varde

vardea@montclair.edu 1

Brandon Wilde

wildeb11@montclair.edu 0 0 Computational Linguistics, Montclair State University , Montclair, NJ , United States 1 Computer Science, Montclair State University , Montclair, NJ , United States 2 Data Science, Montclair State University , Montclair, NJ , United States

Machine translation (MT) using Gated Recurrent Units (GRUs) is a popular model used in industry-level web translators because of the eficiency with which it handles sequential data compared to Long Short-Term Memory (LSTM) in language modeling with smaller datasets. Motivated by this, a deep learning GRU based Recurrent Neural Network (RNN) is modeled as a framework in this paper, utilizing WMT2021's English-German data-set that originally contains 400,000 strings from German news with parallel English translations. Our framework serves as a pilot approach in translating strings from German news media into English sentences, to build applications and pave the way for further work in the area. In real-life scenarios, this framework can be useful in developing mobile applications (apps) for quick translation where eficiency is crucial. Furthermore, our work makes broader impacts on a UN SDG (United Nations Sustainable Development Goal) of Quality Education, since ofering education remotely by leveraging technology, as well as seeking equitable solutions and universal access are significant objectives there. Our framework for German-English translation in this paper can be adapted to other similar language translation tasks.

Figure 1 Google Translate attempt from Frankfurter Allgemeine Zeitung (German newspaper) with limitations [2]

1. Introduction

1 Motivation and Goal: The open task created by EMNLP provides datasets of sentences from news articles in multiple language pairs with parallel translated data [1]. The work generated by the task seeks to advance current machine translation (MT) research by using the latest performance scores as a comparison for future research, to investigate the applicability of current varying methods of MT, to examine challenges in word translation for specific language pairs, and to elicit more research on low-resource, morphologically rich languages. This provides the motivation for our research. Our goal is to investigate a specific machine translation problem in a morphologically rich language and model a framework to provide a feasible solution. In this context, we address the issue of German-English news translation. While there is much work on translation, there are gaps in existing tools, e.g. Google Translate has a limit on characters (see Fig. 1) with translation from a German News source [2]. In order to make news and other such text accessible globally, it is important to address large-scale translation, for which issues such eficiency are significant. We present the following.

Models and Methods: We address the issue of translapacted (see Fig 2 from the United Nations source [6]), including language-related issues. This makes it even more important for us to address such concerns in order to enhance education. In addition, the framework in this paper has the real-life standpoint of being useful in mobile application (app) development due to its eficiency. Application of machine learning in mobile apps is broached in a variety of works, e.g. as summarized in a survey paper [7]. During online news translation in a mobile application, it is important to obtain fast results that capture the crux of the material presented in the news. Our framework is useful in such tasks.

2. Related Work Avramis et al. presented work at WMT2020 utilizing

the German-English news corpus provided by EMNLP for that year’s open task [8]. The paper details the development of a test suite, containing multiple diferent linguistic phenomena relevant to the German to English Figure 2: UN SDG on Education and its recent concerns translation process. The most dificult concepts highlighted in the test suite when using MT to translate German into English include ambiguous sentences, multiword expressions, verb valency, and “false friends” which tion via a framework modeled by GRU RNN deep learning refers to words in two languages that appear similar in methods on a parallel German-English translated news composition and are often mistaken as sharing the same corpus. The RNN (Recurrent Neural Network), originally meaning, but do not. The example their paper provides is conceptualized by Rumelhart et al. [3], with the con- the German word “Novella” commonly having its target cept of GRU (Gated Recurrent Unit) proposed by Cho et translation mistaken for “novel,” which it does not transal. [4], is selected to model this framework based on its late into or semantically represent, but instead “novella,” current performance in machine translation due to its or “short story.” The paper points out that it is a sureficiency as compared to the Long Short-Term Memory prising fact that MT models are prone to false friends (LSTM) model. In order to create a reasonable training when making mistakes in translating because this is an time, we experiment with our framework on batches of observed human error. This was insightful when analyz64 sentence pairs at a time. We choose to work with ing the validity and accuracy of our translated sentences, Keras, an open-source Python library to implement this where we were able to understand phenomena that could framework [5]. We obtain interesting results that set the be influencing the margin of error. stage for building applications and conducting further There is research that points to Rumelhart et al. for research for enhancement. Our framework in this paper the early conceptualization of Recurrent Nets that were for German-English translation is usable for translation able to evolve into modern RNN programming [3]. This in other morphologically rich languages. early work is a predecessor introducing vital concepts

Applications: From a real-life perspective, translation in neural machine translation using an RNN such as the of news is important for ensuring accessibility of current hidden layer between input and output units, sigma-pi events for readers across the world who read in diferent units, and so on. More recent work by Chung et al. [9] languages, and even fighting censorship of news-media is able to give empirical comparisons to LSTM in RNNs. by bridging information divides to countries without free- The original concept for GRUs was introduced by Cho dom of press. To that end, our paper broadly impacts UN et al. [4] who proposed a novel neural network model SDG 4: Quality Education since its facets include the fol- called RNN Encoder-Decoder which uses two diferent lowing. (1) “Help countries in mobilizing resources and neural networks as encoder and decoder respectively. implementing innovative and context-appropriate solu- The encoder is used to read the source sentence and map tions to provide education remotely, leveraging hi-tech, it into a vector of fixed length, while the decoder reads low-tech and no-tech approaches”; (2) “Seek equitable the vector and maps it back to a corresponding target solutions and universal access” [6]. In the aftermath of sentence. Along with the new architecture, they also COVID, some of these goals have been negatively im- proposed an improved version of standard RNN called a Gated Recurrent Unit (GRU) which uses a reset gate and among languages enables efective knowledge transfer, an update gate to decide how much information should while avoiding negative efects caused by incorporating be passed to the output sequence. They can be trained very distant languages. Recent work done by Oncevay et to keep information from long ago if the information al. [16] tried to embed typological features in language is critical to the prediction or forget information if it is vector space for multilingual machine translation tasks irrelevant to the prediction. They experimented with this and reported to achieve competitive translation accuracy. model on a task of translating English to French, found Recent work by Popović [17] details and compares that the overall translation performance was improved in language-structure related issues that arise in NMT terms of BLEU (BiLingual Evaluation Understudy) scores specifically between German and English. The author’s [10] and linguistic regularities at both word level and work finds that key structural diferences between Gerphrase level were captured. After their work, this model man and English causing ambiguities and inconsistent has become a mainstream model framework. target translations are the handling of prepositions, the

Zhang et al. [11] proposed an alternative to the widely- translation of ambiguous English (source) words, and used bidirectional encoder with the merits of incorpo- generation of English (target) continuous tenses. English rating future and history contexts into the source repre- and German both follow SVO (Subject-Verb-Object) sensentation. This novel encoder is called a context-aware tence structure, so the obstacles found in Popović’s work recurrent encoder (CAEncoder) which consists of two highlighting prepositional phrasing, ambiguity, and tense levels. The bottom level summarizes the history infor- account for inaccuracies. mation and the upper level assembles this information Other work in this general area entails addressing artogether with future context into the source represen- ticle errors and collocation errors in written text translatation. Through their experiment on translation tasks tion from a source language into English [18, 19, 20, 21], with two diferent language pairs, they found this novel by addressing issues of ESL (English as a Second Lanencoder to be as eficient as the bidirectional encoder and guage) learners. Preposition prediction and idiom deto demonstrate better performance. tection are addressed in some works [22, 23, 24]. Prob

Previous work has been done on multilingual neural lems on knowledge discovery from big data including machine translation (NMT) that demonstrates the dif- those on machine translation are discussed in [25]. Deep ifculties in translating between languages of the same learning techniques are used widely in machine translanguage family and languages in diferent language fam- lation via paradigms such as LSTM (Long Short-Term ilies. The study determined that it is dificult for one Memory) [26], BERT (Bidirectional Encoder Represenmodel to handle every language to be considered for tations from Transformers) [27], GPT (Generative Pretranslation. The reasoning for this, in part, is because trained Transformer) [28] and T5 (Text-To-Text Transthe model could be negatively impacted during train- fer Transformer) [29]. Depending on the task, one of ing when considering language pairs, such as Chinese these paradigms would be selected and adapted within to English and German to English. For this reason, the solution approaches. There are studies that emphasize study explores language clustering, where languages that commonsense knowledge in the realm of machine inare closely related are clustered together, to boost the telligence, addressing translation among several tasks model during training. They determine that language [30, 31, 32]. Comparison is presented in [33] between embeddings, which considers genealogy and typology in symbolic knowledge graphs (KGs) and deep learning with clustering, outperforms random family, which only con- neural models, explaining their pros versus cons, and how siders genealogy [12]. Handwritten Chinese character they can potentially complement each other. Our work in recognition by distance metric learning is approached in this paper fits in the broad spectrum of such exhaustive [13] that cites work pertinent to pictorial scripts, consid- research. Its main contribution is the framework modering OCR and machine translation. eled to conduct German-English news translation with

Eforts in improving machine translation quality be- eficiency as needed in real-life applications. tween typologically similar languages have long been witnessed in the field. For those very close language pairs, a direct word-for-word translation method was tested and 3. Models and Methods received promising results [14]. More advanced multilingual neural machine translation system has been created The deep learning paradigm is one of the most widely to address one to many or many to many translations used facets for Machine Translation. We model a framewithin language groups which share inherent similar work for morphologically rich language translation destructures. Azpiazu and Pera [15] put forward a novel ploying a GRU based RNN, given its success with realencoder-decoder machine translation framework called life scenarios such as industry level web-translators, and HNMT specifically exploited the hierarchical nature of a adapt it specifically to our problem of German-English typological language family tree. The natural connection news translation in this paper. Our framework is implemented within the Python Keras platform [5] to perform translations from German to English. The methodology for the model discussed in this paper involves text preprocessing, model design and model training. This is discussed next with reference to our data in this work.

3.1. Dataset and Text Preprocessing The data used to train our model is sourced from the

News Commentary dataset, obtained on the EMNLP 2021 website for the machine translation conference WMT21 [1]. The data, provided specifically for the task of machine translation, is an aligned corpus of German and English news stories. The collection comprises approximately 400,000 German-English sentence pairs sourced from news articles.

The text preprocessing phase entails data cleaning, tokenization and sentence padding. First, the dataset is passed through data cleaning filters. Since all sentences would be padded to the same final length, extremely long Figure 3: Framework for Machine Translation sentences are removed. This includes sentence pairs for which either the German or English sentence is more than 50 words long. Errors in the creation of the dataset can also occasionally incorrectly map one German sen- GRU is designed for sequential processing and so maintence to two English sentences or vice versa. This is tains dependencies between diferent parts of a given partially corrected by passing the data through two fil- entered sentence. The GRU output is then fed into a timeters. The first removes all sentence pairs in which one distributed dense neural layer, which produces a series sentence has more than twice as many words as its coun- of logit vectors for each sentence. Each logit efectively terpart and a minimum length of 25 words. The second represents the probability of a given English word occuriflter removes all sentence pairs in which one sentence ring in that position within the sentence, so the output is has more than four times as many words as its counter- decoded by calling the English word which corresponds part. The combined filters reduce the dataset to a size of to the position of the largest logit in each vector. approximately 378K German-English sentence pairs. Several model and training parameters are left as vari

Tokenization is then performed with the Keras Tok- ables, to ensure the easy reconfiguration of components. enizer function, dividing sentences into their component The parameter list and our chosen parameter configurawords, and assigning each unique word an integer for tions are included in Tables 1 and 2. We choose to remain further processing. Each sentence is thus converted into relatively constant with some of the configurable model a list of integers. Dummy <PAD> tokens are then added functions that are well-established: Softmax is used as at the end of each tokenized sentence, so that each sen- the activation function, sparse categorical cross-entropy tence conforms to the same length and can be processed is used as the loss function, and Adam was used as the by the neural machine translation model. optimization function [5].

3.2. Translation Model Design We predetermined to approach this machine translation

task with an RNN model as justified earlier. After reviewing the literature and assessing approaches by others, we resolved to build a GRU-based RNN. Our framework for translation is illustrated in Fig. 3.

The model is built within the Keras platform and is composed of two principal components: a GRU and a dense layer. Input data are entered into the GRU, and processed in matrices with a configurable dimensionality referred to here as GRU units (not to be confused with the number of GRUs, which was only one). The

3.3. Model Training After preparing the dataset and RNN model, the data is

divided into two parts. Using Python Sklearn’s train-testsplit method, the dataset is shufled and split: 80% of the data for training and 20% for testing, to add robustness to the framework.

Model training then commences with a configuring of model parameters and subsequent passing of the training data into the Keras Model.fit() method. Accuracy and loss are used as standard metrics [5] to monitor model performance during training and provide a basis for modifications to the model’s hyper-parameters.

Parameters

4. Experiments and Discussion

Initial experimentation is conducted with abbreviated datasets (5k - 50k sentence pairs) to reduce test time and allow for the testing of more hyper-parameter combina- 4.1. Experimental Results tions. This provides a precursory glimpse of the fully trained model. Two parameter configurations are then The total training and testing times for all the executions selected for training on the full dataset, creating what combined in our experiments with Models I and II are would be named Simple RNN Model I (Table 1) and Sim- synopsized in Table 3. In Fig. 4, we can observe that ple RNN Model II (Table 2) in our overall framework. both the training and validation loss decreased overall for

The learning rate is a configurable hyper-parameter Model I. Despite the occasional spikes in loss, this is what that controls how quickly the model is adapted to the we expect to see while training the model. Meanwhile, problem, often in the range between 0.001 and 0.05. Our experiments are set up with learning rates of 0.01 and 0.05 correspondingly. The number of GRU units are set to 128 and 512 for Model I and Model II respectively. We save the history of the model throughout the training process and subsequently plot the changes in loss and accuracy (see Figs. 4 – 7). We conduct experiments with two setups for the running of the RNN model. Comparing these two setups, the principal diferences lie in the learning rate and GRU parameters. both the training and validation accuracy increase for training and validation sets shows a fluctuating change Model I, seen in Fig. 5. across all 10 epochs, as can be seen in Fig. 7, indicating

However, we also notice that the validation accuracy robustness. The overall accuracy decreases as is expected drops at the end of running, demonstrating that the model with rising loss. weights and biases have not yet reached a stable optimum.

Interestingly, the validation loss does not increase during the same period, as would be expected. Such occurrences 4.2. Discussion on Experiments might indicate when the model weights and biases are We observe in all our experimentation that Model I somemore precisely able to replicate several of the previously what outperforms Model II in both loss and accuracy. correct results, while losing ground on some of the less Model I has a final training accuracy of 0.655 and final certain results. In other terms, the model is becoming validation accuracy of 0.653. Model II had a final validamore confident producing target sentence words easy to tion accuracy of 0.645 and a final validation accuracy of predict, while simultaneously losing confidence in words 0.649. Model I has a final training loss of 2.78 and the final that are more dificult to predict. validation loss was 2.85. Model II has a final validation

For Model II, we change to a larger learning rate of loss of 4.66 and a final validation loss of 5.55. 0.05 and a larger number of GRU units, 512. The results Model I depicts a consistent decrease in loss and a difer drastically from Model I. The loss for both the consistent increase in validation. Model II portrays a contraining and validation set have an overall increase across sistent increase in loss while the accuracy increases and all 10 epochs, as can be seen in Fig. 6. In the second, decreases throughout the training process without any third, fifth, eighth, and ninth epochs, the training loss consistency. Despite the markedly diferent behavior, the decreases. In the second, fifth, ninth, and tenth epoch, the two models both finish with a diference in translation validation loss decreases. In all other epochs, the training accuracy less than 1%. Overall, it seems as though the and validation losses both increase. The accuracy for both

5. Conclusions

lower learning rate of 0.01 in Model I produces better results than the learning rate of 0.05 in Model II. The accuracy of Model I is higher than in Model II and the loss In this paper, we model a framework using a GRU-based in Model I is lower than in Model II. The learning rate RNN to perform German-English news translation, deis a significant factor in how well the models perform. picting a method of eficient translation of text pieces In our previous attempts to find the best parameters to in morphologically rich languages. We notice high eftrain on, we find that 512 GRU units provide the best ifciency in training and testing the model. While the preliminary results. However, despite the fact that Model accuracy levels obtained here seem good for a starting II uses 512 GRU units, Model I still outperforms Model point, there is scope for further improvement. II on the whole. It is clear that the higher learning rate In future work, apart from considering approaches hinders Model II much more than the use of 512 GRU such as word2vec and bidirectional RNN, as well as tununits is able to help it. It is likely that with the high ing some hyper-parameters, we could recommend using learning rate, Model II over-corrects and is not able to more training epochs. Selecting an appropriate learnnarrow in on optimal results. This is reflected in Figs. 6 ing rate and number of GRU units, as well as securing and 7 where we can see that the loss increases and the suficient training time are challenges for training deep accuracy is inconsistent. The learning rate of both of our learning MT models. During our attempts to tune the models hovers around the 65% range. Though Google model, we observe that smaller learning rates require Translate gives an accuracy in the 80% range, it faces the more training epochs, given the smaller changes made to issue of a maximum character limit which is not feasible the weights each update, whereas larger learning rates for translating news articles. Similar critiques can be ap- result in rapid changes and require fewer training epochs. plied to other tools and methods in the literature. Hence, Later, this work might benefit from using a learning rate our work, though at an early stage, can address such is- that decreases with each epoch, allowing the initial trainsues and pave the way for building eficient, larger scale, ing to advance quickly while letting the fine-tuning take and easily accessible mobile apps in news translation for the time it needs. These are some recommendations morphologically rich languages. This would complement based on our study in this paper. Furthermore, we could other state-of-the-art apps. potentially incorporate commonsense knowledge into

One limitation on our model’s performance may have the learning process. As depicted in recent works, deep been the technique used to transform our data into fea- learning based models and commonsense based models ture sets. A simple word-integer assignment method is can complement each other for enhanced performance. used here that may have been a detriment due to its rep- Files for this project are available on GitHub and can resentation of words in an ordinal system as opposed to a be provided to interested users upon request. On the categorical one. Alternate approaches could include one- whole, this work provides the ground for developing hot encoding [34] or word vector generation word2vec mobile apps for news translation orthogonal to existing [35]. We could also implement an alternative architec- work in the area. It caters broadly to the United Nations ture such as a bidirectional RNN [36] into the framework Sustainable Development Goal of Quality Education. to explore if it enhances model performance We chose to work with a simple approach first in line with the 6. Acknowledgments logic of preferring simpler theories over complex ones as per Occam’s razor principle [37], and also given the Authors are in alphabetical order. A. Saxena has been funded fact that we need reduced complexity and high eficiency by a GA from the Comp. Sc. dept. at MSU. A. Varde has NSF for translation tasks in this context. While our simple grants 2018575 and 2117308. She is a visiting researcher at the approach allows the model to observe general context Max Planck Institute for Informatics, Germany. patterns, it does not ofer semantic representation of the words to be translated. Furthermore, for better under- References standing the performance of the translation model, we could consider adopting BLEU scores in the evaluation of our future model, since this is widely recognized as a reliable evaluation criteria in the machine translation ifeld [10]. On the whole, our current framework creates a good baseline for translating German news to English, capturing reference to context. [1] E. WMT, Translation Task— German-English corpus, 2021. URL: https://www.statmt.org/wmt21/ translation-task.html. [2] Zeitung, Frankfurter Allegemeine, 2021. URL: https://

www.faz.net/aktuell. [3] D. E. Rumelhart, G. E. Hinton, R. J. Williams, Learning internal representations by error propagation, Technical Report, California Univ San Diego La Jolla Inst for Cognitive Science, 1985. [4] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, work for collocation error correction in web pages and F. Bougares, H. Schwenk, Y. Bengio, Learning phrase text documents, ACM SIGKDD Explorations 17 (2015) representations using rnn encoder-decoder for statisti- 14–23. cal machine translation, arXiv preprint arXiv:1406.1078 [22] P. Bhagat, A. S. Varde, A. Feldman, Wordprep: Word(2014). based preposition prediction tool, in: 2019 IEEE Interna[5] F. Chollet, Deep learning with Python, Simon and Schus- tional Conference on Big Data (Big Data), IEEE, 2019, pp.

ter, 2021. 2169–2176. [6] UN, SDG website, 2021. URL: [23] J. Briskilal, C. Subalalitha, An ensemble model for claswww.un.org/sustainabledevelopment/ sifying idioms and literal texts using bert and roberta, sustainabledevelopment-goals/. Information Processing & Management 59 (2022) 102756. [7] P. Basavaraju, A. S. Varde, Supervised learning tech- [24] A. Elghafari, D. Meurers, H. Wunsch, Exploring the dataniques in mobile device apps for androids, ACM SIGKDD driven prediction of prepositions in english, in: Coling Explorations 18 (2017) 18–29. 2010: Posters, 2010, pp. 267–275. [8] E. Avramidis, V. Macketanz, U. Strohriegel, A. Bur- [25] G. De Melo, A. S. Varde, Scalable learning technologies chardt, S. Möller, Fine-grained linguistic evaluation for big data mining, in: 20th International Conference on for state-of-the-art machine translation, arXiv preprint Database Systems for Advanced Applications, DASFAA arXiv:2010.06359 (2020). 2015, Springer Verlag, 2015. [9] J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical eval- [26] K. Gref, R. K. Srivastava, J. Koutník, B. R. Steunebrink, uation of gated recurrent neural networks on sequence J. Schmidhuber, Lstm: A search space odyssey, IEEE modeling, arXiv preprint arXiv:1412.3555 (2014). transactions on neural networks and learning systems [10] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a 28 (2016) 2222–2232.

method for automatic evaluation of machine translation, [27] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Prein: Proceedings of the 40th annual meeting of the Asso- training of deep bidirectional transformers for language ciation for Computational Linguistics, 2002, pp. 311–318. understanding, 2019. arXiv:1810.04805. [11] B. Zhang, D. Xiong, J. Su, H. Duan, A context- [28] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, aware recurrent encoder for neural machine transla- Improving language understanding by generative pretion, IEEE/ACM Transactions on Audio, Speech, and training (2018).

Language Processing 25 (2017) 2424–2432. [29] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang, [12] X. Tan, J. Chen, D. He, Y. Xia, T. Qin, T.-Y. Liu, Multilin- M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the limits of gual neural machine translation with language cluster- transfer learning with a unified text-to-text transformer, ing, arXiv preprint arXiv:1908.09324 (2019). 2020. arXiv:1910.10683. [13] B. Dong, A. S. Varde, D. Stevanovic, J. Wang, L. Zhao, In- [30] N. Tandon, A. S. Varde, G. de Melo, Commonsense knowlterpretable distance metric learning for handwritten chi- edge in machine intelligence, ACM SIGMOD Record 46 nese character recognition, CoRR abs/2103.09714 (2021). (2017) 49–52.

arXiv:2103.09714. [31] C. Matuszek, M. Witbrock, R. C. Kahlert, J. Cabral, [14] J. Hajic, Machine translation of very close languages, in: D. Schneider, P. Shah, D. Lenat, Searching for common Sixth Applied Natural Language Processing Conference, sense: Populating cyc from the web, UMBC Computer 2000, pp. 7–12. Science and Electrical Engineering Department Collec[15] I. M. Azpiazu, M. S. Pera, A framework for hierarchi- tion (2005).

cal multilingual machine translation, arXiv preprint [32] E. Onyeka, A. S. Varde, V. Anu, N. Tandon, O. Daramola, arXiv:2005.05507 (2020). Using commonsense knowledge and text mining for im[16] A. Oncevay, B. Haddow, A. Birch, Bridging linguis- plicit requirements localization, in: 2020 IEEE 32nd Intertic typology and multilingual machine translation with national Conference on Tools with Artificial Intelligence multi-view language representations, arXiv preprint (ICTAI), IEEE, 2020, pp. 935–940.

arXiv:2004.14923 (2020). [33] S. Razniewski, N. Tandon, A. S. Varde, Information to [17] M. Popović, Comparing language related issues for nmt wisdom: commonsense knowledge extraction and compiand pbmt between german and english, The Prague lation, in: ACM International Conference on Web Search Bulletin of Mathematical Linguistics 108 (2017) 209. and Data Mining (WSDM), 2021, pp. 1143–1146. [18] D. Dahlmeier, H. T. Ng, Correcting semantic collocation [34] J. Liang, J. Chen, X. Zhang, Y. Zhou, J. Lin, One-hot enerrors with l1-induced paraphrases, in: Proceedings of coding and convolutional neural network based anomaly the 2011 conference on empirical methods in natural detection, Journal of Tsinghua University (Science and language processing, 2011, pp. 107–117. Technology) 59 (2019) 523–529. [19] N.-R. Han, M. Chodorow, C. Leacock, Detecting errors [35] T. Mikolov, K. Chen, G. Corrado, J. Dean, Eficient estiin english article usage by non-native speakers, Natural mation of word representations in vector space, arXiv Language Engineering 12 (2006) 115–129. preprint arXiv:1301.3781 (2013). [20] A. M. Pradhan, A. S. Varde, J. Peng, E. M. Fitzpatrick, [36] M. Schuster, K. K. Paliwal, Bidirectional recurrent neural Automatic classification of article errors in l2 written networks, IEEE transactions on Signal Processing 45 english, in: Twenty-Third International FLAIRS Confer- (1997) 2673–2681.

ence, 2010. [37] T. Mitchell, M. L. McGraw-Hill, Edition, 1997. [21] A. Varghese, A. S. Varde, J. Peng, E. Fitzpatrick, A frame