Generalizable Architecture for Robust Word Vectors Tested by Noisy Paraphrases Valentin Malykh Laboratory of Neural Systems and Deep Learning, Moscow Institute of Physics and Technology valentin.malykh@phystech.edu Moscow, Russia Abstract. This paper is devoted to present language independent architecture of robust word vectors. The robustness for typos is burning demand of current industry, regarding the social networks, instant messaging, etc. This architecture is designed to be indifferent to typos like switching, extra letters, and missing letters. The experiments on paraphrase corpora for three different languages are demonstrating the applicability of the proposed approach in noisy environments. Keywords: word vectors, noise-resilient, char-aware, neural nets 1 Introduction The problem of user input data is widely known: the typos, the errors, wrong word usage, etc. To handle this issue there exist bunch of different tools like embedded spell- checkers on websites’ input forms, but severity of this problem just growing larger, especially with wide usage of mobile devices with small and/or simplified keyboards. On the other hand word vectors had become very popular past years in variety of tasks, like text classification, paraphrase detection, sentiment analysis, etc. But to use this word vectors the user input should be cleared from the noise. To address this issue we present the novel architecture of word vectors proof to specific type of noise - the missing or surplus letters in words. To demonstrate generality of proposed approach there were chosen languages from different types: English as (almost) analytical, Russian as synthetic flexive, and Turk- ish as synthetic agglutinative. The task to test the approach against was chosen to be paraphrase identification, since in this task there is a natural metric - paraphrase or not for every pair of sentences. The addition of some noise to these pairs is enabling us to compare different architectures on noise robustness. The formal contribution of the paper is: – Introduction of robust to typos word vector architecture. – Results of testing on Russian Paraphrase Corpus. – Results of testing on Microsoft Research Paraphrase Corpus. – Results of testing on Turkish Paraphrase Corpus. 2 Related work Some of results presented in this work have previously been presented as workshop pa- per [Malykh16]. This paper is broadening the previously presented results by additional language (Turkish) and additional experiments on other languages. The presented architecture is based on previous works of Tomas Mikolov [Mikolov13], [Joulin16]. Both of the mentioned works are lacking the support of out of vocabulary (OOV) words, which could be the issue with noisy input. The group where belongs au- thors of previously mentioned works also proposed another approach [Bojanowski16], where the issue of OOV words has been resolved by composing word vector by sum- mation of vectors for n-grams, from which word letter representation consists of. Our approach differs in a that way, that our model creates embeddings of the words on-the- fly, basing only on their letter representation, so we have no explicit vocabulary in our model. Close idea was presented in corrupted word reconstruction task for english language in the work [Sakaguchi16], where the authors demonstrate stable recognition of vocab- ulary words. Our approach is using related initial word representation BME. 2.1 BME The Begin-Middle-End (BME) representation is related to Begin-Intermediate-End (BIE) representation from work [Sakaguchi16]. The BME representation is broadening BIE representation by addition of three instead of one initial and ending characters. This is more suitable for languages with rich morphology, like Russian or Turkish: for example, in Russian language an affix has average length of 2.54 [Polikarpov07]. 2.2 LSTM Our approach is based on Long Short-Term Memory cells described in original paper [Hochreiter97]. gu = σ(Wu ∗ ht−1 + Iu ∗ xt ) gf = σ(Wf ∗ ht−1 + If ∗ xt ) go = σ(Wo ∗ ht−1 + Io ∗ xt ) (1) gc = tanh(Wc ∗ ht−1 + Ic ∗ xt ) mt = g f +gu gc o ht = tanh(g mt−1 ) here σ is the logistic sigmoid function, Wu , Wf , Wo , Wc are recurrent weight ma- trices and Iu , If , Io , Ic are projection matrices. The u, f , and o are denoting update, forget, and output gates in LSTM cell respectively. And c denotes memory cell related variables. The main idea of usage of recurrent neural nets is to exploit their memorising abil- ity for the context, since we are supposing that word meaning is highly correlated with meanings of the surrounding words, following so called distributional hypothe- sis, which for the best of author’s knowledge firstly appeared in [Rubenstein65]. This relates our model to word2vec approach, which is also heavily relies on context in cre- ation of word vectors. 2.3 Corpora For the Microsoft Research Paraphrase Corpus has a lot of previous work published. Few of them should be mentioned due to their word vector usage: additive composi- tion of vectors and cosine distance achieved 0.73 accuracy in 2014 [Milajevs14] and recursive neural nets using syntax-aware multi-sense word embeddings achieved 0.78 accuracy in 2015 [Cheng15]. For the relatively full list of works on this corpus we’re referring the reader to ACL website1 . For the Russian Paraphrase Corpus there are two available works: [Loukashevich17] and [Pronoza16]. The latter paper is devoted to construction the corpus and has no pre- sented baseline method for the pask of paraphrase identification itself. And the former work is using SVM methods for the task of paraphrase detection with a result in com- parable (two class, non-standard) track of 0.81 of F1 measure. Surprisingly there is no previously published works on the Turkish Paraphrase Cor- pus, despite that it is partially available for quite a time to this day. Also it should be explicitly stated that proposed model does not pretend to be com- pared in the paraphrase detection task, this task is used to demonstrate the robustness to noise property of presented word vectors. 3 Architecture A few words should be spoken about the BME representation. B part of the repre- sentation consists of one-hot encoding for first three letters, E part consists of one-hot encoding for last three letters, and M part is the sum of one-hot encoded vectors for all the letters in the word. Finally this representation is used as initial input for our model. The graphical representation of BME representation is presented in the bottom of figure 2, where the while architecture is also presented. The model itself consists of two Fully-Connected (FC) layers and three LSTM lay- ers. The first FC layer is used as mapping from input BME representation to fixed-size vector. This vector is fed to LSTM layers, which are responsible for handling the context in the training. The top layer of the model is also FC, which produces final fixed-size vector used as embedding for target word. The training is followed the procedure for continuous bag-of-words (CBOW) pro- posed in [Mikolov13]. I.e. the model is fed by window size of surrounding words, and is supposed to produce a vector for target word. The target word vector is then com- pared to context words vectors and some distant word vectors by the means of cosine similarity. The visual representation of training process is given at figure 1.2 1 https://aclweb.org/aclwiki/index.php?title=Paraphrase_ Identification_(State_of_the_art) 2 This figure is taken from Tensorflow.org. Fig. 1. Negative Sampling 3.1 The Negative Sampling Negative Sampling technique is used in the CBOW training process for speedup and we also use it. Originally, the negative sampling comes from work [Smith05], but we are using definition of negative sampling according to [Mikolov13]: X X L(x) = log( e−s(x,wi ) ) + log( es(x,wj ) ) (2) i∈C j6∈C where C is the set of indices of words in context for word x. The context is defined as words in predefined window, surrounding the given one. s(x, w) is a similarity scoring function for two words. In our model it is cosine similarity of word vectors produced by the output layer of the network. For our evaluation we chose window size as 8 following [Pennington14]. The three layers architecture shown best results in our experiments, which is typical for language modelling tasks, where number of layers is varied from one [Sundermeyer12] to four [Sutskever14]. The model was implemented on Tensorflow framework [Abadi16]. 4 Experiment Setup The conducted experiments are supposed to demonstrate the robustness to noise of proposed architecture in contrast to standard approach presented in [Mikolov13]. To achieve this goal the following setup was created: – we measure ROC AUC of prediction of paraphrase (the true class consists of true paraphrases); – prediction is based directly on cosine similarity between vectors for phrases; – the vectors for phrases are computed by averaging vectors for (known) words of this phrase; – we compare the standard word vectors for particular language and the proposed architecture trained on the same corpus; – we are adding the noise to the input data and comparing the sensitivity to the noise level. Fig. 2. Model graphical representation Also we provide random baseline, which for the chosen measure is always close to 0.5. The noise emulation in this experiment setup consists of two components: – The probability of inserting a letter after the current one. The letters are drawn uniformly from the alphabet. – The probability of the letter to disappear. The both types of noise emulation are applied at the same time. The noise level men- tioned below is always meant to have both probabilities to be at specified value. This noise setup was chosen to demonstrate the robustness against the random er- ror, i.e. unintentionally added extra letter or missed letter in a word, not the typical typo (letter shuffle), since the robustness against the shuffling was demonstrated in [Sakaguchi16]. We are comparing the solutions in range [0.0, 0.30]. The 0.30 was chosen arbitrarily, with additional consideration of that 0.30 noise level is unrealistic and too high for practical use. For random baseline and for every noise level (except zero level) the experiment was conducted 10 times. The standard error is not exceeding 0.003. 5 Experiment on Russian language 5.1 Corpus The corpus is described in [Pronoza16]. This corpus consists of news headings from different news agencies, which are supposed (by the means of automatic grading sys- tem) be close in terms of semantic meaning. Additionally they all tested to be close in the creation time. The corpus contains about 6000 pairs of phrases, which are labeled as -1 - not paraphrase, 0 - weak paraphrase, and 1 - strong paraphrase. For our evaluation we had taken only -1 & 1 classes, i.e. non-paraphrase and strong paraphrase. There are 4470 such pairs in the corpus. 5.2 Random baseline The random baseline just reporting a random number in [0, 1] interval. 5.3 Word2Vec baseline For the standard word2vec baseline we’re taking the model adopted from RusVectores project3 , [Kutuzov15]. The word2vec model we’ve used was trained on Russian Na- tional Corpus (RNC)4 firstly described in [Andryuschenko89]. Also for this solution we’d used the Mystem lemmatization engine5 described in [Segalovich03]. We are av- eraging all and only the known to model lemmatized words vectors, i.e. the unknown words are ignored. 5.4 Our solution For our solution we also take mean vector for all the words (since in our setup there is no such thing as OOV) and cosine similarity between resulting vectors. By design our solution does not demand any lemmatization or stemming. We also trained our model on RNC. 5.5 Results on Russian language The results are presented on the figure 3. We could see that the level of noise is important characteristic of the input. The word2vec solution is highly sensitive to the noise level, and from the level of 0.14 it generates virtually the random results (due to distribution of the test results, some of them are worse than random). In contrast our architecture has been demonstrated the robustness to noise up to level of 0.30. It is important that proposed architecture performs better from level of 0.06 and its quality decreases steadily. 3 http://rusvectores.org/ 4 http://www.ruscorpora.ru/ 5 https://tech.yandex.ru/mystem/ Fig. 3. The results of the experiment on Russian language 6 Experiment on English language 6.1 Corpus The corpus is Microsoft Research Paraphrase Corpus6 . This corpus consists of 5800 pairs of sentences which have been extracted from news sources on the web and pro- vided with human annotations indicating whether each pair captures a paraphrase/semantic equivalence relationship. 6.2 Random baseline The random baseline again just reporting a random number in [0, 1] interval. 6.3 Word2Vec baseline For the standard word2vec baseline here we’re taking the model trained on Reuters corpus [Lewis04] by the gensim software package7 . For this solution we’d used the Snowball stemmer described in [Porter01]. The model was trained for 500 iterations with min count value set to 2. In the testing stage we are averaging all the known to model stemmed words vectors. 6.4 Word2Vec baseline 2 For the reference also we’re providing the result of testing in this setup the other word2vec model. It is Google News word vectors model, available online8 . To the best of our knowledge it is the largest available model for English language. It was trained 6 It is available from here: https://www.microsoft.com/en-us/download/ details.aspx?id=52398 7 https://radimrehurek.com/gensim/models/word2vec.html 8 https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/ edit?usp=sharing on corpus of 3 billion words, and has 3 million tokens. Unfortunately, the corpus on which is was trained is not publicly available so we could not compare to it directly. For this model we do not use lemmatization, since it contains the word forms for the majority of the words. 6.5 Our solution For our solution we also take mean vector for all the words and cosine similarity be- tween resulting vectors. We also trained our model on Reuters corpus. 6.6 Results on English language The results are presented on the figure 4. Fig. 4. The results of experiment on English language Here we also could see that the level of noise is important characteristic of the input. But for the English language the effects are ”postpone” to the higher noise levels. The word2vec solutions for English language are not so sensitive to the noise level, and for the Reuters trained model the random level9 comes from only 0.27. For the Google News model the random level is to the right of the 0.30 border of our plot. The proposed architecture is performing better starting from 0.14 level for Reuters trained model and 0.16 for Google news model. Also it is important to mention that our model not only more robust than Google News one, but it is also contains far less parameters: for Google News we have 3 million by 300 vector length - about 1 billion parameters. And for the proposed architecture it is by design only squared layer width, which is 1024 for three layers in our experiments. That gives us 3 million parameters for the whole model. 9 The level of noise in input data where the produced results are indistinguishable from random baseline results. 7 Experiment on Turkish language 7.1 Corpus The corpus is Turkish Paraphrase Corpus (TPC), described in [Demir12]. As for this day only news part of this corpus is available10 . It contains of 846 pairs of sentences from news sources on the web and provided with human annotations indicating whether each pair captures a paraphrase/semantic equivalence relationship. 7.2 Random baseline The random baseline like in the previous experiments reporting a random number in [0, 1] interval. 7.3 Word2Vec baseline For the standard word2vec baseline here we’re taking the model trained on ”42 bin haber” (42 thousand news) corpus described in [Yildirim03]. We again used gensim software package. For this solution we’d used the Snowball stemmer for Turkish lan- guage described in [Eryigit04]. The model was trained for 500 iterations with min count value set to 2. In the testing stage we are averaging all the known to model stemmed words vectors. 7.4 Our solution For our solution we also take mean vector for all the words and cosine similarity be- tween resulting vectors. We also trained our model on ”42 bin haber” corpus. 7.5 Results on Turkish language The results are presented on the figure 5. As we can see the results on TPC is not very impressing, but the main feature of noise-robustness could be noticed nevertheless. 8 Conclusion The robust word vector model had demonstrated abilities to be indifferent to some levels of noise. It is better from the standard widely used word2vec model with noise levels from 0.06 for Russian language and 0.14 for English language up to at least 0.30. It seems to be practical level, but for the future work we should try to improve our model to produce better results with less noise or without noise at all. The difference in the excess level could be explained by the fact that Russian is flexive language with rich morphology, which is on the one hand is stable for the most words and easy to learn for 10 It is available from here: https://osf.io/wp83a/ Fig. 5. The results of experiment on Turkish language the model, and on the other hard the disruption in the flexion could lead the lemmatizer and consequently the standard word2vec model to be unable to produce vector for it. And English is language with strong analytic tendency, so the morphology of it poor, and the letters of the word are meaningful even an the end of a word. Tor Turkish language critical noise level is as low as 0.05. This seems to be unreasonable low. The possible explanations could be that for agglutinative languages the whole structure of the word is important, but more likely that available corpora is not enough for proposed approaches to demonstrate reasonable quality. For the future work we are considering improving the architecture to achieve higher scores on small noise levels and conduct more experiments on different architecture variations. References Kutuzov15. Kutuzov, A. and Andreev, I., 2015. Texts in, meaning out: neural language models in semantic similarity task for Russian. Proceedings of the Dialog 2015 Conference, Moscow, Russia. Mikolov12. Mikolov, T., Sutskever, I., Deoras, A., Le, H. S., Kombrink, S., & Cernocky, J., 2012. Subword language modeling with neural networks. preprint (http://www. fit. vutbr. cz/imikolov/rnnlm/char. pdf). Mikolov13. Mikolov, T., and J. Dean., 2013. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems. Joulin16. Joulin, Armand, et al., 2016. Bag of Tricks for Efficient Text Classification. arXiv preprint arXiv:1607.01759. Sakaguchi16. Sakaguchi, K., Duh, K., Post, M. and Van Durme, B., 2016. Robsut Wrod Re- ocginiton via semi-Character Recurrent Neural Network. arXiv preprint arXiv:1608.02214. Hochreiter97. Hochreiter, Sepp and Schmidhuber, Jurgen, 1997. Long short-term memory. Neu- ral computation, 9(8):17351780. Andryuschenko89. Andryuschenko, V.M., 1989. Konzepziya i arhitectura Mashinnogo fonda russkogo jazyka (The concept and design of the Computer Fund of Russian Language), Moskva: Nauka (in Russian). Segalovich03. Segalovich, I., 2003, June. A Fast Morphological Algorithm with Unknown Word Guessing Induced by a Dictionary for a Web Search Engine. In MLMTA (pp. 273-280). Smith05. Smith, Noah A., and Jason Eisner, 2005. Contrastive estimation: Training log-linear models on unlabeled data. Proceedings of the 43rd Annual Meeting on Association for Com- putational Linguistics. Association for Computational Linguistics. Lewis04. Lewis, D. D.; Yang, Y.; Rose, T.; and Li, F., 2004. RCV1: A New Benchmark Collec- tion for Text Categorization Research. Journal of Machine Learning Research, 5:361-397 Porter01. Porter, M.F., 2001. Snowball: A language for stemming algorithms. Milajevs14. Milajevs, D., Kartsaklis, D., Sadrzadeh, M. and Purver, M., 2014. Evaluating Neural Word Representations in Tensor-Based Compositional Settings, Proceedings of EMNLP 2014, Doha, Qatar, pp. 708-719. Cheng15. Cheng, J. and Kartsaklis, D., 2015. Syntax-Aware Multi-Sense Word Embeddings for Deep Compositional Models of Meaning, Proceedings of EMNLP 2015, Lisbon, Portugal, pp. 1531-1542. Demir12. Demir, S., El-Kahlout, I.D., Unal, E. and Kaya, H., 2012. Turkish Paraphrase Corpus. In LREC (pp. 4087-4091). Yildirim03. Yildirim, O., Atik, F., Amasyali, M. F., 2003. 42 Bin Haber Veri Kumesi, Yildiz Teknik Universitesi, Bilgisayar Muh. Bolumu. Eryigit04. Eryigit, G. and Adali, E., 2004. An Affix Stripping Morphological Analyzer for Turk- ish. Proceedings of the IAESTED International Conference Artificial Intelligence and Appli- cations 2004, Innsbruck, Austria. Pronoza16. Pronoza, E., Yagunova, E. and Pronoza, A., 2016. Construction of a Russian para- phrase corpus: unsupervised paraphrase extraction. In Information Retrieval (pp. 146-157). Springer International Publishing. Malykh16. Malykh V., 2016. Robust Word Vectors for Russian Language. In Proceedings of AINL FRUCT Conference 2016. Bojanowski16. Bojanowski P., Grave E., Joulin A., Mikolov T., 2016. Enriching Word Vectors with Subword Information. arXiv:1607.04606 Rubenstein65. Rubenstein, H., & Goodenough, J. (1965). Contextual correlates of synonymy. Communications of the ACM, 8 (10), 627-633. Loukashevich17. Loukachevitch N. V., Shevelev A. S., Mozharova V. A., 2017. Testing Fea- tures and Methods in Russian Paraphrasing Task. Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference ”Dialogue 2017”. Polikarpov07. Polikarpov A.A., 2007. Towards the Foundations of Menzerath?s Law. In: Grzy- bek P. (eds) Contributions to the Science of Text and Language. Text, Speech and Language Technology, vol 31. Springer, Dordrecht Pennington14. Pennington, J., Socher, R. and Manning, C.D., 2014, October. Glove: Global vec- tors for word representation. In EMNLP (Vol. 14, pp. 1532-1543). Sutskever14. Sutskever, I., Vinyals, O. and Le, Q.V., 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems (pp. 3104-3112). Sundermeyer12. Sundermeyer, M., Schlueter, R. and Ney, H., 2012. LSTM neural networks for language modeling. In Thirteenth Annual Conference of the International Speech Communi- cation Association. Abadi16. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M. and Ghemawat, S., 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467.