<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Generalizable Architecture for Robust Word Vectors Tested by Noisy Paraphrases</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Laboratory of Neural Systems and Deep Learning, Moscow Institute of Physics and Technology Moscow</institution>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper is devoted to present language independent architecture of robust word vectors. The robustness for typos is burning demand of current industry, regarding the social networks, instant messaging, etc. This architecture is designed to be indifferent to typos like switching, extra letters, and missing letters. The experiments on paraphrase corpora for three different languages are demonstrating the applicability of the proposed approach in noisy environments.</p>
      </abstract>
      <kwd-group>
        <kwd>word vectors</kwd>
        <kwd>noise-resilient</kwd>
        <kwd>char-aware</kwd>
        <kwd>neural nets</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The problem of user input data is widely known: the typos, the errors, wrong word
usage, etc. To handle this issue there exist bunch of different tools like embedded
spellcheckers on websites’ input forms, but severity of this problem just growing larger,
especially with wide usage of mobile devices with small and/or simplified keyboards.</p>
      <p>On the other hand word vectors had become very popular past years in variety of
tasks, like text classification, paraphrase detection, sentiment analysis, etc. But to use
this word vectors the user input should be cleared from the noise. To address this issue
we present the novel architecture of word vectors proof to specific type of noise - the
missing or surplus letters in words.</p>
      <p>To demonstrate generality of proposed approach there were chosen languages from
different types: English as (almost) analytical, Russian as synthetic flexive, and
Turkish as synthetic agglutinative. The task to test the approach against was chosen to be
paraphrase identification, since in this task there is a natural metric - paraphrase or not
for every pair of sentences. The addition of some noise to these pairs is enabling us to
compare different architectures on noise robustness.</p>
      <p>The formal contribution of the paper is:
– Introduction of robust to typos word vector architecture.
– Results of testing on Russian Paraphrase Corpus.
– Results of testing on Microsoft Research Paraphrase Corpus.</p>
      <p>– Results of testing on Turkish Paraphrase Corpus.</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>Some of results presented in this work have previously been presented as workshop
paper [Malykh16]. This paper is broadening the previously presented results by additional
language (Turkish) and additional experiments on other languages.</p>
      <p>The presented architecture is based on previous works of Tomas Mikolov [Mikolov13],
[Joulin16]. Both of the mentioned works are lacking the support of out of vocabulary
(OOV) words, which could be the issue with noisy input. The group where belongs
authors of previously mentioned works also proposed another approach [Bojanowski16],
where the issue of OOV words has been resolved by composing word vector by
summation of vectors for n-grams, from which word letter representation consists of. Our
approach differs in a that way, that our model creates embeddings of the words
on-thefly, basing only on their letter representation, so we have no explicit vocabulary in our
model.</p>
      <p>Close idea was presented in corrupted word reconstruction task for english language
in the work [Sakaguchi16], where the authors demonstrate stable recognition of
vocabulary words. Our approach is using related initial word representation BME.
2.1</p>
      <p>BME
The Begin-Middle-End (BME) representation is related to Begin-Intermediate-End (BIE)
representation from work [Sakaguchi16]. The BME representation is broadening BIE
representation by addition of three instead of one initial and ending characters. This is
more suitable for languages with rich morphology, like Russian or Turkish: for example,
in Russian language an affix has average length of 2.54 [Polikarpov07].
2.2</p>
      <sec id="sec-2-1">
        <title>LSTM</title>
        <p>Our approach is based on Long Short-Term Memory cells described in original paper
[Hochreiter97].</p>
        <p>gu = (Wu
gf = (Wf
go = (Wo
ht 1 + Iu
ht 1 + If
ht 1 + Io
xt)
xt)
xt)
gc = tanh(Wc
mt = gf
ht = tanh(go
+gu
ht 1 + Ic</p>
        <p>gc
mt 1)
xt)
(1)
here is the logistic sigmoid function, Wu; Wf ; Wo; Wc are recurrent weight
matrices and Iu; If ; Io; Ic are projection matrices. The u, f , and o are denoting update,
forget, and output gates in LSTM cell respectively. And c denotes memory cell related
variables.</p>
        <p>The main idea of usage of recurrent neural nets is to exploit their memorising
ability for the context, since we are supposing that word meaning is highly correlated
with meanings of the surrounding words, following so called distributional
hypothesis, which for the best of author’s knowledge firstly appeared in [Rubenstein65]. This
relates our model to word2vec approach, which is also heavily relies on context in
creation of word vectors.
2.3</p>
      </sec>
      <sec id="sec-2-2">
        <title>Corpora</title>
        <p>For the Microsoft Research Paraphrase Corpus has a lot of previous work published.
Few of them should be mentioned due to their word vector usage: additive
composition of vectors and cosine distance achieved 0.73 accuracy in 2014 [Milajevs14] and
recursive neural nets using syntax-aware multi-sense word embeddings achieved 0.78
accuracy in 2015 [Cheng15]. For the relatively full list of works on this corpus we’re
referring the reader to ACL website1.</p>
        <p>For the Russian Paraphrase Corpus there are two available works: [Loukashevich17]
and [Pronoza16]. The latter paper is devoted to construction the corpus and has no
presented baseline method for the pask of paraphrase identification itself. And the former
work is using SVM methods for the task of paraphrase detection with a result in
comparable (two class, non-standard) track of 0.81 of F1 measure.</p>
        <p>Surprisingly there is no previously published works on the Turkish Paraphrase
Corpus, despite that it is partially available for quite a time to this day.</p>
        <p>Also it should be explicitly stated that proposed model does not pretend to be
compared in the paraphrase detection task, this task is used to demonstrate the robustness to
noise property of presented word vectors.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Architecture</title>
      <p>A few words should be spoken about the BME representation. B part of the
representation consists of one-hot encoding for first three letters, E part consists of one-hot
encoding for last three letters, and M part is the sum of one-hot encoded vectors for all
the letters in the word. Finally this representation is used as initial input for our model.
The graphical representation of BME representation is presented in the bottom of figure
2, where the while architecture is also presented.</p>
      <p>The model itself consists of two Fully-Connected (FC) layers and three LSTM
layers. The first FC layer is used as mapping from input BME representation to fixed-size
vector. This vector is fed to LSTM layers, which are responsible for handling the context
in the training. The top layer of the model is also FC, which produces final fixed-size
vector used as embedding for target word.</p>
      <p>The training is followed the procedure for continuous bag-of-words (CBOW)
proposed in [Mikolov13]. I.e. the model is fed by window size of surrounding words, and
is supposed to produce a vector for target word. The target word vector is then
compared to context words vectors and some distant word vectors by the means of cosine
similarity. The visual representation of training process is given at figure 1.2
1 https://aclweb.org/aclwiki/index.php?title=Paraphrase_</p>
      <p>Identification_(State_of_the_art)
2 This figure is taken from Tensorflow.org.
Negative Sampling technique is used in the CBOW training process for speedup and
we also use it. Originally, the negative sampling comes from work [Smith05], but we
are using definition of negative sampling according to [Mikolov13]:</p>
      <p>L(x) = log(X e s(x;wi)) + log(X es(x;wj))
i2C
j62C
(2)
where C is the set of indices of words in context for word x. The context is defined as
words in predefined window, surrounding the given one. s(x; w) is a similarity scoring
function for two words. In our model it is cosine similarity of word vectors produced
by the output layer of the network.</p>
      <p>For our evaluation we chose window size as 8 following [Pennington14]. The three
layers architecture shown best results in our experiments, which is typical for language
modelling tasks, where number of layers is varied from one [Sundermeyer12] to four
[Sutskever14]. The model was implemented on Tensorflow framework [Abadi16].
4</p>
    </sec>
    <sec id="sec-4">
      <title>Experiment Setup</title>
      <p>The conducted experiments are supposed to demonstrate the robustness to noise of
proposed architecture in contrast to standard approach presented in [Mikolov13]. To
achieve this goal the following setup was created:
– we measure ROC AUC of prediction of paraphrase (the true class consists of true
paraphrases);
– prediction is based directly on cosine similarity between vectors for phrases;
– the vectors for phrases are computed by averaging vectors for (known) words of
this phrase;
– we compare the standard word vectors for particular language and the proposed
architecture trained on the same corpus;
– we are adding the noise to the input data and comparing the sensitivity to the noise
level.
0.5.</p>
      <p>Also we provide random baseline, which for the chosen measure is always close to
The noise emulation in this experiment setup consists of two components:
– The probability of inserting a letter after the current one. The letters are drawn
uniformly from the alphabet.</p>
      <p>– The probability of the letter to disappear.</p>
      <p>The both types of noise emulation are applied at the same time. The noise level
mentioned below is always meant to have both probabilities to be at specified value.</p>
      <p>This noise setup was chosen to demonstrate the robustness against the random
error, i.e. unintentionally added extra letter or missed letter in a word, not the typical
typo (letter shuffle), since the robustness against the shuffling was demonstrated in
[Sakaguchi16].</p>
      <p>We are comparing the solutions in range [0:0; 0:30]. The 0.30 was chosen arbitrarily,
with additional consideration of that 0.30 noise level is unrealistic and too high for
practical use. For random baseline and for every noise level (except zero level) the
experiment was conducted 10 times. The standard error is not exceeding 0.003.</p>
      <sec id="sec-4-1">
        <title>Corpus</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experiment on Russian language</title>
      <p>The corpus is described in [Pronoza16]. This corpus consists of news headings from
different news agencies, which are supposed (by the means of automatic grading
system) be close in terms of semantic meaning. Additionally they all tested to be close in
the creation time. The corpus contains about 6000 pairs of phrases, which are labeled as
-1 - not paraphrase, 0 - weak paraphrase, and 1 - strong paraphrase. For our evaluation
we had taken only -1 &amp; 1 classes, i.e. non-paraphrase and strong paraphrase. There are
4470 such pairs in the corpus.
5.2</p>
      <sec id="sec-5-1">
        <title>Random baseline</title>
        <p>The random baseline just reporting a random number in [0; 1] interval.
5.3
For the standard word2vec baseline we’re taking the model adopted from RusVectores
project3, [Kutuzov15]. The word2vec model we’ve used was trained on Russian
National Corpus (RNC)4 firstly described in [Andryuschenko89]. Also for this solution
we’d used the Mystem lemmatization engine5 described in [Segalovich03]. We are
averaging all and only the known to model lemmatized words vectors, i.e. the unknown
words are ignored.
For our solution we also take mean vector for all the words (since in our setup there is
no such thing as OOV) and cosine similarity between resulting vectors. By design our
solution does not demand any lemmatization or stemming. We also trained our model
on RNC.</p>
        <sec id="sec-5-1-1">
          <title>The results are presented on the figure 3.</title>
          <p>We could see that the level of noise is important characteristic of the input. The
word2vec solution is highly sensitive to the noise level, and from the level of 0.14
it generates virtually the random results (due to distribution of the test results, some
of them are worse than random). In contrast our architecture has been demonstrated
the robustness to noise up to level of 0.30. It is important that proposed architecture
performs better from level of 0.06 and its quality decreases steadily.
3 http://rusvectores.org/
4 http://www.ruscorpora.ru/
5 https://tech.yandex.ru/mystem/</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6 Experiment on English language</title>
      <sec id="sec-6-1">
        <title>6.1 Corpus</title>
        <p>The corpus is Microsoft Research Paraphrase Corpus6. This corpus consists of 5800
pairs of sentences which have been extracted from news sources on the web and
provided with human annotations indicating whether each pair captures a paraphrase/semantic
equivalence relationship.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2 Random baseline</title>
        <p>The random baseline again just reporting a random number in [0; 1] interval.
6.3</p>
      </sec>
      <sec id="sec-6-3">
        <title>Word2Vec baseline</title>
        <p>For the standard word2vec baseline here we’re taking the model trained on Reuters
corpus [Lewis04] by the gensim software package7. For this solution we’d used the
Snowball stemmer described in [Porter01]. The model was trained for 500 iterations
with min count value set to 2. In the testing stage we are averaging all the known to
model stemmed words vectors.
6.4</p>
      </sec>
      <sec id="sec-6-4">
        <title>Word2Vec baseline 2</title>
        <p>For the reference also we’re providing the result of testing in this setup the other
word2vec model. It is Google News word vectors model, available online8. To the best
of our knowledge it is the largest available model for English language. It was trained
6 It is available from here: https://www.microsoft.com/en-us/download/
details.aspx?id=52398
7 https://radimrehurek.com/gensim/models/word2vec.html
8 https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/
edit?usp=sharing
on corpus of 3 billion words, and has 3 million tokens. Unfortunately, the corpus on
which is was trained is not publicly available so we could not compare to it directly.
For this model we do not use lemmatization, since it contains the word forms for the
majority of the words.
6.5</p>
      </sec>
      <sec id="sec-6-5">
        <title>Our solution</title>
        <p>For our solution we also take mean vector for all the words and cosine similarity
between resulting vectors. We also trained our model on Reuters corpus.
6.6</p>
      </sec>
      <sec id="sec-6-6">
        <title>Results on English language</title>
        <sec id="sec-6-6-1">
          <title>The results are presented on the figure 4.</title>
          <p>Here we also could see that the level of noise is important characteristic of the input.
But for the English language the effects are ”postpone” to the higher noise levels. The
word2vec solutions for English language are not so sensitive to the noise level, and for
the Reuters trained model the random level9 comes from only 0.27. For the Google
News model the random level is to the right of the 0.30 border of our plot.</p>
          <p>The proposed architecture is performing better starting from 0.14 level for Reuters
trained model and 0.16 for Google news model. Also it is important to mention that
our model not only more robust than Google News one, but it is also contains far less
parameters: for Google News we have 3 million by 300 vector length - about 1 billion
parameters. And for the proposed architecture it is by design only squared layer width,
which is 1024 for three layers in our experiments. That gives us 3 million parameters
for the whole model.
9 The level of noise in input data where the produced results are indistinguishable from random
baseline results.</p>
        </sec>
      </sec>
      <sec id="sec-6-7">
        <title>Corpus</title>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Experiment on Turkish language</title>
      <p>The corpus is Turkish Paraphrase Corpus (TPC), described in [Demir12]. As for this
day only news part of this corpus is available10. It contains of 846 pairs of sentences
from news sources on the web and provided with human annotations indicating whether
each pair captures a paraphrase/semantic equivalence relationship.
7.2</p>
      <sec id="sec-7-1">
        <title>Random baseline</title>
        <p>The random baseline like in the previous experiments reporting a random number in
[0; 1] interval.
7.3</p>
      </sec>
      <sec id="sec-7-2">
        <title>Word2Vec baseline</title>
        <p>For the standard word2vec baseline here we’re taking the model trained on ”42 bin
haber” (42 thousand news) corpus described in [Yildirim03]. We again used gensim
software package. For this solution we’d used the Snowball stemmer for Turkish
language described in [Eryigit04]. The model was trained for 500 iterations with min count
value set to 2. In the testing stage we are averaging all the known to model stemmed
words vectors.
7.4</p>
      </sec>
      <sec id="sec-7-3">
        <title>Our solution</title>
        <p>For our solution we also take mean vector for all the words and cosine similarity
between resulting vectors. We also trained our model on ”42 bin haber” corpus.
7.5</p>
      </sec>
      <sec id="sec-7-4">
        <title>Results on Turkish language</title>
        <sec id="sec-7-4-1">
          <title>The results are presented on the figure 5. As we can see the results on TPC is not very impressing, but the main feature of noise-robustness could be noticed nevertheless.</title>
          <p>8</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Conclusion</title>
      <p>The robust word vector model had demonstrated abilities to be indifferent to some levels
of noise. It is better from the standard widely used word2vec model with noise levels
from 0.06 for Russian language and 0.14 for English language up to at least 0.30. It
seems to be practical level, but for the future work we should try to improve our model
to produce better results with less noise or without noise at all. The difference in the
excess level could be explained by the fact that Russian is flexive language with rich
morphology, which is on the one hand is stable for the most words and easy to learn for
10 It is available from here: https://osf.io/wp83a/
the model, and on the other hard the disruption in the flexion could lead the lemmatizer
and consequently the standard word2vec model to be unable to produce vector for it.
And English is language with strong analytic tendency, so the morphology of it poor,
and the letters of the word are meaningful even an the end of a word. Tor Turkish
language critical noise level is as low as 0.05. This seems to be unreasonable low. The
possible explanations could be that for agglutinative languages the whole structure of
the word is important, but more likely that available corpora is not enough for proposed
approaches to demonstrate reasonable quality.</p>
      <p>For the future work we are considering improving the architecture to achieve higher
scores on small noise levels and conduct more experiments on different architecture
variations.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Kutuzov15. Kutuzov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Andreev</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <year>2015</year>
          .
          <article-title>Texts in, meaning out: neural language models in semantic similarity task for Russian</article-title>
          .
          <source>Proceedings of the Dialog 2015 Conference</source>
          , Moscow, Russia.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Mikolov12. Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deoras</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>H. S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kombrink</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Cernocky</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <year>2012</year>
          .
          <article-title>Subword language modeling with neural networks. preprint (http://www</article-title>
          . fit. vutbr. cz/imikolov/rnnlm/char. pdf).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Mikolov13. Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <surname>J. Dean.</surname>
          </string-name>
          ,
          <year>2013</year>
          .
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>Advances in neural information processing systems.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Joulin16. Joulin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Armand</surname>
          </string-name>
          , et al.,
          <year>2016</year>
          .
          <article-title>Bag of Tricks for Efficient Text Classification</article-title>
          .
          <source>arXiv preprint arXiv:1607</source>
          .
          <fpage>01759</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Sakaguchi16. Sakaguchi</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duh</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Post</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Van Durme</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <year>2016</year>
          .
          <article-title>Robsut Wrod Reocginiton via semi-Character Recurrent Neural Network</article-title>
          .
          <source>arXiv preprint arXiv:1608</source>
          .
          <fpage>02214</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Hochreiter97. Hochreiter</surname>
          </string-name>
          , Sepp and Schmidhuber, Jurgen,
          <year>1997</year>
          .
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural computation</source>
          ,
          <volume>9</volume>
          (
          <issue>8</issue>
          ):
          <fpage>17351780</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Andryuschenko89. Andryuschenko</surname>
            ,
            <given-names>V.M.</given-names>
          </string-name>
          ,
          <year>1989</year>
          .
          <article-title>Konzepziya i arhitectura Mashinnogo fonda russkogo jazyka (The concept and design of the Computer Fund of Russian Language), Moskva: Nauka (in Russian)</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Segalovich03. Segalovich</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <year>2003</year>
          , June.
          <string-name>
            <given-names>A Fast</given-names>
            <surname>Morphological</surname>
          </string-name>
          <article-title>Algorithm with Unknown Word Guessing Induced by a Dictionary for a Web Search Engine</article-title>
          . In MLMTA (pp.
          <fpage>273</fpage>
          -
          <lpage>280</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Smith05. Smith</surname>
            ,
            <given-names>Noah A.</given-names>
          </string-name>
          , and Jason Eisner,
          <year>2005</year>
          .
          <article-title>Contrastive estimation: Training log-linear models on unlabeled data</article-title>
          .
          <source>Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics</source>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Lewis04. Lewis</surname>
            ,
            <given-names>D. D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Rose</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <year>2004</year>
          .
          <article-title>RCV1: A New Benchmark Collection for Text Categorization Research</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>5</volume>
          :
          <fpage>361</fpage>
          -
          <lpage>397</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Porter01. Porter</surname>
            ,
            <given-names>M.F.</given-names>
          </string-name>
          ,
          <year>2001</year>
          .
          <article-title>Snowball: A language for stemming algorithms</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Milajevs14. Milajevs</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kartsaklis</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sadrzadeh</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Purver</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <year>2014</year>
          .
          <article-title>Evaluating Neural Word Representations in Tensor-Based Compositional Settings</article-title>
          ,
          <source>Proceedings of EMNLP</source>
          <year>2014</year>
          , Doha, Qatar, pp.
          <fpage>708</fpage>
          -
          <lpage>719</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          Cheng15. Cheng, J. and
          <string-name>
            <surname>Kartsaklis</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <year>2015</year>
          .
          <article-title>Syntax-Aware Multi-Sense Word Embeddings for Deep Compositional Models of Meaning</article-title>
          ,
          <source>Proceedings of EMNLP</source>
          <year>2015</year>
          , Lisbon, Portugal, pp.
          <fpage>1531</fpage>
          -
          <lpage>1542</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Demir12. Demir</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>El-Kahlout</surname>
            ,
            <given-names>I.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Unal</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Kaya</surname>
          </string-name>
          , H.,
          <year>2012</year>
          .
          <article-title>Turkish Paraphrase Corpus</article-title>
          . In LREC (pp.
          <fpage>4087</fpage>
          -
          <lpage>4091</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Yildirim03. Yildirim</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Atik</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Amasyali</surname>
            ,
            <given-names>M. F.</given-names>
          </string-name>
          ,
          <year>2003</year>
          . 42
          <string-name>
            <given-names>Bin</given-names>
            <surname>Haber Veri Kumesi</surname>
          </string-name>
          , Yildiz Teknik Universitesi,
          <string-name>
            <given-names>Bilgisayar</given-names>
            <surname>Muh</surname>
          </string-name>
          . Bolumu.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Eryigit04. Eryigit</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Adali</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <year>2004</year>
          .
          <article-title>An Affix Stripping Morphological Analyzer for Turkish</article-title>
          .
          <source>Proceedings of the IAESTED International Conference Artificial Intelligence and Applications</source>
          <year>2004</year>
          , Innsbruck, Austria.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Pronoza16. Pronoza</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yagunova</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Pronoza</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <year>2016</year>
          .
          <article-title>Construction of a Russian paraphrase corpus: unsupervised paraphrase extraction</article-title>
          .
          <source>In Information Retrieval</source>
          (pp.
          <fpage>146</fpage>
          -
          <lpage>157</lpage>
          ). Springer International Publishing.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Malykh16. Malykh</surname>
            <given-names>V.</given-names>
          </string-name>
          ,
          <year>2016</year>
          .
          <article-title>Robust Word Vectors for Russian Language</article-title>
          .
          <source>In Proceedings of AINL FRUCT Conference</source>
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Bojanowski16. Bojanowski</surname>
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grave</surname>
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joulin</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
            <given-names>T.</given-names>
          </string-name>
          ,
          <year>2016</year>
          .
          <article-title>Enriching Word Vectors with Subword Information</article-title>
          . arXiv:
          <volume>1607</volume>
          .
          <fpage>04606</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Rubenstein65. Rubenstein</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Goodenough</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>1965</year>
          ).
          <article-title>Contextual correlates of synonymy</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <volume>8</volume>
          (
          <issue>10</issue>
          ),
          <fpage>627</fpage>
          -
          <lpage>633</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <given-names>Loukashevich17. Loukachevitch N. V.</given-names>
            ,
            <surname>Shevelev</surname>
          </string-name>
          <string-name>
            <given-names>A. S.</given-names>
            ,
            <surname>Mozharova</surname>
          </string-name>
          <string-name>
            <surname>V. A.</surname>
          </string-name>
          ,
          <year>2017</year>
          .
          <article-title>Testing Features and Methods in Russian Paraphrasing Task</article-title>
          .
          <source>Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference ”Dialogue</source>
          <year>2017</year>
          ”.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Polikarpov07. Polikarpov</surname>
            <given-names>A.A.</given-names>
          </string-name>
          ,
          <year>2007</year>
          .
          <article-title>Towards the Foundations of Menzerath?s Law</article-title>
          . In: Grzybek P. (eds)
          <article-title>Contributions to the Science of Text and Language</article-title>
          . Text,
          <source>Speech and Language Technology</source>
          , vol
          <volume>31</volume>
          . Springer, Dordrecht
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>Pennington14. Pennington</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Manning</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.D.</surname>
          </string-name>
          ,
          <year>2014</year>
          , October. Glove:
          <article-title>Global vectors for word representation</article-title>
          .
          <source>In EMNLP</source>
          (Vol.
          <volume>14</volume>
          , pp.
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <surname>Sutskever14. Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vinyals</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q.V.</given-names>
          </string-name>
          ,
          <year>2014</year>
          .
          <article-title>Sequence to sequence learning with neural networks</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          (pp.
          <fpage>3104</fpage>
          -
          <lpage>3112</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>Sundermeyer12. Sundermeyer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schlueter</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Ney</surname>
          </string-name>
          , H.,
          <year>2012</year>
          .
          <article-title>LSTM neural networks for language modeling</article-title>
          .
          <source>In Thirteenth Annual Conference of the International Speech Communication Association.</source>
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <surname>Abadi16. Abadi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agarwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barham</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brevdo</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Citro</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Devin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Ghemawat</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <year>2016</year>
          . Tensorflow:
          <article-title>Large-scale machine learning on heterogeneous distributed systems</article-title>
          .
          <source>arXiv preprint arXiv:1603</source>
          .
          <fpage>04467</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>