<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Word embedding in form of symmetric and skew-symmetric operator</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Koshchenko Ekaterina</string-name>
          <email>catherine.pths@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kuralenok Igor</string-name>
          <email>ikuralenok@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>JetBrains Research</institution>
          ,
          <addr-line>Saint-Petersburg</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>National Research University, Higher School of Economics in Saint-Petersburg</institution>
          ,
          <addr-line>Saint-Petersburg</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Existing word embedding models represent each word with two real-valued vectors: central and context. This happens because of words relations asymmetric nature and requires more time and data for training. We introduce a new approach based on asymmetric relations that uses the advantages of global vectors model. Due to the reduction of asymmetric information impact on resulting words representations, our model converges faster and outperforms existing models on words analogies tasks. Index Terms SSDE, word embedding, matrix decomposition Understanding words relations in the context of natural language is an easy task for human but not for computer. We need to teach computers how words are related and what meanings they have, depending on the context. To make it possible for a machine to process words, they have to be presented in digitized format. This leads to the idea of real-valued vector representations word embeddings. Most works on word embeddings focus their attention on preserving two words properties in their representations. The first property is that words relations and similarities can be described using distances and angles between word vectors. For example, closer-further feature: “yellow” is closer to “red” than to “smart”. In vector form it can be presented as Smart</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>This property is widely used for synonyms search. Another
property is words analogies. The corresponding feature
was introduced by Mikolov et al. [1], designed to learn
words similarities. For example, “Paris” and “France” has
the same connection as “Budapest” and “Hungary”. In
vectors we can present it as</p>
    </sec>
    <sec id="sec-2">
      <title>F rance</title>
      <p>P aris = Hungary</p>
    </sec>
    <sec id="sec-3">
      <title>Budapest:</title>
      <p>This approach benefits models creating meaning based
word vectors, while the closer-further feature is more
practical and can be applied to clustering and classification
tasks.</p>
      <p>Word embeddings were originally created to be used
in Natural Language Processing tasks. For example, one
of the feature extraction techniques used for document
indexing is latent semantic indexing [2]. Latent semantic
indexing is a precursor for word embeddings embodying
the same principles and ideas. Another task is sentiment
analysis. One of the solutions for this problem is SentProp
framework [3], it combines label propagation method with
word embeddings to learn sentiment lexicons on
domainspecific corpora. Another way to solve some of the Natural
Language Processing tasks are Language Models.
Nowadays state of the art decisions for Language Modeling are
ELMO [4] and BERT [5]. Each of these methods uses
prebuilt word embeddings as input data and can benefit
from better embedding models. Therefore, creating better
embedding models is still a relevant task.</p>
      <p>There are three most popular and used word embedding
models. Word2Vec is a local window-based method
presented by Mikolov et al. [6]. It preserves words analogies
feature, bringing closer vectors of words appearing in a
similar context. Another approach is GloVe [7] which
is trained on word-word co-occurrence counts. Authors
noticed that to understand the relation of two words you
can examine the ratio of their co-occurrence probabilities
with various probe words, thus deploying words analogies
feature. Third model – FastText [8] – is focused on
distances/angles property. FastText uses character n-grams
to enrich word vectors with subword information. This
approach allows to use morphology information, therefore,
choosing better vectors for sparse words and makes it
possible to learn something for non-vocabulary words.</p>
      <p>Words relations are often asymmetrical. For example,
"New York" is a common combination of words meaning
the name of the city in the USA. However, "York New"
is a quite rare combination and does not mean anything
specific. In all mentioned models words interaction is
expressed in terms of the dot product of their vectors,
that leads to a generation of two vectors for each word:
central and context. For that reason, twice more
parameters should be computed and, consequently, more time is
required for learning. To solve this problem asymmetrical
relations between word representation can be used instead
of central and context vectors dot product.</p>
      <p>In this work, we propose a Symmetric Skew-symmetric
Decomposition based model. We demonstrate that our
method outperforms GloVe approach on its words
analogies metrics.</p>
      <sec id="sec-3-1">
        <title>II. Related work</title>
        <p>There are many word embedding models known from
the literature. But most of them were based on three
principle approaches: Word2Vec [6], GloVe [7] and FastText
[8]. All three models are widely used in language models
and Natural Language Processing applications.</p>
        <sec id="sec-3-1-1">
          <title>A. Word2Vec</title>
          <p>Word2Vec is an approach introduced by Mikolov et al.
[6] that preserves words analogies property. It suggests two
language models: Skip-gram and CBOW. Both methods
represent words relationships with the dot product of their
vectors. As it was described in the introduction, relations
can be asymmetrical, which leads to two vectors per word
usage: central and context. Skip-gram and CBOW scan
corpus with a sliding window. All words inside the window
are considered to be in the same context, i.e. connected to
each other. In both models all words inside one window
get the same co-occurrence weight, i.e. are equal. We call
this type of window "constant window".</p>
          <p>Continuous Bag of Words (CBOW) is a model trained
with “predict middle-word if you know surrounding
context” task. The method tries to choose words central and
context vectors, so that probability to predict the word
in the middle of the sliding window, based on the rest of
the window, would be high. The second model is called
Skip-gram and is trained on the inverse problem: predict
context with just one word in the center of the sliding
window.</p>
          <p>For each training step for each word, both methods
should count the probability of using window middle-word
in context with any other word from the vocabulary. It
makes computational complexity too high. In later article
[9] this problem was solved for Skip-gram model with
Negative Sampling. Negative Sampling suggests counting
the probability of middle-word being in the same context
only with a constant number of positive and negative
samples. Positive samples are words that often appear in
one window with middle-word, they can be found before
the training process. Negative samples are words that are
unlikely to appear in context with middle-word. Mikolov
et al. suggest getting negative samples from uniform
distribution raised to 3=4rd power. This approach allows
accelerating Skip-gram model calculations while being of
the same quality.</p>
          <p>Results of experiments have shown that Skip-gram
method performs better on semantic tasks and their
syntactic tasks results are very similar. Since Skip-gram can
be trained easier than CBOW with same or even better
results, later models use Skip-gram.</p>
          <p>Skip-gram and CBOW models have several drawbacks.
First, training time depends on the corpus size. Second,
there are two vectors generated for each word, which
requires more time and input data for training.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>B. GloVe</title>
          <p>GloVe model, for Global Vectors, suggested by
Manning et al. also aims to preserve words analogies. The
relationship of two words can be learned by examining
their relations with other words. In this approach words
relationships are represented with a matrix of their
cooccurrences X, where xij is how many times word wi
was in the context with word wj . This matrix should be
constructed before the training process with one scan of
the corpus. On each learning step we iterate through
cooccurrences matrix and for each non-zero co-occurrence
xij calculate central and context vectors for corresponding
wi, according to value and direction of target function
gradient.</p>
          <p>In GloVe each word is presented with two vectors,
similar to Word2Vec. A sliding window is also used to scan
the corpus for co-occurrences matrix construction. Unlike
the Word2Vec "constant" window, GloVe uses "shrinking"
window. The weight of co-occurrence in the window
linearly decreases with distance increasing. Authors did not
explore how window type affects experiments results and
did not give any details on such a choice.</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>C. FastText</title>
          <p>FastText model, in contrast to Word2Vec and GloVe,
was built to preserve words property of representing words
relations in distances and angles between their vectors.
This change allows the model to perform better on text
classification tasks. Similar to two previous methods,
FastText generates central and context vectors for each word
and uses a sliding window to scan the corpus.</p>
          <p>The main idea of this approach is to use character
ngrams to build central vectors. During the vocabulary
construction, each word is saved with it’s n-grams. For
example, for the word “pencil” we also remember 3-grams
"pe", "pen", "enc", "nci", "cil" and "il" in addition to the
whole word sequence. 3-gram "pen" corresponding to the
word "pencil" is different from the word “pen”. After that,
during the training process, each sequence gets its own
vector and resulting central vector is a sum of all n-gram
vectors and whole word vector.</p>
          <p>As it was mentioned, FastText has great results on text
classification tasks but Word2Vec and GloVe outperform
it on words analogies tasks.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>III. The SSDE Model</title>
        <p>Words relations have asymmetric nature, for that reason
all three approaches above generate two vectors for each
word. The question is how to apply these central and
context vectors. In GloVe, for example, there are several
modes for what to use as a resulting vector. The default
mode is a sum of central and context vectors. There was
no intuition for this choice, although our experiments have
shown that the default mode indeed performs best. It is
possible that Word2Vec, GloVe and FastText use more
parameters than they really need, which means more time
and input data is required for training. The subject of our
research was to find out if words asymmetric information
is really necessary to include into the resulting vector.
To do that we introduce a Symmetric Skew-symmetric
Decomposition Embedding (SSDE). It is based on GloVe
model, mainly because it is faster than other existing
models and performs better on word analogies metrics.</p>
        <sec id="sec-3-2-1">
          <title>A. GloVe model analysis</title>
          <p>The main idea of GloVe model: words wi and wj
relation can be found by studying the ratio of their
co-occurrence probabilities with various probe words –
P (wi; wk)=P (wj ; wk), where wk is a probe word. So,
general model can be written as
Authors say that due to exchangeability of words and
context words function F should be a homomorphism:
F ((ui</p>
          <p>uj )T vk) =
F ((ui
uj )T vk) =</p>
          <p>P (wi; wk) :
P (wj ; wk)
F (uiT vk) :</p>
          <p>
            F (ujT vk)
i;j=1
This formula gives an idea that model F is exponential,
which in combination with Eqn. (
            <xref ref-type="bibr" rid="ref1">1</xref>
            ) leads to:
uiT vk = log Pik = log Xik
log Xi:
After that GloVe brings biases to the formula. log Xi does
not depend on probe word k and is replaced with bias biu.
For word-context exchange symmetry context bias bvk is
also included:
          </p>
          <p>uiT vk + biu + bvk = log Xik:
In this equation, right-hand side is what information
model has to learn and left-hand side is how GloVe
preserves it. This is optimized with weighted least squares
regression model. As a result, GloVe model target function
is</p>
          <p>
            jV j
J = X f (Xij ) (uiT vj + biu + bjv
log Xij )2;
(
            <xref ref-type="bibr" rid="ref5">5</xref>
            )
where
          </p>
          <p>X co-occurrences matrix,
jV j vocabulary size,
ui and biu central vector and bias for word wi,
vj and bjv context vector and bias for word wj .</p>
          <p>Introduction of encoding and decoding biases is a
moment that has no mathematical demonstration in the
article, but our experiments have shown that the model
does not work without their usage. We explained this
with target function similarity with mutual information
formula:</p>
          <p>DKL =</p>
          <p>
            X p(wi; wj ) log
i;j
p(wi; wj )
p(wi)p(wj )
:
(
            <xref ref-type="bibr" rid="ref1">1</xref>
            )
(
            <xref ref-type="bibr" rid="ref2">2</xref>
            )
(
            <xref ref-type="bibr" rid="ref3">3</xref>
            )
(
            <xref ref-type="bibr" rid="ref4">4</xref>
            )
(
            <xref ref-type="bibr" rid="ref6">6</xref>
            )
          </p>
          <p>
            We are actually looking for embedding that will preserve
the ratio in logarithmic part of Eqn. (
            <xref ref-type="bibr" rid="ref6">6</xref>
            ). The ratio
represents how much more often a combination of words x and
y occurs in corpus than each of them individually.
Information that the model encodes is a conditional probability
given model F:
          </p>
          <p>I =</p>
          <p>X p(wi; wj ) log
i;j</p>
          <p>
            p(wi; wj jF )
p(wijF )p(wj jF )
:
The result of rewriting Eqn. (
            <xref ref-type="bibr" rid="ref6">6</xref>
            ) and Eqn. (
            <xref ref-type="bibr" rid="ref7">7</xref>
            ) in GloVe
notation and combining with the weighted least squares
method will be very similar to GloVe target function:
(
            <xref ref-type="bibr" rid="ref7">7</xref>
            )
(
            <xref ref-type="bibr" rid="ref8">8</xref>
            )
(
            <xref ref-type="bibr" rid="ref9">9</xref>
            )
p(wi; wj jF )
p(wijF )p(wj jF ) ) euiT vj
p(wi; wj )
log p(wi)p(wj ) ) log Xij
biu
bjv
J =
          </p>
          <p>X p(wi; wj ) (uiT vj + log p(wi) + log p(wj )
i;j
log p(wi; wj ))2;
Joint probability of words wi and wj are what in GloVe
model is designed as co-occurrences matrix Xij and prior
probabilities of words are designed as biases biu and bjv. In
our experiments we tried both ways and obtained similar
results for biases and probabilities usage. For that reason,
we continued using prior probabilities in SSDE to decrease
computational complexity.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>B. Our model</title>
          <p>
            From Eqn. (
            <xref ref-type="bibr" rid="ref5">5</xref>
            ) we see that GloVe represents words
relations with dot product of their central and context vectors:
uT v. This is done to consider the asymmetry property
that we want to remove. Central and context vectors
dot product is equal to corresponding one-hot encoder
vectors multiplication to central and context matrices
product. Central and context matrices product can be
considered as linear operator, and any linear operator can
be decomposed to sum of symmetric and skew-symmetric
matrices [10]:
uiT vj = hiU T V hj
          </p>
          <p>L = U T V = S + K</p>
          <p>
            After that symmetric matrix S (according to the
property of symmetric matrices) can be written as a product
of some low-rank matrix and its transpose. The same
transformation can be used for the skew-symmetric matrix
K with multiplying lower-diagonal part to 1.
lij = sij + kij = aiT aj + ij ciT cj ;
(
            <xref ref-type="bibr" rid="ref10">10</xref>
            )
          </p>
          <p>The size of a matrix A is jV j l where jV j - size of
vocabulary, l - word symmetric representation size. The
size of a matrix C is jV j m where m - word asymmetric
representation size. Balancing between symmetric and
skewsymmetric sizes we control the information distribution
the way we need. For example, to reduce the influence of
asymmetric information on resulting word representation
we make constant m much smaller than l.</p>
          <p>
            In total, after rewriting GloVe target function (
            <xref ref-type="bibr" rid="ref5">5</xref>
            ) with
Eqn. (
            <xref ref-type="bibr" rid="ref10">10</xref>
            ) and using the prior probabilities instead of
biases, we get SSDE model target function:
          </p>
          <p>jV j
Q = X f (pij ) (aiT aj + ij ciT cj + log pi + log pj
log pij )2;
(11)
pij = p(wi; wj ) and pi = p(wi) are counted from
the input corpus before the training process
ij = 1, if i &gt; j, otherwise ij = 1</p>
          <p>On each training step we iterate through word-word
cooccurrences matrix X. Each co-occurrence xij shows how
many times word wi was in the context with word wj .
We compute gradients for symmetric vectors and
skewsymmetric vectors and update them according to the
gradients.</p>
          <p>Resulting word embeddings are vectors of symmetric
matrix A. Since we wanted to remove asymmetric
information influence on resulting word representations, vectors ci
are only used for training. However, their properties worth
further studying.</p>
          <p>There are two ways to optimize function (11): 1)
gradient descent, 2) stochastic gradient descent. The advantage
of gradient descent is that it will eventually converge
to better results. Though stochastic gradient has several
methods that achieve reasonable results much faster than
gradient-descent. Since we wanted to reduce training time,
we decided to use Glove’s approach using adaptive
gradient descent. GloVe authors also noticed that values slightly
change on each stochastic gradient iteration which means
computations can be done in parallel.</p>
          <p>GloVe model shuffles whole co-occurrences matrix on
each step of stochastic gradient descent.</p>
          <p>jV j jV j
X X f (Xij ) (uiT vj + biu + bjv
i=1 j=1
= Ei;j U(X)f (Xij ) (uiT vj + biu + bjv
log Xij )2
log Xij )2:
(12)
In SSDE model we shuffle only lines of co-occurrences
matrix.</p>
          <p>jV j jV j
X X f (pij ) (aiT aj + ij ciT cj + log pi + log pj
i=1 j=1
log pij )2
jV j
= EiX f (pij ) (aiT aj + ij ciT cj + log pi + log pj
j=1
log pij )2:
(13)
Lines shuffle without columns shuffle makes computations
cash-friendly, reducing cash-miss rate. This change allowed
us to optimize model performance while quality remained
the same.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>IV. Experiments</title>
        <sec id="sec-3-3-1">
          <title>A. Evaluation</title>
          <p>To compare SSDE with GloVe we used metrics
suggested in GloVe article. All the metrics are based on word
analogies property. There are four words w1, w2, w3, w4,
all associated with one topic and can be described as “ w1
is related to w2 the same way w3 is related to w4”. This
can be presented in vectors terms as
According to the arithmetics law this can be rewritten as
w2
w1 = w4</p>
          <p>w3:
w2</p>
          <p>w1 + w3 = w4( ):
Testing algorithm is: 1) get first three input words and
count left part of (*) 2) among all vectors of our
vocabulary find the closest vector v to the previous step result
(using cosine similarity) 3) if word corresponding to v is
equal to w4, then this experiment was successful, otherwise
it failed.</p>
          <p>We do not provide a comparison with CBOW or
Skipgram models, but, as it is shown in the article [7], GloVe
performs better than the other baselines.</p>
          <p>‘Tab. I” shows all metrics that were used to evaluate
both GloVe and SSDE models. Five of these metrics have
semantic nature, for example,
"King"</p>
          <p>"M an" + "W oman" = "Queen":
While the other nine are syntactic, for example,
"Dangerous"</p>
          <p>"Danger" + "Beauty" = "Beautif ul":</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>B. Results</title>
          <p>We compared GloVe and SSDE models on corpus
composed of 100Mb of articles from English Wikipedia. For
corpus scanning we used symmetric shrinking window
of size 30. All models were trained up to convergence.
Studying of the constant window and asymmetric window
results will be completed in future work.</p>
          <p>Tab. II shows the performance of GloVe and SSDE
models with an equal number of parameters trained. Our
approach significantly improves scores both for semantic
and syntactic tasks.</p>
          <p>Tab. III shows results of GloVe and SSDE models
with equal sizes of word embeddings vectors. As it was
mentioned, GloVe model uses a sum of central and context
vectors as the resulting representation and SSDE model
uses only a symmetric vector. Similar or even higher
scores can be obtained with SSDE model with the same
representation size as GloVe, but almost twice a smaller
number of parameters.</p>
          <p>All the results were obtained on Inter Core i7 processor,
8GB, DDR4 memory type.</p>
          <p>We demonstrated that our approach outperforms GloVe
model on word analogies metrics while calculating a twice
smaller number of parameters. This fact proves our
initial assumption that asymmetric information influence
on word embeddings can be significantly reduced, thus,
optimizing time required for training of the model.</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>V. Conclusion</title>
        <sec id="sec-3-4-1">
          <title>A. Achievements</title>
          <p>In this paper, we studied the necessity of word
relationships asymmetric information for word embeddings.
We showed that it is possible to train high-quality word
vectors using a little information on the asymmetry of
relations, comparing to the popular word embedding model
with highest scores on word analogies tasks – GloVe.
Since our approach computes a twice smaller number of
parameters, it requires less time to train the model.</p>
          <p>We analyzed GloVe model and introduced a new model
– SSDE – that combines the advantages of GloVe with
our ideas on asymmetric relations. Comparison of SSDE
with GloVe has shown that our model outperforms GloVe
on word analogies metrics, while GloVe, according to the
article [7], outperforms CBOW and Skip-gram models.</p>
        </sec>
        <sec id="sec-3-4-2">
          <title>B. Future work</title>
          <p>SSDE model, similar to GloVe and Word2Vec, uses
a sliding window to scan the corpus. We assume that
depending on the type of the window used, results may be
different for metrics of different types. Constant windows
might perform better on synonyms search tasks, while the
shrinking window could be a good choice for word
analogies tasks. So, in future work, we will examine window
type influence on different metrics types.</p>
          <p>Currently, we only use vectors with symmetric
information for resulting word embeddings. However, there might
be some interesting information encoded in asymmetric
vectors. For example, L1-regularization turn most of the
skew-symmetric vectors to zero. There might be some
connection between those words which corresponding
skewsymmetric vectors are not zero. In future work, we will
study the asymmetric component of SSDE and analyze if
there is any pattern that might increase performance on
some tasks.</p>
          <p>Window size and symmetry influence on model
performance is another aspect that was not examined.
Importance of asymmetric information might increase for highly
asymmetric windows.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , W.-t. Yih, and G. Zweig, “
          <article-title>Linguistic regularities in continuous space word representations</article-title>
          .”
          <source>in HLT-NAACL</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>746</fpage>
          -
          <lpage>751</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Sebastiani</surname>
          </string-name>
          , “Machine learning in
          <source>automated text categorization,” ACM Computing Surveys</source>
          , vol.
          <volume>34</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>47</lpage>
          ,
          <year>2002</year>
          . [Online]. Available: http://nmis.isti.cnr.it/ sebastiani/Publications/ACMCS02.pdf
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>W. L.</given-names>
            <surname>Hamilton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leskovec</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Jurafsky</surname>
          </string-name>
          , “
          <article-title>Inducing domain-specific sentiment lexicons from unlabeled corpora</article-title>
          ,”
          <source>in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>595</fpage>
          -
          <lpage>605</lpage>
          . [Online]. Available: http://aclweb.org/anthology/D16-1057
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Peters</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Neumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Iyyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gardner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          , “
          <article-title>Deep contextualized word representations,”</article-title>
          <source>in Proc. of NAACL</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , “Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          ,” arXiv preprint arXiv:
          <year>1810</year>
          .04805,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Corrado, and
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          , “
          <article-title>Efficient estimation of word representations in vector space,” CoRR</article-title>
          , vol.
          <source>abs/1301.3781</source>
          ,
          <year>2013</year>
          . [Online]. Available: http://dblp.uni-trier. de/db/journals/corr/corr1301.html#abs-1301-3781
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pennington</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          , “Glove:
          <article-title>Global vectors for word representation,” in Empirical Methods in Natural Language Processing</article-title>
          (EMNLP),
          <year>2014</year>
          , pp.
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          . [Online]. Available: http://www.aclweb.org/anthology/ D14-1162
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Grave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          , and T. Mikolov, “
          <article-title>Enriching word vectors with subword information,” Transactions of the Association for Computational Linguistics</article-title>
          , vol.
          <volume>5</volume>
          , pp.
          <fpage>135</fpage>
          -
          <lpage>146</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , I. Sutskever,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. S.</given-names>
            <surname>Corrado</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          , “
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          ,
          <source>” in Advances in Neural Information Processing Systems</source>
          26,
          <string-name>
            <surname>C. J. C. Burges</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Bottou</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Welling</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Ghahramani</surname>
            , and
            <given-names>K. Q.</given-names>
          </string-name>
          <string-name>
            <surname>Weinberger</surname>
          </string-name>
          , Eds. Curran Associates, Inc.,
          <year>2013</year>
          , pp.
          <fpage>3111</fpage>
          -
          <lpage>3119</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>F.</given-names>
            <surname>Gantmacher</surname>
          </string-name>
          ,
          <article-title>The theory of matrices, ser</article-title>
          .
          <source>The Theory of Matrices. Chelsea Pub</source>
          . Co.,
          <year>1960</year>
          , no.
          <source>т. 1</source>
          . [Online]. Available: https://books.google.ru/books?id=GOdQAAAAMAAJ
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>