<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Context and Embeddings in Language Modelling - an Exploration</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Proceedings of the XX International Conference “Data Analytics and Management in Data Intensive Domains” (DAMDID/RCDL'2018)</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Matthias Nitsche © Marina Tropmann-Frick Hamburg University of Applied Sciences, Department of Computer Science</institution>
          ,
          <addr-line>Hamburg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <fpage>131</fpage>
      <lpage>138</lpage>
      <abstract>
        <p>Embeddings are a natural way to map text to a latent space, commonly consumed in downstream language tasks, e.g., question- answering, named-entity recognition or neural machine translations. Embeddings typically capture syntactical relations between parts of a sequence and solve semantic problems connected with word-sense-disambiguation (WSD) well. As a result of WSD, the curse of dimensionality, out-of-vocabulary words, overfitting to a domain and missing real world knowledge, infering meaning without context is hard. Thus we require two things. First, we need techniques to actively overcome syntactical problems dealing with WSD and semantically correlating words/sentences. Second, we require context to reconstruct the intentions and settings of a given text such that it can be understood. This work explores different embedding models, data augmentation techniques and context selection strategies (subsampling on the input space) for real world language problems.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Many NLP applications start with preprocessing
pipelines involving stemming, stopword removal, special
character extraction and tokenization. When the
morphological treatment of text is done the most important
step is the representation of text: projection. Language is
a high dimensional and multi-sense problem domain
dealing with polysemy, synonymy, antonymy and
hyponymy. Therefore, we often need to reduce the
dimensions of the problem domain, projecting it to a latent
space. Classical models project words using WordNet
mapping each word to a relation, employ methods from
linear algebra like Singular Value Decomposition (SVD)
and most famously Latent Semantic Indexing (LSI) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
More complicated statistical models involve expectation
maximization procedures for which Latent Dirichlet
Allocation (LDA) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is the standard. Word and
subword-level embeddings try to overcome some of the
limitations of the former methods using neural networks
posing language models as an optimization problem.
Word2Vec by [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] was the first successful model that
superseded the quality of preceding methods.
Embeddings map words, sentences, characters or part of
words to a non-linear latent space in ℝ where  stands
for the amount of dimensions the embedding has.
Projects like fastText, spaCy, Starspace, GloVe and
Word2Vec Googles News embeddings offer pretrained
language models on vast amounts of data. There are
multiple ways to choose a context for embeddings: by
window of size  around a center word, by dependency
tree around a word or by representing words as
probability distributions and discarding unlikely words. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
generalize context embeddings to models of the
exponential family (ef-emb). [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] enhance ef-emb by creating a
very complex selection procedure based on an
amortization network and variational inference to drop
unimportant items from the context with an indicator
vector. In theory context selection works with two
functions. The first is a function for selecting what a viable
context is, e.g.,  =  ( ,  ), where  is the target item/
center word,  all items/vocabulary. The second for
subsampling on the target and context  ( ,  ). The origins
of neural language models are based on [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] pro-posing a
shallow single layer neural network with a softmax layer.
The neural language model computes a conditional
probability distribution over words – producing
embeddings based on the  preceding words, represented as
a vector of dimensions  , shared across the entire
network in the respective context vectors  . The most
basic language model computes the conditional
probabilities given a word   and the preceding words
using the chain rule. When the vocabulary grows large
the normalization term in the denominator of the
softmax becomes more difficult to handle. The model in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
is intractable and could not be successfully build. The
first model that successfully beat state-of-the-art
language models was Word2Vec by [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Later we will review
word and subword-level embeddings.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2 Word Embeddings</title>
      <p>At first we briefly review word-level embeddings.
Corpora typically consist of words that are part of sentences
in documents. Before training, each sentence is
tokenized and morphologically altered with stemming or
lemmatization. Classical models use the bag of words
model, so words are represented as a co-occurrence
feature matrix. We start with Word2Vec, since almost
every model leveraging embeddings in language take it
as a point of reference.</p>
      <sec id="sec-2-1">
        <title>2.1 Word2Vec</title>
        <p>
          [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] improved on several aspects of Bengio's model by
using the skip-gram
        </p>
        <p>window function (an alternative
would be CBOW) and a tractable approximation of the
softmax called negative sampling/hierarchical softmax.
Word2Vec has become the de facto standard in a lot of
language downstream tasks. Google shipped pre-trained
Word2Vec skip-gram models on Google News articles
for everybody to use. The corpus is large (up to a billion
words) and the dimensions of the latent space is large
 = 300. The training would take weeks up to months
on a just a few state-of-the-art GPUs, saving each
researcher the time to train them themselves. We will see
a great influx of pre-trained language models in the
future
because OOV</p>
        <p>
          words are a real issue and
generalization on small sparse domains is
highly
problematic. While most of the premises of pre-trained
models are great, they also introduce biases. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] have
shown that this particular dataset employs gender biases.
Skip-gram predicts the context of a center word   over
a window  such that  − , … ,   , … ,  + is satisfied.
1
        </p>
        <p>−1 −≤≤
,≠0
log  ( + |  )</p>
        <sec id="sec-2-1-1">
          <title>CBOW does the opposite, given a word context most likely.</title>
          <p>− , … ,   , … ,  +</p>
          <p>Negative
predict the center word   that is
sampling
speeds
up
the
performance by using the positive samples of the context
words 2 ∗  and uses only a few negative samples that
are not in its context. The respective objective cost
function is
where  is the sigmoid function, a binary function,
drawing  samples from the negative or noise
distribution   ( ), to distinguish the negative draws   
the target word   drawn from the context of  
.</p>
          <p>
            The objective of negative sampling is to learn high
quality word embeddings by comparing noise (out of
context) to words from the context. Another language
model building upon Word2Vec is Global Vectors for
Word Representation (GloVe) by [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ], which is trained
on an aggregated global word co-occurrence matrix from
a corpus. The difference to Word2Vec is that the global
statistics are taken into account contrary to Word2Vec,
that works on local context windows alone. GloVe
typically performs better than
Word2Vec skip-gram,
especially when the vocabulary is large. GloVe is also
available on different corpora such as Twitter, Common
from
          </p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Crawl or Wikipedia.</title>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2 Bag of Tricks - fastText</title>
        <p>
          Another interesting and popular word embedding model
is fastText by [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. It bases on a similar idea as Word2Vec.
Instead of negative sampling - using the hierarchical
softmax, and instead of words - using n-gram features.
N-grams build on bag of words, commonly known as a
co-occurrence matrix 
× 
where documents 
are
rows and the whole vocabulary  the features assuming
i.i.d word order. Given a sequence of words [  , … ,   ]
n-grams take slices of  , e.g., [[  , … ,   ] , … ,
[ + , … ,  +
        </p>
        <p>] ]. fastText comes in two flavours:
character-level and word-level n-grams. We will review
the character-level n-grams later.
the following form:
The corresponding cost function where  is the
hierarchical softmax function,   is a document with bag of
ngram feature vectors, 
and 
are weight matrices and
  the label given a classification task. The label  in this
case is the word. The unsupervised learning task is
hierarchical softmax with CBOW denoted as  and has
As can be seen instead of finding the surrounding context
of a word</p>
        <p>
          we try to find the most probable word given
the context  . What is novel about this approach is using
n-gram features instead of windows speeding up the
training, while still matching state-of-the-art results.
fastText training time on a sentiment analysis task was 10
seconds compared to the shortest running model of 2-3
hours up to several days. As we see later, this model can
be largely improved with character n-grams proposed in
[
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] and [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
2.3 CoVe
sequences of vectors.
        </p>
        <p>
          So far we have investigated shallow neural networks
with single layers and therefore only one non-linearity.
[
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] have found that training an attentional
sequence-tosequence
model normally used for neural
machine
translations helps at enriching word vectors not just on
the
        </p>
        <p>
          word-level hierarchy. By training a two-layer,
bidirectional long short-term memory [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], on a source
language (English) to a target (German), they achieve
state-of-the-art performance. All sequences of words  
are pre-initialized with GloVe(  ) where words become
where   is a sentence in the source language and   a
sentence of the target language maximizing the
likelyhood of an encoder MT-LSTM ℎ, a decoder LSTM ℎ 

.
        </p>
        <p>The softmax attention  
over the decoder   
represents the relevance of each step from the encoder ℎ.
ℎ then is a hidden state where the softmax and    are
concatenated, possibly to attend to the relevant parts,
while not forgetting what was learned during the
decoding. Intuitively we are training a machine
translation model where the only interesting part are the
learned context vectors for sequences of the MT-LSTM.
It was shown that the model performs better, when
concatenating GloVe and CoVe into one single vector.
The idea behind this is that we can transfer the higher
level features learned in sequence-to-sequence tasks to
standard downstream tasks like classification. By first
using GloVe on the word-level and then the MT-LSTM,
we are creating layers of abstractions. Essentially this is
a first step towards transfer learning, which is standard
practice in computer vision tasks with pre-trained CNNs.
The top achiever is a model called Char + CoVe-L with
a large CoVe model concatenated with a n-gram
character features model.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.4 Bias and Critique</title>
        <p>Currently is a time producing a lot of different models
based on experimentation and educated guesses. It is
usually left to the reader trying to find explanations in
embeddings for language. What does a word-level
embedding like Word2Vec actually represent? While
there is still a lot of ground to cover, recent papers focus
a little more on the whys instead of the hows. Before
going into details about subword embeddings and
selection procedures let us discuss some of the problems,
challenges and critique gaining a little more insight on
why embeddings actually work. Most of the
state-of-theart models evaluate word embeddings with intrinsic
evaluations. Intrinsic evaluation is usually qualitative,
given a set of semantic word analogy pairs, test if the
model connects them correctly.   ⃗ −    ⃗ ≈
   ⃗ −    ⃗ . The woman/queen vs. man/king is the
most famous of all examples. One could deduct that
given a large number of such analogy word pairs, testing
the presence of synonymy, polysemy and word
positioning is sufficient. Intrinsic evaluation shows
exactly what works, not what does not work or even what
works but should not. Extrinsically it is not possible to
use labels testing the precision and recall of our system.
And it is easy to see why: What would you expect should
a general approximation of a word should look like?
Should it be able to learn every possible dimension and
therefore interpretation of what we perceive of it? If so,
how should it learn to distinguish different domains with
a different context? The context of a domain is never
explained or given to the models.</p>
        <p>
          Given a reasonable amount of test cases, quality can be
ensured to some extent. How good or bad they actually
perform is usually tested in downstream language tasks.
If the embeddings perform better on that specific task
compared to a preceding model, it is declared
state-ofthe-art. Interestingly, [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] show that even state of the art
embeddings display a large amount of bias towards
certain topics:   ⃗ −   ⃗ ≈   ⃗ −
ℎ  ⃗.
        </p>
        <p>
          Training real language models on real data yields real
bias. The world and its written words are not fair and they
incorporate really narrow views and concepts. Gender
inequality and racism are two of the most challenging
societal problems in the 21st century. Learning
embeddings always yields a representation of the input.
The bias is statistically significant. The problem is more
obvious when considering that the standard Word2Vec
model trained on the Googles News Corpus is applied on
thousands of downstream language tasks. These kind of
biases are not unique to language modelling and can be
found in computer vision as well. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] hints that there are
three forms of bias: occupational stereotypes, analogies
with stereotypes and indirect gender bias. They also
acknowledge that not everything we perceive as bias
should be seen as such, e.g. football and footballer is
male dominant for other reasons than just bias. To debias
embeddings the answer is quite clear: we need additional
knowledge in form of gender specific word lists. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]
suggests to create a reference model  with word vectors
that are gender biased words.
        </p>
        <p>
          While this works for a direct bias, it is much harder with
indirect bias to spread across different latent dimensions.
Therefore, a debiasing algorithm is suggested with two
steps 1.) Identify the gender subspace and 2.) Equalize
(factor out gender) or soften (reduce magnitude). What
do these models learn? [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] have found that Word2Vec
with skip-gram and negative sampling is a PMI matrix.
A (P)PMI matrix (extra P for keeping only positive
entries) is a high dimensional and sparse context matrix,
where each row is a word  from the vocabulary  and
each column represents a context  , where it occurs.
PPMI matrices are theoretically well known and provide
a guiding hand for what Word2Vec actually learns. The
problem of PPMI matrices is actually that you need to
carefully consider each context for each occurring word,
which does not scale up to billions of tokens. The results
actually show that Word2Vec skip-gram with negative
sampling is still the better choice from a view of
precision and scalability. For further exploration of the
theoretical aspects of word embeddings see [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]; for an
explanation of the additivity of vectors [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], and for a
geometric interpretation of Word2Vec skip-gram with
negative sampling.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3 Subword Embeddings</title>
      <p>
        Subword embeddings deal with words by slicing them
into smaller proportions. This is advantageous due to the
fact that single words and their corresponding vectors
only match by symbolic comparison. Thus, there are
advantages of representing words as vectors of sub-level
symbolic representations, that first largely occurred in
neural machine translations. The representations range
from character CNNs/LSTMs [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] to character n-grams
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ][
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. These models typically handle
out-of-vocabulary words much better than their corresponding word
embeddings. While subword-level embeddings deal
better with OOV and relatedness than words, there are
dedicated strategies for OOV handling beyond subword
embeddings.
      </p>
      <sec id="sec-3-1">
        <title>3.1 Out-of-vocabulary words</title>
        <p>
          Out-of-vocabulary (OOV) words is a problem in two
circumstances. The first is that the amount of OOV
words is large, and second - the dataset is small and deals
with niche words, where every word constitutes heavily.
Words that do not match any given word vector are
mapped to the UNK token. There are several strategies
on dealing with OOV words ranging from using the
context words around OOV words [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ], using pre-trained
language models to assign their vector to OOV words
[
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] or retrain character-level language models on
pretrained models [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] found a few tricks to improve
on Word2Vec with their proposed model Nonce2Vec.
They use pre-trained word embeddings from Word2Vec
and treat OOV words as the sum of their context words.
They show that this is applicable on smaller datasets as
well. [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] found it effective to use vectors of pre-trained
language models, where a word was OOV in their
domain. Using the pre-trained vector of a different
domain helped them in improving the initialization of
their OOV words in comparison to assign a global UNK
token to their data points. They improved models on
reading comprehension considerably especially with
OOV words. [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] have shown that generating OOV word
embeddings by training a character-level model on a
pretrained dataset. The goal is to re-create the vectors by
leveraging character information. With a character-level
vector word representation OOV words can be handled
based on the sum of character vectors. They have found
that this is much better in cases where the dataset is small
and pre-trained embeddings are available.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2 Character-level</title>
        <p>
          Character-level embedding models typically build on
pre-trained word embeddings. Additionally characters
based representations of words are itself vectors for each
character of a word or vector representation of the
ngrams of a word. [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] explore different architectures for
language modelling and compare three different models
with differing inputs to language models. The three
setups, see Figure 1, use an LSTM for the language
model and either words as
input and softmax as output, single characters with a
CNN as input and output, or a character CNN as input
with a softmax output. In the following we will explore
different character-level models. [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] presents a model
with a character-level convolutional neural network
(CNN) with a highway network over characters.
Characters are used as an input to a single layer CNN with
maxpooling, using a highway network, introduced in [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ],
similarly to a RNN with a carry mechanism, before
applying a LSTM with a softmax for the most likely next
word representation. Most interesting in this work is the
application of the CNN with the highway network. A few
things to note, the vocabulary  over characters and  as
usual the embedding size, we deal with ℝ ×| | matrix
character embeddings. A word  ∈  is decomposed as
a sequence of characters [ 1, … ,   ], where  = | |, the
matrix representation then is   ∈ ℝ × . The columns
are character vectors, the rows character dimensions  .
The character-level CNN maximizes the following cost
function
,
where   is a filter of width  creating a feature map   ,
indexed by  …  +  − 1 columns over the filters of   .
⟨. . . ⟩ is the inner product. The convolution or kernel can
be seen as a generator for character n-grams. This is then
fed to   which takes the maximum of the feature map,
e.g., applies a max pooling transformation. After this  
is used as input to a highway network, which is
essentially a RNN/LSTM network with different gating
mechanisms.
        </p>
        <p>
          The transform gate  maps the input into a different latent
space, (1 −  ) is the carry gate, deciding, what
information will carry on over time.  (   +   ) is a
typical affine transformation with a non-linearity
applied. ⨀ is the entry-wise product or Hadamard
Product. Stacking several layers of highway networks
allows to carry parts of the input to the output, while
combining them in a recurrent fashion. At last, the output
 is fit into an LSTM with a softmax to obtain
distributions over the next word. [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] manages to reduce
parameter size by 60% while achieving state of the art
language modelling results. Furthermore, they find that
their models learn semantic and orthographic relations
from characters, arguing if word-level embeddings seem
even necessary. They also successfully deal with OOV
words assigning intrinsically chosen words like
  to the correct word   , that word-level
models failed to learn.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3 Character n-grams</title>
        <p>
          While character-level models work on par with
wordlevel models, recent works focuse on character n-grams.
Charagram by [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] is an approach to learn
characterlevel compositions, not the statistics of single characters.
Given a textual word or sentence, e.g., a sequence of
characters
        </p>
        <p>= ⟨ 1,  2, … ,   ⟩,    = ⟨  ,  +1 , … ,  .</p>
        <p>Charagram produces a character n-gram count vector,
where each character n-gram has its own vector    , if
the n-gram    ∈  is part of all n-grams of the model. 
is the indicator function, if    ∈  then 1 else 0.
n-grams matter and they suggest above  &gt; 2 or  ≥
for languages like German with many noun compounds.</p>
        <p>4 Context Selection
correlation.
ℎ is a single non-linearity applied over the sum of all
ngram character vectors of  , where  is the maximum
length of any character n-gram in the model. 
initialized by different choices as a model parameter.
can be
They achieve state-of-the-art results and further beating
LSTM and CNN based models using the spearman's 
scoring function.Word2Vec takes two vectors   and  
element in ℝ , where  is the dimensionality and   
the target word vector with the corresponding context
is
vectors   
:  (  ,   ) =    ∙    .</p>
        <p>We would like to represent a word as a character
representtation through n-grams, e.g., ℎ   
= ⟨ℎ
, ℎ ,
ℎ  ,    ,   ⟩. The above Word2Vec objective can be
rewritten to represent each word as a bag of character
ngrams vector representation:
where   is a vector representation of a single n-gram,
from a global set</p>
        <p>with all character n-grams. We are
interested in the Word2Vec objective, where each word
is now a sum of these character n-gram representations</p>
        <p>
          ⊂ 1, … ,  . [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]successfully improve on the analogy
task over previous models and deal with OOV words
even where the morphemes do not match up. The size of
        </p>
        <p>Context selection is about choosing a suitable function
over a domain  that maps a given center and its context
to a latent space in ℝ</p>
        <p>
          where  is the dimension of a
latent column space. In text context selection is narrowly
replaced by the surrounding words   ∈ ℝ
 × of a target
word  , where  is a window size, which is generally
known as the skip-gram objective. CBOW on the other
hand is the reverse operation, given a context  
its center word. It turns out that context is a much larger
what is
topic than just in language modelling. We will first
review a couple of concepts applied to general problems
of count and real valued data, using exponential family
distributions proposed by [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] and [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>4.1 Generalization of Context selection</title>
        <p>
          Context embeddings are not only useful to textual data,
but to sequential data of different shapes and forms as
well. [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] presents a general procedure modelling on count
and real valued data, using an expectation-maximization
(EM) algorithm to approximate exponential family
embeddings. The exponential family distributions are
distributions
with a special form
given the natural
parameters and sufficient statistics giving rise to the
possibility of fitting
        </p>
        <p>
          different kinds of probability
distributions to the same problem set. The most famous
distributions are Gaussian, Poisson or categorical. [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]
propose two example models for Gaussian (real valued)
and Poisson (count based) distributions. The general
form of exponential families are as follows:
  |  
~ 
     ,  (  ) ,
where   is any data point, for which we like to learn the
distribution,    the context of each data point  .     
is the natural parameter space that is always convex, e.g.,
within the bounds of the applicable finite integral of the
function and  (  ) the sufficient statistic, a function that
fully summarizes the data  such that there exist no other
statistic that provides additional information. The natural
parameter has the general form
where  [ ] are the embedding parameters for a respective
target,  [ ] are the context parameters, a probability
distribution over context elements, and  is the link
function that must be defined for each individual
problem, connecting context with a data point. The
objective cost function is the sum of log conditional
probabilities of each data point which is then optimized
using stochastic gradient descent. If the probability
distribution is categorical, the
objective is almost
equivalent to Word2Vec with CBOW.
        </p>
        <p>
          Given this framework one can construct all kinds of
contexts and link functions to solve embeddings for a
specific domain. [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] propose an advancement on the
efemb by [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], by considering only a subset of elements in
the context, instead of using all of them, naming their
     =   ( [ ]
        </p>
        <p>[ ]  ),

∈ 
model context selection for exponential family
embeddings (CS-EFE). Additionally, CS-EFE depends
on three parameters, the embeddings for a target, the
context of the target and a hidden binary vector that
indicates what the target depends on. The authors
leverage amortized variational inference (VI). We will
try to describe the work in 3 steps. Why VI? Why
blackbox VI? Why amortized VI?
ef-emb could be easily optimized with gradient descent
given the cost function. What has changed is that
CSEFE deploys an additional set of coefficients  that
indicate if an element of a context is part of the target
word or not. To do this we need to marginalize out this
binary vector  . Therefore, we use VI to deposit the
functional over the exponential family to approximate
the best solution possible. While this is a good starting
point, this objective is still intractable and we need to find
ways to approximate this even further by the variational
lower bound or ELBO and share parameters across the
contexts, which VI alone is not able to do. This reduces
the runtime and storage complexities considerably and
introduces a lower error bound that guarantees errors
lower than, but not errors close to. The first problem is
that the original VI has no parameter sharing of the
context  , which in this case is absolutely needed.
Context is shared, that is why an amortization network
for parameter sharing is needed, e.g., amortized VI. In
Figure 3 we see the amortization network, where  is the
target, in language modelling the word  , the score   
is a score over    the context vector,   the target
embedding,    are the prior probability parameters of
   and ℎ   are Gaussian kernels, where each score of
each target word except the kth is assigned to one of the
kernels. The second problem is that we cannot fit the
variational distribution  (   ;    ) to each target
individually and hence use blackbox VI, approximating the
expectation by Monte Carlo sampling, obtaining noisy
gradients of the ELBO. To simplify: Select the correct
context from a window using a binary vector as indicator,
which cannot be computed, using VI. VI cannot share
parameters, which there are plenty of and cannot, even
with sharing, estimate the correct gradients given the
KLD. Using an indicator vector to select appropriate
elements from the context results in variable length
context vectors, for which we need a fixed size
representation. Instead of this we use Gaussian real valued
kernels to estimate mean and variance for each binary
vector and assign it. We use Monte Carlo sampling,
because we would need to compute every possible
setting between the binary vector and context vector,
which guarantees an error that is smaller than the
evidence lower bound obtaining “tainted” or “noisy”
gradients.</p>
      </sec>
      <sec id="sec-3-5">
        <title>4.2 Context selection</title>
        <p>
          Context selection in language models is at this point a
well studied task. Word2Vec by [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] uses a context
window of surrounding words. While this sounds
intuitive, there are a lot of suggestions on improving this.
Originally, [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] suggested to use sub-sampling to remove
frequently co-occurring words and use context
distribution smoothing reducing bias towards rare words. This is
very much in conjunction with count based methods that
clip off the top/bottom percent of a vocabulary. [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] have
found that using dependency based word embeddings
has an impact on the quality and quantity of functional
similarity tasks such as    . However, it is to note
that on topical similarity tasks the suggested model
performs worse. [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] note that mostly a linear context,
e.g., windows, is used. Given a corpora and a target word
 , with a corresponding sentence (e.g., context) and
modifiers of that sentence  1, … ,   with head ℎ, a
dependency tree is created, see Figure 4, with the
Stanford Dependency parser.
The contexts ( 1,  1), … , (  ,   ), (ℎ,  ℎ1), where  is the
dependency relation between head and modifier (e.g.,
nsubj, dobj, prep with, amod). While  is the forward
relation or outgoing relation from the head - the target
word -  1 is the in-going relation or inverse-relation.
Given a Word2Vec model with a small window size of
 = 2 and a larger window size  = 5, the dependency
based model learns different word relations and
minimizes two effects. We can see in Figure 4 that coincidental
filtering takes place, because “Australian” is obviously
not part of “science” in general, which Word2Vec would
take as a context in either model. Secondly, if the
window size is small, out-of-reach words like “discover”
and “telescope” would have been filtered out. Longer
more complex sentences could have several head words,
where the context is out-of-reach in larger Word2Vec
models as well. In comparison with Word2Vec, the
dependency base model has a higher precision and recall
on functional similarity tests.
porate additional information from external data sources,
augmenting
word
vectors.
        </p>
        <p>
          [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]
improve
on
the
Word2Vec model by [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] using dictionaries. Dictionaries
are records with a word mapping to a definition.
        </p>
        <p>Guitar - a stringed musical instrument, with a fretted
fingerboard, typically incurved sides, and six or twelve
strings, played by plucking or strumming with the fingers
or a plectrum.</p>
        <p>
          The key concept presented in [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] is that each word can
be weakly and strongly linked to each other given the
definition. For instance, the Guitar and Violin share the
words stringed musical instrument, that should strongly
tie them together. In the definition of the Violin there is
no plucking or strumming and thus is considered a weak
pair. Moreover weak pairs are promoted to strong pairs,
when they are within the 
closest neighbouring words
calculated with a cosine distance. The skip-gram
objecttive with negative sampling can be rephrased given the
definition to positively and negatively couple words. The
positive sampling cost function is
  (  ) =   ⋅ ∑  ∈  (  )
 (  ⋅   ) =   ⋅ ∑  ∈  (  )
   ⋅   .
ℓ is the logistic loss function,   is each target word of
the corpus with its corresponding vector   ,   (  ) are
strong pairs,   (  ) are weak pairs and   /  are
corresponding strong and weak pair vectors. The
hyperparameters 

and
        </p>
        <p>strong and weak pairs. Set to zero, the model behaves
exactly like</p>
        <sec id="sec-3-5-1">
          <title>Word2Vec. The corresponding negative sampling cost function is are chosen to best fit to the learning of</title>
          <p>Where   is chosen such that it is randomly chosen from
the vocabulary at random without self  (  ) and it is not
part of strong   ∉  (  ) or weak   ∉ 
(  ) word
pairs. Which results in the cost function  from a target
 (  ,   ) =  (  ⋅   ) +    (  ) +  
(  ).</p>
          <p>The results show an improvement over state-of-the-art
models on word similarity and text classification. They
parsed and trained on a last language corpus from</p>
        </sec>
        <sec id="sec-3-5-2">
          <title>Wikipedia comparing a pre-trained</title>
          <p>Word2Vec model
augmented with dictionaries, a retrofitted model using
WordNet and a single model on a raw corpus. Dict2Vec
showed superior results on the raw corpus and improved
the other models by up to 13%.</p>
        </sec>
      </sec>
      <sec id="sec-3-6">
        <title>4.4 Comparison</title>
        <p>
          [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ] have found that different downstream and language
modelling tasks need different types of context applied.
They compare window-based, substitution-based,
dependency-based, concatenation and SVD on sub-sampling
context for word embeddings. In Figure 5 we can see
three kinds of datasets. WordSim-353-R, for topical
substituting for “love” yields   
and
for
functional similarity and TOEFL for evenly balanced
parts of topical coherence and functional similarity. First
to note: substitution based word embeddings performed
worse overall in all domains. The idea is to substitute
words in sentences, e.g., “I love my job” [ , ? ,   ,   ]
        </p>
        <p>
          0.1,    0.1] learned by a language model. What we
can immediately see is that typical word embeddings like
Word2Vec with window 1, 5 and 10 outperform the
other models on topical coherence WordSim-353-S and
are on par with dependency based models on SimLex999
and TOEFL. Further, dependency-based models perform
much better on functional similarly tasks like
WordSim353-S. Their results also suggest that concatenating
different word embeddings yields the highest results on
downstream language tasks such as parsing, ner or
sentiment. Unfortunately, Dict2Vec is not in the list of
curated models as it is still being evaluated and new.
In this paper we explored a wide variety of concepts dealing
with word-level and subword-level embeddings as well as
context selection procedures. All of the suggested methods
have assets and drawbacks. However, strategies using
pretrained character n-grams on large datasets with negative
sampling/hierarchical softmax on the skip-gram
and
CBOW objective performs best. That is, they bring all the
features of pre-trained word embeddings, while dealing
with OOV words and faster training. It would be interesting
to see if character-level embeddings could be enhanced with
procedures like   
2 
,  
 and  
 to leverage
external sources and incorporate global statistics as well.
Word2Vec is the basic work-unit behind all current text
representation learning tasks. Besides what is covered here,
there are multiple research directions open. E.g., statistical
models that treat words as a distribution, see [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ] and [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ].
They treat words as a probability mass functions (pmfs) and
can express uncertainty in different dimensions as well as
deal with all kind of WSD problems and entailment. [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ]
goes even further by representing words as hierar-chical
pmfs. Instead of changing how the representation is created,
they alter the representation to fit certain conditions and
features. Another issues are domain adaptation and transfer
learning techniques. In the future they will help in dealing
with the asymmetry of data. Given a dataset of a domain
that is well known, generalize it to a target domain with
fewer samples. This will be particularly helpful in smaller
domains and help transpose different ideas beyond the
current context. At last there is a desperate need for further
theoretical under-standing. It is hard to compare every
model and even harder when the evaluation is largely
intrinsic and effects can only be indirectly tested in
downstream language tasks. Here we will also work on
further improvements.
        </p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Deerwester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.T.</given-names>
            <surname>Dumais</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.W.</given-names>
            <surname>Furnas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.K.</given-names>
            <surname>Landauer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Harshman</surname>
          </string-name>
          .
          <article-title>Indexing by latent semantic analysis</article-title>
          .
          <source>Journal of the American Society for Information Science</source>
          ,
          <volume>41</volume>
          (
          <issue>6</issue>
          ):
          <fpage>391</fpage>
          -
          <lpage>407</lpage>
          ,
          <year>1990</year>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.M.</given-names>
            <surname>Blei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.Y.</given-names>
            <surname>Ng</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.I.</given-names>
            <surname>Jordan</surname>
          </string-name>
          .
          <article-title>Latent dirichlet allocation</article-title>
          .
          <source>J. Mach. Learn. Res.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , I. Sutskever,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Corrado, and
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>CoRR, abs/1310.4546</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Rudolph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mandt</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Blei</surname>
          </string-name>
          .
          <article-title>Exponential family embeddings</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          <volume>29</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Athey</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Blei</surname>
          </string-name>
          .
          <article-title>Context selection for embedding models</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          <volume>30</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ducharme</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Vincent</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Janvin</surname>
          </string-name>
          .
          <article-title>A neural probabilistic language model</article-title>
          .
          <source>J. Mach. Learn. Res.</source>
          ,
          <volume>3</volume>
          :
          <fpage>1137</fpage>
          -
          <lpage>1155</lpage>
          ,
          <year>March 2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Bolukbasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.Y.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Saligrama</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Kalai</surname>
          </string-name>
          .
          <article-title>Man is to computer programmer as woman is to homemaker? debiasing word embeddings</article-title>
          .
          <source>CoRR.</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.</given-names>
            <surname>Dyer</surname>
          </string-name>
          .
          <article-title>Notes on noise contrastive estimation and negative sampling</article-title>
          .
          <source>CoRR.</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>X.</given-names>
            <surname>Rong</surname>
          </string-name>
          .
          <article-title>word2vec parameter learning explained</article-title>
          .
          <source>CoRR.</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pennington</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.D.</given-names>
            <surname>Manning</surname>
          </string-name>
          . Glove:
          <article-title>Global vectors for word representation</article-title>
          .
          <source>In Empirical Methods in Natural Language Processing (EMNLP)</source>
          , pages
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          , E. Grave,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          .
          <article-title>Bag of tricks for efficient text classification</article-title>
          .
          <source>CoRR.</source>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Grave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          .
          <article-title>Enriching word vectors with subword information</article-title>
          .
          <source>CoRR.</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wieting</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bansal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Gimpel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Livescu</surname>
          </string-name>
          . Charagram:
          <article-title>Embedding words and sentences via character n-grams</article-title>
          .
          <source>CoRR.</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>B.</given-names>
            <surname>McCann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bradbury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          .
          <article-title>Learned in translation: Contextualized word vectors</article-title>
          .
          <source>CoRR.</source>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Graves</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          .
          <article-title>Framewise phoneme classification with bidirectional lstm and other neural network architectures</article-title>
          .
          <source>Neural Networks</source>
          , pages
          <fpage>5</fpage>
          -
          <lpage>6</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Goldberg</surname>
          </string-name>
          .
          <article-title>Neural word embedding as implicit matrix factorization</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          <volume>27</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gittens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Achlioptas</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.W.</given-names>
            <surname>Mahoney</surname>
          </string-name>
          .
          <article-title>Skip-gram - zipf + uniform = vector additivity</article-title>
          .
          <source>In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <string-name>
            <surname>ACL</surname>
          </string-name>
          <year>2017</year>
          ,
          <article-title>Canada</article-title>
          , Volume
          <volume>1</volume>
          , pages
          <fpage>69</fpage>
          -
          <lpage>76</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>D.</given-names>
            <surname>Mimno</surname>
          </string-name>
          and
          <string-name>
            <surname>L. Thompson.</surname>
          </string-name>
          <article-title>The strange geometry of skip-gram with negative sampling</article-title>
          .
          <source>In 2017 Conference on Empirical Methods in Natural Language Processing</source>
          ,
          <year>September 2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jernite</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sontag</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.M.</given-names>
            <surname>Rush</surname>
          </string-name>
          .
          <article-title>Character-aware neural language models</article-title>
          .
          <source>CoRR.</source>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A.</given-names>
            <surname>Herbelot</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Baroni</surname>
          </string-name>
          .
          <article-title>High-risk learning: acquiring new word vectors from tiny data</article-title>
          .
          <source>CoRR.</source>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>B.</given-names>
            <surname>Dhingra</surname>
          </string-name>
          , H. Liu,
          <string-name>
            <given-names>R.</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>andW.W.</given-names>
            <surname>Cohen</surname>
          </string-name>
          .
          <article-title>A comparative study of word embeddings for reading comprehension</article-title>
          .
          <source>CoRR.</source>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Pinter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Guthrie</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Eisenstein</surname>
          </string-name>
          .
          <article-title>Mimicking word embeddings using subword rnns</article-title>
          .
          <source>CoRR.</source>
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>R. J</surname>
          </string-name>
          ´ozefowicz,
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schuster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          .
          <article-title>Exploring the limits of language modeling</article-title>
          .
          <source>CoRR.</source>
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>R.K.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Greff</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          .
          <article-title>Highway networks</article-title>
          .
          <source>CoRR.</source>
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Goldberg</surname>
          </string-name>
          .
          <article-title>Dependency-based word embeddings</article-title>
          .
          <source>In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics</source>
          , Baltimore, USA, Volume
          <volume>2</volume>
          , pages
          <fpage>302</fpage>
          -
          <lpage>308</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>J.</given-names>
            <surname>Tissier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gravier</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <surname>A. Habrard.</surname>
          </string-name>
          <article-title>Dict2vec: Learning word embeddings using lexical dictionaries</article-title>
          .
          <source>In Proceedings of the Conference on Empirical Methods in Natural Language Processing</source>
          , Copenhagen, Denmark, September 9-
          <issue>11</issue>
          ,
          <year>2017</year>
          , pages
          <fpage>254</fpage>
          -
          <lpage>263</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>O.</given-names>
            <surname>Melamud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>McClosky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Patwardhan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Bansal</surname>
          </string-name>
          .
          <article-title>The role of context types and dimensionality in learning word embeddings</article-title>
          .
          <source>CoRR.</source>
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>L.</given-names>
            <surname>Vilnis</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>McCallum</surname>
          </string-name>
          .
          <article-title>Word representations via gaussian embedding</article-title>
          .
          <source>CoRR.</source>
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>B.</given-names>
            <surname>Athiwaratkun</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.G.</given-names>
            <surname>Wilson.</surname>
          </string-name>
          <article-title>Multimodal word distributions</article-title>
          .
          <source>In Conference of the Association for Computational Linguistics (ACL)</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>M.</given-names>
            <surname>Nickel</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Kiela</surname>
          </string-name>
          . Poincar´
          <article-title>e embeddings for learning hierarchical representations</article-title>
          .
          <source>CoRR.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>