1 Introduction

Context and Embeddings in Language Modelling - an Exploration

0 Proceedings of the XX International Conference “Data Analytics and Management in Data Intensive Domains” (DAMDID/RCDL'2018) , Moscow , Russia 1 Matthias Nitsche © Marina Tropmann-Frick Hamburg University of Applied Sciences, Department of Computer Science , Hamburg , Germany

131 138

Embeddings are a natural way to map text to a latent space, commonly consumed in downstream language tasks, e.g., question- answering, named-entity recognition or neural machine translations. Embeddings typically capture syntactical relations between parts of a sequence and solve semantic problems connected with word-sense-disambiguation (WSD) well. As a result of WSD, the curse of dimensionality, out-of-vocabulary words, overfitting to a domain and missing real world knowledge, infering meaning without context is hard. Thus we require two things. First, we need techniques to actively overcome syntactical problems dealing with WSD and semantically correlating words/sentences. Second, we require context to reconstruct the intentions and settings of a given text such that it can be understood. This work explores different embedding models, data augmentation techniques and context selection strategies (subsampling on the input space) for real world language problems.

1 Introduction

Many NLP applications start with preprocessing pipelines involving stemming, stopword removal, special character extraction and tokenization. When the morphological treatment of text is done the most important step is the representation of text: projection. Language is a high dimensional and multi-sense problem domain dealing with polysemy, synonymy, antonymy and hyponymy. Therefore, we often need to reduce the dimensions of the problem domain, projecting it to a latent space. Classical models project words using WordNet mapping each word to a relation, employ methods from linear algebra like Singular Value Decomposition (SVD) and most famously Latent Semantic Indexing (LSI) [ 1 ]. More complicated statistical models involve expectation maximization procedures for which Latent Dirichlet Allocation (LDA) [ 2 ] is the standard. Word and subword-level embeddings try to overcome some of the limitations of the former methods using neural networks posing language models as an optimization problem. Word2Vec by [ 3 ] was the first successful model that superseded the quality of preceding methods. Embeddings map words, sentences, characters or part of words to a non-linear latent space in ℝ where stands for the amount of dimensions the embedding has. Projects like fastText, spaCy, Starspace, GloVe and Word2Vec Googles News embeddings offer pretrained language models on vast amounts of data. There are multiple ways to choose a context for embeddings: by window of size around a center word, by dependency tree around a word or by representing words as probability distributions and discarding unlikely words. [ 4 ] generalize context embeddings to models of the exponential family (ef-emb). [ 5 ] enhance ef-emb by creating a very complex selection procedure based on an amortization network and variational inference to drop unimportant items from the context with an indicator vector. In theory context selection works with two functions. The first is a function for selecting what a viable context is, e.g., = ( , ), where is the target item/ center word, all items/vocabulary. The second for subsampling on the target and context ( , ). The origins of neural language models are based on [ 6 ] pro-posing a shallow single layer neural network with a softmax layer. The neural language model computes a conditional probability distribution over words – producing embeddings based on the preceding words, represented as a vector of dimensions , shared across the entire network in the respective context vectors . The most basic language model computes the conditional probabilities given a word and the preceding words using the chain rule. When the vocabulary grows large the normalization term in the denominator of the softmax becomes more difficult to handle. The model in [ 6 ] is intractable and could not be successfully build. The first model that successfully beat state-of-the-art language models was Word2Vec by [ 3 ]. Later we will review word and subword-level embeddings.

2 Word Embeddings

At first we briefly review word-level embeddings. Corpora typically consist of words that are part of sentences in documents. Before training, each sentence is tokenized and morphologically altered with stemming or lemmatization. Classical models use the bag of words model, so words are represented as a co-occurrence feature matrix. We start with Word2Vec, since almost every model leveraging embeddings in language take it as a point of reference.

2.1 Word2Vec

[ 3 ] improved on several aspects of Bengio's model by using the skip-gram

window function (an alternative would be CBOW) and a tractable approximation of the softmax called negative sampling/hierarchical softmax. Word2Vec has become the de facto standard in a lot of language downstream tasks. Google shipped pre-trained Word2Vec skip-gram models on Google News articles for everybody to use. The corpus is large (up to a billion words) and the dimensions of the latent space is large = 300. The training would take weeks up to months on a just a few state-of-the-art GPUs, saving each researcher the time to train them themselves. We will see a great influx of pre-trained language models in the future because OOV

words are a real issue and generalization on small sparse domains is highly problematic. While most of the premises of pre-trained models are great, they also introduce biases. [ 7 ] have shown that this particular dataset employs gender biases. Skip-gram predicts the context of a center word over a window such that − , … , , … , + is satisfied. 1

−1 −≤≤ ,≠0 log ( + | )

CBOW does the opposite, given a word context most likely.

− , … , , … , +

Negative predict the center word that is sampling speeds up the performance by using the positive samples of the context words 2 ∗ and uses only a few negative samples that are not in its context. The respective objective cost function is where is the sigmoid function, a binary function, drawing samples from the negative or noise distribution ( ), to distinguish the negative draws the target word drawn from the context of .

The objective of negative sampling is to learn high quality word embeddings by comparing noise (out of context) to words from the context. Another language model building upon Word2Vec is Global Vectors for Word Representation (GloVe) by [ 10 ], which is trained on an aggregated global word co-occurrence matrix from a corpus. The difference to Word2Vec is that the global statistics are taken into account contrary to Word2Vec, that works on local context windows alone. GloVe typically performs better than Word2Vec skip-gram, especially when the vocabulary is large. GloVe is also available on different corpora such as Twitter, Common from

Crawl or Wikipedia. 2.2 Bag of Tricks - fastText

Another interesting and popular word embedding model is fastText by [ 11 ]. It bases on a similar idea as Word2Vec. Instead of negative sampling - using the hierarchical softmax, and instead of words - using n-gram features. N-grams build on bag of words, commonly known as a co-occurrence matrix × where documents are rows and the whole vocabulary the features assuming i.i.d word order. Given a sequence of words [ , … , ] n-grams take slices of , e.g., [[ , … , ] , … , [ + , … , +

] ]. fastText comes in two flavours: character-level and word-level n-grams. We will review the character-level n-grams later. the following form: The corresponding cost function where is the hierarchical softmax function, is a document with bag of ngram feature vectors, and are weight matrices and the label given a classification task. The label in this case is the word. The unsupervised learning task is hierarchical softmax with CBOW denoted as and has As can be seen instead of finding the surrounding context of a word

we try to find the most probable word given the context . What is novel about this approach is using n-gram features instead of windows speeding up the training, while still matching state-of-the-art results. fastText training time on a sentiment analysis task was 10 seconds compared to the shortest running model of 2-3 hours up to several days. As we see later, this model can be largely improved with character n-grams proposed in [ 12 ] and [ 13 ]. 2.3 CoVe sequences of vectors.

So far we have investigated shallow neural networks with single layers and therefore only one non-linearity. [ 14 ] have found that training an attentional sequence-tosequence model normally used for neural machine translations helps at enriching word vectors not just on the

word-level hierarchy. By training a two-layer, bidirectional long short-term memory [ 15 ], on a source language (English) to a target (German), they achieve state-of-the-art performance. All sequences of words are pre-initialized with GloVe( ) where words become where is a sentence in the source language and a sentence of the target language maximizing the likelyhood of an encoder MT-LSTM ℎ, a decoder LSTM ℎ .

The softmax attention over the decoder represents the relevance of each step from the encoder ℎ. ℎ then is a hidden state where the softmax and are concatenated, possibly to attend to the relevant parts, while not forgetting what was learned during the decoding. Intuitively we are training a machine translation model where the only interesting part are the learned context vectors for sequences of the MT-LSTM. It was shown that the model performs better, when concatenating GloVe and CoVe into one single vector. The idea behind this is that we can transfer the higher level features learned in sequence-to-sequence tasks to standard downstream tasks like classification. By first using GloVe on the word-level and then the MT-LSTM, we are creating layers of abstractions. Essentially this is a first step towards transfer learning, which is standard practice in computer vision tasks with pre-trained CNNs. The top achiever is a model called Char + CoVe-L with a large CoVe model concatenated with a n-gram character features model.

2.4 Bias and Critique

Currently is a time producing a lot of different models based on experimentation and educated guesses. It is usually left to the reader trying to find explanations in embeddings for language. What does a word-level embedding like Word2Vec actually represent? While there is still a lot of ground to cover, recent papers focus a little more on the whys instead of the hows. Before going into details about subword embeddings and selection procedures let us discuss some of the problems, challenges and critique gaining a little more insight on why embeddings actually work. Most of the state-of-theart models evaluate word embeddings with intrinsic evaluations. Intrinsic evaluation is usually qualitative, given a set of semantic word analogy pairs, test if the model connects them correctly. ⃗ − ⃗ ≈ ⃗ − ⃗ . The woman/queen vs. man/king is the most famous of all examples. One could deduct that given a large number of such analogy word pairs, testing the presence of synonymy, polysemy and word positioning is sufficient. Intrinsic evaluation shows exactly what works, not what does not work or even what works but should not. Extrinsically it is not possible to use labels testing the precision and recall of our system. And it is easy to see why: What would you expect should a general approximation of a word should look like? Should it be able to learn every possible dimension and therefore interpretation of what we perceive of it? If so, how should it learn to distinguish different domains with a different context? The context of a domain is never explained or given to the models.

Given a reasonable amount of test cases, quality can be ensured to some extent. How good or bad they actually perform is usually tested in downstream language tasks. If the embeddings perform better on that specific task compared to a preceding model, it is declared state-ofthe-art. Interestingly, [ 7 ] show that even state of the art embeddings display a large amount of bias towards certain topics: ⃗ − ⃗ ≈ ⃗ − ℎ ⃗.

Training real language models on real data yields real bias. The world and its written words are not fair and they incorporate really narrow views and concepts. Gender inequality and racism are two of the most challenging societal problems in the 21st century. Learning embeddings always yields a representation of the input. The bias is statistically significant. The problem is more obvious when considering that the standard Word2Vec model trained on the Googles News Corpus is applied on thousands of downstream language tasks. These kind of biases are not unique to language modelling and can be found in computer vision as well. [ 7 ] hints that there are three forms of bias: occupational stereotypes, analogies with stereotypes and indirect gender bias. They also acknowledge that not everything we perceive as bias should be seen as such, e.g. football and footballer is male dominant for other reasons than just bias. To debias embeddings the answer is quite clear: we need additional knowledge in form of gender specific word lists. [ 7 ] suggests to create a reference model with word vectors that are gender biased words.

While this works for a direct bias, it is much harder with indirect bias to spread across different latent dimensions. Therefore, a debiasing algorithm is suggested with two steps 1.) Identify the gender subspace and 2.) Equalize (factor out gender) or soften (reduce magnitude). What do these models learn? [ 16 ] have found that Word2Vec with skip-gram and negative sampling is a PMI matrix. A (P)PMI matrix (extra P for keeping only positive entries) is a high dimensional and sparse context matrix, where each row is a word from the vocabulary and each column represents a context , where it occurs. PPMI matrices are theoretically well known and provide a guiding hand for what Word2Vec actually learns. The problem of PPMI matrices is actually that you need to carefully consider each context for each occurring word, which does not scale up to billions of tokens. The results actually show that Word2Vec skip-gram with negative sampling is still the better choice from a view of precision and scalability. For further exploration of the theoretical aspects of word embeddings see [ 17 ]; for an explanation of the additivity of vectors [ 18 ], and for a geometric interpretation of Word2Vec skip-gram with negative sampling.

3 Subword Embeddings

Subword embeddings deal with words by slicing them into smaller proportions. This is advantageous due to the fact that single words and their corresponding vectors only match by symbolic comparison. Thus, there are advantages of representing words as vectors of sub-level symbolic representations, that first largely occurred in neural machine translations. The representations range from character CNNs/LSTMs [ 19 ] to character n-grams [ 12 ][ 13 ]. These models typically handle out-of-vocabulary words much better than their corresponding word embeddings. While subword-level embeddings deal better with OOV and relatedness than words, there are dedicated strategies for OOV handling beyond subword embeddings.

3.1 Out-of-vocabulary words

Out-of-vocabulary (OOV) words is a problem in two circumstances. The first is that the amount of OOV words is large, and second - the dataset is small and deals with niche words, where every word constitutes heavily. Words that do not match any given word vector are mapped to the UNK token. There are several strategies on dealing with OOV words ranging from using the context words around OOV words [ 20 ], using pre-trained language models to assign their vector to OOV words [ 21 ] or retrain character-level language models on pretrained models [ 22 ]. [ 20 ] found a few tricks to improve on Word2Vec with their proposed model Nonce2Vec. They use pre-trained word embeddings from Word2Vec and treat OOV words as the sum of their context words. They show that this is applicable on smaller datasets as well. [ 21 ] found it effective to use vectors of pre-trained language models, where a word was OOV in their domain. Using the pre-trained vector of a different domain helped them in improving the initialization of their OOV words in comparison to assign a global UNK token to their data points. They improved models on reading comprehension considerably especially with OOV words. [ 22 ] have shown that generating OOV word embeddings by training a character-level model on a pretrained dataset. The goal is to re-create the vectors by leveraging character information. With a character-level vector word representation OOV words can be handled based on the sum of character vectors. They have found that this is much better in cases where the dataset is small and pre-trained embeddings are available.

3.2 Character-level

Character-level embedding models typically build on pre-trained word embeddings. Additionally characters based representations of words are itself vectors for each character of a word or vector representation of the ngrams of a word. [ 23 ] explore different architectures for language modelling and compare three different models with differing inputs to language models. The three setups, see Figure 1, use an LSTM for the language model and either words as input and softmax as output, single characters with a CNN as input and output, or a character CNN as input with a softmax output. In the following we will explore different character-level models. [ 19 ] presents a model with a character-level convolutional neural network (CNN) with a highway network over characters. Characters are used as an input to a single layer CNN with maxpooling, using a highway network, introduced in [ 24 ], similarly to a RNN with a carry mechanism, before applying a LSTM with a softmax for the most likely next word representation. Most interesting in this work is the application of the CNN with the highway network. A few things to note, the vocabulary over characters and as usual the embedding size, we deal with ℝ ×| | matrix character embeddings. A word ∈ is decomposed as a sequence of characters [ 1, … , ], where = | |, the matrix representation then is ∈ ℝ × . The columns are character vectors, the rows character dimensions . The character-level CNN maximizes the following cost function , where is a filter of width creating a feature map , indexed by … + − 1 columns over the filters of . ⟨. . . ⟩ is the inner product. The convolution or kernel can be seen as a generator for character n-grams. This is then fed to which takes the maximum of the feature map, e.g., applies a max pooling transformation. After this is used as input to a highway network, which is essentially a RNN/LSTM network with different gating mechanisms.

The transform gate maps the input into a different latent space, (1 − ) is the carry gate, deciding, what information will carry on over time. ( + ) is a typical affine transformation with a non-linearity applied. ⨀ is the entry-wise product or Hadamard Product. Stacking several layers of highway networks allows to carry parts of the input to the output, while combining them in a recurrent fashion. At last, the output is fit into an LSTM with a softmax to obtain distributions over the next word. [ 19 ] manages to reduce parameter size by 60% while achieving state of the art language modelling results. Furthermore, they find that their models learn semantic and orthographic relations from characters, arguing if word-level embeddings seem even necessary. They also successfully deal with OOV words assigning intrinsically chosen words like to the correct word , that word-level models failed to learn.

3.3 Character n-grams

While character-level models work on par with wordlevel models, recent works focuse on character n-grams. Charagram by [ 13 ] is an approach to learn characterlevel compositions, not the statistics of single characters. Given a textual word or sentence, e.g., a sequence of characters

= ⟨ 1, 2, … , ⟩, = ⟨ , +1 , … , .

Charagram produces a character n-gram count vector, where each character n-gram has its own vector , if the n-gram ∈ is part of all n-grams of the model. is the indicator function, if ∈ then 1 else 0. n-grams matter and they suggest above > 2 or ≥ for languages like German with many noun compounds.

4 Context Selection correlation. ℎ is a single non-linearity applied over the sum of all ngram character vectors of , where is the maximum length of any character n-gram in the model. initialized by different choices as a model parameter. can be They achieve state-of-the-art results and further beating LSTM and CNN based models using the spearman's scoring function.Word2Vec takes two vectors and element in ℝ , where is the dimensionality and the target word vector with the corresponding context is vectors : ( , ) = ∙ .

We would like to represent a word as a character representtation through n-grams, e.g., ℎ = ⟨ℎ , ℎ , ℎ , , ⟩. The above Word2Vec objective can be rewritten to represent each word as a bag of character ngrams vector representation: where is a vector representation of a single n-gram, from a global set

with all character n-grams. We are interested in the Word2Vec objective, where each word is now a sum of these character n-gram representations

⊂ 1, … , . [ 12 ]successfully improve on the analogy task over previous models and deal with OOV words even where the morphemes do not match up. The size of

Context selection is about choosing a suitable function over a domain that maps a given center and its context to a latent space in ℝ

where is the dimension of a latent column space. In text context selection is narrowly replaced by the surrounding words ∈ ℝ × of a target word , where is a window size, which is generally known as the skip-gram objective. CBOW on the other hand is the reverse operation, given a context its center word. It turns out that context is a much larger what is topic than just in language modelling. We will first review a couple of concepts applied to general problems of count and real valued data, using exponential family distributions proposed by [ 4 ] and [ 5 ].

4.1 Generalization of Context selection

Context embeddings are not only useful to textual data, but to sequential data of different shapes and forms as well. [ 4 ] presents a general procedure modelling on count and real valued data, using an expectation-maximization (EM) algorithm to approximate exponential family embeddings. The exponential family distributions are distributions with a special form given the natural parameters and sufficient statistics giving rise to the possibility of fitting

different kinds of probability distributions to the same problem set. The most famous distributions are Gaussian, Poisson or categorical. [ 4 ] propose two example models for Gaussian (real valued) and Poisson (count based) distributions. The general form of exponential families are as follows: | ~ , ( ) , where is any data point, for which we like to learn the distribution, the context of each data point . is the natural parameter space that is always convex, e.g., within the bounds of the applicable finite integral of the function and ( ) the sufficient statistic, a function that fully summarizes the data such that there exist no other statistic that provides additional information. The natural parameter has the general form where [ ] are the embedding parameters for a respective target, [ ] are the context parameters, a probability distribution over context elements, and is the link function that must be defined for each individual problem, connecting context with a data point. The objective cost function is the sum of log conditional probabilities of each data point which is then optimized using stochastic gradient descent. If the probability distribution is categorical, the objective is almost equivalent to Word2Vec with CBOW.

Given this framework one can construct all kinds of contexts and link functions to solve embeddings for a specific domain. [ 5 ] propose an advancement on the efemb by [ 4 ], by considering only a subset of elements in the context, instead of using all of them, naming their = ( [ ]

[ ] ), ∈ model context selection for exponential family embeddings (CS-EFE). Additionally, CS-EFE depends on three parameters, the embeddings for a target, the context of the target and a hidden binary vector that indicates what the target depends on. The authors leverage amortized variational inference (VI). We will try to describe the work in 3 steps. Why VI? Why blackbox VI? Why amortized VI? ef-emb could be easily optimized with gradient descent given the cost function. What has changed is that CSEFE deploys an additional set of coefficients that indicate if an element of a context is part of the target word or not. To do this we need to marginalize out this binary vector . Therefore, we use VI to deposit the functional over the exponential family to approximate the best solution possible. While this is a good starting point, this objective is still intractable and we need to find ways to approximate this even further by the variational lower bound or ELBO and share parameters across the contexts, which VI alone is not able to do. This reduces the runtime and storage complexities considerably and introduces a lower error bound that guarantees errors lower than, but not errors close to. The first problem is that the original VI has no parameter sharing of the context , which in this case is absolutely needed. Context is shared, that is why an amortization network for parameter sharing is needed, e.g., amortized VI. In Figure 3 we see the amortization network, where is the target, in language modelling the word , the score is a score over the context vector, the target embedding, are the prior probability parameters of and ℎ are Gaussian kernels, where each score of each target word except the kth is assigned to one of the kernels. The second problem is that we cannot fit the variational distribution ( ; ) to each target individually and hence use blackbox VI, approximating the expectation by Monte Carlo sampling, obtaining noisy gradients of the ELBO. To simplify: Select the correct context from a window using a binary vector as indicator, which cannot be computed, using VI. VI cannot share parameters, which there are plenty of and cannot, even with sharing, estimate the correct gradients given the KLD. Using an indicator vector to select appropriate elements from the context results in variable length context vectors, for which we need a fixed size representation. Instead of this we use Gaussian real valued kernels to estimate mean and variance for each binary vector and assign it. We use Monte Carlo sampling, because we would need to compute every possible setting between the binary vector and context vector, which guarantees an error that is smaller than the evidence lower bound obtaining “tainted” or “noisy” gradients.

4.2 Context selection

Context selection in language models is at this point a well studied task. Word2Vec by [ 3 ] uses a context window of surrounding words. While this sounds intuitive, there are a lot of suggestions on improving this. Originally, [ 3 ] suggested to use sub-sampling to remove frequently co-occurring words and use context distribution smoothing reducing bias towards rare words. This is very much in conjunction with count based methods that clip off the top/bottom percent of a vocabulary. [ 25 ] have found that using dependency based word embeddings has an impact on the quality and quantity of functional similarity tasks such as . However, it is to note that on topical similarity tasks the suggested model performs worse. [ 25 ] note that mostly a linear context, e.g., windows, is used. Given a corpora and a target word , with a corresponding sentence (e.g., context) and modifiers of that sentence 1, … , with head ℎ, a dependency tree is created, see Figure 4, with the Stanford Dependency parser. The contexts ( 1, 1), … , ( , ), (ℎ, ℎ1), where is the dependency relation between head and modifier (e.g., nsubj, dobj, prep with, amod). While is the forward relation or outgoing relation from the head - the target word - 1 is the in-going relation or inverse-relation. Given a Word2Vec model with a small window size of = 2 and a larger window size = 5, the dependency based model learns different word relations and minimizes two effects. We can see in Figure 4 that coincidental filtering takes place, because “Australian” is obviously not part of “science” in general, which Word2Vec would take as a context in either model. Secondly, if the window size is small, out-of-reach words like “discover” and “telescope” would have been filtered out. Longer more complex sentences could have several head words, where the context is out-of-reach in larger Word2Vec models as well. In comparison with Word2Vec, the dependency base model has a higher precision and recall on functional similarity tests. porate additional information from external data sources, augmenting word vectors.

[ 26 ] improve on the Word2Vec model by [ 3 ] using dictionaries. Dictionaries are records with a word mapping to a definition.

Guitar - a stringed musical instrument, with a fretted fingerboard, typically incurved sides, and six or twelve strings, played by plucking or strumming with the fingers or a plectrum.

The key concept presented in [ 26 ] is that each word can be weakly and strongly linked to each other given the definition. For instance, the Guitar and Violin share the words stringed musical instrument, that should strongly tie them together. In the definition of the Violin there is no plucking or strumming and thus is considered a weak pair. Moreover weak pairs are promoted to strong pairs, when they are within the closest neighbouring words calculated with a cosine distance. The skip-gram objecttive with negative sampling can be rephrased given the definition to positively and negatively couple words. The positive sampling cost function is ( ) = ⋅ ∑ ∈ ( ) ( ⋅ ) = ⋅ ∑ ∈ ( ) ⋅ . ℓ is the logistic loss function, is each target word of the corpus with its corresponding vector , ( ) are strong pairs, ( ) are weak pairs and / are corresponding strong and weak pair vectors. The hyperparameters and

strong and weak pairs. Set to zero, the model behaves exactly like

Word2Vec. The corresponding negative sampling cost function is are chosen to best fit to the learning of

Where is chosen such that it is randomly chosen from the vocabulary at random without self ( ) and it is not part of strong ∉ ( ) or weak ∉ ( ) word pairs. Which results in the cost function from a target ( , ) = ( ⋅ ) + ( ) + ( ).

The results show an improvement over state-of-the-art models on word similarity and text classification. They parsed and trained on a last language corpus from

Wikipedia comparing a pre-trained

Word2Vec model augmented with dictionaries, a retrofitted model using WordNet and a single model on a raw corpus. Dict2Vec showed superior results on the raw corpus and improved the other models by up to 13%.

4.4 Comparison

[ 27 ] have found that different downstream and language modelling tasks need different types of context applied. They compare window-based, substitution-based, dependency-based, concatenation and SVD on sub-sampling context for word embeddings. In Figure 5 we can see three kinds of datasets. WordSim-353-R, for topical substituting for “love” yields and for functional similarity and TOEFL for evenly balanced parts of topical coherence and functional similarity. First to note: substitution based word embeddings performed worse overall in all domains. The idea is to substitute words in sentences, e.g., “I love my job” [ , ? , , ]

0.1, 0.1] learned by a language model. What we can immediately see is that typical word embeddings like Word2Vec with window 1, 5 and 10 outperform the other models on topical coherence WordSim-353-S and are on par with dependency based models on SimLex999 and TOEFL. Further, dependency-based models perform much better on functional similarly tasks like WordSim353-S. Their results also suggest that concatenating different word embeddings yields the highest results on downstream language tasks such as parsing, ner or sentiment. Unfortunately, Dict2Vec is not in the list of curated models as it is still being evaluated and new. In this paper we explored a wide variety of concepts dealing with word-level and subword-level embeddings as well as context selection procedures. All of the suggested methods have assets and drawbacks. However, strategies using pretrained character n-grams on large datasets with negative sampling/hierarchical softmax on the skip-gram and CBOW objective performs best. That is, they bring all the features of pre-trained word embeddings, while dealing with OOV words and faster training. It would be interesting to see if character-level embeddings could be enhanced with procedures like 2 , and to leverage external sources and incorporate global statistics as well. Word2Vec is the basic work-unit behind all current text representation learning tasks. Besides what is covered here, there are multiple research directions open. E.g., statistical models that treat words as a distribution, see [ 28 ] and [ 29 ]. They treat words as a probability mass functions (pmfs) and can express uncertainty in different dimensions as well as deal with all kind of WSD problems and entailment. [ 30 ] goes even further by representing words as hierar-chical pmfs. Instead of changing how the representation is created, they alter the representation to fit certain conditions and features. Another issues are domain adaptation and transfer learning techniques. In the future they will help in dealing with the asymmetry of data. Given a dataset of a domain that is well known, generalize it to a target domain with fewer samples. This will be particularly helpful in smaller domains and help transpose different ideas beyond the current context. At last there is a desperate need for further theoretical under-standing. It is hard to compare every model and even harder when the evaluation is largely intrinsic and effects can only be indirectly tested in downstream language tasks. Here we will also work on further improvements.

[1]

Deerwester ,

S.T.

Dumais ,

G.W.

Furnas ,

T.K.

Landauer , and

Harshman . Indexing by latent semantic analysis . Journal of the American Society for Information Science , 41 ( 6 ): 391 - 407 , 1990

[2]

D.M.

Blei ,

A.Y.

Ng , and

M.I.

Jordan . Latent dirichlet allocation . J. Mach. Learn. Res.

[3]

Mikolov , I. Sutskever,

Chen , G. Corrado, and

Dean . Distributed representations of words and phrases and their compositionality . CoRR, abs/1310.4546 , 2013 .

[4]

Rudolph ,

Ruiz ,

Mandt , and

Blei . Exponential family embeddings . In Advances in Neural Information Processing Systems 29 .

[5]

Liu ,

Ruiz ,

Athey , and

Blei . Context selection for embedding models . In Advances in Neural Information Processing Systems 30 .

[6]

Bengio ,

Ducharme ,

Vincent , and

Janvin . A neural probabilistic language model . J. Mach. Learn. Res. , 3 : 1137 - 1155 , March 2003 .

[7]

Bolukbasi ,

K.W.

Chang ,

J.Y.

Zou ,

Saligrama , and

Kalai . Man is to computer programmer as woman is to homemaker? debiasing word embeddings . CoRR.

[8]

Dyer . Notes on noise contrastive estimation and negative sampling . CoRR.

[9]

Rong . word2vec parameter learning explained . CoRR.

[10]

Pennington ,

Socher , and

C.D.

Manning . Glove: Global vectors for word representation . In Empirical Methods in Natural Language Processing (EMNLP) , pages 1532 - 1543 , 2014 .

[11]

Joulin , E. Grave,

Bojanowski , and

Mikolov . Bag of tricks for efficient text classification . CoRR.

[12]

Bojanowski ,

Grave ,

Joulin , and

Mikolov . Enriching word vectors with subword information . CoRR.

[13]

Wieting ,

Bansal ,

Gimpel , and

Livescu . Charagram: Embedding words and sentences via character n-grams . CoRR.

[14]

McCann ,

Bradbury ,

Xiong , and

Socher . Learned in translation: Contextualized word vectors . CoRR.

[15]

Graves and

Schmidhuber . Framewise phoneme classification with bidirectional lstm and other neural network architectures . Neural Networks , pages 5 - 6 , 2005 .

[16]

Levy and

Goldberg . Neural word embedding as implicit matrix factorization . In Advances in Neural Information Processing Systems 27 .

[17]

Gittens ,

Achlioptas , and

M.W.

Mahoney . Skip-gram - zipf + uniform = vector additivity . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics , ACL 2017 , Canada , Volume 1 , pages 69 - 76 .

[18]

Mimno and L. Thompson. The strange geometry of skip-gram with negative sampling . In 2017 Conference on Empirical Methods in Natural Language Processing , September 2017 .

[19]

Kim ,

Jernite ,

Sontag , and

A.M.

Rush . Character-aware neural language models . CoRR.

[20]

Herbelot and

Baroni . High-risk learning: acquiring new word vectors from tiny data . CoRR.

[21]

Dhingra , H. Liu,

Salakhutdinov ,

andW.W.

Cohen . A comparative study of word embeddings for reading comprehension . CoRR.

[22]

Pinter ,

Guthrie , and

Eisenstein . Mimicking word embeddings using subword rnns . CoRR.

[23] R. J ´ozefowicz,

Vinyals ,

Schuster ,

Shazeer , and

Wu . Exploring the limits of language modeling . CoRR.

[24]

R.K.

Srivastava ,

Greff , and

Schmidhuber . Highway networks . CoRR.

[25]

Levy and

Goldberg . Dependency-based word embeddings . In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics , Baltimore, USA, Volume 2 , pages 302 - 308 , 2014 .

[26]

Tissier ,

Gravier , and

A. Habrard.

Dict2vec: Learning word embeddings using lexical dictionaries . In Proceedings of the Conference on Empirical Methods in Natural Language Processing , Copenhagen, Denmark, September 9- 11 , 2017 , pages 254 - 263 , 2017 .

[27]

Melamud ,

McClosky ,

Patwardhan , and

Bansal . The role of context types and dimensionality in learning word embeddings . CoRR.

[28]

Vilnis and

McCallum . Word representations via gaussian embedding . CoRR.

[29]

Athiwaratkun and

A.G.

Wilson. Multimodal word distributions . In Conference of the Association for Computational Linguistics (ACL) , 2017 .

[30]

Nickel and

Kiela . Poincar´ e embeddings for learning hierarchical representations . CoRR.