<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Distributional Representation of Words for Small Corpora using Pre-training Techniques</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Pierpaolo Basile, Lucia Siciliani, Gaetano Rossiello, Pasquale Lops Department of Computer Science - University of Bari Aldo Moro Via E. Orabona</institution>
          ,
          <addr-line>4, 70125 - Bari</addr-line>
          ,
          <country country="IT">ITALY</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>16</fpage>
      <lpage>18</lpage>
      <abstract>
        <p>Distributional Semantic Models and word embedding approaches have proved their efectiveness to represent words as mathematical points in a geometric space. Relying on this representation allows computing the relatedness between words according to their distance in the space. This ability is useful for several natural language processing tasks. However, when we have a collection containing a few documents, it is not possible to build an accurate representation of words because we do not have enough information about the co-occurrences of terms. In this paper, we deal with this issue by proposing an approach which relies on Random Indexing and a pre-trained model built on a large and balanced corpus. We perform an evaluation by investigating a real world application scenario in which this approach has been adopted.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>CCS CONCEPTS</title>
      <p>• Information systems → Document collection models;
Information extraction; Digital libraries and archives; Dictionaries; •
Computing methodologies → Lexical semantics; Information
extraction; • Applied computing → Document analysis.</p>
    </sec>
    <sec id="sec-2">
      <title>BACKGROUND AND MOTIVATION</title>
      <p>
        Distributional Semantics Models (DSMs) [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] and more recent word
embeddings approaches [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] have proved that the distributed
representation of words is efective in several natural language
processing (NLP) tasks. In particular, by representing words as
mathematical points in a geometric space, it is possible to compute word
relatedness as the distance in that space: two words are similar if
they are close in the geometric space. Both DSM and word
embeddings approaches have their roots in the distributional hypothesis
[
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]: two words are similar if they share similar linguistic
contexts. Generally, these approaches exploit words co-occurrences as
linguistic contexts. Since words co-occurrences strongly depend on
the statistical distribution of words in the corpus, these approaches
can be afected by the dimension of the corpus. The domain of the
corpus can also afect the semantics captured by the DSM or
embeddings. If the target corpus is very specific and focused on a single
domain (e.g. sport or politics) the model captures only semantic
aspects belonging to that domain.
      </p>
      <p>
        Moreover, there are other aspects that can afect these approaches,
such as the initialisation of the embeddings. Since embeddings are
randomly initialised, diferent results could be obtained by
applying several times the same approach on the same corpus. Some
DSMs can be afected by the method used to count (weight) the
cooccurrences, or by the parameters used to reduce the co-occurrences
matrix (e.g. the number of dimensions in the LSA [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] approach).
More details about pitfalls in both DSM and word embeddings are
discussed in [
        <xref ref-type="bibr" rid="ref1 ref12 ref9">1, 9, 12</xref>
        ].
      </p>
      <p>In this paper, we focus our attention on the corpus dimension
issue. In some contexts, we have corpora with few documents, and
even in those cases we aim at obtaining a distributed representation
of words able to efectively capture their semantics. In such a case,
it might be useful to pre-train a distributed representation of words
on a large balanced corpus and then exploit that representation as
starting point for building word vectors on the corpus containing
few documents.</p>
      <p>
        Recently, contextual word embeddings, such as ELMo [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ],
ULMFit [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and BERT [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], have shown to be efective as transfer learning
technique NLP. The main idea is to leverage an unsupervised neural
language model trained on a large corpus as a pre-training stage.
Then, the resulting pre-trained word embeddings are used to train
deep neural networks for supervised NLP tasks [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Even if these
neural language models can mitigate the problem represented by
Out-of-Vocabulary words, i.e. words not seen during the language
modelling stage, they require an enormous amount of data and
high computational capabilities. For these reasons, dealing with
new words in small collections of documents for specific domains
still remains an open challenge.
      </p>
      <p>
        In an attempt to address this limitation, we analyse a specific
DSM approach called Random Indexing (RI) [
        <xref ref-type="bibr" rid="ref15 ref18">15, 18</xref>
        ], an incremental
method that makes simple to add new documents to an already
existing model. The incremental property of RI is already exploited
for discovering implicit connections between terms [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and for
analysing the evolution of language over time [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. We chose RI
because other approaches based on word embeddings are not
inherently incremental. It is possible to initialise embeddings with
embeddings built on another corpus, but it is not simple to tackle
the issue due to out of vocabulary words1.
      </p>
      <p>The general idea behind our approach is to build a model Mpr e
on a large balanced corpus and then, given a new small collection
of documents Cs , build a new model Ms relying on word vectors in
Mpr e . The goal is to deal with the issue of the small dimension of
1Words that occur in the domain corpus but do not occur in the embeddings used for
the initialisation.
the corpus Cs , by relying on the information captured during the
definition of the model Mpr e .</p>
      <p>Our research question is to prove that in case of a small collection
of documents the approach based on pre-training is able to provide
better performance with respect to a word representation without
pre-training. We provide an evaluation by exploiting a real world
application scenario in which given a collection of documents and a
set of seed words we want to discover related concepts by exploiting
the relatedness computed in the semantic space.</p>
      <p>The paper is structured as follow: Section 2 describes the
proposed methodology for pre-training word vectors using RI, while
Section 3 provides details about the evaluation and reports the
results. Finally, Section 4 closes the paper by providing final remarks
and future work.
2</p>
    </sec>
    <sec id="sec-3">
      <title>METHODOLOGY</title>
      <p>
        Our approach is based on RI. The mathematical insight behind RI
is the projection of a high-dimensional space on a lower
dimensional one using a random matrix; this kind of projection does not
compromise distance metrics2 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>Formally, given a n × m matrix A and an m × k matrix R, which
contains random vectors, we define a new n × k matrix B as follows:
An,m · Rm,k = Bn,k
k &lt;&lt; m
(1)
The new matrix B has the property to preserve the distance between
points.</p>
      <p>Specifically, RI creates the DSM in two steps:
(1) A random vector is assigned to each word. This vector is
sparse, high-dimensional and ternary, which means that its
elements can take values in {-1, 0, 1}. A random vector
contains a small number of randomly distributed non-zero
elements, and the structure of this vector follows the hypothesis
behind the concept of Random Projection;
(2) Random vectors are accumulated by analysing co-occurring
words. In particular the semantic vector for any word is
computed as the sum of the random vectors for words that
co-occur with the analysed word. When computing the sum,
we apply some weighting to the random vector. In our case,
to reduce the impact of very frequent terms, we use the
following weight: hi = q th#t×iC , where C is the total number
of occurrences in the corpus and #ti is the occurrences of the
term ti . The idea behind this weighting schema is to penalise
most frequent words. The parameter th is generally set to
0.001.</p>
      <p>Formally, given a corpus D of n documents, and a vocabulary V
of m words extracted form D, we perform two steps: i) we assign a
random vector r to each word w in V ; ii) we compute a semantic
vector svi for each word wi as the sum of all random vectors
assigned to words co-occurring with wi . The context is the set of c
words that precede and follow wi . In our experiment we set c to 5.
The second step is defined by the following equation:
svi =
Õ</p>
      <p>Õ
d ∈D −c &lt;j ,j&lt;i+c
hj ∗ rj
(2)</p>
      <sec id="sec-3-1">
        <title>2Only L2 norm-based distances are preserved.</title>
        <p>where hj is the weight applied to the context word as previously
explained. After these two steps, we obtain a set of semantic vectors
assigned to each word in V representing our DSM.</p>
        <p>We apply the classical RI approach to the balanced large corpus as
described above. We obtain two spaces: i) the set of random vectors
assigned to each word w in V and ii) the set of semantic vectors
SV built by accumulating random vectors. The set SV contains a
semantic vectors for each word in V .</p>
        <p>Given a small collection of documents S we want to apply RI
by relying on vectors built on the large corpus. Since RI is an
incremental approach we can reuse the vectors built on the large
collection. In particular:
(1) we extract the vocabulary Vs from S. Vs can contain words
that already occur in V . For these words wj ∈ V ∩Vs we reuse
both the random vector and the semantic vector assigned
to wj . For words wk ∈ Vs \ V we build new random vectors
and initialise the semantic vectors coordinates to zero.
(2) we perform the accumulation of random vectors by analysing
word co-occurrences as describe above for the classical RI
approach.</p>
        <p>The output of this process is composed of two new sets of random
vectors and semantic vectors as reported in Figure 1.
3</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>EVALUATION</title>
      <p>
        The goal of the evaluation is to prove that in case of a small
collection of documents the approach based on RI with pre-training is
able to provide better performance with respect to a word
representation based on RI without pre-training. However, the datasets
usually used to evaluate word similarity performance are based on
common concepts or common entities. This kind of datasets is not
suitable for evaluating specific domain collections as in our case.
For that reason we design an in-vivo evaluation by integrating our
approach in an already existing system called Semantic Framework
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The Semantic Framework provides a set of tools and services
for analysing, indexing and searching a collection of documents for
the Public Administration. Moreover, the framework also provides
services for discovering related words and concepts starting from
both a collection of documents and the description of two concepts.
In particular, given the description of two concepts given as a set
of words, the tool is able to provide a ranked list of other words
that are somehow related to both the initial concepts. This is the
specific scenario we have chosen for the empirical evaluation.
3.1
      </p>
    </sec>
    <sec id="sec-5">
      <title>Extraction of Related Words</title>
      <p>Given two concepts c1 and c2, along with their descriptions given
as set of words d1 and d2, and given a collection of documents, the
goal is to extract a ranked list of words related to both c1 and c2. The
method relies on both the distributed representation of words and
the similarity between words in the geometric space. In particular,
given a collection of documents, we build a DSM where each word
is represented as a vector. For each concept, e.g. c1 and c2, we build
a vector representation by computing the centroid of the vectors of
the keywords occurring in the concept description (e.g. d1 and d2).</p>
      <p>The second step is to compute the list of related words: given
the two vectors describing the concepts c1 and c2, we retrieve the
neighbourhood for each concept by using cosine similarity. In the
last step we normalised each list using the z-norm normalisation
and we create the final list by averaging the score of words occurring
in both the lists (intersection). The final list is ranked and the top-N
words are returned to the user. The whole process is sketched in
Figure 2.</p>
      <p>
        The GUI provided by the Semantic Framework for building the
list of related words is shown in Figure 3. The tool allows: i) to
build a matrix with a specific number of rows and columns; ii)
to define each concept on the rows and columns by associating
a description which is then adopted to build the corresponding
vector representation. It is worth to notice that the cell in the
matrix reports multi-word expressions, instead of single words, since
the collection of documents has been indexed by exploiting the
Semantic Framework services, able to automatically extract phrases
from documents [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] .
3.2
      </p>
    </sec>
    <sec id="sec-6">
      <title>Evaluation Setup</title>
      <p>We evaluate our method on several collections of documents in
both English and Italian. Four English collections are taken from the
TALIA European project and two Italian collections are provided
by the Apulia Region. In particular, English collections concern
deliverable of European projects related to the
Interreg-Mediterranean program, while the two Italian collections contain project
proposals of two research programs funded by the Apulia Region.
Table 1 shows the statistics about the collections.</p>
      <sec id="sec-6-1">
        <title>Collection Language #documents #occurrences</title>
        <p>T1 EN 200 1,460,713
T2 EN 221 1,802,770
T3 EN 158 1,360,393
T4 EN 579 4,623,876
A1 IT 55 623,575
A2 IT 87 876,654
Table 1: Statistics about collections used during the
evaluation.</p>
        <p>As corpus for the pre-training step, we adopt the British National
Corpus (BNC)3 for the English language and the Paisà4 corpus for
the Italian.</p>
        <p>BNC is a 100 million word collection of samples of written and
spoken language from several sources, designed to represent a wide
cross-section of British English both spoken and written.</p>
        <p>The Paisà corpus is a large collection of Italian web texts in
which documents were selected in two diferent ways. A part of
the corpus was constructed by querying the Web exploiting 50,000
word pairs combining terms from an Italian basic vocabulary list.
The remaining documents come from the Italian versions of
various Wikimedia Foundation projects, namely: Wikipedia, Wikinews,</p>
      </sec>
      <sec id="sec-6-2">
        <title>3http://www.natcorp.ox.ac.uk/ 4https://www.corpusitaliano.it/</title>
        <p>Wikisource, Wikibooks, Wikiversity, Wikivoyage. The corpus
contains approximately 380,000 documents coming from about 1,000
diferent websites, for a total of about 250 million words.</p>
        <p>We pre-train vectors by using Random Indexing with a vector
dimension equals to 300 with 10 non-zero elements in the random
vector. We limit the vocabulary dimension to 100,000 by taking into
account the most frequent words.</p>
        <p>The code for building Random Indexing with pre-training is
freely available on GitHub5.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>3.3 Results</title>
      <p>One expert of the TALIA project and one expert of the Apulia
Region provided a set of concepts pairs for which the list of related
words is extracted as described in Section 3.1. In particular, experts
provide 24 pairs for the English collections, and 12 pairs for the
Italian ones. Each list has been evaluated by two experts. In
particular, two lists are provided to each expert, one built by using
pre-trained RI and another one using only RI. The expert does not
know the method used to build the list. Given the pair of concepts
(with their descriptions) and the two lists of related words the
expert must judge which list provides more significant words. Finally,
we compute the percentage of times that both the experts prefer
the list built through the pre-trained RI.</p>
      <p>Analysing the experts’ judgements we observe that they agree on
the higher significance of the list created with the pre-trained RI the
84% of times for the English, while for the Italian the agreement is
75%. This first in-vivo evaluation provides encouraging results and
suggests that the pre-training is fundamental when the collection
contains few documents. We plan to design an in-vitro evaluation
by developing a specific dataset.</p>
      <sec id="sec-7-1">
        <title>5We will release the URL in case of acceptance.</title>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>4 CONCLUSIONS</title>
      <p>In this paper, we propose a pre-training strategy for building a
distributional semantics model when a small collection of documents is
involved. In particular, we extend an existing DSM approach called
Random Indexing by introducing a pre-training step that relies on
a large and balanced corpus. We have integrated our method in a
tool for the semantic analysis of documents and we designed an
in-vivo evaluation that involves two languages (English and Italian)
and six collections of documents.</p>
      <p>Results prove that the approach based on pre-training provides
better results. This suggests that in case of a small collection of
documents the additional information provided by a large corpus
might help to improve the quality of the distributional model.</p>
      <p>As future work, we plan to develop a pre-training strategy for
approaches based on word embeddings and design an in-vitro
evaluation by building a specific dataset.</p>
    </sec>
    <sec id="sec-9">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work is partially funded by the “TALIA - Territorial
Appropriation of Leading-edge Innovation Action” project,
Interreg-Mediterranean program, priority axis 1: Promoting Mediterranean
innovation capacities to develop smart and sustainable growth, Programme
specific objective 1.1 to increase transnational activity of innovative
clusters and networks of key sectors of the MED area (2018-2019).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Maria</given-names>
            <surname>Antoniak</surname>
          </string-name>
          and
          <string-name>
            <given-names>David</given-names>
            <surname>Mimno</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Evaluating the stability of embeddingbased word similarities</article-title>
          .
          <source>Transactions of the Association of Computational Linguistics</source>
          <volume>6</volume>
          (
          <year>2018</year>
          ),
          <fpage>107</fpage>
          -
          <lpage>119</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Pierpaolo</given-names>
            <surname>Basile</surname>
          </string-name>
          , Annalina Caputo, Marco Di Ciano, Gaetano Grasso, Gaetano Rossiello, and
          <string-name>
            <given-names>Giovanni</given-names>
            <surname>Semeraro</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>SEPIR: a semantic and personalised information retrieval tool for the public administration based on distributional semantics</article-title>
          .
          <source>International Journal of Electronic Governance</source>
          <volume>9</volume>
          ,
          <fpage>1</fpage>
          -
          <lpage>2</lpage>
          (
          <year>2017</year>
          ),
          <fpage>132</fpage>
          -
          <lpage>155</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Pierpaolo</given-names>
            <surname>Basile</surname>
          </string-name>
          , Annalina Caputo, and
          <string-name>
            <given-names>Giovanni</given-names>
            <surname>Semeraro</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Analysing word meaning over time by exploiting temporal random indexing</article-title>
          .
          <source>In First Italian Conference on Computational Linguistics CLiC-it.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Trevor</surname>
            <given-names>Cohen</given-names>
          </string-name>
          , Roger Schvaneveldt, and
          <string-name>
            <given-names>Dominic</given-names>
            <surname>Widdows</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Reflective Random Indexing and indirect inference: A scalable method for discovery of implicit connections</article-title>
          .
          <source>Journal of Biomedical Informatics</source>
          <volume>43</volume>
          ,
          <issue>2</issue>
          (
          <year>2010</year>
          ),
          <fpage>240</fpage>
          -
          <lpage>256</lpage>
          . https://doi.org/10.1016/j.jbi.
          <year>2009</year>
          .
          <volume>09</volume>
          .003
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Sanjoy</given-names>
            <surname>Dasgupta</surname>
          </string-name>
          and
          <string-name>
            <given-names>Anupam</given-names>
            <surname>Gupta</surname>
          </string-name>
          .
          <year>1999</year>
          .
          <article-title>An elementary proof of the JohnsonLindenstrauss lemma</article-title>
          .
          <source>Technical Report. Technical Report TR-99-006</source>
          , International Computer Science Institute, Berkeley, California, USA.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          .
          <source>In NAACL-HLT (1)</source>
          .
          <source>Association for Computational Linguistics</source>
          ,
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>John Rupert Firth.</surname>
          </string-name>
          <year>1957</year>
          .
          <article-title>A synopsis of linguistic theory. Studies in linguistic analysis</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Zellig</surname>
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Harris</surname>
          </string-name>
          .
          <year>1968</year>
          .
          <article-title>Mathematical Structures of Language</article-title>
          . New York: Interscience.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Johannes</given-names>
            <surname>Hellrich</surname>
          </string-name>
          and
          <string-name>
            <given-names>Udo</given-names>
            <surname>Hahn</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Bad companyâĂŤneighborhoods in neural embedding spaces considered harmful</article-title>
          .
          <source>In Proceedings of COLING</source>
          <year>2016</year>
          ,
          <source>the 26th International Conference on Computational Linguistics: Technical Papers</source>
          .
          <fpage>2785</fpage>
          -
          <lpage>2796</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Jeremy</given-names>
            <surname>Howard</surname>
          </string-name>
          and
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Ruder</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Universal Language Model Finetuning for Text Classification</article-title>
          .
          <source>In ACL (1)</source>
          .
          <source>Association for Computational Linguistics</source>
          ,
          <fpage>328</fpage>
          -
          <lpage>339</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Thomas</surname>
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Landauer</surname>
          </string-name>
          and
          <string-name>
            <surname>Susan</surname>
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Dumais</surname>
          </string-name>
          .
          <year>1997</year>
          .
          <article-title>A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge</article-title>
          .
          <source>Psychological review 104</source>
          ,
          <issue>2</issue>
          (
          <year>1997</year>
          ),
          <fpage>211</fpage>
          -
          <lpage>240</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Omer</surname>
            <given-names>Levy</given-names>
          </string-name>
          , Yoav Goldberg, and
          <string-name>
            <given-names>Ido</given-names>
            <surname>Dagan</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Improving distributional similarity with lessons learned from word embeddings</article-title>
          .
          <source>Transactions of the Association for Computational Linguistics</source>
          <volume>3</volume>
          (
          <year>2015</year>
          ),
          <fpage>211</fpage>
          -
          <lpage>225</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Tomas</surname>
            <given-names>Mikolov</given-names>
          </string-name>
          , Kai Chen, Greg Corrado, and
          <string-name>
            <given-names>Jefrey</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Eficient Estimation of Word Representations in Vector Space</article-title>
          .
          <source>CoRR abs/1301</source>
          .3781 (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Matthew</surname>
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Peters</surname>
            , Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark,
            <given-names>Kenton</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>and Luke</given-names>
          </string-name>
          <string-name>
            <surname>Zettlemoyer</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Deep Contextualized Word Representations</article-title>
          .
          <source>In NAACL-HLT. Association for Computational Linguistics</source>
          ,
          <fpage>2227</fpage>
          -
          <lpage>2237</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Magnus</given-names>
            <surname>Sahlgren</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>An Introduction to Random Indexing</article-title>
          .
          <source>In Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering</source>
          ,
          <string-name>
            <surname>TKE</surname>
          </string-name>
          , Vol.
          <volume>5</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Magnus</given-names>
            <surname>Sahlgren</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>The Word-Space Model: Using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces</article-title>
          .
          <source>Ph.D. Dissertation</source>
          . Stockholm: Stockholm University, Faculty of Humanities, Department of Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Alex</surname>
            <given-names>Wang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Amanpreet</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <surname>Julian Michael</surname>
          </string-name>
          , Felix Hill,
          <string-name>
            <given-names>Omer Levy</given-names>
            , and
            <surname>Samuel</surname>
          </string-name>
          <string-name>
            <given-names>R.</given-names>
            <surname>Bowman</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding</article-title>
          .
          <source>In BlackboxNLP@EMNLP. Association for Computational Linguistics</source>
          ,
          <fpage>353</fpage>
          -
          <lpage>355</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Dominic</given-names>
            <surname>Widdows</surname>
          </string-name>
          and
          <string-name>
            <given-names>Kathleen</given-names>
            <surname>Ferraro</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Semantic Vectors: A Scalable Open Source Package and Online Technology Management Application</article-title>
          .
          <source>In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC</source>
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>