<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>On the evaluation of retrofitting for supervised short-text classification1</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kaoutar GHAZI</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andon TCHECHMEDJIEV</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Se´bastien HARISPE</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicolas SUTTON-CHARANI</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gildas TAGNY NGOMPE´</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>EuroMov Digital Health in Motion</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Univ Montpellier</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>IMT Mines Ale`s</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ale`s</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>France</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Laverune - France</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Current NLP systems heavily rely on embedding techniques that are used to automatically encode relevant information about linguistic entities of interest (e.g., words, sentences) into latent spaces. These embeddings are currently the cornerstone of the best machine learning systems used in a large variety of problems such as text classification. Interestingly, state-of-the-art embeddings are commonly only computed using large corpora, and generally do not use additional knowledge expressed into established knowledge resources (e.g. WordNet). In this paper, we empirically study if retrofitting, a class of techniques used to update word vectors in a way that takes into account knowledge expressed in knowledge resources, is beneficial for short text classification. To this aim, we compared the performances of several state-of-the-art classification techniques with or without retrofitting on a selection of benchmarks. Our results show that the retrofitting approach is beneficial for some classifiers settings and only for datasets that share a similar domain to the semantic lexicon used for the retrofitting.</p>
      </abstract>
      <kwd-group>
        <kwd />
        <kwd>text classification</kwd>
        <kwd>word embeddings</kwd>
        <kwd>retrofitting</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Embedding techniques are the cornerstone of numerous state-of-the-art NLP systems;
they enable to automatically encode relevant information about linguistic entities of
interest (e.g., words, sentences, documents) into latent spaces in order to obtain high
quality representations that will further be used to solve complex tasks. Such techniques have
proven to be critical for designing efficient systems in text classification [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], question
answering [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] or information extraction [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to mention a few.
      </p>
      <p>
        Neural network architectures, particularly recurrent neural networks (RNN) or
Transformers are now de facto approaches to computing embeddings, as illustrated by
the broad variety of language models of increasing complexity and efficiency that have
been published in recent years (e.g. RoBERTa [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], GPT-3 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]). These approaches rely
on the surface analysis of large corpora composed of billions of words, and do not use
additional knowledge expressed into established knowledge resources (e.g. WordNet).
Despite recent successes, there is only so much that can be learned from a surface
analy1Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
sis of text and embedding models capture very superficial knowledge about meaning [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
One way of integrating structured a priori knowledge, is to apply retrofitting, a class
of techniques used to update word vectors in a way that takes into account knowledge
expressed in knowledge resources. Despite the promising results obtained by retrofitting
techniques, the study of hybrid embedding approaches mixing corpora and knowledge
representations is still relatively marginal, especially in context of specific tasks. This
paper aims at investigating the relevance of retrofitted word embeddings in the context
of supervised short-text classification, especially when compared to state of the art
contextualised language models. We compare the performance of several pre-trained word
embedding models with and without retrofitting for short-text classification. We explored
several retrofitting approaches and used word vectors as features with both a classical
machine-learning pipeline and a more state of the art bi-LSTM encoder. We further
compared to two transformer baselines, where the transformers are directly used for
classification.
      </p>
      <p>The paper is organized as follows: Section 1 briefly present the tow most common
retrofitting models; Section 2 presents the protocol used in our experimental setting as
well as the obtained results. Section 3 discusses those results and offers additional
observations that question the benefit of current retrofitting approaches for the studied task.</p>
    </sec>
    <sec id="sec-2">
      <title>1. Retroffiting embeddings in NLP</title>
      <p>
        State-of-the-art word embedding techniques solely based on corpora analysis are
performed under the assumption of the distributional hypothesis stating that words
occurring in similar contexts tend to be semantically close [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. This hypothesis, made popular
through Firth’s idea (1957) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]: “You shall know a word by the company it keeps”, is one
of the main tenets of statistical semantics. By definition such approaches cannot capture
lexical or conceptual relationships that could be important to accurately characterizing
the semantics of words, e.g. some approaches will similarly represent synonyms and
antonyms [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. To address this limitation, a class of approaches denoted Retrofitting aim
at incorporating a priori knowledge from external resources in order to refine word
embeddings, e.g. lexicons, ontologies, domain-specific datasets expressing semantic
knowledge.
      </p>
      <p>
        The use of external data or knowledge generally requires retraining the model used
to compute the embeddings (considered as a subset of the model’s parameters). In this
case retrofitting can be seen as a post-processing step that aims at updating pre-trained
word embeddings in order to induce a refined vector space with desired properties
encoded in the external resource. Indeed, in addition to observed words contexts (i.e.
surroundings), resources such as semantic lexicons (such as FrameNet, PPDB and
WordNet) that label lexical entries with semantic relations (e.g. hyperonymy, hyponymy) can
be used. In the literature, the prevailing approach is to define a specific objective
function that learns the distribution of words and their lexical (resp. conceptual) relationships
either jointly [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13">10,11,12,13</xref>
        ], or separately by updating pre-trained embeddings [
        <xref ref-type="bibr" rid="ref14 ref15">14,15</xref>
        ].
When embeddings are updated using lexical ontologies, the objective function depends
on which semantic relation we seek to highlight: synonymy, hyperonymy, hypernymy
(the ”retrofitting” technique) [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] or synonymy and antonymy (the ”counterfitting”
technique) [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. These approaches are particularly interesting as they can be applied to any
word embeddings independently from the embeddings technique initially used to
generate them.
      </p>
      <p>
        Another strategy learns independent representations from corpora and knowledge
representations to later combine them. For instance, Goikoetxea et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] learned word
representations from WordNet, and combined them with embeddings computed from
text. Several contributions have been proposed to refine these general strategies, e.g.
Vulic´ et al. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] have proposed an approach based on context analysis enabling to retrofit
words that do not occur in the lexicon by exploiting words co-occurring in similar
dependency-based contexts; Yih et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] proposed to use a thesaurus to distinguish
synonyms from antonyms in word embeddings.
      </p>
      <p>
        These approaches have traditionally been proposed for static word representations,
i.e. a single representation is associated to a word (token). Recently, contextualized
textembedding models have been proposed to deal with issues induced by polysemy [
        <xref ref-type="bibr" rid="ref18 ref19">18,19</xref>
        ].
In this case, a context-specific representation of a word is obtained depending on its
meaning in each sentence. Recent retrofitting techniques are designed for these
contextualized embeddings, e.g. Shi et al. [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] proposed to consider prior knowledge about
paraphrases to improve context-specific representations.
      </p>
      <p>
        Several studies have stressed the benefits of retrofitting on several NLP tasks such as
sentiment analysis, relationship and text classification [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref14 ref15 ref17 ref21 ref9">14,15,21,9,17,10,11,12</xref>
        ]. These
studies only contain limited comparisons with state-of-art language models, in particular
for short-text classification.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2. Evaluation protocol</title>
      <p>
        This section presents the datasets and protocol used to evaluate the benefit of retrofitting
approaches for short-text classification. We focus our study on the following well-known
and representative retrofitting techniques: ”retrofitting” of Faruqui et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and
“counterfitting” of Mrksˇic´ et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <sec id="sec-3-1">
        <title>2.1. Word Embeddings</title>
        <p>
          we considered the 300-dimensional word vectors: (i) Paragram [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], learned from the
text content in the paraphrase database PPDB, (ii) Glove [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] learned from Wikipedia
and Common Crawl data, (iii) MUSE, a fastText embedding learned from Wikipedia2, as
well as (iv) two contextualized word embeddings models: Flair embeddings [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] trained
on JW300 corpus, and RoBERTa [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] embeddings trained on five English corpora:
BookCorpus [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]; Wikipedia; CC-NEWS [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]; Open Web Text [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] and Stories [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ].
        </p>
        <p>For each word embedding model, except for the contextualized embedding
baselines, we consider three settings: original embeddings (baseline), retrofitted and
counterfitted.</p>
      </sec>
      <sec id="sec-3-2">
        <title>2.2. Evaluation benchmarks</title>
        <p>The evaluations were performed on two benchmarks:</p>
        <sec id="sec-3-2-1">
          <title>2https://github.com/facebookresearch/MUSE</title>
          <p>
            HuffPost headlines3 [
            <xref ref-type="bibr" rid="ref28">28</xref>
            ]: 200849 headlines published in HuffPost from 2012 to
2018. Each news headline belongs to one of 41 possible classes.
          </p>
          <p>Product Listing On Amazon India4: 27375 product titles from Amazon India for
2019. We keep the products belonging one of the 9 classes and redundant records
have been dropped.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>2.3. Evaluation Process</title>
        <p>
          We consider two different evaluation settings: (i) shallow machine learning where we
compute a single document vector by pooling individual word embeddings, which we
use as a bag of features for several classifiers; (ii) deep machine learning, where we
use a bi-LSTM encoder [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] to learn document representations from word embeddings
during the training of a final feed forward layer. Pre-transformer literature suggests that
the ability of LSTM to capture dependencies between words, makes it a robust choice
for text classification applications.
        </p>
        <p>
          In the first context, three models are compared: the ridge classifier, random forest
and XGBoost from scikit-learn [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ]. In the second context we use Flair embeddings [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]
with its RNN Document Embedding implementation initialised with bi-LSTM cells for
each model. We apply a grid search on held-out training data to find the best
hyperparameter values and then we run a 10-folds cross validation considering the optimally
hyper-paremeters for each model. Words embeddings (baseline, retrofitted or
counterfitted) are given as input for each model. In the shallow setting we compute pooled
document vectors with flair’s document pool embedding implementation (mean pooling with
a linear smoothing); in the deep setting unpooled word embeddings are given as input to
the LSTM encoder. In addition, we also present baselines using embeddings from Flair
RoBerta in both settings, as well as a direct classification with the Transformer model
with a classification head (using RoBerta).
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Results and discussion</title>
      <p>Table 1 reports the average accuracies over the 10 cross-validation folds for all models
on the two benchmarks. Results are grouped depending on the embedding used in the
tested approach. It is important to highlight the impact of retrofitting on the performance
with relation to the corresponding baseline approach, i.e. considering the use of the
original embedding without retrofitting 5. We also report the accuracy delta compared to the
corresponding baseline accuracy.6 In the shallow setting, we only reported on the best
performing classifier (always the ridge classifier). For the LSTM-RNN approach, the
standard deviations of the averaged accuracies obtained during cross-validation are
generally around 1 5% for the Huffington post dataset and 5 12% for the Amazon India
dataset. For the Ridge approach, the standard variations are always under 1%.</p>
      <sec id="sec-4-1">
        <title>3https://www.kaggle.com</title>
        <p>4https://data.world/promptcloud/product-listing-on-amazon-india
5We draw the reader’s attention to the fact that the embedding models considered (Paragram, Muse,
Glove. . . ) have not all been trained on the same corpora.</p>
        <p>6e.g.For the HuffPost dataset, Paragram embbedings retrofitted with PPDB leads to an accuracy of 42:80%
with the ridge classifier, which corresponds to a 0:03% accuracy improvement compared to the Paragram
baseline (42:77%).</p>
        <p>RidgeC
MUSE Counterfitted</p>
        <p>PPDB&amp;WordNet
approach using embeddings refined by retrofitting (maximum average accuracy of
line Paragram setting. In the shallow machine learning setting, we can hardly observe
any improvement with retrofitting (variations too small to be significant), which can be
attributed to the hypothesis that a linear classifier cannot meaningfully capture the
additional information. The impact of retrofitting is clearer on LSTM-RNN, although we
observe large variations of the average accuracy and a higher overall variability across
folds. Since the LSTM-RNN encoder is trained alongside classification layer, we
effectively learn a non-linear supervised document representations that can both capture some
dependencies and map the original feature space in a meaningful way. The improvements
mainly concern the HuffPost Headlines dataset (news domain). Given that some of the
embeddings are trained on news corpora and that the lexicons used for retrofitting mostly
(except PPDB) cover the general domain, it is reasonable to assume that the retrofitting
mostly benefits data in the same domain. For example, retrofitting Paragram embeddings
with PPDB leads to a +2:91% average accuracy improvement using an LSTM-RNN
classifier on HuffPost headlines; the same approach applied to Product Amazon India
leads to a 9:25% decrease of the average accuracy. Generally, the results underline the
difficulty of formulating recommendations for one particular approach. However, we can
identify that MUSE most often benefits from retrofitting than not. Compared to other
words embeddings concidered, MUSE embeddings are learned from the smallest and the
most general corpus (Wikipedia).</p>
        <p>The impact of the corpus used for computing words vectors is also emphasized
by the evaluation on contextualized embeddings. In fact, we also evaluated Flair and
RoBERTa embeddings as features with the ridge classifier for both data-sets 7,
obtaining an accuracy of 53:48% (resp. 39:24%) with RoBERTa, and only 41:81% with Flair
embeddings (resp. 34:69%) for HuffPost Headlines (resp. Product Amazon India): better
initial representations lead to a better classification result even with an unsophisticated
classifier. With less meaningful input representation it ’s beneficial to have some form of
task-specific representation learning to help the classifier exploit all meaningful
information in the features. We also tested Flair and RoBerta with a LSTM-RNN head, however
the significantly larger number of parameters did not allow the models to converge with
similar computational constraints to the other models 8.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Conclusion</title>
      <p>By definition, embedding techniques only based on corpora analysis are not designed
to capture lexical or conceptual relationships that could be important to accurately
characterizing the semantics of words. To address this limitation, a class of approaches
denoted Retrofitting has been proposed in the litterature to incorporate a priori knowledge
expressed into knowledge resources, e.g. lexical ontologies. Questioning the benefit of
such approaches requires extensive task-specific empirical evaluations.</p>
      <p>In this context, this paper presents an evaluation of the impact of state-of-the-art
retrofitting approaches for short-text classification using shallow and deep learning
models on two datasets: HuffPost Headlines and Product Amazon India. Two retrofitting
techniques of interest, as they enable refinement of existing embeddings, have been tested
using several external resources. The baseline retrofitting used a single ontology (i.g.
PPDB) that captures similar words while the counterfitting technique used two external
resources that captures similar and dissimilar words respectively. We applied these
techniques on several pre-trained words embeddings. We compared retrofitted and
counterfitted embeddings with contextualized ones. Based on the results obtained in our
evaluation, we conclude that current retrofitting techniques generally fail to systematically
and significantly improve classification performance. Indeed, despite interesting gains
using some configurations (retrofitting technique, resource and classification method),
no general tendency and recommendations can be expressed. Tested shallow machine
learning models seem not to benefit from retrofitting; Deep Learning approaches such as
LSTM-RNN do in some settings: interesting gains have been observed using Paragram
embeddings with PPDB, or MUSE embeddings with PPDB, FrameNet or WordNet+ for
HuffPost Headlines dataset (same domain), for the Amazon India dataset we saw little
benefit to using retrofitting (different domain).</p>
      <p>
        In future work, we can explore the retrofitting approach for contextualized word
embeddings as proposed by Shi et al. in [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. We can also use all semantic lexicons together
7Equivalent to the transformer with a classification head and frozen weights
8For retrofitted embeddings + Ridge training and evaluation were almost instantaneous. For retrofitted
embedding + LSTM-RNN (20 epochs) we had approx. 1000 samples/s, for RoBERTa, 29 samples/s.
to retrofit each embeddings or use domain-specific lexical ontologies or terminologies
for the retrofitting.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Peng</given-names>
            <surname>Jin</surname>
          </string-name>
          , Yue Zhang, Xingyuan Chen, and
          <string-name>
            <given-names>Yunqing</given-names>
            <surname>Xia</surname>
          </string-name>
          .
          <article-title>Bag-of-embeddings for text classification</article-title>
          .
          <source>In IJCAI</source>
          , volume
          <volume>16</volume>
          , pages
          <fpage>2824</fpage>
          -
          <lpage>2830</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Guangyou</given-names>
            <surname>Zhou</surname>
          </string-name>
          , Tingting He,
          <string-name>
            <surname>Jun Zhao</surname>
            ,
            <given-names>and Po</given-names>
          </string-name>
          <string-name>
            <surname>Hu</surname>
          </string-name>
          .
          <article-title>Learning continuous word embedding with metadata for question retrieval in community question answering</article-title>
          .
          <source>In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)</source>
          , pages
          <fpage>250</fpage>
          -
          <lpage>259</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Xin</given-names>
            <surname>Ye</surname>
          </string-name>
          , Hui Shen, Xiao Ma, Razvan Bunescu, and Chang Liu.
          <article-title>From word embeddings to document similarities for improved information retrieval in software engineering</article-title>
          .
          <source>In Proceedings of the 38th international conference on software engineering</source>
          , pages
          <fpage>404</fpage>
          -
          <lpage>415</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Yinhan</given-names>
            <surname>Liu</surname>
          </string-name>
          , Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen,
          <string-name>
            <surname>Omer Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Mike</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Luke</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Veselin</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          .
          <article-title>Roberta: A robustly optimized bert pretraining approach</article-title>
          . arXiv preprint arXiv:
          <year>1907</year>
          .11692,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Tom</surname>
            <given-names>B Brown</given-names>
          </string-name>
          , Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry,
          <string-name>
            <given-names>Amanda</given-names>
            <surname>Askell</surname>
          </string-name>
          , et al.
          <article-title>Language models are few-shot learners</article-title>
          .
          <source>arXiv preprint arXiv:2005.14165</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Emily</surname>
            <given-names>M Bender</given-names>
          </string-name>
          and
          <string-name>
            <given-names>Alexander</given-names>
            <surname>Koller</surname>
          </string-name>
          .
          <article-title>Climbing towards nlu: On meaning, form, and understanding in the age of data</article-title>
          .
          <source>In Proc. of ACL</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Zellig</surname>
            <given-names>S Harris.</given-names>
          </string-name>
          <article-title>Distributional structure</article-title>
          .
          <source>Word</source>
          ,
          <volume>10</volume>
          (
          <issue>2-3</issue>
          ):
          <fpage>146</fpage>
          -
          <lpage>162</lpage>
          ,
          <year>1954</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.R.</given-names>
            <surname>Firth</surname>
          </string-name>
          .
          <article-title>Studies in Linguistic Analysis: Special Volume of the Philosogical Society</article-title>
          . Special Volume of the Philological Society. Blackwell,
          <year>1957</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Wen-tau Yih</surname>
          </string-name>
          , Geoffrey Zweig, and John C Platt.
          <article-title>Polarity inducing latent semantic analysis</article-title>
          .
          <source>In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning</source>
          , pages
          <fpage>1212</fpage>
          -
          <lpage>1222</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Mo</given-names>
            <surname>Yu</surname>
          </string-name>
          and
          <string-name>
            <given-names>Mark</given-names>
            <surname>Dredze</surname>
          </string-name>
          .
          <article-title>Improving lexical embeddings with semantic knowledge</article-title>
          .
          <source>In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)</source>
          , pages
          <fpage>545</fpage>
          -
          <lpage>550</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Jiang</surname>
            <given-names>Bian</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Bin</given-names>
            <surname>Gao</surname>
          </string-name>
          , and
          <string-name>
            <surname>Tie-Yan Liu</surname>
          </string-name>
          .
          <article-title>Knowledge-powered deep learning for word embedding</article-title>
          .
          <source>In Joint European conference on machine learning and knowledge discovery in databases</source>
          , pages
          <fpage>132</fpage>
          -
          <lpage>148</lpage>
          . Springer,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Chang</surname>
            <given-names>Xu</given-names>
          </string-name>
          , Yalong Bai, Jiang Bian,
          <string-name>
            <given-names>Bin</given-names>
            <surname>Gao</surname>
          </string-name>
          , Gang Wang, Xiaoguang Liu, and
          <string-name>
            <surname>Tie-Yan Liu</surname>
          </string-name>
          .
          <article-title>Rc-net: A general framework for incorporating knowledge into word representations</article-title>
          .
          <source>In Proceedings of the 23rd ACM international conference on conference on information and knowledge management</source>
          , pages
          <fpage>1219</fpage>
          -
          <lpage>1228</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Fried</surname>
          </string-name>
          and
          <string-name>
            <given-names>Kevin</given-names>
            <surname>Duh</surname>
          </string-name>
          .
          <article-title>Incorporating both distributional and relational semantics in word representations</article-title>
          .
          <source>arXiv preprint arXiv:1412.4369</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Manaal</surname>
            <given-names>Faruqui</given-names>
          </string-name>
          , Jesse Dodge,
          <article-title>Sujay K Jauhar, Chris Dyer, Eduard Hovy, and Noah A Smith. Retrofitting word vectors to semantic lexicons</article-title>
          .
          <source>arXiv preprint arXiv:1411.4166</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Nikola</surname>
            <given-names>Mrksˇic´</given-names>
          </string-name>
          , Diarmuid O Se´aghdha, Blaise Thomson, Milica Gasˇic´,
          <string-name>
            <surname>Lina</surname>
          </string-name>
          Rojas-Barahona,
          <string-name>
            <surname>Pei-Hao</surname>
            <given-names>Su</given-names>
          </string-name>
          , David Vandyke,
          <string-name>
            <surname>Tsung-Hsien Wen</surname>
            , and
            <given-names>Steve</given-names>
          </string-name>
          <string-name>
            <surname>Young</surname>
          </string-name>
          .
          <article-title>Counter-fitting word vectors to linguistic constraints</article-title>
          .
          <source>arXiv preprint arXiv:1603.00892</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Josu</surname>
            <given-names>Goikoetxea</given-names>
          </string-name>
          , Eneko Agirre, and
          <string-name>
            <given-names>Aitor</given-names>
            <surname>Soroa</surname>
          </string-name>
          .
          <article-title>Single or multiple? combining word representations independently learned from text and wordnet</article-title>
          .
          <source>In AAAI</source>
          , pages
          <fpage>2608</fpage>
          -
          <lpage>2614</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Ivan</surname>
            <given-names>Vulic´</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roy</surname>
            <given-names>Schwartz</given-names>
          </string-name>
          , Ari Rappoport, Roi Reichart, and
          <string-name>
            <given-names>Anna</given-names>
            <surname>Korhonen</surname>
          </string-name>
          .
          <article-title>Automatic selection of context configurations for improved class-specific word representations</article-title>
          .
          <source>arXiv preprint arXiv:1608.05528</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Matthew</surname>
            <given-names>E Peters</given-names>
          </string-name>
          , Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Luke</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          .
          <article-title>Deep contextualized word representations</article-title>
          .
          <source>arXiv preprint arXiv:1802.05365</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Jacob</surname>
            <given-names>Devlin</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          . Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          .
          <source>arXiv preprint arXiv:1810.04805</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Weijia</surname>
            <given-names>Shi</given-names>
          </string-name>
          , Muhao Chen,
          <string-name>
            <surname>Pei Zhou</surname>
          </string-name>
          , and
          <string-name>
            <surname>Kai-Wei Chang</surname>
          </string-name>
          .
          <article-title>Retrofitting contextualized word embeddings with paraphrases</article-title>
          .
          <source>arXiv preprint arXiv:1909.09700</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Billy</surname>
            <given-names>Chiu</given-names>
          </string-name>
          , Simon Baker, Martha Palmer, and
          <string-name>
            <given-names>Anna</given-names>
            <surname>Korhonen</surname>
          </string-name>
          .
          <article-title>Enhancing biomedical word embeddings by retrofitting to verb clusters</article-title>
          .
          <source>In Proceedings of the 18th BioNLP Workshop and Shared Task</source>
          , pages
          <fpage>125</fpage>
          -
          <lpage>134</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Jeffrey</surname>
            <given-names>Pennington</given-names>
          </string-name>
          , Richard Socher, and
          <string-name>
            <given-names>Christopher D</given-names>
            <surname>Manning</surname>
          </string-name>
          . Glove:
          <article-title>Global vectors for word representation</article-title>
          .
          <source>In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)</source>
          , pages
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Alan</surname>
            <given-names>Akbik</given-names>
          </string-name>
          , Duncan Blythe, and
          <string-name>
            <given-names>Roland</given-names>
            <surname>Vollgraf</surname>
          </string-name>
          .
          <article-title>Contextual string embeddings for sequence labeling</article-title>
          .
          <source>In Proceedings of the 27th International Conference on Computational Linguistics</source>
          , pages
          <fpage>1638</fpage>
          -
          <lpage>1649</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Yukun</surname>
            <given-names>Zhu</given-names>
          </string-name>
          , Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and
          <string-name>
            <given-names>Sanja</given-names>
            <surname>Fidler</surname>
          </string-name>
          .
          <article-title>Aligning books and movies: Towards story-like visual explanations by watching movies and reading books</article-title>
          .
          <source>In Proceedings of the IEEE international conference on computer vision</source>
          , pages
          <fpage>19</fpage>
          -
          <lpage>27</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Nagel</surname>
          </string-name>
          .
          <source>Common crawl news corpus</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>Aaron</given-names>
            <surname>Gokaslan</surname>
          </string-name>
          and Vanya Cohen.
          <article-title>Openwebtext corpus</article-title>
          . http://Skylion007.github.io/ OpenWebTextCorpus,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Trieu</surname>
            <given-names>H</given-names>
          </string-name>
          <string-name>
            <surname>Trinh and Quoc V Le</surname>
          </string-name>
          .
          <article-title>A simple method for commonsense reasoning</article-title>
          . arXiv preprint arXiv:
          <year>1806</year>
          .02847,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>Rishabh</given-names>
            <surname>Misra</surname>
          </string-name>
          .
          <source>News category dataset</source>
          ,
          <year>06 2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Fabian</surname>
            <given-names>Pedregosa</given-names>
          </string-name>
          , Gae¨l Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          , Ron Weiss,
          <string-name>
            <surname>Vincent Dubourg</surname>
          </string-name>
          , et al.
          <article-title>Scikit-learn: Machine learning in python</article-title>
          .
          <source>the Journal of machine Learning research</source>
          ,
          <volume>12</volume>
          :
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>