<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Ingredients for Happiness: Modeling constructs via semi-supervised content driven inductive transfer learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Bakhtiyar Syed?</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vijayasaradhi Indurthi?</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kulin Shah</string-name>
          <email>kulin.shah@students.iiit.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manish Gupta</string-name>
          <email>gmanish@microsoft.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vasudeva Varma</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IIIT Hyderabad</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Modeling a ect via understanding the social constructs behind them is an important task in devising robust and accurate systems for socially relevant scenarios. In the CL-A Shared Task (part of A ective Content Analysis workshop @ AAAI 2019), the organizers released a dataset of `happy' moments, called the HappyDB corpus. The task is to detect two social constructs: the agency (i.e., whether the author is in control of the happy moment) and the social characteristics (i.e., whether anyone else other than the author was also involved in the happy moment). We employ an inductive transfer learning technique where we utilize a pre-trained language model and ne-tune it on the target task for both the binary classi cation tasks. At rst, we use a language model pre-trained on the huge WikiText-103 corpus. This step utilizes an AWDLSTM with three hidden layers for training the language model. In the second step, we ne-tune the pre-trained language model on both the labeled and unlabeled instances from the HappyDB dataset. Finally, we train a classi er on top of the language model for each of the identi cation tasks. Our experiments using 10-fold cross validation on the corpus show that we achieve a high accuracy of 93% for detection of the social characteristic and 87% for agency of the author, showing signi cant gains over other baselines. We also show that using the unlabeled dataset for ne-tuning the language model in the second step improves our accuracy by 1-2% across detection of both the constructs.</p>
      </abstract>
      <kwd-group>
        <kwd>Happy Moments</kwd>
        <kwd>Inductive transfer learning</kwd>
        <kwd>Language model</kwd>
        <kwd>ne-tuning</kwd>
        <kwd>Agency Prediction</kwd>
        <kwd>Social Characteristic Detection</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        In our quest to better model happy moments and characterize them, it is
important to understand which entities were involved in the happy moments, and
? The authors contributed equally.
the psychology and behaviours which make people happy. Once the reasons and
behaviours which trigger happiness are identi ed, techniques can be e ectively
developed to steer towards such behaviours which can increase people happiness
levels. It is therefore useful to answer questions like (1) whether the author was
in control of the happy moment (referred to as agency in this paper), and (2)
whether multiple people contributed to the happy moment (referred to as social
characteristic in this paper). The CL-AFF shared task at A Con20193 focuses
on answering these two research questions. Asai et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] developed a database
of 100K happy moments, HappyDB, using crowd sourcing and made it publicly
available. We use this dataset to build models for answering the two questions.
      </p>
      <p>
        Recently, there has been signi cant progress in the area of inductive transfer
learning for natural language processing (NLP). Training deep learning
models from scratch requires enormous amount of labeled data for achieving high
accuracy. In recent times though, there have been advancements which give
better performance on tasks like text classi cation from only a few labeled data
instances [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>In this work, we show that inductive transfer learning is greatly bene cial
in identifying the agency and social characteristics of happy moments in the
dataset. We also employ a variant wherein we utilize the `unlabeled' happy
moments and leverage it to increase the system performance. Our experiments using
10-fold cross validation on the corpus show that we achieve a high accuracy of
93% for detection of the social characteristic and 87% for agency of the
author, showing signi cant gains over other baselines.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Problem De nition</title>
      <p>We speci cally attempt to solve Task 1 of the CL-A shared task, i.e., detecting
the agency and social labels of the given happy moment. Formally, we model the
task as follows.</p>
      <sec id="sec-2-1">
        <title>Agency and Social characteristic detection: Given a happy moment H,</title>
        <p>we intend to learn the agency label C1 and the social label C2. C1 indicates
whether the author is in control of the happy moment being described. C2,
on the other hand, indicates whether anybody else other than the author, i.e.,
whether multiple entities are involved in the happy moment being described.
We model both the tasks as binary classi cation problems. Thus, if author is in
control of the happy moment, C1=1; otherwise, C1=0. Similarly, if anyone else
other the author is involved in the happy moment, C2=1; otherwise, C2=0.</p>
        <p>
          To solve these problems, we propose a semi-supervised inductive transfer
learning approach. Our approach is inspired by the ULMFiT architecture [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]
and AWD-LSTM [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] which we discuss in brief below.
3 https://sites.google.com/view/a con2019/cl-a -shared-task
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Preliminaries</title>
      <p>In this section, we discuss the ULMFiT architecture and the AWD-LSTM model
in brief.
3.1</p>
      <sec id="sec-3-1">
        <title>The ULMFiT Architecture</title>
        <p>
          Previous research has proposed multiple models for exploiting inductive
transfer for Natural Language Processing (NLP) applications [
          <xref ref-type="bibr" rid="ref12 ref5">5, 12</xref>
          ]. In this work,
we adapt a recently proposed architecture called ULMFiT (Universal Language
Model Fine-tuning) for inductive transfer learning. The ULMFiT architecture
proposed by Howard and Ruder [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] uses multiple heuristics for ne-tuning of
language models (LMs) to avoid over tting when training neural models on small
labeled datasets. The ULMFiT architecture not just reduces the LM over- tting
but also prevents catastrophic forgetting of information which earlier models
built on LMs were susceptible to. We adapt the ULMFiT model for our
inductive transfer learning approach with a variant and show that inductive transfer
learning is greatly bene cial for identifying agency and social characteristics of
the happy moments in the given corpus. Besides exploiting just the labeled data,
our variant also utilizes the unlabeled corpus for ne-tuning the language model
which further improves the classi cation performance across both the constructs.
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>The AWD-LSTM Model</title>
        <p>
          Our inductive transfer learning mechanism also makes use of the
AveragedSGD Weight-Dropped Long Short Term Memory (AWD-LSTM) networks [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
The AWD-LSTM uses DropConnect and a variant of Average-SGD (NT-ASGD)
along with several other well-known regularization strategies. We leverage the use
of AWD-LSTMs as it has been shown to very e ective in learning low-perplexity
language models.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Approach: Inductive Transfer Learning</title>
      <p>In this section, we describe the three phases of the proposed inductive transfer
learning approach. Figure 1 illustrates the overall system architecture.</p>
      <p>
        The proposed inductive transfer learning framework for identi cation of the
`agency' and `social' characteristics makes use of the following three phases in
order.
1. General Domain Pre-training: The rst phase pre-trains the
AWDLSTM based language model on a huge text corpus. In our case, we use
the pretrained language model trained on Wikitext-103 [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] dataset which
consists of 103 Million unique words and 28,595 pre-processed Wikipedia
articles. General domain pre-training helps the model learn basic
characteristics of the language in question. It is essential that the LM be pre-trained
on a huge corpus so that these general-domain characteristics are learned
well.
2. Language Model Fine-tuning for the Target Task: For this step, after
pre-training the language model with a huge corpus of the language texts,
we ne-tune it using both the labeled as well as unlabeled part of the happy
moments corpus. In this stage, we utilize task-speci c data to ne-tune our
language model in an unsupervised manner. As proposed in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], our
netuning involves discriminative ne-tuning and slanted triangular learning
rates to combat the catastrophic forgetting nature of language models as
exhibited in previous works [
        <xref ref-type="bibr" rid="ref12 ref5">5, 12</xref>
        ]. In discriminative ne-tuning, instead of
keeping the same learning rate for all the layers of the AWD-LSTM, a di
erent learning rate is used for tuning the three di erent layers. The intuition
is that since each of the layers represent a di erent kind of information [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ],
they must be ne-tuned to di erent extents. Also using the same learning
rate is not the best way to enable the model to converge to a suitable
region of the parameter space. Thus we adapt the slanted triangular learning
rate [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] which rst increases the learning rate and then linearly decays it as
the number of training samples increases.
      </p>
      <sec id="sec-4-1">
        <title>3. Classi er Fine-tuning for the Target Task: The weights that we obtain</title>
        <p>
          from the second phase are ne-tuned by extending the upstream architecture
with two fully connected layers with softmax activation for the classi cation.
In this phase, we adapt the gradual unfreezing heuristic [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] for our task. In
gradual unfreezing (GU), all layers are not ne-tuned at the same time,
instead the model is gradually unfrozen starting from the last layer, as it
contains the least general knowledge [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. The last layer is rst unfrozen
and ne-tuned for one epoch. Subsequently, the next frozen layer is unfrozen
and all unfrozen layers are ne-tuned. This is repeated until all layers are
ne-tuned until convergence is reached.
5
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experiments</title>
      <p>In this section, we describe the baselines and present comparisons between the
baseline and our proposed approach.
5.1</p>
      <sec id="sec-5-1">
        <title>Baselines</title>
        <p>
          Word embedding is a technique in NLP which maps words of a language into
dense vectors of real numbers in a continuous embedding space. Traditional
NLP systems such as BoW (Bag of Words) and TF-IDF (Term
FrequencyInverse Document Frequency) are mainly syntactic representations and cannot
capture the semantic relationships between words. Word embedding techniques
have been gaining popularity in a range of NLP tasks like Sentiment analysis [
          <xref ref-type="bibr" rid="ref10 ref20">10,
20</xref>
          ], Named Entity Recognition [
          <xref ref-type="bibr" rid="ref18 ref8">8, 18</xref>
          ], Question Answering [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], etc.
        </p>
        <p>As baselines, we use word embeddings and demographics features of the
author of the happy moment like age, country, gender, marital status, parenthood,
happiness duration. We train multiple classi ers using these set of features.
Speci cally, for the baselines, we use the following pre-trained word and
sentence embedding models: GloVe, Concatenated Power Mean, Google Universal
Embedding, fastText, Lexical Vectors and InferSent embeddings.</p>
        <p>For word based embeddings, the embedding of the sentence is computed by
tokenizing the sentence into words and computing the average of all the
embeddings of the words of the sentence. We formulate the problem of identifying the
social and agency attributes as text classi cation tasks. Hence, we use multiple
supervised learning algorithms like Logistic Regression (LR), Support Vector
Machines (SVM), Random Forests (RF), Neural Networks (with two hidden
layers), and boosting (XGB) to train the models.</p>
        <p>In the following, we describe the word/sentence embeddings which we use as
baselines.</p>
        <p>
          (1) fastText [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]: It is a skipgram based word embedding method, where each
word is represented as a bag of character n-grams. A vector representation is
associated to each character n-gram; words being represented as the sum of
these representations.
        </p>
        <p>
          (2) GloVe [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] is an unsupervised learning algorithm for distributed word
representation. Training is performed on aggregated global word-word co-occurrence
statistics from a corpus, and the resulting representations showcase interesting
linear substructures of the word vector space. We use the standard 300
dimensional GloVe embeddings (GloVe1) trained on 840B word tokens. As another
baseline, we also use 200 dimensional GloVe embeddings trained on a Twitter
corpus (GloVe2) containing 27B word tokens.
        </p>
        <p>
          (3) InferSent [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] is another set of embeddings trained by Facebook. InferSent
is trained using the task of language inference. Given two sentences the model
is trained to infer whether they are a contradiction, a neutral pairing, or an
entailment. The output is an embedding of 4096 dimensions.
        </p>
        <p>
          (4) Concatenated Power Mean Word Embedding [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] generalizes the concept
of average word embeddings to power mean word embeddings. The concatenation
of di erent types of power mean word embeddings considerably closes the gap
to state-of-the-art methods mono-lingually and substantially outperforms many
complex techniques cross-lingually.
        </p>
        <p>
          (5) Lexical Vectors [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] is another word embedding similar to fastText with
slightly modi ed objective. Fast Text [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] is another word embedding model which
incorporates character n-grams into the skipgram model of Word2Vec and
considers the subword information.
        </p>
        <p>
          (6) The Universal Sentence Encoder [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] encodes text into high dimensional
vectors. The model is trained and optimized for greater-than-word length text,
such as sentences, phrases or short paragraphs. It is trained on a variety of
data sources and a variety of tasks with the aim of dynamically accommodating
a wide variety of natural language understanding tasks. The input is variable
length English text and the output is a 512 dimensional vector.
        </p>
        <p>For each of the embeddings in the above list, we train models using di erent
supervised learning algorithms. We use the scikit-learn implementations of these
algorithms with the standard default parameters without any hyper-parameter
tuning.</p>
      </sec>
      <sec id="sec-5-2">
        <title>Hyper-parameter Settings</title>
        <p>
          As suggested in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], we use the AWD-LSTM language model with three layers,
1150 hidden activations per layer and an embedding size of 400. The hidden layer
of the classi er is of size 50. A batch size of 30 is used to train the model. The
LM and classi er ne-tuning is done with a base learning rate of 0.004 and 0.01
respectively. We built separate models for the `agency' and `social' classi cation
tasks.
5.3
        </p>
      </sec>
      <sec id="sec-5-3">
        <title>Results and Analysis</title>
        <p>In this work, we showed that the idea of using inductive transfer learning by
ne-tuning language models helps in giving robust performance across detection
of agency and social characteristics. We also showed that the use of unlabeled
data for LM ne-tuning in our second stage helped in improving performance
across 10-fold cross validation evaluation measures for both the tasks.</p>
        <p>
          We plan to perform the given text classi cation using other pre-trained
embeddings like ELMo (Embeddings from Language Models) [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], Skip-Thought
Vectors [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], Quick-Thoughts [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] and Multi-task learning based sentence
representations [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ], and investigate if use of those embeddings can improve the
classi cation accuracy. We would also like to experiment with other semi-supervised
techniques to improve the classi cation accuracy.
        </p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Asai</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Evensen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Golshan</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Halevy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lopatenko</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stepanov</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suhara</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>W.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Happydb: A corpus of 100,000 crowdsourced happy moments</article-title>
          .
          <source>In: Proceedings of LREC 2018</source>
          .
          <article-title>European Language Resources Association (ELRA), Miyazaki</article-title>
          , Japan (May
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bojanowski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grave</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joulin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Enriching word vectors with subword information</article-title>
          .
          <source>arXiv preprint arXiv:1607.04606</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Cer</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kong</surname>
          </string-name>
          , S.y.,
          <string-name>
            <surname>Hua</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Limtiaco</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>John</surname>
          </string-name>
          , R.S.,
          <string-name>
            <surname>Constant</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guajardo-Cespedes</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yuan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tar</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , et al.:
          <article-title>Universal sentence encoder</article-title>
          . arXiv preprint arXiv:
          <year>1803</year>
          .
          <volume>11175</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Conneau</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kiela</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schwenk</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barrault</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bordes</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Supervised learning of universal sentence representations from natural language inference data</article-title>
          .
          <source>In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing</source>
          . pp.
          <volume>670</volume>
          {
          <fpage>680</fpage>
          . Association for Computational Linguistics, Copenhagen, Denmark (
          <year>September 2017</year>
          ), https://www.aclweb.org/anthology/D17-1070
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Dai</surname>
            ,
            <given-names>A.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q.V.</given-names>
          </string-name>
          :
          <article-title>Semi-supervised sequence learning</article-title>
          .
          <source>In: Advances in neural information processing systems</source>
          . pp.
          <volume>3079</volume>
          {
          <issue>3087</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Howard</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruder</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Universal language model ne-tuning for text classi cation</article-title>
          . In:
          <article-title>Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</article-title>
          .
          <source>vol. 1</source>
          , pp.
          <volume>328</volume>
          {
          <issue>339</issue>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Kiros</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salakhutdinov</surname>
            ,
            <given-names>R.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zemel</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Urtasun</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Torralba</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fidler</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Skip-thought vectors</article-title>
          .
          <source>In: Advances in neural information processing systems</source>
          . pp.
          <volume>3294</volume>
          {
          <issue>3302</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Lample</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ballesteros</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Subramanian</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kawakami</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dyer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Neural architectures for named entity recognition</article-title>
          .
          <source>arXiv preprint arXiv:1603.01360</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Logeswaran</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          :
          <article-title>An e cient framework for learning sentence representations</article-title>
          .
          <source>arXiv preprint arXiv:1803</source>
          .
          <volume>02893</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Maas</surname>
            ,
            <given-names>A.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daly</surname>
            ,
            <given-names>R.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pham</surname>
            ,
            <given-names>P.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>A.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potts</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Learning word vectors for sentiment analysis</article-title>
          . In:
          <article-title>Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1</article-title>
          . pp.
          <volume>142</volume>
          {
          <fpage>150</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Merity</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Keskar</surname>
            ,
            <given-names>N.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Socher</surname>
          </string-name>
          , R.:
          <article-title>Regularizing and optimizing lstm language models</article-title>
          .
          <source>arXiv preprint arXiv:1708.02182</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Mou</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meng</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jin</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>How transferable are neural networks in nlp applications? arXiv preprint</article-title>
          arXiv:
          <volume>1603</volume>
          .06111 (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Pennington</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          : Glove:
          <article-title>Global vectors for word representation</article-title>
          .
          <source>In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)</source>
          . pp.
          <volume>1532</volume>
          {
          <issue>1543</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Peters</surname>
            ,
            <given-names>M.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neumann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Iyyer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gardner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Deep contextualized word representations</article-title>
          .
          <source>arXiv preprint arXiv:1802</source>
          .
          <volume>05365</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kiros</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zemel</surname>
          </string-name>
          , R.:
          <article-title>Exploring models and data for image question answering</article-title>
          .
          <source>In: Advances in neural information processing systems</source>
          . pp.
          <volume>2953</volume>
          {
          <issue>2961</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16. Ruckle,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Eger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Peyrard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Gurevych</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          :
          <article-title>Concatenated p-mean word embeddings as universal cross-lingual sentence representations</article-title>
          .
          <source>arXiv preprint arXiv:1803</source>
          .
          <volume>01400</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Salle</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Villavicencio</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Incorporating subword information into matrix factorization word embeddings</article-title>
          . arXiv preprint arXiv:
          <year>1805</year>
          .
          <volume>03710</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Santos</surname>
            ,
            <given-names>C.N.d.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guimaraes</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Boosting named entity recognition with neural character embeddings</article-title>
          .
          <source>arXiv preprint arXiv:1505.05008</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Subramanian</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Trischler</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pal</surname>
            ,
            <given-names>C.J.</given-names>
          </string-name>
          :
          <article-title>Learning general purpose distributed sentence representations via large scale multi-task learning</article-title>
          .
          <source>arXiv preprint arXiv:1804</source>
          .
          <volume>00079</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wei</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qin</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Learning sentiment-speci c word embedding for twitter sentiment classi cation</article-title>
          . In:
          <article-title>Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</article-title>
          .
          <source>vol. 1</source>
          , pp.
          <volume>1555</volume>
          {
          <issue>1565</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Yosinski</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clune</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lipson</surname>
          </string-name>
          , H.:
          <article-title>How transferable are features in deep neural networks?</article-title>
          <source>In: Advances in neural information processing systems</source>
          . pp.
          <volume>3320</volume>
          {
          <issue>3328</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>