<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Word Pair Convolutional Model for Happy Moment Classi cation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michael Saxon?</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Samarth Bhandari?</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lewis Ruskin</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gabrielle Honda</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>College of Integrative Sciences and Arts Arizona State University</institution>
          ,
          <addr-line>Tempe, AZ 85281</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Computing</institution>
          ,
          <addr-line>Informatics, and Decision Systems Engineering</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>School of Electrical</institution>
          ,
          <addr-line>Computer, and Energy Engineering</addr-line>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>The Luminosity Lab, O ce of Knowledge Enterprise Development</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>We propose the Word Pair Convolutional Model (WoPCoM) for the CL-A 19 shared task at the AAAI-19 Workshop on A ective Content Analysis. The challenge is the classi cation of speaker-described happy moments as social events and/or activities in which they have agency. WoPCoM leverages the regular structure of language on an architectural level in a way that recurrent models cannot easily answer, by learning convolutional word pair features that capture the important intra- and inter-phrasal relationships in a sentence. It performs with an average accuracy of 91.45% predicting the social label and 86.49% predicting the agency label assessed on a 10-fold cross validation. This represents a performance improvement of 0.92% for predicting the social label, but only 0.04% on predicting the agency label, over a simpler deep LSTM baseline. In spite of similar performance on these metrics, WoPCoM demonstrates desirable results when other metrics such as training time, model-intermediate class separation, and over tting propensity are considered, warranting further study.</p>
      </abstract>
      <kwd-group>
        <kwd>natural language processing</kwd>
        <kwd>dilated convolutional neural network</kwd>
        <kwd>social interaction</kwd>
        <kwd>word vectors</kwd>
        <kwd>word embeddings</kwd>
        <kwd>semantic- syntactic features</kwd>
        <kwd>sentiment analysis</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1.1</p>
    </sec>
    <sec id="sec-2">
      <title>Introduction</title>
      <p>CL-A</p>
      <sec id="sec-2-1">
        <title>Shared Task</title>
        <p>
          This paper will address the challenge put forth in the CL-A shared task [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] at
the AAAI-19 Workshop On A ective Content Analysis. The task involves rating
a happy moment for speaker agency and social interaction given only a sentence
describing the moment.
        </p>
        <p>
          HappyDB is a corpus of 100,000 crowd-sourced happy moments. Workers
on Amazon Mechanical Turk were tasked with recalling and writing sentences
? Equal contribution
describing happy moments from three \recollection periods" since they have
occurred, the past 24 hours, the past week, and the past month. [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] From this
corpus a training set of 10,000 sentences, each labelled with speaker agency
and social attributes, was produced. The social label is assessed as \yes" if the
moment in question directly involved people other than the speaker, and \no"
otherwise. The agency label is rated \yes" if the speaker had direct control
over the action described in the moment, and \no" otherwise. For example, the
moment \My boyfriend bought me owers" would be rated as \yes" for social
and \no" for agency, whereas \I took my dog for a walk" would be \no" for
social and \yes" for agency.
1.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Related Work</title>
        <p>As this is a new shared task, previous approaches to this speci c problem do
not exist. Because we have selected a deep learning-based approach to the task,
we consider neural networks for sentence-level semantic classi cation, and the
word embedding approaches that underpin most methods in that domain related
work.</p>
        <p>
          Word embeddings are a powerful tool for NLP tasks due to their capacity to
capture both semantic and syntactic data without a need for feature engineering
[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Despite shortcomings like susceptibility to common unigrams [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] and no
mechanism for representing syntactic relationships, a bag of words approach to
creating word vectors establishes a good theoretical baseline for identifying and
addressing various weaknesses in word encoding models.
        </p>
        <p>
          Embeddings from Language Models (ELMo), 1024-dimensional word vectors
assessed using a bidirectional LSTM applied across the characters in a sentence,
are the current state of the art in latent word vector representation [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. ELMo
embeddings have been demonstrated to bring peak performance on a variety of
downstream tasks against other universal word embedders such as Word2Vec
and GloVe [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. We selected pre-trained ELMo embeddings for our approach
because of this past record of performance on downstream tasks, and their two
most useful properties: the learning of meaningful sub-word units [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] by virtue
of their character-based processing, and their capture of richer semantic
context at the word level from their use of the full sentence context in generating
constituent word vectors. While Google's Universal Sentence Encoder has been
demonstrated to perform better than ELMo on semantic relatedness and
textual similarity tasks [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], we require the granularity provided by word-level rather
than sentence-level embeddings for our approach to the task.
1.3
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Design Rationale</title>
        <p>Our neural network design arose by rst considering how we would implement a
feature engineering approach, and then determining how a neural network could
learn similar features to the ones we would have hand-engineered.</p>
        <p>The \agency-ness" and \social-ness" of the vast majority of sentences we
considered at this stage hinged on a few critical features. For example, the \I
walked" in \I walked my dog to the park" is critical for understanding the
agency of the speaker. Perhaps this pair of words could be captured by a
twoword lter evaluating personal pronouns to the left of active verbs. Through
the consideration of these toy examples our hypothetical feature engineering
approach began to take shape.</p>
        <p>By approximating distributions of certain semantic-syntactic concepts such
as personal nouns, social verbs, and action verbs in the embedding space the
probability a word belongs to such classes can be estimated. The word count
distance between two words in a sentence can be roughly correlated with their
syntactic relationship. Coupled with semantic-syntactic information about the
individual words themselves features correlated with sentence meaning would
arise.</p>
        <p>
          Rather than hand engineer these word probability distributions we would
design the initial layers of the neural network to learn them. Rather than hand
engineer the two-word lters we would use convolutional layers to learn them.
Inspired by the dilated convolution approach employed by WaveNet [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] to
generate audio samples by capturing various levels of sample stride, we decided
to create a lterbank of varying dilation factor size-two convolutional lters to
capture various inter- and intra-phrasal relationships. We were con dent in this
approach because it would allow us to constrain the solution space and build in
insights from the structure of language that are not present in the far more
general sequence learning recurrent methodologies such as LSTM-based [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] models.
        </p>
        <p>
          English is a strongly head-initial language with a subject-verb-object word
order. The word determining the syntactic category of an English phrase
generally precedes its complements, leading to the formation of right-branching
grammatical structures. Nouns generally form constituent noun phrases of the
verbs they follow [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. Taken together, this information about the regular
structure of English sentences o ers an opportunity to design network architectures
that capture the important interplay between key pairs of words in a sentence.
In other words, because we can rely on the direction between pairs of words to
matter in many of these examples, a lter-based conceptualization of how to
process sentences for semantics makes sense.
        </p>
        <p>These observations justi ed the design of the Word Pair Convolutional Model
(WoPCoM), hinging on the importance of pairs of words and the exibility of
neural networks to form a prediction.
2</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Approach</title>
      <p>Almost all of the classes we considered boiled down to a prototypical feature
f (x(n); k) = P (x(n) 2 S1)P (x(n + k) 2 S2) for input sentence x of words
x(n), word count separation k and word classes S1 and S2. For example, one
possible feature for testing speaker agency in a sentence might consider S1 to
be the set of personal nouns and S2 to be the set of active verbs for k = 1,
which would capture subclauses common to active sentences such as \I took" in
the example sentence above, or \we went," \I made," etc. Similar features for
labelling social interactions can be envisioned. Through the convolution of these
features across the sentences, the labels could be ascertained. While modelling
the probability distributions of the various word sets and considering all the
important features by hand is not feasible, we hypothesized that a convolutional
neural network (CNN) could solve both of these problems at once, implicitly
learning the distributions of word classes and choosing which classes to select
for simultaneously during the learning process. What follows are the insights
from our model design process through the models we considered.</p>
      <p>
        We chose to use ELMo vectors as the input feature for all models. All
models were implemented using PyTorch [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and AllenNLP [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], all of our code is
available at https://github.com/luminositylab/CL-AFF-ST.
2.1
      </p>
      <sec id="sec-3-1">
        <title>Baseline Recurrent Model</title>
        <p>We rst considered a nave model employing a long short-term memory (LSTM)
layer with variable hidden size taking the sentence set of ELMo embeddings
as input. The nal hidden state of the LSTM was processed through a single
fully-connected layer with output size 2, and then a sigmoid function to map all
possible outputs to the (0; 1) probability range, representing the probability of
membership to the two classes.
2.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Word Pair Convolutional Model</title>
        <p>In our prototypical feature engineering formulation of the task we considered a
feature extractor convolving across a vector of class membership probabilities
for each word in the sentence to extract various intermediate word pair class
attributes that could be used in higher-level classi cation. However, due to the
feasibility issues discussed above and the lack of a ground truth objective to train
word class probabilities with, we opted for a looser type of pre-CNN processing, a
set of linear layers separated by ReLU operations with no strict class probability
constraint at the output. We do this by taking the ELMo-evaluated sentence set
of word vectors through a 1024 N linear layer, with N corresponding to the
number of word attributes we want the WoPCoM to learn.</p>
        <p>Two-vector convolutional lters of varying stride are then applied across the
sentence set of size N word attribute vectors to evaluate word pair
semanticsyntactic features. Through the varying dilation of the lters di erent types of
syntactic relationships can be captured, allowing for intra- and inter-phrasal
relationships to be assessed. We implemented our feature extractors as a set of
ve 1-dimensional 2 N N C convolutional layers with dilation factors varying
from 1 to 5. The output dimension N C (number of classes) is con gurable to
allow for evaluating di erent quantities of word pair relationships, and can be
considered analogous to the hidden state size in the LSTM implementation.</p>
        <p>To \pool" the ve sentence-length convolution outputs we feed them through
an LSTM with hidden size N C and take the nal N C 1 hidden state as the
\pooled" output of the ve lter sets. Those vectors are then concatenated into
one vector, and fed through a linear 5N C 2 layer to map these feature outputs
Sentence length
1024</p>
        <p>N</p>
        <p>N</p>
        <p>D = 1</p>
        <p>D = 2
Sentence in</p>
        <p>ELMo
Embedder</p>
        <p>Some word
NC
5</p>
        <p>NC</p>
        <p>Sigmoid
Linear (NC x 5) x 2</p>
        <p>P(Agency)
P(Social)
(More N x N and ReLU)
Linear 1024 x N</p>
        <p>ReLU</p>
        <p>
          5x 1d CNN
Dilation D=[
          <xref ref-type="bibr" rid="ref1 ref5">1,5</xref>
          ]
        </p>
        <p>LSTMs for Pooling
All 5 layers concat
to the two classes. Finally, these feature outputs are evaluated as probabilities
through the sigmoid function. Figure 1 depicts the WoPCoM model by showing
selected layer outputs in the processing of a single sentence batch.
2.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Hyperparameters</title>
        <p>Baseline For the baseline recurrent model a hidden representation size of 25
was used.</p>
        <p>WoPCoM For the nal implementation the word class count N was set to 100,
and the convolutional lter set output size N C was set to 50.
2.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Training</title>
        <p>
          The Adam optimizer [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] was used with a learning rate of 0.0001.
        </p>
        <p>The models were trained with random batches of 50 same-length sentences
to minimize necessary input padding. A patience factor of 10 was used to allow
variable-length training, once 10 epochs pass without a drop in validation loss,
training is halted. Mean-squared error was used as the loss metric.</p>
        <p>Class labels were tokenized as 0 for \yes" and 1 for \no." This means that
the model is really learning negative probabilities (probability that the sentence
is not social/agency) but this has no bearing on nal accuracy. To compute
accuracy, F1, and AUC the output probabilities are rounded to 0 or 1.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results and Discussion</title>
      <p>We evaluated WoPCoM and our baseline model using 10-fold cross validation,
achieving the results in Table 1. Results that have over a 0.5% improvement
from the baseline are in bold.</p>
      <p>
        To illustrate the general separation of classes performed by WoPCoM t-SNE
projection [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] was employed to generate Figure 2. Note that the \Social Only"
and \Agency Only" classes both overlap signi cantly with the most heavily
\Social &amp; Agency" region, while overlapping signi cantly less with each other's
regions.
      </p>
      <p>75
50
25
0
−25
−50
−75</p>
      <p>Social &amp; Agency
Social Only
Agency Only</p>
      <p>None</p>
      <p>WoPCoM shows a modest improvement over the LSTM baseline on the social
classi cation task, but the di erence between its performance and the LSTM
performance on the agency classi cation task is meager. This disparity could have
meaningful implications about the underlying data, warranting further study.</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and Future Work</title>
      <p>Through the process of training and testing these models many times we have
come up with some informal observations in addition to the concrete. Some
conrm basic concepts in machine learning, such as the e cacy of deeper
architectures at tting to the complex underlying distributions at the cost of over tting.
We began paying attention to which models quickly t to high accuracy across
relatively few epochs, and which models maintained a relatively narrow gulf
between the training and validation loss across many epochs.</p>
      <p>One potential advantage we found to WoPCoM as opposed to the LSTM is
that it tends to resist over tting. The training and test loss both saturate around
the same epoch, and there does not come a point where the over tting pattern
of simultaneously decreasing training loss and increasing validation loss takes
place. This might be a result of the constraints on the solution space structure
described brie y above. We are interested in applying a more detailed analysis
to this phenomenon.</p>
      <p>We plan to complete future work to look more rigorously to validate
parameters of our approach that were chosen almost arbitrarily, such as the choice
of two-word convolution lters, and dilation factors up to ve, against similar
approaches leveraging three- or four-word lters and larger dilation factors, as
well as pit WoPCoM or models designed with a similar philosophy against more
radically di erent architectures on more established tasks.</p>
      <p>Currently we are using pretrained ELMo embeddings, we suspect that for
task/dataset-speci c problems such as this shared task word embeddings learned
from the unlabeled data directly could improve performance. An unsupervised
approach employing an autoencoder that shares the sentence-feature
compression architecture of WoPCoM could potentially be adapted to improve
performance as well.</p>
      <p>We would also like to investigate the utility of the WoPCoM architecture
we've developed to applications with scarcity of training data and build on this
architecture to t on such problems better.
5</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>We would like to thank Dr. Hemanth Venkateswara for his guidance, particularly
in shooting down our worst ideas early, so we never had to discover how bad they
were ourselves.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Asai</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Evensen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Golshan</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Halevy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lopatenko</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stepanov</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suhara</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>W.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>HappyDB: A corpus of 100,000 crowdsourced happy moments</article-title>
          .
          <source>In: Proceedings of LREC 2018</source>
          .
          <article-title>European Language Resources Association (ELRA), Miyazaki</article-title>
          , Japan (May
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bojanowski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grave</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joulin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Enriching word vectors with subword information</article-title>
          .
          <source>arXiv preprint arXiv:1607.04606</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Gardner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grus</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neumann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tafjord</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dasigi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>N.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peters</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmitz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>L.S.:</given-names>
          </string-name>
          <article-title>AllenNLP: A deep semantic natural language processing platform</article-title>
          . arXiv preprint arXiv:
          <year>1803</year>
          .
          <volume>07640</volume>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Hacohen-Kerner</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosenfeld</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sabag</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tzidkani</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Topic-based classi - cation through unigram unmasking</article-title>
          .
          <source>Procedia Computer Science</source>
          <volume>126</volume>
          ,
          <issue>69</issue>
          {
          <fpage>76</fpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Hochreiter</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural Computation</source>
          <volume>9</volume>
          (
          <issue>8</issue>
          ),
          <volume>1735</volume>
          {
          <fpage>1780</fpage>
          (
          <year>1997</year>
          ). https://doi.org/10.1162/neco.
          <year>1997</year>
          .
          <volume>9</volume>
          .8.1735, https: //doi.org/10.1162/neco.
          <year>1997</year>
          .
          <volume>9</volume>
          .8.
          <fpage>1735</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Jaidka</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mumick</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chhaya</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ungar</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>The CL-A Happiness Shared Task: Results and Key Insights</article-title>
          .
          <source>In: Proceedings of the 2nd Workshop on A ective Content Analysis @ AAAI (A Con2019)</source>
          . Honolulu,
          <string-name>
            <surname>Hawaii</surname>
          </string-name>
          (
          <year>January 2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Kingma</surname>
            ,
            <given-names>D.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ba</surname>
          </string-name>
          , J.:
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>arXiv preprint arXiv:1412.6980</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Maaten</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          , G.:
          <article-title>Visualizing high-dimensional data using t-SNE</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>9</volume>
          ,
          <issue>2579</issue>
          {
          <volume>2605</volume>
          (01
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          . In: Burges,
          <string-name>
            <given-names>C.J.C.</given-names>
            ,
            <surname>Bottou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Welling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Ghahramani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            ,
            <surname>Weinberger</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.Q</surname>
          </string-name>
          . (eds.)
          <source>Advances in Neural Information Processing Systems</source>
          <volume>26</volume>
          , pp.
          <volume>3111</volume>
          {
          <fpage>3119</fpage>
          . Curran Associates, Inc. (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. van den Oord, A.,
          <string-name>
            <surname>Dieleman</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Simonyan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vinyals</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Graves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kalchbrenner</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Senior</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kavukcuoglu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Wavenet: A generative model for raw audio</article-title>
          .
          <source>arXiv preprint arXiv:1609.03499</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Paszke</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gross</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chintala</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chanan</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>DeVito</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Desmaison</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Antiga</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lerer</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Automatic di erentiation in PyTorch (</article-title>
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Perone</surname>
            ,
            <given-names>C.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Silveira</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paula</surname>
            ,
            <given-names>T.S.</given-names>
          </string-name>
          :
          <article-title>Evaluation of sentence embeddings in downstream and linguistic probing tasks</article-title>
          . arXiv preprint arXiv:
          <year>1806</year>
          .
          <volume>06259</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Peters</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neumann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Iyyer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gardner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Deep contextualized word representations</article-title>
          .
          <source>In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long Papers). pp.
          <volume>2227</volume>
          {
          <fpage>2237</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2018</year>
          ). https://doi.org/10.18653/v1/
          <fpage>N18</fpage>
          - 1202, http://aclweb.org/anthology/N18-1202
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <source>English Syntax: An Introduction</source>
          . Cambridge University Press (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>