<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>GTH-UPM at TASS 2019: Sentiment Analysis of Tweets for Spanish Variants</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ignacio Gonzalez Godino</string-name>
          <email>ignacio.ggodino@alumnos.upm.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luis Fernando D'Haro</string-name>
          <email>luisfernando.dharo@upm.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Grupo de Tecnolog a del Habla, ETSI de Telecomunicacion Universidad Politecnica de Madrid Avenida Complutense 30</institution>
          ,
          <addr-line>28040, Madrid</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>579</fpage>
      <lpage>588</lpage>
      <abstract>
        <p>This article describes the system developed by the Grupo de Tecnolog a del Habla at Universidad Politecnica de Madrid, Spain (GTH-UPM) for the competition on sentiment analysis in tweets: TASS 2019. The developed system consisted of three classi ers: a) a system based on feature vectors extracted from the tweets, b) a neural-based classi er using FastText, and c) a deep neural network classi er using contextual vector embeddings created using BERT. Finally, the averaged probabilities of the three classi ers were calculated to get the nal score. The nal system obtained an averaged F1 of 48.0% and 48.4% for the dev set on the mono and cross tasks respectively, 46.0% and 45.0% for the mono and cross tasks for the test set.</p>
      </abstract>
      <kwd-group>
        <kwd>TASS</kwd>
        <kwd>Multiclassi ers</kwd>
        <kwd>Natural Language Processing NLP</kwd>
        <kwd>Twitter</kwd>
        <kwd>Sentiment Analysis</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Sentiment Analysis (a.k.a opinion mining) is a branch of the Natural Language
Processing eld whose goal is to automatically determine whether a piece of
text can be considered as positive, negative, neutral or none, deriving this way
the opinion or attitude of the person writing the text [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Sentiment analysis
has recently brought a lot of attention since it can be used for companies to
understand customers' feelings towards their products [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], for politicians to
poll statements and actions (even to predict the results of an election [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]), or it
can be also used to monitor and analyze social phenomena and general mood.
      </p>
      <p>
        For TASS 2019 competition, the organizers proposed to research on sentiment
analysis with a special interest on evaluating polarity classi cation of tweets
written in Spanish variants (i.e. Spanish language spoken in Costa Rica, Spain,
Peru, Uruguay and Mexico). The main challenges the system must face up were
the lack of context (tweets are short, up to 240 characters), presence of
informal language such as misspellings, onomatopeias, emojis, hashtags, usernames,
etc., similarities between variants, classes imbalance, but specially restrictions
imposed by the organizers on the data used for training [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>The proposed challenge consisted of two sub-tasks: a) Monolingual: where
participants must train and test their systems using only the dataset for the
corresponding variant, and b) Cross-lingual: where participants must train their
systems on a selection of the complementary given datasets while using the
corresponding variant for testing; the goal here was to test the dependency of
systems on learning speci c characteristics of the text for a given variant. For
both tasks, the challenge organizers asked participants that in case submitting a
supervised or semi-supervised system, it must be only trained with the provided
training data being totally forbidden to use other training sets. However,
linguistic resources like lexicons, vectors of word embeddings or knowledge bases could
be used by clearly indicating them. The goal here was to have fair comparison
between systems but also to furtherance creativity by restricting system to only
use the same set of training data.</p>
      <p>The paper is distributed as follows. In Section 2 we provided detailed
information about the datasets given by the organizers; afterwards, in Section 3 we
describe in detail the classi ers and features used in our system; then, in Section
4, we present our results on the monolingual and cross-lingual settings. Finally,
in Section 5 we present our conclusions and future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Corpus description</title>
      <p>The organizers provided participants with a corpus including ve sets of data,
for ve di erent countries where Spanish is spoken, which are: Costa Rica (CR),
Spain (ES), Mexico (MX), Peru (PE), and Uruguay (UY). For each variant,
training, development and test sets were provided. The data was composed by
several tweets, their ID, user, date, variant and, only in training and development
sets, the sentiment class, which could be `P' (positive), `N' (negative), `NEU'
(neutral) or `NONE' (no sentiment).</p>
      <p>The label distribution for each variant for the training and development sets
is shown in Table 1. Table 2 shows the label distribution for the test set. As we
can see, the distribution of labels among variants are di erent showing systems
must deal with class imbalance; however, the label distribution between the
training, dev and test sets for the same variant are quick similar (except for
Peru where there are more Neg, Neu and None for the test set than for training
and dev). This posed the challenge to create a robust system, but also explains
the di erence in performance for this variant as shown in Section 4.</p>
      <p>The task was divided in two sub-tasks, monolingual and cross-lingual
analysis. In the rst sub-task, the systems used tweets from the same variant for both
training and testing. In the cross-lingual setting, in order to test the dependency
of systems on a variant, they could be trained in a selection of any variant except
the one which was used to test. In our case, we just combined the other variants
into a single le and evaluate on the corresponding dev or test set for the given
variant.</p>
      <p>After reading the data, each tweet was pre-processed as follows:
{ Leading and trailing spaces were removed
{ Words starting by the symbol \#" were replaced by just keeping the word
and removing the \#". If camel case was found, the word was separated. For
instance, \#thisBeautifulDay" was replaced by \this Beautiful Day".
{ Url references (`http://...') were replaced by the word `http'.
{ User references (`@username') were replaced by the word USER NAME.
{ Sequences of three or more equal characters were replaced by a single
occurrence of that character. For instance, \siiii" was replaced by "si".
{ References to human laughter, as \jajja" or \jajajajaj" were replaced by
\jajaja".
{ Numbers were removed.
{ Other punctuation symbols were removed.</p>
      <p>Then, we performed lemmatization and tokenization using the large Spanish
model included in SpaCy1. Finally, tweets were converted to lowercase.</p>
      <p>During this pre-processing phase, we analyzed the tweets and discovered a
remarkably high percentage of Out-Of-Vocabulary words (OOV's) by comparing
the vocabularies from the training and development sets, with values ranging
between 53-55%. We managed to reduce it up to 48-51% by using lemmatization
and character vector embeddings (see Section 3.2), but it was still a surprisingly
high value, which reduced the performance of our nal system leaving it as a
future work to nd additional solutions to this problem.</p>
      <p>The resulting pre-processed tweets were then used as input for the three
classi ers mentioned in the following section.</p>
      <sec id="sec-2-1">
        <title>1 https://spacy.io/</title>
        <p>3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Classi ers</title>
      <p>The nal system was based on three di erent and independent classi ers followed
by an ensembling method, which are explained below.
3.1</p>
      <sec id="sec-3-1">
        <title>Feature-based classi er</title>
        <p>The rst classi er was based on features extracted from the training tweets2.
Then, we concatenated them and trained a classi er for each variant. The
extracted features were:
{ Number of words in the tweet.
{ The number of words with all characters in upper case.
{ Number of \hashtags" found in the tweet (i.e. words starting with symbol
\#").
{ Whether the tweet has an exclamation mark.
{ Whether the tweet has a question mark.
{ Presence or absence of words with one character repeated more than two
times, as \holaaa".</p>
        <p>
          These features were selected based on features commonly used in sentiment
analysis [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] or [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], and our intuition. For instance, we can intuitively expect that
longer tweets tend to be more negative in order to explain the sorrow situation,
or that upper case words usually have more importance and tend to be used
in highly sentimental tweets. On the other hand, when analyzing the corpus,
we noticed than tweets containing exclamation marks, \hashtags" or words like
\holaaa" are more likely to be positive, and tweets containing exclamation marks
tend to be non-emotional; therefore many of the proposed features were extracted
based on our initial analysis and intuition.
        </p>
        <p>
          Besides, a negative and positive vocabulary was automatically created from
the training data extracting the 25 most discriminating words between the classes
`P' and `N' using the algorithm proposed in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. Four features were extracted
from this vocabulary (checking if these words were in the tweet or not) and
normalized by the number of words in the tweet:
{ Number of negative words.
{ Number of positive words.
{ Number of positive words minus the number of negative words.
{ Total count of both negative and positive words.
        </p>
        <p>This set of ten features was used for training eight di erent classi ers: Logistic
Regression, Multinomial Nave Bayes, Decision Tree, Support Vector Machines,
Random Forest, Extra Trees, AdaBoost and Gradient Boost. Each one was
computed following three di erent strategies, Normal, One-Vs-Rest and
OneVs-One. All these classi ers were implemented with the scikit-learn3 tools for
Python, and nally we kept the one that obtained the best performance for the
corresponding variant on the development set.
2 Some features were extracted before applying the pre-processing methods.
3 https://scikit-learn.org
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Fasttext classi er</title>
        <p>
          FastText [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] is an e cient library, created by Facebook's AI Research (FAIR)
lab4 that allows learning n-gram word and sub-word representations (i.e.
vector embeddings) using a supervised or unsupervised learning algorithm on a
standard multicore-CPU. The library also allows training a multi-class sentence
classi er using a simple linear model (multinomial logistic regression) with rank
constraint.
        </p>
        <p>
          In more detail, for the sentence classi er, the library implements a shallow
neural network that uses as input features the averaged vector embeddings of
the input sentence and a Softmax layer to obtain a probability distribution over
the pre-de ned classes. Several tricks are implemented such as Hierarchical
Softmax [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] and Hu man coding tree [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] to reduce the computational complexity
when the number of labels is large; use of bag of n-grams and hashing trick [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]
are also implemented to maintain a fast and memory e cient mapping of the
learned n-grams. Finally, FastText also deals with the problem of OOVs
(Out-ofVocabulary words) by training bag of n-grams of characters; this capability was
one of the main motivations for using FastText as we discovered there was a huge
proportion of OOVs between training and dev data during our data analysis (see
Section 2).
        </p>
        <p>
          For our classi er, we rst created a set of vector embeddings with dimension
100 using a supervised method trained with the labeled data from previous
TASS challenge [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. The idea was to use those pre-trained vector embeddings as
a linguistic resource for the following steps. Initially, instead we tried using the
available pre-trained vectors [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] released by FAIR for Spanish5 but our results
were worse probably due to di erences in pre-processing, nature of the text
(tweets vs formal text), and the reduce number of training data to correctly adapt
the 300-dimensional pre-trained vector embeddings. The hyper-parameters we
used for this pre-training phase were: learning rate: 1.0, epochs: 5, wordNgrams:
2, dimension: 100. The rest of parameters were the default ones provided by
FastText.
        </p>
        <p>Next, we trained 5 independent supervised models for each variant in the
competition using the pre-trained vector embeddings as input and the
corresponding training data for that variant complying to the established rules by
the organizers. Finally, we ne-tuned the model hyper-parameters (learning rate,
number of iterations, and size of word n-grams) using the corresponding
development set. In general, the only parameter we ne-tuned was the number of
epochs ranging from 5 to 10 depending on the amount of training data for each
variant.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>BERT classi er</title>
        <p>
          BERT stands for Bidirectional Encoder Representations from Transformers.
Proposed by [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], this is a new method of pre-training contextual vector
representa
        </p>
        <sec id="sec-3-3-1">
          <title>4 https://fasttext.cc/ 5 https://fasttext.cc/docs/en/crawl-vectors.html</title>
          <p>tions which obtains state-of-the-art results on several Natural Language
Processing (NLP) tasks such as text classi cation, question-answering, labeling tagging
and language model prediction.</p>
          <p>
            One of the main advantages of BERT is that their creators have publicly
released pre-trained English and Multilingual models, which have been trained
on massive corpora of unlabeled data with a new pre-training objective: the
\masked language mode" (MLM), inspired by the Cloze task [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ]. This change
allowed authors to use bidirectional networks instead of the left-to-right networks
used in the earlier OpenAi GPT model [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ]. Finally, the advantage of using
BERT is that the pre-trained models are ready to be ne-tuned for downstream
tasks with limited amount of data by using transfer learning approaches [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ].
This is done by ne-tuning BERTs nal layers while taking advantage of the
rich representations of language learned during pre-training.
          </p>
          <p>For our classi er, we used the BERT-Base pre-trained Multilingual Cased
model, which was trained on 104 languages and consists of 12 layers (Transformer
blocks), 768 hidden units, 12 multi-head attentions, which sum up to 110M
parameters. Then, we created 5 di erent models for each variant by ne-tuning
the model using only the training data for the corresponding variant and checking
the progress along up to 10 di erent iterations on the development set with a
batch size of 32 samples.
3.4</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>Averaging Ensemble</title>
        <p>We used the three former classi ers to get a distribution of probabilities for
each class (i.e. multi-label classi cation) given the tweet. Then, we implemented
a soft-voting ensemble by averaging the three probabilities and classi ed the
tweet with the most likely class. As the feature-based classi er was trained in
several di erent classi ers, we had 24 di erent results for each variant. Therefore,
we selected the classi er that obtained the best performance on the dev set, as
shown in Tables 3 and 4.
From the Ensemble explained in the section before, we decided to submit the best
performance for each classi er in the development set and the same approach
Variant Selected Classi er</p>
        <p>CR Normal Ada Boost
ES Normal Nave Bayes
MX OneVsRest Ada Boost
PE Normal Ada Boost</p>
        <p>UY OneVsRest Ada Boost
for the test set. Our results, for each classi er, are shown in Tables 5
(featurebased), 6 (Fasttext), and 7 (BERT); results for the nal ensemble are presented
in Table 8.</p>
        <p>It is important to mention that, when evaluating on the test set, we used
the best hyper-parameters found using the development set and then combined
the training and dev sets for training the nal classi cation models; then we
evaluated the resulting models on the test set. For the cross-lingual setting we
took care, when combining the training and development data, to exclude the
corresponding development set for the variant to be tested.</p>
        <p>As we can see in the results, the ensemble outperformed most of the times
the individual classi ers on the test set for both cross and mono lingual settings.
We also found that the feature classi er performed the worst in most of the
cases (except for BERT for the Peruvian variant), however we think it provides
complementary information as we found when performing the optimizations on
the development set.
In this paper we have described the rst participation of GTH-UPM for the
\Sentiment Analysis at SEPLN" (TASS 2019) at Tweet level. Our nal system
consisted of an ensemble using average voting of three di erent multi-label text
classi ers: a) feature-based using Scikit-Learn, a shallow neural network using
FastText, and a transfer-learning approach using BERT. Our results on the
development set provided an averaged F1 score of 45.0% and 46.0% on the test
set for the cross- and mono-lingual settings respectively. Our system performed
very well when compared with other submitted systems across the two settings
showing that our proposal was robust enough.</p>
        <p>
          As future work we are planning to perform a more exhaustive analysis of
the results given by the three classi ers, ne tuning models hyper-parameters,
and testing new features as the ones proposed in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] and new pre-trained vector
embeddings such ElMO [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] or UlmFit [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <p>The work leading to these results has been supported by the following projects:
AMIC (MINECO, TIN2017-85854-C4-4-R) and CAVIAR (MINECO,
TEC201784593-C2-1-R). We gratefully acknowledge the support of NVIDIA Corporation
with the donation of the Titan X Pascal GPU used for this research.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Agarwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xie</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vovsha</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rambow</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Passonneau</surname>
          </string-name>
          , R.:
          <article-title>Sentiment analysis of twitter data</article-title>
          .
          <source>In: Proceedings of the Workshop on Language in Social Media (LSM</source>
          <year>2011</year>
          ). pp.
          <volume>30</volume>
          {
          <issue>38</issue>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bermingham</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smeaton</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>On using twitter to monitor political sentiment and predict election results</article-title>
          .
          <source>In: Proceedings of the Workshop on Sentiment Analysis where AI meets Psychology</source>
          . pp.
          <volume>2</volume>
          {
          <fpage>10</fpage>
          . SAAIP 2011 (
          <year>April 2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Chiruzzo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosa</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Retuyt-inco at tass 2018: Sentiment analysis in spanish variants using neural networks and svm</article-title>
          .
          <source>Proceedings of TASS 2172</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>
          . arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>D</given-names>
            <surname>az-Galiano</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.C.</surname>
          </string-name>
          , et al.:
          <article-title>Overview of tass 2019</article-title>
          .
          <article-title>CEUR-WS, Bilbao</article-title>
          , Spain (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. Garc a Cumbreras,
          <string-name>
            <surname>M.A.</surname>
          </string-name>
          , Mart nez Camara,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Villena</surname>
          </string-name>
          <string-name>
            <surname>Roman</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          ,
          <source>Garc a Morera, J.: Tass</source>
          <year>2015</year>
          {
          <article-title>the evolution of the spanish opinion mining systems (</article-title>
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Goodman</surname>
          </string-name>
          , J.:
          <article-title>Classes for fast maximum entropy training</article-title>
          .
          <source>arXiv preprint cs/0108006</source>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Grave</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bojanowski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joulin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Learning word vectors for 157 languages</article-title>
          .
          <source>In: Proceedings of the International Conference on Language Resources and Evaluation (LREC</source>
          <year>2018</year>
          )
          <article-title>(</article-title>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Howard</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruder</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Universal language model ne-tuning for text classi cation</article-title>
          . arXiv preprint arXiv:
          <year>1801</year>
          .
          <volume>06146</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Joulin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grave</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bojanowski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Bag of tricks for e cient text classi cation</article-title>
          .
          <source>In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume</source>
          <volume>2</volume>
          ,
          <string-name>
            <given-names>Short</given-names>
            <surname>Papers</surname>
          </string-name>
          . pp.
          <volume>427</volume>
          {
          <fpage>431</fpage>
          . Association for Computational Linguistics (
          <year>April 2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>King</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lam</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roberts</surname>
            ,
            <given-names>M.E.</given-names>
          </string-name>
          :
          <article-title>Computer-assisted keyword and document set discovery from unstructured text</article-title>
          .
          <source>American Journal of Political Science</source>
          <volume>61</volume>
          (
          <issue>4</issue>
          ),
          <volume>971</volume>
          {
          <fpage>988</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>E cient estimation of word representations in vector space</article-title>
          .
          <source>arXiv preprint arXiv:1301.3781</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Pang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , et al.:
          <article-title>Opinion mining and sentiment analysis</article-title>
          .
          <source>Foundations and Trends in Information Retrieval</source>
          <volume>2</volume>
          (
          <issue>1</issue>
          {2),
          <volume>1</volume>
          {
          <fpage>135</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Peters</surname>
            ,
            <given-names>M.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neumann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Iyyer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gardner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Deep contextualized word representations</article-title>
          .
          <source>arXiv preprint arXiv:1802</source>
          .
          <volume>05365</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Narasimhan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salimans</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Improving language understanding with unsupervised learning</article-title>
          .
          <source>Tech. rep.</source>
          ,
          <source>Technical report, OpenAI</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Ruder</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Neural Transfer Learning for Natural Language Processing</article-title>
          .
          <source>Ph.D. thesis</source>
          , National University of Ireland, Galway (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Taylor</surname>
          </string-name>
          , W.L.:
          <article-title>cloze procedure: A new tool for measuring readability</article-title>
          .
          <source>Journalism Bulletin</source>
          <volume>30</volume>
          (
          <issue>4</issue>
          ),
          <volume>415</volume>
          {
          <fpage>433</fpage>
          (
          <year>1953</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Vinodhini</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chandrasekaran.</surname>
          </string-name>
          ,
          <string-name>
            <surname>R.M.</surname>
          </string-name>
          :
          <article-title>Sentiment analysis and opinion mining: a survey</article-title>
          .
          <source>International Journal</source>
          <volume>2</volume>
          (
          <issue>6</issue>
          ),
          <volume>282</volume>
          {
          <fpage>292</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Weinberger</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dasgupta</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Attenberg</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Langford</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smola</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Feature hashing for large scale multitask learning</article-title>
          .
          <source>arXiv preprint arXiv:0902.2206</source>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>