=Paper= {{Paper |id=Vol-1619/paper3 |storemode=property |title=SMILES: Twitter Emotion Classification using Domain |pdfUrl=https://ceur-ws.org/Vol-1619/paper3.pdf |volume=Vol-1619 |authors=Bo Wang,Maria Liakata,Arkaitz Zubiaga,Rob Procter,Eric Jensen |dblpUrl=https://dblp.org/rec/conf/ijcai/WangLZPJ16 }} ==SMILES: Twitter Emotion Classification using Domain== https://ceur-ws.org/Vol-1619/paper3.pdf
             SMILE: Twitter Emotion Classification using Domain Adaptation
         Bo Wang           Maria Liakata      Arkaitz Zubiaga     Rob Procter                                Eric Jensen
                                      Department of Computer Science
                                           University of Warwick
                                                Coventry, UK
                                {bo.wang, m.liakata, e.jensen}@warwick.ac.uk

                         Abstract                                      opinions and feedback (e.g. museum tweetups). This gold
                                                                       mine of user opinions has sparked an increasing research
     Despite the widely spread research interest in so-                interest in the interdisciplinary field of social media and
     cial media sentiment analysis, sentiment and emo-                 museum study [Fletcher and Lee, 2012; Villaespesa, 2013;
     tion classification across different domains and on               Drotner and Schrøder, 2014].
     Twitter data remains a challenging task. Here we
                                                                          We have also seen a surge of research in sentiment anal-
     set out to find an effective approach for tackling a
                                                                       ysis with over 7,000 articles written on the topic [Feldman,
     cross-domain emotion classification task on a set
                                                                       2013], for applications ranging from analyses of movie re-
     of Twitter data involving social media discourse
                                                                       views [Pang and Lee, 2008] and stock market trends [Bollen
     around arts and cultural experiences, in the con-
                                                                       et al., 2011] to forecasting election results [Tumasjan et al.,
     text of museums. While most existing work in
                                                                       2010]. Supervised learning algorithms that require labelled
     domain adaptation has focused on feature-based
                                                                       training data have been successfully used for in-domain sen-
     or/and instance-based adaptation methods, in this
                                                                       timent classification. However, cross-domain sentiment anal-
     work we study a model-based adaptive SVM ap-
                                                                       ysis has been explored to a much lesser extent. For instance,
     proach as we believe its flexibility and efficiency
                                                                       the phrase “light-weight” carries positive sentiment when de-
     is more suitable for the task at hand. We conduct
                                                                       scribing a laptop but quite the opposite when it is used to
     a series of experiments and compare our system
                                                                       refer to politicians. In such cases, a classifier trained on
     with a set of baseline methods. Our results not only
                                                                       one domain may not work well on other domains. A widely
     show a superior performance in terms of accuracy
                                                                       adopted solution to this problem is domain adaptation, which
     and computational efficiency compared to the base-
                                                                       allows building models from a fixed set of source domains
     lines, but also shed light on how different ratios of
                                                                       and deploy them into a different target domain. Recent devel-
     labelled target-domain data used for adaptation can
                                                                       opments in sentiment analysis using domain adaptation are
     affect classification performance.
                                                                       mostly based on feature-representation adaptation [Blitzer et
                                                                       al., 2007; Pan et al., 2010; Bollegala et al., 2011], instance-
1 Introduction                                                         weight adaptation [Jiang and Zhai, 2007; Xia et al., 2014;
With the advent and growth of social media as a ubiquitous             Tsakalidis et al., 2014] or combinations of both [Xia et
platform, people increasingly discuss and express opinions             al., 2013; Liu et al., 2013]. Despite its recent increase
and emotions towards all kinds of topics and targets. One              in popularity, the use of domain adaptation for sentiment
of the topics that has been relatively unexplored in the sci-          and emotion classification across topics on Twitter is still
entific community is that of emotions expressed towards arts           largely unexplored [Liu et al., 2013; Tsakalidis et al., 2014;
and cultural experiences. A survey conducted in 2012 by the            Townsend et al., 2014].
British TATE Art Galleries found that 26 percent of the re-               In this work we set out to find an effective approach
spondents had posted some kind of content online, such as              for tackling the cross-domain emotion classification task on
blog posts, tweets or photos, about their experience in the art        Twitter, while also furthering research in the interdisciplinary
galleries during or after their visit [Villaespesa, 2013]. When        study of social media discourse around arts and cultural ex-
cultural tourists share information about their experience in          periences1 . We investigate a model-based adaptive-SVM ap-
social media, this real-time communication and spontaneous             proach that was previously used for video concept detec-
engagement with art and culture not only broadens its target           tion [Yang et al., 2007] and compare with a set of domain-
audience but also provides a new space where valuable in-              dependent and domain-independent strategies. Such a model-
sight shared by its customers can be garnered. As a result             based approach allows us to directly adapt existing models
museums, galleries and other cultural venues have embraced             to the new target-domain data without having to generate
social media such as Twitter, and actively used it to pro-             domain-dependent features or adjusting weights for each of
mote their exhibitions, organise participatory projects and/or
                                                                          1
create initiatives to engage with visitors, collecting valuable               SMILE project: http://www.culturesmile.org/



                                                                  15
Proceedings of the 4th Workshop on Sentiment Analysis where AI meets Psychology (SAAIP 2016), IJCAI 2016, pages 15-21,
                                          New York City, USA, July 10, 2016.
the training instances.We conduct a series of experiments and          of sentiment classifiers, under the intuition that deep learn-
evaluate the proposed system2 on a set of Twitter data about           ing algorithms learn intermediate concepts (between raw in-
museums, annotated by three annotators from the social sci-            put and target) and these intermediate concepts could yield
ences. The aim is to maximise the use of the base classi-              better transfer across domains.
fiers that were trained from a general-domain corpus, and                 When it comes to instance adaptation, [Jiang and Zhai,
through domain adaptation minimise the classification error            2007] proposes an instance weighting framework that prunes
rate across 5 emotion categories: anger, disgust, happiness,           “misleading” instances and approximates the distribution of
surprise and sadness. Our results show that adapted SVM                instances in the target domain. Their experiments show that
classifiers achieve significantly better performance than out-         by adding some labelled target domain instances and assign-
of-domain classifiers and also suggest a competitive perfor-           ing higher weights to them performs better than either remov-
mance compared to in-domain classifiers. To the best of our            ing “misleading” source domain instances using a small num-
knowledge this is the first attempt at cross-domain emotion            ber of labelled target domain data or bootstrapping unlabelled
classification for Twitter data.                                       target instances. [Xia et al., 2014] adapts the source domain
                                                                       training data to the target domain based on a logistic approx-
2 Related Work                                                         imation. [Tsakalidis et al., 2014] learns different classifiers
                                                                       on different sets of features and combines them in an ensem-
Most existing approaches can be classified into two cate-              ble model. Such an ensemble model is then applied to part
gories: feature-based adaptation and instance-based adapta-            of the target domain test data to create new training data (i.e.
tion. The former seek to construct new adaptive feature repre-         documents for which different classifiers had the same pre-
sentations that reduce the difference between domains, while           dictions). We include this ensemble method as one of our
the latter aims to sample and re-weight source domain train-           baseline approaches for evaluation and comparison.
ing data for use in classification within the target domain.              In contrast with most cross-domain sentiment classification
   With respect to feature domain adaptation, [Blitzer et al.,         works, we use a model-based approach proposed in [Yang et
2007] applied structural correspondence learning (SCL) algo-           al., 2007], which directly adapts existing classifiers trained
rithm for cross-domain sentiment classification. SCL chooses           on general-domain corpora. We believe this is more efficient
a set of pivot features with highest mutual information to             and flexible [Yang and Hauptmann, 2008] for our task. We
the domain labels, and uses these pivot features to align              evaluate on a set of manually annotated tweets about cultural
other features by training N linear predictors. Finally it             experiences in museums and conduct a finer-grained classifi-
computes singular value decomposition (SVD) to construct               cation of emotions conveyed (i.e. anger, disgust, happiness,
low-dimensional features to improve its classification per-            surprise and sadness).
formance. A small amount of target domain labelled data
is used to learn to deal with misaligned features from SCL.            3 Datasets
[Townsend et al., 2014] found that SCL did not work well for
cross-domain adaptation of sentiment on Twitter due to the             We use two datasets, a source-domain dataset and a target-
lack of mutual information across the Twitter domains and              domain dataset, which enables us to experiment on domain
uses subjective proportions as a backoff adaptation approach.          adaptation. The source-domain dataset we adopted is the
[Pan et al., 2010] proposed to construct a bipartite graph from        general-domain Twitter corpus created by [Purver and Bat-
a co-occurrence matrix between domain-independent and do-              tersby, 2012], which was generated through distant supervi-
main specific features to reduce the gap between different             sion using hashtags and emoticons associated with 6 emo-
domains and use spectral clustering for feature alignment.             tions: anger, disgust, fear, happiness, surprise and sadness.
The resulting clusters are used to represent data examples and            Our target-domain dataset that allows us to perform ex-
train sentiment classifiers. They used mutual information be-          periments on emotions associated with cultural experiences
tween features and domains to classify domain-independent              consists of a set of tweets pertaining to museums. A col-
and domain specific features, but in practice this also intro-         lection of tweets mentioning one of the following Twit-
duces mis-classification errors. [Bollegala et al., 2011] de-          ter handles associated with British museums was gathered
scribes a cross-domain sentiment classification approach us-           between May 2013 and June 2015: @camunivmuseums,
ing an automatically created sentiment sensitive thesaurus.            @fitzmuseum uk, @kettlesyard, @maacambridge, @icia-
Such a thesaurus is constructed by computing the point-wise            bath, @thelmahulbert, @rammuseum, @plymouthmuseum,
mutual information between a lexical element u and a fea-              @tateliverpool, @tate stives, @nationalgallery, @britishmu-
ture as well as relatedness between two lexical elements. The          seum, @ thewhitechapel. These are all museums associated
problem with these feature adaptation approaches is that they          with the SMILES project. A subset of 3,759 tweets was sam-
try to connect domain-dependent features to known or com-              pled from this collection for manual annotation. We devel-
mon features under the assumption that parallel sentiment              oped a tool for manual annotation of the emotion expressed
words exist in different domains, which is not necessarily ap-         in each of these tweets. The options for the annotation of
plicable to various topics in tweets [Liu et al., 2013]. [Glo-         each tweet included 6 different emotions; the six Ekman emo-
rot et al., 2011] proposes a deep learning system to extract           tions as in [Purver and Battersby, 2012], with the exception of
features that are highly beneficial for the domain adaptation          ‘fear’ as it never featured in the context of tweets about muse-
                                                                       ums. Two extra annotation options were included to indicate
   2
       The code can be found at http://bit.ly/1WHup4b                  that a tweet should have no code, indicating that a tweet was



                                                                  16
not conveying any emotions, and not relevant when it did not                      Emotion             No. of tweets       % of tweets
refer to any aspects related to the museum in question. The                        no code                1572              41.8%
                                                                                    happy                 1137              30.2%
annotator could choose more than one emotion for a tweet,
                                                                                 not relevant              214               5.7%
except when no code or not relevant were selected, in which                         anger                   57               1.5%
case no additional options could be picked. The annotation of                      surprise                 35               0.9%
all the tweets was performed independently by three sociol-                          sad                    32               0.9%
ogy PhD students. Out of the 3,759 tweets that were released                  happy & surprise              11               0.3%
for annotation, at least 2 of the annotators agreed in 3,085                     happy & sad                9                0.2%
cases (82.1%). We use the collection resulting from these                      disgust & anger              7                0.2%
3,085 tweets as our target-domain dataset for classifier adap-                     disgust                  6                0.2%
tation and evaluation. Note that tweets labelled as no code or                   sad & anger                2                0.1%
                                                                                sad & disgust               2                0.1%
not relevant are included in our dataset to reflect a more re-
                                                                            sad & disgust & anger           1              <0.1%
alistic data distribution on Twitter, while our source-domain
data doesn’t have any no code or not relevant tweets.                            Table 2: Target data emotion distribution
   The distribution of emotion annotations in Table 2 shows a
remarkable class imbalance, where happy accounts for 30.2%                     100 %
of the tweets, while the other emotions are seldom observed                                         Source-domain data
in the museum dataset. There is also a large number of tweets                                       Target-domain data
with no emotion associated (41.8%). One intuitive expla-                        80 %
nation is that Twitter users tend to express positive and ap-
preciative emotions regarding their museum experiences and                      60 %
shy away from making negative comments. This can also be
demonstrated by comparing the museum data emotion distri-                       40 %
bution to our general-domain source data as seen in Figure 1,
where the sample ratio of positive instances is shown for each
emotion category.                                                               20 %

   To quantify the difference between two text datasets,
Kullback-Leibler (KL) divergence has been commonly used                          0%
before [Dai et al., 2007]. Here we use the KL-divergence
                                                                                       er



                                                                                                t


                                                                                                          py



                                                                                                                     se


                                                                                                                            sad
                                                                                              gus
                                                                                    ang




                                                                                                                  pri
                                                                                                       hap
method proposed by [Bigi, 2003], as it suggests a back-off
                                                                                             dis




                                                                                                               sur
smoothing method that deals with the data sparseness prob-
lem. Such back-off method keeps the probability distribu-               Figure 1: Source and target data distribution comparison
tions summing to 1 and allows operating on the entire vo-
cabulary, by introducing a normalisation coefficient and a
very small threshold probability for all the terms that are           4 Methodology
not in the given vocabulary. Since our source-domain data             Given the source-domain Dsrc and target-domain Dtar , we
contains many more tweets than the target-domain data, we             have one or kk sets of labelled source-domain data denoted as
have randomly sub-sampled the former and made sure the                                 Nsrc
                                                                        (xki , yik ) i=1    in Dsrc , where xki is the ith feature vector
two data sets have similar vocabulary size in order to avoid          with each element as the value of the corresponding feature
biases. We removed stop words, user mentions, URL links               and yik are the emotion categories that the ith instance be-
and re-tweet symbols prior to computing the KL-divergence.
                                                                      longs to. Suppose we have some classifiers fsrc    k
                                                                                                                            (x) that have
Finally we randomly split each data set into 10 folds and
                                                                      been trained on the source-domain data (named as the aux-
compute the in-domain and cross-domain symmetric KL-
                                                                      iliary classifiers in [Yang et al., 2007]) and a small set of
divergence (KLD) value between every pair of folds. Ta-
                                                                      labelled target-domain data as Dtar    l
                                                                                                                 where Dtar = Dtar   l
                                                                                                                                          [
ble 1 shows the computed KL-divergence averages. It can                  u
                                                                      Dtar   , our goal is to adapt fsrc
                                                                                                       k
                                                                                                          (x) to a new classifier ftar (x)
be seen that KL-divergence between the two data sets (i.e.
                                                                      based on the small set of labelled examples in Dtar    l
                                                                                                                                , so it can
KLD(Dsrc || Dtar )) is twice as large as the in-domain KL-
                                                                      be used to accurately predict the emotion class of unseen data
divergence values. This suggests a significant difference be-
                                                                      from Dtar  u
                                                                                     .
tween data distributions in the two domain and thus justifies
our need for domain adaptation.                                       4.1   Base Classifiers
                                                                      Our base classifiers are the classifiers that have been trained
                                                                                                                 Nsrc
            Data domain        Averaged KLD value                     on the source-domain data (xi , yi ) i=1 , where yi 2
          KLD(Dsrc || Dsrc )          2.391                           {1, ..., K} with K referring to the number of emotion cate-
          KLD(Dtar || Dtar )          2.165
          KLD(Dsrc || Dtar )          4.818
                                                                      gories. In our work, we use Support Vector Machines (SVMs)
                                                                      in a “one-versus-all” setting, which trains K binary classi-
Table 1: In-domain and cross-domain KL-divergence values              fiers, each separating one class from the rest. We chose this
                                                                      as a better way of dealing with class imbalance in a multi-
                                                                      class scenario.



                                                                 17
Features                                                                which allows the weight controls {⌧ k }M  k=1 of the base classi-
The base classifiers are trained on 3 sets of features gener-           fiers fsrc
                                                                               k
                                                                                   (x) to be learnt automatically based on their classifi-
ated from the source-domain data: (i) n-grams, (ii) lexicon             cation performance of the small set of labelled target-domain
features, (iii) word embedding features.                                examples. To achieve this, [Yang and Hauptmann, 2008]
   N-gram models have long been used in NLP for various                 adds another regulariser to the regularised loss minimisation
tasks. We used 1-2-3 grams after filtering out all the stop             framework, with the objective function of training the adap-
words, as our n-gram features. We construct 32 Lexicon                  tive classifier now written as:
features from 9 Twitter specific and general-purpose lexica.                            1 T        1               XN
Each lexicon provides either a numeric sentiment score, or                      min       w w + B(⌧ )T ⌧ + C           ⇠i
categories where a category could correspond to a particular                   w,⌧,⇠    2          2               i=1
emotion or a strong/weak positive/negative sentiment.                                           M
                                                                                                X                                             (3)
   The use of Word embedding features to represent the                          s.t.       yi         ⌧ k fsrc
                                                                                                           k
                                                                                                               (x) + yi wT (xi )   1   ⇠i ,
context of words and concepts, has been shown to be very                                        k=1
effective in boosting the performance of sentiment classifica-
                                                                                           ⇠im        0, 8(xi , yi ) 2 Dsrc
tion. In this work we use a set of word embeddings learnt us-
ing a sentiment-specific method in [Tang et al., 2014] and an-          where 12 (⌧ )T ⌧ measures the overall contribution of base clas-
other set of general word embeddings trained with 5 million             sifiers. Thus this objective function seeks to avoid over re-
tweets by [Vo and Zhang, 2015]. Training on an additional               liance on the base classifiers and also over-complex f (·).
set of 3 million tweets we trained ourselves did not increase           The two goals are balanced by the parameter B. By rewriting
performance. Pooling functions are essential and particularly           this objective function as a minimisation problem of a La-
effective for feature selection from dense embedding feature            grange (primal) function and set its derivative against w, ⌧ ,
vectors. [Tang et al., 2014] applied the max, min and mean              and ⇠ to zero, we have:
pooling functions and found them to be highly useful. We                             XN
                                                                                                              1 X
                                                                                                                 N
tested and evaluated six pooling functions, namely sum, max,                    w=       ↵i yi (xi ), ⌧ k =                k
                                                                                                                    ↵i yi fsrc (xi ) (4)
min, mean, std (i.e. standard deviation) and product, and se-                        i=1
                                                                                                             B  i=1
lected sum, max and mean as they led to the best performance.           where ⌧ k is a weighted sum of yi fsrc      k
                                                                                                                        (xi ) and it indi-
4.2   Classifier Adaptation                                             cates the classification performance of fsrc    k
                                                                                                                             on the target-
                                                                        domain. Therefore we have base classifiers assigned with
[Yang et al., 2007] proposes a many-to-one SVM adaptation               larger weight if they classify the labelled target-domain data
model, which directly modifies the decision function of an              well. Now given (1), (2) and (4), the new decision function
ensemble of existing classifiers fsrc
                                  k
                                      (x), trained with one or k        can be formulated as:
sets of labelled source-domain data in Dsrc , and thus creates                           M N
a new adapted classifier ftar (x) for the target-domain Dtar .                       1 XX               k         k
                                                                          ftar (x) =             ↵i yi fsrc (xi )fsrc (x) + f (x)
The adapted classifier has the following form:                                       B       i=1k=1
                          M
                          X                                                                N   ⇣                 M                      ⌘
                                                                                           X                  1 X k
             ftar (x) =         ⌧ k fsrc
                                     k
                                         (x) +    f (x)      (1)                       =  ↵i yi K(xi , x) +                     k
                                                                                                                     fsrc (xi )fsrc (x)
                          k=1                                                         i=1
                                                                                                             B
                                                                                                                k=1
                                                                                                                                       (5)
where ⌧ k 2 (0, 1) is the weight of each base classifier
                                                                        Comparing
                                                                        P             (5) with a standard SVM model f (x) =
  k
fsrc (x). f (x) is the perturbation function that is learnt from
a small set of labelled target-domain data in Dtar
                                                l
                                                    . As shown             i=1 ↵i yi K(xi , x), this multi-classifier adaptation model
                                                                        can be interpreted as a way of adding the predicted labels
in [Yang et al., 2007] it has the form:
                                                                        of base classifiers on the target-domain as additional features.
                                     N
                                     X                                  Under this interpretation the scalar B balances the contribu-
             f (x) = wT (x) =              ↵i yi K(xi , x)   (2)        tion of the original features and additional features.
                                     i=1
             PN                                                         4.3     Data Preprocessing
where w = i=1 ↵i yi (xi ) are the model parameters to be                A set of preprocessing techniques applied include substi-
estimated from the labelled examples in Dtar   l
                                                   and ↵i is the        tuting URL links with strings “URL”, user mentions with
feature coefficient of the ith labelled target-domain instance.         “@USERID”, removing the hashtag symbol “#”, normalis-
Furthermore K(·, ·) ⌘ (·)T (·) is the kernel function in-               ing emoticons and abbreviations3 .
duced from the nonlinear feature mapping. f (x) is learnt
in a framework that aims to minimise the regularised empir-             5 Results and Evaluation
ical risk [Yang, 2009]. The adapted classifier ftar (x) learnt          In this section we present the experimental results and com-
under this framework tries to minimise the classification error         pare our proposed adaptation system with a set of domain-
on the labelled target-domain examples and the distance from            dependent and domain-independent strategies. We also in-
the base classifiers fsrc
                      k
                          (x), to achieve a better bias-variance        vestigate the effect of different sizes of the labelled target-
trade-off.                                                              domain data in the classification performance.
   In this work we use the extended multi-classifier adapta-
tion framework proposed by [Yang and Hauptmann, 2008],                     3
                                                                               http://bit.ly/1U7fiQR



                                                                   18
5.1   Adaptation Baselines                                              is very challenging to overcome without acquiring more la-
The baseline methods and our proposed system are the fol-               belled data than we currently have. It especially effects our
lowing:                                                                 domain adaptation as all the parameters in Eq.(3) cannot be
                                                                        properly optimised.
   • BASE: the base classifiers use either one set of features
                                                                           Since there are very few tweets annotated as “disgust”, we
      or all three feature sets (i.e. BASE-all). As an example,
                                                                        decide not to consider the “disgust” emotion as part of our
      the BASE-embedding classifier is trained and tuned with
                                                                        experiment evaluation here. As seen in Table 3, BASE mod-
      all source-domain data using only word-embedding fea-
                                                                        els are outperformed significantly by all other methods (ex-
      tures, then tested on 30% of our target-domain data. We
                                                                        cept ENSEMBLE, which performs only slightly better than
      use the LIBSVM implementation [Chang and Lin, 2011]
                                                                        the BASE models) positing the importance of domain adapta-
      of SVM for building the base classifiers.
                                                                        tion. With the exception of the ADAPT-3-model for “Anger”,
   • TARG: trained and tuned with 70% labelled target-                  our ADAPT models consistently outperform AGGR-all and
      domain data. Since this model is entirely trained from            ENSEMBLE while showing competitive performance com-
      the target domain, it can be considered as the perfor-            pared to the upper-bound baseline, TARG-all. We also ob-
      mance upper-bound that is very hard to beat.                      serve that the aggregation model AGGR-all is outperformed
   • AGGR: an aggregate model trained from all source-                  by TARG-all, indicating such domain knowledge cannot be
      domain data and 70% labelled target-domain data.                  transferred effectively to a different domain by simply mod-
                                                                        elling from aggregated data from both domains. In com-
   • ENSEMBLE: combines the base classifiers in an en-                  parison, our ADAPT models are able to leverage the large
      semble model. Then perform classification on 30% of               and balanced source-domain data (as base classifiers) unlike
      the target-domain data to generate new training data, as          TARG, while adjusting the contribution of each base classi-
      described in Section 2.                                           fier unlike AGGR.
   • ADAPT: our domain adapted models using either one                     When comparing our ADAPT models, we find that in most
      base classifier trained with all feature sets (i.e. ADAPT-        cases models adapted from multiple base classifiers beat the
      1-model) or an ensemble of three standalone base clas-            ones adapted from one single base classifier, even though the
      sifiers with each trained with one set of features (i.e.          same features are used in both scenarios. This shows the ben-
      ADAPT-3-model). We use 30% of the labelled target-                efit of the multi-classifier adaptation approach, which aims to
      domain data for classifier adaptation and parameter tun-          maximise the utility of each base classifier. Two additional
      ing described in Section 4.2.                                     models, namely ADAPT-1-modelx and ADAPT-3-modelx,
The above methods are all tested on the same 30% labelled               are the replicates of ADAPT-1/3-model except they also use
target-domain data in order to make their results compara-              40% target-domain data for tuning the model parameters. On
ble. In addition we perform in-domain cross-validation and              average their results are only slightly better than ADAPT-1/3-
evaluation only on our source-domain data using all feature             model that use 30% of the target-domain data for both train-
sets; this model is named as SRC-all. We use an RBF kernel              ing and parameter optimisation. This is especially prominent
function (as it outperforms linear kernel. Polynomial kernel            with “happiness” where we have sufficient target-domain in-
gives similar performance but requires more parameter tun-              stances and less of a class imbalance issue. This shows our
ing) with default setting of the gamma parameter in all the             ADAPT models are able to yield knowledge transfer effec-
methods. For the cost factor C and class weight parameter               tively across different domains with a small amount of la-
(except the SRC-all model) we conduct cross-validated grid-             belled target-domain data. More analysis on the impact of
search over the same set of parameter values for all the meth-          adaptation sample ratios is given in Section 5.3.
ods, for parameter optimisation. This makes sure our ADAPT                 We can also evaluate the performance of each model by
models are comparable with BASE, TARG, ENSEMBLE and                     comparing its efficiency in terms of computation time. Here
AGGR. For ADAPT-3-model we also optimise the base clas-                 we report the total computation time taken for all the above
sifier weight parameters, denoted as ⌧ k in Eq.(1), as described        methods except BASE, for the emotion “happiness”. Such
in Section 4.2.                                                         computation process consists of adaptation training, grid-
                                                                        search over the same set of parameter values and final testing.
5.2   Experimental Results                                              As seen in Table 4, compared to other out-of-domain strate-
We report the experimental results in Table 3, with three cat-          gies the proposed ADAPT models are more efficient to train
egories of models: 1) in-domain no adaptation methods, i.e.             especially in comparison with AGGR, which is an order of
BASE and TARG models, TARG being the upper-bound for                    magnitude more costly due to the inclusion of source-domain
performance evaluation; 2) the domain adaptation baselines,             data. Within the ADAPT models, ADAPT-1-model requires
i.e. AGGR and ENSEMBLE and 3) our adaptation systems                    less time to train since it only has one base classifier for adap-
(ADAPT models). As can be seen the classification perfor-               tation.
mances reported for emotions other than “happy” are below
50 in terms of F1 score with some results being as low as               5.3   Effect of Adaptation Training Sample ratios
0.00. This is caused by the class imbalance issue within these          Here we evaluate the effect of different ratios of the la-
emotions as shown in Table 2 and Figure 1, especially for               belled target-domain data on the overall classification per-
the emotion “disgust” which has only 16 tweets. We tried to             formance for the emotion “happiness”. Figure 2 shows the
balance this issue using a class weight parameter, but it still         normalised F1 scores and computation time of each ADAPT



                                                                   19
                             Anger                   Disgust                   Happy                   Surprise                 Sad
       Model
                       P       R       F1      P       R      F1         P      R       F1       P        R      F1      P       R      F1
  BASE-ngrams         5.77   40.91   10.11    0.49    100.0 0.97       37.62   100.0   54.67   1.46     100.0 2.87     1.50    100.0   2.96
  BASE-lexicon        2.59   90.91    5.03    0.55    100.0 1.10       38.43   98.96   55.36   0.00      0.00   0.00   2.54    93.33   4.94
 BASE-embedding       2.06   72.73    4.02    0.00    0.00   0.00      39.18   96.11   55.66   2.00     60.00 3.88     1.49    80.00   2.92
    BASE-all          2.01   59.09    3.88    5.00    20.00 8.00       38.75   98.19   55.57   1.69     66.67 3.29     1.58    86.67   3.11
   TARG-all          36.00   40.91   38.30    0.00    0.00   0.00      78.04   84.72   81.24   20.83    33.33 25.64    18.75   20.00   19.35
   AGGR-all          10.71   27.27   15.38   33.33    20.00 25.00      64.79   86.27   74.00   5.88     11.11 7.69     4.17    20.00   6.90
  ENSEMBLE            2.11   100.0    4.13    0.49    100.0 0.97       45.20   83.55   58.66   2.70     11.11 4.35     1.46    100.0   2.88
 ADAPT-1-model       16.28   31.82   21.54    0.59    80.00 1.18       79.34   80.57   79.95   11.11    13.33 12.12    100.0    6.67   12.50
 ADAPT-3-model       20.00    9.09   12.50    0.00    0.00   0.00      82.11   80.83   81.46   8.14     46.67 13.86    8.77    33.33   13.89
 ADAPT-1-modelx      21.43   13.64   16.67   100.0    20.00 33.33      80.53   79.27   79.90   12.50    26.67 17.02    16.67   13.33   14.81
 ADAPT-3-modelx      20.00   22.73   21.28    1.82    20.00 3.33       80.30   83.42   81.83   12.50    26.67 17.02    10.20   33.33   15.63
     SRC-all         93.57   93.37   93.46   99.05    98.73 98.89      81.87   85.91   83.83   96.25    98.03 97.13    91.04   92.51   91.76

                                             Table 3: Model performance comparison




                (a) C = 1                                    (b) C = 3                                        (c) C = 10

                    Figure 2: Performance of each ADAPT model with C = 1,3,10 vs. its computation time

        Model           Total computation time in minutes              6 Conclusion
       TARG-all                       7.72
      ENSEMBLE                       209.72                            In this work we study a model-based multi-class adaptive-
       AGGR-all                     1238.24                            SVM approach to cross-domain emotion recognition and
     ADAPT-1-model                   26.30                             compare against a set of domain-dependent and domain-
     ADAPT-3-model                   118.41                            independent strategies. We conduct a series of experiments
                                                                       and evaluate our proposed system on a set of newly anno-
      Table 4: Total computation time for each method                  tated Twitter data about museums. We find that our adapted
                                                                       SVM model outperforms the out-of-domain base models and
                                                                       domain adaptation baselines while also showing competi-
model across different adaptation training sample sizes rang-          tive performance against the in-domain upper-bound model.
ing from 10% to 70% of the total target-domain data (with the          Moreover, in comparison to other adaptation strategies our
same 30% held out as test data) and with the cost factor C =           approach is computationally more efficient especially com-
1, 3 and 10 (as the same choices of C are used in [Yang et al.,        pared to the classifier trained on aggregated source and tar-
2007] for conducting their experiment). We observe a loga-             get data. Finally, we shed light on how different ratios of la-
rithmic growth for the F1 scores obtained from every model,            belled target-domain data used for adaptation can effect clas-
against a linear growth of computation time cost. Thus even            sification performance. We show there is a trade-off between
though there is a reasonable increase in classification perfor-        model effectiveness and efficiency when selecting adaptation
mance when increasing the adaptation sample size from 50%              sample size. Our code and data4 are publicly available, en-
to 70%, it becomes much less efficient to train such mod-              abling further research and comparison with our approach.
els and we require more data, which may not be available.                 In the future we would like to investigate a feature-based
Since we have a trade-off between model effectiveness and              deep learning approach for cross-topic emotion classification
efficiency here, it is appropriate to use 30% of our labelled          on Twitter while examining the possibility of making it as ef-
target-domain data for classifier adaptation as we have done           ficient and flexible as the model adaptation based approaches.
so in ADAPT-1-model and ADAPT-3-model. One should se-                  Another future direction is to study how to best resolve the re-
lect the adaptation training sample size accordingly based on          markable class imbalance issue in social media emotion anal-
the test data at hand, but empirically we think 1,000 labelled         ysis when some emotions are rarely expressed.
target-domain tweets would be enough for an effective adap-
                                                                          4
tation to classify 3,000-4,000 test tweets.                                   http://bit.ly/1SddvIw



                                                                  20
Acknowledgments                                                       [Pan et al., 2010] Sinno Jialin Pan, Xiaochuan Ni, Jian-Tao
                                                                         Sun, Qiang Yang, and Zheng Chen. Cross-domain sen-
This work has been funded by the AHRC SMILES project.                    timent classification via spectral feature alignment. In
We would like to thank Liz Walker, Matt Jeffryes and Michael             WWW, pages 751–760. ACM, 2010.
Clapham for their contribution to earlier versions of the emo-
tion classifiers.                                                     [Pang and Lee, 2008] Bo Pang and Lillian Lee. Opinion
                                                                         mining and sentiment analysis. Foundations and trends
                                                                         in information retrieval, 2(1-2):1–135, 2008.
References
                                                                      [Purver and Battersby, 2012] Matthew Purver and Stuart
[Bigi, 2003] Brigitte Bigi. Using Kullback-Leibler distance              Battersby. Experimenting with distant supervision for
   for text categorization. Springer, 2003.                              emotion classification. In EACL, pages 482–491. Asso-
[Blitzer et al., 2007] John Blitzer, Mark Dredze, Fernando               ciation for Computational Linguistics, 2012.
   Pereira, et al. Biographies, bollywood, boom-boxes and             [Tang et al., 2014] Duyu Tang, Furu Wei, Nan Yang, Ming
   blenders: Domain adaptation for sentiment classification.             Zhou, Ting Liu, and Bing Qin. Learning sentiment-
   In ACL, volume 7, pages 440–447, 2007.                                specific word embedding for twitter sentiment classifica-
                                                                         tion. In ACL, volume 1, pages 1555–1565, 2014.
[Bollegala et al., 2011] Danushka Bollegala, David Weir,
  and John Carroll. Using multiple sources to construct a             [Townsend et al., 2014] Richard Townsend, Aaron Kalair,
  sentiment sensitive thesaurus for cross-domain sentiment               Ojas Kulkarni, Rob Procter, and Maria Liakata. University
  classification. In NAACL HLT, pages 132–141. Associa-                  of warwick: Sentiadaptron-a domain adaptable sentiment
  tion for Computational Linguistics, 2011.                              analyser for tweets-meets semeval. SemEval 2014, page
                                                                         768, 2014.
[Bollen et al., 2011] Johan Bollen, Huina Mao, and Xiaojun
  Zeng. Twitter mood predicts the stock market. Journal of            [Tsakalidis et al., 2014] Adam Tsakalidis, Symeon Pa-
  Computational Science, 2(1):1–8, 2011.                                 padopoulos, and Ioannis Kompatsiaris. An ensemble
                                                                         model for cross-domain polarity classification on twitter.
[Chang and Lin, 2011] Chih-Chung Chang and Chih-Jen                      In WISE, pages 168–177. Springer, 2014.
  Lin. LIBSVM: A library for support vector machines.                 [Tumasjan et al., 2010] Andranik Tumasjan, Timm Oliver
  ACM Transactions on Intelligent Systems and Technology,
                                                                         Sprenger, Philipp G Sandner, and Isabell M Welpe. Pre-
  2:27:1–27:27, 2011.
                                                                         dicting elections with twitter: What 140 characters reveal
[Dai et al., 2007] Wenyuan Dai, Gui-Rong Xue, Qiang                      about political sentiment. ICWSM, 10:178–185, 2010.
  Yang, and Yong Yu. Co-clustering based classification for           [Villaespesa, 2013] Elena Villaespesa. Diving into the mu-
  out-of-domain documents. In SIGKDD, pages 210–219.                     seums social media stream: Analysis of the visitor experi-
  ACM, 2007.                                                             ence in 140 characters. In Museums and the Web, 2013.
[Drotner and Schrøder, 2014] Kirsten   Drotner   and                  [Vo and Zhang, 2015] Duy-Tin Vo and Yue Zhang. Target-
  Kim Christian Schrøder.        Museum communication                    dependent twitter sentiment classification with rich auto-
  and social media: The connected museum. Routledge,                     matic features. In IJCAI, pages 1347–1353, 2015.
  2014.
                                                                      [Xia et al., 2013] Rui Xia, Chengqing Zong, Xuelei Hu, and
[Feldman, 2013] Ronen Feldman. Techniques and applica-                   Erik Cambria. Feature ensemble plus sample selection:
   tions for sentiment analysis. Communications of the ACM,              domain adaptation for sentiment classification. Intelligent
   56(4):82–89, 2013.                                                    Systems, IEEE, 28(3):10–18, 2013.
[Fletcher and Lee, 2012] Adrienne Fletcher and Moon J Lee.            [Xia et al., 2014] Rui Xia, Jianfei Yu, Feng Xu, and Shumei
   Current social media uses and evaluations in ameri-                   Wang. Instance-based domain adaptation in nlp via in-
   can museums. Museum Management and Curatorship,                       target-domain logistic approximation. In AAAI, 2014.
   27(5):505–521, 2012.                                               [Yang and Hauptmann, 2008] Jun Yang and Alexander G
[Glorot et al., 2011] Xavier Glorot, Antoine Bordes, and                 Hauptmann. A framework for classifier adaptation and its
  Yoshua Bengio. Domain adaptation for large-scale sen-                  applications in concept detection. In MIR, pages 467–474.
  timent classification: A deep learning approach. In ICML,              ACM, 2008.
  pages 513–520, 2011.                                                [Yang et al., 2007] Jun Yang, Rong Yan, and Alexander G
[Jiang and Zhai, 2007] Jing Jiang and ChengXiang Zhai. In-               Hauptmann. Cross-domain video concept detection using
   stance weighting for domain adaptation in nlp. In ACL,                adaptive svms. In Proceedings of the 15th international
   pages 264–271. Association for Computational Linguis-                 conference on Multimedia, pages 188–197. ACM, 2007.
   tics, June 2007.                                                   [Yang, 2009] Jun Yang. A general framework for classifier
[Liu et al., 2013] Shenghua Liu, Fuxin Li, Fangtao Li, Xueqi             adaptation and its applications in multimedia. PhD thesis,
                                                                         Columbia University, 2009.
   Cheng, and Huawei Shen. Adaptive co-training svm for
   sentiment classification on tweets. In CIKM, pages 2079–
   2088. ACM, 2013.



                                                                 21