=Paper=
{{Paper
|id=Vol-1619/paper3
|storemode=property
|title=SMILES: Twitter Emotion Classification using Domain
|pdfUrl=https://ceur-ws.org/Vol-1619/paper3.pdf
|volume=Vol-1619
|authors=Bo Wang,Maria Liakata,Arkaitz Zubiaga,Rob Procter,Eric Jensen
|dblpUrl=https://dblp.org/rec/conf/ijcai/WangLZPJ16
}}
==SMILES: Twitter Emotion Classification using Domain==
SMILE: Twitter Emotion Classification using Domain Adaptation Bo Wang Maria Liakata Arkaitz Zubiaga Rob Procter Eric Jensen Department of Computer Science University of Warwick Coventry, UK {bo.wang, m.liakata, e.jensen}@warwick.ac.uk Abstract opinions and feedback (e.g. museum tweetups). This gold mine of user opinions has sparked an increasing research Despite the widely spread research interest in so- interest in the interdisciplinary field of social media and cial media sentiment analysis, sentiment and emo- museum study [Fletcher and Lee, 2012; Villaespesa, 2013; tion classification across different domains and on Drotner and Schrøder, 2014]. Twitter data remains a challenging task. Here we We have also seen a surge of research in sentiment anal- set out to find an effective approach for tackling a ysis with over 7,000 articles written on the topic [Feldman, cross-domain emotion classification task on a set 2013], for applications ranging from analyses of movie re- of Twitter data involving social media discourse views [Pang and Lee, 2008] and stock market trends [Bollen around arts and cultural experiences, in the con- et al., 2011] to forecasting election results [Tumasjan et al., text of museums. While most existing work in 2010]. Supervised learning algorithms that require labelled domain adaptation has focused on feature-based training data have been successfully used for in-domain sen- or/and instance-based adaptation methods, in this timent classification. However, cross-domain sentiment anal- work we study a model-based adaptive SVM ap- ysis has been explored to a much lesser extent. For instance, proach as we believe its flexibility and efficiency the phrase “light-weight” carries positive sentiment when de- is more suitable for the task at hand. We conduct scribing a laptop but quite the opposite when it is used to a series of experiments and compare our system refer to politicians. In such cases, a classifier trained on with a set of baseline methods. Our results not only one domain may not work well on other domains. A widely show a superior performance in terms of accuracy adopted solution to this problem is domain adaptation, which and computational efficiency compared to the base- allows building models from a fixed set of source domains lines, but also shed light on how different ratios of and deploy them into a different target domain. Recent devel- labelled target-domain data used for adaptation can opments in sentiment analysis using domain adaptation are affect classification performance. mostly based on feature-representation adaptation [Blitzer et al., 2007; Pan et al., 2010; Bollegala et al., 2011], instance- 1 Introduction weight adaptation [Jiang and Zhai, 2007; Xia et al., 2014; With the advent and growth of social media as a ubiquitous Tsakalidis et al., 2014] or combinations of both [Xia et platform, people increasingly discuss and express opinions al., 2013; Liu et al., 2013]. Despite its recent increase and emotions towards all kinds of topics and targets. One in popularity, the use of domain adaptation for sentiment of the topics that has been relatively unexplored in the sci- and emotion classification across topics on Twitter is still entific community is that of emotions expressed towards arts largely unexplored [Liu et al., 2013; Tsakalidis et al., 2014; and cultural experiences. A survey conducted in 2012 by the Townsend et al., 2014]. British TATE Art Galleries found that 26 percent of the re- In this work we set out to find an effective approach spondents had posted some kind of content online, such as for tackling the cross-domain emotion classification task on blog posts, tweets or photos, about their experience in the art Twitter, while also furthering research in the interdisciplinary galleries during or after their visit [Villaespesa, 2013]. When study of social media discourse around arts and cultural ex- cultural tourists share information about their experience in periences1 . We investigate a model-based adaptive-SVM ap- social media, this real-time communication and spontaneous proach that was previously used for video concept detec- engagement with art and culture not only broadens its target tion [Yang et al., 2007] and compare with a set of domain- audience but also provides a new space where valuable in- dependent and domain-independent strategies. Such a model- sight shared by its customers can be garnered. As a result based approach allows us to directly adapt existing models museums, galleries and other cultural venues have embraced to the new target-domain data without having to generate social media such as Twitter, and actively used it to pro- domain-dependent features or adjusting weights for each of mote their exhibitions, organise participatory projects and/or 1 create initiatives to engage with visitors, collecting valuable SMILE project: http://www.culturesmile.org/ 15 Proceedings of the 4th Workshop on Sentiment Analysis where AI meets Psychology (SAAIP 2016), IJCAI 2016, pages 15-21, New York City, USA, July 10, 2016. the training instances.We conduct a series of experiments and of sentiment classifiers, under the intuition that deep learn- evaluate the proposed system2 on a set of Twitter data about ing algorithms learn intermediate concepts (between raw in- museums, annotated by three annotators from the social sci- put and target) and these intermediate concepts could yield ences. The aim is to maximise the use of the base classi- better transfer across domains. fiers that were trained from a general-domain corpus, and When it comes to instance adaptation, [Jiang and Zhai, through domain adaptation minimise the classification error 2007] proposes an instance weighting framework that prunes rate across 5 emotion categories: anger, disgust, happiness, “misleading” instances and approximates the distribution of surprise and sadness. Our results show that adapted SVM instances in the target domain. Their experiments show that classifiers achieve significantly better performance than out- by adding some labelled target domain instances and assign- of-domain classifiers and also suggest a competitive perfor- ing higher weights to them performs better than either remov- mance compared to in-domain classifiers. To the best of our ing “misleading” source domain instances using a small num- knowledge this is the first attempt at cross-domain emotion ber of labelled target domain data or bootstrapping unlabelled classification for Twitter data. target instances. [Xia et al., 2014] adapts the source domain training data to the target domain based on a logistic approx- 2 Related Work imation. [Tsakalidis et al., 2014] learns different classifiers on different sets of features and combines them in an ensem- Most existing approaches can be classified into two cate- ble model. Such an ensemble model is then applied to part gories: feature-based adaptation and instance-based adapta- of the target domain test data to create new training data (i.e. tion. The former seek to construct new adaptive feature repre- documents for which different classifiers had the same pre- sentations that reduce the difference between domains, while dictions). We include this ensemble method as one of our the latter aims to sample and re-weight source domain train- baseline approaches for evaluation and comparison. ing data for use in classification within the target domain. In contrast with most cross-domain sentiment classification With respect to feature domain adaptation, [Blitzer et al., works, we use a model-based approach proposed in [Yang et 2007] applied structural correspondence learning (SCL) algo- al., 2007], which directly adapts existing classifiers trained rithm for cross-domain sentiment classification. SCL chooses on general-domain corpora. We believe this is more efficient a set of pivot features with highest mutual information to and flexible [Yang and Hauptmann, 2008] for our task. We the domain labels, and uses these pivot features to align evaluate on a set of manually annotated tweets about cultural other features by training N linear predictors. Finally it experiences in museums and conduct a finer-grained classifi- computes singular value decomposition (SVD) to construct cation of emotions conveyed (i.e. anger, disgust, happiness, low-dimensional features to improve its classification per- surprise and sadness). formance. A small amount of target domain labelled data is used to learn to deal with misaligned features from SCL. 3 Datasets [Townsend et al., 2014] found that SCL did not work well for cross-domain adaptation of sentiment on Twitter due to the We use two datasets, a source-domain dataset and a target- lack of mutual information across the Twitter domains and domain dataset, which enables us to experiment on domain uses subjective proportions as a backoff adaptation approach. adaptation. The source-domain dataset we adopted is the [Pan et al., 2010] proposed to construct a bipartite graph from general-domain Twitter corpus created by [Purver and Bat- a co-occurrence matrix between domain-independent and do- tersby, 2012], which was generated through distant supervi- main specific features to reduce the gap between different sion using hashtags and emoticons associated with 6 emo- domains and use spectral clustering for feature alignment. tions: anger, disgust, fear, happiness, surprise and sadness. The resulting clusters are used to represent data examples and Our target-domain dataset that allows us to perform ex- train sentiment classifiers. They used mutual information be- periments on emotions associated with cultural experiences tween features and domains to classify domain-independent consists of a set of tweets pertaining to museums. A col- and domain specific features, but in practice this also intro- lection of tweets mentioning one of the following Twit- duces mis-classification errors. [Bollegala et al., 2011] de- ter handles associated with British museums was gathered scribes a cross-domain sentiment classification approach us- between May 2013 and June 2015: @camunivmuseums, ing an automatically created sentiment sensitive thesaurus. @fitzmuseum uk, @kettlesyard, @maacambridge, @icia- Such a thesaurus is constructed by computing the point-wise bath, @thelmahulbert, @rammuseum, @plymouthmuseum, mutual information between a lexical element u and a fea- @tateliverpool, @tate stives, @nationalgallery, @britishmu- ture as well as relatedness between two lexical elements. The seum, @ thewhitechapel. These are all museums associated problem with these feature adaptation approaches is that they with the SMILES project. A subset of 3,759 tweets was sam- try to connect domain-dependent features to known or com- pled from this collection for manual annotation. We devel- mon features under the assumption that parallel sentiment oped a tool for manual annotation of the emotion expressed words exist in different domains, which is not necessarily ap- in each of these tweets. The options for the annotation of plicable to various topics in tweets [Liu et al., 2013]. [Glo- each tweet included 6 different emotions; the six Ekman emo- rot et al., 2011] proposes a deep learning system to extract tions as in [Purver and Battersby, 2012], with the exception of features that are highly beneficial for the domain adaptation ‘fear’ as it never featured in the context of tweets about muse- ums. Two extra annotation options were included to indicate 2 The code can be found at http://bit.ly/1WHup4b that a tweet should have no code, indicating that a tweet was 16 not conveying any emotions, and not relevant when it did not Emotion No. of tweets % of tweets refer to any aspects related to the museum in question. The no code 1572 41.8% happy 1137 30.2% annotator could choose more than one emotion for a tweet, not relevant 214 5.7% except when no code or not relevant were selected, in which anger 57 1.5% case no additional options could be picked. The annotation of surprise 35 0.9% all the tweets was performed independently by three sociol- sad 32 0.9% ogy PhD students. Out of the 3,759 tweets that were released happy & surprise 11 0.3% for annotation, at least 2 of the annotators agreed in 3,085 happy & sad 9 0.2% cases (82.1%). We use the collection resulting from these disgust & anger 7 0.2% 3,085 tweets as our target-domain dataset for classifier adap- disgust 6 0.2% tation and evaluation. Note that tweets labelled as no code or sad & anger 2 0.1% sad & disgust 2 0.1% not relevant are included in our dataset to reflect a more re- sad & disgust & anger 1 <0.1% alistic data distribution on Twitter, while our source-domain data doesn’t have any no code or not relevant tweets. Table 2: Target data emotion distribution The distribution of emotion annotations in Table 2 shows a remarkable class imbalance, where happy accounts for 30.2% 100 % of the tweets, while the other emotions are seldom observed Source-domain data in the museum dataset. There is also a large number of tweets Target-domain data with no emotion associated (41.8%). One intuitive expla- 80 % nation is that Twitter users tend to express positive and ap- preciative emotions regarding their museum experiences and 60 % shy away from making negative comments. This can also be demonstrated by comparing the museum data emotion distri- 40 % bution to our general-domain source data as seen in Figure 1, where the sample ratio of positive instances is shown for each emotion category. 20 % To quantify the difference between two text datasets, Kullback-Leibler (KL) divergence has been commonly used 0% before [Dai et al., 2007]. Here we use the KL-divergence er t py se sad gus ang pri hap method proposed by [Bigi, 2003], as it suggests a back-off dis sur smoothing method that deals with the data sparseness prob- lem. Such back-off method keeps the probability distribu- Figure 1: Source and target data distribution comparison tions summing to 1 and allows operating on the entire vo- cabulary, by introducing a normalisation coefficient and a very small threshold probability for all the terms that are 4 Methodology not in the given vocabulary. Since our source-domain data Given the source-domain Dsrc and target-domain Dtar , we contains many more tweets than the target-domain data, we have one or kk sets of labelled source-domain data denoted as have randomly sub-sampled the former and made sure the Nsrc (xki , yik ) i=1 in Dsrc , where xki is the ith feature vector two data sets have similar vocabulary size in order to avoid with each element as the value of the corresponding feature biases. We removed stop words, user mentions, URL links and yik are the emotion categories that the ith instance be- and re-tweet symbols prior to computing the KL-divergence. longs to. Suppose we have some classifiers fsrc k (x) that have Finally we randomly split each data set into 10 folds and been trained on the source-domain data (named as the aux- compute the in-domain and cross-domain symmetric KL- iliary classifiers in [Yang et al., 2007]) and a small set of divergence (KLD) value between every pair of folds. Ta- labelled target-domain data as Dtar l where Dtar = Dtar l [ ble 1 shows the computed KL-divergence averages. It can u Dtar , our goal is to adapt fsrc k (x) to a new classifier ftar (x) be seen that KL-divergence between the two data sets (i.e. based on the small set of labelled examples in Dtar l , so it can KLD(Dsrc || Dtar )) is twice as large as the in-domain KL- be used to accurately predict the emotion class of unseen data divergence values. This suggests a significant difference be- from Dtar u . tween data distributions in the two domain and thus justifies our need for domain adaptation. 4.1 Base Classifiers Our base classifiers are the classifiers that have been trained Nsrc Data domain Averaged KLD value on the source-domain data (xi , yi ) i=1 , where yi 2 KLD(Dsrc || Dsrc ) 2.391 {1, ..., K} with K referring to the number of emotion cate- KLD(Dtar || Dtar ) 2.165 KLD(Dsrc || Dtar ) 4.818 gories. In our work, we use Support Vector Machines (SVMs) in a “one-versus-all” setting, which trains K binary classi- Table 1: In-domain and cross-domain KL-divergence values fiers, each separating one class from the rest. We chose this as a better way of dealing with class imbalance in a multi- class scenario. 17 Features which allows the weight controls {⌧ k }M k=1 of the base classi- The base classifiers are trained on 3 sets of features gener- fiers fsrc k (x) to be learnt automatically based on their classifi- ated from the source-domain data: (i) n-grams, (ii) lexicon cation performance of the small set of labelled target-domain features, (iii) word embedding features. examples. To achieve this, [Yang and Hauptmann, 2008] N-gram models have long been used in NLP for various adds another regulariser to the regularised loss minimisation tasks. We used 1-2-3 grams after filtering out all the stop framework, with the objective function of training the adap- words, as our n-gram features. We construct 32 Lexicon tive classifier now written as: features from 9 Twitter specific and general-purpose lexica. 1 T 1 XN Each lexicon provides either a numeric sentiment score, or min w w + B(⌧ )T ⌧ + C ⇠i categories where a category could correspond to a particular w,⌧,⇠ 2 2 i=1 emotion or a strong/weak positive/negative sentiment. M X (3) The use of Word embedding features to represent the s.t. yi ⌧ k fsrc k (x) + yi wT (xi ) 1 ⇠i , context of words and concepts, has been shown to be very k=1 effective in boosting the performance of sentiment classifica- ⇠im 0, 8(xi , yi ) 2 Dsrc tion. In this work we use a set of word embeddings learnt us- ing a sentiment-specific method in [Tang et al., 2014] and an- where 12 (⌧ )T ⌧ measures the overall contribution of base clas- other set of general word embeddings trained with 5 million sifiers. Thus this objective function seeks to avoid over re- tweets by [Vo and Zhang, 2015]. Training on an additional liance on the base classifiers and also over-complex f (·). set of 3 million tweets we trained ourselves did not increase The two goals are balanced by the parameter B. By rewriting performance. Pooling functions are essential and particularly this objective function as a minimisation problem of a La- effective for feature selection from dense embedding feature grange (primal) function and set its derivative against w, ⌧ , vectors. [Tang et al., 2014] applied the max, min and mean and ⇠ to zero, we have: pooling functions and found them to be highly useful. We XN 1 X N tested and evaluated six pooling functions, namely sum, max, w= ↵i yi (xi ), ⌧ k = k ↵i yi fsrc (xi ) (4) min, mean, std (i.e. standard deviation) and product, and se- i=1 B i=1 lected sum, max and mean as they led to the best performance. where ⌧ k is a weighted sum of yi fsrc k (xi ) and it indi- 4.2 Classifier Adaptation cates the classification performance of fsrc k on the target- domain. Therefore we have base classifiers assigned with [Yang et al., 2007] proposes a many-to-one SVM adaptation larger weight if they classify the labelled target-domain data model, which directly modifies the decision function of an well. Now given (1), (2) and (4), the new decision function ensemble of existing classifiers fsrc k (x), trained with one or k can be formulated as: sets of labelled source-domain data in Dsrc , and thus creates M N a new adapted classifier ftar (x) for the target-domain Dtar . 1 XX k k ftar (x) = ↵i yi fsrc (xi )fsrc (x) + f (x) The adapted classifier has the following form: B i=1k=1 M X N ⇣ M ⌘ X 1 X k ftar (x) = ⌧ k fsrc k (x) + f (x) (1) = ↵i yi K(xi , x) + k fsrc (xi )fsrc (x) k=1 i=1 B k=1 (5) where ⌧ k 2 (0, 1) is the weight of each base classifier Comparing P (5) with a standard SVM model f (x) = k fsrc (x). f (x) is the perturbation function that is learnt from a small set of labelled target-domain data in Dtar l . As shown i=1 ↵i yi K(xi , x), this multi-classifier adaptation model can be interpreted as a way of adding the predicted labels in [Yang et al., 2007] it has the form: of base classifiers on the target-domain as additional features. N X Under this interpretation the scalar B balances the contribu- f (x) = wT (x) = ↵i yi K(xi , x) (2) tion of the original features and additional features. i=1 PN 4.3 Data Preprocessing where w = i=1 ↵i yi (xi ) are the model parameters to be A set of preprocessing techniques applied include substi- estimated from the labelled examples in Dtar l and ↵i is the tuting URL links with strings “URL”, user mentions with feature coefficient of the ith labelled target-domain instance. “@USERID”, removing the hashtag symbol “#”, normalis- Furthermore K(·, ·) ⌘ (·)T (·) is the kernel function in- ing emoticons and abbreviations3 . duced from the nonlinear feature mapping. f (x) is learnt in a framework that aims to minimise the regularised empir- 5 Results and Evaluation ical risk [Yang, 2009]. The adapted classifier ftar (x) learnt In this section we present the experimental results and com- under this framework tries to minimise the classification error pare our proposed adaptation system with a set of domain- on the labelled target-domain examples and the distance from dependent and domain-independent strategies. We also in- the base classifiers fsrc k (x), to achieve a better bias-variance vestigate the effect of different sizes of the labelled target- trade-off. domain data in the classification performance. In this work we use the extended multi-classifier adapta- tion framework proposed by [Yang and Hauptmann, 2008], 3 http://bit.ly/1U7fiQR 18 5.1 Adaptation Baselines is very challenging to overcome without acquiring more la- The baseline methods and our proposed system are the fol- belled data than we currently have. It especially effects our lowing: domain adaptation as all the parameters in Eq.(3) cannot be properly optimised. • BASE: the base classifiers use either one set of features Since there are very few tweets annotated as “disgust”, we or all three feature sets (i.e. BASE-all). As an example, decide not to consider the “disgust” emotion as part of our the BASE-embedding classifier is trained and tuned with experiment evaluation here. As seen in Table 3, BASE mod- all source-domain data using only word-embedding fea- els are outperformed significantly by all other methods (ex- tures, then tested on 30% of our target-domain data. We cept ENSEMBLE, which performs only slightly better than use the LIBSVM implementation [Chang and Lin, 2011] the BASE models) positing the importance of domain adapta- of SVM for building the base classifiers. tion. With the exception of the ADAPT-3-model for “Anger”, • TARG: trained and tuned with 70% labelled target- our ADAPT models consistently outperform AGGR-all and domain data. Since this model is entirely trained from ENSEMBLE while showing competitive performance com- the target domain, it can be considered as the perfor- pared to the upper-bound baseline, TARG-all. We also ob- mance upper-bound that is very hard to beat. serve that the aggregation model AGGR-all is outperformed • AGGR: an aggregate model trained from all source- by TARG-all, indicating such domain knowledge cannot be domain data and 70% labelled target-domain data. transferred effectively to a different domain by simply mod- elling from aggregated data from both domains. In com- • ENSEMBLE: combines the base classifiers in an en- parison, our ADAPT models are able to leverage the large semble model. Then perform classification on 30% of and balanced source-domain data (as base classifiers) unlike the target-domain data to generate new training data, as TARG, while adjusting the contribution of each base classi- described in Section 2. fier unlike AGGR. • ADAPT: our domain adapted models using either one When comparing our ADAPT models, we find that in most base classifier trained with all feature sets (i.e. ADAPT- cases models adapted from multiple base classifiers beat the 1-model) or an ensemble of three standalone base clas- ones adapted from one single base classifier, even though the sifiers with each trained with one set of features (i.e. same features are used in both scenarios. This shows the ben- ADAPT-3-model). We use 30% of the labelled target- efit of the multi-classifier adaptation approach, which aims to domain data for classifier adaptation and parameter tun- maximise the utility of each base classifier. Two additional ing described in Section 4.2. models, namely ADAPT-1-modelx and ADAPT-3-modelx, The above methods are all tested on the same 30% labelled are the replicates of ADAPT-1/3-model except they also use target-domain data in order to make their results compara- 40% target-domain data for tuning the model parameters. On ble. In addition we perform in-domain cross-validation and average their results are only slightly better than ADAPT-1/3- evaluation only on our source-domain data using all feature model that use 30% of the target-domain data for both train- sets; this model is named as SRC-all. We use an RBF kernel ing and parameter optimisation. This is especially prominent function (as it outperforms linear kernel. Polynomial kernel with “happiness” where we have sufficient target-domain in- gives similar performance but requires more parameter tun- stances and less of a class imbalance issue. This shows our ing) with default setting of the gamma parameter in all the ADAPT models are able to yield knowledge transfer effec- methods. For the cost factor C and class weight parameter tively across different domains with a small amount of la- (except the SRC-all model) we conduct cross-validated grid- belled target-domain data. More analysis on the impact of search over the same set of parameter values for all the meth- adaptation sample ratios is given in Section 5.3. ods, for parameter optimisation. This makes sure our ADAPT We can also evaluate the performance of each model by models are comparable with BASE, TARG, ENSEMBLE and comparing its efficiency in terms of computation time. Here AGGR. For ADAPT-3-model we also optimise the base clas- we report the total computation time taken for all the above sifier weight parameters, denoted as ⌧ k in Eq.(1), as described methods except BASE, for the emotion “happiness”. Such in Section 4.2. computation process consists of adaptation training, grid- search over the same set of parameter values and final testing. 5.2 Experimental Results As seen in Table 4, compared to other out-of-domain strate- We report the experimental results in Table 3, with three cat- gies the proposed ADAPT models are more efficient to train egories of models: 1) in-domain no adaptation methods, i.e. especially in comparison with AGGR, which is an order of BASE and TARG models, TARG being the upper-bound for magnitude more costly due to the inclusion of source-domain performance evaluation; 2) the domain adaptation baselines, data. Within the ADAPT models, ADAPT-1-model requires i.e. AGGR and ENSEMBLE and 3) our adaptation systems less time to train since it only has one base classifier for adap- (ADAPT models). As can be seen the classification perfor- tation. mances reported for emotions other than “happy” are below 50 in terms of F1 score with some results being as low as 5.3 Effect of Adaptation Training Sample ratios 0.00. This is caused by the class imbalance issue within these Here we evaluate the effect of different ratios of the la- emotions as shown in Table 2 and Figure 1, especially for belled target-domain data on the overall classification per- the emotion “disgust” which has only 16 tweets. We tried to formance for the emotion “happiness”. Figure 2 shows the balance this issue using a class weight parameter, but it still normalised F1 scores and computation time of each ADAPT 19 Anger Disgust Happy Surprise Sad Model P R F1 P R F1 P R F1 P R F1 P R F1 BASE-ngrams 5.77 40.91 10.11 0.49 100.0 0.97 37.62 100.0 54.67 1.46 100.0 2.87 1.50 100.0 2.96 BASE-lexicon 2.59 90.91 5.03 0.55 100.0 1.10 38.43 98.96 55.36 0.00 0.00 0.00 2.54 93.33 4.94 BASE-embedding 2.06 72.73 4.02 0.00 0.00 0.00 39.18 96.11 55.66 2.00 60.00 3.88 1.49 80.00 2.92 BASE-all 2.01 59.09 3.88 5.00 20.00 8.00 38.75 98.19 55.57 1.69 66.67 3.29 1.58 86.67 3.11 TARG-all 36.00 40.91 38.30 0.00 0.00 0.00 78.04 84.72 81.24 20.83 33.33 25.64 18.75 20.00 19.35 AGGR-all 10.71 27.27 15.38 33.33 20.00 25.00 64.79 86.27 74.00 5.88 11.11 7.69 4.17 20.00 6.90 ENSEMBLE 2.11 100.0 4.13 0.49 100.0 0.97 45.20 83.55 58.66 2.70 11.11 4.35 1.46 100.0 2.88 ADAPT-1-model 16.28 31.82 21.54 0.59 80.00 1.18 79.34 80.57 79.95 11.11 13.33 12.12 100.0 6.67 12.50 ADAPT-3-model 20.00 9.09 12.50 0.00 0.00 0.00 82.11 80.83 81.46 8.14 46.67 13.86 8.77 33.33 13.89 ADAPT-1-modelx 21.43 13.64 16.67 100.0 20.00 33.33 80.53 79.27 79.90 12.50 26.67 17.02 16.67 13.33 14.81 ADAPT-3-modelx 20.00 22.73 21.28 1.82 20.00 3.33 80.30 83.42 81.83 12.50 26.67 17.02 10.20 33.33 15.63 SRC-all 93.57 93.37 93.46 99.05 98.73 98.89 81.87 85.91 83.83 96.25 98.03 97.13 91.04 92.51 91.76 Table 3: Model performance comparison (a) C = 1 (b) C = 3 (c) C = 10 Figure 2: Performance of each ADAPT model with C = 1,3,10 vs. its computation time Model Total computation time in minutes 6 Conclusion TARG-all 7.72 ENSEMBLE 209.72 In this work we study a model-based multi-class adaptive- AGGR-all 1238.24 SVM approach to cross-domain emotion recognition and ADAPT-1-model 26.30 compare against a set of domain-dependent and domain- ADAPT-3-model 118.41 independent strategies. We conduct a series of experiments and evaluate our proposed system on a set of newly anno- Table 4: Total computation time for each method tated Twitter data about museums. We find that our adapted SVM model outperforms the out-of-domain base models and domain adaptation baselines while also showing competi- model across different adaptation training sample sizes rang- tive performance against the in-domain upper-bound model. ing from 10% to 70% of the total target-domain data (with the Moreover, in comparison to other adaptation strategies our same 30% held out as test data) and with the cost factor C = approach is computationally more efficient especially com- 1, 3 and 10 (as the same choices of C are used in [Yang et al., pared to the classifier trained on aggregated source and tar- 2007] for conducting their experiment). We observe a loga- get data. Finally, we shed light on how different ratios of la- rithmic growth for the F1 scores obtained from every model, belled target-domain data used for adaptation can effect clas- against a linear growth of computation time cost. Thus even sification performance. We show there is a trade-off between though there is a reasonable increase in classification perfor- model effectiveness and efficiency when selecting adaptation mance when increasing the adaptation sample size from 50% sample size. Our code and data4 are publicly available, en- to 70%, it becomes much less efficient to train such mod- abling further research and comparison with our approach. els and we require more data, which may not be available. In the future we would like to investigate a feature-based Since we have a trade-off between model effectiveness and deep learning approach for cross-topic emotion classification efficiency here, it is appropriate to use 30% of our labelled on Twitter while examining the possibility of making it as ef- target-domain data for classifier adaptation as we have done ficient and flexible as the model adaptation based approaches. so in ADAPT-1-model and ADAPT-3-model. One should se- Another future direction is to study how to best resolve the re- lect the adaptation training sample size accordingly based on markable class imbalance issue in social media emotion anal- the test data at hand, but empirically we think 1,000 labelled ysis when some emotions are rarely expressed. target-domain tweets would be enough for an effective adap- 4 tation to classify 3,000-4,000 test tweets. http://bit.ly/1SddvIw 20 Acknowledgments [Pan et al., 2010] Sinno Jialin Pan, Xiaochuan Ni, Jian-Tao Sun, Qiang Yang, and Zheng Chen. Cross-domain sen- This work has been funded by the AHRC SMILES project. timent classification via spectral feature alignment. In We would like to thank Liz Walker, Matt Jeffryes and Michael WWW, pages 751–760. ACM, 2010. Clapham for their contribution to earlier versions of the emo- tion classifiers. [Pang and Lee, 2008] Bo Pang and Lillian Lee. Opinion mining and sentiment analysis. Foundations and trends in information retrieval, 2(1-2):1–135, 2008. References [Purver and Battersby, 2012] Matthew Purver and Stuart [Bigi, 2003] Brigitte Bigi. Using Kullback-Leibler distance Battersby. Experimenting with distant supervision for for text categorization. Springer, 2003. emotion classification. In EACL, pages 482–491. Asso- [Blitzer et al., 2007] John Blitzer, Mark Dredze, Fernando ciation for Computational Linguistics, 2012. Pereira, et al. Biographies, bollywood, boom-boxes and [Tang et al., 2014] Duyu Tang, Furu Wei, Nan Yang, Ming blenders: Domain adaptation for sentiment classification. Zhou, Ting Liu, and Bing Qin. Learning sentiment- In ACL, volume 7, pages 440–447, 2007. specific word embedding for twitter sentiment classifica- tion. In ACL, volume 1, pages 1555–1565, 2014. [Bollegala et al., 2011] Danushka Bollegala, David Weir, and John Carroll. Using multiple sources to construct a [Townsend et al., 2014] Richard Townsend, Aaron Kalair, sentiment sensitive thesaurus for cross-domain sentiment Ojas Kulkarni, Rob Procter, and Maria Liakata. University classification. In NAACL HLT, pages 132–141. Associa- of warwick: Sentiadaptron-a domain adaptable sentiment tion for Computational Linguistics, 2011. analyser for tweets-meets semeval. SemEval 2014, page 768, 2014. [Bollen et al., 2011] Johan Bollen, Huina Mao, and Xiaojun Zeng. Twitter mood predicts the stock market. Journal of [Tsakalidis et al., 2014] Adam Tsakalidis, Symeon Pa- Computational Science, 2(1):1–8, 2011. padopoulos, and Ioannis Kompatsiaris. An ensemble model for cross-domain polarity classification on twitter. [Chang and Lin, 2011] Chih-Chung Chang and Chih-Jen In WISE, pages 168–177. Springer, 2014. Lin. LIBSVM: A library for support vector machines. [Tumasjan et al., 2010] Andranik Tumasjan, Timm Oliver ACM Transactions on Intelligent Systems and Technology, Sprenger, Philipp G Sandner, and Isabell M Welpe. Pre- 2:27:1–27:27, 2011. dicting elections with twitter: What 140 characters reveal [Dai et al., 2007] Wenyuan Dai, Gui-Rong Xue, Qiang about political sentiment. ICWSM, 10:178–185, 2010. Yang, and Yong Yu. Co-clustering based classification for [Villaespesa, 2013] Elena Villaespesa. Diving into the mu- out-of-domain documents. In SIGKDD, pages 210–219. seums social media stream: Analysis of the visitor experi- ACM, 2007. ence in 140 characters. In Museums and the Web, 2013. [Drotner and Schrøder, 2014] Kirsten Drotner and [Vo and Zhang, 2015] Duy-Tin Vo and Yue Zhang. Target- Kim Christian Schrøder. Museum communication dependent twitter sentiment classification with rich auto- and social media: The connected museum. Routledge, matic features. In IJCAI, pages 1347–1353, 2015. 2014. [Xia et al., 2013] Rui Xia, Chengqing Zong, Xuelei Hu, and [Feldman, 2013] Ronen Feldman. Techniques and applica- Erik Cambria. Feature ensemble plus sample selection: tions for sentiment analysis. Communications of the ACM, domain adaptation for sentiment classification. Intelligent 56(4):82–89, 2013. Systems, IEEE, 28(3):10–18, 2013. [Fletcher and Lee, 2012] Adrienne Fletcher and Moon J Lee. [Xia et al., 2014] Rui Xia, Jianfei Yu, Feng Xu, and Shumei Current social media uses and evaluations in ameri- Wang. Instance-based domain adaptation in nlp via in- can museums. Museum Management and Curatorship, target-domain logistic approximation. In AAAI, 2014. 27(5):505–521, 2012. [Yang and Hauptmann, 2008] Jun Yang and Alexander G [Glorot et al., 2011] Xavier Glorot, Antoine Bordes, and Hauptmann. A framework for classifier adaptation and its Yoshua Bengio. Domain adaptation for large-scale sen- applications in concept detection. In MIR, pages 467–474. timent classification: A deep learning approach. In ICML, ACM, 2008. pages 513–520, 2011. [Yang et al., 2007] Jun Yang, Rong Yan, and Alexander G [Jiang and Zhai, 2007] Jing Jiang and ChengXiang Zhai. In- Hauptmann. Cross-domain video concept detection using stance weighting for domain adaptation in nlp. In ACL, adaptive svms. In Proceedings of the 15th international pages 264–271. Association for Computational Linguis- conference on Multimedia, pages 188–197. ACM, 2007. tics, June 2007. [Yang, 2009] Jun Yang. A general framework for classifier [Liu et al., 2013] Shenghua Liu, Fuxin Li, Fangtao Li, Xueqi adaptation and its applications in multimedia. PhD thesis, Columbia University, 2009. Cheng, and Huawei Shen. Adaptive co-training svm for sentiment classification on tweets. In CIKM, pages 2079– 2088. ACM, 2013. 21