=Paper= {{Paper |id=Vol-2554/paper10 |storemode=property |title=Leveraging Emotion Features in News Recommendations |pdfUrl=https://ceur-ws.org/Vol-2554/paper_10.pdf |volume=Vol-2554 |authors=Nastaran Babanejad,Ameeta Agrawal,Heidar Davoudi,Aijun An,Manos Papagelis |dblpUrl=https://dblp.org/rec/conf/recsys/BabanejadADAP19 }} ==Leveraging Emotion Features in News Recommendations== https://ceur-ws.org/Vol-2554/paper_10.pdf
         Leveraging Emotion Features in News Recommendations
             Nastaran Babanejad                                          Ameeta Agrawal                                Heidar Davoudi
                 York University                                          York University                           Ontario Tech University
                 Toronto, Canada                                          Toronto, Canada                              Oshawa, Canada
               nasba@eecs.yorku.ca                                      ameeta@eecs.yorku.ca                        heidar.davoudi@uoit.ca

                                                   Aijun An                                    Manos Papagelis
                                               York University                                  York University
                                              Toronto, Canada                                  Toronto, Canada
                                             aan@eecs.yorku.ca                              papaggel@eecs.yorku.ca

ABSTRACT
Online news reading has become very popular as the web provides
access to news articles from millions of sources around the world.
As a specific application domain, news recommender systems aim
to give the most relevant news article recommendations to users
according to their personal interests and preferences. Recently, a
family of models has emerged that aims to improve recommenda-
tions by adapting to the contextual situation of users. These models
provide the premise of being more accurate as they are tailored to
satisfy the continuously changing needs of users. However, little
attention has been paid to the emotional context and its potential
on improving the accuracy of news recommendations. The main
objective of this paper is to investigate whether, how and to what
extent emotion features can improve recommendations. Towards
that end, we derive a large number of emotion features that can be
attributed to both items and users in the domain of news. Then, we
devise state-of-the-art emotion-aware recommendation models by
                                                                                       Figure 1: Illustrative example of emotions expressed in ar-
systematically leveraging these features. We conducted a thorough
                                                                                       ticles read by two different users, U1 (top) and U2 (bottom),
experimental evaluation on a real dataset coming from news do-
                                                                                       over a three month period. Can we leverage the emotional
main. Our results demonstrate that the proposed models outperform
                                                                                       context to improve recommendations?
state-of-the-art non-emotion-based recommendation models. Our
study provides evidence of the usefulness of the emotion features
at large, as well as the feasibility of our approach on incorporating
them to existing models to improve recommendations.
                                                                                       of the items and users themselves (e.g., user behavior, interests and
CCS CONCEPTS                                                                           preferences) and the history of a user’s interactions with the items
                                                                                       through ratings, reviews, clicks and more [20, 33, 34]. However, lit-
• Information systems → Recommender systems; Sentiment
                                                                                       tle attention has been paid to the emotional context and its relation
analysis; • Computing methodologies → Neural networks.
                                                                                       to recommendations.
KEYWORDS                                                                                  While emotions can be manifested in various ways, we focus on
                                                                                       emotions expressed in textual information that is associated with
news recommender systems, contextual information, emotion fea-                         items or users in the system. For example, the content of a news
tures                                                                                  article, the content of an online review or the lyrics of a song are
                                                                                       good examples of textual information directly associated with an
1    INTRODUCTION                                                                      item’s emotional context. On the other hand, the emotional profile
Recommender systems (RS) have widely and successfully been em-                         of a user can be determined through explicit or implicit feedback of
ployed in domains as diverse as news and media, entertainment,                         users to items. Explicit feedback, such as providing ratings and/or
e-commerce and financial services, to name a few. The main util-                       submitting reviews to items, can represent an accurate reflection
ity of such systems is their ability to suggest items to users that                    of a user’s opinion about the item, but it is considered an intrusive
they might like or find useful. Traditionally, research on recom-                      process that disrupts the user-system interaction and negatively
mendation algorithms has focused on improving the accuracy of                          impacts user experience [32]. In addition, while it might be avail-
predictive models based on a combination of descriptive features                       able for certain domains (e.g, product recommendations [8], movie
Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons   recommendations [29], etc.), it is not easily obtainable in domains
License Attribution 4.0 International (CC BY 4.0).                                     such as news, where users typically interact with items at a fast
pace and are less inclined to provide feedback. In the absence, spar-
sity or high cost of acquisition of explicit feedback, incorporating           items
                                                                                                             Raw Data
implicit feedback, which is generally abundant and non-intrusive,
                                                                                                         user-specific properties
might be beneficial. Therefore, we focus on indirectly capturing                                         item-specific properties
                                                                               users
the emotional context of users’ activity by monitoring their inter-                                      user-item interactions
actions with items over time. For instance, one can monitor the
                                                                                                                                       Stage 1
tone of the stories in news article users are reading. Effectively, this                                Feature Generation
information can be used to model a user’s historical or temporal
emotional profile.
                                                                                                        Feature Extraction
    To further motivate this, consider Figure 1 that illustrates the
emotional profiles of two users, U1 and U2 , based on eight basic                      Non-Emotion-based Features    Emotion-based Features
                                                                                          item-related                 item-related
emotions, expressed in articles read by them over a period of three                       user-related                 user-related
months. One can notice that emotions of sadness and fear are mostly
expressed in the articles read by U1 while other emotions, such as
joy are less expressed. In addition, one can observe trends such as                                     Feature Selection
                                                                                                                                                     focus of
the expression of anger increasing over time. On the other hand,                                                                                    this paper
for U2 , the emotions of joy and trust are mostly expressed and other                                                                    Stage 2
                                                                                                          Model Training
emotions, such as disgust are less expressed. Moreover, emotions
of fear and anticipation are increasingly expressed in the articles
read by this user. Although, the emotional tone derived from news
articles read by a user cannot justify the personality and state of                                          Blending                     Stage 3
mind of the user, it can be considered as the taste or preference                                               &
of the user, where it shows the type of articles they are more in-                                           Ensemble
terested in. Inspired by these observations, recent advancement in
methods for emotion detection and the success of emotion-aware                     Item
recommendation algorithms, the main motivation of our research                  Predictions
is to investigate whether, how and to what extent emotion features
can improve the accuracy of recommendations.
The Problem. More formally, the recommendation task can be de-             Figure 2: Overview of an emotion-aware recommendation
scribed as follows. Let a set of 𝑚 users U = {𝑢 1, 𝑢 2, ..., 𝑢𝑚 } and      system and the focus of the main contributions of the paper.
a set of 𝑛 items I = {𝑖 1, 𝑖 2, ..., 𝑖𝑛 }. Let us also assume that each
user 𝑢𝑖 has already interacted with a set of items I𝑢𝑖 ⊆ I (e.g.,
consumed news articles). Then, the problem is to accurately pre-           Contributions. The major contributions of this paper are as follows:
dict the probability 𝑝𝑢𝑎 ,𝑖 𝑗 with which a user 𝑢𝑎 ∈ U will like item
                                                                               • We systematically identify, extract and select the most rele-
𝑖 𝑗 ∈ I \ I𝑢𝑎 . The task can also take the form of recommending a
                                                                                 vant emotion-based features for use in news recommenda-
set I𝑘 ⊆ I \ I𝑢𝑎 of 𝑘 items that the user will find most interesting
                                                                                 tion models. These features are associated with both items
(top-𝑘 recommendations). For example, in the news domain, the
                                                                                 (e.g., news articles) and users (e.g., readers).
task is that of recommending an unread article.
                                                                               • We devise a number of state-of-the-art models for generating
Challenges & Approach. In order to evaluate the importance of                    recommendations that incorporate the additional emotion
the emotional context to recommendations, we had to incorporate                  features. These models include variations of gradient boosting
emotional features [2, 36, 45] to state-of-the-art recommendation al-            decision trees, deep matrix factorization methods and deep
gorithms and evaluate their accuracy performance. Figure 2, shows                neural network architectures. In addition, we use ensembling
a schematic diagram of the emotion-aware recommendation algo-                    methods to increase the predictive performance by blending
rithm process we designed, which consists of three main stages: i)               or combining the predictions of multiple constituent models.
feature engineering, ii) model training, and iii) blending & ensemble          • We propose EmoRec, an emotion-aware recommendation
learning. Each of these components, define a number of challenges                model, which demonstrates the best accuracy performance
that need to be addressed. During feature engineering, we had to                 in news recommendation task. EmoRec itself is an ensemble
generate a number of features attributed to both users and items.                model.
Emphasis was given in capturing the most important non-emotional               • We conduct a thorough experimental evaluation on a real
and emotional features for the prediction task. Once features are                dataset coming from news domain. Our results demonstrate
extracted, off-the-shelf feature selection methods are employed to               that the emotion-aware recommendation models consis-
select a subset of them that are more relevant for use in model                  tently outperform state-of-the-art non-emotion-based rec-
construction. During model training, we experiment with a number                 ommendation models. Our study provides evidence of the
of state-of-the-art models for generating recommendations. During                usefulness of the emotion features at large, as well as the
blending & ensemble we combine alternative models to obtain better               feasibility of our approach on incorporating them to existing
predictive models than any of the constituent models alone.                      models to improve recommendations.
2     RELATED WORK
Prior research has found a range of features to be useful in the
context of news recommender systems, such as user location [15],
time of the day [26], demographic information [21], or article social
media profile [50]. However, emotion, which is one of the important
elements of human nature that has a significant impact on our
behavior and choices [49], has received little attention. A number
of studies in the area of psychology, neurology, and behavioral
sciences have shown that individuals’ choices are related to their
                                                                            Figure 3: Example emotions expressed in textual content
feelings and mental moods [24].
   In the context of recommender systems, one of the earliest                                    Table 1: Emotion Resources
works [17], pointed out that emotions are crucial for users’ de-
cision making and that users transmit their decisions together with                Resources              Size                Emotion Taxonomy
emotions. Tkalcic et al. [42] introduced a unifying framework for
                                                                                   WordNet-Affect [39]    4787 words                  Several
using emotions in user interactions with a recommender system,
                                                                                   ISEAR [46]             7600 sentences              ISEAR
and suggested that while an implicit approach of user feedback
                                                                                   NRC [25]               14,182 words               Plutchik
may be less accurate, it is well suited for user interaction purposes              SentiWordNet 3.0 [4]   11,000+ synset            Sentiments
since the user is not aware of it [41].
   While emotions as features have been studied in movie recom-
mendations [28, 29], music recommendations [18] and restaurant            and so on. Note that we extract emotion features for both users and
recommendations [44], to name a few, much less work has explored          items.
the role of emotion features in news recommender systems.                 3.1.1 Item Emotion-based Features.
   Emotion in news articles has been studied for categorizing news           Number of Emotion Words: This feature represents the num-
stories into eight emotion categories [3]. Specifically for recom-        ber of words in an emotion lexicon (i.e., WordNet-Affect, see Table
mender systems, Parizi and Kazemifard [35] introduced a model for         1) that occur in the item (i.e., news article) more than once.
Persian news utilizing both, the emotion of news as well as user’s           Ekman’s Emotion Label: We count the number of emotion
preference. More recently, Mizgajski and Morzy [23] introduced a          words occurring in the text document for each emotion type (Ek-
recommender system for recommending news items by leveraging              man’s six emotion categories [13]) and then the text is assigned
a multi-dimensional model of emotions, where emotion is derived           an emotion label with the highest number of emotion words ap-
through user’s self-assessed reactions (i.e., explicit feedback) which    pearing in the text. If more than one emotion category has the
can be considered as intrusive collection. In contrast to previous        highest count, 0 is assigned to this feature, leaving the next feature
studies, our work focuses on studying the role of emotion features        to indicate mixed emotions. A combination of different lexicons
in news recommender systems using implicit user feedback.                 (WordNet-Affect and NRC, see Table 1) is used to find the emotion
                                                                          labels. We use multiple resources to have a bigger set of emotion
3     FEATURES FOR RECOMMENDATION                                         words for each emotion.
This section describes the feature extraction procedure which is             Mixed Emotions: This feature indicates whether an item has
utilized in our proposed framework. The features are grouped into         more than one document-level emotion labels based on Ekman’s
two main categories: (i) emotion-based features for items and users,      emotion model (i.e., if two or more emotions have the highest score,
and (ii) non-emotion-based features for items and users.                  this feature is valued at 1, otherwise 0). Since the initial annotation
                                                                          effort (previous feature) illustrated that in many cases, a sentence
3.1    Emotion-based Features                                             can exhibit more than one emotion, we have an additional category
                                                                          called mixed emotion to account for all such instances.
The main objective of this paper is to improve the performance of            Sentiment Feature: The text is classified into three categories:
recommender system by leveraging the user/item emotion features.          positive, negative and neutral. We utilize the approach introduced
   Figure 3 shows an example of textual content of items (i.e., an        in [30] and use SentiWordNet [4].
article) in news domains. As it can be observed, there are several           Interjections: This feature counts the number of interjections
words such as win and gratifying, expressing the emotion of happi-        in a document. A short sound, word or phrase spoken suddenly to
ness. Moreover, interjections such as yay and oh can be indicators        express an emotion, e.g., oh, look out!, ah, are called interjections1 .
of different emotions [16]. In this section, we describe how we           Our preliminary analysis found that interjections were common
extract such features to improve the recommendation system ef-            in quotes in news articles, which can be detected for potential
fectiveness. In order to maintain consistency, each news article is       emotions.
preprocessed by tokenizing into words, removing the stopwords                Capitalized Words: This feature counts the number of words
and POS-tagging to extract nouns, verbs, adverbs and adjectives. In       in a document with all uppercase characters. People use capital
particular, we focus on two approaches for computing emotion fea-
tures: sentiment analysis, which classifies text into neutral, positive   1 List    of interjections derived from: i) https://surveyanyplace.com/
and negative sentiments, and emotion analysis which categorizes           the-ultimate-interjection-list,  ii)   https://7esl.com/interjections-exclamations,
text into emotions such as happiness, sadness, anger, disgust, fear       and iii) https://www.thoughtco.com/interjections-in-english-1692798
words to express an emotion [43] and make it bold to the readers             Table 2: List of Emotion/Non-emotion Feature Importance
(e.g., I said I am FINE).
   Punctuation: Two features are included to model the occur-                        Emotion Features                            Gain Score
rence of question marks and exclamation marks repeated more
                                                                                     Plutchik emotion scores                       3200.86
than two times in a document. Using punctuation can clarify the                      User emotions across items                    1985.36
emotional content of the texts that are sometimes easy to miss [43].                 User emotions across categories               1850.33
   Grammatical Markers and Extended Words: This feature                              Ekman’s emotion label                         1101.38
counts the number of times words with a character repeated more                      Punctuation                                   910.55
than two times (e.g., haaappy or oh yeah!!????) [7] as excessive                     Grammatical markers and extended words        860.13
use of letters in a word (e.g., repetition) is one way to emphasize                  Interjections                                 773.12
feelings.                                                                            Capitalized words                             640.21
   Plutchik Emotion Scores: First, we measure the semantic re-                       Mixed emotions                                526.97
                                                                                     Sentiment features                            360.68
latedness score between a word 𝑊𝑖 in the text and an emotion
category 𝐶 𝑗 in the NRC lexicon (see Table 1) as follows [1]:                        Non-emotion Features                        Gain Score
                                    v
                                    t𝑛
                                      Ö                                              User latent vector                            3640.87
                  𝑃𝑀𝐼 (𝑊𝑖 , 𝐶 𝑗 ) = 𝑛         𝑃𝑀𝐼 (𝑊𝑖 , 𝐶 𝑘𝑗 )         (1)           Potential to trigger subscription             2974.46
                                        𝑘=1                                          User interest in subcategory                  1530.28
                                                                                     Topic labeling                                1421.19
where 𝐶 𝑘𝑗 (𝑘 = 1 . . . 𝑛) is the 𝑘 th word of emotion category 𝐶 𝑗 . 𝑃𝑀𝐼
                                                                                     User spent time                               1110.57
is the Pointwise Mutual Information calculated as follows:                           Visit count                                   920.53
                                                                                     Item topic                                    867.12
                                              𝑃 (𝑊𝑖 , 𝐶 𝑘𝑗 )                         Coherence                                     685.23
                   𝑃𝑀𝐼 (𝑊𝑖 , 𝐶 𝑘𝑗 ) = log                              (2)
                                            𝑃 (𝑊𝑖 )𝑃 (𝐶 𝑘𝑗 )                         TF-IDF                                        410.29

where 𝑃 (𝑊𝑖 ) and 𝑃 (𝐶 𝑘𝑗 ) are the probabilities that 𝑊𝑖 and 𝐶 𝑘𝑗 occur
in a text corpus, respectively, and 𝑃 (𝑊𝑖 , 𝐶 𝑘𝑗 ) is the probability that      Topic Label: We use lda2vec [27] to generate and label the topics
                                                                             in an item (i.e., document), where each generated topic is labeled
𝑊𝑖 and 𝐶 𝑘𝑗 co-occur within a sliding window in the corpus. Finally,
                                                                             by one of its top 𝑘 words which is most semantically similar to
we calculate the average, maximum and minimum of score for all               the other words in the top 𝑘 word list. We then label the item (i.e.,
words in the text for each emotion category and consider each as             document) with the label of the most coherent topic among the top
an individual feature.                                                       𝑚 topics of the document. The word vector of this label word is
3.1.2 User Emotion-based Features.                                           used as the value for this feature.
    As we do not have access to users’ explicit emotion towards items,          TF-IDF: This feature represents items as n-grams (unigram, bi-
we develop users’ implicit emotional profile based on their historical       gram, trigram) with the TF-IDF weighting approach [22].
interactions with items. By computing the emotion profile of the                Coherence: We first calculate the cosine similarity scores be-
items with which a user is interacting, we derive the emotional              tween all pair of words in an item using word2vec pre-trained word
taste of the user over that period of time over the set of items.            vectors2 , and then record average of similarity scores, standard
    User Emotions Across Items: We determine the emotion score               deviation of similarity scores, the lowest score that is higher than
(i.e., Plutchik’s emotion scores) for the last accessed item before          the standard deviation, and the highest score that is lower than the
subscription as well as for the last 20 items accessed by the user.          standard deviation as four features.
Then, we pick the top 3 frequent emotions.                                      Potential to Trigger Subscription: This feature represents the
    User Emotions Across Categories: We determine the emotion                total number of times the item was requested right before a paywall
of categories of items (e.g., sports in news domain) accessed by a           was presented to a user who subsequently made a subscription [10,
user by counting the number of items assigned to an emotion in               11]. In a subscription-based item delivery model a paywall is the
a specific category, with the most frequent emotion considered as            page asking for subscription before allowing an unsubscribed user
the emotion of the category. The feature is calculated for the whole         to continue accessing items.
history of the user.                                                         3.2.2 User Non-emotion-based Features.
                                                                                Visit Count: We calculate the average number of items (articles)
3.2    Non-Emotion-based Features                                            accessed by a user per visit. A visit is terminated if a user is inactive
Non-emotion-based features can also be classified into item-based            for more than 30 minutes.
and user-based features.                                                        User Spent Time: Two features are represented. One is the
3.2.1 Item Non-Emotion-based Features.                                       average time the user spent per item, and the other is the average
   Item Topic: We extract topics in the article using Latent Dirich-         time the user spent per visit.
let Allocation (LDA) [6]. In LDA, each topic is a distribution over             User Interest in Subcategory: This feature represents the em-
words, and each document is a mixture of topics. The number of               pirical probability of subcategory 𝑠 given a user 𝑢 and a category 𝑐
topics for the news articles are 112 , which were chosen empirically         denoted as 𝑃 (𝑠 |𝑢, 𝑐).
to minimize the perplexity score of the LDA result. Thus, the item
topic is represented by a vector of length 112.                              2 https://code.google.com/archive/p/word2vec/
For example, 𝑃 (election|𝑢, politics) can be determined by the total                     combination of all base models outcomes:
number of articles the user read on election over the total number
                                                                                                                       6
of articles that the user read on politics. In our experiments, the                                                    Õ
categories and subcategories were provided with the dataset and                                                             𝛼𝑖 𝑝𝑖                         (3)
                                                                                                                        𝑖
we consider only the top 50 most frequently visited subcategories
for this feature.                                                                        where 𝑝𝑖 is the probability that the user is interested in the item
   User Latent Vector: We calculate the latent vector for each user                      according to base model 𝑖 and 𝛼𝑖 is the weight of base model 𝑖
based on matrix factorization introduced in [40]. This feature is                        learned by XGboost/Catboost.
chosen so that we can compare our method with the Deep Matrix
Factorization model in [47], a state-of-the-art recommendation                           Model 2 (Deep Neural Network (DNN)): Figure 4 shows our
method, which uses this feature as input for a deep neural network.                      proposed Deep Neural Network architecture for leveraging the
                                                                                         emotion features (and other commonly available features) for the
3.3      Feature Selection                                                               recommendation purpose. The input is divided into four groups [5]:
One of the critical steps after feature extraction is to select important                i) user non-emotion based features, ii) item non-emotion based
features for recommendation. Table 2 reports the most important                          features, ii) user emotion-based features, and iv) item emotion-based
features according to gain importance score for the news data set.                       features. For the categorical inputs, we utilize one-hot encoding
We evaluate feature importance by averaging over 10 training runs                        (the second layer is look-up embeddings mapping each categorical
of a gradient boosting machine learning model XGBoost [9] to                             feature to a fixed length embedding vector). In the architect “Dense
reduce variance3 . Also, the model is trained using early stopping                       Layer” can be formalized as: Dense(𝑥) = 𝑓 (𝑊 𝑥 + bias) where 𝑊
with a validation set to prevent over-fitting to the training data. By                   and 𝑏𝑖𝑎𝑠 are parameters, 𝑥 is the layer input and 𝑓 is the activation
using the zero importance function, we find features that have zero                      function (for linear layer 𝑓 is the identity function). We use 𝐿2
importance according to XGBoost.                                                         regularization to prevent over-fitting in embedding layer and use
                                                                                         back-propagation to learn the parameters.
4     RECOMMENDATION MODEL
In this section, we introduce a tailored structure of an Emotion-                        Model 3 (Deep Matrix Factorization (Deep MF)): Inspired by
aware Recommender System Model (EmoRec) for personalized                                 the models proposed in [19, 47], we built our Deep MF (Figure 5)
recommendation. Our final model is an ensemble model of three                            to leverage extra user/item features (i.e., emotion and non-emotion
models leveraging both emotion/non-emotion-based features. We                            features) in the recommendation prediction task. In [47], they con-
describe the structure of the proposed model and the training meth-                      struct a user-item matrix with explicit ratings and implicit prefer-
ods next.                                                                                ence feedback, then with this matrix as the input, they present a
                                                                                         deep neural architecture to learn a low dimensional space for the
4.1      Model Training                                                                  representation of both users and items. In [19], by replacing the
Model 1 (Boost Model): Gradient Boosting Decision Tree (GBDT)                            inner product with a neural architecture, they learn an arbitrary
methods are among the most powerful machine learning approaches                          function to capture the interactions between user and item latent
which have been effectively used in many domains [14] including                          vectors. Different from their work, we focused on modeling the
recommendation [48]. The basic idea in GBDT approaches is to                             user/item with rich extra features, such as non-emotion and emo-
learn a set of base/weak learners (i.e., decision trees) sequentially by                 tion based features, as well as using embedding vectors learned in
using different training splits. More precisely, at each step, we learn                  our DNN model. The input of our proposed model is the same as
a new base model by fitting it to the error residuals (i.e., difference                  the DNN model where the categorical features are encoded using
between the current model predictions and the actual target values)                      one hot vectors. The second layer is the look-up embedding. In
at that step. The new model outcome is the previous model outcome                        this layer, we have both MF embedding vectors, which we estimate
plus the (weighted) new base learner outcome. Eventually, the final                      through the learning process, and DNN embedding vectors, which
model outcome is the weighted average of all base learners outcome,                      are concatenation of embedding vectors (for each similar input
where the weights are learned jointly with the base learners. We                         group) learned from DNN model (they are fixed in this model).
train two state-of-the-art GBDT models, namely, XGBoost [9] and                          Generalized Matrix Factorization (GMF) layer combines two em-
Catboost [12], on our training datasets with the features selected                       beddings using dot product and applies some non-linearity. Similar
in Section 3.3 as the input.                                                             to DNN model, the output of the model is the probability that a
    XGBoost uses pre-sorted/histogram-based algorithm to compute                         user is interested in an item.
the best split while CatBoost uses ordered boosting, a permeation
based algorithm, to learn the weak learners effectively. Moreover,                       Ensemble/Blending Model: The final model EmoRec was the
XGBoost uses one-hot encoding before supplying categorical data,                         weighted average of the three models’ predictions. We use Nelder-
but CatBoost handles categorical features directly. We train both                        Mead Method [31] to find the optimum weights of each models.
models individually (three base models for each). The final model
output (i.e., probability that a user is interested in an item) is the                   5   EXPERIMENTS
3 Variance refers to the sensitivity of the learning algorithm to the specifics of the   In this section, we introduce the data, evaluation protocols and the
training data (e.g., the noise and specific observations).                               specific configurations used in our experiments.
                                                                              Table 3: Results of our Models on News Dataset (F-score)

                                                                                   Model                                     Non-Emo All
                                                                                   Single Boost Model                          70.19    70.86
                                                                                   Boost Blend                                 70.69    71.50
                                                                                   Deep MF                                     72.93    73.29
                                                                                   Single DNN Model                            70.88    73.00
                                                                                   DNN Ensemble                                73.62    74.30
                                                                                   Boost Blend + Deep MF                       73.07    74.98
                                                                                   Boost Blend + DNN Ensemble                  74.00    74.23
                                                                                   Deep MF + DNN Ensemble                      74.61    75.10
                                                                                   EmoRec
                                                                                   (Boost Blend + Deep MF + DNN Ensemble)      78.20   80.30

           Figure 4: The Structure of Our DNN Model
                                                                                                         Precision × Recall
                                                                                                  𝐹 =2×
                                                                                                         Precision + Recall
                                                                             The F-score on a test data set is the average over all the users in
                                                                             the test data set.

                                                                             5.3    Comparing Recommendation Models with
                                                                                    and without Emotion Features
                                                                             Our main objective is to see whether the use of emotion features
                                                                             will boost the performance of recommendation models. For such a
                                                                             purpose we run the three state-of-the-art recommendation models
                                                                             described in the last section and some ensembles formed by these
                                                                             models with and without emotion features. The models used in our
                                                                             evaluation are as follows:
                                                                                  • Single Boost Model: We run XGBoost and Catboost separately
        Figure 5: The Structure of Our Deep MF Model                                to make predictions and collect the average of their F-scores.
                                                                                  • Boost Blend: This is the 6-model ensemble described in Model
                                                                                    1 in Section 4.1.
5.1     Data                                                                      • Deep MF: This is the deep matrix factorization model de-
Our experiments are conducted on a real-world news dataset. The                     scribed in Section 4.1.
Globe and Mail is one of the major newspapers4 in Canada. We use                  • Single DNN model: We run the DNN model for 5 times with
the data spanning from January to July 2014 (a 6-month period)                      the same hyperparameters but different random seeds and
in our experiments where the data in the first four months were                     collect the average result over 5 runs.
used for training, and the last two months for testing. The dataset               • DNN Ensemble: An ensemble of 5 DNN models with different
contains information for 359,145 articles in total and 88,648 users                 hyperparameters (e.g., different learning rates, etc.) is run 5
in total, out of which 17,009 became subscribers during this period,                times each with a different random seed. The average result
and 71,639 were non-subscribers. Every time a user reads an article,                over the 5 runs is collected.
watches a video or generally takes an action in the news portal, the              • Boost Blend + Deep MF: This is an ensemble consisting of
interaction is recorded as a hit. Typically, a hit contains information             Boost Blend and Deep MF.
like date, time, user id, visited article, special events of interest like        • Boost Blend + DNN Ensemble: This an ensemble consisting
subscription, sign in, and so on.                                                   of Boost Blend and DNN Ensemble.
                                                                                  • Deep MF + DNN Ensemble: This is an ensemble consisting of
5.2     Evaluation Metrics                                                          Deep MF and DNN Ensemble.
We use F-score to measure the predictive performance of a rec-                    • Boost Blend + Deep MF + DNN Ensemble: an ensemble con-
ommender system. For each user in the test data set, we use the                     sisting of Boost Blend, Deep MF and DNN Ensemble.
original set of read articles in the test period as the ground truth,           We train each of the above models using the training data of our
denoted as 𝑇𝑔 . Assuming the set of recommended news articles                data set and use the trained model to make recommendations by
for the user is 𝑇𝑟 , precision, recall, and F-measure are defined as         predicting a user’s interest in an item in the test data. Table 3 shows
follows:                                                                     the results (in F-score) of using these recommendation methods
                           |𝑇𝑔 ∩ 𝑇𝑟 |            |𝑇𝑔 ∩ 𝑇𝑟 |                  with and without emotion features on the news data set, where the
             Precision =              , Recall =                             whole set of emotion features described in Section 3.3 is used in
                              |𝑇𝑟 |                 |𝑇𝑔 |
                                                                             the results for "All", while none of the emotion features is used in
4 https://www.theglobeandmail.com/                                           the results for "Non-Emo". As can be seen, adding emotion features
Table 4: Comparison of EmoRec with State-of-the-art Base-                    Table 5: Effect of Individual Emotion Features (F-score)
lines on News Dataset (F-score)
                                                                                   Emotion Features                            News
      Model                                     Non-Emo All                        ALL emotion features                         80.30
      Basic MF                                   69.10   71.23                     - Sentiment features                         78.15
      FDEN and GBDT                              72.02   73.28                     - Mixed emotions                             76.90
      Truncated SVD-based Feature Engineering    73.12   74.01                     - Capitalized words                          76.21
      EmoRec                                     78.20   80.30                     - Interjections                              75.84
                                                                                   - Grammatical markers and extended words     75.23
                                                                                   - Ekman’s emotion label                      74.98
                                                                                   - Punctuation                                75.17
improves the predictive performance for all the recommendation                     - User emotions across categories            74.15
methods. Among the single recommendation models (i.e., Single                      - User emotions across items                 73.23
Boost Model, Deep MF and Single DNN Model), Deep MF performs                       - Plutchik emotion scores                    72.10
the best. The results also show that ensemble methods perform
better than single/component models. The best performance is
                                                                         Table 6: Effect of Top Three Emotion Features (Plutchik
produced by the largest ensemble (i.e., Boost Blend + Deep MF +
                                                                         emotions, User emotions across categories, and User emotions
DNN Ensemble). We refer to this best-performing model as our
                                                                         across items) on State-of-the-art Models
EmoRec model.

5.4    Comparison with Other Baselines                                        Model                   No Emotion     Top Three Emotion

We also compare our EmoRec model with the following three state-              Basic MF                   69.10                70.38
of-the-art recommendation methods with well-tuned parameters                  Boost Blend                70.69                71.00
                                                                              FDEN and GBDT              72.02                72.77
(that is, the parameters are optimally tuned to ensure the fair com-
                                                                              Deep MF                    72.93                73.01
parison). The objective is to investigate whether emotion features
                                                                              Truncated SVD-based        73.12                73.60
can smarten up these recommender systems. A brief description of              DNN Ensemble               73.62                73.98
these three models is as follows:
   Basic MF: This is the simple matrix factorization model where
used for discovering latent features between two entities (i.e., user    used in EmoRec. In each run of this study, we keep all the fea-
and articles) [40]. Both user preferences and item characteristics are   tures except one type of emotion features. The results indicate that
mapped to latent factor vectors. Each element of the item-specific       removing Plutchik emotion scores (item feature), User emotions
factor vector measures the extent to which the item possesses one        across categories and User emotions across items (user features)
feature. Accordingly,each element of the user-specific factor vector     lead to considerable decline in the performance. It also shows that
measures the extent of the user preferences in that feature.             our model is able to capture useful implicit user emotion effectively.
   FDEN and GBDT : an ensemble of different models, including               To further validate the effectiveness of the top emotion features
Field-aware Deep Embedding Networks and Gradient Boosting                as learned from our experiments, we run a further experiment incor-
Decision Trees [5]. The predictions of FDENs are from a bagging          porating only the top three emotion features (i.e., Plutchik emotions,
ensemble using the arithmetic mean of many networks, each of             User emotions across categories, and User emotions across items)
which has slight differences on hyper-parameters, including the          on six state-of-the-art recommendation models. As the results in
forms of the activation.                                                 Table 6 show, only using these three emotion features can also
   Truncated SVD-based Feature Engineering: a gradient boosted           improve the recommender systems, with Basic MF showing the
decision trees model with truncated SVD-based embedding fea-             most gain.
tures [37]. To overcome the cold start problem, a truncated SVD-
based embedding features were created using the embedding fea-           6     CONCLUSIONS
tures with four different statistical based features (users, items,      Motivated by the recent development in emotion detection methods
artists and time), the final model was the weighted average of the       (in textual information), we considered the problem of leveraging
five models’ predictions.                                                emotion features to improve recommendations. Towards that end,
   The results are illustrated in Table 4, which shows that emo-         we derived a large number of emotion features that can be attrib-
tion features can also improve the recommendation performance            uted to both items and users in news domain and can provide an
of these three state-of-the-art baselines. In addition, our EmoRec       emotional context. Then, we devised state-of-the-art non-emotion
model performs significantly better than these three baselines in        and emotion-aware recommendation models to investigate whether,
both cases of using emotion features and not using emotion fea-          how and to what extent emotion features can improve recommen-
tures.                                                                   dations. To the best of our knowledge, this is the first attempt to
                                                                         systematically and broadly evaluate the utility of a number of emo-
5.5    Effect of Individual Emotion Features                             tion features for the recommendation task. Our results indicate
Table 5 presents the results of a feature ablation study in order        that emotion-aware recommendation models consistently outper-
to further understand the effect of individual emotion features          form state-of-the-art non-emotion-based recommendation models.
Furthermore, our study provided evidence of the usefulness of the                              (KDD ’18). ACM, New York, NY, USA, 205–214. https://doi.org/10.1145/3219819.
emotion features at large, as well as the feasibility of our approach                          3219892
                                                                                          [11] H. Davoudi, M. Zihayat, and A. An. 2017. Time-Aware Subscription Prediction
on incorporating them to existing models to improve recommenda-                                Model for User Acquisition in Digital News Media. In Proceedings of the 2017
tions.                                                                                         SIAM International Conference on Data Mining. Society for Industrial and Ap-
                                                                                               plied Mathematics, Houston, Texas, USA, 135–143. https://doi.org/10.1137/1.
   As a more tangible outcome of the study, we proposed EmoRec,                                9781611974973.16
an emotion-aware recommendation model, which demonstrates                                 [12] Anna Veronika Dorogush, Vasily Ershov, and Andrey Gulin. 2018. CatBoost:
the best predictive performance in news recommendation task.                                   gradient boosting with categorical features support. (Oct. 2018).
                                                                                          [13] Paul Ekman. 1984. Expression and the nature of emotion. Approaches to emotion
EmoRec itself is an ensemble model combining three models (Boost                               3 (1984), 19–344.
Blend + Deep MF + DNN Ensemble). It significantly outperforms                             [14] Ji Feng, Yang Yu, and Zhi-Hua Zhou. 2018. Multi-Layered Gradient Boosting
other state-of-the-art recommendation methods evaluated in our                                 Decision Trees. In Advances in Neural Information Processing Systems 31, S. Bengio,
                                                                                               H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.).
experiments. We also evaluated the proposed emotion features                                   Curran Associates, Inc., 3551–3561.
individually. Among the emotion features examined, the Plutchik                           [15] Blaž Fortuna, Carolina Fortuna, and Dunja Mladenić. 2010. Real-time news
                                                                                               recommender system. In Joint European Conference on Machine Learning and
emotion scores of items (obtained by computing PMI scores between                              Knowledge Discovery in Databases. Springer, 583–586.
words) and user emotion profiles (based on the emotion scores of                          [16] Cliff Goddard. 2014. Interjections and Emotion (with Special Reference to
the items that the user accessed) are the most important.                                      “Surprise” and “Disgust”). Emotion Review 6, 1 (Jan. 2014), 53–63. https:
                                                                                               //doi.org/10.1177/1754073913491843
   Employing emotional context in recommendations appears to be                           [17] Gustavo Gonzalez, Josep Lluis de la Rosa, Miquel Montaner, and Sonia Delfin.
a promising direction of research. While the scope of our current                              2007. Embedding Emotional Context in Recommender Systems. In Proceedings
study is limited to emotions extracted by textual information, there                           of the 2007 IEEE 23rd International Conference on Data Engineering Workshop
                                                                                               (ICDEW ’07). IEEE Computer Society, Washington, DC, USA, 845–852.
is evidence that emotions can be extracted through other means of                         [18] Byeong-Jun Han, Seungmin Rho, Sanghoon Jun, and Eenjun Hwang. 2010. Music
communication, such as audio and video, or other cues [38].                                    emotion classification and context-based music recommendation. Multimedia
                                                                                               Tools and Applications 47, 3 (2010), 433–460.
                                                                                          [19] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng
7    ACKNOWLEDGEMENTS                                                                          Chua. 2017. Neural Collaborative Filtering. In Proceedings of the 26th International
                                                                                               Conference on World Wide Web - WWW ’17. ACM Press, Perth, Australia, 173–182.
This work is funded by Natural Sciences and Engineering Research                               https://doi.org/10.1145/3038912.3052569
Council of Canada (NSERC), The Globe and Mail, and the Big Data                           [20] Dhruv Khattar, Vaibhav Kumar, Manish Gupta, and Vasudeva Varma. 2018. Neu-
Research, Analytics, and Information Network (BRAIN) Alliance                                  ral Content-Collaborative Filtering for News Recommendation. In NewsIR’18
                                                                                               Workshop. NewsIR@ECIR, Grenoble, France, 1395–1399.
established by the Ontario Research Fund Research Excellence                              [21] Hong Joo Lee and Sung Joo Park. 2007. MONERS: A news recommender for the
Program (ORF-RE). We would like to thank The Globe and Mail for                                mobile web. Expert Systems with Applications 32, 1 (2007), 143–150.
providing the dataset used in this research. In particular, we thank                      [22] Christopher D Manning, Prabhakar Raghavan, and Hinrich Schütze. 2009. Intro-
                                                                                               duction to Information Retrieval. (2009), 569.
Gordon Edall and the Data Science team of The Globe and Mail for                          [23] Jan Mizgajski and Mikołaj Morzy. [n.d.]. Affective recommender systems in
their insights and collaboration in our joint project.                                         online news industry: how emotions influence reading choices. User Modeling
                                                                                               and User-Adapted Interaction ([n. d.]), 1–35.
                                                                                          [24] Saif M. Mohammad and Felipe Bravo-Marquez. 2017. WASSA-2017 Shared Task
REFERENCES                                                                                     on Emotion Intensity. In In Proceedings of the EMNLP 2017 Workshop on Com-
 [1] Ameeta Agrawal and Aijun An. 2012. Unsupervised Emotion Detection from Text               putational Approaches to Subjectivity, Sentiment, and Social Media (WASSA). As-
     Using Semantic and Syntactic Relations. In 2012 IEEE/WIC/ACM International                sociation for Computational Linguistics, Copenhagen, Denmark, 34–49. https:
     Conferences on Web Intelligence and Intelligent Agent Technology, Vol. 1. IEEE,           //arxiv.org/abs/1708.03700
     Macau, China, 346–353. https://doi.org/10.1109/WI-IAT.2012.170                       [25] Saif M. Mohammad and Peter D. Turney. 2013. Crowdsourcing a Word–Emotion
 [2] Ameeta Agrawal, Aijun An, and Manos Papagelis. 2018. Learning Emotion-                    Association Lexicon. Computational Intelligence 29, 3 (Aug. 2013), 436–465.
     enriched Word Representations. In Proceedings of the 27th International Conference        https://doi.org/10.1111/j.1467-8640.2012.00460.x
     on Computational Linguistics. Association for Computational Linguistics, Santa       [26] Alejandro Montes-García, Jose María Álvarez-Rodríguez, Jose Emilio Labra-Gayo,
     Fe, New Mexico, USA, 950–961. https://www.aclweb.org/anthology/C18-1081                   and Marcos Martínez-Merino. 2013. Towards a journalist-based news recommen-
 [3] Mostafa Al Masum Shaikh, Helmut Prendinger, and Mitsuru Ishizuka. 2010.                   dation system: The Wesomender approach. Expert Systems with Applications 40,
     Emotion Sensitive News Agent (ESNA): A system for user centric emotion sensing            17 (2013), 6735–6741.
     from the news. Web Intelligence and Agent Systems 8, 4 (2010), 377–396.              [27] Christopher E. Moody. 2016. Mixing Dirichlet Topic Models and Word Embed-
 [4] Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. 2010. Sentiwordnet            dings to Make lda2vec. arXiv:1605.02019 [cs] (May 2016). http://arxiv.org/abs/
     3.0: an enhanced lexical resource for sentiment analysis and opinion mining.. In          1605.02019 arXiv: 1605.02019.
     Lrec, Vol. 10. 2200–2204.                                                            [28] Yashar Moshfeghi, Benjamin Piwowarski, and Joemon M Jose. 2011. Handling
 [5] Bing Bai and Yushun Fan. 2017. Incorporating Field-aware Deep Embedding                   data sparsity in collaborative filtering using emotion and semantic based features.
     Networks and Gradient Boosting Decision Trees for Music Recommendation. In                In Proceedings of the 34th international ACM SIGIR conference on Research and
     The 11th ACM International Conference on Web Search and Data Mining(WSDM).                development in Information Retrieval. ACM, 625–634.
     ACM, London, England, 7.                                                             [29] Ante Odić, Marko Tkalčič, Jurij F Tasič, and Andrej Košir. 2013. Predicting and
 [6] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet                detecting the relevant contextual information in a movie-recommender system.
     Allocation. Journal of Machine Learning Research 3 (March 2003), 993–1022.                Interacting with Computers 25, 1 (2013), 74–90.
 [7] Mondher Bouazizi and Tomoaki Otsuki. 2016. A Pattern-Based Approach for              [30] Sylvester Olubolu Orimaye, Saadat M. Alhashmi, and Siew Eu-gene. 2012. Sen-
     Sarcasm Detection on Twitter. IEEE Access 4 (2016), 5477–5488. https://doi.org/           timent Analysis Amidst Ambiguities in Youtube Comments on Yoruba Lan-
     10.1109/ACCESS.2016.2594194                                                               guage (Nollywood) Movies. In Proceedings of the 21st International Conference on
 [8] Li Chen, Guanliang Chen, and Feng Wang. 2015. Recommender Systems Based                   World Wide Web (WWW ’12 Companion). ACM, New York, NY, USA, 583–584.
     on User Reviews: The State of the Art. User Modeling and User-Adapted Interaction         https://doi.org/10.1145/2187980.2188138
     25, 2 (June 2015), 99–154. https://doi.org/10.1007/s11257-015-9155-5                 [31] Yoshihiko Ozaki, Masaki Yano, and Masaki Onishi. 2017. Effective hyperparameter
 [9] Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting                  optimization using Nelder-Mead method in deep learning. IPSJ Transactions on
     System. In Proceedings of the 22Nd ACM SIGKDD International Conference on                 Computer Vision and Applications 9, 1 (Nov. 2017), 20. https://doi.org/10.1186/
     Knowledge Discovery and Data Mining (KDD ’16). ACM, New York, NY, USA,                    s41074-017-0030-7
     785–794. https://doi.org/10.1145/2939672.2939785                                     [32] Maja Pantic and Alessandro Vinciarelli. 2009. Implicit human-centered tagging
[10] Heidar Davoudi, Aijun An, Morteza Zihayat, and Gordon Edall. 2018. Adaptive               [Social Sciences]. IEEE Signal Processing Magazine 26, 6 (2009), 173–180.
     Paywall Mechanism for Digital News Media. In Proceedings of the 24th ACM             [33] Manos Papagelis and Dimitris Plexousakis. 2005. Qualitative analysis of user-
     SIGKDD International Conference on Knowledge Discovery & Data Mining                  based and item-based prediction algorithms for recommendation agents. Engi-
                                                                                               neering Applications of Artificial Intelligence 18, 7 (2005), 781–789.
[34] Manos Papagelis, Dimitris Plexousakis, and Themistoklis Kutsuras. 2005. Alle-             Reviews.. In In Fourth International AAAI Conference on Weblogs and Social Media.
     viating the sparsity problem of collaborative filtering using trust inferences. In        Association for Computational Linguistics, Washington, DC, 107–116.
     International Conference on Trust Management. Springer, 224–239.                     [44] Blanca Vargas-Govea, Gabriel González-Serna, and Rafael Ponce-Medellın. 2011.
[35] Ali Hakimi Parizi and Mohammad Kazemifard. 2015. Emotional news recom-                    Effects of relevant contextual features in the performance of a restaurant recom-
     mender system. In 2015 Sixth International Conference of Cognitive Science (ICCS).        mender system. ACM RecSys 11, 592 (2011), 56.
     IEEE, 37–41.                                                                         [45] Karzan Wakil, Rebwar Bakhtyar, Karwan Ali, and Kozhin Alaadin. 2015. Im-
[36] Mikhail Rumiantcev. 2017. Music adviser : emotion-driven music recommenda-                proving Web Movie Recommender System Based on Emotions. International
     tion ecosystem. Ph.D. Dissertation. Department of Mathematical Information                Journal of Advanced Computer Science and Applications 6, 2 (2015), 9. https:
     Technology Oleksiy Khriyenko. https://jyx.jyu.fi/handle/123456789/53196                   //doi.org/10.14569/IJACSA.2015.060232
[37] Nima Shahbazi, Mohamed Chahhou, and Jarek Gryz. 2017. Truncated SVD-based            [46] H. G. Wallbott and K. R. Scherer. 1986. How universal and specific is emotional
     Feature Engineering for Music Recommendation. In The 11th ACM International               experience? Evidence from 27 countries on five continents. Social Science Infor-
     Conference on Web Search and Data Mining(WSDM). ACM, London, England, 7.                  mation 25, 4 (Dec. 1986), 763–795. https://doi.org/10.1177/053901886025004001
[38] Mohammad Soleymani, Sadjad Asghari-Esfeden, Yun Fu, and Maja Pantic. 2016.           [47] Hong-Jian Xue, Xinyu Dai, Jianbing Zhang, Shujian Huang, and Jiajun Chen. 2017.
     Analysis of EEG signals and facial expressions for continuous emotion detection.          Deep Matrix Factorization Models for Recommender Systems. In Proceedings of
     IEEE Transactions on Affective Computing 7, 1 (2016), 17–28.                              the Twenty-Sixth International Joint Conference on Artificial Intelligence. Inter-
[39] Carlo Strapparava, Alessandro Valitutti, and others. 2004. WordNet Affect: an             national Joint Conferences on Artificial Intelligence Organization, Melbourne,
     Affective Extension of WordNet.. In LREC, Vol. 4. European Language Resources             Australia, 3203–3209. https://doi.org/10.24963/ijcai.2017/447
     Association (ELRA), Lisbon, Portugal, 1083–1086.                                     [48] Qian Zhao, Yue Shi, and Liangjie Hong. 2017. GB-CENT: Gradient Boosted
[40] Gábor Takács, István Pilászy, Bottyán Németh, and Domonkos Tikk. 2008. Inves-             Categorical Embedding and Numerical Trees. In Proceedings of the 26th Interna-
     tigation of various matrix factorization methods for large recommender systems.           tional Conference on World Wide Web (WWW ’17). International World Wide Web
     In 2008 IEEE International Conference on Data Mining Workshops. IEEE, 553–562.            Conferences Steering Committee, Republic and Canton of Geneva, Switzerland.
[41] Marko Tkalčič, Urban Burnik, Ante Odić, Andrej Košir, and Jurij Tasič. 2012.         [49] Yong Zheng, Robin Burke, and Bamshad Mobasher. 2013. The Role of Emotions
     Emotion-aware recommender systems–a framework and a case study. In Interna-               in Context-aware Recommendation. In RecSys workshop in conjunction with the
     tional Conference on ICT Innovations. Springer, 141–150.                                  7th ACM conference on Recommender Systems. RecSys workshop in conjunction
[42] Marko Tkalcic, Andrej Kosir, Jurij Tasivc, and Matevž Kunaver. 2011. Affective            with the 7th ACM conference on Recommender Systems, Hong Kong, China., 8.
     recommender systems: the role of emotions in recommender systems. 9–13.              [50] Morteza Zihayat, Anteneh Ayanso, Xing Zhao, Heidar Davoudi, and Aijun An.
[43] Oren Tsur, Dmitry Davidov, and Ari Rappoport. 2010. ICWSM-A Great Catchy                  2019. A utility-based news recommendation system. Decision Support Systems
     Name: Semi-Supervised Recognition of Sarcastic Sentences in Online Product                117 (2019), 14–27.