Diversifying Sentiments in News Recommendation Mete Sertkan, Sophia Althammer, Sebastian Hofstätter and Julia Neidhardt Christian Doppler Laboratory for Recommender Systems, TU Wien, Vienna, Austria Abstract Personalized news recommender systems are widely deployed to filter the information overload caused by the sheer amount of news produced daily. Recommended news articles usually have a sentiment similar to the sentiment orientation of the previously consumed news, creating a self-reinforcing cycle of sentiment chambers around people. Wu et al. introduced SentiRec – a sentiment diversity-aware neural news recommendation model to counter this lack of diversity. In this work, we reproduce SentiRec without access to the original source code and data sample. We re-implement SentiRec from scratch and use the Microsoft MIND dataset (same source but different subset as in the original work) for our experiments. We evaluate and discuss our reproduction from different perspectives. While the original paper mainly has a user-centric perspective on sentiment diversity by comparing the recommendation list to the user’s interaction history, we also analyze the intra-list sentiment diversity of the recommendation list. Additionally, we study the effect of sentiment diversification on topical diversity. Our results suggest that SentiRec does not generalize well to other data since the compared baselines already perform well, opposing the original work’s findings. While the original SentiRec utilizes a rule-based sentiment analyzer, we also study a pre-trained neural sentiment analyzer. However, we observe no improvements in effectiveness nor in sentiment diversity. To foster reproducibility, we make our source code publicly available. 1. Introduction Content-based recommenders usually recommend items to users similar to items they have liked in the past [1]. Also, recent well-performing neural news recommendation methods follow this principle. They model users based on their previously browsed news articles and, in turn, rank candidate news articles based on a relevance score considering the user model [2]. However, such approaches are prone to a lack of diversity. Especially since news with negative sentiment is more often clicked than positive ones, diversifying the sentiment is essential in news recommendations [3]. Taking all this into account, Wu et al. [3] introduced SentiRec, a sentiment diversity-aware neural news recommendation method. They learn sentiment-aware news representations by considering the content of the news and jointly training the recommendation model together with an auxiliary sentiment prediction task. Users are modeled by their previous clicked and non-clicked (i.e., seen but not clicked) news articles. The SentiRec approach regularizes and thus increases sentiment diversity by penalizing candidate news with similar sentiment compared to the users’ overall sentiment orientation. In both sentiment regularization and sentiment Perspectives on the Evaluation of Recommender Systems Workshop (PERSPECTIVES 2022), September 22nd, 2022, co-located with the 16th ACM Conference on Recommender Systems, Seattle, WA, USA. Envelope-Open mete.sertkan@tuwien.ac.at (M. Sertkan) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) prediction tasks, VADER [4], a rule-based sentiment-analyzer, is utilized to determine the sentiment polarity score as the label. In this work, we reproduce SentiRec without having access to the original source code or dataset. Our request for access to the original source code and data set has not been answered yet. Thus, we re-implement SentiRec from scratch and use the Microsoft MIND [2] dataset (same data source but different subset as in the original work) for our experiments. We evaluate our reproduction from different perspectives, namely i) effectiveness, ii) user-centric sentiment diversity, iii) intra-list sentiment diversity, and iiii) topical diversity. In our first evaluation perspective we aim to compare effectiveness trends from the original paper with our implementation and study: RQ1 How does our reproduced SentiRec implementation compare to the MIND [2] baselines concerning effectiveness? In contrast to the original work, our reproduction does not significantly outperform the baselines, which might be due to the dataset differences, highlighting the shortcomings of SentiRec regarding generalizability. We also employed a pre-trained neural sentiment analyzer (BERT- SA1 ) in addition to the rule-based one (VADER-SA [4]). When using BERT-SA, we observe no gains in recommendation performance and sentiment diversity compared to the VADER-SA setting. Our next evaluation perspective is user-centric sentiment diversity, as defined in the original paper; thus, we investigate: RQ2 How does our reproduced SentiRec implementation compare to the MIND [2] baselines concerning user-centric sentiment diversity? Opposing the original paper, we could not achieve the best user-centric sentiment diversity results by outperforming the random model while maintaining the best effectiveness. Moreover, we demonstrate that some baselines already reach sufficient user-centric sentiment diversity and significantly outperform SentiRec, (again) highlighting the lack of generalizability. While the original paper focuses on user-centric sentiment diversity by comparing the recommended list of news to the user’s interaction history, our third perspective focuses on sentiment diversity between news articles within a recommendation list, i.e., intra-list sentiment diversity. Thus we investigate: RQ3 How does our reproduced SentiRec implementation compare to the MIND [2] baselines concerning intra-list sentiment diversity? In contrast to the user-centric evaluation and although been penalized for user-centric sentiment similarity, our reproduction significantly outperforms most baselines if intra-list sentiment diversity is considered. This calls for a discussion on whether to employ a user-centric or an intra-list diversification and further investigations. While the original paper only considers sentiment diversity, we also analyze topical diversity, and thus in our final evaluation perspective, we study: RQ4 How does our reproduced SentiRec implementation compare to the MIND [2] baselines concerning user-centric and intra-list topical diversity? The user-centric topical diversity compares the user’s interaction history to the recommendation list. We demonstrate that the baselines already reach significantly better user-centric topical diversity than our Sentirec reproduction - highlighting the tradeoff between different objectives. 1 https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english In intra-list topical diversity, our reproduction reaches comparable results to the baselines (taking aside the random model). The contributions of this work are as follow: • We reproduce SentiRec [3] without having access to the original source code and dataset. Instead, we re-implement SentiRec from scratch and use the MIND [2] dataset. Although our implementation shows similar trends, we fail to reproduce the original findings, which might be caused due to dataset differences. In particular, the baselines in our experiments already show decent recommendation and sentiment diversity performance. • We propose extending the experiment by using a pre-trained neural sentiment analyzer instead of a rule-based sentiment analyzer. However, we observe no gains in effectiveness nor sentiment diversity. • We propose extending the experiment by considering user-centric topical diversity and intra-list topical and sentiment diversity. While the baselines outperform our reproduction if user-centric and intra-list topical diversity is considered, it significantly outperforms the baselines in intra-list sentiment diversity. • We publish the first open implementation of SentiRec for the community at: https://github.com/MeteSertkan/newsrec 2. Background The way how items are presented often influences the decision behavior of users [5]. Thus, when interacting with news articles, also their textual style plays an essential role [3, 6, 7] besides semantic or syntactic properties. However, these features are hard to engineer by hand. Recently, deep learning architectures have been increasingly used in recommendation scenarios [8]. These architectures have proven highly beneficial when capturing various patterns (e.g., user sessions, structure in pictures or language) or dealing with high complexity (e.g., multi-modal data, very dynamic settings, etc.). They usually follow an end-to-end feature extraction paradigm, where the recommendation model and the representation model (i.e., item and user encoder) are trained simultaneously. Thus handcrafted heuristics are avoided [9]. The trend has also reached the new recommendation domain. For example, NAML [10] uses attention networks to incorporate different views of a news article, e.g., title, abstract, category, etc., into the news; LSTUR [11] captures the short-term interest of users by applying GRU on recently clicked items and long-term interest by considering a user’s whole history track; and NRMS [12] uses multi-head self-attention in combination with additive-attention to model news articles, and in turn, users. However, by only considering the content of the users’ previous interactions, they are prone to recommend in a ”more of the same” way, and consequently, they might lack diversity. Therefore, we study news diversification and, in particular, sentiment diversification. In this work, we re-implement, extend, and analyze SentiRec [3]. SentiRec learns sentiment-aware news representations using an auxiliary sentiment prediction task and introduces a sentiment regularization method to obtain sentiment-diverse recommendations. While sentiment-aware recommendations have been studied in the tourism domain [13, 14], movie domain [15, 16], and e-commerce [17, 18], to name a few, less attention has been paid to sentiment-aware recommendations in the news domain and nor to sentiment diversification. Sentiment Diversity Score News Sentiment 4 Representation Score Sentiment Click Score Monitor 5 Click Predictor 3 Sentiment Predictor User Encoder Transformer ... ... News Sentiment News Sentiment News Sentiment Word Embedding Encoder Analyzer Encoder Analyzer ... Encoder Analyzer t1 ... tM DC D1 ... DN 2 News Encoder 1 Candidate News Browsed News Figure 1: Overview of SentiRec [3] comprising following major components: 1 News Encoder, which learns to encode news by their content and simultaneously to predict a sentiment score based on the learned encoding; 2 Sentiment Analyzer, which assigns a sentiment score to each news article based on its content; 3 User Encoder, which models users based on their previous news interactions; 4 Click Predictor, which determines a score for a given user and candidate news pair; and 5 Sentiment Monitor, which monitors and regularizes the sentiment diversity. 3. Methods 3.1. SentiRec SentiRec aims to optimize recommendation accuracy and sentiment diversity, which naturally leads to a trade-off between accuracy and diversity. The overall task is to rank candidate items based on a user’s history of previous items. Given for a user 𝑢 a history set 𝐻 of 𝑛 previously browsed news articles [𝐷1 , …, 𝐷𝑛 ] with sentiment polarity scores [𝑠1 , …, 𝑠𝑛 ], the aim is to rank a set 𝐶 of 𝑝 candidate news articles [𝐷1𝑐 , ..., 𝐷𝑝𝑐 ] (with sentiment polarity scores [𝑠1𝑐 , …, 𝑠𝑝𝑐 ]) by assigning each article a score i.e., [𝑦1̂ , …, 𝑦𝑝̂ ]. In particular, SentiRec seeks for sentiment diversity in the recommendation list. Higher diversity is achieved if top-ranked news articles have different sentiment polarity scores than the overall sentiment orientation 𝑠 ̄ = 𝑚𝑒𝑎𝑛([𝑠1 , …, 𝑠𝑁 ]) of the user’s previously browsed news. In the following we describe the different SentiRec components as shown in Figure 1. 1 News Encoder. The task of the news encoder is to find a representation 𝑟 𝑐 of candidate news 𝐷 𝑐 as well as representations [𝑟1 , …, 𝑟𝑁 ] of browsed news [𝐷1 , …, 𝐷𝑁 ] by taking their title as input. It consists of an embedding layer followed by a transformer layer to obtain a representation 𝑟 out of a sequence of terms. Since no details about the transformer layer were given, we follow the architecture of the closely related NRMS [12] model. Thus, we use multi-head self-attention for contextualization and additive-attention to obtain a unified embedding out of the contextualized word embeddings. The news encoder is jointly trained with an auxiliary sentiment prediction task in order to infuse sentiment awareness to the news representation. The sentiment score 𝑠 ̂ is predicted using a linear layer, i.e, 𝑠 ̂ = 𝑉𝑠 × 𝑟 + 𝑣𝑠 , where 𝑉𝑠 and 𝑣𝑠 are learnable parameters and 𝑟 is the news representation. As loss function the mean absolute error between predicted sentiment scores 𝑠𝑖̂ and the sentiment scores determined by the sentiment analyzer 𝑠𝑖 is used as follows : 𝑆 1 ℒ𝑠𝑒𝑛𝑡𝑖 = ∑ |𝑠𝑖̂ − 𝑠𝑖 | (1) 𝑆 𝑖=1 2 Sentiment-Analyzer. Given the title of a news article, the sentiment analyzer determines the sentiment polarity score ranging in [-1, 1], which is considered as the sentiment label of the respective news article. The original paper uses VADER [4] (a rule-based method) as sentiment analyzer (VADER-SA). In addition, we also study a pre-trained neural sentiment analyzer2 (BERT-SA). 3 User Encoder. The user encoder gets the sentiment-aware representations of the previously browsed news, i.e., [𝑟1 , …, 𝑟𝑁 ], as input and uses a transformer layer (i.e., multi-head self-attention followed by additive attention according to NRMS [12]) to obtain a representation 𝑢 of the user. 4 Click Predictor. The click predictor uses the dot-product between user and candidate embedding, i.e, 𝑢𝑟 𝑐 , to determine a click score 𝑦.̂ 5 Sentiment Monitor. The sentiment monitor observes to what extent the sentiment polarity score (obtained by the sentiment analyzer) 𝑠 𝑐 of a candidate news article diverges from the users’ overall sentiment orientation 𝑠 ̄ = 𝑚𝑒𝑎𝑛([𝑠1 , ..., 𝑠𝑁 ]) (i.e., the mean sentiment polarity score of the users browsing history). This diversity in sentiment is measured by 𝑝 = 𝑚𝑎𝑥(0, 𝑠𝑠̄ 𝑐 𝑦), ̂ where larger values of 𝑝 indicate less sentiment diversity. The sentiment diversity score 𝑝 is further used to regularize and steer the model into a more sentiment diverse direction. Following loss function is used for this purpose: 1 ℒ𝑑𝑖𝑣 = ∑𝑝 (2) |𝑆| 𝑖∈𝑆 𝑖 where 𝑆 is the training set and 𝑝𝑖 the sentiment diversity score of the 𝑖-th sample. Negative sampling is used in order to create a labeled dataset for the recommendation task. For each clicked news in a user impression, 𝐾 non-clicked samples from the same impression are randomly selected. The recommendation loss is the negative log-likelihood of the clicked samples and is defined as follows: 𝑒𝑥𝑝(𝑦𝑖̂ + ) ℒ𝑟𝑒𝑐 = − ∑ 𝑙𝑜𝑔( 𝐾 ) (3) 𝑖∈𝑆 𝑒𝑥𝑝(𝑦𝑖̂ + ) + ∑𝑗=1 𝑒𝑥𝑝(𝑦𝑖,𝑗 ̂ −) where 𝑦𝑖̂ + is the click score of 𝑖-th clicked news and 𝑦𝑖,𝑗 ̂ − the click score of the 𝑗-th sample of the corresponding 𝐾 negative samples, and 𝑆 is the training set. The final loss function brings all three losses, i.e., recommendation loss, sentiment prediction loss, and sentiment diversity loss, together as follows: ℒ = ℒ𝑟𝑒𝑐 + 𝜆ℒ𝑠𝑒𝑛𝑡𝑖 + 𝜇ℒ𝑑𝑖𝑣 (4) where 𝜆 and 𝜇 are hyperparameters controlling the influence of the sentimen prediction loss and sentiment diversity loss respectively. 2 https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english 3.2. Evaluation Perspectives We evaluate our reproduction from five different perspectives: effectiveness, user-centric sentiment diversity, intra-list sentiment diversity, user-centric topical diversity, and intra-list topical diversity. Note, in contrast to the intra-list diversity measures, the user-centric measures assess diversity in relation to the user’s previous news consumption. We compare the results of our reproduction against all baselines and our extensions, using paired t-test with Bonferroni correction [19, 20]. Effectiveness. We evaluate effectiveness using 𝐴𝑈 𝐶, 𝑀𝑅𝑅, 𝑛𝐷𝐶𝐺@5, and 𝑛𝐷𝐶𝐺@10. User-Centric Sentiment Diversity. We evaluate user-centric sentiment diversity using the sentiment alignment metrics 𝑆𝑀𝑅𝑅 and 𝑆@𝐾, introduced by WU et al. [3], which is defined as follows: 𝐶 𝑐 𝐾 𝑠 𝑆𝑀𝑅𝑅 = 𝑚𝑎𝑥(0, 𝑠 ̄ ∑ 𝑖 ), 𝑆@𝐾 = 𝑚𝑎𝑥(0, 𝑠 ̄ ∑ 𝑠𝑖𝑐 ) (5) 𝑖=1 𝑖 𝑖=1 where 𝐶 is the length of the recommendation list (i.e, number of candidate items) and 𝑠𝑖𝑐 is the sentiment polarity score of the news article ranked at position 𝑖 in this list; and 𝑠 ̄ is the overall sentiment orientation of the corresponding user. Hence, the closer top-ranked candidates’ sentiment to the users’ overall sentiment orientation, the higher the sentiment alignment metrics. Ergo, lower sentiment alignment scores indicate more sentiment-diverse recommendations. Intra-List Sentiment Diversity (not included in the original paper). As the sentiment polarity score 𝑠𝑖 of a news article is only one scalar, we compute the intra-list sentiment diversity by averaging the absolute difference of sentiment polarity scores 𝑠𝑖 and 𝑠𝑗 between each news pair in the Top-K list of recommended candidate articles: 2 𝐼 𝐿𝑆𝑆 @𝐾 = ∑ |𝑠 − 𝑠 | (6) 𝐾 (𝐾 − 1) 𝑠 ,𝑠 ∈𝐶@𝐾 𝑖 𝑗 𝑖 𝑗 The intra-list sentiment diversity score lies between 0 and 1, with 0 being maximal divers. User-Centric Topical Diversity (not included in the original paper). We consider the news articles’ categories (e.g., sports) and subcategories (e.g., soccer) to compute topical diversity. We represent a (sub)category of a news article with a 1-hot-encoding. We compute the user’s category representation 𝑐𝑢 by summing up all browsed news category representations. Similarly, we compute the recommendations list’s category representation 𝑐𝐶@𝐾 by summing up the category representations of the recommended top-K candidate news articles. We then measure diversity 𝑇 @𝐾 by taking cosine similarity between 𝑐𝑢 and 𝑐𝐶@𝐾 . This leads to a measure between 0 and 1, with 0 being maximal divers. Similarly, we measure 𝑇𝑀𝑅𝑅 with the difference being computing a weighted average of all candidates’ category representations to obtain a representation 𝑐𝑀𝑅𝑅 of the recommendation list, where the weight is the rank of corresponding news articles. 𝑇𝑀𝑅𝑅 = 𝑐𝑜𝑠𝑠𝑖𝑚 (𝑐𝑀𝑅𝑅 , 𝑐𝑢 ), 𝑇 @𝐾 = 𝑐𝑜𝑠𝑠𝑖𝑚 (𝑐𝐶@𝐾 , 𝑐𝑢 ) (7) Intra-List Topical Diversity (not included in the original paper). We again represent a (sub)category of a news article with a 1-hot-encoding. We measure the intra-list topical diversity of the recommendation list by computing the average pairwise cosine similarity between the 1-hot-encoded category representations 𝑐 of the recommended top-k news articles. This leads to a measure between 0 and 1, with 0 being maximal divers. 2 𝐼 𝐿𝑆𝑇 @𝐾 = ∑ 𝑐𝑜𝑠 (𝑐 , 𝑐 ) (8) 𝐾 (𝐾 − 1) 𝑐 ,𝑐 ∈𝐶@𝐾 𝑠𝑖𝑚 𝑖 𝑗 𝑖 𝑗 4. Experimental Setting Dataset. The dataset of the original paper is constructed from MSN News3 logs collected from 0ctober 31, 2018, to January 29, 2019, but has not been open-sourced, and our access request has not been answered yet. Thus, we use the MIND [2] dataset - specifically the MIND-small4 version - in our experiments, as it stems from the same source. It was randomly sampled from 50K users (with at least five clicks) during six weeks, from October 12 to November 22, 2019, where the first five weeks are for training and the last week for testing. One sample is composed of a timestamp, the user-id, a list of chronologically ordered news-ids representing the user’s click history, and a list of shuffled candidate news-ids with corresponding labels (i.e., 1 for clicked and 0 for seen but not clicked). Detailed statistics of the datasets are summarized in Table 1. Mind-small has five times more users with about two times fewer impressions and on average about seven times fewer positive interactions per user (seven clicks vs. 49) than the SentiRec dataset. Table 1 SentiRec dataset (as reported) and MIND-small dataset statistics. Dataset #Users #News #Impression #Clicks #Non-Clicks SentiRec 10,000 42,255 445,230 489,644 6,651,940 MIND-small 50,000 65,238 230,117 347,727 8,236,715 Training. All models are trained on 90% of the training data. The remaining 10% is used to tune the hyperparameters by optimizing AUC. We use early-stopping with a minimum delta of 0.0001 AUC and patience of 5. Note that we use 300-dimensional Glove embeddings [21] in all models to initialize the word embedding layer and NLTK [22] word tokenizer for tokenization. Further, we limit the number of browsed news in each impression to 50 and the title length to 20 terms (smaller sequences are zero-padded). Parameter Settings. We set the negative sampling ratio K to 4. We apply 20% dropout to the word embeddings. We use multi-head self-attention with 15 attention heads followed by an additive-attention layer with a 200-dimensional query vector. We use the ADAM [23] optimizer with a learning rate of 0.0001 and a batch size of 128. For the VADER-SA based model (𝑆𝑒𝑛𝑡𝑖𝑅𝑒𝑐𝑉 ) 3 https://www.msn.com/en-us/news 4 https://msnews.github.io/index.html we set 𝜆 = 0.4 and 𝜇 = 10 and for the BERT-SA based model (𝑆𝑒𝑛𝑡𝑖𝑅𝑒𝑐𝐵 ) we set 𝜆 = 0.4 and 𝜇 = 1. Baselines. We compare the reproduced and adapted models against following baselines suggested by the dataset providers [2]: LSTUR [11] (not included in the original paper) - Neural news recommender capturing users’ long- and short-term interests. We initialize the GRU network with the user embedding. We set the masking probability of the users’ long-term interests to 50%. We apply 20% dropout to the word embeddings. The negative sampling ratio K is set to 4. For the CNN, we set the number of filters to 300 and the window size to 3. We use a 200-dimensional query vector for the additive-attention layer. We use the ADAM [23] optimizer with a learning rate of 0.0001 and a batch size of 256. NAML [10] (not included in the original paper) & 𝑁 𝐴𝑀𝐿𝑇 (adaptation of NAML as in the original paper) - Neural news recommender incorporating multiple views (i.e., title, category, and abstract) into the news representation. We limit the abstract length to 50 terms. We apply 20% dropout to the word embeddings. We set the category embeddings dimension to 100. The number of CNN filters is set to 400 and the window size to 3. We use 200-dimensional query vectors in the additive-attention layers. The negative sampling ratio K is set to 4. We use the ADAM [23] optimizer with a learning rate of 0.0001 and a batch size of 256. We also trained 𝑁 𝐴𝑀𝐿𝑇 - a ”title only” version - as used in the original paper [3]. We obtained the same parameters as NAML without the need for category dimensions. NRMS [12] - Neural news recommender which utilizes multi-head self-attention within both the news encoder and the user encoder. We use multi-head self-attention with 15 attention heads followed by an additive-attention layer with a 200-dimensional query vector. We apply 20% dropout to the word embeddings. We set the negative sampling ratio K to 4. We use the ADAM [23] optimizer with a learning rate of 0.0001 and a batch size of 128. 5. Results and Analysis In this section, we present and analyze our results and answer our previously stated research questions. We investigate whether the reproduced models perform as described in the original paper and study: RQ1 How does our reproduced SentiRec implementation compare to the MIND [2] baselines concerning effectiveness? We compare the recommendation performance (i.e., 𝐴𝑈 𝐶, 𝑀𝑅𝑅, 𝑛𝐷𝐶𝐺@5, and 𝑛𝐷𝐶𝐺@10) of the reproduced model (i.e., 𝑆𝑒𝑛𝑡𝑖𝑅𝑒𝑐𝑉 ) against the baselines (i.e., 𝐿𝑆𝑇 𝑈 𝑅 [11], 𝑁 𝐴𝑀𝐿 & 𝑁 𝐴𝑀𝐿𝑇 [10], 𝑁 𝑅𝑀𝑆 [12], and 𝑅𝑎𝑛𝑑𝑜𝑚), which is summarized in rows 1-6 of Table 2. Opposing the original work, our sentiment reproduction does not significantly outperform all baselines concerning recommendation effectiveness. Moreover, it performs similarly to the closely related 𝑁 𝑅𝑀𝑆 baseline. Furthermore, utilizing a pre-trained neural sentiment analyzer instead of the rule-based one does not yield performance gains (compare rows 6 to 7 in Table 2). RQ2 How does our reproduced SentiRec implementation compare to the MIND [2] baselines concerning user-centric sentiment diversity? We investigate sentiment diversity by comparing the sentiment alignment scores (i.e., 𝑆𝑀𝑅𝑅 , Table 2 Comparing effectiveness (i.e., AUC, MRR, nDCG@5, and nDCG@10). Higher effectiveness scores indicate better performance. Subscripts V (VADER-SA) and B (BERT-SA) indicate the used sentiment analyzer. Note, † indicates a statistically significant difference to 𝑆𝑒𝑛𝑡𝑖𝑅𝑒𝑐𝑉 at alpha 0.05. nDCG Model AUC MRR @5 @10 1 Random .4994† .2190† .2236† .2863† 2 𝑁 𝐴𝑀𝐿𝑇 .6194 .2982 .3190 .3804 3 𝑁 𝐴𝑀𝐿 .6206 .2913† .3185 .3782† 4 𝐿𝑆𝑇 𝑈 𝑅 .6210† .2840† .3101† .3721† 5 𝑁 𝑅𝑀𝑆 .6228 .2946 .3191 .3817 6 𝑆𝑒𝑛𝑡𝑖𝑅𝑒𝑐𝑉 .6224 .2952 .3211 .3818 7 𝑆𝑒𝑛𝑡𝑖𝑅𝑒𝑐𝐵 .6219 .2942 .3203 .3820 Table 3 Comparing user-centric sentiment and topic alignment (i.e., 𝑆𝑀𝑅𝑅 , 𝑆@5, 𝑆@10, 𝑇𝑀𝑅𝑅 , 𝑇 @5, 𝑇 @10). Lower alignment scores indicate better diversity. Subscripts V (VADER-SA) and B (BERT-SA) indicate the used sentiment analyzer. Note, † indicates a statistically significant difference to 𝑆𝑒𝑛𝑡𝑖𝑅𝑒𝑐𝑉 at alpha 0.05. VADER-SA Labels BERT-SA Labels Model 𝑇𝑀𝑅𝑅 𝑇 @5 𝑇 @10 𝑆𝑀𝑅𝑅 𝑆@5 𝑆@10 𝑆𝑀𝑅𝑅 𝑆@5 𝑆@10 1 𝑅𝑎𝑛𝑑𝑜𝑚 .0086† .0150† .0188† .1095† .1748† .2638† .4315† .3680† .4428† 2 𝑁 𝐴𝑀𝐿𝑇 .0157† .0276† .0382 .1741† .2623† .3933† .5091† .4570† .5047† 3 𝑁 𝐴𝑀𝐿 .0131† .0210† .0248† .1132† .1749† .2936† .4504† .3744† .4270† 4 𝐿𝑆𝑇 𝑈 𝑅 .0158† .0281† .0412† .1655† .2637† .4297† .4735† .4220† .4867† 5 𝑁 𝑅𝑀𝑆 .0149† .0282 .0390 .1317† .2317† .3869† .4883 .4353 .4926† 6 𝑆𝑒𝑛𝑡𝑖𝑅𝑒𝑐𝑉 .0161 .0284 .0386 .1300 .2153 .3651 .4872 .4328 .4891 7 𝑆𝑒𝑛𝑡𝑖𝑅𝑒𝑐𝐵 .0174† .0325† .0449† .1560† .2675† .4330† .4905† .4414† .4942† 𝑆@5, and 𝑆@10 – lower scores indicate higher sentiment diversity) of our reproduced model, i.e., 𝑆𝑒𝑛𝑡𝑖𝑅𝑒𝑐𝑉 , and the baselines (see rows 1-6 in Table 3). In the original work [3], SentiRec outperforms all baselines in sentiment diversity - even the Random model - while maintaining the highest recommendation performance scores. We can not confirm these findings. Moreover, our results suggest that the baselines already perform well in all aspects, i.e., recommendation performance and sentiment diversity. In particular, we do not observe large margins in sentiment diversity as in the original paper While the original paper studies sentiment diversity with a user-centric focus, it is also essential to investigate sentiment diversity within a recommended list of news articles; thus, we ask: RQ3 How does our reproduced SentiRec implementation compare to the MIND [2] baselines concerning intra-list sentiment diversity? We compute the intra-list sentiment similarity at cutoff K, i.e., 𝐼 𝐿𝑆𝑆 @𝐾, by considering the pairwise differences of news articles within a top K recommendation list. Table 4 (rows 1-7) Table 4 Comparing sentiment- and topic-based intra-list similarity (i.e., 𝐼 𝐿𝑆𝑆 @5, 𝐼 𝐿𝑆𝑆 @10, 𝐼 𝐿𝑆𝑇 @5, 𝐼 𝐿𝑆𝑇 @10). Lower intra-list similarity scores indicate better diversity. Subscripts V (VADER-SA) and B (BERT-SA) indicate the used sentiment analyzer. Note, † indicates a statistically significant difference to 𝑆𝑒𝑛𝑡𝑖𝑅𝑒𝑐𝑉 at alpha 0.05. VADER-SA Labels BERT-SA Labels Model 𝐼 𝐿𝑆𝑇 @5 𝐼 𝐿𝑆𝑇 @10 𝐼 𝐿𝑆𝑆 @5 𝐼 𝐿𝑆𝑆 @10 𝐼 𝐿𝑆𝑆 @5 𝐼 𝐿𝑆𝑆 @10 1 𝑅𝑎𝑛𝑑𝑜𝑚 .2393† .2394† .5047† .5045† .0774† .0775† 2 𝑁 𝐴𝑀𝐿𝑇 .2336† .2377† .4770† .4863† .1396† .1089† 3 𝑁 𝐴𝑀𝐿 .2600† .2480† .5221† .5049† .3377† .1886† 4 𝐿𝑆𝑇 𝑈 𝑅 .2313 .2347 .4826† .4826 .1223† .1026 5 𝑁 𝑅𝑀𝑆 .2376† .2393† .4700 .4819 .1290 .1016 6 𝑆𝑒𝑛𝑡𝑖𝑅𝑒𝑐𝑉 .2310 .2337 .4682 .4812 .1289 .1013 7 𝑆𝑒𝑛𝑡𝑖𝑅𝑒𝑐𝐵 .2423† .2404† .4444† .4648† .1429† .1063† summarizes our outcomes. A lower intra-list similarity score indicates better diversity. In contrast to our user-centric diversity findings, where the baselines already exhibit decent performance, we observe that our reproduced model, i.e., 𝑆𝑒𝑛𝑡𝑖𝑟𝑒𝑐𝑉 , significantly outperforms most baselines concerning intra-list sentiment diversity. In comparison, the 𝑁 𝐴𝑀𝐿 baseline shows the worst performance. Suggesting that additional modalities might foster user-centric sentiment diversity (see Table 3) but hurt intra-list sentiment diversity by recommending top K news articles with a rather higher sentiment similarity. Effectiveness and sentiment diversity are the emergent perspectives to evaluate SentiRec; in addition to those, we also focus on topical diversity and investigate: RQ4 How does our reproduced SentiRec implementation compare to the MIND [2] baselines concerning user-centric and intra-list topical diversity? We adapt the user-centric sentiment alignment metrics and introduce user-centric topical align- ment metrics, i.e., 𝑇𝑀𝑅𝑅 and 𝑇 @𝐾, by considering the categorical membership of the news articles. Lower 𝑇𝑀𝑅𝑅 / 𝑇 @𝐾 indicate higher diversity. The last three columns of Table 3 summa- rize our analysis. The 𝑅𝑎𝑛𝑑𝑜𝑚 model recommends the most topically diverse news articles to the users’ previously browsed news articles, except if the top 10 recommendations are considered, where the 𝑁 𝐴𝑀𝐿 model excels. The 𝑁 𝐴𝑀𝐿 and the 𝐿𝑆𝑇 𝑈 𝑅 baselines significantly reach better user-centric topical diversity than our reproduced 𝑆𝑒𝑛𝑡𝑖𝑅𝑒𝑐 models while maintaining reasonable recommendation performance – demonstrating the competitiveness of the baseline models. If we consider intra-list topical diversity 𝐼 𝐿𝑆𝑇 @𝐾 (see Table 4 last two columns), which is defined by the pairwise categorical differences within the recommendation list, the 𝑅𝑎𝑛𝑑𝑜𝑚 Model recommends the most diverse news articles. Our reproduction, 𝑆𝑒𝑛𝑡𝑖𝑅𝑒𝑐𝑉 , outperforms the 𝑁 𝐴𝑀𝐿 models and is on par with the 𝐿𝑆𝑇 𝑈 𝑅 and 𝑁 𝑅𝑀𝑆 baselines. 6. Discussion Overall, we cannot confirm the findings of the original work, where they outperformed all baselines in effectiveness and user-centric sentiment diversity. We argue that the effectiveness and diversity discrepancies between the original SentiRec and our reproduction are due to dataset differences highlighting the shortcomings of 𝑆𝑒𝑛𝑡𝑖𝑅𝑒𝑐 concerning generalizability. Our dataset contains five times more users and about 23K more news than the original paper; however, it contains relatively few positive feedback (i.e., clicks) and spans only over six weeks (compared to nine weeks). Thus, inherently more diverse behavior is contained in the used dataset than in the original paper. One might argue that the sentiment diversity issue in our sample is not as prevalent as in the sample of the original work. However, we demonstrate that the 𝑁 𝐴𝑀𝐿 baseline significantly outperforms our reproduction and gets close to the 𝑅𝑎𝑛𝑑𝑜𝑚 model’s performance. This highlights that there is room for improvement, which is not utilized by the 𝑆𝑒𝑛𝑡𝑖𝑅𝑒𝑐’s diversification approach. As mentioned, the 𝑁 𝐴𝑀𝐿 [10] model outperforms all other models (except the 𝑅𝑎𝑛𝑑𝑜𝑚 model) regarding user-centric sentiment diversity while maintaining comparable recommendation performance to our 𝑆𝑒𝑛𝑡𝑖𝑅𝑒𝑐 reproductions. Besides the title of a news article, it also considers category, subcategory, and abstract. Thus, we reason that considering different modalities supports the diversification task. Note, in the original paper 𝑁 𝐴𝑀𝐿 is fed with only one modality (i.e., title) - in this work denoted as 𝑁 𝐴𝑀𝐿𝑇 . Besides the user-centric view of sentiment diversity, we also analyze a more generic per- spective, i.e., intra-list sentiment diversity. We demonstrate that our reproduction achieves an outstanding intra-list sentiment diversity, although optimized for user-centric sentiment diversity. Setting both perspectives alongside opens the room for the following question, which we will tackle in future work: Which view of sentiment diversity should we optimize while maintaining user satisfaction? Optimizing for the user-centric perspective is more conservative. This will rank news articles with an orthogonal sentiment to the overall sentiment of the user’s news consumption higher. Such an approach has a strong nudging power but might drop user satisfaction by recommending more the “unusual”. On the other hand, optimizing for the intra-list perspective is more relaxed by suggesting news articles with different sentiments. However, it bears the risk that users might still follow their previous behavior and consume, for example, only negative news articles. Our final evaluation perspective, which the original work does not consider, is topical diversity. In particular, we consider categorical differences between recommended news articles and the users’ browsed news, i.e., user-centric topical diversity and categorical differences within the news articles in the recommendation list, i.e., intra-list topical diversity. In both measures, the 𝑅𝑎𝑛𝑑𝑜𝑚 model achieves the most topically diverse recommendations. Setting aside the 𝑅𝑎𝑛𝑑𝑜𝑚 model, while in the user-centric perspective, our reproduction 𝑆𝑒𝑛𝑡𝑖𝑟𝑒𝑐𝑉 is outperformed by most baselines, in the intra-list perspective, it is on par or better than the baselines. With different sentiment distributions within news categories, we plan to analyze whether topical diversification already yields sentiment diversification and higher user satisfaction in future work. 7. Conclusion This work aims to reproduce SentiRec [3] - a sentiment diversity-aware neural news recommen- dation model - without having access to the original source code and dataset. We re-implement SentiRec from scratch and make it publicly available. We use the MIND [2] dataset, which has the same source as the original paper, albeit a different time period. Overall, we can not confirm the significant findings of the SentiRec paper. The reproduced model does not outperform the random model in (user-centric) sentiment diversity while maintaining the best recommendation performance compared to the baselines as in the original work. Moreover, our results suggest that the baselines already perform well. In particular, the NAML [10] model delivers the most sentiment-diverse recommendations (w.r.t. to the users’ overall consumption behavior) apart from the random model while holding a comparable recommendation performance to all other baselines. We conclude that these discrepancies are due to dataset differences high-lighting the shortcomings of SentiRec concerning generalizability. In addition to the original paper, we also consider the topical diversity of the recommended list compared to the users’ previous user history. Similar to previously, we show that the baselines, particularly 𝑁 𝐴𝑀𝐿, significantly yield better topical diversity than our reproduced 𝑆𝑒𝑛𝑡𝑖𝑟𝑒𝑐 model. In addition to a rule-based sentiment analyzer, as used by Wu et al. [3], we conducted our experiments with a pre-trained neural sentiment analyzer to study whether a neural model leads to better sentiment labels and thus to improved overall training performance. However, we do not observe improvements in recommendation performance or sentiment diversity. While the original paper only focuses on sentiment diversity by comparing the users’ overall user history with the recommendation list (i.e., user-centric diversity), we also investigate the sentiment and the topical diversity between news articles within the recommendation list (intra-list diversity). In contrast to the user-centric evaluation, the intra-list evaluation shows that our 𝑆𝑒𝑛𝑡𝑖𝑅𝑒𝑐 reproduction significantly outperforms most baselines, while the strong 𝑁 𝐴𝑀𝐿 baseline performs poorly. We discuss our different evaluation perspectives (i.e., user-centric/intra-list sentiment and topical diversity). We plan to conduct offline and online experiments to compare and combine them in future work. Furthermore, we plan to include other auxiliary information into the end-to-end recommendation model, such as emotion awarness and diversity. Ultimately, we want to create recommendation models that optimize for a broad range of goals and benefit society by more responsible recommendations. Acknowledgments This research is supported by the Christian Doppler Research Association (CDG), and has received funding from the EU’s H2020 research and innovation program (Grant No. 822670). References [1] F. Ricci, L. Rokach, B. Shapira, Recommender Systems: Introduction and Challenges, Springer US, Boston, MA, 2015, pp. 1–34. URL: https://doi.org/10.1007/978-1-4899-7637-6_1. doi:1 0 . 1 0 0 7 / 9 7 8 - 1 - 4 8 9 9 - 7 6 3 7 - 6 \ _ 1 . [2] F. Wu, Y. Qiao, J.-H. Chen, C. Wu, T. Qi, J. Lian, D. Liu, X. Xie, J. Gao, W. Wu, M. Zhou, MIND: A large-scale dataset for news recommendation, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, 2020, pp. 3597–3606. URL: https://www.aclweb.org/anthology/2020. acl-main.331. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 0 . a c l - m a i n . 3 3 1 . [3] C. Wu, F. Wu, T. Qi, Y. Huang, SentiRec: Sentiment diversity-aware neural news rec- ommendation, in: Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, Association for Computational Linguistics, Suzhou, China, 2020, pp. 44–53. URL: https://www.aclweb.org/anthology/2020.aacl-main.6. [4] C. Hutto, E. Gilbert, Vader: A parsimonious rule-based model for sentiment analysis of social media text, Proceedings of the International AAAI Conference on Web and Social Media 8 (2014). URL: https://ojs.aaai.org/index.php/ICWSM/article/view/14550. [5] D. Jannach, M. Zanker, A. Felfernig, G. Friedrich, Online consumer decision making, Cambridge University Press, 2010, p. 234–252. doi:1 0 . 1 0 1 7 / C B O 9 7 8 0 5 1 1 7 6 3 1 1 3 . 0 1 2 . [6] R. El Baff, H. Wachsmuth, K. Al Khatib, B. Stein, Analyzing the Persuasive Effect of Style in News Editorial Argumentation, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, 2020, pp. 3154–3160. URL: https://www.aclweb.org/anthology/2020.acl-main.287. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 0 . a c l - m a i n . 2 8 7 . [7] M. Sertkan, J. Neidhardt, H. Werthner, Documents, topics, and authors: Text mining of online news, in: 2019 IEEE 21st Conference on Business Informatics (CBI), volume 01, 2019, pp. 405–413. doi:1 0 . 1 1 0 9 / C B I . 2 0 1 9 . 0 0 0 5 3 . [8] S. Zhang, L. Yao, A. Sun, Y. Tay, Deep learning based recommender system: A survey and new perspectives, ACM Comput. Surv. 52 (2019). URL: https://doi.org/10.1145/3285029. doi:1 0 . 1 1 4 5 / 3 2 8 5 0 2 9 . [9] Y. Deldjoo, M. Schedl, P. Cremonesi, G. Pasi, Recommender systems leveraging multimedia content, ACM Comput. Surv. 53 (2020). URL: https://doi.org/10.1145/3407190. doi:1 0 . 1 1 4 5 / 3407190. [10] C. Wu, F. Wu, M. An, J. Huang, Y. Huang, X. Xie, Neural news recommendation with attentive multi-view learning, arXiv preprint arXiv:1907.05576 (2019). [11] M. An, F. Wu, C. Wu, K. Zhang, Z. Liu, X. Xie, Neural news recommendation with long- and short-term user representations, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguis- tics, Florence, Italy, 2019, pp. 336–345. URL: https://www.aclweb.org/anthology/P19-1033. doi:1 0 . 1 8 6 5 3 / v 1 / P 1 9 - 1 0 3 3 . [12] C. Wu, F. Wu, S. Ge, T. Qi, Y. Huang, X. Xie, Neural news recommendation with multi- head self-attention, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, 2019, pp. 6389–6394. URL: https://www.aclweb.org/anthology/D19-1671. doi:1 0 . 1 8 6 5 3 / v 1 / D 1 9 - 1 6 7 1 . [13] H. Wang, Y. Fu, Q. Wang, H. Yin, C. Du, H. Xiong, A location-sentiment-aware recom- mender system for both home-town and out-of-town users, in: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17, Association for Computing Machinery, New York, NY, USA, 2017, p. 1135–1143. URL: https://doi.org/10.1145/3097983.3098122. doi:1 0 . 1 1 4 5 / 3 0 9 7 9 8 3 . 3 0 9 8 1 2 2 . [14] P. Padia, K. H. Lim, J. Cha, A. Harwood, Sentiment-aware and personalized tour rec- ommendation, in: 2019 IEEE International Conference on Big Data (Big Data), 2019, pp. 900–909. doi:1 0 . 1 1 0 9 / B i g D a t a 4 7 0 9 0 . 2 0 1 9 . 9 0 0 6 4 4 2 . [15] C. Orellana-Rodriguez, E. Diaz-Aviles, W. Nejdl, Mining affective context in short films for emotion-aware recommendation, in: Proceedings of the 26th ACM Conference on Hypertext & Social Media, HT ’15, Association for Computing Machinery, New York, NY, USA, 2015, p. 185–194. URL: https://doi.org/10.1145/2700171.2791042. doi:1 0 . 1 1 4 5 / 2 7 0 0 1 7 1 . 2791042. [16] C. Musto, G. Rossiello, M. de Gemmis, P. Lops, G. Semeraro, Combining text summa- rization and aspect-based sentiment analysis of users’ reviews to justify recommenda- tions, in: Proceedings of the 13th ACM Conference on Recommender Systems, RecSys ’19, Association for Computing Machinery, New York, NY, USA, 2019, p. 383–387. URL: https://doi.org/10.1145/3298689.3347024. doi:1 0 . 1 1 4 5 / 3 2 9 8 6 8 9 . 3 3 4 7 0 2 4 . [17] D. Hyun, C. Park, M.-C. Yang, I. Song, J.-T. Lee, H. Yu, Review sentiment-guided scalable deep recommender system, in: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’18, Association for Computing Machinery, New York, NY, USA, 2018, p. 965–968. URL: https://doi.org/10.1145/3209978.3210111. doi:1 0 . 1145/3209978.3210111. [18] A. Da’u, N. Salim, Sentiment-aware deep recommender system with neural attention networks, IEEE Access 7 (2019) 45472–45484. doi:1 0 . 1 1 0 9 / A C C E S S . 2 0 1 9 . 2 9 0 7 7 2 9 . [19] J. Urbano, H. Lima, A. Hanjalic, Statistical significance testing in information retrieval: An empirical analysis of type i, type ii and type iii errors, in: Proceedings of the 42nd Inter- national ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR’19, Association for Computing Machinery, New York, NY, USA, 2019, p. 505–514. URL: https://doi.org/10.1145/3331184.3331259. doi:1 0 . 1 1 4 5 / 3 3 3 1 1 8 4 . 3 3 3 1 2 5 9 . [20] M. D. Smucker, J. Allan, B. Carterette, A comparison of statistical significance tests for information retrieval evaluation, in: Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, CIKM ’07, Association for Computing Machinery, New York, NY, USA, 2007, p. 623–632. URL: https://doi.org/10.1145/ 1321440.1321528. doi:1 0 . 1 1 4 5 / 1 3 2 1 4 4 0 . 1 3 2 1 5 2 8 . [21] J. Pennington, R. Socher, C. D. Manning, Glove: Global vectors for word representation, in: Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532–1543. URL: http://www.aclweb.org/anthology/D14-1162. [22] S. Bird, E. Klein, E. Loper, Natural language processing with Python: analyzing text with the natural language toolkit, ” O’Reilly Media, Inc.”, 2009. [23] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014).