=Paper=
{{Paper
|id=Vol-3143/paper1
|storemode=property
|title=Predicting Feature-based Similarity in the News Domain Using Human Judgments
|pdfUrl=https://ceur-ws.org/Vol-3143/paper1.pdf
|volume=Vol-3143
|authors=Alain D. Starke,Sebastian Øverhaug,Christoph Trattner
|dblpUrl=https://dblp.org/rec/conf/recsys/StarkeOT21
}}
==Predicting Feature-based Similarity in the News Domain Using Human Judgments==
Predicting Feature-based Similarity in the News Domain Using Human Judgments Alain D. Starke1,2 , Sebastian Øverhaug2 and Christoph Trattner2 1 Wageningen University & Research, Droevendaalsesteeg 4, 6708 PB Wageningen, The Netherlands 2 University of Bergen, P.O. Box 7800, 5020 Bergen, Norway Abstract When reading an online news article, users are typically presented ‘more like this’ recommendations by news websites. In this study, we assessed different similarity functions for news item retrieval, by com- paring them to human judgments of similarity. We asked 401 participants to assess the overall similarity of ten pairs of political news articles, which were compared to feature-specific similarity functions (e.g., based on body text or images). We found that users indicated to mostly use text-based features (e.g., title) for their similarity judgments, suggesting that body text similarity was the most representative for their judgment. Moreover, we modeled similarity judgments using different regression techniques. Us- ing data from another study, we contrasted our results across retrieval domains, revealing that similarity functions in news are less representative of user judgments than those in movies and recipes. Keywords news, similarity, similar-item retrieval, recommender systems, user study, human judgment 1. Introduction Similarity functions are central to recommender systems and information retrieval systems [1]. They assess the similarity between a reference article and a set of possible recommendations [2]. Using a dataset with political news articles, this paper employs a semantic similarity approach to assess the utility of different feature-based similarity functions in the news domain, grounding them in human judgments of similarity. 1.1. Problem Outline News retrieval faces several domain-specific challenges. Compared to leisure domains (e.g., movies), news articles are volatile, in the sense that they become obsolete quickly or may be updated later [3]. Consequently, user preferences may strongly depend on contextual factors, such as a user’s time of day or location [4, 5]. News websites typically present content-based recommendations [1]. A common setup is to present a list of articles that are similar to the story the user is currently reading, such as depicted INRA’21: 9th International Workshop on News Recommendation and Analytics, September 25, 2021, Amsterdam, Netherlands " alain.starke@wur.nl (A. D. Starke); overhaug15@gmail.com (S. Øverhaug); christoph.trattner@uib.no (C. Trattner) ~ https://github.com/Overhaug/HuJuRecSys (S. Øverhaug); http://christophtrattner.com/ (C. Trattner) 0000-0002-9873-8016 (A. D. Starke); 0000-0002-1193-0508 (C. Trattner) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Section Title Item recommendations Main image Author Date of Publication Lead paragraph Body text Figure 1: Different features in a news article, which may be used by a news recommender system to recommend items to a user. in Figure 1. These are often labeled ‘More on this Story’ (e.g., at BBC News), showcasing similar articles in terms of their publication time or specific keywords. Whether two news articles are alike can be computed using similarity functions [1, 5]. Features (e.g., title) considered by such functions should to a large extent reflect a user’s similarity assessment [6], while not being too similar to what a user is currently reading, for it may lead to redundancy [2]. However, research on feature-based similarity is limited and rather domain-dependent. For example, users browsing on recipe websites tend to use titles and header photos to assess similarity between recipes, while users of movie recommenders use plot descriptions and genre [7]. As a result, there is no consensus on which news article features best represent a user’s similarity judgment. This may be problematic, as similarity functions in recommender systems may be more effective if they reflect user perceptions. Hence, the current study assesses a set of similarity functions for news article retrieval, particularly for the task of similar-item recommendation. We ask users of an online news system to judge the similarity between pairs of news articles, which is used to develop a model to predict news similarity. Subsequently, we perform cross-domain comparisons, comparing which features are used for human similarity judgments in news, movies, and recipes, using data from [7]. We posit the following research questions: • RQ1: Which news article features are used by humans to judge similarity and to what extent are different feature-specific similarity functions related to human similarity judgments? • RQ2: Which combination of news article features is best suited to predict user similarity judgments? • RQ3: How does the use of news features and their similarity functions compare to those used in the recipe and movie domains? 1.2. Contributions This paper makes the following contributions: • We advance the understanding of how readers perceive similarity between news articles, in terms of (i) which article cues or features are reported as important, and (ii) how features correlate with similarity ratings provided by users, (iii) that user-reported feature importance is not always consistent with the computed correlations. • We show which news information features can predict a user’s similarity judgment. • We juxtapose our news study with findings from the movie and recipe domains, using data from [7], showing that feature-specific similarity functions in the news domains are less representative of human judgment than functions in the movie and recipe domains. • We present a reproducible data processing pipeline, available on Github1 , and add a benchmarking dataset for the publicly available Washington Post Corpus news article database. 2. Related Work We highlight work from the domains of Similar-item Retrieval and Semantic Similarity to craft similarity functions. Moreover, we discuss specific challenges in news recommendation, and explain how similarity functions are assessed by using human similarity judgments as ground truth. 2.1. Similar Item Retrieval Similar item retrieval seeks to identify unseen or novel items that are similar to what a user has elicited preferences for [1]. In the recommender domain, this is referred to as a similar-item recommendation problem. A fundamental question is how to compute similarity between concepts [8, 9], which is examined in studies on semantic similarity [10], a field of research that usually not only captures the similarity between two concepts, but also how different they are [11]. This can be based on ontological relations, based on human knowledge, or on co-occurrence metrics that stem from a hierarchical or annotated corpus of words [2, 12]. For example, latent semantic analysis derives meaning and similarity from the text context itself, by examining how and how often words are used [2]. A traditional method is to compute similarity between items by deriving vectors from text items. Although TF-IDF has been outperformed by other metrics, such as BM25 [13], Term Frequency-Inverse Document Frequency remains one of the most commonly used IR methods to create similarity vectors [14]. It uses the term frequency per document and the inverse appearance frequency across all documents [15], while similarity between the vectors of liked and unseen items can be computed using cosine similarity [16]. A much simpler approach is to derive a set of keywords from each item [15]. For example, a book recommender could compute the similarity between 𝑏𝑜𝑜𝑘1 = 𝑓 𝑎𝑛𝑡𝑎𝑠𝑦, 𝑒𝑝𝑖𝑐, 𝑏𝑙𝑜𝑜𝑑𝑦, and 𝑏𝑜𝑜𝑘2 = 𝑓 𝑎𝑛𝑡𝑎𝑠𝑦, 𝑦𝑜𝑢𝑛𝑔, 𝑑𝑟𝑎𝑔𝑜𝑛𝑠, through the Jaccard coefficient: 𝐽(𝐴, 𝐵) = |𝑏𝑜𝑜𝑘1∩𝑏𝑜𝑜𝑘2| |𝑏𝑜𝑜𝑘1∪𝑏𝑜𝑜𝑘2| . 1 https://github.com/Overhaug/HuJuRecSys There are various other similarity metrics available, such as the Levenshtein distance (i.e., “edit distance”), and LDA (Latent Dirichlet Allocation). 2.2. Similarity Representations in the News Domain News recommender systems primarily focus on textual representations of news articles [1]. Most approaches utilize the main text or title, ignoring most other textual features, such as the author [14]. A straightforward, but more uncommon approach in academic studies [17], is to retrieve articles based on date-time, such as those that are published on the same day as the article that is currently inspected. Other approaches include the use of (sub)categories, while image-based similarity is more common in other domains [18], such as food [7]. 2.2.1. Text-based approaches Most similarity functions relevant in news retrieval are text-based. TF-IDF is traditionally combined with Cosine similarity and used as a news recommendation benchmark [19]. In some cases, its effectiveness can be improved by constraining it on a maximum number of words [20]. TF-IDF can also be combined with a K-Nearest Neighbor algorithm to recommend short-term interest news articles [21]. Besides the aforementioned methods, a common approach is to derive latent topics from texts. Although recent work uses Word2Vec and BERT [22, 23], this work considers Latent Dirichlet Allocation (LDA) and Probabilistic Latent Semantic Indexing (PLSI) [24]. LDA and PLSI can cluster topically-similar news articles based on tags and named entities. News recommendations can be refined afterwards based on recency scores. A final interesting text-based method is based on sentiment analysis. Sentiment analysis mines a text’s opinions in terms of the underlying attitude, judgments, and beliefs. It has been suggested that negativity in news has a large impact, triggering more vivid recall of news story details among users [25]. 2.2.2. Other News Features A news article’s date-time feature is also leveraged in the context of similar-item news recom- mendation, either through pre-filtering, recency modeling, or post-filtering [1]. Pre-filtering involves omitting outdated news articles before computation starts, while the more uncommon post-filtering removes all non-recent articles from a Top-N set. Recency modeling is the most common, which incorporates recency as one of the factors in an algorithm’s similarity compu- tation (e.g., by giving it a higher weight). Pon et al. [26] describe an approach that targets users with multiple interests, by considering recency in conjunction with a ‘multiple topic tracking’ technique. 2.3. Assessing Similarity Functions Using Human Judgments Similar-item retrieval approaches, as also used in similar-item recommender systems, are typically validated using human judgments [12]. An important question is to what extent similarity functions reflect a user’s similarity assessment of item pairs. This could lead to problems if a user either ignores or overvalues different item features, compared to what is being computed [9]. This has been studied in the movie and recipe domains: Trattner and Jannach [7] contrast user similarity assessments to a set of similarity functions, pointing out that specific features (e.g., a recipe’s title or a movie’s genre) strongly correlate with user similarity judgments. In a similar vein, Yao and Harper [27] assess to what extent different algorithms for related item recommendations in music are consistent with user similarity judgments. However, assessing similarity between news articles might be harder than between movies. Whereas similarity between movie pairs is usually attributed to the annotated metadata (e.g., genre), two news articles could be similar because they are recent, address a common topic, or because a person appears in both stories. Although a few studies let humans assess the overall similarity between news headlines [2, 28], none have done so across multiple features. For example, users in the work of Tintarev and Masthoff [2] successfully judged the similarity between news articles, but only based on their headlines. 2.4. Key differences with previous work Novel to our approach is the use of feature-specific similarity representations and functions in news, as well as grounding them in human similarity judgments. Most relevant to our approach are the works of Trattner and Jannach [7], and Yao and Harper [27], for they explore how computational functions for similarity compare to users’ perception of similarity. In particular, Trattner and Jannach [7] serve as an example for our approach, for they also present an online study on similarity perceptions. However, these studies concerned retrieval in music, movies, and recipes. Since the merit of feature-specific similarity functions in other domains is unknown for news, the goal of the current study is to assess their performance in news. 3. Method We assess the utility of different feature-specific similarity functions by collecting human judgments of similarity for pairs of news articles. In this section, we describe (1) the dataset and its specific features, (2) the engineered similarity functions, and (2) the design of our user study to determine the effectiveness of these functions. 3.1. Dataset and Feature Engineering 3.1.1. News Database We employed a publicly available news article database. We focused on a scenario of a single news source, as the use of multiple news websites could lead to ‘duplicate’ articles on the same news event. To ensure reproducibility, we obtained news articles from the open Washington Post Corpus [29]. The news items in the dataset comprised title, author (including a bio), date of publication, section headers, and the main body text. In addition, we retrieved the images associated with the news articles, 655,533 in total. After removing duplicates from the original source, our remaining dataset contained 238,082 articles, which were originally published between Jan’12 and Aug’18. Table 1 Descriptive statistics and contents of the dataset employed for the user study. Feature Mean Median Min Max Number of words in title 9.78 10 2 25 Number of characters in title 60.16 61 11 195 Article image brightness 0.37 0.35 0.04 0.98 Article image sharpness 0.24 0.2 0.03 1.27 Article image contrast 0.18 0.18 0.01 0.64 Article image colorfulness 0.17 0.16 0 0.73 Article image entropy 7.05 7.33 0.75 7.95 Number of words in article body text 768.44 637 6 10640 Number of characters in article body text 4676.99 3895.5 38 65641 Article body text sentiment 0.54 0.54 0.05 0.89 Date of publication 2015-01-04 2014-12-31 2012-01-10 2017-08-22 Number of words in author biographies 21.63 17 4 306 Number of characters in author biographies 140.32 115 33 1989 Number of authors 1.05 1 1 8 For our user study, we selected news articles categorized in ‘Politics’, as they were on (inter)nationally relevant topics. Other categories were neglected as they focused more on local events and may have an effect on similarity estimates, as these events may not be familiar to the user. We sampled a total of 2400 ‘Politics’ news articles, 400 from each year between 2012 and 2017, for the descriptive statistics are reported in Table 1. 3.2. Modeling Similarity with Feature-Based Similarity Functions To model the similarity between two news articles, we used twenty similarity functions and representations across seven dataset features. We designed functions in line with the field’s current state-of-the-art, by exploiting specific cues that people may use to assess similarity between two items – based on findings from the movie and recipe domains [7]. Table 2 describes the developed similarity functions. For each pair of news articles, we computed similarity scores based on seven main features: subcategory, title, presented images, author (including bio), publication dates, and body text (first 50 words and full text). For text-based features, the similarity functions were either based on word mappings or distance methods, while similarity based on subcategories and authors was computed using a Jaccard coefficient. Moreover, we computed date-time similarity (i.e. recency modeling) through a linear function that computed how many days apart two articles were published. 3.2.1. Title Title-based similarity was computed using four string similarity functions and a topic-based one. The string-based functions were based on distance metrics: the Levenshtein distance (LV) [30], the Jaro-Winkler method (JW) [31], the longest common subsequence, and the bi-gram distance method (BI) [32]. Similar to Trattner and Jannach [7], Latent Dirichlet Allocation (LDA) topic-modeling was set to 100 topics. Table 2 Similarity functions employed in the current study, each comprised of a feature and a metric. Name Metric Explanation ⋂︀ 𝑠𝑢𝑏𝑐𝑎𝑡(𝑛 ) 𝑠𝑢𝑏𝑐𝑎𝑡(𝑛 ) Subcat:JACC 𝑠𝑖𝑚(𝑛𝑖 , 𝑛𝑗 ) = 𝑠𝑢𝑏𝑐𝑎𝑡(𝑛𝑖𝑖 ) ⋃︀ 𝑠𝑢𝑏𝑐𝑎𝑡(𝑛𝑗𝑗 ) Subcategory Jaccard-based similarity Title:LV 𝑠𝑖𝑚(𝑛𝑖 , 𝑛𝑗 ) = 1 − |𝑑𝑖𝑠𝑡𝐿𝑉 (𝑛𝑖 , 𝑛𝑗 )| Title Levenshtein distance-based similarity Title:JW 𝑠𝑖𝑚(𝑛𝑖 , 𝑛𝑗 ) = 1 − |𝑑𝑖𝑠𝑡𝐽𝑊 (𝑛𝑖 , 𝑛𝑗 )| Title Jaro-Winkler distance-based similarity Title:LCS 𝑠𝑖𝑚(𝑛𝑖 , 𝑛𝑗 ) = 1 − |𝑑𝑖𝑠𝑡𝐿𝐶𝑆 (𝑛𝑖 , 𝑛𝑗 )| Title longest common subsequence distance-based similarity Title:BI 𝑠𝑖𝑚(𝑛𝑖 , 𝑛𝑗 ) = 1 − |𝑑𝑖𝑠𝑡𝐵𝐼 (𝑛𝑖 , 𝑛𝑗 )| Title bi-gram distance-based similarity 𝐿𝐷𝐴(𝑇 𝑖𝑡𝑙𝑒(𝑛 ))*𝐿𝐷𝐴(𝑇 𝑖𝑡𝑙𝑒(𝑛 )) Title:LDA 𝑠𝑖𝑚(𝑛𝑖 , 𝑛𝑗 ) = ||𝐿𝐷𝐴(𝑇 𝑖𝑡𝑙𝑒(𝑛𝑖𝑖))||||𝐿𝐷𝐴(𝑇 𝑖𝑡𝑙𝑒(𝑛𝑗𝑗 ))|| Title LDA cosine-based similarity Image:BR 𝑠𝑖𝑚(𝑛𝑖 , 𝑛𝑗 ) = 1 − |𝐵𝑅(𝑛𝑖 ) − 𝐵𝑅(𝑛𝑗 )| Image brightness distance-based similarity Image:SH 𝑠𝑖𝑚(𝑛𝑖 , 𝑛𝑗 ) = 1 − |𝑆𝐻(𝑛𝑖 ) − 𝑆𝐻(𝑛𝑗 )| Image sharpness distance-based similarity Image:CO 𝑠𝑖𝑚(𝑛𝑖 , 𝑛𝑗 ) = 1 − |𝐶𝑂(𝑛𝑖 ) − 𝐶𝑂(𝑛𝑗 )| Image contrast distance-based similarity Image:COL 𝑠𝑖𝑚(𝑛𝑖 , 𝑛𝑗 ) = 1 − |𝐶𝑂𝐿(𝑛𝑖 ) − 𝐶𝑂𝐿(𝑛𝑗 )| Image colorfulness distance-based similarity Image:EN 𝑠𝑖𝑚(𝑛𝑖 , 𝑛𝑗 ) = 1 − |𝐸𝑁 (𝑛𝑖 ) − 𝐸𝑁 (𝑛𝑗 )| Image entropy distance-based similarity 𝐸𝑀 𝐵(𝑛 )*𝐸𝑀 𝐵(𝑛 ) Image:EMB 𝑠𝑖𝑚(𝑛𝑖 , 𝑛𝑗 ) = ||𝐸𝑀 𝐵(𝑛𝑖𝑖)||||𝐸𝑀 𝐵(𝑛𝑗𝑗 )|| Image embedding cosine-based similarity ⋂︀ 𝑎𝑢𝑡ℎ𝑜𝑟(𝑛 ) 𝑎𝑢𝑡ℎ𝑜𝑟(𝑛 ) Author:JACC 𝑠𝑖𝑚(𝑛𝑖 , 𝑛𝑗 ) = 𝑎𝑢𝑡ℎ𝑜𝑟(𝑛𝑖𝑖 ) ⋃︀ 𝑎𝑢𝑡ℎ𝑜𝑟(𝑛𝑗𝑗 ) Author Jaccard-based similarity Date:ND 𝑠𝑖𝑚(𝑛𝑖 , 𝑛𝑗 ) = 1 − |𝑑𝑖𝑠𝑡𝑑𝑎𝑦𝑠 (𝑛𝑖 , 𝑛𝑗 )| Date published distance-based similarity (unit = days) 𝑇 𝐹 𝐼𝐷𝐹 (𝑇 𝑒𝑥𝑡(𝑛 ))*𝑇 𝐹 𝐼𝐷𝐹 (𝑇 𝑒𝑥𝑡(𝑛 )) BodyText:TFIDF 𝑠𝑖𝑚(𝑛𝑖 , 𝑛𝑗 ) = ||𝑇 𝐹 𝐼𝐷𝐹 (𝑇 𝑒𝑥𝑡(𝑛𝑖𝑖))||||𝑇 𝐹 𝐼𝐷𝐹 (𝑇 𝑒𝑥𝑡(𝑛𝑗𝑗 ))|| All article body text cosine-based similarity 𝑇 𝐹 𝐼𝐷𝐹 (𝑇 𝑒𝑥𝑡(𝑛 ))*𝑇 𝐹 𝐼𝐷𝐹 (𝑇 𝑒𝑥𝑡(𝑛 )) BodyText:50TFIDF 𝑠𝑖𝑚(𝑛𝑖 , 𝑛𝑗 ) = ||𝑇 𝐹 𝐼𝐷𝐹 (𝑇 𝑒𝑥𝑡(𝑛𝑖𝑖))||||𝑇 𝐹 𝐼𝐷𝐹 (𝑇 𝑒𝑥𝑡(𝑛𝑗𝑗 ))|| First 50 words in article body text cosine-based similarity 𝐿𝐷𝐴(𝑇 𝑒𝑥𝑡(𝑛 ))*𝐿𝐷𝐴(𝑇 𝑒𝑥𝑡(𝑛 )) BodyText:LDA 𝑠𝑖𝑚(𝑛𝑖 , 𝑛𝑗 ) = ||𝐿𝐷𝐴(𝑇 𝑒𝑥𝑡(𝑛𝑖𝑖))||||𝐿𝐷𝐴(𝑇 𝑒𝑥𝑡(𝑛𝑗𝑗 ))|| All article body text LDA cosine-based similarity BodyText:Senti 𝑠𝑖𝑚(𝑛𝑖 , 𝑛𝑗 ) = 1 − |𝑆𝐸𝑁 𝑇 𝐼(𝑛𝑖 ) − 𝑆𝐸𝑁 𝑇 𝐼(𝑛𝑗 )| Article body text sentiment distance-based similarity 𝑇 𝐹 𝐼𝐷𝐹 (𝑇 𝑒𝑥𝑡(𝑛 ))*𝑇 𝐹 𝐼𝐷𝐹 (𝑇 𝑒𝑥𝑡(𝑛 )) AuthorBio:TFIDF 𝑠𝑖𝑚(𝑛𝑖 , 𝑛𝑗 ) = ||𝑇 𝐹 𝐼𝐷𝐹 (𝑇 𝑒𝑥𝑡(𝑛𝑖𝑖))||||𝑇 𝐹 𝐼𝐷𝐹 (𝑇 𝑒𝑥𝑡(𝑛𝑗𝑗 ))|| Author bio cosine-based similarity 𝐿𝐷𝐴(𝑇 𝑖𝑡𝑙𝑒(𝑛 ))*𝐿𝐷𝐴(𝑇 𝑖𝑡𝑙𝑒(𝑛 )) AuthorBio:LDA 𝑠𝑖𝑚(𝑛𝑖 , 𝑛𝑗 ) = ||𝐿𝐷𝐴(𝑇 𝑖𝑡𝑙𝑒(𝑛𝑖𝑖))||||𝐿𝐷𝐴(𝑇 𝑖𝑡𝑙𝑒(𝑛𝑗𝑗 ))|| Author bio LDA cosine-based similarity 3.2.2. Image Features In line with the current state-of-the-art [7], we computed image-based similarity using six different functions. These were an image’s brightness, sharpness (i.e., based on a pixel’s intensity), contrast, colorfulness (i.e., based on the sRGB color space), entropy (i.e., amount of information captured per image dot), and image embeddings. Mathematical details are available in our Github repository. 3.2.3. Body Text Body similarity was computed for two string-based functions (i.e., TF-IDF), a topic-based function (i.e., LDA), and a text sentiment-based metric (based on research of [25]). TF-IDF encodings were paired with cosine similarity, for which we discerned between similarity based on an article’s first 50 words (i.e., an article’s first paragraph), which could be compared to the average movie plot length in [7], and similarity based on the entire body text. 3.3. User Study The similarity functions in Table 2 were assessed by computing similarity scores per news article pair and comparing them to human judgments. We explain our sampling strategy and how we collected human judgments of similarity. 3.3.1. Sampling News Article Pairs on Similarity We compiled a set of news article pairs that were either strongly similar, dissimilar or in-between. To ensure a good distribution, we employed a stratified sampling strategy that was in line with previous work [7]. We computed the pairwise similarity across all 2400 news articles, averaging the similarity values of all functions in Table 2. Pairs were ordered on their similarity levels and divided into ten deciles, groups D1-D10 of equal size. We sampled a total of 6,000 news article pairs: 2,000 dissimilar pairs between decile D1, 2,000 pairs from deciles D2-D9, and 2000 similar pairs from decile D10. 3.3.2. Procedure and Measures The resulting 6000 news article pairs were used to collect human judgments on similarity. Figure 2 depicts a mock-up of the main application, showing from top to bottom different news article features (Note: an author bio could also be inspected). Users could read all text if they clicked ‘read more’. Figure 2: Mock-up of a pair-wise similarity assessment in our web application. Users were asked to assess the similarity of two presented news articles, as well as how familiar they were with the articles and the confidence level of their judgment. Users were presented ten news article pairs, of which one was an attention check.2 Much like in the study by Tintarev and Masthoff [2], users were asked to assess the similarity of each news article pair on a 5-point scale (cf., Figure 2). As an extension to other studies, users also indicated their familiarity with each article and the level of confidence in their assessment (all 5-point scales). Moreover, we asked users to what extent they employed different features in their similarity judgments (5-point scales). Finally, we inquired on a user’s frequency of news consumption and their demographics. 3.3.3. Participants Participants were recruited from Amazon MTurk. Since we used a database of news articles that concerned American politics, we only recruited U.S.-based participants. They had at least an average hit acceptance rate of 98% and 500 completed HITs. A total of 401 participants completed our study, with a median time of 6 minutes and 35 seconds, who were compensated with 0.5 USD. Only 241 participants (60.01%) passed our attention check, which was slightly higher than in [7]. This resulted in usable 2,169 similarity judgments; only 21 pairs were presented twice, to different users. This final sample (53% males) mostly consisted of age groups 25-34 (33.2%) and 35-44 (30.3%), of which 66% reported to visit news websites at least once a week (24.9% did so daily), while 50 participants rarely read online news. 4. Results For our analyses, we first examined the use of different news features, assessing different similarity functions through human judgments (RQ1). Furthermore, we predicted human similarity judgments using model-based approaches (RQ2). In addition, we compared our results for RQ1-RQ2 with the news and recipe domains (RQ3). 4.1. News Features Usage We examined to what extent participants used different features to assess similarity between news articles (RQ1). Figure 3A summarizes the results for participants who passed the attention check. On average, an article’s title (M=4.2) and body text (M=4.4) were considered most often, while sentiment (M=3.7) and an article’s subcategory (M=3.2) saw above average use. In contrast, author features, publication date, an article’s image were rarely used to assess similarity. Figure 3B shows that all differences between features were significant (all: 𝑝 < 0.01), based on a one-way ANOVA on feature usage and a Tukey’s HSD post-hoc analysis. With regard to [RQ3], most findings were compatible with the movie and recipes domains. The use of title and body text was also observed for recipes (i.e., ingredients and directions), while plot and genre features were used in movies [7]. The use of the genre cue in movies was also more frequent than the use of a news article’s subcategory. 2 Users were asked for this pair to only answer ‘5’ on all answer scales. A B 3 2 4 1 Difference Cue Usage 0 3 p<0.01 −1 −2 2 −3 Author−Image Author−Subcat Author−Title Author Bio−Author Author Bio−Body Text Author Bio−Date of Pub Author Bio−Image Author Bio−Subcat Author Bio−Title Body Text−Author Body Text−Date of Pub Body Text−Image Body Text−Subcat Body Text−Title Date of Pub−Author Date of Pub−Image Date of Pub−Subcat Date of Pub−Title Image−Subcat Image−Title Title−Subcat Subcat Title Image Author Date of Pub Body Text Author Bio Information Cue Pair Figure 3: A: Mean reported cue usage for news articles, scaled 1-5; B: Tukey’s HSD post hoc tests (means and S.E.) that examine differences in cue usage. 4.2. Grounding Similarity Functions in Human Similarity Judgments 4.2.1. Descriptive Statistics To address [RQ1], we compared feature-specific similarity scores of presented news article pairs to similarity ratings given by users. Figure 4 contrasts the similarity scores, averaged across all similarity functions, with the users’ similarity judgments, averaged per user. As shown, there was a discrepancy between the similarity inferred by the similarity functions, which was distributed around the mean value of 0.39 (𝑆𝐷 = 0.085), and the similarity judgments of users, which was lower (𝑀 = 0.18, 𝑆𝐷 = 0.24). This suggested that users were less likely to judge two news articles to be similar, compared to our similarity functions. 4.2.2. Feature-specific Comparison in News Table 3 outlines the Spearman correlations between similarity functions and the similarity judgments given by users. It differentiates between the results of our own user study (i.e., ‘News Articles’), and that of [7] for the movie and recipe domains, allowing for cross-domain comparisons (discussed later). We first discuss the results for the news domain and focus on users who passed the attention check. Table 3 shows that most correlations were modest (all 𝜌 < 0.3), suggesting that the news similarity functions did not fully reflect a user’s judgment. Among all features, we found that full body text similarity (BodyText:TFIDF) correlated most strongly to user judgments: 𝜌 = 0.29, 𝑝 < 0.001, which was also the most commonly used feature in earlier news recommendation scenarios [1]. Although some users might have only inspected an article’s first 50 words (cf., the text visible in Figure 2; on average 15% of the full body text), the BodyText:50TFIDF metric had a much lower correlation: 𝜌 = 0.14, 𝑝 < 0.001. Figure 4: Frequency of similarity scores (scaled 0-1). Similarity functions depict the average score per news article pair, user judgments show the mean given similarity judgment per user. Among all image similarity metrics, embeddings (Image:EMB) had the highest correlation with user judgments: 𝜌 = 0.17*** , which was modest nonetheless. This function, along with BodyText:TFIDF, Author:Jacc, AuthorBio:TFIDF, and Subcat:Jacc, seemed to best represent user similarity judgments in news. Table 3 highlights that other functions did not represent a user’s similarity judgment in news, such as sentiment (BodyText:Sent): 𝜌 = −0.02. Surprisingly, although most users considered titles to assess similarity, their judgments were hardly similar to each distance-based title similarity function (all 𝜌 < 0.1). Note that the Title:LDA and BodyText:LDA might have suffered from insufficient latent topic information, as their correlations were close to zero. Finally, because similarity ratings correlated positively with familiarity scores (𝜌 = 0.27*** ), we tested whether only including judgments for familiar news article pairs (i.e., with scores of 4 or higher) affected the results in Table 3. Although this would increase correlations with 1 to 4 percentage points for most features, most changes were statistically significant (e.g., TFIDF:BodyText would increase from 0.29 to 0.33). 4.2.3. Cross-domain Comparison Using data from [7], we compared the results in Table 3 across the news, recipe, and movie domains. Correlations between human judgments and similarity functions in the news domain were shown to be much weaker than in the recipe domain and, to a lesser extent, the movie domain. This applied to most features, including title, image, and body text. Two notable differences lie in title and image-based functions. Whereas the reported correla- Table 3 Spearman correlations between similarity functions and human similarity judgments, for news (current study), and recipes and movies (obtained from [7]). 𝜌𝑝𝑎𝑠𝑠 denotes correlations with users who passed the attention check, 𝜌𝑎𝑙𝑙 denotes those with all users. *𝑝 < 0.05;**𝑝 < 0.01;***𝑝 < 0.001. News Articles Recipes Movies Similarity Function 𝜌pass 𝜌all Sim. Function 𝜌pass 𝜌all Sim. Function 𝜌pass 𝜌all *** *** Subcat:Jacc 0.14 0.11 Genre:Jacc 0.56 0.53*** Title:LV 0.06** 0.04* Title:LV 0.48*** 0.38*** Title:LV 0.19*** 0.18*** Title:JW 0.05* 0.03 Title:JW 0.46*** 0.35*** Title:JW 0.16*** 0.16*** Title:LCS 0.07*** 0.05** Title:LCS 0.50*** 0.40*** Title:LCS 0.20*** 0.19*** Title:BI 0.08*** 0.07*** Title:BI 0.48*** 0.38*** Title:BI 0.17*** 0.17*** Title:LDA 0.02 0.00 Title:LDA 0.22*** 0.19*** Title:LDA 0.01 0.01 Image:BR 0.10*** 0.07*** Image:BR 0.18** 0.14* Image:BR 0.22*** 0.20*** Image:SH 0.06** 0.03 Image:SH 0.16* 0.11* Image:SH 0.10*** 0.08*** Image:CO 0.05* 0.05** Image:CO 0.29*** 0.20*** Image:CO 0.03 0.03 Image:COL 0.05* 0.03* Image:COL 0.09* 0.07* Image:COL 0.15*** 0.14*** Image:EN 0.07** 0.05** Image:EN 0.34*** 0.28*** Image:EN 0.15*** 0.09*** Image:EMB 0.17*** 0.13*** Image:EMB 0.44*** 0.34*** Image:EMB 0.18*** 0.16*** Author:Jacc 0.13*** 0.10*** Dir:Jacc 0.10*** 0.07*** Date:ND 0.09*** 0.08*** Date:MD 0.37*** 0.35*** *** *** BodyText:TFIDF 0.29 0.23 BodyText:50TFIDF 0.14*** 0.12*** Dir:TFIDF 0.50*** 0.40*** Plot:TFIDF 0.25*** 0.20*** BodyText:LDA 0.03 0.01 Dir:LDA 0.54*** 0.43*** Plot:LDA 0.37*** 0.34*** BodyText:Sent -0.02 -0.02 AuthorBio:TFIDF 0.15*** 0.12*** AuthorBio:LDA 0.11*** 0.09*** tions for title features were weak in news (𝜌 < 0.1), the distance-based title metrics showed strong correlations with user judgments for recipes (𝑟ℎ𝑜 ≈ 0.5). With regard to image-specific similarity, functions in news were only weakly correlated to human judgments (𝜌𝑚𝑎𝑥 = 0.17), while they were more representative for recipes (𝜌𝑚𝑎𝑥 = 0.44) and movies (𝜌𝑚𝑎𝑥 = 0.22). 4.3. Predicting Human Similarity Judgments Going beyond simple correlation analyses, we also sought to predict similarities with these functions using state-of-the-art machine learning methods (RQ2), as used in recommender systems research. This helped us to understand each feature’s importance, beyond the feature- specific correlations in Table 3. 4.3.1. Model Evaluation and Cross-Domain Comparison To determine model performance, standard metrics such as Root Mean Square Error (RMSE), R2 , and Mean Absolute Error (MAE) were used. Five-fold cross-validation was used as an evaluation protocol. Furthermore, by applying grid search on a validation set from the training data, the optimal hyper-parameters for each model were found. The performance of the models on News Articles is described in Table 4. In part (i), a Wilcoxon Rank-Sum test on RMSE pointed out that all models except GB performed significantly better than a random baseline (𝑝𝑎𝑙𝑙 < 0.05). Table 4 (i) also compares our results to findings from the recipe and movie domains (RQ3), adapted from [7]. Most notably, we found that Lasso is the best performing model, while Ridge outperformed other models in the Recipe and Movie domains. Moreover, the news model (i.e., 𝑅2 = 0.33) was less accurate than the recipe model (i.e., 𝑅2 = 0.51), while its accuracy was comparable to that of the movie model (i.e., 𝑅2 = 0.36). This suggested that the similarity functions adapted from [7] were less representative for user similarity judgments in the news domain. 4.3.2. Feature-specific Models and User Characteristics To further explore [RQ2], Table 4 (ii) describes the performance of feature-specific models. To compare our findings to other domains, Ridge regression was used to combine multiple similarity functions per feature, while linear regression was used for features with a single Table 4 Model accuracy of different learning approaches, predicting a user’s similarity judgment in the news domain. We compare (i) models averaged across all features in the news, recipe, and movie domains (using data from [7]), (ii) describe the accuracy of feature-specific models in news, and include (iii) user characteristics. The best performing models per domain are denoted in bold. News Articles Recipes Movies (𝑁 = 2, 169) (𝑁 = 1, 539) (𝑁 = 1, 395) Method RMSE 𝑅2 MAE RMSE 𝑅2 MAE RMSE 𝑅2 MAE (i) Model performance (All features) All (Random Forest (RF)) 0.9219 0.2982 0.7643 0.8958 0.4734 0.6787 0.8807 0.3543 0.7007 All (Gradient Boosting (GB)) 0.9177 0.3123 0.7520 0.8805 0.4921 0.6672 0.8844 0.3489 0.7029 All (Ridge Regression) 0.9141 0.3257 0.7459 0.8654 0.5063 0.6651 0.8745 0.3628 0.6926 All (Linear Regression) 0.9120 0.3289 0.7453 0.8700 0.5022 0.6668 0.8752 0.3616 0.6929 All (Lasso Regression) 0.9101 0.3339 0.7480 0.8873 0.3574 0.7286 0.8873 0.3574 0.7286 Mean 0.9652 0.0000 0.8122 1.2292 0.4995 1.0433 1.0942 0.5001 0.9140 Random 0.9659 -0.0226 0.8125 1.2290 0.0010 1.0435 1.0948 0.0061 0.9140 (ii) Regression model per news article feature Subcat (Linear) 0.9554 0.1406 0.7943 Title (Ridge) 0.9618 0.0889 0.8071 Image (Ridge) 0.9548 0.1495 0.7913 Author (Linear) 0.9568 0.1333 0.7991 Date (Linear) 0.9616 0.0911 0.8070 BodyText (Ridge) 0.9141 0.3244 0.7514 AuthorBio (Ridge) 0.9561 0.1414 0.7991 (iii) All (Ridge) + Additional User Characteristics News website visits 0.9164 0.3207 0.7463 Num. days reads news 0.9186 0.3215 0.7476 Gender 0.9125 0.3314 0.7456 Age 0.9081 0.3435 0.7338 All additional features 0.9099 0.3412 0.7358 function. Although the representativeness of the different BodyText similarity functions varied (cf., Table 3), it was the best predicting feature, even outperforming the All features model. Finally, we included user characteristics and demographics in our Ridge model. We tested the impact of each additional feature separately, as well as simultaneously. Table 4 (iii) outlines that the addition of user characteristics (e.g., news consumption frequency) hardly affected the model’s predictive quality. A model that included the user’s age reported the lowest RMSE, but this decrease (from 0.9141 in (i) to 0.9081 in (iii)) was not statistically significant different according to a Wilcoxon Rank-Sum test. 5. Discussion This work contributes to the literature on similarity estimates, which is a central theme in the recommender systems literature, with a particular focus on the news domain. It is among the first to study news similarity representations in detail, making the following contributions: 1. Determining which features are considered by users when judging similarity between news articles. 2. Assessing how feature-specific similarity functions relate to similarity judgments. 3. Predicting similarity judgments of users through machine learning models. 4. Comparing our results to findings from the movie and recipe domains. We have taken a first step towards designing representative feature-specific similarity functions for news, going beyond other studies that focused on overall similarity or just a single feature [28, 2]. 5.1. Feature-specific Similarity We have assessed the value of feature-specific similarity functions in the news domain, adapted from recommender literature in the news, movie, and recipe domains [7]. We find that most feature-specific similarity functions only partially reflect a user’s similarity judgment, yielding modest correlations. To best reflect user perceptions, we suggest that content-based news recommender systems should exploit the body text, supported by image embeddings, article categories, and the author. The representativeness of body text is grounded in the reported feature use, as well as consistent with previous studies on news retrieval [1]. In contrast, although users used a news article’s title in their similarity judgments, we have found title-based similarity functions to be hardly representative for these judgments. The weak correlations could be attributed to the relatively ‘wordy’ titles of news articles (cf., Table 1), compared to the other domains in scope. At the similarity function level, it is possible that the string-based functions do not capture more subtle similarities between news articles, for example if two headlines describe an identical news event, but from a different news angle. Moreover, the insignificant correlation between Title:LDA and a user’s similarity judgment suggests that word-based similarity is unrelated to how users perceive a pair of news articles. In terms of predicting similarity judgment, we have used machine learning to determine model accuracy and feature importance, and to examine the predictive value of additional user characteristics. We find that the addition of user characteristics and demographics in our models does not significantly improve the accuracy indicators, indicating there is little variance across users. In terms of similarity modeling, these findings suggest that the main focus should be on leveraging a news article’s BodyText, while other features should only be used if the similarity functions would be more accustomed to the news domain. 5.2. Cross-domain Comparisons We have also explored cross-domain differences. In line with [7], we have found further evidence that different domains call for different similarity functions. For one, the ridge regression model for news is found to be somewhat less accurate than for news and recipes, although a 𝑅2 of 0.33 is reasonable. However, the MAE of 0.75 for a measure that is scaled from 1 to 5 suggests that there is room for improvement, which could be attributed to the low given similarity scores. It seems that text-based similarity (i.e., movie plot, recipe directions, news’ body text) is useful in most domains in scope, given an appropriate similarity function. BodyText features are listed among the strongest correlations, as well as among the strongest predictors. In contrast, the title and image features are less representative of similarity judgments in news and movies, compared to the recipe domain. Whereas only image embeddings seem to be somewhat representative of news similarity assessments, images features are more useful in determining recipe similarity. We have observed that the model accuracy reported in Table 4 is comparable to findings from the movie domain (cf., [7]). This is despite the differences in given similarity scores across domains (which is much lower for news; see Figure 4), and the weaker correlations reported in Table 3. All in all, the news domain seems to require similarity functions that are less ‘taste- related’ than movies or recipes, but further research is needed to develop more accurate ones, possibly by also using psychological theories on similarity [9]. 5.3. Limitations & Future Work A notable limitation of our approach is the use of a single dataset, which only comprises political articles. It is possible that the relation between similarity judgments and feature- specific similarity functions would be affected when employing additional main categories. For example, ‘name-dropping’ sports teams in a news article title might result in a higher feature importance for news article titles, compared to ‘political judgments’. Furthermore, the news articles shown to users were a few years old, which might have reduced familiarity levels and, in turn, decreased similarity ratings. Another shortcoming is that it is not entirely clear on what grounds users have made their similarity judgments. We have asked them a single question on similarity, while some other studies have also used multiple questionnaire items [2]. However, our inquiry on reported feature use by participants (RQ1) reveals a part of the underlying cognitive process, and suggests what are good features to optimize for. In fact, this is also a new finding. For future studies, we suggest to develop and assess feature-specific similarity functions that unambiguously apply to the news domain. For example, similarity functions that leverage named entities (e.g., ‘Donald Trump’ or ‘France’) could help to manage user expectations about inter-article similarity. Furthermore, it would be most useful to test our assertions in an online study where news article recommendations are evaluated, much like the work of [7] and [27]. Above all, we like to emphasize that the current study serves as a first step. Based on these findings, future studies can further develop feature-specific similarity functions for the news domains, for this paper provides insight in what types of functions and features are successful, and which ones are not. Acknowledgments This work was supported by industry partners and the Research Council of Norway with funding to MediaFutures: Research Centre for Responsible Media Technology and Innovation, through the Centres for Research-based Innovation scheme, project number 309339. References [1] M. Karimi, D. Jannach, M. Jugovac, News recommender systems–survey and roads ahead, Information Processing & Management 54 (2018) 1203–1227. [2] N. Tintarev, J. Masthoff, Similarity for news recommender systems, in: In Proceedings of the AH’06 Workshop on Recommender Systems and Intelligent User Interfaces, Citeseer, 2006. [3] A. S. Das, M. Datar, A. Garg, S. Rajaram, Google news personalization: scalable online collaborative filtering, in: Proceedings of the 16th international conference on World Wide Web, 2007, pp. 271–280. [4] B. Fortuna, C. Fortuna, D. Mladenić, Real-time news recommender system, in: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, 2010, pp. 583–586. [5] T. De Pessemier, C. Courtois, K. Vanhecke, K. Van Damme, L. Martens, L. De Marez, A user-centric evaluation of context-aware recommendations for a mobile news service, Multimedia Tools and Applications 75 (2016) 3323–3351. [6] A. Elbadrawy, G. Karypis, User-specific feature-based similarity models for top-n recom- mendation of new items, ACM Transactions on Intelligent Systems and Technology (TIST) 6 (2015) 1–20. [7] C. Trattner, D. Jannach, Learning to recommend similar items from human judgments, User Modeling and User-Adapted Interaction 30 (2020) 1–49. [8] Ö. Özgöbek, J. A. Gulla, R. C. Erdur, A survey on challenges and methods in news recom- mendation, in: International Conference on Web Information Systems and Technologies, volume 2, SCITEPRESS, 2014, pp. 278–285. [9] A. A. Winecoff, F. Brasoveanu, B. Casavant, P. Washabaugh, M. Graham, Users in the loop: a psychologically-informed approach to similar item retrieval, in: Proceedings of the 13th ACM Conference on Recommender Systems, 2019, pp. 52–59. [10] R. Richardson, A. Smeaton, J. Murphy, Using WordNet as a knowledge base for measuring semantic similarity between words, Technical Report Working Paper CA-1294, 1994. [11] D. Lin, An information-theoretic definition of similarity, in: ICML, volume 98, 1998, pp. 296–304. [12] S. A. Takale, S. S. Nandgaonkar, Measuring semantic similarity between words using web documents, International Journal of Advanced Computer Science and Applications (IJACSA) 1 (2010). [13] Y. Lv, T. Moon, P. Kolari, Z. Zheng, X. Wang, Y. Chang, Learning to model relatedness for news recommendation, in: Proceedings of the 20th International Conference on World Wide Web, 2011, pp. 57–66. [14] D. Billsus, M. J. Pazzani, User modeling for adaptive news access, User Modelling and User-Adapted Interaction (2000). [15] D. Jannach, M. Zanker, A. Felfernig, G. Friedrich, Recommender systems: an introduction, Cambridge University Press, 2010. [16] I. Cantador, P. Castells, Semantic contextualisation in a news recommender system, in: Workshop on Context-Aware Recommender Systems at the RecSys 2009: ACM Conference on Recommender Systems, ACM, New York, 2009. [17] A. Lommatzsch, B. Kille, F. Hopfgartner, L. Ramming, NewsREEL multimedia at Medi- aEval 2018: News recommendation with image and text content, in: CEUR Workshop Proceedings, 2018. [18] M. Rorvig, Images of similarity: A visual exploration of optimal similarity metrics and scaling properties of trec topic-document sets, Journal of the American Society for Information Science 50 (1999) 639–651. [19] F. Goossen, W. Ijntema, F. Frasincar, F. Hogenboom, U. Kaymak, News personalization using the CF-IDF semantic recommender, in: ACM International Conference Proceeding Series, 2011. [20] T. Bogers, A. Van Den Bosch, Comparing and evaluating information retrieval algorithms for news recommendation, in: RecSys’07: Proceedings of the 2007 ACM Conference on Recommender Systems, 2007. [21] D. Billsus, M. J. Pazzani, Personal news agent that talks, learns and explains, in: Proceedings of the International Conference on Autonomous Agents, 1999. [22] B. P. Chamberlain, E. Rossi, D. Shiebler, S. Sedhain, M. M. Bronstein, Tuning word2vec for large scale recommendation systems, in: Fourteenth ACM Conference on Recommender Systems, 2020, pp. 732–737. [23] J. Liu, C. Xia, X. Li, H. Yan, T. Liu, A bert-based ensemble model for chinese news topic prediction, in: Proceedings of the 2020 2nd International Conference on Big Data Engineering, 2020, pp. 18–23. [24] L. Li, D. Wang, T. Li, D. Knox, B. Padmanabhan, SCENE: A scalable two-stage personalized news recommendation system, in: SIGIR’11 - Proceedings of the 34th International ACM SIGIR Conference, 2011. [25] S. Soroka, L. Young, M. Balmas, Bad News or Mad News? Sentiment Scoring of Negativity, Fear, and Anger in News Content, Annals of the American Academy of Political and Social Science (2015). [26] R. K. Pon, A. F. Cardenas, D. Buttler, T. Critchlow, Tracking multiple topics for finding interesting articles, in: Proceedings of the ACM SIGKDD International Conference, 2007. [27] Y. Yao, F. M. Harper, Judging similarity: a user-centric study of related item recommenda- tions, in: Proceedings of the 12th ACM Conference on Recommender Systems, 2018, pp. 288–296. [28] C. Watters, H. Wang, Rating news documents for similarity, Journal of the American Society for Information Science 51 (2000) 793–804. [29] NIST, Trec washington post corpus, 2019. Data retrieved from, https://trec.nist.gov/data/ wapost/. [30] L. Yujian, L. Bo, A normalized Levenshtein distance metric, IEEE Transactions on Pattern Analysis and Machine Intelligence (2007). [31] M. A. Jaro, Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida, Journal of the American Statistical Association (1989). [32] G. Kondrak, N-gram similarity and distance, in: Lecture Notes in Computer Science (includ- ing subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2005.