1. Introduction

International Workshop on News Recommendation and Analytics, September

Predicting Feature-based Similarity in the News Domain Using Human Judgments

Alain D. Starke

0 1

Sebastian Øverhaug

Christoph Trattner

0 0 University of Bergen , P.O. Box 7800, 5020 Bergen , Norway 1 Wageningen University & Research , Droevendaalsesteeg 4, 6708 PB Wageningen , The Netherlands

2021

25 2021 0000 0002

When reading an online news article, users are typically presented 'more like this' recommendations by news websites. In this study, we assessed diferent similarity functions for news item retrieval, by comparing them to human judgments of similarity. We asked 401 participants to assess the overall similarity of ten pairs of political news articles, which were compared to feature-specific similarity functions (e.g., based on body text or images). We found that users indicated to mostly use text-based features (e.g., title) for their similarity judgments, suggesting that body text similarity was the most representative for their judgment. Moreover, we modeled similarity judgments using diferent regression techniques. Using data from another study, we contrasted our results across retrieval domains, revealing that similarity functions in news are less representative of user judgments than those in movies and recipes.

eol>news similarity similar-item retrieval recommender systems user study human judgment

1. Introduction 1.1. Problem Outline

News retrieval faces several domain-specific challenges. Compared to leisure domains (e.g., movies), news articles are volatile, in the sense that they become obsolete quickly or may be updated later [ 3 ]. Consequently, user preferences may strongly depend on contextual factors, such as a user’s time of day or location [ 4, 5 ].

News websites typically present content-based recommendations [ 1 ]. A common setup is to present a list of articles that are similar to the story the user is currently reading, such as depicted

Title Main image

Author Date of Publication Lead paragraph

Body text Item recommendations in Figure 1. These are often labeled ‘More on this Story’ (e.g., at BBC News), showcasing similar articles in terms of their publication time or specific keywords.

Whether two news articles are alike can be computed using similarity functions [ 1, 5 ]. Features (e.g., title) considered by such functions should to a large extent reflect a user’s similarity assessment [ 6 ], while not being too similar to what a user is currently reading, for it may lead to redundancy [ 2 ]. However, research on feature-based similarity is limited and rather domain-dependent. For example, users browsing on recipe websites tend to use titles and header photos to assess similarity between recipes, while users of movie recommenders use plot descriptions and genre [ 7 ]. As a result, there is no consensus on which news article features best represent a user’s similarity judgment. This may be problematic, as similarity functions in recommender systems may be more efective if they reflect user perceptions.

Hence, the current study assesses a set of similarity functions for news article retrieval, particularly for the task of similar-item recommendation. We ask users of an online news system to judge the similarity between pairs of news articles, which is used to develop a model to predict news similarity. Subsequently, we perform cross-domain comparisons, comparing which features are used for human similarity judgments in news, movies, and recipes, using data from [ 7 ]. We posit the following research questions: • RQ1: Which news article features are used by humans to judge similarity and to what extent are diferent feature-specific similarity functions related to human similarity judgments? • RQ2: Which combination of news article features is best suited to predict user similarity judgments? • RQ3: How does the use of news features and their similarity functions compare to those used in the recipe and movie domains?

1.2. Contributions

This paper makes the following contributions: • We advance the understanding of how readers perceive similarity between news articles, in terms of (i) which article cues or features are reported as important, and (ii) how features correlate with similarity ratings provided by users, (iii) that user-reported feature importance is not always consistent with the computed correlations. • We show which news information features can predict a user’s similarity judgment. • We juxtapose our news study with findings from the movie and recipe domains, using data from [ 7 ], showing that feature-specific similarity functions in the news domains are less representative of human judgment than functions in the movie and recipe domains. • We present a reproducible data processing pipeline, available on Github1, and add a benchmarking dataset for the publicly available Washington Post Corpus news article database.

2. Related Work

We highlight work from the domains of Similar-item Retrieval and Semantic Similarity to craft similarity functions. Moreover, we discuss specific challenges in news recommendation, and explain how similarity functions are assessed by using human similarity judgments as ground truth.

2.1. Similar Item Retrieval

Similar item retrieval seeks to identify unseen or novel items that are similar to what a user has elicited preferences for [ 1 ]. In the recommender domain, this is referred to as a similar-item recommendation problem. A fundamental question is how to compute similarity between concepts [ 8, 9 ], which is examined in studies on semantic similarity [ 10 ], a field of research that usually not only captures the similarity between two concepts, but also how diferent they are [ 11 ]. This can be based on ontological relations, based on human knowledge, or on co-occurrence metrics that stem from a hierarchical or annotated corpus of words [ 2, 12 ]. For example, latent semantic analysis derives meaning and similarity from the text context itself, by examining how and how often words are used [ 2 ].

A traditional method is to compute similarity between items by deriving vectors from text items. Although TF-IDF has been outperformed by other metrics, such as BM25 [ 13 ], Term Frequency-Inverse Document Frequency remains one of the most commonly used IR methods to create similarity vectors [ 14 ]. It uses the term frequency per document and the inverse appearance frequency across all documents [ 15 ], while similarity between the vectors of liked and unseen items can be computed using cosine similarity [ 16 ].

A much simpler approach is to derive a set of keywords from each item [ 15 ]. For example, a book recommender could compute the similarity between 1 = , , , and 2 = , , , through the Jaccard coeficient : (, ) = ||11∩∪22|| .

1https://github.com/Overhaug/HuJuRecSys

There are various other similarity metrics available, such as the Levenshtein distance (i.e., “edit distance”), and LDA (Latent Dirichlet Allocation).

2.2. Similarity Representations in the News Domain

News recommender systems primarily focus on textual representations of news articles [ 1 ]. Most approaches utilize the main text or title, ignoring most other textual features, such as the author [ 14 ]. A straightforward, but more uncommon approach in academic studies [ 17 ], is to retrieve articles based on date-time, such as those that are published on the same day as the article that is currently inspected. Other approaches include the use of (sub)categories, while image-based similarity is more common in other domains [ 18 ], such as food [ 7 ]. 2.2.1. Text-based approaches Most similarity functions relevant in news retrieval are text-based. TF-IDF is traditionally combined with Cosine similarity and used as a news recommendation benchmark [ 19 ]. In some cases, its efectiveness can be improved by constraining it on a maximum number of words [ 20 ]. TF-IDF can also be combined with a K-Nearest Neighbor algorithm to recommend short-term interest news articles [ 21 ].

Besides the aforementioned methods, a common approach is to derive latent topics from texts. Although recent work uses Word2Vec and BERT [ 22, 23 ], this work considers Latent Dirichlet Allocation (LDA) and Probabilistic Latent Semantic Indexing (PLSI) [ 24 ]. LDA and PLSI can cluster topically-similar news articles based on tags and named entities. News recommendations can be refined afterwards based on recency scores.

A final interesting text-based method is based on sentiment analysis. Sentiment analysis mines a text’s opinions in terms of the underlying attitude, judgments, and beliefs. It has been suggested that negativity in news has a large impact, triggering more vivid recall of news story details among users [ 25 ]. 2.2.2. Other News Features A news article’s date-time feature is also leveraged in the context of similar-item news recommendation, either through pre-filtering, recency modeling, or post-filtering [ 1 ]. Pre-filtering involves omitting outdated news articles before computation starts, while the more uncommon post-filtering removes all non-recent articles from a Top-N set. Recency modeling is the most common, which incorporates recency as one of the factors in an algorithm’s similarity computation (e.g., by giving it a higher weight). Pon et al. [ 26 ] describe an approach that targets users with multiple interests, by considering recency in conjunction with a ‘multiple topic tracking’ technique.

2.3. Assessing Similarity Functions Using Human Judgments

Similar-item retrieval approaches, as also used in similar-item recommender systems, are typically validated using human judgments [ 12 ]. An important question is to what extent similarity functions reflect a user’s similarity assessment of item pairs. This could lead to problems if a user either ignores or overvalues diferent item features, compared to what is being computed [ 9 ]. This has been studied in the movie and recipe domains: Trattner and Jannach [ 7 ] contrast user similarity assessments to a set of similarity functions, pointing out that specific features (e.g., a recipe’s title or a movie’s genre) strongly correlate with user similarity judgments. In a similar vein, Yao and Harper [ 27 ] assess to what extent diferent algorithms for related item recommendations in music are consistent with user similarity judgments.

However, assessing similarity between news articles might be harder than between movies. Whereas similarity between movie pairs is usually attributed to the annotated metadata (e.g., genre), two news articles could be similar because they are recent, address a common topic, or because a person appears in both stories. Although a few studies let humans assess the overall similarity between news headlines [ 2, 28 ], none have done so across multiple features. For example, users in the work of Tintarev and Masthof [ 2 ] successfully judged the similarity between news articles, but only based on their headlines.

2.4. Key diferences with previous work

Novel to our approach is the use of feature-specific similarity representations and functions in news, as well as grounding them in human similarity judgments. Most relevant to our approach are the works of Trattner and Jannach [ 7 ], and Yao and Harper [ 27 ], for they explore how computational functions for similarity compare to users’ perception of similarity. In particular, Trattner and Jannach [ 7 ] serve as an example for our approach, for they also present an online study on similarity perceptions. However, these studies concerned retrieval in music, movies, and recipes. Since the merit of feature-specific similarity functions in other domains is unknown for news, the goal of the current study is to assess their performance in news.

3. Method

We assess the utility of diferent feature-specific similarity functions by collecting human judgments of similarity for pairs of news articles. In this section, we describe (1) the dataset and its specific features, (2) the engineered similarity functions, and (2) the design of our user study to determine the efectiveness of these functions.

3.1. Dataset and Feature Engineering

3.1.1. News Database We employed a publicly available news article database. We focused on a scenario of a single news source, as the use of multiple news websites could lead to ‘duplicate’ articles on the same news event. To ensure reproducibility, we obtained news articles from the open Washington Post Corpus [29]. The news items in the dataset comprised title, author (including a bio), date of publication, section headers, and the main body text. In addition, we retrieved the images associated with the news articles, 655,533 in total. After removing duplicates from the original source, our remaining dataset contained 238,082 articles, which were originally published between Jan’12 and Aug’18. For our user study, we selected news articles categorized in ‘Politics’, as they were on (inter)nationally relevant topics. Other categories were neglected as they focused more on local events and may have an efect on similarity estimates, as these events may not be familiar to the user. We sampled a total of 2400 ‘Politics’ news articles, 400 from each year between 2012 and 2017, for the descriptive statistics are reported in Table 1.

3.2. Modeling Similarity with Feature-Based Similarity Functions

To model the similarity between two news articles, we used twenty similarity functions and representations across seven dataset features. We designed functions in line with the field’s current state-of-the-art, by exploiting specific cues that people may use to assess similarity between two items – based on findings from the movie and recipe domains [ 7 ].

Table 2 describes the developed similarity functions. For each pair of news articles, we computed similarity scores based on seven main features: subcategory, title, presented images, author (including bio), publication dates, and body text (first 50 words and full text). For text-based features, the similarity functions were either based on word mappings or distance methods, while similarity based on subcategories and authors was computed using a Jaccard coeficient. Moreover, we computed date-time similarity (i.e. recency modeling) through a linear function that computed how many days apart two articles were published. 3.2.1. Title Title-based similarity was computed using four string similarity functions and a topic-based one. The string-based functions were based on distance metrics: the Levenshtein distance (LV) [30], the Jaro-Winkler method (JW) [31], the longest common subsequence, and the bi-gram distance method (BI) [32]. Similar to Trattner and Jannach [ 7 ], Latent Dirichlet Allocation (LDA) topic-modeling was set to 100 topics. 3.2.2. Image Features In line with the current state-of-the-art [ 7 ], we computed image-based similarity using six diferent functions. These were an image’s brightness, sharpness (i.e., based on a pixel’s intensity), contrast, colorfulness (i.e., based on the sRGB color space), entropy (i.e., amount of information captured per image dot), and image embeddings. Mathematical details are available in our Github repository. 3.2.3. Body Text Body similarity was computed for two string-based functions (i.e., TF-IDF), a topic-based function (i.e., LDA), and a text sentiment-based metric (based on research of [ 25 ]). TF-IDF encodings were paired with cosine similarity, for which we discerned between similarity based on an article’s first 50 words (i.e., an article’s first paragraph), which could be compared to the average movie plot length in [ 7 ], and similarity based on the entire body text.

3.3. User Study

The similarity functions in Table 2 were assessed by computing similarity scores per news article pair and comparing them to human judgments. We explain our sampling strategy and how we collected human judgments of similarity. 3.3.1. Sampling News Article Pairs on Similarity We compiled a set of news article pairs that were either strongly similar, dissimilar or in-between. To ensure a good distribution, we employed a stratified sampling strategy that was in line with previous work [ 7 ]. We computed the pairwise similarity across all 2400 news articles, averaging the similarity values of all functions in Table 2. Pairs were ordered on their similarity levels and divided into ten deciles, groups D1-D10 of equal size. We sampled a total of 6,000 news article pairs: 2,000 dissimilar pairs between decile D1, 2,000 pairs from deciles D2-D9, and 2000 similar pairs from decile D10. 3.3.2. Procedure and Measures The resulting 6000 news article pairs were used to collect human judgments on similarity. Figure 2 depicts a mock-up of the main application, showing from top to bottom diferent news article features (Note: an author bio could also be inspected). Users could read all text if they clicked ‘read more’.

Users were presented ten news article pairs, of which one was an attention check.2 Much like in the study by Tintarev and Masthof [ 2 ], users were asked to assess the similarity of each news article pair on a 5-point scale (cf., Figure 2). As an extension to other studies, users also indicated their familiarity with each article and the level of confidence in their assessment (all 5-point scales). Moreover, we asked users to what extent they employed diferent features in their similarity judgments (5-point scales). Finally, we inquired on a user’s frequency of news consumption and their demographics. 3.3.3. Participants Participants were recruited from Amazon MTurk. Since we used a database of news articles that concerned American politics, we only recruited U.S.-based participants. They had at least an average hit acceptance rate of 98% and 500 completed HITs. A total of 401 participants completed our study, with a median time of 6 minutes and 35 seconds, who were compensated with 0.5 USD.

Only 241 participants (60.01%) passed our attention check, which was slightly higher than in [ 7 ]. This resulted in usable 2,169 similarity judgments; only 21 pairs were presented twice, to diferent users. This final sample (53% males) mostly consisted of age groups 25-34 (33.2%) and 35-44 (30.3%), of which 66% reported to visit news websites at least once a week (24.9% did so daily), while 50 participants rarely read online news.

4. Results

For our analyses, we first examined the use of diferent news features, assessing diferent similarity functions through human judgments (RQ1). Furthermore, we predicted human similarity judgments using model-based approaches (RQ2). In addition, we compared our results for RQ1-RQ2 with the news and recipe domains (RQ3).

4.1. News Features Usage

We examined to what extent participants used diferent features to assess similarity between news articles (RQ1). Figure 3A summarizes the results for participants who passed the attention check. On average, an article’s title (M=4.2) and body text (M=4.4) were considered most often, while sentiment (M=3.7) and an article’s subcategory (M=3.2) saw above average use. In contrast, author features, publication date, an article’s image were rarely used to assess similarity. Figure 3B shows that all diferences between features were significant (all: < 0.01), based on a one-way ANOVA on feature usage and a Tukey’s HSD post-hoc analysis.

With regard to [RQ3], most findings were compatible with the movie and recipes domains. The use of title and body text was also observed for recipes (i.e., ingredients and directions), while plot and genre features were used in movies [ 7 ]. The use of the genre cue in movies was also more frequent than the use of a news article’s subcategory.

2Users were asked for this pair to only answer ‘5’ on all answer scales.

Iagem trhuoA foeubP t a Information Cue D t x e T y d o B o i B r o h t u A

Itruhoage−Am ttrchouuab−SA littrouhe−TA ittrroouuhoh−BAA ittryxoodoheu−TBBA itftraeoobuohu−PBAD iItrooauheg−BAm ittrcoohaubu−SBA ilittroohue−TBAPttrxyeudoho−TABaitftyxeoaedoub−TPBDr tIxyedaoge−TBm ttxyceudoab−TSB littyxeode−TTB fttrbuuoeaho−APD Itfbuaoeage−PDm tftcubaeoaub−SPD lifttbouaee−TPD Itceaguba−Sm liIteage−Tm ilttceuba−TS

4.2. Grounding Similarity Functions in Human Similarity Judgments

4.2.1. Descriptive Statistics To address [RQ1], we compared feature-specific similarity scores of presented news article pairs to similarity ratings given by users. Figure 4 contrasts the similarity scores, averaged across all similarity functions, with the users’ similarity judgments, averaged per user. As shown, there was a discrepancy between the similarity inferred by the similarity functions, which was distributed around the mean value of 0.39 ( = 0.085), and the similarity judgments of users, which was lower ( = 0.18, = 0.24). This suggested that users were less likely to judge two news articles to be similar, compared to our similarity functions. 4.2.2. Feature-specific Comparison in News Table 3 outlines the Spearman correlations between similarity functions and the similarity judgments given by users. It diferentiates between the results of our own user study (i.e., ‘News Articles’), and that of [ 7 ] for the movie and recipe domains, allowing for cross-domain comparisons (discussed later).

We first discuss the results for the news domain and focus on users who passed the attention check. Table 3 shows that most correlations were modest (all < 0.3), suggesting that the news similarity functions did not fully reflect a user’s judgment. Among all features, we found that full body text similarity (BodyText:TFIDF ) correlated most strongly to user judgments: = 0.29, < 0.001, which was also the most commonly used feature in earlier news recommendation scenarios [ 1 ]. Although some users might have only inspected an article’s first 50 words (cf., the text visible in Figure 2; on average 15% of the full body text), the BodyText:50TFIDF metric had a much lower correlation: = 0.14, < 0.001.

Among all image similarity metrics, embeddings (Image:EMB) had the highest correlation with user judgments: = 0.17* , which was modest nonetheless. This function, along with BodyText:TFIDF, Author:Jacc, AuthorBio:TFIDF, and Subcat:Jacc, seemed to best represent user similarity judgments in news.

Table 3 highlights that other functions did not represent a user’s similarity judgment in news, such as sentiment (BodyText:Sent): = − 0.02. Surprisingly, although most users considered titles to assess similarity, their judgments were hardly similar to each distance-based title similarity function (all < 0.1). Note that the Title:LDA and BodyText:LDA might have sufered from insuficient latent topic information, as their correlations were close to zero.

Finally, because similarity ratings correlated positively with familiarity scores ( = 0.27* ), we tested whether only including judgments for familiar news article pairs (i.e., with scores of 4 or higher) afected the results in Table 3. Although this would increase correlations with 1 to 4 percentage points for most features, most changes were statistically significant (e.g., TFIDF:BodyText would increase from 0.29 to 0.33). 4.2.3. Cross-domain Comparison Using data from [ 7 ], we compared the results in Table 3 across the news, recipe, and movie domains. Correlations between human judgments and similarity functions in the news domain were shown to be much weaker than in the recipe domain and, to a lesser extent, the movie domain. This applied to most features, including title, image, and body text.

Two notable diferences lie in title and image-based functions. Whereas the reported correlaSubcat:Jacc Title:LV Title:JW Title:LCS Title:BI Title:LDA Image:BR Image:SH Image:CO Image:COL Image:EN Image:EMB Author:Jacc Date:ND BodyText:TFIDF BodyText:50TFIDF BodyText:LDA BodyText:Sent AuthorBio:TFIDF AuthorBio:LDA study), and recipes and movies (obtained from [ 7 ]). denotes correlations with users who passed the attention check, denotes those with all users. * < 0.05;** < 0.01;*** < 0.001.

News Articles

Recipes

Movies Similarity Function pass

Sim. Function pass

Sim. Function pass tions for title features were weak in news ( < 0.1), the distance-based title metrics showed strong correlations with user judgments for recipes (ℎ ≈ 0.5). With regard to image-specific similarity, functions in news were only weakly correlated to human judgments ( = 0.17), while they were more representative for recipes ( = 0.44) and movies ( = 0.22).

4.3. Predicting Human Similarity Judgments

Going beyond simple correlation analyses, we also sought to predict similarities with these functions using state-of-the-art machine learning methods (RQ2), as used in recommender systems research. This helped us to understand each feature’s importance, beyond the featurespecific correlations in Table 3. 4.3.1. Model Evaluation and Cross-Domain Comparison To determine model performance, standard metrics such as Root Mean Square Error (RMSE), R2, and Mean Absolute Error (MAE) were used. Five-fold cross-validation was used as an evaluation protocol. Furthermore, by applying grid search on a validation set from the training data, the optimal hyper-parameters for each model were found.

Genre:Jacc Title:LV Title:JW Title:LCS Title:BI Title:LDA Image:BR Image:SH Image:CO Image:COL Image:EN Image:EMB Dir:Jacc Date:MD Plot:TFIDF Plot:LDA than a random baseline ( < 0.05). Table 4 (i) also compares our results to findings from the recipe and movie domains (RQ3), adapted from [ 7 ]. Most notably, we found that Lasso is the best performing model, while Ridge outperformed other models in the Recipe and Movie domains. Moreover, the news model (i.e., 2 = 0.33) was less accurate than the recipe model (i.e., 2 = 0.51), while its accuracy was comparable to that of the movie model (i.e., 2 = 0.36). This suggested that the similarity functions adapted from [ 7 ] were less representative for user similarity judgments in the news domain. 4.3.2. Feature-specific Models and User Characteristics To further explore [RQ2], Table 4 (ii) describes the performance of feature-specific models. To compare our findings to other domains, Ridge regression was used to combine multiple similarity functions per feature, while linear regression was used for features with a single function. Although the representativeness of the diferent BodyText similarity functions varied (cf., Table 3), it was the best predicting feature, even outperforming the All features model.

Finally, we included user characteristics and demographics in our Ridge model. We tested the impact of each additional feature separately, as well as simultaneously. Table 4 (iii) outlines that the addition of user characteristics (e.g., news consumption frequency) hardly afected the model’s predictive quality. A model that included the user’s age reported the lowest RMSE, but this decrease (from 0.9141 in (i) to 0.9081 in (iii)) was not statistically significant diferent according to a Wilcoxon Rank-Sum test.

5. Discussion

This work contributes to the literature on similarity estimates, which is a central theme in the recommender systems literature, with a particular focus on the news domain. It is among the ifrst to study news similarity representations in detail, making the following contributions: 1. Determining which features are considered by users when judging similarity between news articles. 2. Assessing how feature-specific similarity functions relate to similarity judgments. 3. Predicting similarity judgments of users through machine learning models. 4. Comparing our results to findings from the movie and recipe domains.

We have taken a first step towards designing representative feature-specific similarity functions for news, going beyond other studies that focused on overall similarity or just a single feature [ 28, 2 ].

5.1. Feature-specific Similarity

We have assessed the value of feature-specific similarity functions in the news domain, adapted from recommender literature in the news, movie, and recipe domains [ 7 ]. We find that most feature-specific similarity functions only partially reflect a user’s similarity judgment, yielding modest correlations. To best reflect user perceptions, we suggest that content-based news recommender systems should exploit the body text, supported by image embeddings, article categories, and the author. The representativeness of body text is grounded in the reported feature use, as well as consistent with previous studies on news retrieval [ 1 ]. In contrast, although users used a news article’s title in their similarity judgments, we have found title-based similarity functions to be hardly representative for these judgments. The weak correlations could be attributed to the relatively ‘wordy’ titles of news articles (cf., Table 1), compared to the other domains in scope. At the similarity function level, it is possible that the string-based functions do not capture more subtle similarities between news articles, for example if two headlines describe an identical news event, but from a diferent news angle. Moreover, the insignificant correlation between Title:LDA and a user’s similarity judgment suggests that word-based similarity is unrelated to how users perceive a pair of news articles.

In terms of predicting similarity judgment, we have used machine learning to determine model accuracy and feature importance, and to examine the predictive value of additional user characteristics. We find that the addition of user characteristics and demographics in our models does not significantly improve the accuracy indicators, indicating there is little variance across users. In terms of similarity modeling, these findings suggest that the main focus should be on leveraging a news article’s BodyText, while other features should only be used if the similarity functions would be more accustomed to the news domain.

5.2. Cross-domain Comparisons

We have also explored cross-domain diferences. In line with [ 7 ], we have found further evidence that diferent domains call for diferent similarity functions. For one, the ridge regression model for news is found to be somewhat less accurate than for news and recipes, although a 2 of 0.33 is reasonable. However, the MAE of 0.75 for a measure that is scaled from 1 to 5 suggests that there is room for improvement, which could be attributed to the low given similarity scores.

It seems that text-based similarity (i.e., movie plot, recipe directions, news’ body text) is useful in most domains in scope, given an appropriate similarity function. BodyText features are listed among the strongest correlations, as well as among the strongest predictors. In contrast, the title and image features are less representative of similarity judgments in news and movies, compared to the recipe domain. Whereas only image embeddings seem to be somewhat representative of news similarity assessments, images features are more useful in determining recipe similarity.

We have observed that the model accuracy reported in Table 4 is comparable to findings from the movie domain (cf., [ 7 ]). This is despite the diferences in given similarity scores across domains (which is much lower for news; see Figure 4), and the weaker correlations reported in Table 3. All in all, the news domain seems to require similarity functions that are less ‘tasterelated’ than movies or recipes, but further research is needed to develop more accurate ones, possibly by also using psychological theories on similarity [ 9 ].

5.3. Limitations & Future Work

A notable limitation of our approach is the use of a single dataset, which only comprises political articles. It is possible that the relation between similarity judgments and featurespecific similarity functions would be afected when employing additional main categories. For example, ‘name-dropping’ sports teams in a news article title might result in a higher feature importance for news article titles, compared to ‘political judgments’. Furthermore, the news articles shown to users were a few years old, which might have reduced familiarity levels and, in turn, decreased similarity ratings.

Another shortcoming is that it is not entirely clear on what grounds users have made their similarity judgments. We have asked them a single question on similarity, while some other studies have also used multiple questionnaire items [ 2 ]. However, our inquiry on reported feature use by participants (RQ1) reveals a part of the underlying cognitive process, and suggests what are good features to optimize for. In fact, this is also a new finding.

For future studies, we suggest to develop and assess feature-specific similarity functions that unambiguously apply to the news domain. For example, similarity functions that leverage named entities (e.g., ‘Donald Trump’ or ‘France’) could help to manage user expectations about inter-article similarity. Furthermore, it would be most useful to test our assertions in an online study where news article recommendations are evaluated, much like the work of [ 7 ] and [ 27 ].

Above all, we like to emphasize that the current study serves as a first step. Based on these ifndings, future studies can further develop feature-specific similarity functions for the news domains, for this paper provides insight in what types of functions and features are successful, and which ones are not.

Acknowledgments

This work was supported by industry partners and the Research Council of Norway with funding to MediaFutures: Research Centre for Responsible Media Technology and Innovation, through the Centres for Research-based Innovation scheme, project number 309339.

Society for Information Science 51 (2000) 793–804. [29] NIST, Trec washington post corpus, 2019. Data retrieved from, https://trec.nist.gov/data/ wapost/. [30] L. Yujian, L. Bo, A normalized Levenshtein distance metric, IEEE Transactions on Pattern

Analysis and Machine Intelligence (2007). [31] M. A. Jaro, Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida, Journal of the American Statistical Association (1989). [32] G. Kondrak, N-gram similarity and distance, in: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2005.

[1]

Karimi ,

Jannach ,

Jugovac , News recommender systems-survey and roads ahead , Information Processing & Management 54 ( 2018 ) 1203 - 1227 .

[2]

Tintarev ,

Masthof , Similarity for news recommender systems , in: In Proceedings of the AH'06 Workshop on Recommender Systems and Intelligent User Interfaces, Citeseer , 2006 .

[3] A. S. Das , M.

Datar , A.

Garg , S.

Rajaram , Google news personalization: scalable online collaborative filtering , in: Proceedings of the 16th international conference on World Wide Web , 2007 , pp. 271 - 280 .

[4]

Fortuna ,

Mladenić , Real-time news recommender system , in: Joint European Conference on Machine Learning and Knowledge Discovery in Databases , Springer, 2010 , pp. 583 - 586 .

[5] T. De Pessemier , C.

Courtois , K.

Vanhecke , K.

Van Damme , L.

Martens , L. De Marez, A user-centric evaluation of context-aware recommendations for a mobile news service , Multimedia Tools and Applications 75 ( 2016 ) 3323 - 3351 .

[6]

Elbadrawy , G. Karypis, User-specific feature-based similarity models for top-n recommendation of new items , ACM Transactions on Intelligent Systems and Technology (TIST) 6 ( 2015 ) 1 - 20 .

[7]

Trattner ,

Jannach , Learning to recommend similar items from human judgments, User Modeling and User-Adapted Interaction 30 ( 2020 ) 1 - 49 .

[8]

Ö.

Özgöbek ,

J. A.

Gulla ,

R. C.

Erdur , A survey on challenges and methods in news recommendation , in: International Conference on Web Information Systems and Technologies , volume 2 , SCITEPRESS, 2014 , pp. 278 - 285 .

[9]

A. A.

Winecof ,

Brasoveanu ,

Casavant ,

Washabaugh ,

Graham , Users in the loop: a psychologically-informed approach to similar item retrieval , in: Proceedings of the 13th ACM Conference on Recommender Systems , 2019 , pp. 52 - 59 .

[10]

Richardson ,

Smeaton ,

Murphy , Using WordNet as a knowledge base for measuring semantic similarity between words , Technical Report Working Paper CA-1294 , 1994 .

[11]

Lin , An information-theoretic definition of similarity , in: ICML, volume 98 , 1998 , pp. 296 - 304 .

[12]

S. A.

Takale ,

S. S.

Nandgaonkar , Measuring semantic similarity between words using web documents , International Journal of Advanced Computer Science and Applications (IJACSA) 1 ( 2010 ).

[13]

Lv , T. Moon,

Kolari ,

Zheng ,

Wang ,

Chang , Learning to model relatedness for news recommendation , in: Proceedings of the 20th International Conference on World Wide Web , 2011 , pp. 57 - 66 .

[14]

Billsus ,

M. J.

Pazzani , User modeling for adaptive news access, User Modelling and User-Adapted Interaction ( 2000 ).

[15]

Jannach ,

Zanker ,

Felfernig , G. Friedrich, Recommender systems: an introduction , Cambridge University Press, 2010 .

[16]

Cantador ,

Castells , Semantic contextualisation in a news recommender system , in: Workshop on Context-Aware Recommender Systems at the RecSys 2009: ACM Conference on Recommender Systems , ACM, New York, 2009 .

[17]

Lommatzsch ,

Kille ,

Hopfgartner , L. Ramming, NewsREEL multimedia at MediaEval 2018: News recommendation with image and text content , in: CEUR Workshop Proceedings , 2018 .

[18]

Rorvig , Images of similarity: A visual exploration of optimal similarity metrics and scaling properties of trec topic-document sets , Journal of the American Society for Information Science 50 ( 1999 ) 639 - 651 .

[19]

Goossen ,

Ijntema ,

Frasincar ,

Hogenboom , U. Kaymak, News personalization using the CF-IDF semantic recommender , in: ACM International Conference Proceeding Series, 2011 .

[20]

Bogers , A. Van Den Bosch , Comparing and evaluating information retrieval algorithms for news recommendation , in: RecSys'07: Proceedings of the 2007 ACM Conference on Recommender Systems , 2007 .

[21]

Billsus ,

M. J.

Pazzani , Personal news agent that talks, learns and explains , in: Proceedings of the International Conference on Autonomous Agents , 1999 .

[22]

B. P.

Chamberlain ,

Rossi ,

Shiebler ,

Sedhain , M. M. Bronstein , Tuning word2vec for large scale recommendation systems , in: Fourteenth ACM Conference on Recommender Systems , 2020 , pp. 732 - 737 .

[23]

Liu ,

Xia ,

Li ,

Yan , T. Liu, A bert-based ensemble model for chinese news topic prediction , in: Proceedings of the 2020 2nd International Conference on Big Data Engineering , 2020 , pp. 18 - 23 .

[24]

Li ,

Wang ,

Li ,

Knox , B. Padmanabhan, SCENE: A scalable two-stage personalized news recommendation system , in: SIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference , 2011 .

[25]

Soroka ,

Young ,

Balmas , Bad News or Mad News? Sentiment Scoring of Negativity, Fear, and Anger in News Content, Annals of the American Academy of Political and Social Science ( 2015 ).

[26]

R. K.

Pon ,

A. F.

Cardenas ,

Buttler , T. Critchlow, Tracking multiple topics for finding interesting articles , in: Proceedings of the ACM SIGKDD International Conference , 2007 .

[27]

Yao ,

F. M.

Harper , Judging similarity: a user-centric study of related item recommendations , in: Proceedings of the 12th ACM Conference on Recommender Systems , 2018 , pp. 288 - 296 .

[28]

Watters ,

Wang , Rating news documents for similarity , Journal of the American