Human-centric evaluation of similarity spaces of news articles Clara Higuera Cabañes Michel Schammel Shirley Ka Kei Yu Ben Fields [first name].[last name]@bbc.co.uk The British Broadcasting Corporation New Broadcasting House, Portland Place London, W1A 1AA United Kingdom • Analogously, what are efficient and effective means of computing similarity between news ar- Abstract ticles In this paper we present a practical approach • By what means can we use the human cognition of to evaluate similarity spaces of news articles, article similarity to select parameters or otherwise guided by human perception. This is moti- tune a computed similarity space vated by applications that are expected by modern news audiences, most notably recom- A typical application that benefits from this sort of mender systems. Our approach is laid out human calibrated similarity space for news articles is and contextualised with a brief background an article recommender system. While a classic col- in human similarity measurement and percep- laborative filtering approach has been tried within the tion. This is complimented with a discussion news domain [LDP10], typical user behaviour makes of computational methods for measuring sim- this approach difficult in practice. In particular, the ilarity between news articles. We then go lifespan of individual articles tends to be short and the through a prototypical use of the evaluation item preferences of users is light. in a practical setting before we point to fu- This leads to a situation where in practice a col- ture work enabled by this framework. laborative filtering approach is hampered by the cold- start problem, where lack of preference data negatively 1 Introduction and Motivation impacts the predictive power of the system. To get In a modern news organisation, there are a number of around this issue, a variety of more domain-specific ap- functions that depend on computational understand- proaches have been tried [GDF13, TASJ14, KKGV18]. ing of produced media. For text-based news articles However, these all demand significant levels of analyt- this typically takes the form of lower dimensionality ical effort or otherwise present challenges when scaling content-similarity. But how do we know that these to a large global news organisation. A simple way to similarities are reliable? On what basis can we take get around these constraints while still meeting the these computational similarity spaces to be a proxy functional requirements1 of a recommender system is for human judgement? In this paper we address this to generate a similarity space across recently published question as follows. articles and be able to surface the most similar content to the current article. This assumes that most readers • How can we assess human cognition of the simi- predominantly prefer reading similar content, but this larity for news articles a pragmatic assumption. Copyright c 2019 for the individual papers by the papers’ au- In order for this approach of article similarity to thors. Copying permitted for private and academic purposes. be an effective means for recommendation to readers, This volume is published and copyrighted by its editors. the similarity space needs to be well aligned with the In: A. Aker, D. Albakour, A. Barrón-Cedeño, S. Dori-Hacohen, human perception of similarity across these articles. M. Martinez, J. Stray, S. Tippmann (eds.): Proceedings of the NewsIR’19 Workshop at SIGIR, Paris, France, 25-July-2019, 1 Here that means: present a reader of an article with other published at http://ceur-ws.org articles that they have a high likelihood of reading To that end, this paper will lay out a methodology 2.2 Sensory Perception for assessing the perception of similarity between news A common means of measuring the human ability to articles (Section 2), methods for computing similarity differentiate between stimuli that are similar is de- between news articles (Section 3), and an example case scribed in terms of Just Noticeable Difference (JND). where findings from the first part are used to aid model That is, the JND is a unit where if two stimuli are mea- selection in the second (Section 4). We also briefly dis- surably closer than this JND, the average person will cuss how such a content similarity recommender sys- not be able to notice the difference between these stim- tem works in practice before we conclude the paper by uli. This has been effectively used to understand hu- considering next steps implied by this work. man perception of a wide variety of things from speech [BRN99] or colour [CL95] to the handling characteris- tics of cars [HJ68]. In a news article context the JND 2 Human Similarity is the amount of measurable change between articles before an average reader would consider them different Given that our motivation for having a similarity space articles. among news articles is to produce articles that read- Serving as a complement to the idea of JND is a ers perceive as similar, it is critical that we have a sensory triangle test. In this test three stimuli are pre- means of assessing similarity of news articles, as per- sented to an evaluator, with two of them being iden- ceived by people. While it would be convenient to tical. The evaluator is then asked to identify which of assume that news articles are perceived by people as the three stimuli is different from the other two. This having objective similarities, there are a number rea- process is repeated by a population of evaluators, and sons to work from the assumption that is not the case. if a statistically significant2 portion of the population Broadly, human perception of item similarity does not correctly identifies the different stimuli, the difference obey the requirements of a well-formed metric space, is taken as perceivable and therefore larger than the most notably symmetry [AM99] and the triangle in- JND [OO85]. equality [YBDS+ 17]. Therefore we look to other domains for useful ana- 2.3 A Proposed Test logues to our problem of assessing the perceptual dif- Given the above, we propose the following means of ference between objects and a mapping of that into assessing article similarity. a similarity metric. In particular, we look at assess- ment methods from two domains: psychophysics and 1. Gather a collection of anchor articles from your sensory perception. corpus. 2. For each anchor select two additional articles for 2.1 Psychophysics comparison 3. Present each of these triplets in turn to a human The field of psychophysics is concerned with under- evaluator asking the evaluator to decide which of standing the interaction between physical phenomena the two articles is most similar to the anchor and human cognition of these phenomena, most typ- ically auditory and visual stimulus. One of the most Beyond the evaluation process, there is the mechanism widely known applications from psychophysics is lossy for selecting both the anchors and the comparison ar- compression, where digital audio or video is reduced ticles. For these issues much depends on the partic- in size by discarding portions that are not likely to be ulars of the assessment and to that end we will go perceived by a general audience[Pan95, Wal92]. As through our use of this assessment in Section 4. How- a result of these well established areas of research, ever, there are some guiding principles to consider in this field has mature techniques for measuring human- general. Keeping in mind that the goal of the assess- perceivable difference across transformations or deteri- ment is a human understanding of the similarity space, oration of an anchor stimuli. The standard practice in rather than the analytical configuration of the space, auditory settings is called Multiple Stimulus with Hid- we should seek to select anchors to maximise coverage den Reference and Anchor (MUSHRA) [15301]. This across the corpus and we should seek to select compar- testing framework allows for the precise measuring of ison articles that we believe to be a variety of different change which are or are not generally noticeable while levels of similarity from the anchor articles. A straight- calibrating for individual testers’ differences in percep- forward way to bootstrap these selection criteria is to tion and cognition, though this comes at the expense 2 typically a chi-squared test is used, c.f. of a test which can be lengthy and require larger pop- https://www.sensorysociety.org/knowledge/sspwiki/pages/ ulations of testers than less complicated tests. triangle\%20test.aspx use a best-effort computed similarity and to the select • The algorithm delivers inspectable topics; as ev- items across the space. ery topic is a probability distribution of words, it By adhering to these principles we should be able to is straightforward to determine the most impor- improve our results, though as with many assessments tant words contributing to each topic and thus of this type, the larger the number of participants be- allowing interpretation of the topics. comes, the stronger the conclusion will be. • Building onto the word distributions, the topics associated with a document can easily be traced 3 Computed Similarity back to the most salient words in the document. In order to compute a similarity measure between arti- This is a strong step towards explainability; a key cles, we first need to derive a computer-readable repre- requirement under recital 71 of the GDPR [RP16] sentation for each document and second, choose an ad- and a strong tool for recommender monitoring. equate metric to evaluate the distance between them. There are several algorithms that can be used to 3.4 Similarity Measures construct similarity spaces and perform topic mod- In order to compute similarity between documents, one elling. requires the use of a metric, which, in the case of vector spaces, usually resorts to Euclidean distance or cosine 3.1 Doc2vec similarity. However, in the case of probability distri- butions, a similarity metric needs to measure concepts Word2vec [MCCD13] and its extension to Doc2vec other than physical distance. In the context of simi- [LM14] are embedding algorithms (usually formed of larity of texts, the correct approach is to measure the shallow, two-layer neural networks) that construct vec- relative information gain between each other. Having tor spaces of words based on their frequencies and co- read document A, how much more information can a occurrences in the training corpus. The hence learned reader get from reading document B? mathematical representation can be used to estab- A logical choice to measure this information gain is lish similarities between words using vector algebra. the Kullback-Leibler divergence (KL), which measures Doc2Vec works in a similar way but trains on individ- the difference between statistical distributions and is ual documents rather than words and is thus able to related to the Shannon and Wiener information the- establish similarities between documents rather than orems [KL51]. The more similar two documents and just words. their probability distributions are, the less informa- tion is gained from one with respect to the other. An- 3.2 FastText other option would be the Jensen-Shannon divergence Another popular natural language processing library is [Lin91], which also measures the similarity between fastText. Based on a shallow neural network with an two probability distributions. embedding layer, fastText can be used in two applica- However, as the KL divergence is the metric tions: learning embeddings from a corpus [BGJM17] used during training of the particular implementation or document classification [JGBM17]. In the former [HBB10] used in this work, we keep it as measure of application, [GBG+ 18] used the fastText algorithm to similarity between documents. generate language models for 157 different languages The KL divergence as a metric comes with two from Wikipedia data. These pre-trained models can caveats: be used to transform documents into vector represen- First, the metric is not finite. The ratio of two tation and enable similarity calculations in the same probability distributions may incur a divide by zero manner as in the Doc2vec case. issue. This can be remedied by adding a small amount  to each component in order to prevent any division by 3.3 Latent Dirichlet Allocation zero. The value of  then governs the upper numerical limit of the metric. Latent Dirichlet Allocation (LDA) [BNJ03] is a gen- Second, the KL divergence is an asymmetric mea- erative probabilistic model that represents documents sure which is problematic when referring to true met- as a mixture or collection of topics expressed as prob- ric spaces as they assume the property of symmetry abilities with each topic represented by a probability [Fré06]. However, the symmetry assumption is not distribution of words. Section 3.4 describes how the universal in other domains, especially when looking similarity between documents can be assessed with this at the application of human judgement to similar- method. ity [Tve77] and when a sense of hierarchy is subcon- For our use case, we found LDA has a number of sciously imposed by humans, such as for the example advantages: of saying ”an ellipse is like a circle” rather than ”a forms best to human judgement. This provides a way to deal with the key challenge in using LDA (or sim- ilar unsupervised learning methods): how to quantify the impact of tuning the hyperparameter reponsible for the number of topics. 4.1 Triangle Tests We trained three LDA models with 30, 50 and 75 top- ics respectively, using 70 000 articles from BBC News Online published in 2017. From the set we selected a reference article a1 and computed the KL divergence Figure 1: KL distribution of reference article a1 between the reference and all other articles in the set against the rest of the articles in the corpus for one model. We then order the results from similar (small KL) to less similar in order to pick a diverse set circle is like an ellipse”. The direction of asymmetry of articles for testing. Figure 1 displays the distribu- in our similarity space of news articles behaves in a tion of articles ordered by KL between article a1 and similar way. If we have two articles talking about cli- the rest of the articles in the corpus using the 30 topic mate change for example, where one is a very detailed model. Thus, we can select a set of articles (a1 - a5 ), piece about climate change and the other is more of to carry out the triangle tests. an overview, the information gained differs depending The next step is to use the selected articles and on the sequence that the articles are read in. There- create a questionnaire with sixteen questions. Each fore we judge the KL divergence to deliver an ade- question contains three articles from the set: an an- quate measurement of similarity between documents chor article and two comparative articles (A and B) and, specifically, news articles. that are located in different positions of the similarity To further evaluate the alignment of computed sim- space. The name for the test is drawn from the fact ilarity with perceived similarity, we proceed with pre- that three articles are always presented as mentioned senting a prototypical case of human-centric testing. in section 2.2. We asked ten journalists to read each anchor article alongside the two comparative articles. They then indicate which one, in their opinion, was 4 A Prototypical Case more similar to the anchor article. The questions and In section 2 we discussed perception and the subjec- order of the comparative articles were shuffled between tivity of interpreting similarity by humans as well as participants. how machines can compute similarity via different ap- The purpose of the test was to be able to compare proaches with metrics like KL-divergence (section 3). the responses of the journalists with the responses of In this section we describe a case following the method the different LDA models. Each model outputs a dif- proposed in 2.3 to evaluate the alignment of similarity ferent KL value between articles depending on the hy- between humans and machines that helped us select perparameters (principally: number of topics) used. the optimal model for the purpose of building content Therefore we expect different LDA models to have dif- similarity recommenders for BBC News articles. fering alignment with human judgement. Once the articles have been translated into a dis- In order to evaluate the performance of the different tribution of topic probabilities, the KL divergence can models we calculated how many answers per partici- then be used to rank articles by similarity. However, pant agreed with the answers given by the model and due to the fact that LDA is an unsupervised algorithm, therefore which model is best aligned with human in- it is difficult to measure the impact of adjusting the terpretation. The results of this evaluation with 30, 50 hyperparameters in contrast to supervised learning al- and 70 topics models are displayed in Figure 2. When gorithms where loss and error provide a helpful con- comparing the three models, the 50 topic model shows straint. Finding the optimal number of topics is par- the best average alignment (70 percent) and least vari- ticularly challenging when solely assessing the output ance across the different testers. In general, all mod- topics and the similarity space the model spans. els show good alignment with human perception and Again, this is where the perceived similarity and certainly performs better than randomly selecting the 16 human-centric tests show their strength. By compar- correct answer, which is 12 . Additionally this also ing the similarity ranking of the model to the ranking provides validation that human perception is highly performed by people through a variation on triangle aligned to our chosen similarity metric. tests, we provide a clear means to see which model con- This gives confidence in the results obtained and Figure 2: Percentage of answers aligned between the 30, 50 and 70 topic models and the respondents of the test. x-axis represents participant number, y-axis percentage of responses aligned with each model allows us to proceed with the 50 topic model for a of the topic modelling algorithm LDA. The findings content similarity recommender in production. obtained show the strong potential of these types of tests. In the future we plan to apply the LDA model 5 Towards content similarity recom- to build more sophisticated recommenders that takes into account the reading profile of users or sequential mendations modelling. With the best model selected, we can build an au- tomatic topic scoring pipeline that, for every article References published, transforms the article into a topic proba- bility distribution. These distributions are persisted [15301] ITU-R Recommendation BS. 1534-1. in a database and made available to the recommenda- Method for the subjective assessment of tion system. Using the KL divergence as the similarity intermediate quality level of coding sys- metric, the recommendation system can calculate the tems, 2001. similarity between each article pair and thus find the N most similar articles for a given article and serve [AM99] Cynthia M Aguilar and Douglas L Medin. them as recommendations. The recommended articles Asymmetries of comparison. Psycho- may be be further ranked and filtered according to nomic Bulletin & Review, 6(2):328–337, business rules. 1999. [BGJM17] P. Bojanowski, E. Grave, A. Joulin, and 6 Conclusions and Future Work T. Mikolov. Enriching word vectors with subword information. Transactions of the The prototypical test shows the potential of this Association for Computational Linguis- methodology in capturing alignment between human tics, (5):135–146, 2017. and machine perception of similarity. Additionally, it facilitates the selection of parameters for the LDA [BNJ03] David M Blei, Andrew Y Ng, and model. It has helped us discriminate between the three Michael I Jordan. Latent dirichlet allo- models and suggests the 50 topic model as the most ap- cation. Journal of machine Learning re- propriate. For pragmatism, we selected a limited num- search, 3(Jan):993–1022, 2003. ber of articles and testers, however we believe these findings validate the use of this type of testing for gen- [BRN99] John S Bradley, R Reich, and SG Nor- eral use and we consider this guidance for extracting cross. A just noticeable difference in c50 stronger conclusions given a bigger sample. for speech. Applied Acoustics, 58(2):99– In this contribution we have stated the need of mea- 108, 1999. suring content-similarity in a news organisation with the motivation of building content similarity recom- [CL95] Chun-Hsien Chou and Yun-Chin Li. A menders. We have revised methods to measure human perceptually tuned subband image coder and machine perception of similarity and presented a based on the measure of just-noticeable- prototype of a human-centric test to evaluate the align- distortion profile. IEEE Transactions on ment between computed and human similarity with circuits and systems for video technology, the purpose of assisting in the selection of parameters 5(6):467–476, 1995. [Fré06] M Maurice Fréchet. Sur quelques points ’10, pages 31–40, New York, NY, USA, du calcul fonctionnel. Rendiconti del Cir- 2010. ACM. colo Matematico di Palermo (1884-1940), 22(1):1–72, 1906. [Lin91] Jianhua Lin. Divergence measures based [GBG+ 18] Edouard Grave, Piotr Bojanowski, on the shannon entropy. IEEE Transac- Prakhar Gupta, Armand Joulin, and tions on Information theory, 37(1):145– Tomas Mikolov. Learning word vectors 151, 1991. for 157 languages. In Proceedings of the 11th Language Resources and Eval- [LM14] Q. Le and T. Mikolov. Distributed repre- uation Conference, Miyazaki, Japan, sentations of phrases and their composi- May 2018. European Language Resource tionality. In International conference on Association. machine learning, pages 1188–1196, 2014. [GDF13] Florent Garcin, Christos Dimitrakakis, [MCCD13] Tomas Mikolov, K. Chen, G. Corrado, and Boi Faltings. Personalized news and J. Dean. Efficient estimation of word recommendation with context trees. In representations in vector space. volume Proceedings of the 7th ACM conference Workshop Track, pages 1301–3781, 2013. on Recommender systems, page 105112. ACM, 2013. [OO85] MAPDE O’MAHONY and N Odbert. A comparison of sensory difference testing [HBB10] Matthew Hoffman, Francis R Bach, and procedures: Sequential sensitivity analy- David M Blei. Online learning for latent sis and aspects of taste adaptation. Jour- dirichlet allocation. In advances in neu- nal of Food Science, 50(4):1055–1058, ral information processing systems, pages 1985. 856–864, 2010. [HJ68] Errol R Hoffmann and Peter N Joubert. [Pan95] Davis Pan. A tutorial on mpeg/audio Just noticeable differences in some vehi- compression. IEEE multimedia, 2(2):60– cle handling variables. Human Factors, 74, 1995. 10(3):263–272, 1968. [RP16] European Union Regulation and Protec- [JGBM17] Armand Joulin, Edouard Grave, Piotr tion. Regulation (eu) 2016/679 of the Bojanowski, and Tomas Mikolov. Bag european parliament and of the council. of tricks for efficient text classification. REGULATION (EU), 679, 2016. In Proceedings of the 15th Conference of the European Chapter of the Association [TASJ14] Michele Trevisiol, Luca Maria Aiello, for Computational Linguistics: Volume 2, Rossano Schifanella, and Alejandro Short Papers, pages 427–431, Valencia, Jaimes. Cold-start news recommendation Spain, April 2017. Association for Com- with domain-dependent browse graph. In putational Linguistics. Proceedings of the 8th ACM Conference on Recommender systems, pages 81–88. [KKGV18] Dhruv Khattar, Vaibhav Kumar, Man- ACM, 2014. ish Gupta, and Vasudeva Varma. Neu- ral content-collaborative filtering for [Tve77] Amos Tversky. Features of similarity. news recommendation. NewsIR@ ECIR, Psychological review, 84(4):327, 1977. 2079:45–50, 2018. [KL51] Solomon Kullback and Richard A Leibler. [Wal92] Gregory K Wallace. The jpeg still On information and sufficiency. The an- picture compression standard. IEEE nals of mathematical statistics, 22(1):79– transactions on consumer electronics, 86, 1951. 38(1):xviii–xxxiv, 1992. [LDP10] Jiahui Liu, Peter Dolan, and Elin Rønby [YBDS+ 17] JM Yearsley, A Barque-Duran, E Scer- Pedersen. Personalized news recommen- rati, JA Hampton, and EM Pothos. The dation based on click behavior. In Pro- triangle inequality constraint in similar- ceedings of the 15th International Confer- ity judgments. Progress in Biophysics and ence on Intelligent User Interfaces, IUI Molecular Biology, 10, 2017.