Human-centric evaluation of similarity spaces of news
                           articles

       Clara Higuera Cabañes                Michel Schammel              Shirley Ka Kei Yu               Ben Fields
                                        [first name].[last name]@bbc.co.uk
                                         The British Broadcasting Corporation
                                        New Broadcasting House, Portland Place
                                                  London, W1A 1AA
                                                   United Kingdom


                                                                   • Analogously, what are efficient and effective
                                                                     means of computing similarity between news ar-
                        Abstract                                     ticles
     In this paper we present a practical approach                 • By what means can we use the human cognition of
     to evaluate similarity spaces of news articles,                 article similarity to select parameters or otherwise
     guided by human perception. This is moti-                       tune a computed similarity space
     vated by applications that are expected by
     modern news audiences, most notably recom-                      A typical application that benefits from this sort of
     mender systems. Our approach is laid out                    human calibrated similarity space for news articles is
     and contextualised with a brief background                  an article recommender system. While a classic col-
     in human similarity measurement and percep-                 laborative filtering approach has been tried within the
     tion. This is complimented with a discussion                news domain [LDP10], typical user behaviour makes
     of computational methods for measuring sim-                 this approach difficult in practice. In particular, the
     ilarity between news articles. We then go                   lifespan of individual articles tends to be short and the
     through a prototypical use of the evaluation                item preferences of users is light.
     in a practical setting before we point to fu-                   This leads to a situation where in practice a col-
     ture work enabled by this framework.                        laborative filtering approach is hampered by the cold-
                                                                 start problem, where lack of preference data negatively
1     Introduction and Motivation                                impacts the predictive power of the system. To get
In a modern news organisation, there are a number of             around this issue, a variety of more domain-specific ap-
functions that depend on computational understand-               proaches have been tried [GDF13, TASJ14, KKGV18].
ing of produced media. For text-based news articles              However, these all demand significant levels of analyt-
this typically takes the form of lower dimensionality            ical effort or otherwise present challenges when scaling
content-similarity. But how do we know that these                to a large global news organisation. A simple way to
similarities are reliable? On what basis can we take             get around these constraints while still meeting the
these computational similarity spaces to be a proxy              functional requirements1 of a recommender system is
for human judgement? In this paper we address this               to generate a similarity space across recently published
question as follows.                                             articles and be able to surface the most similar content
                                                                 to the current article. This assumes that most readers
    • How can we assess human cognition of the simi-             predominantly prefer reading similar content, but this
      larity for news articles                                   a pragmatic assumption.
Copyright c 2019 for the individual papers by the papers’ au-
                                                                     In order for this approach of article similarity to
thors. Copying permitted for private and academic purposes.      be an effective means for recommendation to readers,
This volume is published and copyrighted by its editors.         the similarity space needs to be well aligned with the
In: A. Aker, D. Albakour, A. Barrón-Cedeño, S. Dori-Hacohen,   human perception of similarity across these articles.
M. Martinez, J. Stray, S. Tippmann (eds.): Proceedings of the
NewsIR’19 Workshop at SIGIR, Paris, France, 25-July-2019,           1 Here that means: present a reader of an article with other

published at http://ceur-ws.org                                  articles that they have a high likelihood of reading
To that end, this paper will lay out a methodology           2.2   Sensory Perception
for assessing the perception of similarity between news
                                                             A common means of measuring the human ability to
articles (Section 2), methods for computing similarity
                                                             differentiate between stimuli that are similar is de-
between news articles (Section 3), and an example case
                                                             scribed in terms of Just Noticeable Difference (JND).
where findings from the first part are used to aid model
                                                             That is, the JND is a unit where if two stimuli are mea-
selection in the second (Section 4). We also briefly dis-
                                                             surably closer than this JND, the average person will
cuss how such a content similarity recommender sys-
                                                             not be able to notice the difference between these stim-
tem works in practice before we conclude the paper by
                                                             uli. This has been effectively used to understand hu-
considering next steps implied by this work.
                                                             man perception of a wide variety of things from speech
                                                             [BRN99] or colour [CL95] to the handling characteris-
                                                             tics of cars [HJ68]. In a news article context the JND
2     Human Similarity                                       is the amount of measurable change between articles
                                                             before an average reader would consider them different
Given that our motivation for having a similarity space
                                                             articles.
among news articles is to produce articles that read-
                                                                Serving as a complement to the idea of JND is a
ers perceive as similar, it is critical that we have a
                                                             sensory triangle test. In this test three stimuli are pre-
means of assessing similarity of news articles, as per-
                                                             sented to an evaluator, with two of them being iden-
ceived by people. While it would be convenient to
                                                             tical. The evaluator is then asked to identify which of
assume that news articles are perceived by people as
                                                             the three stimuli is different from the other two. This
having objective similarities, there are a number rea-
                                                             process is repeated by a population of evaluators, and
sons to work from the assumption that is not the case.
                                                             if a statistically significant2 portion of the population
Broadly, human perception of item similarity does not
                                                             correctly identifies the different stimuli, the difference
obey the requirements of a well-formed metric space,
                                                             is taken as perceivable and therefore larger than the
most notably symmetry [AM99] and the triangle in-
                                                             JND [OO85].
equality [YBDS+ 17].
   Therefore we look to other domains for useful ana-        2.3   A Proposed Test
logues to our problem of assessing the perceptual dif-
                                                             Given the above, we propose the following means of
ference between objects and a mapping of that into
                                                             assessing article similarity.
a similarity metric. In particular, we look at assess-
ment methods from two domains: psychophysics and              1. Gather a collection of anchor articles from your
sensory perception.                                              corpus.
                                                              2. For each anchor select two additional articles for
2.1   Psychophysics                                              comparison
                                                              3. Present each of these triplets in turn to a human
The field of psychophysics is concerned with under-
                                                                 evaluator asking the evaluator to decide which of
standing the interaction between physical phenomena
                                                                 the two articles is most similar to the anchor
and human cognition of these phenomena, most typ-
ically auditory and visual stimulus. One of the most         Beyond the evaluation process, there is the mechanism
widely known applications from psychophysics is lossy        for selecting both the anchors and the comparison ar-
compression, where digital audio or video is reduced         ticles. For these issues much depends on the partic-
in size by discarding portions that are not likely to be     ulars of the assessment and to that end we will go
perceived by a general audience[Pan95, Wal92]. As            through our use of this assessment in Section 4. How-
a result of these well established areas of research,        ever, there are some guiding principles to consider in
this field has mature techniques for measuring human-        general. Keeping in mind that the goal of the assess-
perceivable difference across transformations or deteri-     ment is a human understanding of the similarity space,
oration of an anchor stimuli. The standard practice in       rather than the analytical configuration of the space,
auditory settings is called Multiple Stimulus with Hid-      we should seek to select anchors to maximise coverage
den Reference and Anchor (MUSHRA) [15301]. This              across the corpus and we should seek to select compar-
testing framework allows for the precise measuring of        ison articles that we believe to be a variety of different
change which are or are not generally noticeable while       levels of similarity from the anchor articles. A straight-
calibrating for individual testers’ differences in percep-   forward way to bootstrap these selection criteria is to
tion and cognition, though this comes at the expense            2 typically a chi-squared test is used, c.f.
of a test which can be lengthy and require larger pop-       https://www.sensorysociety.org/knowledge/sspwiki/pages/
ulations of testers than less complicated tests.             triangle\%20test.aspx
use a best-effort computed similarity and to the select      • The algorithm delivers inspectable topics; as ev-
items across the space.                                        ery topic is a probability distribution of words, it
   By adhering to these principles we should be able to        is straightforward to determine the most impor-
improve our results, though as with many assessments           tant words contributing to each topic and thus
of this type, the larger the number of participants be-        allowing interpretation of the topics.
comes, the stronger the conclusion will be.
                                                             • Building onto the word distributions, the topics
                                                               associated with a document can easily be traced
3     Computed Similarity                                      back to the most salient words in the document.
In order to compute a similarity measure between arti-         This is a strong step towards explainability; a key
cles, we first need to derive a computer-readable repre-       requirement under recital 71 of the GDPR [RP16]
sentation for each document and second, choose an ad-          and a strong tool for recommender monitoring.
equate metric to evaluate the distance between them.
    There are several algorithms that can be used to       3.4   Similarity Measures
construct similarity spaces and perform topic mod-         In order to compute similarity between documents, one
elling.                                                    requires the use of a metric, which, in the case of vector
                                                           spaces, usually resorts to Euclidean distance or cosine
3.1   Doc2vec                                              similarity. However, in the case of probability distri-
                                                           butions, a similarity metric needs to measure concepts
Word2vec [MCCD13] and its extension to Doc2vec
                                                           other than physical distance. In the context of simi-
[LM14] are embedding algorithms (usually formed of
                                                           larity of texts, the correct approach is to measure the
shallow, two-layer neural networks) that construct vec-
                                                           relative information gain between each other. Having
tor spaces of words based on their frequencies and co-
                                                           read document A, how much more information can a
occurrences in the training corpus. The hence learned
                                                           reader get from reading document B?
mathematical representation can be used to estab-
                                                              A logical choice to measure this information gain is
lish similarities between words using vector algebra.
                                                           the Kullback-Leibler divergence (KL), which measures
Doc2Vec works in a similar way but trains on individ-
                                                           the difference between statistical distributions and is
ual documents rather than words and is thus able to
                                                           related to the Shannon and Wiener information the-
establish similarities between documents rather than
                                                           orems [KL51]. The more similar two documents and
just words.
                                                           their probability distributions are, the less informa-
                                                           tion is gained from one with respect to the other. An-
3.2   FastText                                             other option would be the Jensen-Shannon divergence
Another popular natural language processing library is     [Lin91], which also measures the similarity between
fastText. Based on a shallow neural network with an        two probability distributions.
embedding layer, fastText can be used in two applica-         However, as the KL divergence is the metric
tions: learning embeddings from a corpus [BGJM17]          used during training of the particular implementation
or document classification [JGBM17]. In the former         [HBB10] used in this work, we keep it as measure of
application, [GBG+ 18] used the fastText algorithm to      similarity between documents.
generate language models for 157 different languages          The KL divergence as a metric comes with two
from Wikipedia data. These pre-trained models can          caveats:
be used to transform documents into vector represen-          First, the metric is not finite. The ratio of two
tation and enable similarity calculations in the same      probability distributions may incur a divide by zero
manner as in the Doc2vec case.                             issue. This can be remedied by adding a small amount
                                                            to each component in order to prevent any division by
3.3   Latent Dirichlet Allocation                          zero. The value of  then governs the upper numerical
                                                           limit of the metric.
Latent Dirichlet Allocation (LDA) [BNJ03] is a gen-           Second, the KL divergence is an asymmetric mea-
erative probabilistic model that represents documents      sure which is problematic when referring to true met-
as a mixture or collection of topics expressed as prob-    ric spaces as they assume the property of symmetry
abilities with each topic represented by a probability     [Fré06]. However, the symmetry assumption is not
distribution of words. Section 3.4 describes how the       universal in other domains, especially when looking
similarity between documents can be assessed with this     at the application of human judgement to similar-
method.                                                    ity [Tve77] and when a sense of hierarchy is subcon-
   For our use case, we found LDA has a number of          sciously imposed by humans, such as for the example
advantages:                                                of saying ”an ellipse is like a circle” rather than ”a
                                                          forms best to human judgement. This provides a way
                                                          to deal with the key challenge in using LDA (or sim-
                                                          ilar unsupervised learning methods): how to quantify
                                                          the impact of tuning the hyperparameter reponsible
                                                          for the number of topics.

                                                          4.1   Triangle Tests
                                                          We trained three LDA models with 30, 50 and 75 top-
                                                          ics respectively, using 70 000 articles from BBC News
                                                          Online published in 2017. From the set we selected a
                                                          reference article a1 and computed the KL divergence
Figure 1: KL distribution of reference article a1
                                                          between the reference and all other articles in the set
against the rest of the articles in the corpus
                                                          for one model. We then order the results from similar
                                                          (small KL) to less similar in order to pick a diverse set
circle is like an ellipse”. The direction of asymmetry    of articles for testing. Figure 1 displays the distribu-
in our similarity space of news articles behaves in a     tion of articles ordered by KL between article a1 and
similar way. If we have two articles talking about cli-   the rest of the articles in the corpus using the 30 topic
mate change for example, where one is a very detailed     model. Thus, we can select a set of articles (a1 - a5 ),
piece about climate change and the other is more of       to carry out the triangle tests.
an overview, the information gained differs depending        The next step is to use the selected articles and
on the sequence that the articles are read in. There-     create a questionnaire with sixteen questions. Each
fore we judge the KL divergence to deliver an ade-        question contains three articles from the set: an an-
quate measurement of similarity between documents         chor article and two comparative articles (A and B)
and, specifically, news articles.                         that are located in different positions of the similarity
   To further evaluate the alignment of computed sim-     space. The name for the test is drawn from the fact
ilarity with perceived similarity, we proceed with pre-   that three articles are always presented as mentioned
senting a prototypical case of human-centric testing.     in section 2.2. We asked ten journalists to read each
                                                          anchor article alongside the two comparative articles.
                                                          They then indicate which one, in their opinion, was
4   A Prototypical Case                                   more similar to the anchor article. The questions and
In section 2 we discussed perception and the subjec-      order of the comparative articles were shuffled between
tivity of interpreting similarity by humans as well as    participants.
how machines can compute similarity via different ap-        The purpose of the test was to be able to compare
proaches with metrics like KL-divergence (section 3).     the responses of the journalists with the responses of
In this section we describe a case following the method   the different LDA models. Each model outputs a dif-
proposed in 2.3 to evaluate the alignment of similarity   ferent KL value between articles depending on the hy-
between humans and machines that helped us select         perparameters (principally: number of topics) used.
the optimal model for the purpose of building content     Therefore we expect different LDA models to have dif-
similarity recommenders for BBC News articles.            fering alignment with human judgement.
   Once the articles have been translated into a dis-        In order to evaluate the performance of the different
tribution of topic probabilities, the KL divergence can   models we calculated how many answers per partici-
then be used to rank articles by similarity. However,     pant agreed with the answers given by the model and
due to the fact that LDA is an unsupervised algorithm,    therefore which model is best aligned with human in-
it is difficult to measure the impact of adjusting the    terpretation. The results of this evaluation with 30, 50
hyperparameters in contrast to supervised learning al-    and 70 topics models are displayed in Figure 2. When
gorithms where loss and error provide a helpful con-      comparing the three models, the 50 topic model shows
straint. Finding the optimal number of topics is par-     the best average alignment (70 percent) and least vari-
ticularly challenging when solely assessing the output    ance across the different testers. In general, all mod-
topics and the similarity space the model spans.          els show good alignment with human perception and
   Again, this is where the perceived similarity and      certainly performs better than randomly selecting the
                                                                                        16
human-centric tests show their strength. By compar-       correct answer, which is 12 . Additionally this also
ing the similarity ranking of the model to the ranking    provides validation that human perception is highly
performed by people through a variation on triangle       aligned to our chosen similarity metric.
tests, we provide a clear means to see which model con-      This gives confidence in the results obtained and
Figure 2: Percentage of answers aligned between the 30, 50 and 70 topic models and the respondents of the test.
x-axis represents participant number, y-axis percentage of responses aligned with each model

allows us to proceed with the 50 topic model for a           of the topic modelling algorithm LDA. The findings
content similarity recommender in production.                obtained show the strong potential of these types of
                                                             tests. In the future we plan to apply the LDA model
5    Towards content similarity recom-                       to build more sophisticated recommenders that takes
                                                             into account the reading profile of users or sequential
     mendations
                                                             modelling.
With the best model selected, we can build an au-
tomatic topic scoring pipeline that, for every article       References
published, transforms the article into a topic proba-
bility distribution. These distributions are persisted       [15301]     ITU-R Recommendation BS. 1534-1.
in a database and made available to the recommenda-                      Method for the subjective assessment of
tion system. Using the KL divergence as the similarity                   intermediate quality level of coding sys-
metric, the recommendation system can calculate the                      tems, 2001.
similarity between each article pair and thus find the
N most similar articles for a given article and serve        [AM99]      Cynthia M Aguilar and Douglas L Medin.
them as recommendations. The recommended articles                        Asymmetries of comparison. Psycho-
may be be further ranked and filtered according to                       nomic Bulletin & Review, 6(2):328–337,
business rules.                                                          1999.

                                                             [BGJM17]    P. Bojanowski, E. Grave, A. Joulin, and
6    Conclusions and Future Work                                         T. Mikolov. Enriching word vectors with
                                                                         subword information. Transactions of the
The prototypical test shows the potential of this
                                                                         Association for Computational Linguis-
methodology in capturing alignment between human
                                                                         tics, (5):135–146, 2017.
and machine perception of similarity. Additionally,
it facilitates the selection of parameters for the LDA       [BNJ03]     David M Blei, Andrew Y Ng, and
model. It has helped us discriminate between the three                   Michael I Jordan. Latent dirichlet allo-
models and suggests the 50 topic model as the most ap-                   cation. Journal of machine Learning re-
propriate. For pragmatism, we selected a limited num-                    search, 3(Jan):993–1022, 2003.
ber of articles and testers, however we believe these
findings validate the use of this type of testing for gen-   [BRN99]     John S Bradley, R Reich, and SG Nor-
eral use and we consider this guidance for extracting                    cross. A just noticeable difference in c50
stronger conclusions given a bigger sample.                              for speech. Applied Acoustics, 58(2):99–
   In this contribution we have stated the need of mea-                  108, 1999.
suring content-similarity in a news organisation with
the motivation of building content similarity recom-         [CL95]      Chun-Hsien Chou and Yun-Chin Li. A
menders. We have revised methods to measure human                        perceptually tuned subband image coder
and machine perception of similarity and presented a                     based on the measure of just-noticeable-
prototype of a human-centric test to evaluate the align-                 distortion profile. IEEE Transactions on
ment between computed and human similarity with                          circuits and systems for video technology,
the purpose of assisting in the selection of parameters                  5(6):467–476, 1995.
[Fré06]    M Maurice Fréchet. Sur quelques points
                                                                       ’10, pages 31–40, New York, NY, USA,
            du calcul fonctionnel. Rendiconti del Cir-
                                                                       2010. ACM.
            colo Matematico di Palermo (1884-1940),
            22(1):1–72, 1906.
                                                           [Lin91]     Jianhua Lin. Divergence measures based
[GBG+ 18]   Edouard Grave, Piotr Bojanowski,                           on the shannon entropy. IEEE Transac-
            Prakhar Gupta, Armand Joulin, and                          tions on Information theory, 37(1):145–
            Tomas Mikolov. Learning word vectors                       151, 1991.
            for 157 languages. In Proceedings of
            the 11th Language Resources and Eval-          [LM14]      Q. Le and T. Mikolov. Distributed repre-
            uation Conference, Miyazaki, Japan,                        sentations of phrases and their composi-
            May 2018. European Language Resource                       tionality. In International conference on
            Association.                                               machine learning, pages 1188–1196, 2014.

[GDF13]     Florent Garcin, Christos Dimitrakakis,         [MCCD13] Tomas Mikolov, K. Chen, G. Corrado,
            and Boi Faltings.    Personalized news                  and J. Dean. Efficient estimation of word
            recommendation with context trees. In                   representations in vector space. volume
            Proceedings of the 7th ACM conference                   Workshop Track, pages 1301–3781, 2013.
            on Recommender systems, page 105112.
            ACM, 2013.                                     [OO85]      MAPDE O’MAHONY and N Odbert. A
                                                                       comparison of sensory difference testing
[HBB10]     Matthew Hoffman, Francis R Bach, and
                                                                       procedures: Sequential sensitivity analy-
            David M Blei. Online learning for latent
                                                                       sis and aspects of taste adaptation. Jour-
            dirichlet allocation. In advances in neu-
                                                                       nal of Food Science, 50(4):1055–1058,
            ral information processing systems, pages
                                                                       1985.
            856–864, 2010.

[HJ68]      Errol R Hoffmann and Peter N Joubert.          [Pan95]     Davis Pan. A tutorial on mpeg/audio
            Just noticeable differences in some vehi-                  compression. IEEE multimedia, 2(2):60–
            cle handling variables. Human Factors,                     74, 1995.
            10(3):263–272, 1968.
                                                           [RP16]      European Union Regulation and Protec-
[JGBM17]    Armand Joulin, Edouard Grave, Piotr                        tion. Regulation (eu) 2016/679 of the
            Bojanowski, and Tomas Mikolov. Bag                         european parliament and of the council.
            of tricks for efficient text classification.               REGULATION (EU), 679, 2016.
            In Proceedings of the 15th Conference of
            the European Chapter of the Association        [TASJ14]    Michele Trevisiol, Luca Maria Aiello,
            for Computational Linguistics: Volume 2,                   Rossano Schifanella, and Alejandro
            Short Papers, pages 427–431, Valencia,                     Jaimes. Cold-start news recommendation
            Spain, April 2017. Association for Com-                    with domain-dependent browse graph. In
            putational Linguistics.                                    Proceedings of the 8th ACM Conference
                                                                       on Recommender systems, pages 81–88.
[KKGV18] Dhruv Khattar, Vaibhav Kumar, Man-                            ACM, 2014.
         ish Gupta, and Vasudeva Varma. Neu-
         ral content-collaborative filtering for           [Tve77]     Amos Tversky. Features of similarity.
         news recommendation. NewsIR@ ECIR,                            Psychological review, 84(4):327, 1977.
         2079:45–50, 2018.
[KL51]      Solomon Kullback and Richard A Leibler.        [Wal92]     Gregory K Wallace.       The jpeg still
            On information and sufficiency. The an-                    picture compression standard.    IEEE
            nals of mathematical statistics, 22(1):79–                 transactions on consumer electronics,
            86, 1951.                                                  38(1):xviii–xxxiv, 1992.

[LDP10]     Jiahui Liu, Peter Dolan, and Elin Rønby        [YBDS+ 17] JM Yearsley, A Barque-Duran, E Scer-
            Pedersen. Personalized news recommen-                     rati, JA Hampton, and EM Pothos. The
            dation based on click behavior. In Pro-                   triangle inequality constraint in similar-
            ceedings of the 15th International Confer-                ity judgments. Progress in Biophysics and
            ence on Intelligent User Interfaces, IUI                  Molecular Biology, 10, 2017.