Towards a Cross-article Narrative Comparison of News Martino Mensio Harith Alani Alistair Willis martino.mensio@open.ac.uk h.alani@open.ac.uk alistair.willis@open.ac.uk The Open University, United Kingdom Abstract In the world of public misinformation, there are many cases where the information is not false or fabricated, but rather has been manipulated using more subtle techniques such as word replacements, selection of details, omissions and argument distortion. These techniques can have the e↵ect of influencing the reader’s frame of mind towards the events reported. We currently lack the necessary tools to uncover such manip- ulations automatically. In this position paper, we propose an integrated analysis framework and pipeline to identify various narrative signals in news articles; such as structural roles, framing, and subjectivity. By comparing these at the document level and sentence level, it will be possible to highlight di↵erences of narrative techniques used to report the same news events. 1 Introduction Narrative analysis refers to the processing of a piece of text to understand and characterise its structure [Rie93]. Such an analysis could help to distinguish between event reports based on their narrative structure. These are usually reflected through linguistic signals that can be more or less explicit, such as emphasising certain aspects, changing the order in which certain information is presented, or using specific terminology to impose or stress a certain opinion. In the specific case of news articles, their narrative structure usually follows a complex non-chronological sequence, which tends to di↵er from other kinds of narrative that proceed more linearly [ZZBBN19]. It is a choice that is made to “get a good story” [Bel05], and can be exploited to emphasise or introduce non-objective statements or causality relationships between events [Dah10]. To avoid being manipulated, one solution suggested in the literature is to gather information from multiple sources [ABS14, GAR97], and to cross-compare them in order to get a broader view of the event. The same information, for example, may be presented by some sources and omitted by others, or the sequence of events be presented di↵erently to emphasise di↵erent aspects. Therefore, we believe that readers should be made more aware of the narrative and framing embedded in the piece of news they are consuming, and how they compare with those in other articles reporting the same event. Currently, there are hardly any automated tools that o↵er such functionality: the best readers can do is to use news aggregators that show articles grouped by events, but they have to do such comparison on their own. In this position paper, we suggest a framework to automatically highlight the di↵erences in how the same story is presented by di↵erent articles, by cross-comparing their narratives. To this end, the contributions of this paper are: i) integrating several signals characterising the narrative of news; ii) presenting a processing pipeline to link together similar articles at the document and sentence level, integrating the signals identified; and iii) introducing a set of cross-article signals that aim to highlight the di↵erence of narrative techniques applied. Copyright c by the paper’s authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). In: R. Campos, A. Jorge, A. Jatowt, S. Bhatia (eds.): Proceedings of the Text2Story’20 Workshop, Lisbon, Portugal, 14-April-2020, published at http://ceur-ws.org 89 2 Related work In this section, we provide an overview of previous studies in two areas of research. First, the investigation on relationships between news articles which aims to find documents that cover the same information. Second, the detection of narrative linguistic signals, which investigates and characterises several aspects of structure, framing, and subjectivity. For both of them, we gather a set of techniques that enable our approach described in the next Section 3. 2.1 Relationships between news articles There are di↵erent possible types of relationships between news articles, such as similarity (covering the same information), referencing (one is citing another one), and temporal proximity. They can be performed at the document level (e.g., the whole article is similar to another one) or at the sentence level (e.g., the same sentence is corroborated by a sentence in another article [BMTH18]) or even at the paragraph level. Since we are interested in finding articles discussing the same information, we focus on similarity relationships. Other relationships could add interesting features, such as the order of publication which would help to identify which of the articles might have taken inspiration from the other. For the time being, we focus on studying and understanding the role of similarity. At the article-level, there is a wide variety of work that investigates article clustering, and the methods mostly used are Latent Dirichlet Allocation (LDA) or document embedding. LDA [BNJ03] is the most used technique for topic modelling, as it allows the discovery of topics and to group articles accordingly using word distributions. Another technique for grouping articles together is to compute a similarity measure (e.g., cosine similarity) between numeric representations of the documents (TF-IDF [Jon72] or Language Models [DCLT18, CYyK+ 18, YDY+ 19]). We plan to study these models in order to select the one that can efficiently discriminate articles that talk about the same events, even if they use di↵erent linguistics, from articles that may use the same subset of words but talk about di↵erent events. Furthermore, there are works that not only link the articles at a document level, but also investigate in more detail the connections between sentences. In one recent work [BMTH18], groups of similar articles are found, then broken down to pieces of information and analysed to find if these details are corroborated (occurring in multiple documents) or omitted (occurring in other documents of the same group, but not the current one). We aim to use this idea of applying similarity to both article-level and sentence-level, extending it even to the word-level. By doing so, not only we might be able to recognise which sentences appear in multiple documents (with di↵erent degrees of similarity) but also we would be able to identify the specific words that have been changed. However, this set of approaches are limited to bringing to the attention of the reader the linked information pieces with a measure of similarity, without characterising the di↵erences. The reader would then need to evaluate the di↵erences in the role of the sentence, the framing that it implies and how it compares with other sentences in terms of subjectivity. Di↵erent documents may express the same set of details, but give them a di↵erent role (reporting an action, commenting, contextualising, doing a digression, identifying causes and consequences) and use di↵erent words that are semantically similar but may imply a di↵erent framing perspective. For this reason, the next subsection presents a set of narrative linguistic signals that could provide us with the missing features. 2.2 Narrative linguistic signals There is much research on exposing the narrative using linguistic signals [ZZBBN19], with specific words that indicate the structural role, framing and subjectivity of the part of text they belong to. One limitation is that most of such works are applied to single articles, with little comparison between them. On one hand, some research considers the structural role of a sentence in the document (e.g., is it providing some background, the main event, an evaluation). Di↵erent structural roles have been defined in the literature, such as news schema [Bel91], which identifies hierarchical categories (e.g., action, reaction, consequence, context, history), narrative structure [Bel05] (e.g., abstract, orientation, evaluation, complication, resolution), or linguistic signals [ZZBBN19, Mar00]. Such signals could be used to identify the di↵erences between similar sentences with regards to their structural roles in the articles. On the other hand, there is much literature on framing, defined as how a certain story is presented to shape mass opinion [Gof74], the addition to the underlying facts that reflects the sociocultural context and acts as an underlying force to persuade the reader. The work by [GM89] describes a set of framing packages, made 90 of framing devices (e.g., word choice, metaphors, catchphrases, use of contrast, quantification) and reasoning devices (e.g., problem definition, cause, consequence, solution, action). Additionally, the Frame Semantics Theory [Fil06] can be used to recognise lexical units of known frames. By extracting these linguistic signals, we could represent the framing behind a certain piece of text, and there exist di↵erent approaches to extract the listed features [MGB+ 17, GCCZ18, Asg16, STDS17]. In addition to these two characterisations, we can add other signals derived from studies on subjectivity. As found by recent research, in contemporary journalism the line between opinion and facts is blurring more and more [JWJ+ 19]. For this reason, having signals of subjectivity on the document and paragraph-level would be very useful [Liu10]. In this way, each article and each paragraph can be characterised with an indication of subjectivity. All these features have been used in previous research, but as mentioned above, they are mainly applied to single-article analysis. Extending this kind of analysis by taking into consideration the relationships both at the article level and the sentence level would bring a big contribution by providing contrastive signals that would not come up otherwise. 3 Cross-article comparison framework In this section, we propose a description of our comparison framework. We plan to use methods coming from both the research areas identified (document linking and linguistic signals) as a starting point. In order to do so, we propose the following processing pipeline: • preprocessing: documents are retrieved, cleaned up and fragmented into paragraphs and sentences; • narrative features are attached to each document, paragraph and sentence belonging to three main types: structural role using and highlighting the linguistic devices provided by [ZZBBN19]; framing features are extracted (framing and reasoning devices) finding some linguistic representatives from [GM89, Fil06]; subjectivity is computed, and strong word choices are highlighted [Liu10]; • linking: similar articles are found by using document-level similarity measures: in this way it would be possible to find groups of documents that describe the same events; similar sentences and paragraphs are found by sentence-level similarity measures, inside each group of documents: corroborated and omitted sentences are identified [BMTH18]. Figure 1 shows the result of such processing over two articles, where we have several features attached to the sentences, with similar paragraphs across the two articles linked together using a similarity measure [CYyK+ 18]. c a c a TITLE: fa i g bjec i i TITLE: fa i g bjec i i e e BRITAIN ON EDGE Corona irus: NHS told risk("on edge", e pectation, strong, China corona irus: 'Increased likelihood' of objecti e, ho to handle infectious bodies amid "gra e"), uncertaint negati e cases in the UK neutral gra e concerns bug ill hit UK ithin da s certaint (" ill") ("likelihood") ARTICLE: main focus ARTICLE: The 11-page guide as prepared b Public Globall , there are more than 500 con rmed main focus objecti e, Health England and has ad ised GPs to detaining("keep them cases of the irus, hich has killed 18 danger("killed") Background objecti e Action neutral a oid e amining suspected corona irus in closed rooms") people in China. ictims and keep them in closed rooms. But there are no kno n cases in the UK, Mr uncertaint ("no All 31 people across England, Wales, Hancock said, hich as " ell prepared" to fear("fears"), kno n"), objecti e, Scotland and Northern Ireland that had been deal ith an outbreak. Fourteen people in Background risk("deadl "), similarit = 0.57 con dence(" ell neutral tested for the deadl u-like irus ere objecti e Background the UK ha e no been tested for the irus, imminence(" ithin prepared") negati e, but there a fears the UK could Public Health England said. da s") ha e its rst positi e test " ithin da s". Passengers are recei ing ad ice on hat to The Border Force is hunting do n 2,000 do if the fall ill, [...] as it can take da s after e pertise("ad ice", emphasised similarit = 0.64 subjecti e, Action + people ho ha e recentl tra elled from hunt("hunting do n") Action infection before a patient de elops "most important"), language neutral Commentar Wuhan - the epicentre of irus outbreak. s mptoms, so ph sical checks ere justi cation("so") Holida makers ho touched do n at similarit = 0.39 considered less useful. London Heathro from Wuhan ere Action + Mr Hancock said that it as a "rapidl surprise("shocked") subjecti e shocked as the ere simpl handed a Commentar de eloping situation and the number of e pectation("is likel lea et and told to call NHS 111 if the felt ill. subjecti e, deaths and the number of cases is likel to to", "I e pect"), Action The irus has killed 56 people and infected negati e be higher than those that ha e been reporting("said") more than 2,000 globall after it is belie ed con rmed so far [...]". risk("killed", Background / to ha e originated from a meat market in objecti e "infected") Closing "The NHS is read to respond appropriatel Wuhan - a cit ith 11million people in the con dence("is read ") subjecti e Conclusion Hubei pro ince. to an cases that emerge." Figure 1: An example of analysis between two news articles that both talk about the risk of coronavirus spread in the UK. The first one (from The Sun) emphasises the risks from the virus, while the second article (from BBC) is more focused on presenting the UK as ready to face the problem. Each paragraph is characterised with framing, subjectivity and structural signals, and the links between the articles represent the most similar pairs of sentences. 91 This is is the starting point to identify the di↵erences, with a contrastive analysis. We propose here a set of cross-article comparative signals that can bring the narrative analysis a step further: • The main focus of the compared articles is on a di↵erent part or detail of the story: this means that while they are both describing the same broad event, they are trying to emphasise or prioritise two di↵erent aspects. This signal can be computed by looking at the most similar sentence to the article title (proxy of the emphasis), and seeing how it is represented in other documents. • Ordering: the compared articles present the same details, but in a di↵erent order. Re-ordering events tends to be an efficient way of creating implicit cause-e↵ect relationships. To do this comparison, it is sufficient to find the crossovers in the sentence-level connections. • Selection of details: One article is omitting certain details that have been reported by other articles, or is describing events that are corroborated by other sources, or has unique parts that do not occur in other articles [BMTH18]. In addition to seeing which parts are selected or omitted, the narrative analysis can help us to find some insights about them (e.g., the article is omitting subjective statements reported by others, or is describing a background event that others did not include). • The articles are framing the narrative in di↵erent ways from each other. This manifests through comparing linked sentences to observe the di↵erences in terms of framing features: the considered articles are describing the same events but with di↵erent framing and reasoning. One concrete example is the usage of causality: one article may contain causality signposting between a pair of sentences that is absent elsewhere. Or as another example, the usage of specific words can reveal a specific framing: talking about the same detail or entity, the usage of verbs or adjectives may change. For detecting such peculiarities, features as Named Entities and subjectivity may be combined. • The comparison can be also done on the subjectivity of the article: both at the document level (saying that this is an opinion piece, while a similar one is more factual) or at the sentence level, by interweaving this signal with the ones proposed before. From the signals in Figure 1, we can see that the first article pushes the narrative towards risk and other negative frames, to sustain the idea presented in the title “Britain on Edge”. The second article, even though it has a lot of information in common with the first one, is more confident on the preparedness of the National Health Service to face the virus (e.g., confidence, expertise). The extraction of these cross-article signals is the first step to finding possible cases of manipulation. 4 Evaluation The evaluation of this framework needs to be performed at di↵erent levels. Firstly, we need to find a similarity model that performs well both at the article and sentence levels, going beyond the linguistic surface and being able to relate pieces of text that may use di↵erent terms for describing the same events. The evaluation of the similarity measure will be done at the article level using data coming from tools that aggregate articles talking about the same events, such as Google News Headlines1 and AllSides2 as well as research datasets such as NewsAggregator.3 Instead, for the sentence-level similarity, user feedback will be needed to understand when and why a sentence is considered to describe the same detail while we are dealing with manipulations that can be significant. Following that, we would also need to evaluate the whole framework with user studies to understand the relevance, quality and usefulness of the indicators proposed. Currently, and to the best of our knowledge, there are no similar approaches to the task we are addressing in this paper, and hence we are unable to establish comparisons with other baseline approaches from the literature. 5 Discussion Much research exists that address the problem of misinformation. However, the vast majority of such research focuses on distinguishing what is true from what is false, and hence mainly applies to a small subsection of the 1 https://news.google.com/ 2 https://www.allsides.com/story/admin 3 http://archive.ics.uci.edu/ml/datasets/News+Aggregator 92 misinformation ecosystem.4 There is a lack of research on identifying misleading content, false connection and false context. To this end, there is an immediate need for technological solutions to address such cases, where the information is manipulated in a subtle fashion, and thus cannot be easily dismissed as false. We want to reveal the di↵erences in reporting, without declaring that one article contains true or false information, but rather to provide a tool that exposes such diversities. In this paper we proposed a comparative approach that aims at bringing into light the di↵erences in the narratives of news articles, using a set of cross-article narrative signals. These signals only exist when multiple documents are compared, in contrast to single-article ones that already exist. With this method, we aim to reveal the framing intentions of the writers, and making them more evident and comparable. This analysis may be useful for empowering users to form a critical view of pieces of news they are consuming, to find missing pieces that have been omitted and to see the same information presented with a di↵erent framing by di↵erent articles and sources. Acknowledgements This work is partially supported by EU H2020 Project Co-Inform (grant no. 770302). References [ABS14] Øistein Anmarkrud, Ivar Bråten, and Helge I Strømsø. Multiple-documents literacy: Strategic pro- cessing, source awareness, and argumentation when reading multiple conflicting documents. Learning and Individual Di↵erences, 30:64–76, 2014. [Asg16] Nabiha Asghar. Automatic extraction of causal relations from natural language texts: a compre- hensive survey. arXiv preprint arXiv:1605.07895, 2016. [Bel91] Allan Bell. The language of news media. Blackwell Oxford, 1991. [Bel05] Allan Bell. News stories as narratives. The language of time: a reader, page 397, 2005. [BMTH18] Dimitrios Bountouridis, Mónica Marrero, Nava Tintarev, and Claudia Hau↵. Explaining credibility in news articles using cross-referencing. In SIGIR workshop on ExplainAble Recommendation and Search (EARS), 2018. [BNJ03] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022, 2003. [CYyK+ 18] Daniel Cer, Yinfei Yang, Sheng yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Con- stant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. Universal sentence encoder. arXiv preprint arXiv:1803.11175, 2018. [Dah10] Michael F Dahlstrom. The role of causality in information acceptance in narratives: An example from science communication. Communication Research, 37(6):857–875, 2010. [DCLT18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. [Fil06] Charles J Fillmore. Frame semantics. Cognitive linguistics: Basic readings, 34:373–400, 2006. [GAR97] John T Guthrie, Solomon Alao, and Jennifer M Rinehart. Literacy issues in focus: Engagement in reading for young adolescents. Journal of Adolescent & Adult Literacy, 40(6):438–446, 1997. [GCCZ18] Ge Gao, Eunsol Choi, Yejin Choi, and Luke Zettlemoyer. Neural metaphor detection in context. arXiv preprint arXiv:1808.09653, 2018. [GM89] William A Gamson and Andre Modigliani. Media discourse and public opinion on nuclear power: A constructionist approach. American journal of sociology, 95(1):1–37, 1989. 4 https://firstdraftnews.org/latest/fake-news-complicated/ 93 [Gof74] Erving Go↵man. Frame analysis: An essay on the organization of experience. Harvard University Press, 1974. [Jon72] Karen Sparck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 1972. [JWJ+ 19] Kavanagh Jennifer, Marcellino William, Blake Jonathan, Smith Shawn, Davenport Steven, and Tebeka Mahlet Gizaw. News in a Digital Age: Comparing the Presentation of News Information over Time and Across Media Platforms. Rand Corporation, 2019. [Liu10] Bing Liu. Sentiment analysis and subjectivity. In Handbook of Natural Language Processing, Second Edition, pages 627–666. CRC Press Book, 2010. [Mar00] Daniel Marcu. The theory and practice of discourse parsing and summarization. MIT press, 2000. [MGB+ 17] Arpan Mandal, Kripabandhu Ghosh, Arnab Bhattacharya, Arindam Pal, and Saptarshi Ghosh. Overview of the fire 2017 irled track: Information retrieval from legal documents. In FIRE (Working Notes), pages 63–68, 2017. [Rie93] Catherine Kohler Riessman. Narrative analysis, volume 30. Sage, 1993. [STDS17] Swabha Swayamdipta, Sam Thomson, Chris Dyer, and Noah A. Smith. Frame-Semantic Parsing with Softmax-Margin Segmental RNNs and a Syntactic Sca↵old. arXiv preprint arXiv:1706.09528, 2017. [YDY+ 19] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pages 5754–5764, 2019. [ZZBBN19] Iqra Zahid, Hao Zhang, Frank Boons, and Riza Batista-Navarro. Towards the automatic analysis of the structure of news stories. In Text2Story@ ECIR, pages 71–79, 2019. 94