Data Challenges in Disinformation Diffusion Analysis (Abstract) Paolo Papotti papotti@eurecom.fr EURECOM 1 THE NEED FOR BETTER DIFFUSION NETWORKS Social media enable fast and widespread dissemination of infor- mation that can be exploited to effectively spread disinformation by bad actors [1]. We refer to disinformation as the malicious and coordinated spread of inaccurate content for manipulation of narratives1 . It has been showed that social media disinforma- tion has effectively reached millions of people in state-sponsored campaigns2 . Several computational solutions have been proposed for the identification of coordinated campaign on a single platform [12]. They study how content is disseminated across a network of inter-connected users. However, two main practical challenges limit the impact of such approaches. First, existing approaches focus on a single sources, such as Twitter or Reddit. Unfortunately, misinformation campaigns span multiple platforms and there is a recognized need to jointly an- Figure 1: Coordinated campaign for the same content alyze the diffusion of content across different sources, such as across three social platforms. A node with the same color social networks, online forums (e.g., Reddit), and traditional news denote the same user, dashed arrows denote manipulated outlets (e.g., comments in reputable sites). content. Second, the content diffusion graphs that are currently gener- ated from social network APIs are limited in quality. For example, only the information about content re-posting (e.g., re-tweets) of is just one example of the richer kind of metadata that is needed a user is directly provided. But information is disseminated also for better diffusion networks. by manipulating the original content to add bias, “evidence”, or In fact, to tackle the first challenge, diffusion networks should propaganda material. Moreover, fine granularity of the re-posting be heterogeneous, covering multiple platforms, with the ability to is not available, with the recognized problem of star-effect for recognize the same content and the same users across services. re-tweets that can heavily degrade the quality of the network In Figure 1, users that refer to the same real world person are model [13]. annotated with nodes having the same color. Also, to handle the Consider the example in Figure 1 that shows the coordinated second challenge, the edges should be typed with fine-grained sharing of the same initial piece of content (say, a textual news) metadata that model different actions in the spread of the content. by three users over different platforms. With the current infras- We believe that data here plays a role as important as the algo- tructure and APIs, a journalist or a fact-checker willing to study rithms used for the analysis and therefore more attention is de- the diffusion network would look at each network in isolation. served to the problem of creating such richer diffusion networks. S/he would be able to follow content across users (nodes) only Their creation can lead to better identification of coordinated when they re-post explicitly (full edges across nodes). efforts [11] and ultimately allow the analysis of disinformation In this example, the information would not be enough for the campaigns in terms of actors, space, and time. early identification of the coordinated campaign started by the The goal is therefor to develop methods for modeling and three users. Looking at only one source with limited information creating the rich diffusion networks from the existing platform does not enable the analytics, neither in terms of scope nor ev- APIs. The resulting networks can be exploited to assist users, idence, that we need to identify and understand how false and such as fact-checkers and journalists, in biased content is used in online campaigns [12]. To overcome this limitation, recent approaches explore evi- (1) monitoring sources at scale and recognize misleading in- dence across users and platforms, such as coordinated link shar- formation (in terms of false or biased textual content) on ing [7]. While this signal has proven to be useful, we believe this social networks and forum websites; (2) tracking the spread and diffusion of the content in terms 1 The observations in this work apply also for misinformation, where actors spread of time and actors; incorrect content unintentionally. (3) generating visualizations that support the fight against 2 E.g., https://transparency.twitter.com/en/reports/information-operations.html misinformation and related literacy efforts. © 2021 Copyright for this paper by its author(s). Published in the Workshop Proceed- This network generation is indeed challenging, as the desired ings of the EDBT/ICDT 2021 Joint Conference (March 23–26, 2021, Nicosia, Cyprus) metadata is not available and hard to profile automatically in an on CEUR-WS.org. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) accurate way. We discuss next two research directions that we identify as critical to tackle these challenges. 2 RESEARCH DIRECTIONS based on friendship or topic affinity [3], if an node is likely to be The goal is to develop methods for the automatic modeling of a bot [6], if the content has been manipulated by inserting false content manipulation and diffusion across time and different claims or bias in the language [5, 10]. In this line of research, it media sources, such as social networks, forums, and news outlets. seems promising to exploit both linguistic analysis of the text and Not only we want the diffusion graph for a given content to be external knowledge. The latter could be modelled as reference across sources and very well described in terms of information, information in relational datasets [10], knowledge-graphs [2], we also want it (i) to preserve precisely the provenance of the or check corpora [8, 14]. Recent results show that transformer- data (who created and shared, how and when) and (ii) to be as based language models and query generation techniques can much as possible automatic in its creation, both to handle the automatically detect text containing false claims3 and therefore Web scale and to not put additional burden on the users. There provide valuable metadata to enrich the network. are therefore several challenges that we need to overcome Example 2. Consider again the article example from Example 1. When it is shared across users, some of them introduce incorrect • different sources do not contain any readily available in- statistics about its impact (“it works only for young people”), or formation to connect users/content across networks and facts that are not supported by any source (“it will cost 100$ per the automatic matching is a difficult task in both cases; dose”). We aim to enrich the network edges by recognizing how • labeling the content in terms of being false, manipulated, the content goes from its form 𝐴 to a new form 𝐴∗ when it is and biased requires deep understanding of the language shared by a certain user 𝑢. and of the reference and background information; • Web scale implies massive ingestion from heterogeneous We believe that an effective solution to the problem of creat- data sources, but we would prefer tools that can be used ing diffusion networks for textual content across heterogeneous by end users on their machines for confidentiality; sources would enable better disinformation campaigns detection. • support for different languages as we aim at helping users The resulting graph with typed nodes (persons, organizations) across different countries. and typed relationships (copy or manipulation in terms of con- tent or form) can be then analyzed with existing methods such Given the challenges above, a natural first line of work is to as clustering and geometric deep learning [9, 12], or with novel conduct data integration research to generate a unified represen- methods that take full advantage of the new information and tation from heterogeneous, non-aligned sources. A second line of better identify emerging coordinated campaigns. work is to deploy natural language processing (NLP) techniques to profile the content and enrich the graph with typed nodes and REFERENCES edges. [1] 2020. Cross-platform disinformation campaigns: lessons learned and next steps. The Harvard Kennedy School (HKS) Misinformation Review (2020). https: Data integration. In the first line of research, the aim is the //doi.org/10.37016/mr-2020-002 online creation of a dissemination network for a given textual [2] Naser Ahmadi, Joohyung Lee, Paolo Papotti, and Mohammed Saeed. 2019. Explainable Fact Checking with Probabilistic Answer Set Programming. In content. Given a textual article, for example, the first task is the TTO. identification of its citations and appearance across sources (on- [3] Nicola Barbieri, Francesco Bonchi, and Giuseppe Manco. 2014. Who to follow line articles, boards in forums, social posts) and time. This is not and why: link prediction with explanations. In SIGKDD. ACM, 1266–1275. [4] Riccardo Cappuzzo, Paolo Papotti, and Saravanan Thirumuruganathan. 2020. trivial as one requirement is to go beyond the identification of Creating Embeddings of Heterogeneous Relational Datasets for Data Integra- content by links, which act as unique identifiers. For this goal, tion Tasks. In SIGMOD. ACM, 1335–1349. one promising direction is to exploit text-matching literature [14] [5] Minje Choi, Luca Maria Aiello, Krisztián Zsolt Varga, and Daniele Quercia. 2020. Ten Social Dimensions of Conversations and Relationships. In WWW. to identify also manipulated texts that express the original input ACM / IW3C2, 1514–1525. content. The goal is to have one diffusion network, as in Figure 1 [6] Emilio Ferrara, Onur Varol, Clayton A. Davis, Filippo Menczer, and Alessandro Flammini. 2016. The rise of social bots. Commun. ACM 59, 7 (2016), 96–104. for every given content to analyze, such as a web page, a social [7] Fabio Giglietto, Nicola Righetti, Luca Rossi, and Giada Marino. 2020. Coor- message post, or a generic textual claim. The linking and merging dinated Link Sharing Behavior as a Signal to Surface Sources of Problematic of actors across sources, nodes in the graphs, is also important. Information on Facebook. ACM. [8] Naeemul Hassan, Fatma Arslan, Chengkai Li, and Mark Tremayne. 2017. This can be modelled as an entity resolution problem, from a data Toward Automated Fact-Checking: Detecting Check-worthy Factual Claims integration perspective, for example by using deep learning tech- by ClaimBuster. In KDD. niques [4, 15]. However, the task is especially challenging in real [9] Sameera Horawalavithana, Kin Wai Ng, and Adriana Iamnitchi. 2020. Twitter Is the Megaphone of Cross-platform Messaging on the White Helmets. In settings where we drop assumptions about trusted information Social, Cultural, and Behavioral Modeling. 235–244. about the user accounts. [10] Georgios Karagiannis, Mohammed Saeed, Paolo Papotti, and Immanuel Trum- mer. 2020. Scrutinizer: A Mixed-Initiative Approach to Large-Scale, Data- Example 1. Consider a textual article 𝐴 about a new vaccine Driven Claim Verification. Proc. VLDB Endow. 13, 11 (2020), 2508–2521. that circulates on social platform 1. Existing APIs allow the trac- [11] Franziska B. Keller, David Schoch, Sebastian Stier, and JungHwan Yang. ing of the diffusion of the specific content 𝐴 on platform 1 across 2020. Political Astroturfing on Twitter: How to Coordinate a Disinfor- mation Campaign. Political Communication 37, 2 (2020), 256–280. https: users 𝑢 11, . . . , 𝑢𝑛1 , but the same content may be circulating in a //doi.org/10.1080/10584609.2019.1661888 different form (𝐴 ′ ) and across a different platform 2 by users [12] Federico Monti, Fabrizio Frasca, Davide Eynard, Damon Mannion, and 𝑢 12, . . . , 𝑢𝑚 2 . We aim at identifying that the two posts refer to the Michael M. Bronstein. 2019. Fake News Detection on Social Media using Geometric Deep Learning. CoRR abs/1902.06673 (2019). same content (𝐴 = 𝐴 ′ ) and at matching the subset of users that [13] Francesco Pierri, Carlo Piccardi, and Stefano Ceri. 2020. Topology comparison are sharing the article across the two networks (e.g., 𝑢 31 = 𝑢 62 ). of Twitter diffusion networks reliably reveals disinformation news. Sci. Rep. 10 (2020). https://doi.org/10.1038/s41598-020-58166-5 Metadata from text. Existing NLP tools should be extended and [14] Shaden Shaar, Nikolay Babulkov, Giovanni Da San Martino, and Preslav Nakov. integrated to characterize the nature of the interactions across 2020. That is a Known Lie: Detecting Previously Fact-Checked Claims. In ACL. 3607–3618. actors w.r.t. the specific content. This can lead to labelling the [15] Saravanan Thirumuruganathan, Nan Tang, Mourad Ouzzani, and AnHai Doan. edges in the graph with information (metadata) about the interac- 2020. Data Curation with Deep Learning. In EDBT. 277–286. tion between the nodes (actors). Possible metadata for such edges include the nature of the connection between two users, if it is 3 E.g., https://coronacheck.eurecom.fr