Data Challenges in Disinformation Diffusion Analysis
                                                                               (Abstract)

                                                                            Paolo Papotti
                                                                         papotti@eurecom.fr
                                                                             EURECOM
1    THE NEED FOR BETTER DIFFUSION
     NETWORKS
Social media enable fast and widespread dissemination of infor-
mation that can be exploited to effectively spread disinformation
by bad actors [1]. We refer to disinformation as the malicious
and coordinated spread of inaccurate content for manipulation
of narratives1 . It has been showed that social media disinforma-
tion has effectively reached millions of people in state-sponsored
campaigns2 .
   Several computational solutions have been proposed for the
identification of coordinated campaign on a single platform [12].
They study how content is disseminated across a network of
inter-connected users. However, two main practical challenges
limit the impact of such approaches.
   First, existing approaches focus on a single sources, such as
Twitter or Reddit. Unfortunately, misinformation campaigns span
multiple platforms and there is a recognized need to jointly an-                       Figure 1: Coordinated campaign for the same content
alyze the diffusion of content across different sources, such as                       across three social platforms. A node with the same color
social networks, online forums (e.g., Reddit), and traditional news                    denote the same user, dashed arrows denote manipulated
outlets (e.g., comments in reputable sites).                                           content.
   Second, the content diffusion graphs that are currently gener-
ated from social network APIs are limited in quality. For example,
only the information about content re-posting (e.g., re-tweets) of
                                                                                       is just one example of the richer kind of metadata that is needed
a user is directly provided. But information is disseminated also
                                                                                       for better diffusion networks.
by manipulating the original content to add bias, “evidence”, or
                                                                                           In fact, to tackle the first challenge, diffusion networks should
propaganda material. Moreover, fine granularity of the re-posting
                                                                                       be heterogeneous, covering multiple platforms, with the ability to
is not available, with the recognized problem of star-effect for
                                                                                       recognize the same content and the same users across services.
re-tweets that can heavily degrade the quality of the network
                                                                                       In Figure 1, users that refer to the same real world person are
model [13].
                                                                                       annotated with nodes having the same color. Also, to handle the
   Consider the example in Figure 1 that shows the coordinated
                                                                                       second challenge, the edges should be typed with fine-grained
sharing of the same initial piece of content (say, a textual news)
                                                                                       metadata that model different actions in the spread of the content.
by three users over different platforms. With the current infras-
                                                                                           We believe that data here plays a role as important as the algo-
tructure and APIs, a journalist or a fact-checker willing to study
                                                                                       rithms used for the analysis and therefore more attention is de-
the diffusion network would look at each network in isolation.
                                                                                       served to the problem of creating such richer diffusion networks.
S/he would be able to follow content across users (nodes) only
                                                                                       Their creation can lead to better identification of coordinated
when they re-post explicitly (full edges across nodes).
                                                                                       efforts [11] and ultimately allow the analysis of disinformation
   In this example, the information would not be enough for the
                                                                                       campaigns in terms of actors, space, and time.
early identification of the coordinated campaign started by the
                                                                                           The goal is therefor to develop methods for modeling and
three users. Looking at only one source with limited information
                                                                                       creating the rich diffusion networks from the existing platform
does not enable the analytics, neither in terms of scope nor ev-
                                                                                       APIs. The resulting networks can be exploited to assist users,
idence, that we need to identify and understand how false and
                                                                                       such as fact-checkers and journalists, in
biased content is used in online campaigns [12].
   To overcome this limitation, recent approaches explore evi-                            (1) monitoring sources at scale and recognize misleading in-
dence across users and platforms, such as coordinated link shar-                              formation (in terms of false or biased textual content) on
ing [7]. While this signal has proven to be useful, we believe this                           social networks and forum websites;
                                                                                          (2) tracking the spread and diffusion of the content in terms
1 The observations in this work apply also for misinformation, where actors spread            of time and actors;
incorrect content unintentionally.                                                        (3) generating visualizations that support the fight against
2 E.g., https://transparency.twitter.com/en/reports/information-operations.html
                                                                                              misinformation and related literacy efforts.
© 2021 Copyright for this paper by its author(s). Published in the Workshop Proceed-      This network generation is indeed challenging, as the desired
ings of the EDBT/ICDT 2021 Joint Conference (March 23–26, 2021, Nicosia, Cyprus)       metadata is not available and hard to profile automatically in an
on CEUR-WS.org. Use permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0)
                                                                                       accurate way. We discuss next two research directions that we
                                                                                       identify as critical to tackle these challenges.
2    RESEARCH DIRECTIONS                                                   based on friendship or topic affinity [3], if an node is likely to be
The goal is to develop methods for the automatic modeling of               a bot [6], if the content has been manipulated by inserting false
content manipulation and diffusion across time and different               claims or bias in the language [5, 10]. In this line of research, it
media sources, such as social networks, forums, and news outlets.          seems promising to exploit both linguistic analysis of the text and
   Not only we want the diffusion graph for a given content to be          external knowledge. The latter could be modelled as reference
across sources and very well described in terms of information,            information in relational datasets [10], knowledge-graphs [2],
we also want it (i) to preserve precisely the provenance of the            or check corpora [8, 14]. Recent results show that transformer-
data (who created and shared, how and when) and (ii) to be as              based language models and query generation techniques can
much as possible automatic in its creation, both to handle the             automatically detect text containing false claims3 and therefore
Web scale and to not put additional burden on the users. There             provide valuable metadata to enrich the network.
are therefore several challenges that we need to overcome                     Example 2. Consider again the article example from Example 1.
                                                                           When it is shared across users, some of them introduce incorrect
     • different sources do not contain any readily available in-          statistics about its impact (“it works only for young people”), or
       formation to connect users/content across networks and              facts that are not supported by any source (“it will cost 100$ per
       the automatic matching is a difficult task in both cases;           dose”). We aim to enrich the network edges by recognizing how
     • labeling the content in terms of being false, manipulated,          the content goes from its form 𝐴 to a new form 𝐴∗ when it is
       and biased requires deep understanding of the language              shared by a certain user 𝑢.
       and of the reference and background information;
     • Web scale implies massive ingestion from heterogeneous                 We believe that an effective solution to the problem of creat-
       data sources, but we would prefer tools that can be used            ing diffusion networks for textual content across heterogeneous
       by end users on their machines for confidentiality;                 sources would enable better disinformation campaigns detection.
     • support for different languages as we aim at helping users          The resulting graph with typed nodes (persons, organizations)
       across different countries.                                         and typed relationships (copy or manipulation in terms of con-
                                                                           tent or form) can be then analyzed with existing methods such
   Given the challenges above, a natural first line of work is to          as clustering and geometric deep learning [9, 12], or with novel
conduct data integration research to generate a unified represen-          methods that take full advantage of the new information and
tation from heterogeneous, non-aligned sources. A second line of           better identify emerging coordinated campaigns.
work is to deploy natural language processing (NLP) techniques
to profile the content and enrich the graph with typed nodes and           REFERENCES
edges.                                                                      [1] 2020. Cross-platform disinformation campaigns: lessons learned and next
                                                                                steps. The Harvard Kennedy School (HKS) Misinformation Review (2020). https:
Data integration. In the first line of research, the aim is the                 //doi.org/10.37016/mr-2020-002
online creation of a dissemination network for a given textual              [2] Naser Ahmadi, Joohyung Lee, Paolo Papotti, and Mohammed Saeed. 2019.
                                                                                Explainable Fact Checking with Probabilistic Answer Set Programming. In
content. Given a textual article, for example, the first task is the            TTO.
identification of its citations and appearance across sources (on-          [3] Nicola Barbieri, Francesco Bonchi, and Giuseppe Manco. 2014. Who to follow
line articles, boards in forums, social posts) and time. This is not            and why: link prediction with explanations. In SIGKDD. ACM, 1266–1275.
                                                                            [4] Riccardo Cappuzzo, Paolo Papotti, and Saravanan Thirumuruganathan. 2020.
trivial as one requirement is to go beyond the identification of                Creating Embeddings of Heterogeneous Relational Datasets for Data Integra-
content by links, which act as unique identifiers. For this goal,               tion Tasks. In SIGMOD. ACM, 1335–1349.
one promising direction is to exploit text-matching literature [14]         [5] Minje Choi, Luca Maria Aiello, Krisztián Zsolt Varga, and Daniele Quercia.
                                                                                2020. Ten Social Dimensions of Conversations and Relationships. In WWW.
to identify also manipulated texts that express the original input              ACM / IW3C2, 1514–1525.
content. The goal is to have one diffusion network, as in Figure 1          [6] Emilio Ferrara, Onur Varol, Clayton A. Davis, Filippo Menczer, and Alessandro
                                                                                Flammini. 2016. The rise of social bots. Commun. ACM 59, 7 (2016), 96–104.
for every given content to analyze, such as a web page, a social            [7] Fabio Giglietto, Nicola Righetti, Luca Rossi, and Giada Marino. 2020. Coor-
message post, or a generic textual claim. The linking and merging               dinated Link Sharing Behavior as a Signal to Surface Sources of Problematic
of actors across sources, nodes in the graphs, is also important.               Information on Facebook. ACM.
                                                                            [8] Naeemul Hassan, Fatma Arslan, Chengkai Li, and Mark Tremayne. 2017.
This can be modelled as an entity resolution problem, from a data               Toward Automated Fact-Checking: Detecting Check-worthy Factual Claims
integration perspective, for example by using deep learning tech-               by ClaimBuster. In KDD.
niques [4, 15]. However, the task is especially challenging in real         [9] Sameera Horawalavithana, Kin Wai Ng, and Adriana Iamnitchi. 2020. Twitter
                                                                                Is the Megaphone of Cross-platform Messaging on the White Helmets. In
settings where we drop assumptions about trusted information                    Social, Cultural, and Behavioral Modeling. 235–244.
about the user accounts.                                                   [10] Georgios Karagiannis, Mohammed Saeed, Paolo Papotti, and Immanuel Trum-
                                                                                mer. 2020. Scrutinizer: A Mixed-Initiative Approach to Large-Scale, Data-
     Example 1. Consider a textual article 𝐴 about a new vaccine                Driven Claim Verification. Proc. VLDB Endow. 13, 11 (2020), 2508–2521.
that circulates on social platform 1. Existing APIs allow the trac-        [11] Franziska B. Keller, David Schoch, Sebastian Stier, and JungHwan Yang.
ing of the diffusion of the specific content 𝐴 on platform 1 across             2020. Political Astroturfing on Twitter: How to Coordinate a Disinfor-
                                                                                mation Campaign. Political Communication 37, 2 (2020), 256–280. https:
users 𝑢 11, . . . , 𝑢𝑛1 , but the same content may be circulating in a          //doi.org/10.1080/10584609.2019.1661888
different form (𝐴 ′ ) and across a different platform 2 by users           [12] Federico Monti, Fabrizio Frasca, Davide Eynard, Damon Mannion, and
𝑢 12, . . . , 𝑢𝑚
               2 . We aim at identifying that the two posts refer to the        Michael M. Bronstein. 2019. Fake News Detection on Social Media using
                                                                                Geometric Deep Learning. CoRR abs/1902.06673 (2019).
same content (𝐴 = 𝐴 ′ ) and at matching the subset of users that           [13] Francesco Pierri, Carlo Piccardi, and Stefano Ceri. 2020. Topology comparison
are sharing the article across the two networks (e.g., 𝑢 31 = 𝑢 62 ).           of Twitter diffusion networks reliably reveals disinformation news. Sci. Rep.
                                                                                10 (2020). https://doi.org/10.1038/s41598-020-58166-5
Metadata from text. Existing NLP tools should be extended and              [14] Shaden Shaar, Nikolay Babulkov, Giovanni Da San Martino, and Preslav Nakov.
integrated to characterize the nature of the interactions across                2020. That is a Known Lie: Detecting Previously Fact-Checked Claims. In ACL.
                                                                                3607–3618.
actors w.r.t. the specific content. This can lead to labelling the         [15] Saravanan Thirumuruganathan, Nan Tang, Mourad Ouzzani, and AnHai Doan.
edges in the graph with information (metadata) about the interac-               2020. Data Curation with Deep Learning. In EDBT. 277–286.
tion between the nodes (actors). Possible metadata for such edges
include the nature of the connection between two users, if it is           3 E.g., https://coronacheck.eurecom.fr