Life and Death of Fakes: on Data Persistence for
                                Manipulative Social Media Content
                                Olga Uryupina1
                                1
                                    Department of Information Engineering and Computer Science, University of Trento


                                                  Abstract
                                                  This work presents an in-depth investigation of the data decay for publicly fact-checked online content. We monitor
                                                  compromised posts on major social media platforms (Facebook, Instagram, Twitter, TikTok) for one year, tracking the changes
                                                  in their visibility and availability. We show that data persistence is an important issue for manipulative content, on a larger
                                                  scale than previously reported for online content in general. Our findings also suggest a (much) higher data decay rate for the
                                                  platforms suffering most from online disinformation, indicating an important area for data collection/preservation.

                                                  Keywords
                                                  fact checking, replicability,


                                1. Introduction                                                                                            purpose by professional copywriters who might have
                                                                                                                                           different goals and motivations to keep their texts online
                                Manipulative online content is rapidly becoming a more                                                     (e.g., for click-bait purposes) or remove them (e.g., to
                                and more pervasive issue for the modern society: by de-                                                    reduce the reputation loss from being exposed as unreli-
                                liberately biasing our information flow, unscrupulous                                                      able).
                                content writers can and do affect our emotional state,                                                        Our work focuses specifically on the lifespan of fact-
                                beliefs, reasoning and both online and offline behaviour.                                                  checked compromised content. We go beyond the naive
                                It is therefore not surprising that this has become a cen-                                                 binary present vs. removed view, studying more nuanced
                                tral issue for various stakeholders, from journalists and                                                  cases as well. In particular, we track compromised online
                                fact-checkers to NLP researchers both in academia and                                                      posts over time for the appearance of explicit platform-
                                in the industry. Given the current rapid growth in data-                                                   specific reliability labels (e.g. "out of context"), obfusca-
                                driven studies of manipulative content, it is essential to                                                 tion (the common situation when the online content is –
                                have a reliable overview of data persistence issues in                                                     fully or partially – rendered either very blurred or as a
                                this specific domain: compromised content is often very                                                    black/white box, with a message raising awareness of its
                                dynamic and changes or becomes unavailable over time,                                                      limited reliability; this content, however, is still accessible
                                raising reproducibility concerns,                                                                          to the user upon an extra click), and author-generated
                                    From the readers’ perspective, the visibility of com-                                                  edits, as well as complete content removal.
                                promised content over time affects directly its impact: a                                                     More specifically, we address the following research
                                removed or strongly downgraded document is unlikely                                                        questions:
                                to be read/recovered and cannot be used to promote or
                                support other fakes. From the research and development     RQ1: How persistent is the compromised content?
                                perspective, data persistence is crucial for benchmark-         How does its visibility and availability change
                                ing, ensuring fair comparison between models as well as         over time?
                                even simply providing them with high-quality real-life     RQ2: What is the typical timeline for interaction be-
                                training and testing examples.                                  tween the content generators and fact-checkers?
                                    Starting from already a decade ago, NLP benchmarking        How – if at all – do content writers alter their
                                campaign studies [1] report data persistence issues for         posts after being exposed as problematic by fact
                                online content, as used in various shared tasks, reporting      checkers?
                                around 10% of entries missing compared to the original     RQ3: Are the trends different across platforms?
                                dataset (gold standard). These shared tasks, however, are
                                based almost exclusively on Twitter and do not focus To this end, we analyze two datasets (in English) 1
                                                                                                                                         of social
                                specifically on compromised content. We believe that a media documents, fact-checked by PolitiFact.
                                large proportion of manipulative content is created on

                                                                                                                                           1
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,                                           PolitiFact (https://www.politifact.com/) is an independent journal-
                                Dec 04 — 06, 2024, Pisa, Italy                                                                                 istic agency and one of the most experienced fact-checking orga-
                                $ uryupina@gmail.com (O. Uryupina)                                                                             nizations, providing detailed analytics for non-transparent online
                                            © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                            Attribution 4.0 International (CC BY 4.0).                                                         content since 2007.


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
2. Related Work                                                        source       total      min        max      median
                                                                                    docs     fc time    fc time    fc time
Multiple studies report on data persistence issues for                     all       192         0         56          4
online content. These works, however, mostly focus                         fb        86          1         56          4
                                                                        twitter      16          1         30          4
on Twitter datasets, as used for various challenges and
                                                                         tiktok      17          1         30          6
shared tasks.                                                         instagram      72          0         44          4
   Zubiaga [2] provides an exhaustive report on data per-
sistence for multiple Twitter datasets, showing an aver-      Table 1
age data decay of around 20% over 4 years.                    Assessing the time required for professional fact-checking (fc):
   Küpfer [3] argues, always for Twitter, that data per-      statistics for the 2-month dataset, days.
sistence is not random, becoming drastically more of an
issue for emotionally charged or controversial content.
Indeed, both Bastos [4] and Duan et al. [5] report much       While some of these aspects are crucial for algorithmic
higher tweet decay rates for #Brexit and #BlackLivesMat-      NLP (e.g., data persistence is important for benchmark-
ter, content respectively.                                    ing and – in critical cases – even training ML models),
   To our knowledge, there have been no studies assess-       others are more relevant for understanding the impact of
ing explicitly data persistence issues for fakes. For some    manipulative content on human readers (e.g., obfuscation
datasets, the creators provide estimations of content de-     is an unambiguous warning the platform sends to the
cay. For example, Bianchi et al. [6] estimate that around     reader on a low reliability of the information).
25% of the tweets in their corpus on harmful speech on-          The 2-months dataset has been analysed every two
line were no longer available at the paper publication        days for the first two months and then on a weekly basis
time. It is, however, unspecified, how this estimation was    for the following year. The 8-months dataset has been
obtained.                                                     analyzed in May and October 2024, when the documents
   We hope to bring new insights to our understanding         were 1.5-2 and 2-2.5 years old respectively.
of the data persistence issues for compromised content
by addressing the following novel angles: (i) we aim at a
targeted analysis of manipulative content (fake news), (ii)   4. Compromised content: timeline
we provide a more nuanced approach, tracking subtler
changes in data availability for users and machines (e.g.,    4.1. From publication to fact-checking
obfuscation) and (iii) we go beyond Twitter, targeting all    For this project, we start monitoring the content the day it
the major social media platforms.                             appears on PolitiFact. Obviously, this doesn’t happen the
                                                              very moment the content gets published by its creators:
                                                              it takes some time for the content to reach PolitiFact and
3. Data                                                       then an extra period to perform fact-checking. This lag
For our study, we use two data sets of real-life suspi-       may depend on numerous factors: for example, some
cious online posts, analyzed by PolitiFact. A 2-months        fakes are simple and repetitive, thus requiring less in-
dataset (PolitiFact reports from 15 May – 15 July 2023,       vestigative effort, whereas some others lead PolitiFact
around 200 entries) has been thoroughly monitored for         journalists to request third-party expert analytics, involv-
data visibility and persistence up till now. A larger and     ing time-consuming communications with various public
older dataset (PolitiFact reports from January – Septem-      figures and organizations.
ber 2022, around 800 entries) has been analyzed twice to          Table 1 shows time lag statistics (in days) between the
assess longer-term trends.                                    content publication date (as reported by the platforms)
   The two datasets include all the posts in English from     and the appearance of the corresponding fact-checking
the major social media platforms as reported by PolitiFact    report. It suggests that PolitiFact is doing an outstanding
during the above mentioned periods (i.e., the original        job at timely reacting to online misinformation: an av-
publications slightly predate May 15, 2023 and Jan 1,         erage suspicious post is analyzed in 4 days, with a large
2022, respectively).                                          bulk of reports appearing on the next day already. We ob-
   The analysis involves the following dimensions:            serve no platform-based difference in PolitiFact reaction
                                                              times, thus confirming their neutrality in this respect.
     • visibility: visible (possibly with a warning), ob-         PolitiFact stays in active collaborations with major
       fuscated, removed;                                     social media platforms.2 As a result, in most cases the
     • persistence: original, edited, removed;                content is marked by the platform as somewhat spurious
     • extra labelling: any platform-specific add-ons,        2
                                                                  For example, https://www.facebook.com/help/1952307158131536?
       e.g. "missing context".                                    helpref=related      and     https://www.tiktok.com/safety/en/
                                                                  safety-partners/
                                          % d0       % d7        % d30      % d100      % d365      total
                                all      88.02%     80.72%       75.52%     69.27%      61.97%       192
                                fb       83.72%     80.23%       75.58%     70.93%      63.95%       86
                             twitter     93.75%     93.75%       87.5%      93.75%      93.75%       16
                              tiktok     94.11%     82.35%       76.47%     64.7%       58.82%       17
                           instagram     90.27%     77.77%       72.22%     63.88%      54.16%       72
Table 2
Statistics for the 2-moths dataset: data availability at fact-checking day and one week, 1, 3 and 12 months afterwards: % of
available (visible or obfuscated) documents.

                                       % day0     % day7     % day30      % day100       % day365       total
                            all        48.43%     46.87%     43.22%        40.1%          36.97%         192
                            fb         41.86%     39.53%     36.04%        32.55%          27.9%          86
                         twitter       93.75%     93.75%      87.5%        93.75%         93.75%          16
                          tiktok       94.11%     82.35%     76.47%        64.7%          58.82%          17
                       instagram       34.72%     36.11%     33.33%        31.94%         30.55%          72
Table 3
Statistics for the 2-months dataset: data visibility at fact-checking day and one week, 1, 3 and 12 months afterwards: % of
visible documents.


(e.g. "false" or "out of context") shortly after or even           platforms are more prevalent—and keep appearing and
before the publication on the PolitiFact website. This             disappearing at an alarming rate, leaving us virtually no
marking, as we will see below, often leads to immediate            opportunity to model the underlying trends.
content modification or withdrawal.
                                                                   4.3. Content adjustment
4.2. Content availability after
                                                                  As we have seen above, once a document has been fact-
     fact-checking                                                checked and deemed false, the most typical reaction is its
Tables 2 and 3 illustrate data availability over time for the     – rather fast – removal. This would be a rather natural
2-months set. We distinguish between two categories:              reaction: most creators do not enjoy having their content
visible and available. Available content can be accessed          (and their name) marked as unreliable. In some cases,
by either a human or a machine, possibly with some effort         however, the users3 prefer keeping the compromised con-
(e.g., an extra click). Visible content can be accessed as-is.    tent online. Such content – proven do be problematic by
In other words, non-visible accessible content includes           a publicly available fact-checking report – would trigger
fully or partially obfuscated posts.                              a reaction from (a) the hosting social media platform,
   We see several important trends here. First of all, al-        (b) the community and (c) the authors themselves. The
ready at the fact-checking date, around 12% of documents          observed reactions for visible documents are summarized
are no longer available. This number grows rapidly: after         in Table 4.
one year, the unavailable content comprises 38% of data-             Facebook and Instagram adopt their own labels to mark
points for our 2-month set.. This number is much more             questionable content, distinguishing between "false",
pessimistic than common estimations of online data per-           "out-of-context" and "partly false" documents.4 Although
sistence [2]. This raises an important and a very urgent          PolitiFact stays in an active collaboration with the both
issue: as a community, we should invest a more focused            platforms, there is no direct correspondence between the
and consistent effort in timely saving samples of compro-         labels. The labels get assigned rather quickly and stay
mised documents for ongoing and future research/bench-            unchanged (almost all of the observed label change is
marking. From the human reader perspective, only one              due to the complete removal of the document).
third of posts are clearly visible after one year (and even          Twitter relies on its own community to highlight prob-
in such cases, they might contain explicit markings, such         lematic content. This measure was introduced after the
as "partially false").                                            start of our project and therefore we cannot assess di-
   We also observe a striking difference across platforms:
while most tweets remain online, almost a half of com-             3
                                                                     We do not have any reliable estimations on the content removal by
promised Instagram posts are no longer available after               the major online platforms themselves. In this study, we assume,
12 months. This is truly problematic: while the NLP com-             albeit unrealistically, that the content gets removed by the users.
                                                                   4
                                                                     The exact labels vary across platforms (e.g. "out of context" vs.
munity focuses mainly on Twitter data, fakes on other                "missing context").
                                    % day0     % day7     % day30     % day100      % day365      at some point
                                                        Platform labels
                missing context      11.5%       10.9%      12.0%       10.4%          8.9%           13.5%
                 partly false         8.9%       8.9%       9.4%         9.4%          8.9%           11.5%
                                                       Community labels
                reader’s context     0.5%        1.0%       2.1%         3.1%          3.1%            3.1%
                                                     Authors’ intervention
                    editing          1.6%        2.6%       2.1%         1.6%          1.6%            2.6%
Table 4
Reactions to fact-checking by social media platforms, community and users.

     all                   visible                           obfuscated                          removed                   total
                  May 2024          Oct 2024          May 2024         Oct 2024          May 2024        Oct 2024
      all       363 44.21% 346 42.14%               128 15.59% 107 13.03%              330 40.19% 368 44.82%                821
      fb        170 33.53% 164 32.35%               106   20.9%     90     17.75%      231 45.56% 253 49.90%                507
   twitter      156 81.25% 157 81.77%                3    1.56%      2      1.04%      33    17.18%   33      17.8%         192
    tiktok       3     25%        1      8.33%       0       0       0        0         9     75%     11     91.67%          12
 instagram      29    28.15%     23     22.33%       19   18.44%    15     14.56%      55    53.39%   65     63.11%         103
  youtube        5    83.33%      5     83.33%       0       0       0        0         1    16.66%    1      16.66          6
Table 5
Statistics for the 8-months dataset: data persistence across platforms, assessed in May 2024 (1.5-2 years after the publication).


rectly how quickly the posts become marked as poten-             and removing). A larger-scale study is needed to provide
tially problematic.                                              more reliable Twitter-specific estimates under the new
   Finally, the users themselves might react verbally to         policies.
fact-checking reports or consequent actions by social me-
dia platforms, editing their original posts. The modifica-
tions might range from acknowledging the fact-checking           5. Conclusion
findings and putting clear and unambiguous updates all
                                                            This paper aims at an in-depth analysis of data persis-
the way to claiming being ironic or actively attacking
                                                            tence for publicly fact-checked online content. After one
fact checkers and arguing against their findings. We
                                                            year of monitoring thoroughly online posts fact-checked
have also observed a higher percentage of edits from
                                                            by PolitiFact, we have observed the following findings.
non-anonymous accounts.
                                                            First, the data persistence is a crucial and underrated
                                                            issue for compromised content, with considerable decay
4.4. Longer-term trends                                     rates. Second, the decay trends differ across platforms,
Table 5 shows similar statistics for our 8-months dataset, with Facebook, TikTok and Instagram showing much
covering PolitiFact reports published from January to less data persistance. Third, the decay starts immediately,
September 2022. We have computed them in May and with 12% of the compromised posts getting deleted at
October 2024 when most posts were almost 2 and 2.5 (or before) the publication of the PolitiFact report and
years old respectively.                                     20% becoming unavailable within a week. This suggests
   These numbers support our initial findings: almost an urgent need for a concentrated effort on timely col-
half (44.8%) of compromised documents are no longer lecting real-life fakes if we want to go beyond synthetic
available after 2 years. The decay is more pronounced or simplistic datasets and train impactful fact-checking
for TikTok and Instagram.                                   models.
   A considerably larger percent of Facebook posts re-         In the future, we want to analyze further aspects of
mains visible (non-obfuscated) in our 8-months dataset: the decay issues for the compromised content. Thus, we
this might be attributed to a rendering policy change.      plan to add more fact-checking outlets beyond PolitiFact
   Finally, the 2022 dataset (8-months) contains a larger to see if there are any effects due to the report itself.
share of tweets. The decay rate for Twitter is at 17% after Second, we plan to study in more detail the difference in
2 years (compared to just 6% after 1 year for the 2-months online behaviour (content removal) between anonymous
2023 dataset). We believe that the considerable change in users, non-anonymous users and public figures. Finally,
the platform guidance in the past two years has affected we plan to expand our research on interaction between
the way content writers use Twitter (both publishing content writers and fact-checkers ("editing").
Acknowledgments
We thank the Autonomous Province of Trento for the
financial support of our project via the AI@TN initiative.


References
[1] I. Alegria, N. Aranberri, P. Comas, V. Fernández,
    P. Gamallo, L. Padró, I. San Vicente, J. Turmo,
    A. Zubiaga, Tweetnorm: a benchmark for lex-
    ical normalization of spanish tweets, Language
    Resources and Evaluation 49 (2015) 1–23. doi:10.
    1007/s10579-015-9315-6.
[2] A. Zubiaga, A longitudinal assessment of the persis-
    tence of twitter datasets, Journal of the Association
    for Information Science and Technology 69 (2018).
    doi:10.1002/asi.24026.
[3] A. Küpfer, Nonrandom tweet mortality and data
    access restrictions: Compromising the replication
    of sensitive twitter studies, Political Analysis (2024)
    1–14. doi:10.1017/pan.2024.7.
[4] M. Bastos, This account doesn’t exist: Tweet decay
    and the politics of deletion in the brexit debate, Amer-
    ican Behavioral Scientist 65 (2021) 000276422198977.
    doi:10.1177/0002764221989772.
[5] Y. Duan, J. Hemsley, A. O. Smith, “this tweet
    is unavailable”: #blacklivesmatter tweets decay,
    AoIR Selected Papers of Internet Research (2023).
    URL: https://spir.aoir.org/ojs/index.php/spir/article/
    view/13414. doi:10.5210/spir.v2023i0.13414.
[6] F. Bianchi, S. HIlls, P. Rossini, D. Hovy, R. Tromble,
    N. Tintarev, “it’s not just hate”: A multi-dimensional
    perspective on detecting harmful speech online,
    in: Y. Goldberg, Z. Kozareva, Y. Zhang (Eds.),
    Proceedings of the 2022 Conference on Empir-
    ical Methods in Natural Language Processing,
    Association for Computational Linguistics, Abu
    Dhabi, United Arab Emirates, 2022, pp. 8093–8099.
    URL: https://aclanthology.org/2022.emnlp-main.553.
    doi:10.18653/v1/2022.emnlp-main.553.