<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>1PolitiFact (https://www.politifact.com/) is an independent journal</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Life and Death of Fakes: on Data Persistence for Manipulative Social Media Content</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Olga Uryupina</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information Engineering and Computer Science, University of Trento</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <abstract>
        <p>This work presents an in-depth investigation of the data decay for publicly fact-checked online content. We monitor compromised posts on major social media platforms (Facebook, Instagram, Twitter, TikTok) for one year, tracking the changes in their visibility and availability. We show that data persistence is an important issue for manipulative content, on a larger scale than previously reported for online content in general. Our findings also suggest a (much) higher data decay rate for the platforms sufering most from online disinformation, indicating an important area for data collection/preservation.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;fact checking</kwd>
        <kwd>replicability</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>purpose by professional copywriters who might have
diferent goals and motivations to keep their texts online
Manipulative online content is rapidly becoming a more (e.g., for click-bait purposes) or remove them (e.g., to
and more pervasive issue for the modern society: by de- reduce the reputation loss from being exposed as
unreliliberately biasing our information flow, unscrupulous able).
content writers can and do afect our emotional state, Our work focuses specifically on the lifespan of
factbeliefs, reasoning and both online and ofline behaviour. checked compromised content. We go beyond the naive
It is therefore not surprising that this has become a cen- binary present vs. removed view, studying more nuanced
tral issue for various stakeholders, from journalists and cases as well. In particular, we track compromised online
fact-checkers to NLP researchers both in academia and posts over time for the appearance of explicit
platformin the industry. Given the current rapid growth in data- specific reliability labels (e.g. "out of context"),
obfuscadriven studies of manipulative content, it is essential to tion (the common situation when the online content is –
have a reliable overview of data persistence issues in fully or partially – rendered either very blurred or as a
this specific domain: compromised content is often very black/white box, with a message raising awareness of its
dynamic and changes or becomes unavailable over time, limited reliability; this content, however, is still accessible
raising reproducibility concerns, to the user upon an extra click), and author-generated</p>
      <p>From the readers’ perspective, the visibility of com- edits, as well as complete content removal.
promised content over time afects directly its impact: a More specifically, we address the following research
removed or strongly downgraded document is unlikely questions:
to be read/recovered and cannot be used to promote or
support other fakes. From the research and development RQ1: How persistent is the compromised content?
perspective, data persistence is crucial for benchmark- How does its visibility and availability change
ing, ensuring fair comparison between models as well as over time?
even simply providing them with high-quality real-life RQ2: What is the typical timeline for interaction
betraining and testing examples. tween the content generators and fact-checkers?</p>
      <p>
        Starting from already a decade ago, NLP benchmarking How – if at all – do content writers alter their
campaign studies [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] report data persistence issues for posts after being exposed as problematic by fact
online content, as used in various shared tasks, reporting checkers?
around 10% of entries missing compared to the original RQ3: Are the trends diferent across platforms?
dataset (gold standard). These shared tasks, however, are
based almost exclusively on Twitter and do not focus To this end, we analyze two datasets (in English) of social
specifically on compromised content. We believe that a media documents, fact-checked by PolitiFact.1
large proportion of manipulative content is created on
      </p>
      <p>min
fc time
0
1
1
1
0
median
fc time
4
4
4
6
4
2. Related Work
source</p>
      <sec id="sec-1-1">
        <title>Multiple studies report on data persistence issues for online content. These works, however, mostly focus on Twitter datasets, as used for various challenges and shared tasks.</title>
        <p>
          Zubiaga [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] provides an exhaustive report on data
persistence for multiple Twitter datasets, showing an aver- Table 1
age data decay of around 20% over 4 years. Assessing the time required for professional fact-checking (fc):
        </p>
        <p>
          Küpfer [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] argues, always for Twitter, that data per- statistics for the 2-month dataset, days.
sistence is not random, becoming drastically more of an
issue for emotionally charged or controversial content.
        </p>
        <p>
          Indeed, both Bastos [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] and Duan et al. [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] report much While some of these aspects are crucial for algorithmic
higher tweet decay rates for #Brexit and #BlackLivesMat- NLP (e.g., data persistence is important for
benchmarkter, content respectively. ing and – in critical cases – even training ML models),
        </p>
        <p>
          To our knowledge, there have been no studies assess- others are more relevant for understanding the impact of
ing explicitly data persistence issues for fakes. For some manipulative content on human readers (e.g., obfuscation
datasets, the creators provide estimations of content de- is an unambiguous warning the platform sends to the
cay. For example, Bianchi et al. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] estimate that around reader on a low reliability of the information).
25% of the tweets in their corpus on harmful speech on- The 2-months dataset has been analysed every two
line were no longer available at the paper publication days for the first two months and then on a weekly basis
time. It is, however, unspecified, how this estimation was for the following year. The 8-months dataset has been
obtained. analyzed in May and October 2024, when the documents
        </p>
        <p>We hope to bring new insights to our understanding were 1.5-2 and 2-2.5 years old respectively.
of the data persistence issues for compromised content
tbayrgaedtderdesasnianlgystihseoffomlloawniipnuglantoivveelcoanntgelnest:(f(ai)kweeneawims),a(tiia) 4. Compromised content: timeline
we provide a more nuanced approach, tracking subtler
changes in data availability for users and machines (e.g., 4.1. From publication to fact-checking
obfuscation) and (iii) we go beyond Twitter, targeting all
the major social media platforms.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Data</title>
      <p>For this project, we start monitoring the content the day it
appears on PolitiFact. Obviously, this doesn’t happen the
very moment the content gets published by its creators:
it takes some time for the content to reach PolitiFact and
then an extra period to perform fact-checking. This lag
may depend on numerous factors: for example, some
fakes are simple and repetitive, thus requiring less
investigative efort, whereas some others lead PolitiFact
journalists to request third-party expert analytics,
involving time-consuming communications with various public
• visibility: visible (possibly with a warning),
ob</p>
      <p>fuscated, removed;
• persistence: original, edited, removed;
• extra labelling: any platform-specific add-ons,</p>
      <p>e.g. "missing context".</p>
      <p>For our study, we use two data sets of real-life
suspicious online posts, analyzed by PolitiFact. A 2-months
dataset (PolitiFact reports from 15 May – 15 July 2023,
around 200 entries) has been thoroughly monitored for
data visibility and persistence up till now. A larger and
older dataset (PolitiFact reports from January – Septem- figures and organizations.
ber 2022, around 800 entries) has been analyzed twice to Table 1 shows time lag statistics (in days) between the
assess longer-term trends. content publication date (as reported by the platforms)</p>
      <p>The two datasets include all the posts in English from and the appearance of the corresponding fact-checking
the major social media platforms as reported by PolitiFact report. It suggests that PolitiFact is doing an outstanding
during the above mentioned periods (i.e., the original job at timely reacting to online misinformation: an
avpublications slightly predate May 15, 2023 and Jan 1, erage suspicious post is analyzed in 4 days, with a large
2022, respectively). bulk of reports appearing on the next day already. We
obThe analysis involves the following dimensions: serve no platform-based diference in PolitiFact reaction
times, thus confirming their neutrality in this respect.</p>
      <p>PolitiFact stays in active collaborations with major
social media platforms.2 As a result, in most cases the
content is marked by the platform as somewhat spurious
2For example, https://www.facebook.com/help/1952307158131536?
helpref=related and https://www.tiktok.com/safety/en/
safety-partners/
% d0
88.02%
83.72%
93.75%
94.11%
90.27%
platforms are more prevalent—and keep appearing and
disappearing at an alarming rate, leaving us virtually no
opportunity to model the underlying trends.</p>
      <sec id="sec-2-1">
        <title>4.3. Content adjustment</title>
      </sec>
      <sec id="sec-2-2">
        <title>4.2. Content availability after fact-checking</title>
        <sec id="sec-2-2-1">
          <title>As we have seen above, once a document has been fact</title>
          <p>checked and deemed false, the most typical reaction is its
Tables 2 and 3 illustrate data availability over time for the – rather fast – removal. This would be a rather natural
2-months set. We distinguish between two categories: reaction: most creators do not enjoy having their content
visible and available. Available content can be accessed (and their name) marked as unreliable. In some cases,
by either a human or a machine, possibly with some efort however, the users3 prefer keeping the compromised
con(e.g., an extra click). Visible content can be accessed as-is. tent online. Such content – proven do be problematic by
In other words, non-visible accessible content includes a publicly available fact-checking report – would trigger
fully or partially obfuscated posts. a reaction from (a) the hosting social media platform,</p>
          <p>
            We see several important trends here. First of all, al- (b) the community and (c) the authors themselves. The
ready at the fact-checking date, around 12% of documents observed reactions for visible documents are summarized
are no longer available. This number grows rapidly: after in Table 4.
one year, the unavailable content comprises 38% of data- Facebook and Instagram adopt their own labels to mark
points for our 2-month set.. This number is much more questionable content, distinguishing between "false",
pessimistic than common estimations of online data per- "out-of-context" and "partly false" documents.4 Although
sistence [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ]. This raises an important and a very urgent PolitiFact stays in an active collaboration with the both
issue: as a community, we should invest a more focused platforms, there is no direct correspondence between the
and consistent efort in timely saving samples of compro- labels. The labels get assigned rather quickly and stay
mised documents for ongoing and future research/bench- unchanged (almost all of the observed label change is
marking. From the human reader perspective, only one due to the complete removal of the document).
third of posts are clearly visible after one year (and even Twitter relies on its own community to highlight
probin such cases, they might contain explicit markings, such lematic content. This measure was introduced after the
as "partially false"). start of our project and therefore we cannot assess
di
          </p>
          <p>We also observe a striking diference across platforms:
while most tweets remain online, almost a half of com- 3We do not have any reliable estimations on the content removal by
promised Instagram posts are no longer available after the major online platforms themselves. In this study, we assume,
albeit unrealistically, that the content gets removed by the users.
12 months. This is truly problematic: while the NLP com- 4The exact labels vary across platforms (e.g. "out of context" vs.
munity focuses mainly on Twitter data, fakes on other
"missing context").
missing context</p>
          <p>partly false
reader’s context
editing</p>
          <p>at some point</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>This paper aims at an in-depth analysis of data persis</title>
          <p>tence for publicly fact-checked online content. After one
year of monitoring thoroughly online posts fact-checked
by PolitiFact, we have observed the following findings.</p>
          <p>First, the data persistence is a crucial and underrated
issue for compromised content, with considerable decay
4.4. Longer-term trends rates. Second, the decay trends difer across platforms,
Table 5 shows similar statistics for our 8-months dataset, with Facebook, TikTok and Instagram showing much
covering PolitiFact reports published from January to less data persistance. Third, the decay starts immediately,
September 2022. We have computed them in May and with 12% of the compromised posts getting deleted at
October 2024 when most posts were almost 2 and 2.5 (or before) the publication of the PolitiFact report and
years old respectively. 20% becoming unavailable within a week. This suggests</p>
          <p>These numbers support our initial findings: almost an urgent need for a concentrated efort on timely
colhalf (44.8%) of compromised documents are no longer lecting real-life fakes if we want to go beyond synthetic
available after 2 years. The decay is more pronounced or simplistic datasets and train impactful fact-checking
for TikTok and Instagram. models.</p>
          <p>A considerably larger percent of Facebook posts re- In the future, we want to analyze further aspects of
mains visible (non-obfuscated) in our 8-months dataset: the decay issues for the compromised content. Thus, we
this might be attributed to a rendering policy change. plan to add more fact-checking outlets beyond PolitiFact</p>
          <p>Finally, the 2022 dataset (8-months) contains a larger to see if there are any efects due to the report itself.
share of tweets. The decay rate for Twitter is at 17% after Second, we plan to study in more detail the diference in
2 years (compared to just 6% after 1 year for the 2-months online behaviour (content removal) between anonymous
2023 dataset). We believe that the considerable change in users, non-anonymous users and public figures. Finally,
the platform guidance in the past two years has afected we plan to expand our research on interaction between
the way content writers use Twitter (both publishing content writers and fact-checkers ("editing").</p>
        </sec>
        <sec id="sec-2-2-3">
          <title>We thank the Autonomous Province of Trento for the ifnancial support of our project via the AI@TN initiative.</title>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>I.</given-names>
            <surname>Alegria</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Aranberri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Comas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Fernández</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gamallo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Padró</surname>
          </string-name>
          , I. San Vicente, J.
          <string-name>
            <surname>Turmo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Zubiaga</surname>
          </string-name>
          ,
          <article-title>Tweetnorm: a benchmark for lexical normalization of spanish tweets</article-title>
          ,
          <source>Language Resources and Evaluation</source>
          <volume>49</volume>
          (
          <year>2015</year>
          )
          <fpage>1</fpage>
          -
          <lpage>23</lpage>
          . doi:
          <volume>10</volume>
          . 1007/s10579-015-9315-6.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Zubiaga</surname>
          </string-name>
          ,
          <article-title>A longitudinal assessment of the persistence of twitter datasets</article-title>
          ,
          <source>Journal of the Association for Information Science and Technology</source>
          <volume>69</volume>
          (
          <year>2018</year>
          ). doi:
          <volume>10</volume>
          .1002/asi.24026.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Küpfer</surname>
          </string-name>
          ,
          <article-title>Nonrandom tweet mortality and data access restrictions: Compromising the replication of sensitive twitter studies, Political Analysis (</article-title>
          <year>2024</year>
          )
          <fpage>1</fpage>
          -
          <lpage>14</lpage>
          . doi:
          <volume>10</volume>
          .1017/pan.
          <year>2024</year>
          .
          <volume>7</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bastos</surname>
          </string-name>
          ,
          <article-title>This account doesn't exist: Tweet decay and the politics of deletion in the brexit debate</article-title>
          ,
          <source>American Behavioral Scientist</source>
          <volume>65</volume>
          (
          <year>2021</year>
          )
          <article-title>000276422198977</article-title>
          . doi:
          <volume>10</volume>
          .1177/0002764221989772.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hemsley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. O.</given-names>
            <surname>Smith</surname>
          </string-name>
          , “
          <article-title>this tweet is unavailable”: #blacklivesmatter tweets decay</article-title>
          ,
          <source>AoIR Selected Papers of Internet Research</source>
          (
          <year>2023</year>
          ). URL: https://spir.aoir.org/ojs/index.php/spir/article/ view/13414. doi:
          <volume>10</volume>
          .5210/spir.v2023i0.
          <fpage>13414</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F.</given-names>
            <surname>Bianchi</surname>
          </string-name>
          , S. HIlls, P. Rossini,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hovy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tromble</surname>
          </string-name>
          , N. Tintarev, “
          <article-title>it's not just hate”: A multi-dimensional perspective on detecting harmful speech online</article-title>
          , in: Y.
          <string-name>
            <surname>Goldberg</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Kozareva</surname>
          </string-name>
          , Y. Zhang (Eds.),
          <source>Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Abu Dhabi, United Arab Emirates,
          <year>2022</year>
          , pp.
          <fpage>8093</fpage>
          -
          <lpage>8099</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .emnlp-main.
          <volume>553</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .emnlp-main.
          <volume>553</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>