<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>PROVENANCE: An Intermediary-Free Solution for Digital Content Verification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Bilal Yousuf</string-name>
          <email>bilal.yousuf@adaptcentre.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>M. Atif Qureshi</string-name>
          <email>muhammad.qureshi@adaptcentre.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Brendan Spillane</string-name>
          <email>brendan.spillane@adaptcentre.ie</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gary Munnelly</string-name>
          <email>gary.munnelly@adaptcentre.ie</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oisin Carroll</string-name>
          <email>oisin.carroll@adaptcentre.ie</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matthew Runswick</string-name>
          <email>matthew.runswick@adaptcentre.ie</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kirsty Park</string-name>
          <email>kirsty.park@dcu.ie</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eileen Culloty</string-name>
          <email>eileen.culloty@dcu.ie</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Owen Conlan</string-name>
          <email>owen.conlan@scss.tcd.ie</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jane Suiter</string-name>
          <email>jane.suiter@dcu.ie</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ADAPT Centre, Technological University Dublin</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>ADAPT Centre, Trinity College Dublin</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Institute for Future Media, Democracy and Society, Dublin City University</institution>
        </aff>
      </contrib-group>
      <fpage>9</fpage>
      <lpage>21</lpage>
      <abstract>
        <p>The threat posed by misinformation and disinformation is one of the defining challenges of the 21 st century. Provenance is designed to help combat this threat by warning users when the content they are looking at may be misinformation or disinformation. It is also designed to improve media literacy among its users and ultimately reduce susceptibility to the threat among vulnerable groups within society. The Provenance browser plugin checks the content that users see on the Internet and social media and provides warnings in their browser or social media feed. Unlike similar plugins, which require human experts to provide evaluations and can only provide simple binary warnings, Provenance's state of the art technology does not require human input and it analyses seven aspects of the content users see and provides warnings where necessary.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Misinformation</kwd>
        <kwd>Disinformation</kwd>
        <kwd>Fake News</kwd>
        <kwd>Social Media</kwd>
        <kwd>Plugin</kwd>
        <kwd>Browser Extension</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>a system that can diferentiate between anger and fear organisations have also identified misinformation and
in disinformation and anger and fear in opinion news disinformation as a threat and have increased eforts to
articles. There is also some dificulty in diferentiating combat it. These include the United Nations through its
between news articles from alternative and independent Verified platform [ 15] and the World Health Organisation
agencies and news articles from disinformation sources [16]. More can be read about these initiatives in the
Poyndue to often lower quality writing, more emotive content, ter Institute’s guide to national and international eforts
and the reuse of images and videos. to combat misinformation and disinformation around the</p>
      <p>This paper provides an update on the ongoing progress world [17].
of developing Provenance. The remainder of this paper Provenance is a H2020 project1, however it difers from
is organised as follows. Section 2 Motivation and Back- many of the above as it is a user orientated
intermediaryground delves into the impetus for this project and sit- free solution to help consumers identify misinformation
uates it within other recent EU disinformation projects. and disinformation as they browse the Internet and social
Section 3 Related Work provides a detailed overview of media. It is also designed to improve media literacy skills
similar browser plugins and describes how Provenance by equipping consumers with the tools, knowledge and
advances the state of the art. Section 4 Architecture know-how to face this challenge now and into the future.
Overview contains system architecture diagrams and
descriptions of each component in the Provenance platform.</p>
      <p>Section 5 Provenance in Action provides a detailed expla- 3. Related Work
nation of how the Provenance browser plugin provides
warnings to the user. Section 6 Use Cases presents two
use cases for the Provenance plugin to show in what
scenarios we envision it being used. Section 7
Evaluation briefly describes plans to evaluate the tool. Finally,
section 8 Conclusions completes the paper with closing
remarks.</p>
      <sec id="sec-1-1">
        <title>This review of related work will focus on comparable</title>
        <p>browser plugins designed to provide users with warning
notifications about disinformation or other problematic
content and which are currently active or maintained.</p>
        <p>The purpose of this review is to establish how Provenance
advances the state of the art.</p>
        <p>NewsGuard [18] provides ‘nutrition’ labels for news
websites based on nine journalistic criteria. What
difer2. Motivation and Background entiates it from many of the other fake news and bias
detection browser plugins is that it does not use
autoThe proliferation of misinformation and disinformation mated algorithms to assess news websites but rather
reon social media has been described as a strategic threat lies on a team of journalists to conduct reviews. It comes
to democracy and society in the European Union (EU) as standard with Microsoft Edge, but a subscription is
[2, 3]. A recent EU study on the issue found that the com- needed for other Internet browsers. Its notification icons
mon narratives of society "are being splintered by filter appear as a browser extension in the upper right corner
bubbles, and further ruined by micro-targeting." [4]. The and within third party search engines and social media
report points out that like a virus, misinformation and platforms. Clicking on its browser icon opens a nutrition
disinformation spread throughout society through social label pane where users can quickly see whether the news
media and other platforms in open and closed groups to website passes or fails any of the nine criteria. A link
the detriment of democratic systems. This occurs when is also available for users to see a more detailed report.
"Susceptible users become weaponized as instruments for Visually, NewsGuard employs simple but efective white
disseminating disinformation and propaganda" [4]. ✓on a green shield and red x iconography to denote</p>
        <p>The Presidents of the European Council, Commission when a website has passed or failed. NewsGuard’s
transand Parliament have all made increasingly public calls parent methodology has resulted in their datasets being
for concerted eforts to do more to combat the scourge used for research [19]. While expert led analysis has
of fake news to protect democracy. The President of its merits, it also has issues with scalability, personal
bithe European Parliament has been the most forthright ases, and response times. Aker also maintains that much
in this with a recent announcement that: "We must nur- of the credibility and transparency scoring provided by
ture our democracy &amp; defend our institutions against the NewsGuard could be automated [20].
corrosive power of hate speech, disinformation, fake news Décodex [21] created by Le Monde originally started
&amp; incitement to violence." [5]. As a result, the EU have as an online search facility for users to check URLs
funded a range of FP7, H2020 and other projects to com- against a list of known websites which spread
misinbat misinformation and disinformation including WeV- formation and disinformation. They have since released
erify [6, 7], SocialTruth [8], PHEME [9, 10], EUNOMIA a Facebook bot for users to directly chat to and a browser
[11] Fandango [12, 13] and the European Digital Media plugin that provides red, orange or blue notifications to
Observatory (EDMO) [14]. Many other international
denote whether a website regularly disseminates false scores) to these common information portals so that
information, whose reliability is doubtful, or if they are users may more easily choose high-quality information
a parody website. When installed, the Décodex icon be- resources. It should be noted that this extension is not
comes active when the website being viewed is listed in designed to provide users with detailed warning
notifitheir database. It also produces a colour-coded popup cations when viewing a news website and thus is not
with one of three standard warnings. Users cannot ac- directly comparable to the other systems or Provenance.
cess detailed information about warnings, nor does it It is included here due to its use of MBFC, the fact that it
appear to be integrated with well-known search engines, conveys limited visual information/warnings before the
social media platforms or discussion boards. Décodex’s user visits an information source, and for plenitude.
allow/deny list approach means that scalability is dificult
and the warnings it provides are based on the historical 3.1. No Longer Active
publication record of the website, not the content
currently being viewed. Transparency is also limited. While Many other projects and services related to this work,
still available, its development appears to be in stasis. which have been reviewed in the literature, c.f. [25, 26,</p>
        <p>Media Bias Fact Check (MBFC)2 [22] is an extensive 27, 11, 28, 29, 30], now no longer appear to be active or
media bias resource curated by a small team of journal- working. This is concerning as despite the fact that
misists and lay researchers who have undertaken detailed information and disinformation have been recognised as
assessments of over 4000 media outlets. A transpar- a threat to democracy and social cohesion, and the fact
ent assessment methodology means that their datasets that browser plugins are one of the few citizen-orientated
have been used for several research projects [23, 20]. direct interventions which can help solve the problem at
Their team of researchers undertake in-depth analyses source while increasing long term media literacy, very
of news organisations and assess them using a standard- few of the proposed solutions have been actively
proised methodology, with some subjective judgement, to moted or maintained. The main reason for this appears to
calculate a left/right bias score using their published for- be the fact that many of these plugins were developed by
mula. They also calculate scores for factual reporting individuals or small teams, or even as part of a hackathon,
and credibility. These reports are published on their web- and were thus lacked the resources to be actively
mainsite and updated from time to time. Each news website tained or updated to deal with changing technology such
in their database is categorised as: left bias, left-centre as browser updates or the rapidly evolving threats posed
bias, least biased, right-centre bias, right bias, pro-science, by misinformation and disinformation. The following
conspiracy-pseudoscience, fake news, or satire. While present those related projects found in the literature, but
their browser extension conveys limited details, further which now no longer appear to be actively maintained,
information about each news source is available on their though some are still available to install. URLs have been
website. It draws on this dataset to inform users when included for posterity where possible as many do not
they click on the notification icon as to which of these have peer-reviewed publications.
nine categories the news website they are viewing be- B.S Detector5 relied on matching the URLs of content
longs to, including a brief explanation of the category. in the news feed to a known allow/deny list of sources
It also provides a link to the detailed MBFC report. The of fake news and misinformation.
browser extension also provides Facebook and Twitter AreYouFakeNews.com6 utilised Natural Language
support by displaying a visual left/right bias scale on Processing (NLP) and deep learning to identify patterns
news articles that appear in users feeds with links to the of bias on websites.</p>
        <p>MBFC detailed report and Factual Search3 so that the Fake News Detector AI7 claimed to use a neural
netuser can investigate the topic further. While a valuable work to detect similarity between submitted URLs and
resource with considerable detail, MBFC’s expert evalua- known fake news websites.
tions are based on the historical publication record of the Fake News Detector8 was designed to learn from
news website and not an evaluation of the content the webpages flagged by users to detect other similar fake
user is looking at. It is also a labour intensive and time news webpages.
consuming process. Trusted News9 is a browser plugin that was designed</p>
        <p>Stopaganda Plus4[24] is a browser extension that to assess the objectivity of news articles. Its functionality
adds accuracy and bias decals to Facebook, Twitter, Red- was limited to ‘long form’ news articles and it does not
dit, DuckDuckGo and Google. These visual indicators work with social media content.
extend the functionality of MBFC (who determine the
2https://mediabiasfactcheck.com/
3https://factualsearch.news
4https://browserextension.dev/blog/stopagandaplus-helpsunderstanding-media-biases/</p>
      </sec>
      <sec id="sec-1-2">
        <title>5https://www.producthunt.com/posts/b-s-detector</title>
        <p>6https://github.com/N2ITN/are-you-fake-news
7https://www.fakenewsai.com/
8https://fakenewsdetector.org/
9https://trusted-news.com/</p>
        <p>Fake News Guard10 claimed to combine linguistic where necessary, provide an easy to understand warning
and network analysis techniques to identify fake news, to the user when the content they are viewing may be
however this can no longer be verified. problematic or symptomatic of disinformation. In the</p>
        <p>FiB11 A browser extension built in a hackathon which cases where linguistic analysis or other machine
learnwas reviewed several times in the literature as a compa- ing approaches have been utilized, the results are not
rable system [31]. presented to the user in an explainable or transparent</p>
        <p>TrustedNews12 Trusted News used AI to help users way. Some of these methods have also proven susceptible
evaluate news articles by scoring their objectivity [32]. to adversarial attacks, whereby text may be augmented
However, it does not work on social media and has issues slightly to fool pretrained models [44, 45].
with analysing webpages that require scrolling. Two factors diferentiating Provenance from the
plug</p>
        <p>Trusty Tweet [26] was designed to help users deal ins described above are their limited reach and scalability.
with fake news tweets and to increase media literacy. Many of the above plugins do not provide any
informaTheir transparent approach is designed to prevent reac- tion for some heavily traficked news websites such as the
tance and increase trust. Early user evaluations showed LA Times, Al Jazeera, and the Independent.co.uk. This
promise. is likely due to limiting factors of time and labour of
in</p>
        <p>Check-It [33] was designed to analyse a range of sig- cluding humans in the disinformation judgement process.
nals to identify fake news. It was focused on user privacy While no one doubts the benefits of highly trained expert
with computation undertaken locally. Their approach judgement, the size and nature of the rapidly evolving
used a combination of linguistic models, fact checking, media landscape, especially in regard to misinformation
and website and social media user allow/deny lists. and disinformation in which publishers are prone to rapid
growth, failure and re-branding, means that providing
3.2. Out of Scope Approaches human ratings is a never ending game of whack-a-mole.
Current solutions are only partially succeeding in
proSome misinformation and disinformation detection tools viding judgements of some news agencies. None have
which have been reviewed in other papers have not been attempted to analyse the millions of pieces of content
included in this literature review. This is because they they publish daily. Unlike each of the plugins described
are not a browser plugin or they are a paid for b2b ser- above, Provenance does not require a human-in-the-loop,
vice (Fakebox [34]; AreYouFakeNews [35]), they are fo- nor does it need to be backed by human-generated
alcused on an aligned but separate issue e.g., detection of low/deny lists. Its architecture supports fully automated
bias or detection of reused and or manipulated images and intermediary free analysis of news content.
(Ground.News [36]; SurfSafe [37]), they are specifically The ability to evaluate news articles against seven
for fact checking (BRENDA [38], CredEye [39]), they criteria and provide users with visual notifications and
have pivoted into a B2B platform (FightHoax [40]), they deeper explanations is also a significant advancement on
are not user orientated (Credible News [41, 42]), or they the state of the art and a direct benefit to users in three
are research systems and have not been made available to ways. First, and most importantly, users will be made
the public [30, 43]. While relevant to combating disinfor- aware of individual issues with the content they are
conmation, these are not directly comparable to Provenance. suming and can thus decide whether they will continue
viewing it or look for alternative sources. Second, it will
3.3. Advancing the State of the Art help develop users’ media literacy skills by making them
aware of the diferent caution worthy indicators and how
to check them, making them less susceptible to
misinformation and disinformation in the future. Third, the
nature of these systems means that they cannot be properly
examined. In contrast, a full description of Provenance’s
system architecture is provided below. It is also currently
undergoing evaluation and testing and the results will
be published in time.</p>
        <p>This review demonstrates that browser plugins are a
common user-orientated approach to combat
misinformation and disinformation. However, Provenance adopts
a significantly more advanced and granular
methodology than current or previous eforts in the domain. The
warnings provided by earlier plugins are often based on
the news website’s history of publishing misinformation
and disinformation. Thus, they are limited to
providing a coarse-grained retrospective analysis of the news
website’s publication history. In contrast, Provenance’s
ifne-grained approach is designed to analyse the content
of the news webpage or users’ social media feeds and,</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Architecture Overview</title>
      <p>The system architecture for Provenance is shown in
Figure 1. The components and services use REST APIs
serving JSON for easy, reliable, and fast data exchanges across
internal subsystems.
10http://fakenewsguard.com/
11https://projectfib.azurewebsites.net/
12https://trusted-news.com/</p>
      <p>Data in the form of webpages or social media con- to further investigate the claims made in the article’s
tent is ingested by Provenance either through the Social content. The Personalised Companion Service is used to
Network Monitor or by a Trusted Content Analyst (e.g., determine how this information should be presented for
a journalist or fact checker). The Social Network Moni- an individual user.
tor service discovers content using NewsWhip’s13 social
network monitoring platform. The introduced asset is en- 4.1. Key Components
riched with social engagement data (e.g., likes and shares)
and is forwarded to the Asset Workflow Handler service. 4.1.1. Social Network Monitor</p>
      <p>The Asset Workflow Handler separates the incoming The Social Network Monitor communicates with
data (e.g., a news webpage) into individual assets such NewsWhip’s Social Network API to identify assets which
as images, video, text, etc. These assets are registered should be ingested by Provenance. Finding assets
with the Asset Fingerprinter before being disseminated to involves querying Newswhip’s API with a parameterized
the analytical components (Video/Image Reverse-searcher, search request. The call to NewsWhip’s Social Network
Video/Image Manipulation Detector, Text Similarity De- API is automatically invoked periodically to maintain
tector, Text Tone Detector, and Writing Quality Detector) an updated record of trending news articles and social
to determine if they exhibit any features which normally media posts. Assets detected by NewsWhip are enriched
characterise misleading, questionable, or unsubstantiated through social scoring. The URL, titles, summaries,
iminformation. The output of each analytical service, and ages and videos (if any), along with the enrichment data,
the initial data passed from the Social Network Monitor is extracted from the article and provided to Provenance.
are combined and sent to the Knowledge Graph where Assets composed only of text, for example, are registered
they are stored. in fragments consisting of news feed/article title, the</p>
      <p>The Knowledge Graph may be queried by the Prove- summary, and user engagement data.
nance Query Service to retrieve the results of analysis for
a given webpage. The Provenance plugin, installed in the
user’s browser, leverages this query service to retrieve 4.1.2. Asset Registration
information about webpages that a user is currently view- A dedicated Asset Registration web interface also allows
ing. If the webpage has been analysed by Provenance, Trusted Content Analysts to add assets into the Asset
and exhibits questionable features, the plugin will issue Workflow Handler . Trusted Content Analysts are
stakea warning to the user, indicating that they may want holders such as journalists and other representatives
of news agencies and wire services, fact checkers,
debunkers, and original content creators who may want to search operation for videos and images.
register their multimedia content assets. In future, this
facility will be made more widely available to allow the 4.1.5. Video/Image Manipulation Detector
general public to send content directly to Provenance. It
may also be integrated with news publication platforms The Provenance Video/Image Manipulation Detector
idenand content management systems so that content is au- tifies if an image or video has been manipulated in
comtomatically added. The primary task of this component parison to its source. This work is based on the
PIZis to enable third-parties to register assets that have not ZARO14 project. It utilises recent developments achieved
been discovered by the Social Network Monitor. by deep learning-based methods to enable an instant
detection of manipulations in visual content. In addition,
4.1.3. Asset Workflow Handler use of the latest technologies based on Convolutional
Networks will lead to tangible enhancements in integrity
verThe Asset Workflow Handler is the component of the ification in visual content. The Video/Image Manipulation
Provenance Verification Layer that is responsible for or- Detector increases trust and improves governance. The
chestrating the components and data within the layer. solution is designed to build a web-based system to assess
This component’s primary task is to distribute assets to visual content in a real-world setting. The Video/Image
diferent components for further processing. It invokes Manipulation Detector will further support the
developthe service interfaces and handles the data flow between ment of user skills in detecting false visual information
the services. By utilising the Asset Workflow Handler , themselves by providing a world-class image forensic
components are loosely coupled, thus mitigating direct technology. The Video/Image Manipulation Detector has
component-to-component communications. This will en- a special focus on developing a solution that will be
intuable Provenance to work with the variety of APIs exposed itive and easy to understand and interpret for end-users,
from the existing tools/components. Moreover, the APIs thereby increasing its uptake by the public and its impact
can be adjusted to meet Provenance’s specific needs. Due on the information system. This component’s primary
to this modular design, new components can be easily task is to detect if the image and video are manipulated
added to the Provenance Verification Layer (e.g., detection by comparing them with previously registered images
of bias [46], tabloidization [47], and hate speech [48]), and videos in the system.
and connected to the Asset Workflow Handler .</p>
      <sec id="sec-2-1">
        <title>4.1.6. Asset Fingerprinter and Asset Registry</title>
      </sec>
      <sec id="sec-2-2">
        <title>4.1.4. Video/Image Reverse Searcher</title>
        <p>The Asset Fingerprinter and Asset Registry provide
traceThe Video/Image Reverse Searcher is a key component ability of registered content. It is based on Blockchain
for creating a large-scale annotated dataset for detect- technology, making content immutable and enabling the
ing manipulated visual content. The dataset consists of verification of the sources and alterations to the content.
three distinct parts. The first part includes 45,000 images, Registered assets are handed to the Asset Fingerprinter
each captured by a unique device (i.e., 45,000 diferent via the Asset Workflow Handler . Due to the General Data
cameras have been used). Half of these images are real, Protection Regulation (GDPR) and the size of some assets,
and the other half has been digitally manipulated by ap- the hash of the data is stored on Blockchain. Azure
Storplying a random image processing operation to a local age is used as the Blockchain, and the assets themselves,
area of the image. Since the sensor pattern noise present including large files, are stored using an of-line storage
in images is unique to each sensor (i.e., camera), this service available to store multimedia files. Blockchain is
dataset introduces large diversity, such as noise. The used due to its innate data integrity which is important
second part of the dataset uses imaging software in cam- to prove the traceability of registered content if the tool
eras to introduce a large diversity of artefacts in images. was ever targeted as part of a combined disinformation
Commonly available camera brands and models were and hacking campaign. This component’s primary task
identified and used to collect a dataset of 50,000 images. is the traceability of registered content via Blockchain.
Half of these images were digitally manipulated using
an advanced image editing method based on Generative 4.1.7. Text Similarity Detector
Adversarial Networks (GAN) [49]. Finally, the third part
of the dataset consists of 2,000 images downloaded from News is regularly republished nationally and locally
the Internet representing “real-life” (uncontrolled) ma- from international wire services such as Reuters, Agence
nipulated images created by random people. For all of France-Presse (AFP) and Associated Press (AP). In a bid
the manipulated samples collected for the third part of to lower costs, many news agencies who are not in
comthe data, the matching unmanipulated image was also petition negotiate deals to republish each other’s content.
collected. This component’s primary task is to enable
which had characteristics symptomatic of
disinformation, was annotated in a crowdsourced study to identify
terms and phrases indicative of low quality writing. A
WQS for each piece of content was then derived using a
standard formula. This was subject to testing and expert
evaluation to ensure the WQS the formula produced
accurately reflected each piece of content. Models were then
trained on the dataset which showed that the WQS could
be automatically generated with a high degree of
accuracy. These models and the overall process are currently
undergoing formal evaluation.</p>
      </sec>
      <sec id="sec-2-3">
        <title>4.1.10. Knowledge Graph and Knowledge Graph</title>
      </sec>
      <sec id="sec-2-4">
        <title>Builder</title>
        <p>Similarly, less trustworthy news outlets often put ‘spins’
on existing articles, where correct articles are modified
to contain false information.</p>
        <p>To combat this, the Text Similarity Detector in
Provenance attempts to verify the textual content of an article
by comparing it to similar articles published elsewhere.
A backlog of trustworthy articles is stored in an
Elasticsearch database with a BM25 similarity index [50]. As
BM25 under-performs with very long documents [51],
only the title and first 10 sentences are used in the index.
Once similar articles have been found the component
searches for facts given in the query article in the similar
ones. Facts in an article are found by taking sentences
with a low subjectivity from TextBlob’s sentiment
analysis model [52]. The similarity of two facts is the cosine
similarity of the vector embedding of both, which is
provided by Google’s multilingual text model [53]. If enough
of the article’s factual content cannot be verified, the
plugin displays a warning.</p>
        <sec id="sec-2-4-1">
          <title>The Provenance Knowledge Graph stores a record of all</title>
          <p>the articles introduced to Provenance via the Social
Network Monitor service or via Asset Registration from a
Trusted Content Analyst. It is also a record of all analysis
performed on said assets.
4.1.8. Text Tone Detector The content is organised according to concept,
categories and topics. For example, a news article discussing
Intuitively, one would expect that impartial news sources politics can be categorised according to the left/right
would use impartial, unemotive language to convey the political spectrum followed by the topics discussed as
facts of a story. Recent research has shown that emotions shown in Figure 2. Each node at the article level is split
such as fear, anger, sadness, doubt, and the absence of according to text, image and video.
joy and happiness are indicative of misinformation and The output of the Video/Image Reverse Searcher
indisinformation [54, 55, 56]. Provenance’s Text Tone De- cludes the N most similar images/videos, distance
meatector is designed to identify emotions in text which may sures and geometric validation results. The data from the
indicate that the news source is unreliable. Threshold Video/Image Manipulation Detector includes the
probavalues are used to determine whether caution should be bility of manipulations and the area of polygons. These
shown, and the degree of caution is determined by how are sent as JSON objects to the Knowledge Graph where
far the calculated value deviates from the threshold value. they are stored as entities in a triplestore.
Modelling of Provenance data is achieved using a
com4.1.9. Writing Quality Detector bination of the RDF Data Cube vocabulary [64] to store
statistical information such as the outputs from the
variProvenance’s Writing Quality Detector computes a writ- ous analytical components, and the Dublin Core/BIBO
ing quality score (WQS) for the textual content the user vocabularies [65] to model bibliographic information
is viewing and provides a warning when it falls below a about the assets themselves. Some use is also made of
threshold value. Writing quality is closely related to cohe- the FOAF15 vocabulary to model information such as
sion and coherence [57]. Within the context of news, high content publishers, which are naturally represented as
quality writing is indicative of paid professional journal- foaf:Agent entities.
ism from mainstream, independent, and to a lesser degree, The Knowledge Graph Builder is responsible for
exposalternative news agencies, whereas low quality writing is ing a REST API which the Asset Workflow Handler may
indicative of amateur or unprofessional news production use to upload assets as JSON, and then transforming the
processes [58]. This high/low quality diferentiation is JSON into triples which are stored in a triplestore. In
also apparent in other domains such as academia, pub- Provenance, this is achieved using JOPA [66]: a Java
lilishing, commercial, and blogs and information websites. brary which can be used to map POJOs to triples. Using
While NLP techniques exist to derive writing quality [59], Spring Boot16, a REST API accepting JSON is exposed.
and others have called for it to be used to identify misin- The uploaded JSON is serialized into POJOs using Spring
formation and disinformation [60, 61], only two examples Boot’s built-in version of Jackson. JOPA is then used to
of systems could be found in the literature which actually serialize the triples out to an RDF4J17 instance.
calculate writing quality [62, 63].</p>
          <p>To calculate WQSs for Provenance, a dataset of news
articles, blog posts, and other website content, much of
15http://xmlns.com/foaf/spec/
16https://spring.io/projects/spring-boot
17https://rdf4j.org/
is implemented as a Chrome Extension and works on
the Facebook and Twitter platforms and with articles
published by news agencies. The Personalised
Companion Service uses the user’s interests, domain knowledge,
digital literacy, and the warning preferences stored in
the Minimal User Model to determine whether to
highlight caution or show the verification indicator without
caution. The Personalised Companion Service uses the
data provided by the Asset Fingerprinter, the Video/Image
Reverse Searcher and Video/Image Manipulation Detector,
and the Text Similarity, Tone and Writing Quality Detector
components to create the set of icons that are presented
to users, who can explore the levels of verification
presented through the visual iconography.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Provenance in Action</title>
      <p>The same serialization process works in reverse, al- The Provenance browser plugin is designed to provide
lowing the Provenance Query Service to expose both a users with easy to understand, granular and cautionary
JSON REST endpoint which can produce JSON objects warnings about the content they are consuming. These
from the results of a canned SPARQL query exposed via a warnings are provided via an in-browser icon beside the
Spring Boot REST endpoint, and a much lower level raw address bar when the user is browsing the Internet, or
SPARQL endpoint from the triplestore, for those who within their Facebook and Twitter social media feeds
want a high level of control over their queries. beside the content they are viewing. Figures 3 - 6 show
how Provenance and its visual warnings appear to a user
- who has the Provenance plugin installed - within their
4.1.11. Provenance Query Service Facebook social media feed. The Provenance icon appears
The Provenance Query Service is the interface to the Verifi- as a small blue square with a white P above each content
cation Layer and ofers external trusted services with the item that it has checked. When the icon background
means to request verification information about a web- turns red (with a small exclamation mark), it indicates to
page or article. It will also allow trusted services with the user that the content item is worthy of a cautionary
a means to identify the relatedness of content (through warning. The following presents the four main states of
similarity and the Knowledge Graph) and determine if Provenance which a user will see.
content has been modified. As the results of all analy- Figure 3 shows a user’s Facebook feed who has the
sis are stored in the Knowledge Graph, the Provenance Provenance browser plugin installed. The Provenance
Query Service is efectively a proxy between the user- icon is visible at the top of each news article in the user’s
facing front-end, and the query interface to whatever feed. In this image, the icon is blue which indicates that
storage medium is used to implement the Knowledge there are no warnings with this particular news item.
Graph. In Figure 4, the background of the Provenance icon</p>
      <p>As mentioned in Section 4.1.10, the Provenance Query within the user’s news feed has turned red to indicate
Service exposes both a raw SPARQL endpoint and a REST that this news item is worthy of one or more cautionary
API which provides endpoints for a number of canned warnings. A small black exclamation mark has been
SPARQL queries which return JSON objects. It is envi- added to the top right of the icon for colour blind users.
sioned that the vast majority of user cases will be covered In Figure 5, the user has clicked on the red Provenance
by the REST API, making it easier for developers to access icon. A window has appeared beneath the Provenance
data that is helpful to users. However, it is worthwhile to icon to show the user which of the seven criteria the
allow lower level access to the KG’s contents in the event news article was checked against that Provenance has
deof unforeseen requirements being placed on the KG. tected an issue with. In this example, the red background
and exclamation mark beneath the Writing Quality icon
4.1.12. Personalised Companion Service indicates that this aspect of the news article is worthy of
caution. The user may click on the downward arrow
beThe Personalised Companion Service manages the Prove- neath each icon for further information. In this example,
nance verification indicator, the minimal user model, and the Tone icon is greyed out indicating that this could not
user scrutability and control. The verification indicator be assessed by Provenance in this instance.</p>
    </sec>
    <sec id="sec-4">
      <title>6. Use Cases: Provenance Plugin</title>
      <p>6.1. Social Media Timeline</p>
      <sec id="sec-4-1">
        <title>On the recommendation of a friend, Mary installed the</title>
        <p>Provenance browser plugin due to increased concerns
about the spread of misinformation and disinformation. the images. As this is just an image of a press conference,
The instructional video on the Provenance Chrome Ex- she is confident that its use by multiple news agencies is
tension webpage explained that Provenance uses seven not an issue.
criteria to verify digital content on the Internet and social
media feeds. After installing the Provenance plugin, she
notices that the news items in her Facebook timeline now 7. Evaluation
display the Provenance icon beside the publisher’s name.</p>
        <p>For most of the news stories, the Provenance icon shows Provenance is under development and will shortly be
una white P inside a white circle on a blue background. dergoing human evaluation. Currently, five of the seven
When she clicks on the blue Provenance icon, it opens a news analysis functions have been implemented and have
notification pane showing the seven verification criteria, been integrated with the platform. These are undergoing
all of which display a green background with a white ✓. technical evaluation while the final two analysis tools are</p>
        <p>She is able to click on each of the seven verification being completed. When the tool is fully completed, a
seicons to read a detailed explanation for each criterion, ries of technical tests and human evaluation tests will be
why failing the criterion is an indication that the webpage undertaken to evaluate basic functionality and to ensure
or social media post may be misinformation or disinfor- that it is providing the right warnings at the appropriate
mation, and how the warning is derived. As all of the time. Following this, a series of experiments will be
unicons are green, she is reassured about the origin, ve- dertaken to evaluate its efect on user behaviour. This
racity and overall quality of the news article. For some will include the likelihood of reading and sharing news
news items displayed on her timeline, she notices that articles that have cautionary warnings beside them. We
the blue background of the Provenance icon has turned will also be analysing unintended efects of the tool.
Fired. When she clicks on it, the same information pane nally, a series of long term studies are planned to evaluate
displaying the same verification criteria appears, except its efect on users’ media literacy.
one or more of the seven verification criteria now display
a red background with an exclamation mark beneath. 8. Conclusions
When she clicks on these, an additional detailed
explanation pane appears underneath them to explain why it
has failed. Reading through each warning including their
detailed description, she gains a better understanding
of how to identify misinformation and disinformation.</p>
        <p>In both instances, Mary has become more aware of the
need to critically check the news she consumes and more
aware of good media literacy habits in general.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Misinformation and disinformation are significant issues</title>
        <p>that have negatively afected public discourse, politics
and social cohesion. The Internet and especially social
media are the primary conduits for its growth and spread.</p>
        <p>Existing user-orientated browser plugins have limited
capabilities and only provide users with an historical
rating of a website’s propensity to publish misinformation
and disinformation. They are also not capable of detailed
6.2. News Websites analysis of the content of news webpages or social
media feeds. The Provenance browser plugin significantly
Mary regularly visits news websites to inform herself of improves upon existing user orientated solutions by
procurrent afairs. Usually, the Provenance icon, which is viding intermediary free analysis of webpage and social
visible to the right of her browser’s address bar, displays media content using seven criteria, and where necessary
a white P inside a white circle on a blue background. providing cautionary warnings to users. The user can
However, recently when she was visiting news websites then check the detailed explanatory warning
notificato read more about a story relating to Covid 19 vaccina- tions to make their own judgement. This will improve
tion, she noticed that the background of the Provenance users’ media literacy and reduce susceptibility to
misinicon would sometimes turn red. When she clicked on the formation and disinformation long term.
icon, the verification criteria information pane showed
that Provenance had detected a problem with the image
used in the news article she was reading. Clicking on 9. Acknowledgements
the arrow to open the drop-down explanation pane, she
reads that Provenance has detected that the image has The work has been supported by the PROVENANCE
been used before in another article. The image in ques- project which has received funding from the European
tion shows a picture taken at a conference of the World Union’s Horizon 2020 research and innovation
proHealth Organisation. Looking closely, she sees a credit gramme under Grant Agreement No. 825227, and with
to the Associated Press (AP). She knows that AP is an the financial support of Science Foundation Ireland under
international news wire service, and that local and na- Grant Agreement No. 13/RC/2106_P2 at the ADAPT SFI
tional news agencies republish their articles, including Research Centre.
tomatic detection of fake news, in: Proceed- chinery, 2020, p. 2117–2120. URL: https://doi.org/
ings of the 6th International Workshop on Socio- 10.1145/3397271.3401396.</p>
        <p>Technical Perspective in IS Development (STPIS [39] K. Popat, S. Mukherjee, J. Strötgen, G. Weikum,
2020), CEUR-WS, 2020, p. 168–179. URL: http:// Credeye: A credibility lens for analyzing and
exurn.kb.se/resolve?urn=urn:nbn:se:his:diva-19356. plaining misinformation, in: Companion
Pro[26] K. Hartwig, C. Reuter, Trustytweet: An indicator- ceedings of the The Web Conference 2018, WWW
based browser-plugin to assist users in dealing with ’18, International World Wide Web Conferences
fake news on twitter (2019). Steering Committee, 2018, p. 155–158. URL: https:
[27] A. Giełczyk, R. Wawrzyniak, M. Choraś, Evalua- //doi.org/10.1145/3184558.3186967. doi:10.1145/
tion of the existing tools for fake news detection, 3184558.3186967.
in: K. Saeed, R. Chaki, V. Janev (Eds.), Computer [40] FightHoax, Fighthoax - unlock your programmatic
Information Systems and Industrial Management, advertising, 2021. URL: http://34.253.212.69/.
Lecture Notes in Computer Science, Springer Inter- [41] M. Hardalov, I. Koychev, P. Nakov, In search of
crednational Publishing, 2019, p. 144–151. doi:10.1007/ ible news, in: C. Dichev, G. Agre (Eds.), Artificial
978-3-030-28957-7_13. Intelligence: Methodology, Systems, and
Applica[28] A. Školkay, J. Filin, A comparison of fake news de- tions, Lecture Notes in Computer Science, 2016.
tecting and fact-checking ai based solutions, Studia doi:10.1007/978-3-319-44748-3_17.</p>
        <p>Medioznawcze 20 (2019) 365–383. [42] M. Hardalov, mhardalov/news-credibility, 2019.
[29] K. Shu, A. Sliva, S. Wang, J. Tang, H. Liu, Fake URL:
https://github.com/mhardalov/newsnews detection on social media: A data mining per- credibility.
spective, ACM SIGKDD Explorations Newsletter [43] X. Zhou, A. Jain, V. V. Phoha, R. Zafarani, Fake news
19 (2017) 22–36. doi:10.1145/3137597.3137600. early detection: A theory-driven model, Digital
[30] A. Hanselowski, A. PVS, B. Schiller, F. Caspel- Threats: Research and Practice 1 (2020) 12:1–12:25.
herr, D. Chaudhuri, C. M. Meyer, I. Gurevych, A doi:10.1145/3377478.
retrospective analysis of the fake news challenge [44] W. E. Zhang, Q. Z. Sheng, A. Alhazmi, C. Li,
Adverstance-detection task, in: Proceedings of the 27th sarial attacks on deep learning models in natural
International Conference on Computational Lin- language processing: A survey, arXiv:1901.06796
guistics, Association for Computational Linguistics, [cs] (2019). URL: http://arxiv.org/abs/1901.06796,
2018, p. 1859–1874. URL: https://www.aclweb.org/ arXiv: 1901.06796.</p>
        <p>anthology/C18-1158. [45] Z. Zhou, H. Guan, M. M. Bhat, J. Hsu, Fake
[31] A. Goel, ProjectFib - GitHub Repo, 2016. URL: https: news detection via nlp is vulnerable to
adver//github.com/anantdgoel/ProjectFib. sarial attacks, Proceedings of the 11th
In[32] Eyeo, 2020. URL: https://chrome.google.com/ ternational Conference on Agents and
Artifiwebstore/detail/trusted-news/ cial Intelligence (2019) 794–800. doi:10.5220/
nkkghpncidknplmlkgemdoekpckjmlok?hl=en. 0007566307940800, arXiv: 1901.09657.
[33] D. Paschalides, C. Christodoulou, R. Andreou, [46] B. Spillane, S. Lawless, V. Wade, The impact of
G. Pallis, M. D. Dikaiakos, A. Kornilakis, increasing and decreasing the professionalism of
E. Markatos, Check-it: A plugin for detecting news webpage aesthetics on the perception of bias
and reducing the spread of fake news and misin- in news articles, in: Proceedings of the 22nd
formation on the web, in: 2019 IEEE/WIC/ACM International Conference On Human-Computer
International Conference on Web Intelligence (WI), Interaction, Lecture Notes in Computer Science,
2019, p. 298–302. Springer, 2020. doi:https://doi.org/10.1007/
[34] V. Inc, Fakebox, 2021. URL: https://machinebox.io/. 978-3-030-49059-1_50.
[35] Z. A. Estela, N2ITN/are-you-fake-news, 2021. URL: [47] B. Spillane, I. Hoe, M. Brady, V. Wade, S. Lawless,
https://github.com/N2ITN/are-you-fake-news. Tabloidization versus credibility: Short term
[36] 2021. URL: https://ground.news/. gain for long term pain, in: CHI ’20: The ACM
[37] A. Bhat, SurfSafe, 2021. URL: Conference on Human Factors in Computing
https://chrome.google.com/webstore/ Systems, ACM, 2020. URL: https://dl.acm.org/
detail/surfsafe-join-the-fight-a/ doi/abs/10.1145/3313831.3376388. doi:http:
hbpagabeiphkfhbboacggckhkkipgdmh?hl=en. //dx.doi.org/10.1145/3313831.3376388.
[38] B. Botnevik, E. Sakariassen, V. Setty, Brenda: [48] A. Schmidt, M. Wiegand, A survey on hate speech
Browser extension for fake news detection, in: detection using natural language processing, in:
Proceedings of the 43rd International ACM SIGIR Proceedings of the Fifth International Workshop
Conference on Research and Development in Infor- on Natural Language Processing for Social Media,
mation Retrieval, Association for Computing Ma- Association for Computational Linguistics, 2017,
p. 1–10. URL: https://aclanthology.org/W17-1101. spread of fake news via third-person perception,
doi:10.18653/v1/W17-1101. Human Communication Research 47 (2021) 1–24.
[49] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, doi:10.1093/hcr/hqaa010.</p>
        <p>D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, [59] V. Klyuev, Fake news filtering: Semantic
apGenerative adversarial nets, in: Advances in proaches, in: 2018 7th International Conference
Neural Information Processing Systems, vol- on Reliability, Infocom Technologies and
Optimizaume 27, Curran Associates, Inc., 2014. URL: tion (Trends and Future Directions) (ICRITO), 2018,
https://proceedings.neurips.cc/paper/2014/hash/ p. 9–15. doi:10.1109/ICRITO.2018.8748506.
5ca3e9b122f61f8f06494c97b1afccf3-Abstract.html. [60] M. Spradling, J. Straub, J. Strong, Protection from
[50] S. E. Robertson, S. Walker, Some simple efective ‘fake news’: The need for descriptive factual
labelapproximations to the 2-poisson model for proba- ing for online content, Future Internet 13 (2021)
bilistic weighted retrieval, in: SIGIR’94, Springer, 142. doi:10.3390/fi13060142.</p>
        <p>1994, pp. 232–241. [61] N. Fuhr, A. Giachanou, G. Grefenstette, I. Gurevych,
[51] Y. Lv, C. Zhai, When documents are very A. Hanselowski, K. Jarvelin, R. Jones, Y. Liu,
long, bm25 fails!, in: Proceedings of the J. Mothe, W. Nejdl, et al., An information nutritional
34th international ACM SIGIR conference on label for online documents, ACM SIGIR Forum 51
Research and development in Information Re- (2018) 46–66. doi:10.1145/3190580.3190588.
trieval, SIGIR ’11, Association for Computing [62] C. Fan, Classifying fake news, 2017. URL:
Machinery, 2011, p. 1103–1104. URL: https: https://www.conniefan.com/wp-content/uploads/
//doi.org/10.1145/2009916.2010070. doi:10.1145/ 2017/03/classifying-fake-news.pdf, connie Fan.
2009916.2010070. [63] E. S. Jo, A. Muhamed, S. Nuthakki, A. Singhania,
[52] S. Loria, textblob documentation (2020). URL: DeepNews: Detecting Quality in News, 2018.
https://buildmedia.readthedocs.org/media/pdf/ [64] W. W. W. Consortium, et al., The rdf data cube
textblob/latest/textblob.pdf, release 0.16.0. vocabulary (2014).
[53] Y. Yang, D. Cer, A. Ahmad, M. Guo, J. Law, [65] D. C. M. Initiative, et al., Dublin core metadata
N. Constant, G. H. Abrego, S. Yuan, C. Tar, Y.-H. element set, version 1.1 (2012).</p>
        <p>Sung, B. Strope, R. Kurzweil, Multilingual univer- [66] M. Ledvinka, P. Kremen, Jopa: Accessing ontologies
sal sentence encoder for semantic retrieval, 2019. in an object-oriented way., in: ICEIS (2), 2015, pp.
arXiv:1907.04307. 212–221.
[54] S. B. Parikh, V. Patil, P. K. Atrey, On the origin,
proliferation and tone of fake news, in: 2019 IEEE
Conference on Multimedia Information Processing
and Retrieval (MIPR), IEEE, 2019, p. 135–140. URL:
https://ieeexplore.ieee.org/document/8695387/.</p>
        <p>doi:10.1109/MIPR.2019.00031.
[55] J. Paschen, Investigating the emotional appeal of
fake news using artificial intelligence and human
contributions, Journal of Product &amp; Brand
Management 29 (2019) 223–233.
doi:10.1108/JPBM-122018-2179.
[56] X. Zhang, J. Cao, X. Li, Q. Sheng, L. Zhong,</p>
        <p>K. Shu, Mining dual emotion for fake news
detection, Proceedings of the Web
Conference 2021 (2021) 3465–3476. doi:10.1145/
3442381.3450004, arXiv: 1903.01728 version: 1.
[57] I. Singh, D. P., A. K., On the coherence of fake
news articles, in: I. Koprinska, M. Kamp, A.
Appice, C. Loglisci, L. Antonie, A. Zimmermann,
R. Guidotti, O. Özgöbek, R. P. Ribeiro, R. Gavaldà,
et al. (Eds.), ECML PKDD 2020 Workshops,
Communications in Computer and Information Science,
Springer International Publishing, 2020, p. 591–607.</p>
        <p>doi:10.1007/978-3-030-65965-3_42.
[58] M. Chung, N. Kim, When i learn the news is
false: How fact-checking information stems the</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>