1. Introduction

PROVENANCE: An Intermediary-Free Solution for Digital Content Verification

Bilal Yousuf

bilal.yousuf@adaptcentre.ie 0 1

M. Atif Qureshi

muhammad.qureshi@adaptcentre.ie 0 1

Brendan Spillane

brendan.spillane@adaptcentre.ie 1

Gary Munnelly

gary.munnelly@adaptcentre.ie 1

Oisin Carroll

oisin.carroll@adaptcentre.ie 1

Matthew Runswick

matthew.runswick@adaptcentre.ie 1

Kirsty Park

kirsty.park@dcu.ie 2

Eileen Culloty

eileen.culloty@dcu.ie 2

Owen Conlan

owen.conlan@scss.tcd.ie 1

Jane Suiter

jane.suiter@dcu.ie 2 0 ADAPT Centre, Technological University Dublin 1 ADAPT Centre, Trinity College Dublin 2 Institute for Future Media, Democracy and Society, Dublin City University

9 21

The threat posed by misinformation and disinformation is one of the defining challenges of the 21 st century. Provenance is designed to help combat this threat by warning users when the content they are looking at may be misinformation or disinformation. It is also designed to improve media literacy among its users and ultimately reduce susceptibility to the threat among vulnerable groups within society. The Provenance browser plugin checks the content that users see on the Internet and social media and provides warnings in their browser or social media feed. Unlike similar plugins, which require human experts to provide evaluations and can only provide simple binary warnings, Provenance's state of the art technology does not require human input and it analyses seven aspects of the content users see and provides warnings where necessary.

eol>Misinformation Disinformation Fake News Social Media Plugin Browser Extension

1. Introduction

a system that can diferentiate between anger and fear organisations have also identified misinformation and in disinformation and anger and fear in opinion news disinformation as a threat and have increased eforts to articles. There is also some dificulty in diferentiating combat it. These include the United Nations through its between news articles from alternative and independent Verified platform [ 15] and the World Health Organisation agencies and news articles from disinformation sources [16]. More can be read about these initiatives in the Poyndue to often lower quality writing, more emotive content, ter Institute’s guide to national and international eforts and the reuse of images and videos. to combat misinformation and disinformation around the

This paper provides an update on the ongoing progress world [17]. of developing Provenance. The remainder of this paper Provenance is a H2020 project1, however it difers from is organised as follows. Section 2 Motivation and Back- many of the above as it is a user orientated intermediaryground delves into the impetus for this project and sit- free solution to help consumers identify misinformation uates it within other recent EU disinformation projects. and disinformation as they browse the Internet and social Section 3 Related Work provides a detailed overview of media. It is also designed to improve media literacy skills similar browser plugins and describes how Provenance by equipping consumers with the tools, knowledge and advances the state of the art. Section 4 Architecture know-how to face this challenge now and into the future. Overview contains system architecture diagrams and descriptions of each component in the Provenance platform.

Section 5 Provenance in Action provides a detailed expla- 3. Related Work nation of how the Provenance browser plugin provides warnings to the user. Section 6 Use Cases presents two use cases for the Provenance plugin to show in what scenarios we envision it being used. Section 7 Evaluation briefly describes plans to evaluate the tool. Finally, section 8 Conclusions completes the paper with closing remarks.

This review of related work will focus on comparable

browser plugins designed to provide users with warning notifications about disinformation or other problematic content and which are currently active or maintained.

The purpose of this review is to establish how Provenance advances the state of the art.

NewsGuard [18] provides ‘nutrition’ labels for news websites based on nine journalistic criteria. What difer2. Motivation and Background entiates it from many of the other fake news and bias detection browser plugins is that it does not use autoThe proliferation of misinformation and disinformation mated algorithms to assess news websites but rather reon social media has been described as a strategic threat lies on a team of journalists to conduct reviews. It comes to democracy and society in the European Union (EU) as standard with Microsoft Edge, but a subscription is [2, 3]. A recent EU study on the issue found that the com- needed for other Internet browsers. Its notification icons mon narratives of society "are being splintered by filter appear as a browser extension in the upper right corner bubbles, and further ruined by micro-targeting." [4]. The and within third party search engines and social media report points out that like a virus, misinformation and platforms. Clicking on its browser icon opens a nutrition disinformation spread throughout society through social label pane where users can quickly see whether the news media and other platforms in open and closed groups to website passes or fails any of the nine criteria. A link the detriment of democratic systems. This occurs when is also available for users to see a more detailed report. "Susceptible users become weaponized as instruments for Visually, NewsGuard employs simple but efective white disseminating disinformation and propaganda" [4]. ✓on a green shield and red x iconography to denote

The Presidents of the European Council, Commission when a website has passed or failed. NewsGuard’s transand Parliament have all made increasingly public calls parent methodology has resulted in their datasets being for concerted eforts to do more to combat the scourge used for research [19]. While expert led analysis has of fake news to protect democracy. The President of its merits, it also has issues with scalability, personal bithe European Parliament has been the most forthright ases, and response times. Aker also maintains that much in this with a recent announcement that: "We must nur- of the credibility and transparency scoring provided by ture our democracy & defend our institutions against the NewsGuard could be automated [20]. corrosive power of hate speech, disinformation, fake news Décodex [21] created by Le Monde originally started & incitement to violence." [5]. As a result, the EU have as an online search facility for users to check URLs funded a range of FP7, H2020 and other projects to com- against a list of known websites which spread misinbat misinformation and disinformation including WeV- formation and disinformation. They have since released erify [6, 7], SocialTruth [8], PHEME [9, 10], EUNOMIA a Facebook bot for users to directly chat to and a browser [11] Fandango [12, 13] and the European Digital Media plugin that provides red, orange or blue notifications to Observatory (EDMO) [14]. Many other international denote whether a website regularly disseminates false scores) to these common information portals so that information, whose reliability is doubtful, or if they are users may more easily choose high-quality information a parody website. When installed, the Décodex icon be- resources. It should be noted that this extension is not comes active when the website being viewed is listed in designed to provide users with detailed warning notifitheir database. It also produces a colour-coded popup cations when viewing a news website and thus is not with one of three standard warnings. Users cannot ac- directly comparable to the other systems or Provenance. cess detailed information about warnings, nor does it It is included here due to its use of MBFC, the fact that it appear to be integrated with well-known search engines, conveys limited visual information/warnings before the social media platforms or discussion boards. Décodex’s user visits an information source, and for plenitude. allow/deny list approach means that scalability is dificult and the warnings it provides are based on the historical 3.1. No Longer Active publication record of the website, not the content currently being viewed. Transparency is also limited. While Many other projects and services related to this work, still available, its development appears to be in stasis. which have been reviewed in the literature, c.f. [25, 26,

Media Bias Fact Check (MBFC)2 [22] is an extensive 27, 11, 28, 29, 30], now no longer appear to be active or media bias resource curated by a small team of journal- working. This is concerning as despite the fact that misists and lay researchers who have undertaken detailed information and disinformation have been recognised as assessments of over 4000 media outlets. A transpar- a threat to democracy and social cohesion, and the fact ent assessment methodology means that their datasets that browser plugins are one of the few citizen-orientated have been used for several research projects [23, 20]. direct interventions which can help solve the problem at Their team of researchers undertake in-depth analyses source while increasing long term media literacy, very of news organisations and assess them using a standard- few of the proposed solutions have been actively proised methodology, with some subjective judgement, to moted or maintained. The main reason for this appears to calculate a left/right bias score using their published for- be the fact that many of these plugins were developed by mula. They also calculate scores for factual reporting individuals or small teams, or even as part of a hackathon, and credibility. These reports are published on their web- and were thus lacked the resources to be actively mainsite and updated from time to time. Each news website tained or updated to deal with changing technology such in their database is categorised as: left bias, left-centre as browser updates or the rapidly evolving threats posed bias, least biased, right-centre bias, right bias, pro-science, by misinformation and disinformation. The following conspiracy-pseudoscience, fake news, or satire. While present those related projects found in the literature, but their browser extension conveys limited details, further which now no longer appear to be actively maintained, information about each news source is available on their though some are still available to install. URLs have been website. It draws on this dataset to inform users when included for posterity where possible as many do not they click on the notification icon as to which of these have peer-reviewed publications. nine categories the news website they are viewing be- B.S Detector5 relied on matching the URLs of content longs to, including a brief explanation of the category. in the news feed to a known allow/deny list of sources It also provides a link to the detailed MBFC report. The of fake news and misinformation. browser extension also provides Facebook and Twitter AreYouFakeNews.com6 utilised Natural Language support by displaying a visual left/right bias scale on Processing (NLP) and deep learning to identify patterns news articles that appear in users feeds with links to the of bias on websites.

MBFC detailed report and Factual Search3 so that the Fake News Detector AI7 claimed to use a neural netuser can investigate the topic further. While a valuable work to detect similarity between submitted URLs and resource with considerable detail, MBFC’s expert evalua- known fake news websites. tions are based on the historical publication record of the Fake News Detector8 was designed to learn from news website and not an evaluation of the content the webpages flagged by users to detect other similar fake user is looking at. It is also a labour intensive and time news webpages. consuming process. Trusted News9 is a browser plugin that was designed

Stopaganda Plus4[24] is a browser extension that to assess the objectivity of news articles. Its functionality adds accuracy and bias decals to Facebook, Twitter, Red- was limited to ‘long form’ news articles and it does not dit, DuckDuckGo and Google. These visual indicators work with social media content. extend the functionality of MBFC (who determine the 2https://mediabiasfactcheck.com/ 3https://factualsearch.news 4https://browserextension.dev/blog/stopagandaplus-helpsunderstanding-media-biases/

5https://www.producthunt.com/posts/b-s-detector

6https://github.com/N2ITN/are-you-fake-news 7https://www.fakenewsai.com/ 8https://fakenewsdetector.org/ 9https://trusted-news.com/

Fake News Guard10 claimed to combine linguistic where necessary, provide an easy to understand warning and network analysis techniques to identify fake news, to the user when the content they are viewing may be however this can no longer be verified. problematic or symptomatic of disinformation. In the

FiB11 A browser extension built in a hackathon which cases where linguistic analysis or other machine learnwas reviewed several times in the literature as a compa- ing approaches have been utilized, the results are not rable system [31]. presented to the user in an explainable or transparent

TrustedNews12 Trusted News used AI to help users way. Some of these methods have also proven susceptible evaluate news articles by scoring their objectivity [32]. to adversarial attacks, whereby text may be augmented However, it does not work on social media and has issues slightly to fool pretrained models [44, 45]. with analysing webpages that require scrolling. Two factors diferentiating Provenance from the plug

Trusty Tweet [26] was designed to help users deal ins described above are their limited reach and scalability. with fake news tweets and to increase media literacy. Many of the above plugins do not provide any informaTheir transparent approach is designed to prevent reac- tion for some heavily traficked news websites such as the tance and increase trust. Early user evaluations showed LA Times, Al Jazeera, and the Independent.co.uk. This promise. is likely due to limiting factors of time and labour of in

Check-It [33] was designed to analyse a range of sig- cluding humans in the disinformation judgement process. nals to identify fake news. It was focused on user privacy While no one doubts the benefits of highly trained expert with computation undertaken locally. Their approach judgement, the size and nature of the rapidly evolving used a combination of linguistic models, fact checking, media landscape, especially in regard to misinformation and website and social media user allow/deny lists. and disinformation in which publishers are prone to rapid growth, failure and re-branding, means that providing 3.2. Out of Scope Approaches human ratings is a never ending game of whack-a-mole. Current solutions are only partially succeeding in proSome misinformation and disinformation detection tools viding judgements of some news agencies. None have which have been reviewed in other papers have not been attempted to analyse the millions of pieces of content included in this literature review. This is because they they publish daily. Unlike each of the plugins described are not a browser plugin or they are a paid for b2b ser- above, Provenance does not require a human-in-the-loop, vice (Fakebox [34]; AreYouFakeNews [35]), they are fo- nor does it need to be backed by human-generated alcused on an aligned but separate issue e.g., detection of low/deny lists. Its architecture supports fully automated bias or detection of reused and or manipulated images and intermediary free analysis of news content. (Ground.News [36]; SurfSafe [37]), they are specifically The ability to evaluate news articles against seven for fact checking (BRENDA [38], CredEye [39]), they criteria and provide users with visual notifications and have pivoted into a B2B platform (FightHoax [40]), they deeper explanations is also a significant advancement on are not user orientated (Credible News [41, 42]), or they the state of the art and a direct benefit to users in three are research systems and have not been made available to ways. First, and most importantly, users will be made the public [30, 43]. While relevant to combating disinfor- aware of individual issues with the content they are conmation, these are not directly comparable to Provenance. suming and can thus decide whether they will continue viewing it or look for alternative sources. Second, it will 3.3. Advancing the State of the Art help develop users’ media literacy skills by making them aware of the diferent caution worthy indicators and how to check them, making them less susceptible to misinformation and disinformation in the future. Third, the nature of these systems means that they cannot be properly examined. In contrast, a full description of Provenance’s system architecture is provided below. It is also currently undergoing evaluation and testing and the results will be published in time.

This review demonstrates that browser plugins are a common user-orientated approach to combat misinformation and disinformation. However, Provenance adopts a significantly more advanced and granular methodology than current or previous eforts in the domain. The warnings provided by earlier plugins are often based on the news website’s history of publishing misinformation and disinformation. Thus, they are limited to providing a coarse-grained retrospective analysis of the news website’s publication history. In contrast, Provenance’s ifne-grained approach is designed to analyse the content of the news webpage or users’ social media feeds and,

4. Architecture Overview

The system architecture for Provenance is shown in Figure 1. The components and services use REST APIs serving JSON for easy, reliable, and fast data exchanges across internal subsystems. 10http://fakenewsguard.com/ 11https://projectfib.azurewebsites.net/ 12https://trusted-news.com/

Data in the form of webpages or social media con- to further investigate the claims made in the article’s tent is ingested by Provenance either through the Social content. The Personalised Companion Service is used to Network Monitor or by a Trusted Content Analyst (e.g., determine how this information should be presented for a journalist or fact checker). The Social Network Moni- an individual user. tor service discovers content using NewsWhip’s13 social network monitoring platform. The introduced asset is en- 4.1. Key Components riched with social engagement data (e.g., likes and shares) and is forwarded to the Asset Workflow Handler service. 4.1.1. Social Network Monitor

The Asset Workflow Handler separates the incoming The Social Network Monitor communicates with data (e.g., a news webpage) into individual assets such NewsWhip’s Social Network API to identify assets which as images, video, text, etc. These assets are registered should be ingested by Provenance. Finding assets with the Asset Fingerprinter before being disseminated to involves querying Newswhip’s API with a parameterized the analytical components (Video/Image Reverse-searcher, search request. The call to NewsWhip’s Social Network Video/Image Manipulation Detector, Text Similarity De- API is automatically invoked periodically to maintain tector, Text Tone Detector, and Writing Quality Detector) an updated record of trending news articles and social to determine if they exhibit any features which normally media posts. Assets detected by NewsWhip are enriched characterise misleading, questionable, or unsubstantiated through social scoring. The URL, titles, summaries, iminformation. The output of each analytical service, and ages and videos (if any), along with the enrichment data, the initial data passed from the Social Network Monitor is extracted from the article and provided to Provenance. are combined and sent to the Knowledge Graph where Assets composed only of text, for example, are registered they are stored. in fragments consisting of news feed/article title, the

The Knowledge Graph may be queried by the Prove- summary, and user engagement data. nance Query Service to retrieve the results of analysis for a given webpage. The Provenance plugin, installed in the user’s browser, leverages this query service to retrieve 4.1.2. Asset Registration information about webpages that a user is currently view- A dedicated Asset Registration web interface also allows ing. If the webpage has been analysed by Provenance, Trusted Content Analysts to add assets into the Asset and exhibits questionable features, the plugin will issue Workflow Handler . Trusted Content Analysts are stakea warning to the user, indicating that they may want holders such as journalists and other representatives of news agencies and wire services, fact checkers, debunkers, and original content creators who may want to search operation for videos and images. register their multimedia content assets. In future, this facility will be made more widely available to allow the 4.1.5. Video/Image Manipulation Detector general public to send content directly to Provenance. It may also be integrated with news publication platforms The Provenance Video/Image Manipulation Detector idenand content management systems so that content is au- tifies if an image or video has been manipulated in comtomatically added. The primary task of this component parison to its source. This work is based on the PIZis to enable third-parties to register assets that have not ZARO14 project. It utilises recent developments achieved been discovered by the Social Network Monitor. by deep learning-based methods to enable an instant detection of manipulations in visual content. In addition, 4.1.3. Asset Workflow Handler use of the latest technologies based on Convolutional Networks will lead to tangible enhancements in integrity verThe Asset Workflow Handler is the component of the ification in visual content. The Video/Image Manipulation Provenance Verification Layer that is responsible for or- Detector increases trust and improves governance. The chestrating the components and data within the layer. solution is designed to build a web-based system to assess This component’s primary task is to distribute assets to visual content in a real-world setting. The Video/Image diferent components for further processing. It invokes Manipulation Detector will further support the developthe service interfaces and handles the data flow between ment of user skills in detecting false visual information the services. By utilising the Asset Workflow Handler , themselves by providing a world-class image forensic components are loosely coupled, thus mitigating direct technology. The Video/Image Manipulation Detector has component-to-component communications. This will en- a special focus on developing a solution that will be intuable Provenance to work with the variety of APIs exposed itive and easy to understand and interpret for end-users, from the existing tools/components. Moreover, the APIs thereby increasing its uptake by the public and its impact can be adjusted to meet Provenance’s specific needs. Due on the information system. This component’s primary to this modular design, new components can be easily task is to detect if the image and video are manipulated added to the Provenance Verification Layer (e.g., detection by comparing them with previously registered images of bias [46], tabloidization [47], and hate speech [48]), and videos in the system. and connected to the Asset Workflow Handler .

4.1.6. Asset Fingerprinter and Asset Registry 4.1.4. Video/Image Reverse Searcher

The Asset Fingerprinter and Asset Registry provide traceThe Video/Image Reverse Searcher is a key component ability of registered content. It is based on Blockchain for creating a large-scale annotated dataset for detect- technology, making content immutable and enabling the ing manipulated visual content. The dataset consists of verification of the sources and alterations to the content. three distinct parts. The first part includes 45,000 images, Registered assets are handed to the Asset Fingerprinter each captured by a unique device (i.e., 45,000 diferent via the Asset Workflow Handler . Due to the General Data cameras have been used). Half of these images are real, Protection Regulation (GDPR) and the size of some assets, and the other half has been digitally manipulated by ap- the hash of the data is stored on Blockchain. Azure Storplying a random image processing operation to a local age is used as the Blockchain, and the assets themselves, area of the image. Since the sensor pattern noise present including large files, are stored using an of-line storage in images is unique to each sensor (i.e., camera), this service available to store multimedia files. Blockchain is dataset introduces large diversity, such as noise. The used due to its innate data integrity which is important second part of the dataset uses imaging software in cam- to prove the traceability of registered content if the tool eras to introduce a large diversity of artefacts in images. was ever targeted as part of a combined disinformation Commonly available camera brands and models were and hacking campaign. This component’s primary task identified and used to collect a dataset of 50,000 images. is the traceability of registered content via Blockchain. Half of these images were digitally manipulated using an advanced image editing method based on Generative 4.1.7. Text Similarity Detector Adversarial Networks (GAN) [49]. Finally, the third part of the dataset consists of 2,000 images downloaded from News is regularly republished nationally and locally the Internet representing “real-life” (uncontrolled) ma- from international wire services such as Reuters, Agence nipulated images created by random people. For all of France-Presse (AFP) and Associated Press (AP). In a bid the manipulated samples collected for the third part of to lower costs, many news agencies who are not in comthe data, the matching unmanipulated image was also petition negotiate deals to republish each other’s content. collected. This component’s primary task is to enable which had characteristics symptomatic of disinformation, was annotated in a crowdsourced study to identify terms and phrases indicative of low quality writing. A WQS for each piece of content was then derived using a standard formula. This was subject to testing and expert evaluation to ensure the WQS the formula produced accurately reflected each piece of content. Models were then trained on the dataset which showed that the WQS could be automatically generated with a high degree of accuracy. These models and the overall process are currently undergoing formal evaluation.

4.1.10. Knowledge Graph and Knowledge Graph Builder

Similarly, less trustworthy news outlets often put ‘spins’ on existing articles, where correct articles are modified to contain false information.

To combat this, the Text Similarity Detector in Provenance attempts to verify the textual content of an article by comparing it to similar articles published elsewhere. A backlog of trustworthy articles is stored in an Elasticsearch database with a BM25 similarity index [50]. As BM25 under-performs with very long documents [51], only the title and first 10 sentences are used in the index. Once similar articles have been found the component searches for facts given in the query article in the similar ones. Facts in an article are found by taking sentences with a low subjectivity from TextBlob’s sentiment analysis model [52]. The similarity of two facts is the cosine similarity of the vector embedding of both, which is provided by Google’s multilingual text model [53]. If enough of the article’s factual content cannot be verified, the plugin displays a warning.

The Provenance Knowledge Graph stores a record of all

the articles introduced to Provenance via the Social Network Monitor service or via Asset Registration from a Trusted Content Analyst. It is also a record of all analysis performed on said assets. 4.1.8. Text Tone Detector The content is organised according to concept, categories and topics. For example, a news article discussing Intuitively, one would expect that impartial news sources politics can be categorised according to the left/right would use impartial, unemotive language to convey the political spectrum followed by the topics discussed as facts of a story. Recent research has shown that emotions shown in Figure 2. Each node at the article level is split such as fear, anger, sadness, doubt, and the absence of according to text, image and video. joy and happiness are indicative of misinformation and The output of the Video/Image Reverse Searcher indisinformation [54, 55, 56]. Provenance’s Text Tone De- cludes the N most similar images/videos, distance meatector is designed to identify emotions in text which may sures and geometric validation results. The data from the indicate that the news source is unreliable. Threshold Video/Image Manipulation Detector includes the probavalues are used to determine whether caution should be bility of manipulations and the area of polygons. These shown, and the degree of caution is determined by how are sent as JSON objects to the Knowledge Graph where far the calculated value deviates from the threshold value. they are stored as entities in a triplestore. Modelling of Provenance data is achieved using a com4.1.9. Writing Quality Detector bination of the RDF Data Cube vocabulary [64] to store statistical information such as the outputs from the variProvenance’s Writing Quality Detector computes a writ- ous analytical components, and the Dublin Core/BIBO ing quality score (WQS) for the textual content the user vocabularies [65] to model bibliographic information is viewing and provides a warning when it falls below a about the assets themselves. Some use is also made of threshold value. Writing quality is closely related to cohe- the FOAF15 vocabulary to model information such as sion and coherence [57]. Within the context of news, high content publishers, which are naturally represented as quality writing is indicative of paid professional journal- foaf:Agent entities. ism from mainstream, independent, and to a lesser degree, The Knowledge Graph Builder is responsible for exposalternative news agencies, whereas low quality writing is ing a REST API which the Asset Workflow Handler may indicative of amateur or unprofessional news production use to upload assets as JSON, and then transforming the processes [58]. This high/low quality diferentiation is JSON into triples which are stored in a triplestore. In also apparent in other domains such as academia, pub- Provenance, this is achieved using JOPA [66]: a Java lilishing, commercial, and blogs and information websites. brary which can be used to map POJOs to triples. Using While NLP techniques exist to derive writing quality [59], Spring Boot16, a REST API accepting JSON is exposed. and others have called for it to be used to identify misin- The uploaded JSON is serialized into POJOs using Spring formation and disinformation [60, 61], only two examples Boot’s built-in version of Jackson. JOPA is then used to of systems could be found in the literature which actually serialize the triples out to an RDF4J17 instance. calculate writing quality [62, 63].

To calculate WQSs for Provenance, a dataset of news articles, blog posts, and other website content, much of 15http://xmlns.com/foaf/spec/ 16https://spring.io/projects/spring-boot 17https://rdf4j.org/ is implemented as a Chrome Extension and works on the Facebook and Twitter platforms and with articles published by news agencies. The Personalised Companion Service uses the user’s interests, domain knowledge, digital literacy, and the warning preferences stored in the Minimal User Model to determine whether to highlight caution or show the verification indicator without caution. The Personalised Companion Service uses the data provided by the Asset Fingerprinter, the Video/Image Reverse Searcher and Video/Image Manipulation Detector, and the Text Similarity, Tone and Writing Quality Detector components to create the set of icons that are presented to users, who can explore the levels of verification presented through the visual iconography.

5. Provenance in Action

The same serialization process works in reverse, al- The Provenance browser plugin is designed to provide lowing the Provenance Query Service to expose both a users with easy to understand, granular and cautionary JSON REST endpoint which can produce JSON objects warnings about the content they are consuming. These from the results of a canned SPARQL query exposed via a warnings are provided via an in-browser icon beside the Spring Boot REST endpoint, and a much lower level raw address bar when the user is browsing the Internet, or SPARQL endpoint from the triplestore, for those who within their Facebook and Twitter social media feeds want a high level of control over their queries. beside the content they are viewing. Figures 3 - 6 show how Provenance and its visual warnings appear to a user - who has the Provenance plugin installed - within their 4.1.11. Provenance Query Service Facebook social media feed. The Provenance icon appears The Provenance Query Service is the interface to the Verifi- as a small blue square with a white P above each content cation Layer and ofers external trusted services with the item that it has checked. When the icon background means to request verification information about a web- turns red (with a small exclamation mark), it indicates to page or article. It will also allow trusted services with the user that the content item is worthy of a cautionary a means to identify the relatedness of content (through warning. The following presents the four main states of similarity and the Knowledge Graph) and determine if Provenance which a user will see. content has been modified. As the results of all analy- Figure 3 shows a user’s Facebook feed who has the sis are stored in the Knowledge Graph, the Provenance Provenance browser plugin installed. The Provenance Query Service is efectively a proxy between the user- icon is visible at the top of each news article in the user’s facing front-end, and the query interface to whatever feed. In this image, the icon is blue which indicates that storage medium is used to implement the Knowledge there are no warnings with this particular news item. Graph. In Figure 4, the background of the Provenance icon

As mentioned in Section 4.1.10, the Provenance Query within the user’s news feed has turned red to indicate Service exposes both a raw SPARQL endpoint and a REST that this news item is worthy of one or more cautionary API which provides endpoints for a number of canned warnings. A small black exclamation mark has been SPARQL queries which return JSON objects. It is envi- added to the top right of the icon for colour blind users. sioned that the vast majority of user cases will be covered In Figure 5, the user has clicked on the red Provenance by the REST API, making it easier for developers to access icon. A window has appeared beneath the Provenance data that is helpful to users. However, it is worthwhile to icon to show the user which of the seven criteria the allow lower level access to the KG’s contents in the event news article was checked against that Provenance has deof unforeseen requirements being placed on the KG. tected an issue with. In this example, the red background and exclamation mark beneath the Writing Quality icon 4.1.12. Personalised Companion Service indicates that this aspect of the news article is worthy of caution. The user may click on the downward arrow beThe Personalised Companion Service manages the Prove- neath each icon for further information. In this example, nance verification indicator, the minimal user model, and the Tone icon is greyed out indicating that this could not user scrutability and control. The verification indicator be assessed by Provenance in this instance.

6. Use Cases: Provenance Plugin

6.1. Social Media Timeline

On the recommendation of a friend, Mary installed the

Provenance browser plugin due to increased concerns about the spread of misinformation and disinformation. the images. As this is just an image of a press conference, The instructional video on the Provenance Chrome Ex- she is confident that its use by multiple news agencies is tension webpage explained that Provenance uses seven not an issue. criteria to verify digital content on the Internet and social media feeds. After installing the Provenance plugin, she notices that the news items in her Facebook timeline now 7. Evaluation display the Provenance icon beside the publisher’s name.

For most of the news stories, the Provenance icon shows Provenance is under development and will shortly be una white P inside a white circle on a blue background. dergoing human evaluation. Currently, five of the seven When she clicks on the blue Provenance icon, it opens a news analysis functions have been implemented and have notification pane showing the seven verification criteria, been integrated with the platform. These are undergoing all of which display a green background with a white ✓. technical evaluation while the final two analysis tools are

She is able to click on each of the seven verification being completed. When the tool is fully completed, a seicons to read a detailed explanation for each criterion, ries of technical tests and human evaluation tests will be why failing the criterion is an indication that the webpage undertaken to evaluate basic functionality and to ensure or social media post may be misinformation or disinfor- that it is providing the right warnings at the appropriate mation, and how the warning is derived. As all of the time. Following this, a series of experiments will be unicons are green, she is reassured about the origin, ve- dertaken to evaluate its efect on user behaviour. This racity and overall quality of the news article. For some will include the likelihood of reading and sharing news news items displayed on her timeline, she notices that articles that have cautionary warnings beside them. We the blue background of the Provenance icon has turned will also be analysing unintended efects of the tool. Fired. When she clicks on it, the same information pane nally, a series of long term studies are planned to evaluate displaying the same verification criteria appears, except its efect on users’ media literacy. one or more of the seven verification criteria now display a red background with an exclamation mark beneath. 8. Conclusions When she clicks on these, an additional detailed explanation pane appears underneath them to explain why it has failed. Reading through each warning including their detailed description, she gains a better understanding of how to identify misinformation and disinformation.

In both instances, Mary has become more aware of the need to critically check the news she consumes and more aware of good media literacy habits in general.

Misinformation and disinformation are significant issues

that have negatively afected public discourse, politics and social cohesion. The Internet and especially social media are the primary conduits for its growth and spread.

Existing user-orientated browser plugins have limited capabilities and only provide users with an historical rating of a website’s propensity to publish misinformation and disinformation. They are also not capable of detailed 6.2. News Websites analysis of the content of news webpages or social media feeds. The Provenance browser plugin significantly Mary regularly visits news websites to inform herself of improves upon existing user orientated solutions by procurrent afairs. Usually, the Provenance icon, which is viding intermediary free analysis of webpage and social visible to the right of her browser’s address bar, displays media content using seven criteria, and where necessary a white P inside a white circle on a blue background. providing cautionary warnings to users. The user can However, recently when she was visiting news websites then check the detailed explanatory warning notificato read more about a story relating to Covid 19 vaccina- tions to make their own judgement. This will improve tion, she noticed that the background of the Provenance users’ media literacy and reduce susceptibility to misinicon would sometimes turn red. When she clicked on the formation and disinformation long term. icon, the verification criteria information pane showed that Provenance had detected a problem with the image used in the news article she was reading. Clicking on 9. Acknowledgements the arrow to open the drop-down explanation pane, she reads that Provenance has detected that the image has The work has been supported by the PROVENANCE been used before in another article. The image in ques- project which has received funding from the European tion shows a picture taken at a conference of the World Union’s Horizon 2020 research and innovation proHealth Organisation. Looking closely, she sees a credit gramme under Grant Agreement No. 825227, and with to the Associated Press (AP). She knows that AP is an the financial support of Science Foundation Ireland under international news wire service, and that local and na- Grant Agreement No. 13/RC/2106_P2 at the ADAPT SFI tional news agencies republish their articles, including Research Centre. tomatic detection of fake news, in: Proceed- chinery, 2020, p. 2117–2120. URL: https://doi.org/ ings of the 6th International Workshop on Socio- 10.1145/3397271.3401396.

Technical Perspective in IS Development (STPIS [39] K. Popat, S. Mukherjee, J. Strötgen, G. Weikum, 2020), CEUR-WS, 2020, p. 168–179. URL: http:// Credeye: A credibility lens for analyzing and exurn.kb.se/resolve?urn=urn:nbn:se:his:diva-19356. plaining misinformation, in: Companion Pro[26] K. Hartwig, C. Reuter, Trustytweet: An indicator- ceedings of the The Web Conference 2018, WWW based browser-plugin to assist users in dealing with ’18, International World Wide Web Conferences fake news on twitter (2019). Steering Committee, 2018, p. 155–158. URL: https: [27] A. Giełczyk, R. Wawrzyniak, M. Choraś, Evalua- //doi.org/10.1145/3184558.3186967. doi:10.1145/ tion of the existing tools for fake news detection, 3184558.3186967. in: K. Saeed, R. Chaki, V. Janev (Eds.), Computer [40] FightHoax, Fighthoax - unlock your programmatic Information Systems and Industrial Management, advertising, 2021. URL: http://34.253.212.69/. Lecture Notes in Computer Science, Springer Inter- [41] M. Hardalov, I. Koychev, P. Nakov, In search of crednational Publishing, 2019, p. 144–151. doi:10.1007/ ible news, in: C. Dichev, G. Agre (Eds.), Artificial 978-3-030-28957-7_13. Intelligence: Methodology, Systems, and Applica[28] A. Školkay, J. Filin, A comparison of fake news de- tions, Lecture Notes in Computer Science, 2016. tecting and fact-checking ai based solutions, Studia doi:10.1007/978-3-319-44748-3_17.

Medioznawcze 20 (2019) 365–383. [42] M. Hardalov, mhardalov/news-credibility, 2019. [29] K. Shu, A. Sliva, S. Wang, J. Tang, H. Liu, Fake URL: https://github.com/mhardalov/newsnews detection on social media: A data mining per- credibility. spective, ACM SIGKDD Explorations Newsletter [43] X. Zhou, A. Jain, V. V. Phoha, R. Zafarani, Fake news 19 (2017) 22–36. doi:10.1145/3137597.3137600. early detection: A theory-driven model, Digital [30] A. Hanselowski, A. PVS, B. Schiller, F. Caspel- Threats: Research and Practice 1 (2020) 12:1–12:25. herr, D. Chaudhuri, C. M. Meyer, I. Gurevych, A doi:10.1145/3377478. retrospective analysis of the fake news challenge [44] W. E. Zhang, Q. Z. Sheng, A. Alhazmi, C. Li, Adverstance-detection task, in: Proceedings of the 27th sarial attacks on deep learning models in natural International Conference on Computational Lin- language processing: A survey, arXiv:1901.06796 guistics, Association for Computational Linguistics, [cs] (2019). URL: http://arxiv.org/abs/1901.06796, 2018, p. 1859–1874. URL: https://www.aclweb.org/ arXiv: 1901.06796.

anthology/C18-1158. [45] Z. Zhou, H. Guan, M. M. Bhat, J. Hsu, Fake [31] A. Goel, ProjectFib - GitHub Repo, 2016. URL: https: news detection via nlp is vulnerable to adver//github.com/anantdgoel/ProjectFib. sarial attacks, Proceedings of the 11th In[32] Eyeo, 2020. URL: https://chrome.google.com/ ternational Conference on Agents and Artifiwebstore/detail/trusted-news/ cial Intelligence (2019) 794–800. doi:10.5220/ nkkghpncidknplmlkgemdoekpckjmlok?hl=en. 0007566307940800, arXiv: 1901.09657. [33] D. Paschalides, C. Christodoulou, R. Andreou, [46] B. Spillane, S. Lawless, V. Wade, The impact of G. Pallis, M. D. Dikaiakos, A. Kornilakis, increasing and decreasing the professionalism of E. Markatos, Check-it: A plugin for detecting news webpage aesthetics on the perception of bias and reducing the spread of fake news and misin- in news articles, in: Proceedings of the 22nd formation on the web, in: 2019 IEEE/WIC/ACM International Conference On Human-Computer International Conference on Web Intelligence (WI), Interaction, Lecture Notes in Computer Science, 2019, p. 298–302. Springer, 2020. doi:https://doi.org/10.1007/ [34] V. Inc, Fakebox, 2021. URL: https://machinebox.io/. 978-3-030-49059-1_50. [35] Z. A. Estela, N2ITN/are-you-fake-news, 2021. URL: [47] B. Spillane, I. Hoe, M. Brady, V. Wade, S. Lawless, https://github.com/N2ITN/are-you-fake-news. Tabloidization versus credibility: Short term [36] 2021. URL: https://ground.news/. gain for long term pain, in: CHI ’20: The ACM [37] A. Bhat, SurfSafe, 2021. URL: Conference on Human Factors in Computing https://chrome.google.com/webstore/ Systems, ACM, 2020. URL: https://dl.acm.org/ detail/surfsafe-join-the-fight-a/ doi/abs/10.1145/3313831.3376388. doi:http: hbpagabeiphkfhbboacggckhkkipgdmh?hl=en. //dx.doi.org/10.1145/3313831.3376388. [38] B. Botnevik, E. Sakariassen, V. Setty, Brenda: [48] A. Schmidt, M. Wiegand, A survey on hate speech Browser extension for fake news detection, in: detection using natural language processing, in: Proceedings of the 43rd International ACM SIGIR Proceedings of the Fifth International Workshop Conference on Research and Development in Infor- on Natural Language Processing for Social Media, mation Retrieval, Association for Computing Ma- Association for Computational Linguistics, 2017, p. 1–10. URL: https://aclanthology.org/W17-1101. spread of fake news via third-person perception, doi:10.18653/v1/W17-1101. Human Communication Research 47 (2021) 1–24. [49] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, doi:10.1093/hcr/hqaa010.

D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, [59] V. Klyuev, Fake news filtering: Semantic apGenerative adversarial nets, in: Advances in proaches, in: 2018 7th International Conference Neural Information Processing Systems, vol- on Reliability, Infocom Technologies and Optimizaume 27, Curran Associates, Inc., 2014. URL: tion (Trends and Future Directions) (ICRITO), 2018, https://proceedings.neurips.cc/paper/2014/hash/ p. 9–15. doi:10.1109/ICRITO.2018.8748506. 5ca3e9b122f61f8f06494c97b1afccf3-Abstract.html. [60] M. Spradling, J. Straub, J. Strong, Protection from [50] S. E. Robertson, S. Walker, Some simple efective ‘fake news’: The need for descriptive factual labelapproximations to the 2-poisson model for proba- ing for online content, Future Internet 13 (2021) bilistic weighted retrieval, in: SIGIR’94, Springer, 142. doi:10.3390/fi13060142.

1994, pp. 232–241. [61] N. Fuhr, A. Giachanou, G. Grefenstette, I. Gurevych, [51] Y. Lv, C. Zhai, When documents are very A. Hanselowski, K. Jarvelin, R. Jones, Y. Liu, long, bm25 fails!, in: Proceedings of the J. Mothe, W. Nejdl, et al., An information nutritional 34th international ACM SIGIR conference on label for online documents, ACM SIGIR Forum 51 Research and development in Information Re- (2018) 46–66. doi:10.1145/3190580.3190588. trieval, SIGIR ’11, Association for Computing [62] C. Fan, Classifying fake news, 2017. URL: Machinery, 2011, p. 1103–1104. URL: https: https://www.conniefan.com/wp-content/uploads/ //doi.org/10.1145/2009916.2010070. doi:10.1145/ 2017/03/classifying-fake-news.pdf, connie Fan. 2009916.2010070. [63] E. S. Jo, A. Muhamed, S. Nuthakki, A. Singhania, [52] S. Loria, textblob documentation (2020). URL: DeepNews: Detecting Quality in News, 2018. https://buildmedia.readthedocs.org/media/pdf/ [64] W. W. W. Consortium, et al., The rdf data cube textblob/latest/textblob.pdf, release 0.16.0. vocabulary (2014). [53] Y. Yang, D. Cer, A. Ahmad, M. Guo, J. Law, [65] D. C. M. Initiative, et al., Dublin core metadata N. Constant, G. H. Abrego, S. Yuan, C. Tar, Y.-H. element set, version 1.1 (2012).

Sung, B. Strope, R. Kurzweil, Multilingual univer- [66] M. Ledvinka, P. Kremen, Jopa: Accessing ontologies sal sentence encoder for semantic retrieval, 2019. in an object-oriented way., in: ICEIS (2), 2015, pp. arXiv:1907.04307. 212–221. [54] S. B. Parikh, V. Patil, P. K. Atrey, On the origin, proliferation and tone of fake news, in: 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), IEEE, 2019, p. 135–140. URL: https://ieeexplore.ieee.org/document/8695387/.

doi:10.1109/MIPR.2019.00031. [55] J. Paschen, Investigating the emotional appeal of fake news using artificial intelligence and human contributions, Journal of Product & Brand Management 29 (2019) 223–233. doi:10.1108/JPBM-122018-2179. [56] X. Zhang, J. Cao, X. Li, Q. Sheng, L. Zhong,

K. Shu, Mining dual emotion for fake news detection, Proceedings of the Web Conference 2021 (2021) 3465–3476. doi:10.1145/ 3442381.3450004, arXiv: 1903.01728 version: 1. [57] I. Singh, D. P., A. K., On the coherence of fake news articles, in: I. Koprinska, M. Kamp, A. Appice, C. Loglisci, L. Antonie, A. Zimmermann, R. Guidotti, O. Özgöbek, R. P. Ribeiro, R. Gavaldà, et al. (Eds.), ECML PKDD 2020 Workshops, Communications in Computer and Information Science, Springer International Publishing, 2020, p. 591–607.

doi:10.1007/978-3-030-65965-3_42. [58] M. Chung, N. Kim, When i learn the news is false: How fact-checking information stems the