<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Checking fake news on web browsers: an approach using collaborative datasets</article-title>
      </title-group>
      <abstract>
        <p>Rumors are a constant reality related to information sharing on social networks. The increase of interactions, encouraged by social media, facilitates the dissemination of non-vali dated content. Sometimes, promoting misinformation and causing irreparable damages. The dynamism of on line activities transforms the process of evaluating the accuracy of a message in a lonely task, user-dependent and often tricky. The lack of reliable and centralized data sources that can be used as a reference for content verification, as well as the lack of tools to support this process, makes it harder to verify facts quickly. This article presents the creation of a col laborative dataset of fake Brazilian news, an API to enable the validation of contents and how this environment was used in the development of an extension for the Google Chrome browser, giving rise to a solution that allows checking a text selected by the user.</p>
      </abstract>
      <kwd-group>
        <kwd>rumor</kwd>
        <kwd>information sharing</kwd>
        <kwd>facts</kwd>
        <kwd>fake news</kwd>
        <kwd>data</kwd>
        <kwd>datasets</kwd>
        <kwd>browser</kwd>
        <kwd>fake</kwd>
        <kwd>validation</kwd>
        <kwd>disinformation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>In recent years, the growth of social media has intensified human interaction on the
internet, allowing fast information dissemination, which reaches a diverse and
broader public. Despite the notorious social contribution brought by social
technologies and tools, the difficulty in dealing with information overload contributed
to the growing emergence of unverified content (rumors).</p>
      <p>Contents are transmitted continuously without a veracity or provenance
validation. The speed of communication turns ineffective any process of veracity
assessment. Also, conflicts resulting from socio-political divergences drive the
practice of producing doubtful content envisioning attention focus and the discord
promotion. For example, during the presidential election in the United States, fake
news in Facebook has surpassed the content of the mainstream media [1],
demonstrating its influence on democratic issues and how important it is to fight
disinformation. Moreover, producing fake news has become a profitable business [2].
Rumor identification is a critical challenge [3]. A rumor can be defined as "an
unverified claim that starts at one source (or more) and begins to spread over time to several
nodes in the network" [4]. Rumors can be classified into three types according to their
purpose [5]. First, the news that comes from tabloids, which produce news about
scandals, famous people or crimes, with the objective audience. In this scenario,
many news stories are created, fake or exaggerated [6].</p>
      <p>The second type is hoaxes, which are rumors created in social media, using real
facts to be improperly validated, generating more confidence about its message.
Unlike a simple joke or material to generate profit through entertainment, hoaxes can
cause real damages to someone else [7]. The latter type is the humor news, which
relies on deadpan humor and irony to imitate genuine news sources, and has wide
reach. Regardless of the type of rumor, content verification is a task that confronts the
dynamism and speed of social media. Despite the existence of websites and blogs
dedicated to checking facts, finding explanations about a rumor is still a task that
requires the use of search engines, which do not consider the impartiality and reliability
of a data source when presenting its results, which may increase the misinformation.</p>
      <p>The purpose of this article is three-fold, describing: a) the creation of a
collaborative dataset about fake news, b) an extension (plug-in) for Google Chrome web
browser, which allows the fast verification of content selected by the user, c) an API
that can be reused by other applications. This approach used as referential the main
Brazilian sites, but the process and the code source can be applied in any language
and domain.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>Currently, several techniques have emerged to detect rumors on the internet. In
[8], authors used supervised machine learning techniques to classify a topic on
Twitter as “news” or “personal talk”. Those classified as “news”, they were
reclassified as “credulous” or “non-credulous”. The classification is a supervised
process, having a phase of manual annotation (done by human beings)and a
learning algorithm extracts the patterns. This work concludes that “news” usually
has links to the source and a larger spreading tree. “Credulous” news is propagated
by authors with a high quantity of posts, which is originated from one or a few
users on the network and has large amounts of retweets. On the other hand, the
“non-credulous” news is propagated by ones who use a lot of emoticons, have few
friends and have little time of life on Twitter.</p>
      <p>In [9],the authors developed a platform called Hoaxy to collect news - from
different social media and news sites - through crawlers and APIs. After the
collection, the platform only tracks updates through RSS. As future work, the
authors indicate the development of an interactive web interface for news analysis.</p>
      <p>
        The Truthy system [
        <xref ref-type="bibr" rid="ref9">10</xref>
        ] is a service designed to track political memes and
detect false information in this context. Its framework is responsible for collecting,
analyzing, and tracking memes. Other systems have been proposed such as
RumorLens [
        <xref ref-type="bibr" rid="ref10">11</xref>
        ] and Twitter-Trails[12], both solutions allow users to explore the rumor
spread with an interactive interface, where the user inputs a rumor for searching.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref11">13</xref>
        ], who after analyzing language and machine learning approaches, proposes
a hybrid model combining these two approaches, applying this method to network
analysis to create a possible false news detection system. In their experiments, the
authors demonstrate that linguistic processing must be constructed in multiple layers
and that such tools should be designed to aid in human judgment, but never to replace
it.
      </p>
      <p>Author of [14] evaluates the growing number of keywords in times of crisis,
relating the location of a given event with the emergence of messages about it and, using
machine learning, analyzed the proposed technique in social media during sporting
events and adapted to situations of such as the occurrence of earthquakes. Their
experiences have shown that, given to social crisis scenario or any type of event that is
beyond normality, identifying important words and verifying their origins can help in
assessing the credibility of information.</p>
      <p>[15] used as source data Wikipedia articles to identify false or unconfirmed
content. First, researchers looked at the impact of fake articles by measuring how long
they survive before they are eliminated and how many referrals they receive from ex
ternal links. In addition, using classification tasks, they have found characteristics
capable of identifying textual structures and differences between terms that help in the
identification of false content. Finally, the authors carry out experimentation
involving human beings and conclude that the developed classifier surpasses them, by a
great margin, in the task of classifying a content as false or not.</p>
      <p>The authors of [16] address the issue of the dissemination of rumors in an
epidemiological way, in which users can be infected if they are exposed to this type of
information. This methodology has strong relevance when studying the rise and
spread of rumors in social media. Epidemiological models can serve as a basis for
studies on the relationship between credulity and diffusion in the face of
unconfirmed news.</p>
      <p>Automatic veracity identification remains a difficult task for machines because it
requires a high level of abstraction and creativity. We believe that solutions that
involve collaborative knowledge creation - such as crowdsourcing strategies and access
to open data sources - are useful to this type of challenge. Our proposal uses these
collaborative data as a premise.</p>
    </sec>
    <sec id="sec-3">
      <title>Fakepedia: the proposal</title>
      <p>In our approach, we collect and store in a unique dataset, rumors published in various
portals specializing in content verification, so that they are centralized and organized.
We have also developed an API to make it easier to interact with the dataset and
created an extension for the Google Chrome browser that demonstrates the use of the en
vironment.</p>
      <p>As can be seen in Figure 1, the integration between these modules consists of the
FakePedia architecture and can be summarized as follows: crawlers download news
published by specialized sites, create .xml files and execute the script responsible for
Vector Model . Vector Model uses existing .xml files to create .csv files that will
serve as a query base, which are indexed (index.py) and stored in the dataset. When
a user performs a query in their browser, the extension calls the API that executes
the search.py script. This script performs word processing operations on the query
and searches the terms in the dataset, based on similarity and tf-idf calculations.</p>
    </sec>
    <sec id="sec-4">
      <title>Dataset</title>
      <p>We created a dataset with the main fake news. For this, we developed (in Python)
crawlers capable of extracting information about rumors already known in different
websites and blogs specialized in the subject, in Brazil: e-farsas[17], boatos.org[18]
and "is or is not"[19]. The crawlers extract data from sites using the beautiful soup
library and write them to an XML file whose structure is shown in Figure 2:
After this step, we run another script for the creation of a structured CSV file with
well-defined fields such as "short description of the rumor," and "it is rumor"
(boolean field). The content saved in these files is then analyzed, so irrelevant terms such
as articles and pronouns can be removed. The duplicated news is removed, too.</p>
      <p>API
To facilitate the use of the data set and to encourage the development of applications
that need to identify the rumors, we have developed an API (Application Programming
Interface), whose accessibility functions allow interaction with data set records,
through routines and patterns that allow the exchange of HTTP request and response
messages in JSON format.</p>
      <p>The API returns a list of all the records stored by the crawlers. In addition, it is pos
sible to use filters by subject or by the Boolean field to identify whether an item has
been classified as rumor or not. The developed API calls are intuitive and indicate what
can be returned, as in Table 1.</p>
      <p>URL
/api/news
/apinews/fake
/api/news/true
/api/news/find/[term]</p>
      <p>One of API’s functionality is the search for a term or expression using the Vector
Search and Information Retrieval Model. In this process, the Vector Model uses the
search expression sent to the API to retrieve relevant documents in the fakenews
dataset, resulting in a list of references sorted by the degree of similarity (relevance).
The similarity is calculated by weighting the terms of the query and the terms of the
dataset documents.
6</p>
    </sec>
    <sec id="sec-5">
      <title>Information Search and Retrieval</title>
      <p>In the Vector Model [16], a document is represented by a set of indexed terms and as
sociated with a normalized value that indicates its degree of relevance to the document.
Similarly, the search expression is represented by a numeric vector whose elements
represent the degree of relevance of the term to the expression. For the calculation of
these weights and creation of the vector model of the rumor dataset, we have
implemented a script that creates an inverted list from the .xml files generated by the
crawlers, containing each term and the set of news related to it. The lists help in the
calculation of the tf-idf for each term, where tf ("term frequency") is the number of
times a given term appears in the text of a document and idf ("inverse document
frequency") the frequency that a term occurs in any set of documents.</p>
      <p>The calculation of idf characterizes the term considering the entire corpus,
decreasing the weight of terms that occur in more documents and increasing the weight of
those that occur rarely. In this way, the tf-idf measure is used to assign weight to each
element of the vectors representing the corpus documents. The most heavily indexed
terms are those that occur very frequently in very few documents. The calculation of
tf-idfis given by equation 1:
The Figure 3 illustrates the representation of the eBUSCA1 search expression (0.2,
0.35, 0.1), along with the documents DOC1 (0.3, 0.0, 0.5) and DOC2 (0.5, 0.4, 0.3) in
a three-dimensional vector space formed by the terms t1, t2 and t3:</p>
      <p>The partial comparison between document representation and the search expression
allows the model to calculate the degree of similarity and create a ranking of documents
closer to the query performed. The similarity calculation is given by the equation 2:
where wi,x is the weight of the ith term of document x and wi,y is the weight of the
ith term of the search expression y. Using the representation of the search
presented in Figure 2 and applying the calculation of the degree of similarity, we have for
sim(DOC1, eBUSCA1) = 0.45 and sim (DOC2, eBUSCA1) = 0.92, concluding that
DOC2 has greater similarity with the search expression and therefore is displayed
first in the list.</p>
      <p>When searching for a term through the API, the search script checks the database for
the corresponding results, as well as the similarity of all results with the desired term.</p>
      <p>The output is the five most similar results.
7</p>
    </sec>
    <sec id="sec-6">
      <title>Google Chrome Extension</title>
      <p>Envisioning a better interface with the internet user, we developed an extension for
the Google Chrome browser. So, the user can select a piece of text directly on
the page that (s)he is visiting/reading and to search by the veracity of it. When
selecting the text, the extension queries the dataset through the API searching and
returning the relevant results.</p>
      <p>The use of the proposed solution based on the extension developed for Google
Chrome can be represented by four distinct moments. In the first moment, as
shown in Figure 4 , the user is confronted with the news that has an uncertain
origin and seems suspicions. The translation of the news’ title is: Donald Trump
mentions Bolsonaro in his speech. PS: Bolsonaro is a Brazilian deputy, running
for the presidential elections in 2018.</p>
      <p>With the extension installed in the browser, the user will have the option to select
the suspect content (or part of it) and send it to the analysis tool, which will search
for the text in the fakenews dataset. As we can see in Figure 5 , this process is
facilitated by having the user activate it by clicking the right mouse button.</p>
      <p>Fig. 5. Searching for rumors
The result of this verification is a list of links where the user may find relevant
information about the subject (Figure 5). This list is ordered by the degree of
similarity among the documents in the dataset and the text selected and searched.</p>
      <p>Finally, by clicking on one link, the user accesses the site that verified the
news. Then, the user can get more information about a topic, and acknowledge
about a rumor. At this point, the verification cycle ends.
8</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>Social web shatters the barrier to communicate in anytime and anywhere for
people from all walks of life. With a fast and many- to-many model of information
dissemination, a new challenge is the recognition of an unreliable content.
Additionally, false information can be propagated through social media, resulting in
embarrassment, loss or irreversible damages. Consequently, we need methods and
techniques to identify rumors, helping in the clarification of doubts and confirmation of
facts.</p>
      <p>In this article, we present an approach for the identification of fake news.
This proposal is composed by: a) a collaborative dataset about fake news, b) an
extension (plug-in) for Chrome web browser, which allows the fast verification of
content selected by the user, c) an API that can be reused by other applications.
Also, we illustrated its use. Through this approach, it was possible to provide a
functional solution that allows users to search by rumors in verified sources,
dynamically, directly in the browser.</p>
      <p>This approach used as referential the main Brazilian sites about fake news,
but the process and the code source can be applied in any language and domain. We
decided to instantiate this approach for the Brazilian scenario due to the existence
offew public solutions. FakePedia, at this moment, is applied to the veracity
verification of news related to Zika epidemics and will be used to follow the news
during the 2018 election year.</p>
      <p>As future works, we will improve the search process through the adoption of
paraphrase techniques capable of identifying variations of a rumor, besides
documenting the use of the API and implementing mechanisms capable of encouraging
society to collaborate with the creation of the public dataset.
9</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Silverman</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <article-title>This analysis shows how viral fake election news stories outperformed real new on Facebook</article-title>
          .
          <source>BuzzFeed (Nov. 16</source>
          ,
          <year>2016</year>
          ); Gu,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Kropotov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            ,
            <surname>Yarochkin</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          :
          <article-title>The fake news machine, how propagandists abuse the internet and manipulate the public</article-title>
          .
          <source>In 1st ed. Trend Micro</source>
          , p.
          <volume>81</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Conroy</surname>
            ,
            <given-names>N. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rubin</surname>
            ,
            <given-names>V. L.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Automatic deception detection: Methods for finding fake news</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>Proceedings of the Association for Information Science and Technology</source>
          ,
          <volume>52</volume>
          ,
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Vosoughi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Roy</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2017</year>
          ).
          <article-title>Rumor Gauge: Predicting the Veracity of Rumors on Twitter</article-title>
          .
          <source>ACM Transactions on Knowledge Discovery from Data (TKDD)</source>
          .
          <volume>11</volume>
          ,
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Rubin</surname>
            ,
            <given-names>V. L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Conroy</surname>
            ,
            <given-names>N. J.</given-names>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Deception Detection for News: Three Types of Fake News</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>Proceedings of the Association for Information Science and Technology</source>
          ,
          <volume>52</volume>
          (
          <issue>1</issue>
          ),
          <fpage>1</fpage>
          - Sherman, A. '
          <article-title>Florida Democrats just voted to impose Sharia law on women,' bloggers say</article-title>
          .
          <source>Politifact (May. 08</source>
          ,
          <year>2014</year>
          ) Kang, C. In Washington Pizzeria Attack,
          <source>Fake News Brought Real Guns. New York Times (May. 12</source>
          ,
          <year>2016</year>
          ) https://www.nytimes.com/
          <year>2016</year>
          /12/05/business/media/cometping-pong
          <article-title>-pizza-shooting-fake-news-consequences</article-title>
          .html Castillo,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Mendoza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            , &amp;
            <surname>Poblete</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.</surname>
          </string-name>
          (
          <year>2011</year>
          , March).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <article-title>Information credibility on twitter</article-title>
          .
          <source>In Proceedings of the 20th international conference on World Wide Web</source>
          (pp.
          <fpage>675</fpage>
          -
          <lpage>684</lpage>
          ). ACM.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Shao</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ciampaglia</surname>
            ,
            <given-names>G. L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Flammini</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Menczer</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Hoaxy: A Platform for Tracking Online Misinformation</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          10.
          <string-name>
            <surname>Ratkiewicz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Conover</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meiss</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonçalves</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Patil</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Flammini</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Menczer</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>Truthy: Mapping the spread of astroturf in microblog streams</article-title>
          .
          <source>Proceedings of the 20th International Conference Companion on World Wide Web (WWW '11)</source>
          ,
          <fpage>249</fpage>
          -
          <lpage>252</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          11.
          <string-name>
            <surname>Resnick</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carton</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Park</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Zeffer</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>Rumorlens: A system for analyzing the impact of rumors and corrections in social media</article-title>
          .
          <source>In Proc. Computational Journalism Conference</source>
          <volume>12</volume>
          .
          <string-name>
            <surname>Metaxas</surname>
          </string-name>
          , P. T.,
          <string-name>
            <surname>Finn</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Mustafaraj</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Using TwitterTrails. com to Investigate Rumor Propagation</article-title>
          .
          <source>Proceedings of the 18th ACM Conference. Companion on Computer Supported Cooperative Work &amp; Social Computing</source>
          ,
          <fpage>69</fpage>
          -
          <lpage>72</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          13. E-Farsas.http://www.e-farsas.
          <source>com 14</source>
          . Boatos.org.http://www.boatos.
          <source>org 15</source>
          .
          <article-title>É ou não é</article-title>
          . https://g1.globo.com/e-ou
          <string-name>
            <surname>-</surname>
            nao-e 16. G. Salton,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Wong</surname>
            , and
            <given-names>C.S.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
          </string-name>
          .
          <article-title>A vector space model for information retrieval</article-title>
          .
          <source>Journal of the American Society for Information Science</source>
          ,
          <volume>18</volume>
          (
          <issue>11</issue>
          ):
          <fpage>613</fpage>
          -
          <lpage>620</lpage>
          , Nov. 1975
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>