<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Text Similarity Using Word Embeddings to Classify Misinformation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Caio Sacramento de Britto Almeida</string-name>
          <email>caio@meedan.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Debora Abdalla Santos</string-name>
          <email>abdallag@dcc.ufba.br</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Department, Federal University of Bahia</institution>
          ,
          <country country="BR">Brazil</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Meedan</institution>
          ,
          <addr-line>San Francisco, USA https://meedan.com</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Fake news is a growing problem in the last years, especially during elections. It's hard work to identify what is true and what is false among all the user generated content that circulates every day. Technology can help with that work and optimize the fact-checking process. In this work, we address the challenge of nding similar content in order to be able to suggest to a fact-checker articles that could have been veri ed before and thus avoid that the same information is veri ed more than once. This is especially important in collaborative approaches to fact-checking where members of large teams will not know what content others have already fact-checked.</p>
      </abstract>
      <kwd-group>
        <kwd>Fake News</kwd>
        <kwd>Similarity</kwd>
        <kwd>Misinformation</kwd>
        <kwd>Word Embeddings</kwd>
        <kwd>Text Classi cation</kwd>
        <kwd>Text</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Fake news have always been around, but they are turning into a growing problem
and more evident on the last years with the popularization of the Internet as a
source of news, a role that used to be played by traditional media like television,
radio, magazines and newspapers. From a theoretical point of view [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], fake
news must comply to a few characteristics: should be published and shared on
the Internet; should be created with fake content and without support; is used
to manipulate.
      </p>
      <p>
        Disinformation (that is, purposely false information) has been playing a
fundamental role on recent democratic electoral processes. Social media have an
evident merit to allow debates and to amplify voices in a space that allows great
repercussion. Many studies show [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] how Twitter, Facebook and other platforms,
became important instruments of democracy as they allow exchanges and
stimulate discussions. However, in the same way it happens on the public debate
outside the virtual world, social media are used to disseminate fake information.
Automated accounts that make it easier to send a massive amount of messages
have turned into a potential tool to manipulate debates on social networks,
especially in moments of political relevance.
      </p>
      <p>
        in this way, online platforms allow old defamation strategies and public
debate manipulation, now at a larger scale [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Both bots and humans have
an important role in creating and spreading fake news in electoral contexts|
sometimes on purpose, other times by mistake. Due to the nature of human
psychology, people tend to believe things that support their existing beliefs, a
process called \con rmation bias", de ned as the possibility to remember,
interpret or research in a way that con rms an initial belief or hypothesis [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        Attention begets more attention on social media [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Fake news on Twitter,
for example, has 70% more chances of being accessed than true information [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ],
often due to shocking titles or sensationalism.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Automatic recognition of fake news</title>
      <p>Identify fake news automatically is not a trivial task. One approach is based on
linguistics and try to identify text properties, like writing style and content, that
can help on di erentiating false articles from true articles. An assumption for
this approach is that linguistic behaviors, such as punctuation, types of words
and emotional charge are unconscious and thus are out of the author's control,
and those things could reveal important insights about the nature of the text.</p>
      <p>Studies based on that approach reach an accuracy near 76% when compared
to human performance [9]. Although promising, future e ort should not be
limited by that and could also include other information, for example: number of
inbound and outbound links, number of comments, visual analysis of the page
where the content is at, among other computational techniques for fact veri
cation [10].</p>
      <p>Even this way, since it is a work of high responsibility, a completely
automated classi cation of news can be very risky, and the reputation of a media
organization can be a ected by that. So, the purpose of this work is a hybrid
approach that creates a human-in-the-loop system.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Purpose</title>
      <p>The purpose of this work is to identify articles that are similar to articles that
were previously classi ed by a fact-checking agent as true or false, and in this
way, optimize the process of veri cation. The idea to achieve that is to implement
a plug-in for the open source software for collaborative fact-checking Check [11],
already used on collaborative fact-checking projects around the world. Check
has been used, for example, during the US elections in 2016, France elections
in 2017, Mexico elections in 2018 and Indian elections in 2019 as well as other
veri cation projects not related to elections. Such a plug-in can optimize the
fact-checking process by suggesting similar articles that were already classi ed
in the past, avoiding that the same, or substantially similar, content is
factchecked more than once. Technically, the idea is to use arti cial neural networks
to identify similar articles based on their vector representations. The plugin was
implemented as a Check Bot, so every time a new piece of content is created
on Check, the plugin would look for similar items that had already been
factchecked.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Arti cial Neural Network</title>
      <p>Word2Vec [14] is the neural network that is going to be used on this work.
It contains two layers and processes text. The input is a text corpus and the
output is a set of vectors: characteristic vectors for words from that corpus.
Its applications go beyond text analysis. It can be used for gens, code, music
playlists, social network graphs, and other verbal or symbolic series where a
pattern can be recognized.</p>
      <p>An example given by the authors of Word2Vec is: the vector that represents
the token \Madrid", subtracted from the vector that represents \Spain" and
summed to the vector \France" will be very close to the vector obtained for
\Paris". Or, as an equation:
vec(M adrid)
vec(Spain) + vec(F rance) = vec(P aris)
(1)
5</p>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <p>Given the great results that Word2Vec can achieve on identifying text patterns
[13], it was the choice for this work. We use the pretrained Word2Vec model,
which contains 3 million vectors of 300 dimensions [15] trained on over 100 billion
words from Google News. There are several promising avenues for extending this
work to Portuguese (and other languages), but the datasets currently available
in Portuguese are not as extensive as Google News or as well tested; so, in this
paper the plug-in is used to identify the similarity between an input article in
English and pre-classi ed articles in English.</p>
      <p>The ow works as the following: when new information to be veri ed is
inserted in Check's database, the plug-in developed in this work takes action. It
calculates the vectors of this input text, using Word2Vec, and stores the vectors
in an ElasticSearch database, which is a distributed search service, open source,
scalable, and has good search performance [16]. Another useful feature is that
ElasticSearch can be extended through plug-ins, including plug-ins for
searching criteria. For this work, it was necessary to implement a search plug-in for
ElasticSearch that could calculate the similarity between the input text
(represented as vectors) and each stored text (also represented as vectors) using cosine
distance. The interface between Check and ElasticSearch is lled by Alegre, an
API part of the Check suite which is responsible for text and image processing,
for example, similarity, classi cation, glossary and language identi cation.</p>
      <p>Therefore, searching for similar texts returns the vectors with smaller cosine
distances for the input vectors. The next step of the plug-in is that it uses
Check's API in order to suggest to journalists using Check similar articles that
were classi ed before and that are similar to the article that the user is currently
verifying. This way, the user can decide to relate them or not|and thus, avoid
fact-checking the same content multiple times. This work ow is represented in
Fig. 1.</p>
      <p>The plugin was integrated into the software Check and is available on GitHub
[17].
6</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusions and future work</title>
      <p>In this paper we developed an open-source architecture to use ElasticSearch to
e ciently search language volumes of text items by their vector representations
in near real-time and integrated this into an open-source fact-checking software
tool. We used Word2Vec trained on English-language text, but the architecture
we developed is not speci c to these choices and can easily be adapted to other
vector-representation models and languages.</p>
      <p>First, we would like to use a corpus with Portuguese text [19], so this solution
could be more useful for media organizations in Brazil. Second, the notion of
\similar" is too broad. Texts can be similar but completely contradictory. In this
sense, it would be very useful to determine if a given text supports or refutes
another text. An approach for that could be through stance detection [18], but
it requires more research. Moreover, the choice for Word2Vec led to promising
results, but other recent advances in this domain suggest that transformer models
such as BERT, SBERT and XML-RoBERTa representations are able to improve
the performance of NLP tasks so those options should also be evaluated [20{
22]. Although suggesting similar texts can optimize fact-checking work, much
more could be done if more speci c input corpora could be built. Finally, we
should be able to evaluate how much this approach helps the veri cation work,
for example, by counting how many content items did not need to be veri ed
again because similar, previously-veri ed items were correctly suggested by this
tool.
8. DERAKHSHAN, H.; WARDLE, C. Information Disorder: De nitions. AA. VV.,</p>
      <p>Understanding and Addressing the Disinformation Ecosystem, p. 5-12, 2017.
9. PEREZ-ROSAS, V. et al. Automatic detection of fake news. arXiv preprint
arXiv:1708.07104, 2017.
10. THORNE, J. et al. Fever: a large-scale dataset for fact extraction and veri cation.</p>
      <p>arXiv preprint arXiv:1803.05355. 2018.
11. MEEDAN. Check. Available at https://meedan.com/en/check/. 2016.
12. KOVACS, Z. L. Redes neurais arti ciais. Editora Livraria da Fisica. 2002.
13. GOLDBERG, Y.; LEVY, O. Word2Vec Explained: Deriving Mikolov et al.'s
negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722. 2014.
14. MIKOLOV, T. et al. Distributed representations of words and phrases and their
compositionality. In: Advances in neural information processing systems. p.
31113119. 2013.
15. MIHALTZ, M. Word2Vec GoogleNews Vectors. Available at
https://github.com/mmihaltz/word2vec-GoogleNews-vectors. 2019.
16. BANON, S. Elasticsearch. 2013.
17. MEEDAN. Check Source Code. Available at https://github.com/meedan/check.</p>
      <p>2018.
18. GHANEM, B.; ROSSO, P.; RANGEL, F. Stance detection in fake news a combined
feature representation. In: Proceedings of the First Workshop on Fact Extraction
and Veri cation (FEVER). p. 66-71. 2018.
19. MONTEIRO R. A.; SANTOS R. L. S.; PARDO T. A. S.; DE ALMEIDA, T. A.;
RUIZ E. E. S.; VALE O. A. (2018) Contributions to the Study of Fake News in
Portuguese: New Corpus and Automatic Detection Results. In: Villavicencio A. et
al. (eds) Computational Processing of the Portuguese Language. PROPOR 2018.</p>
      <p>Lecture Notes in Computer Science, vol 11122. Springer, Cham
20. DEVLIN, Jacob et al. Bert: Pre-training of deep bidirectional transformers for
language understanding. arXiv preprint arXiv:1810.04805, 2018.
21. REIMERS, Nils; GUREVYCH, Iryna. Sentence-bert: Sentence embeddings using
siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
22. CONNEAU, Alexis et al. Unsupervised cross-lingual representation learning at
scale. arXiv preprint arXiv:1911.02116, 2019.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>JR</surname>
          </string-name>
          , E. C. T.;
          <string-name>
            <surname>LIM</surname>
            ,
            <given-names>Z. W.; LING</given-names>
          </string-name>
          , R. De ning \
          <article-title>fake news" a typology of scholarly de nitions</article-title>
          .
          <source>Digital Journalism</source>
          , Taylor &amp; Francis, v.
          <volume>6</volume>
          , n. 2, p.
          <volume>137</volume>
          {
          <issue>153</issue>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. ENLI,
          <string-name>
            <surname>Gunn</surname>
            <given-names>Sara; SKOGERB</given-names>
          </string-name>
          , Eli.
          <article-title>Personalized campaigns in party-centred politics: Twitter and Facebook as arenas for political communication</article-title>
          . Information, communication &amp; society, v.
          <volume>16</volume>
          , n. 5, p.
          <fpage>757</fpage>
          -
          <lpage>774</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. ADORNO, G.; SILVEIRA,
          <string-name>
            <surname>J.</surname>
          </string-name>
          Pos-Verdade e Fake News:
          <article-title>Equ vocos do Pol tico na Materialidade Digital</article-title>
          .
          <source>Anais Do SEAD</source>
          <volume>8</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>RUEDIGER</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>A</article-title>
          . et al.
          <article-title>Robo^s, redes sociais e pol tica no Brasil: estudo sobre interfer^encias ileg timas no debate publico na web, riscos a democracia e processo eleitoral de</article-title>
          <year>2018</year>
          .
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>PLOUS</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <article-title>The psychology of judgment and decision making</article-title>
          .
          <source>Mcgraw-Hill Book Company</source>
          ,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>HALE</surname>
            ,
            <given-names>S. A.</given-names>
          </string-name>
          et al.
          <article-title>How digital design shapes political participation: A natural experiment with social information</article-title>
          .
          <source>Plos One</source>
          , v.
          <volume>13</volume>
          , p.
          <volume>1</volume>
          {
          <issue>20</issue>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>LANGIN</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          \
          <article-title>Fake News Spreads Faster than True News on Twitter|Thanks to People</article-title>
          ,
          <source>Not Bots." Science</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>