Issues of Fact-based Information Analysis

          Natalia Sharonova, Anastasiia Doroshenko, Olga Cherednichenko

             National Technical University “Kharkiv Polytechnic Institute”,
                      2, Kyrpychova str., 61002 Kharkiv, Ukraine
         nvsharonova@ukr.net, olha.cherednichenko@gmail.com


       Abstract. With the recent growth of Internet, mobile and social networks the
       spread of fake news and click-baits increases drastically. Today, the fact re-
       trieval system is one of the most effective tools for identifying the information
       for decision-making. We propose the approach based on factual information
       systematization. Different interpretations of the same phenomenon, as well as
       the inconsistency, inaccuracy or mismatch in information coming from different
       sources, lead to the task of factual information extraction. In this work, we ex-
       plore how can natural language processing methods help to check contradic-
       tions and mismatches in facts automatically. The reference model of the fact-
       based analytical system is proposed. It consists of such basic components as
       Document Search component, Fact retrieval component, Fact Analysis compo-
       nent, Visualization component, and Control component.

       Keywords: Fact, Natural Language Processing, Information extraction, Com-
       parator identification, Predicate, Reference model


1      Introduction

   The access to the Internet, as well as to social networks has been simplified for the
last decade. It has led to information flow growth. Social networking sites give an
opportunity for users to share content freely. As a consequence, fake news, hoaxes,
and click-baits are spread, circulated, consumed and shared without critical thinking
or fact checking. Regardless the form, the reverberations of inaccurate or misleading
information could lead to major risks for the society.
   Misinformation can be spread both intentionally and accidentally. Among the in-
terested stakeholders of fake information are politicians, marketing managers, sellers,
and users with unclear purposes. In media politicians use alternative facts and post-
truths in order to manipulate their audience`s opinions creating thus long-term sus-
tainable mindsets. Alternative facts are information with no basis in reality while
post-truth technics are defined as beyond the truth or irrelevant information [1].
   As for social networking sites, there are several sources of fake information search.
Users’ profiles contain a lot of misinformation. In many instances it is difficult to
match pages of the same user at different social networks because of a clash in per-
sonal information. As well as that it is impossible to assess reliable posts presented in
the newsfeed.
   E-commerce is another sphere where fake information is common. Sellers at mar-
ket places present the name and the description of products. The same product can be
described in different ways by different sellers. Apart from that, sellers, in order to
present their products at as many search requests as possible sellers use clashing in-
formation.
   Thus, the problem of identifying and verifying contradictory or ambiguous infor-
mation is crucial. The key idea of our work is to develop an approach for checking
contradictions and mismatch in facts automatically.


2      Related Works

    In recent years, the reliability of information on the Internet has decreased signifi-
cantly. It is particularly noticeable on social networks, where distorted, inaccurate or
false information reaches and affects millions of users within minutes. Therefore the
problem of fake information detection has become a popular research sphere. There
are several challenges to automatic detection of fake news: determining if the facts in
the news article are correct; analyzing the relations between the article headline and
article body; estimating the inherent bias of a written text etc. Factual analysis of the
text is designed to make possible the intellectual analysis of data extracted from the
text flow. The solution of this task should lead to a synergistic effect, to the possibility
of using existing information technologies.
    The paper [1] presents the analysis of hoax medical news in social media is pre-
sented. The stance classification is implemented in hoax analysis particularly with
media contents. An interesting framework has been developed to crosscheck claims
against fact-checks. In order to check the news in social networks, one important con-
cept emerges in the paper [2]. In each post in the newsfeed, the “fact” should be iden-
tified. However, in many situations, it is impossible to identify whether some part of
the information is a fact or not. Each fact is composed of something that has happened
at some time, somewhere, possibly to someone.
    The work [3] proposes an infrastructure to address phenomena of modern online
media production, circulation, and manipulation by establishing a distributed architec-
ture for automatic processing and human feedback. A hybrid technology infrastruc-
ture that provides user- and machine-generated annotations on top of the whole World
Wide Web is proposed. The ultimate goal of the proposed approach is enable internet
users to handle fake news and other online media phenomena by providing both
automatic assessments of content and by including alternative opinions into the proc-
ess of media consumption. The paper [4] discusses the role of computational social
scientists in the fight against digital misinformation. Clarify the fundamental mecha-
nisms that make us vulnerable to misinformation online, as well as devise effective
strategies to counteract misinformation. There is a growing interest in automating the
various activities that revolve around fact-checking. The fact-checking automating
includes newsgathering, verification and delivery of corrections.
    The need for automatic hoax detection systems is a vital task. In the paper [5] they
develop an approach which allows classifying posts in a social network with high
accuracy as hoaxes or non-hoaxes on the basis of the users who “liked” them. Two
classiﬁcation techniques are presented. One technique is based on logistic regression,
and the other one is based on a novel adaptation of Boolean crowdsourcing algo-
rithms. It was proved that both techniques are robust: they work even when the users'
attention is limited to the users who like both hoax and non-hoax posts. These results
suggest that mapping the diffusion pattern of information can be a useful component
of automatic hoax detection systems.
   The goal of the paper [6] is to present a description of UCL Machine Reading’s
model employed during fake news detection. The presented stance detection model is
a single, end-to-end system consisting of lexical and similar features fed through a
multi-layer perceptron with one hidden layer. Being relatively simple in nature, the
model performs on par with more elaborate, ensemble-based systems of other teams.
   In the paper [7] they analyze the link between the article headline and the article
body in order to detect whether the presented news is fake or not. Several neural net-
work architectures were explored for stance detection in news articles. The attention-
based models, in particular, a variation of the Attentive Reader Model (ARM2) work
properly for this task. The given model evaluates each prediction in a two-step proc-
ess. The first step is to compare the headline and the body and classify as related or
unrelated. The second step is to classify related head-body combinations as agrees,
disagrees, or discusses.
   The research [8] is oriented on fake news detection. “Fake news detection” is de-
fined as the task of categorizing news. The paper discusses a typology of several va-
rieties of veracity assessment methods emerging from two major categories – linguis-
tic cue approaches and network analysis approaches. The paper [9] aims to enable the
identification of deliberately deceptive information in text-based online news. Pro-
posed system can alert users to deceptive news in the incoming news stream and
prompt users to further fact-check suspicious instances. It is an information system
support applied a vector space model to cluster the news.
   There are number approaches of information extraction from natural languages
texts. We can highlight lack of automated semantic understanding and low consis-
tency of extracted facts. Despite of existing data extraction solutions the task of ex-
tracting facts still is not solved.


3      Methodology

   Today, the fact retrieval system is one of the most effective tools for identifying
the information for decision-making. When you refer to something as a fact you mean
that you think it is true or correct. Factual information is information based on facts or
relating to the facts. The reliability of automatically extracted facts is the main prob-
lem of processing factual information. It is especially important because of increasing
density of text information flow in mass media and various social networks, forums
and blogs. Different interpretations of the same phenomenon, as well as the inconsis-
tency, inaccuracy or mismatch in information coming from different sources lead to
the task of factual information extraction.
   We can consider facts as structured objects. This record describes real-world entity
with its attributes mentioned in text, usually, who did what to whom, where and
when. So, the fact can be extracted from the textual information and can determine
the attributes of the object or the relations between objects. The task of Information
Extraction is to identify instances, relations, events and their relevant properties in
natural language texts. We consider two types of facts. They can be described as trip-
lets. The first kind of fact is a “Subject -> Relation -> Object”, where the subject is
who acts, the relation defines action with the object. The second kind of fact is a trip-
let: “Object-Attribute-Value”, where the object is the entity about which the fact is
fixed. The attribute is predetermined characteristic that identifies the object with the
certain values.
   The extraction of facts from weakly structured textual information includes the fol-
lowing steps:
   1) Entity Extraction – extract words or phrases that are important for describing the
meaning of the text (lists of terms of the subject domain, personalities, organizations,
geographical names, etc.);
   2) Feature Association Extraction is searching the links between the entities ex-
tracted;
   3) Event and Fact Extraction is extraction of entities, recognition of facts and ac-
tions.
   To implement the entity extraction, a standard linguistic processor is used. The is-
sue is the extraction of information about the relations between entities. For this pur-
pose we need to define a certain template that reflects the semantic links in the sen-
tence. Based on the fact definition, it is possible to define the minimal semantic unit
of factual search, which is a triad: agent-predicate-value. That is, the record of factual
information must include a pointer to the fact search agent, the attribute or predicate
of this object, and give a specific value of this attribute.
   Such a definition makes it possible to extract concepts from weakly structured text
sources of information and to represent relations between them in a structured way.
The resulting structure is facts, both in the form of fairly simple concepts: keywords,
personalities, organizations, geographical names, and in a more complex form, for
example, the name of the person with her job and occupation.
   Algebra of finite predicates is used as a mathematical tool for describing discrete,
determinate and finite objects or processes from real world [10, 11]. We use this math
scheme to represent knowledge extracted from natural language texts:
   - text information objects;
   - the entity of the subject domain,
   - grammatical and semantic characteristics of the text units.
   An analysis showed that the most natural and convenient tool for modeling natural
language relations is the algebra of finite predicates that operates with letter variables
[10, 12]. This tool meets all the requirements for linguistic formalisms. In this case,
all kinds of morphological processing lead to the solution of algebraic equations with
different initial data (fully or partially specified). Having an algorithm for solving
these equations, the formalization of various processes of word processing can be
greatly simplified. The possibility of equivalent transformations and minimization of
the morphological model are available in this approach. Besides that the commonality
of expressive means makes it convenient to analyze different fragments of the model.
   Mathematical relation is the basic concept of logical mathematics. A logical net-
work is a processor and it performs various actions on relationships [10]. Relations
express the attributes of objects and the connections between them. They are a univer-
sal means of describing any objects. The human language, as a means of communica-
tion, is only a means for expressing relationships. Speaking to other people, we con-
vey to them the meaning of the sentence, which is an attitude. The exchange of
thoughts between people is carried out only through the transfer of relations. Each
thought represents some relation. Perceiving objects and events of the external world,
we get information about them in the form of relationships.
   Any relation can be interpreted meaningfully as knowledge about the fact, ex-
pressed by some utterance. The fact is an exhaustive description of the actual state of
all places interested to us. Knowledge of the fact only limits the many possible states
of places. A statement about a fact can be true or false. It is true if the characterizing
relation contains the actual set of place states and otherwise is false.


4      Our approach

    The model of fact extraction from natural language text can be presented in the fol-
lowing way. The fact is considered as a triplet: “Subject -> Predicate -> Object”. The
predicate defines a relation, and the subject and object defines two entities. In the
developed model we introduce a set of grammatical characteristics of the sentence
words. To represent the triplet of fact we use approach proposed in [11, 12].
    We suggest the model of facts extracting based on the method of comparator iden-
tification [10]. It allows matching the data and the template. It is based on the relation
between the words and the placement of these words in the text. This method repre-
sents the extraction process as a human intelligent activity since a human looking
through a text can easily determine whether it corresponds to the template or not and
catch attributes of a fact.
    We discover that the descriptions of the same commodities in the trading platforms
can be presented in a different way. We notice that such description can be presented
as a triplet “object-attribute-value”. So, we can consider the second kind of facts. The
description of commodity is represented as number of words; usually it is not a sen-
tence, and a table with some characteristics of commodity.
    Let E – be the set of structural elements of a web page, W – the set of words.
Then RSEARCH  E  W – is the binary relation "is used for search". Let E q  E –
the set of elements of the web page that are selected for estimation and Wq  W – a
set   of   words    that   match    the   topic   of   the   search.    Binary   relation
RSEARCH  eqi , wqj | eqi  Eq , wqj  Wq  defines a "word-element" pairs. For that
pair, the words belong to the set of words which correspond to the topic and the ele-
ments which belong to a set of selected elements.
  Let w pj  W p – a set of words extracted from the web page. Then the predicate
which evaluates the binary of "element-word" pair:
                      1, if (eqi , w pj )  RSEARCH ,
   Pw  eqi , wpj   
                       0, if (eqi , q pj )  RSEARCH .
  The predicate that defines the presence of control words in a particular element:
   Pe eqi   Pw eqi , w p1   Pw eqi , w p 2   ...  Pw eqi , w pn  .
   The web page estimation combines the estimates for each item and determined by
the predicate:
               
   Pq  P eq1  P eq 2  ...  P eqs .    
  The page estimation is based on a data source model. The presence of different
combinations of words in different combinations of elements of the web page is esti-
mated. Let RSOURCE  E  W – be a binary relation "is used for sources selection",
this is given as follows:
   RSOURCE  ei , w j | ei  Es , w j  wqi , wqi Wq ,
  where wqi – set of words according to the element ei .
  The predicate that estimates a pair of "element-words" is defined as:
                    1, if (ei , p p )  RSOURCE ,
   P  ei , w j   
                     0, if (ei , p p )  RSOURCE .
  The predicate that estimates an item using different word combinations:
   P( ei )  ( P( ei , w p1 )  P( ei , w p 2 )  ...  P( ei , w pj ))  ( P( ei , wpj1 )  ...)  ...
Web page estimation for various combinations of elements is given by predicate:
   PS  ( P( e1 )  P( e2 )  ...  P( e s ))  ( P( e j )  ...)  ...
   The binary relation "elements and corresponding words that were extracted from
the source page for representation the template model, R PAGE  E  W ,
RPAGE  ( e1 , w1 ),...,( es , w j ) . The function of transforming the word combina-
tions into a value template from the set of "standards" C  c1 ,..., cm  is given as:
                                   c1 , if (wi1  wi 2  ...)  (wj1  w j 2  ...)  ...
                                   
   (ei , wj )  RPAGE : F (ei )  ...
                                   c , if (w  w  ...)  (w  w  ...)  ...
                                    m         im      im            jm       jm

   A set of elements of a web page that contains a certain standard of a
set C  c1 ,..., c m  is given as                       
                                    E p  e j  E | c  F ( e j ),c  C . Let                 
RPATTERN  E  C – the binary relation "elements contain benchmarks", at the same
time RPATTERN        e , c | e  E , c  C.
                          i    j     i       p    j
   The template predicate looks like:
                  1, if (e1e2 e3 ( E (e1 , eF )  E (e2 , eI )  E (e3 , eO )))  1,
       Ppattern  
                  0, in other case.                                                     ,
                              1, e  E F ,                       1, e  EI ,
   where        E( e1 ,eF )               ,      E( e2 ,eI )                     and
                              0, e  EF ;                        0, e  E I ;
               1, e  EO ,
E( e3 , eO )  
               0, e  EO .
 Let K  k j  – a set of indicators that are relevant to this signs, then the predicate
M ( k ,k j ) determines, whether the summary template has indicator k j from this
set. Let I  i     – the set of indicators according to the given one, then predicate
M ( i ,i ) determines whether the generic template contains the data for this indicator
i .
   We can propose the reference model to factual information retrieval and analysis
(fig.1). The main concepts are facts that are some knowledge about real-world ob-
jects, web-pages which contain text, indicators for representing attributes, and values
of those attributes. The appropriate models must formalize the factual data processing.


                                      Figure 1. Reference model

   In order to realize proposed reference model, the software should be developed.
We suggest the basic components which are presented in figure 2. The developed
software consists of such basic components as Document Search component, Fact
retrieval component, Fact Analysis component, Visualization component, and Control
component. Our future work is to make implementing and experimenting with the
proposed model.
                                 Figure 2. Basic components


5      Conclusions and Future Works

    Summarizing, we can say that factual analysis is a rather complex system that has
great potential and functionality. The tasks under which the data are built are designed
to facilitate the work of analysts, to carry out filtration as well as structuring of huge
volumes of information, which in our time are one of the main tasks of a person.
    As result, we can underline that the task of identifying instances, relations, events
and their relevant properties in natural language texts is still live issue. In general, we
consider two kinds of facts. Despite existing data extraction solutions the task of ex-
tracting facts still is not solved. We propose to use predicate algebra and method of
comparator identification to create a model of searching and extracting factual data.
The future work will be devoted to research the similarity of facts and mismatch iden-
tification. We hope to develop a mathematical tool based on the relation of tolerance
to make a conclusion about similarity or mismatch in the set of extracted facts.


6      References
 1. Mauridhi Hery Purnomo et al., Biomedical Engineering Research in the Social Network
    Analysis Era: Stance Classification for Analysis of Hoax Medical News in Social Media /
    Procedia Computer Science 116, 2017, pp. 3–9.
 2. Álvaro Figueira, Luciana Oliveira, The current state of fake news: challenges and opportu-
    nities / Procedia Computer Science 121, 2017, pp. 817–825.
 3. Georg Rehm, An Infrastructure for Empowering Internet Users to Handle Fake News and
    Other Online Media Phenomena / An Infrastructure for Empowering Internet Users, 2017,
    pp. 216-231.
 4. Giovanni Luca Ciampaglia, Fighting fake news: a role for computational social science in
    the fight against digital misinformation / J Comput Soc Sc (2018) 1:147–153,
    https://doi.org/10.1007/s42001-017-0005-6
 5. E. Tacchini, G. Ballarin, M. L. Della Vedova, S. Moret and L. de Alfaro, "Some Like it
    Hoax: Automated Fake News Detection in Social Networks," Cornell University, New
    York, USA, 2017.
 6. B. Riedel, I. Augenstein, G. P. Spithourakis and S. Riedel, "A simple but tough-to-beat
    baseline for the Fake News Challenge stance detection task," Cornell University, New
    York, USA, 2017.
 7. N. Rakholia and S. Bhargava, "“Is it true?” – Deep Learning for Stance Detection in
    News," Stanford University, California, USA, 2016.
 8. N. J. Conroy, V. L. Rubin and Y. Chen, "Automatic Deception Detection: Methods for
    Finding Fake News," in Proceedings of the 78th ASIS&T Annual Meeting: Information
    Science with Impact: Research in and for the Community , Missouri, USA, 2015.
 9. Rubin, V., Conroy, N., and Chen, Y., Towards News Verification: Deception Detection
    Methods for News Discourse. 2015.
10. Bondarenko M. F., Shabanov-Kushnarenko U. P. Theory of intelligence: a Handbook
    //SMIT Company, Kharkiv. – 2006.
11. Nina Khairova, Natalia Sharonova. Use of Predicate Categories for Modelling of Opera-
    tion of the Semantic Analyzer of the Linguistic Processor./Proceedinga of IEEE EAST-
    West Design & Test Symposium EWDTS'09 (2009).
12. Khairova, N.F., Petrasova, S., Gautam, A.P.S.: The logical-linguistic model of fact extrac-
    tion from English texts. In: Dregvaite, G., Damasevicius, R. (eds.) ICIST 2016. CCIS, vol.
    639, pp. 625–635. Springer, Cham (2016).
13. Khairova N., Lewoniewski W., Węcel K. (2017) Estimating the Quality of Articles in Rus-
    sian Wikipedia Using the Logical-Linguistic Model of Fact Extraction. In: Abramowicz
    W. (eds) Business Information Systems. BIS 2017. Lecture Notes in Business Information
    Processing, vol 288. Springer, Cham
14. Cherednichenko O., Yanholenko O. Information Technology of Web-Monitoring and
    Measurement of Outcomes in Higher Education Establishment //EuroSymposium on Sys-
    tems Analysis and Design. – Springer International Publishing, 2015. – P. 103-116.
    (http://dx.doi.org/10.1007/978-3-319-24366-5_8)