<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Algebraic Bayesian networks: consistent fusion of partially intersected knowledge systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>A Tulupyev</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>N Kharitonov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>A Zolotin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>TICS Lab, SPIIRAS</institution>
          ,
          <addr-line>39 14</addr-line>
        </aff>
      </contrib-group>
      <fpage>109</fpage>
      <lpage>115</lpage>
      <abstract>
        <p>In this paper, approaches to the synthesis of a consistent system of probability estimates of propositional formulas for two different partially overlapping data sets based on the theory of algebraic Bayesian networks are presented in this paper. Areas in which the results of this article can be applied are described. An example of synthesis for a particular algebraic network is given.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>With the expansion and the observed acceleration of the digitization of all spheres of the economy
(perhaps it would be more accurate to talk about the digitalization of the national economy and even
more broadly - about its informatization and about the information society) data sets (including large
data sets - " big data ") are becoming more accessible, in fact, obtained not through specially planned
expensive experiments or field research, but simply as a result of a" regular "day-to-day economic,
organizational, administrative UPE and administrative or other activities. Moreover, the accumulation
of data is more likely to be a "side effect" of such activity, rather than specifically planned taking into
account the subsequent needs for analyzing these data, searching for patterns in these data, extracting
knowledge from these data for the purpose of improving or optimizing processes, or for the
preparation and adoption solutions.</p>
      <p>This state of affairs, in addition to the positive "availability" of data sets, entails two negative
effects. First, the data sets turn out to be fragmentary, often it is impossible to form a single sample,
since data from one source can cover one part of the parameters of the observed object or process, the
other part of the parameters, these parts may intersect, but prove to be unsuitable for the formation of a
"combined" sample element, which would be desirable to be formed on the basis of two or more data
sources. Such a situation can arise for a variety of reasons, for example, when the values of the
parameters in different sources were recorded at different frequencies or, in general, in different
periods of observation. Secondly, the received data sets may be "inadequate" for preparation and
decision making simply because some parameters were not planned to be registered, the need for
access to their values arose only when the data sets accumulated in information systems were
recognized as a useful resource and Analytical efforts began to be made in their attitude.</p>
      <p>
        The aim of this paper is to propose an approach to the synthesis of a consistent system of
probability estimates of propositional formulas for two different partially overlapping data sets, based
on the theory of algebraic Bayesian networks [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Examples of applications</title>
      <p>
        This situation arises in a wide range of areas: in epidemiology (in the study of risky behavior and the
search for ways to develop preventive programs[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]), information security (in particular, in analyzing
the security of information system personnel against social engineering attacks[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]), in advisory
systems used in online and offline trade[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], accompanied by physical culture and sports activities
(including the training process in the sport of higher achievements), in assessing the reliability of
systems, in psychology[6], pedagogy and sociology (including, in the analysis of social networks[7]),
in ecology[8].
      </p>
      <p>In particular, when solving the problem of automating the analysis and evaluation of the degree of
protection of information system personnel from social engineering attacks, the construction of a
graph of social connections fetching highly uncoordinated data from the social network (possibly even
from several). However, such data still does not contain parameters that could be used to estimate the
effectiveness of preventive impact on personnel (for example, educating or personnel training), since,
as noted, when creating social networks, the need to solve such a problem was not taken into account
and the corresponding possibility was not pawned. At the same time, the situation is not as hopeless as
it might seem, since it is possible to take advantage of the knowledge obtained from other sources,
including expert psychologists and experts in the field of information security. This knowledge can
help "build bridges" between what has been extracted from the "big data" contained in social networks
and the parameters on the basis of the assessment of the values of which it is possible to prepare and
make decisions. But the workable "conjugation" of knowledge from heterogeneous sources that
becomes a key fundamental scientific problem, the approaches to the solution of which must be found,
described and automated.</p>
      <p>To emphasize the importance of the problem for a broad class of scientific research and areas of the
digital economy, let us give one more example. There are widely distributed advisory systems that,
according to the basket of goods in the online stores, tell the user what other goods were bought (or
looked at) by other users with a similar set of goods in the basket. Such advisory systems use
explicitly or implicitly machine learning of probabilistic graphical models where the trained
probabilistic graphical model acts as the core of the advisory system. However, such an advisory
system shows the same drawbacks as in the previous example:
● 1) it will not be able to include a new product in its recommendations, since there simply was
no data on the sales of this new product;
● 2) it will not be able to respond to scenario-based queries , when the customer indicates that he
wants to repair the apartment, waiting for the list of goods, or that he wants to prepare a light
dinner for 10 people.</p>
      <p>In this case, in the raw data that the corresponding system has accumulated, there are no reasons for
decision making however, as in the first example, the way to cope with such class of tasks mightbe to
include expert knowledge</p>
    </sec>
    <sec id="sec-3">
      <title>3. Using of Probabilistic Graphical Models</title>
      <p>We will express common problems that relate the above tasks from the areas of the digital economy
and science in terms of the Bayesian networks theory (including machine learning of Bayesian
networks) and related models of complex knowledge systems with uncertainty, adhering to the
probabilistic, logical-probabilistic, relational-probabilistic and probabilistic-algebraic approaches in
order to explicate them.</p>
      <p>Bayesian networks (Bayesian belief networks, algebraic Bayesian networks, other related models)
can be automatically constructed ("machine-trained") according to data from every single data source
obtained from various automated information system. (Of course, open theoretical, algorithmic and
technological issues remain even at this stage, but nevertheless we will assume that a satisfactory
result of machine learning is available to us). As noted these information sources do not allow them to
be directly and consistently combined into a single source, so Bayesian networks are constructed
separately. The resulting Bayesian networks will also have an intersection over a set of vertices,
because information sources intersect in terms of parameters (variables, attributes). At the same time,
both the structure on the set of vertices from the intersection, and the probability estimates at such
vertices, most likely, do not coincide. The problem of "merging" of intersecting but not coincident
networks was not posed and solved neither in the theory of Bayesian networks of confidence, nor in
the theory of algebraic Bayesian networks, nor in theories of related models, although the need for
solving such a fundamental problem in the context described above, dictated by digitalization
economy, is obvious. (More precisely, we are talking about a series of fundamental problems, because
there are questions about finding consistent estimates of probabilities, both conditional and marginal,
about finding structures that connect the vertices of networks that appeared in the intersection, about
spreading the influence of matching probabilities and structures in place intersections on those parts of
networks that did not enter the intersection, etc.).</p>
      <p>However, the problem is not confined to the merging of existing networks. As noted, experts can
be considered as an information source. Their contribution will result in the addition / completion of
Bayesian networks (or related models) with new vertices and a new structure that forms connections
both within the new set of vertices and vertices from this set with vertices of the previously
"machinetrained" Bayesian network (Bayesian networks). As a rule, in this case, on the one hand, experts will
receive incomplete, inaccurate, non-numeric information about both the structure "inside" and
"outside", and about the probabilities that characterize the vertices and these connections "inside" and
"outside". And according to such data (for example, only data on partial orders of probability
estimation can be available, but not accessible by numerical estimates of probability itself), it will be
necessary to "machine learn" the expert part of the Bayesian network (or related model), as well as the
part that is responsible for the links between the "expert component" and the data-learned from
automated information systems. On the other hand, it will be necessary to organize a dialogue for an
iterative approximation to a satisfactory final result between the expert and the system providing the
"completion" / pre-training of the Bayesian network (or related model).It also raises a number of
fundamental and technological issues, starting from the questions of visualization of structures and
values of parameters and ending with the issues of resolving collisions, modifying (eliminating)
unallowable structural elements, minimizing the required operations, including operations to modify
networks in terms of their intersection, conjugation, and parts that are not included into the
intersection, but are modified "secondary" because of the types of modifications already listed.</p>
      <p>Thus, a "bottleneck" has been revealed, which prevents the technological breakthrough in the
application of approaches (to solving problems from a number of spheres of the digital economy and
science), methods, models, algorithms, technologies and systems of machine learning and other
methods of intellectual analysis. This “bottleneck” is described below: neither the known theoretical
developments nor the existing software systems provide the ability to synthesize new Bayesian
networks (and related models) on the basis of overlapping but not consentient Bayesian networks (and
related models), and also do not provide the possibility of completion / training of these networks and
models based on inaccurate, incomplete and non-numeric information coming from experts.
Elimination of this bottleneck due to the development of appropriate theories, algorithms complexes,
and then, on their basis, data mining systems, will open up within the developing digital economy a
broad market for the application of computer-based training and data mining systems based on
Bayesian networks (and related models), since all other prerequisites exist for this, including
accumulated and accumulating data sets in information sources.
4. Combining two algebraic Bayesian networks
There is a simplest example of possible approaches to combining algebraic Bayesian networks in this
part of the article.</p>
      <p>Let us consider two algebraic Bayesian networks, each of one knowledge pattern with two atoms in
it (figure 1).
Conjuncts will have following interval estimates in this case:</p>
      <p>There are two possible ways to combine these algebraic Bayesian networks which are characterized
by the complexity of the resulting integrated network and the completeness of the information
provided. They both are represented in the next two subsections.</p>
      <sec id="sec-3-1">
        <title>4.1. The first method</title>
        <p>Atoms and are replaced by in the first case.
which is equal to intersection of probabilities and
has the interval estimate of probability of truth
:
The resulting algebraic Bayesian network is shown on figure 2.</p>
        <p>The probabilities of conjuncts will satisfy the following inequalities:</p>
      </sec>
      <sec id="sec-3-2">
        <title>4.2. The second method</title>
        <p>and are both added to intersection in the second case. It complicates the structure of resulting
algebraic Bayesian network (figure 3), but allows to see the effect of combining of and (conjuncts
and ), and the effect of these parameters separately (conjuncts and , and
).</p>
      </sec>
      <sec id="sec-3-3">
        <title>4.3.Comparison of methods</title>
        <p>So, the first method has the simpler computations, but it is less informative: the construction of a more
complex algebraic Bayesian network makes it possible to reveal dependencies on different
measurements of the same value. In addition, the second method either guarantees that the existing
constraints correspond to a non-empty set of probability distributions (possibly with only one
element), or allows one to conclude that the available data set is inconsistent and additional efforts are
required to harmonize the information obtained from different sources.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Conclusion</title>
      <p>
        A method is proposed for synthesizing a consistent system of probability estimates of propositional
formulas for two different partially overlapping sets of data. It is based on the construction of two
algebraic Bayesian networks, which will overlap partially, because the original data sets also intersect,
their connection over coincident vertices and the subsequent application of algorithms to maintain the
consistency of algebraic Bayesian networks. It should be noted that if the resulting network is acyclic
or allows a successful conversion to acyclic, then according to [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] it will be possible to maintain its
consistency cheaper than if it had to be immersed in an encompassing knowledge pattern.
      </p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work was partially supported by the by RFBR according to the research project №. 18-01-00626,
as well as the state task SPIIRAS № 0073-2018-0001.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Tulupyev</surname>
            <given-names>A L</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nikolenko S I and Sirotkin</surname>
            <given-names>A V</given-names>
          </string-name>
          <year>2006</year>
          “
          <article-title>Bayesian belief networks: probabilisticlogic approach</article-title>
          ,” SPb.: Nauka p
          <volume>607</volume>
          (In Russian)
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Levenets</surname>
            <given-names>D G</given-names>
          </string-name>
          et al.
          <article-title>Decremental and incremental reshaping of algebraic Bayesian networks global structures</article-title>
          .
          <source>// Proceedings of the First International Scientific Conference “Intelligent Information Technologies for Industry” (IITI'16)</source>
          2016 pp
          <fpage>57</fpage>
          -
          <lpage>67</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Kugbey</surname>
            <given-names>N</given-names>
          </string-name>
          et al.
          <article-title>International note: Analysis of risk and protective factors for risky sexual behaviours among school-aged adolescents</article-title>
          . // Journal of AdolescenceVolume 68.
          <year>2018</year>
          pp
          <fpage>66</fpage>
          -
          <lpage>69</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Abramov</surname>
            <given-names>M.</given-names>
          </string-name>
          <article-title>V and Azarov A.A. Identifying user's of social networks psychological features on the basis of their musical preferences // Soft Computing and Measurements (SCM</article-title>
          ),
          <source>2017 XX IEEE International Conference on. - IEEE</source>
          ,
          <year>2017</year>
          pp
          <fpage>90</fpage>
          -
          <lpage>92</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Post</surname>
            <given-names>Y L</given-names>
          </string-name>
          et al.
          <article-title>Using probabilistic neural networks to analyze First Nations' drinking water advisory data</article-title>
          .
          <source>// Journal of Water Resources Planning and Management</source>
          . Volume
          <volume>144</volume>
          .
          <year>2018</year>
          . Issue 11,
          <fpage>1</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Davey C G</surname>
          </string-name>
          et al.
          <article-title>A brain model of disturbed self-appraisal in depression</article-title>
          . // American Journal of Psychiatry. Volume
          <volume>174</volume>
          , Issue 9. 2017 pp
          <fpage>895</fpage>
          -
          <lpage>903</lpage>
          Bagretsov
          <string-name>
            <given-names>G I</given-names>
            ,
            <surname>Shindarev</surname>
          </string-name>
          <string-name>
            <given-names>N A</given-names>
            ,
            <surname>Abramov</surname>
          </string-name>
          <string-name>
            <given-names>M V</given-names>
            and
            <surname>Tulupyeva T V Approaches</surname>
          </string-name>
          <article-title>to development of models for text analysis of information in social network profiles in order to evaluate user's vulnerabilities profile // Soft Computing and Measurements (SCM</article-title>
          ),
          <source>2017 XX IEEE International Conference on. - IEEE</source>
          ,
          <year>2017</year>
          pp
          <fpage>93</fpage>
          -
          <lpage>95</lpage>
          Santos,
          <string-name>
            <given-names>R.A.L.</given-names>
            ,
            <surname>Mota-Ferreira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Aguiar</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.M.S.</surname>
          </string-name>
          <article-title>Predicting wildlife road-crossing probability from roadkill data using occupancy-detection models // Science of the Total Environment</article-title>
          . Volume
          <volume>642</volume>
          .
          <year>2018</year>
          pp
          <fpage>629</fpage>
          -
          <lpage>637</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>