=Paper=
{{Paper
|id=Vol-2668/invited2
|storemode=property
|title=Formal Concept Analysis Techniques Can Help in Intelligent Control, Deep Learning, etc.
|pdfUrl=https://ceur-ws.org/Vol-2668/invited2.pdf
|volume=Vol-2668
|authors=Vladik Kreinovich
|dblpUrl=https://dblp.org/rec/conf/cla/Kreinovich20
}}
==Formal Concept Analysis Techniques Can Help in Intelligent Control, Deep Learning, etc.==
<pdf width="1500px">https://ceur-ws.org/Vol-2668/invited2.pdf</pdf>
<pre>
 Formal Concept Analysis Techniques Can Help
   in Intelligent Control, Deep Learning, etc.?

                       Vladik Kreinovich[0000−0002−1244−1650]

           Department of Computer Science, University of Texas at El Paso
            500 W. University, El Paso, TX 79968, USA, vladik@utep.edu


        Abstract. In this paper, we show that formal concept analysis is a
        particular case of a more general problem that includes deriving rules
        for intelligent control, finding appropriate properties for deep learning
        algorithms, etc. Because of this, we believe that formal concept analysis
        techniques can be (and need to be) extended to these application areas
        as well. To show that such an extension is possible, we explain how these
        techniques can be applied to intelligent control.


1     How Formal Concept Analysis Fits into the General
      Scheme of Being

General need for compression. The current world is filled with data. At any
given moment of time, numerous sensors produce humongous amount of numer-
ical measurement results, images, videos, etc.
     This data is usually useful – otherwise it would not have been produced, so
it is desirable to store and process this information.
     However, the problem is that our ability to produce information way exceeds
our ability to store and process it. As a result, we cannot physically store every
bit produced by the sensors, we need to compress this data.
Compression is closely related to interpolation and extrapolation. Of
course, if all the bits were independent, if each was informative – containing
information that cannot be extracted from other bits – there would be no way
to drastically compress this information without losing a significant portion of
it. And this would make producing this immediately deleted information useless.
    Good news is that bits are not independent, there is usually a strong correla-
tion between them, correlation that allows us to drastically decrease the number
of stored bits without losing much information. For example, a photo can be eas-
ily compressed from several Megabytes to dozens of Kylobytes – several orders
of magnitude – and we can still easily recognize all the features of a person on
the webpage where this compressed photo is posted.
?
    This work was supported in part by the National Science Foundation grants 1623190
    (A Model of Change for Preparing a New Generation for Professional Practice in
    Computer Science) and HRD-1242122 (Cyber-ShARE Center of Excellence).


Copyright c 2020 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0). Published in Francisco
J. Valverde-Albacete, Martin Trnecka (Eds.): Proceedings of the 15th International
Conference on Concept Lattices and Their Applications, CLA 2020, pp. 9–17, 2020.
10     Vladik Kreinovich


     This correlation enables us also to effectively interpolate and extrapolate,
i.e., to adequately reconstruct missing information. For example, based on many
readings of temperature, wind speed, and other meteorological characteristics
ate several locations and heights, we can reasonable accurately reconstruct the
values of these characteristics at other locations and heights.
Functions of one, two, etc. inputs. In many cases, we are interested in
characteristics q that depend only on one (possible, multi-dimensional) input x:
q = f (x). For example:

 – we may be interested in the temperature q(x) at different locations and
   different moments of time (i.e., at different points x in space-time),
 – we may be interested in the income q(x) of different people x at different
   moments of time, etc.

   However, in many other cases, we are interested in characteristics q(x, y, . . .)
that depend on two (or even more) different inputs x, y, . . . For example:

 – we may be interested in the degree q(x, y) to which a given person x would
   like or dislike a certain movie y (or a certain book, or a certain research
   paper if x is a researcher),
 – we may be interested in knowing the degree q(x, y) to which, in a given
   situation x, different controls y will lead to good results, etc.

In such situations, we need to compress, interpolate, and extrapolate the desired
dependence q(x, y, . . .).
How can we compress, interpolate, and extrapolate multi-input depen-
dencies: general description. For simplicity, let us consider the case when the
desired quantity depends only on two inputs: q = q(x, y). In this case, in the
beginning,

 – we have information about x and information about y, and
 – we need to perform some processing of this information.

A natural way to speed up data processing is to perform some operations in
parallel – just like for us humans, a natural way for a person to perform a task
faster is to have several helpers working at the same time on the same task.
So, if there are some computational steps where we can process x separately
and process y separately, these steps need to be performed in parallel before
everything else. Thus, in general, processing such data consists of the following
two major stages:

 – first, we perform an appropriate processing on x, resulting in some values
   a(x), and at the same time, we perform an appropriate processing on y,
   producing b(y);
 – after that, we perform some processing on the results a(x) and b(y) of the first
   stage, producing F (a(x), b(y)), where F denotes the algorithm performed at
   this second stage.
  Formal Concept Analysis Techniques Can Help in Intelligent Control, Deep            11
                                                             Learning, etc.

At the end, we approximate the original dependence q(x, y) with the simpler-to-
store and simpler-to-process dependence F (a(x), b(y)) ≈ q(x, y).
    Let us describe possible situations, from the simplest to the most complicated.


Linear case. The simplest case is when q(x, y) can be well approximated as a
linear function of the values a(x) = (a1 (x), . . . , ak (x)), i.e., when

                q(x, y) ≈ b0 (y) + b1 (y) · a1 (x) + . . . + bk (y) · ak (y)

for some coefficients bi (y) depending on y. By adding a0 (x) = 1, we can make
this formula more uniform

           q(x, y) = b0 (y) · a0 (x) + b1 (y) · a1 (x) + . . . + bk (y) · ak (y).

                           def               def                   def
In matrix notations qxy = q(x, y), axi = ai (x), and byi = bi (y), this formula
takes the form
                                    Xk
                             qxy =     axi · byi .                          (1)
                                          i=0

This is a known idea of matrix decomposition, actively used in Principal Com-
ponent Analysis (see, e.g., [15]), in predicting people’s reaction to movies, etc.
(see, e.g., [1, 17]).
A natural generalization of linear case, to operations generalizing ad-
dition and multiplication. In the above case (1):

 – we use multiplication to process individual components axi and bxi of the
   results a(x) and b(y) of processing x and y, and
 – we use addition to combine these results.

Instead of multiplication and addition, we can use more general combination
functions.
    For example, we can have expert control rules of the type “if x satisfies the
property ai (e.g., if x > 0.1), then the control y should satisfy the property bi
(e.g., y ∈ [0, 1])”. We can then combine these rules into an equivalent formula,
according to which y is a reasonable control for the situation x if:

 – either the first rule is applicable, i.e., x satisfies the property a1 and y satisfies
   the property b1 ,
 – or the second rule is applicable, i.e., x satisfies the property a2 and y satisfies
   the property b2 , etc.

If we:

 – denote the truth value of the statement “x satisfies the property ai ” by ai (x)
   and
 – denote the truth value of the statement “y satisfies the property bi ” by bi (y),
12     Vladik Kreinovich

then the truth value q(x, y) of the statement “y is a reasonable control for x”
takes the form

                 q(x, y) = (a1 (x) & b1 (y)) ∨ (a2 (x) & b2 (y)) ∨ . . .

This is the usual example of formal concept analysis; see, e.g., [3, 6]
    This example can be extended to the case when experts use imprecise
(“fuzzy”) words from natural language to describe their rules; see, e.g., see,
e.g., [2, 8, 10, 13, 14, 16]. In this case, the expert’s control rules have a similar
form “if x is ai (e.g., small), then the control y should be bi (e.g., moderate)”.
We can similarly translate these rules into an equivalent formula, according to
which y is a reasonable control for the situation x if:

 – either the first rule is applicable, i.e., x satisfies the property a1 and y satisfies
   the property b1 ,
 – or the second rule is applicable, i.e., x satisfies the property a2 and y satisfies
   the property b2 , etc.

We can then ask the expert to estimate, on a scale from 0 to 1, the degrees ai (x)
to which different values x satisfy the imprecise (“fuzzy”) property ai and the
degrees bi (y) to which different values y satisfy the property bi .
    Since it is usually not practically possible to ask the expert to provide es-
timates for the combined statement “x satisfies the property ai and y satisfies
the property bi ” for all the pairs (x, y) – there are just too many possible pairs
– we have to estimate the degrees to which such statements are true based on
whatever information is available – namely, the degrees ai (x) and bi (y). For this
estimation, we can use a general algorithm f& (a, b) for estimating our degree of
confidence in a composite statement A & B based on our degrees of confidence
a and b in the statement A and B.
    This algorithm has to satisfy certain properties: e.g., since A & B means the
same as B & A, this operation must be commutative; since A & (B & C) is equiv-
alent to (A & B) & C, this operation must be associative, etc. Such operations
are known as “and”-operations, or, for historical reasons, t-norms.
    Similarly, we can use a similarly motivated “or”-operation (also known as
t-conorm) f∨ (a, b) to estimate our degree of confidence in A ∨ B based on our
degrees of confidence a and b in the statements A and B. In these terms, the
desired degree of confidence q(x, y) can be described as follows:

               q(x, y) = f∨ (f& (a1 (x), b1 (y)), f& (a2 (x), b2 (y)), . . .).       (2)


Case when some transformations are linear, while others are not. So
far, we have considered the case when the functions F is linear – or similar to
linear, with more general operations instead of addition and multiplication.
    In some cases, a linear transformation is followed by a non-linear one. For
example, in a traditional 3-layer neural network (see, e.g., [5]), the result q of
    Formal Concept Analysis Techniques Can Help in Intelligent Control, Deep                   13
                                                               Learning, etc.

processing the inputs x1 , . . . , xn has the form
                         K               n
                                                                !
                         X               X
                    q=         Wk · sk         wki · xi − wki       − W0 ,                    (3)
                         k=1             i=1

for some non-linear functions sk (z).
                                                                               P
                                                                               n
    In other words, we first compute linear combinations ak (x) =                   wki ·xi −wki ,
                                                                              i=1
                                                           P
                                                           K
and then perform a non-linear transformation q =                    sk (ak ) − W0 .
                                                           k=1

General case, when everything is possibly non-linear. In general, we
may have non-linear transformations a(x) and b(y), followed by a nonlinear
transformation F (a, b).
    A typical example of such a representation is deep learning (see, e.g., [7]),
where the dimension of the signal decreases as we go from the multi-D input
through processing layers, and thus, the original multi-dimensional signal is
compressed – and, in general, compressed non-linearly. Interestingly, in many
experiments, the intermediate results have intuitive meaning – so that the same
intermediate values can be used for other problems y.


2     What Is the Remaining Problem and How Formal
      Concept Analysis Techniques Can Help
General problem. If we have rules, and these rules are perfect, there is no
problem. However, this is rarely the case. In most practical situations, we have
some information about q(x, y), and we need to come up with the appropriate
decomposition into a(x), b(y), and F (a, b).
 – In the linear case of matrix decomposition, we have examples of people’s
   attitude to different movies, and we need to come up with the most adequate
   values axi and byi .
 – In the case of formal concept analysis, we have a table (often only partially
   filled) of truth values q(x, y), and we need to find appropriate predicates
   ai (x) and bi (y).
 – In the case of intelligent control, we have degrees q(x, y), and we need to
   come up with appropriate rules – e.g., with the most appropriate functions
   ai (x) and bi (y).
 – In the case of deep learning, while there are spectacular successes – like
   beating a world champion in Go, there are also spectacular failures, when
   the system classifies a clearly cat picture as a dog and vice versa. This means
   that even in this case, the problem of finding the appropriate values ai (x)
   and bi (y) is far from being solved.

How can formal concept analysis technique help. Of course, most of
the above problems are NP-hard (see, e.g., [11, 12]), so we cannot expect to
14     Vladik Kreinovich

find a feasible algorithm that always finds a solution. However, many efficient
techniques have been developed in formal concept analysis, and it is desirable to
extend them to other cases as well.
    Such an extension is clearly possible – e.g., the paper [4] provided an efficient
greedy algorithm for deriving fuzzy values when f& (a, b) = max(a + b − 1, 0) and
f∨ (a, b) = min(a + b, 1), and the paper [9] showed that the same algorithm can
work for other “and”- and “or”-operations as well. The unpublished result from
[9] is briefly described in the Appendix.
Conclusion. Our conclusion is simple and straightforward: let us thing big, let
us extend what we have to other cases.


Acknowledgments
The author is greatly thankful to Radim Belohlavek, Marketa Krmelova, and
Martin Trnecka for their help and encouragement.


References
 1. G. Acosta, M. Hernandez, N. Villanueva-Rosales, E. Smith, and V. Kreinovich,
    “Why matrix factorization works well in recommender systems: a systems-based
    explanation”, Journal of Uncertain Systems, 2019, Vol. 13, No. 3, pp. 164–167.
 2. R. Belohlavek, J. W. Dauben, and G. J. Klir, Fuzzy Logic and Mathematics: A His-
    torical Perspective, Oxford University Press, New York, 2017.
 3. R. Belohlavek and V. Vychodil, “Discovery of optimal factors in binary data via a
    novel method of matrix decomposition”, Journal of Computer and System Sciences,
    2010, Vol. 76, No. 1, pp. 3–20.
 4. R. Belohlavek and V. Vychodil, “Factor analysis of incidence data via novel decom-
    position of matrices”, Proceedings of the 7th International Conference on Formal
    Concept Analysis ICFCA’2009, Darmsdart, Germany, May 21–24, 2009, Springer
    Lecture Notes in Artifical Intelligence, Vol. 5548, pp. 83–97.
 5. C. M. Bishop, Pattern Recognition and Machine Learning, Springer, New York,
    2006.
 6. B. Ganter and R. Wille, Formal Concept Analysis. Mathematical Foundations,
    Springer, Berlin, 1999.
 7. I. Goodfellow, Y. Bengio, and A. Courville, Deep Leaning, MIT Press, Cambridge,
    Massachusetts, 2016.
 8. G. Klir and B. Yuan, Fuzzy Sets and Fuzzy Logic, Prentice Hall, Upper Saddle
    River, New Jersey, 1995.
 9. M. Krmelova, R. Belohlavek, and V. Kreinovich, Fuzzy Formal Concept Analysis
    Can Help Extract Rules from Experts, unpublished paper, 2013.
10. J. M. Mendel, Uncertain Rule-Based Fuzzy Systems: Introduction and New Direc-
    tions, Springer, Cham, Switzerland, 2017.
11. D. S. Nau, Specificity Covering, Duke University, Department of Computer Science,
    Technical Report CS-1976-7, 1976.
12. D. S. Nau, G. Markowsky, M. A. Woodbury, and D. B. Amos, “A mathematical
    analysis of human leukocyte antigen serology”, Mathematical Biosciences, 1978,
    Vol. 40, pp. 243–270.
    Formal Concept Analysis Techniques Can Help in Intelligent Control, Deep                    15
                                                               Learning, etc.

13. H. T. Nguyen, C. L. Walker, and E. A. Walker, A First Course in Fuzzy Logic,
    Chapman and Hall/CRC, Boca Raton, Florida, 2019.
14. V. Novák, I. Perfilieva, and J. Močkoř, Mathematical Principles of Fuzzy Logic,
    Kluwer, Boston, Dordrecht, 1999.
15. D. J. Sheskin, Handbook of Parametric and Nonparametric Statistical Procedures,
    Chapman and Hall/CRC, Boca Raton, Florida, 2011.
16. L. A. Zadeh, “Fuzzy sets”, Information and Control, 1965, Vol. 8, pp. 338–353.
17. C. Zhao, S. Sun, L. Han, Q. Peng, “Hydrid matrix factorization for recommender
    systems in social networks”, International Journal on Neural and Mass-Parallel
    Computing and Information Systems, 2016, Vol. 26, No. 6, pp. 559–569.


A      How Formal Concept Analysis Can Help Extract
       Intelligent Control Rules

We have a finite set of pairs (x, y) for which we know q(x, y). Let us denote this
set by P . Based on this information, how can we find appropriate functions ai (x)
and bi (y)?
    In this appendix, we show that in this general intelligent control case, we can
use a greedy algorithm that was originally proposed in [4] for a specific case of
“and”- and “or”-operations.
    By definition of a greedy algorithm:

 – we start by finding the functions a1 (x) and b1 (y),
 – then we fix the selected functions a1 (x) and b1 (y) and find the functions
   a2 (x) and b2 (y), etc., and,
 – in general, we fix the already selected functions a1 (x), b1 (y), . . . , ak−1 (x),
   bk−1 (y), and select a pair of functions ak (x) and bk (x).

How do we select these functions ak (x) and bk (y)?
   From the equation (2) and from the fact that p ≤ f∨ (p, q) for all p and q, we
can conclude that for each k, we should have

    f∨ (f& (a1 (x), b1 (y)), . . . , f& (ak−1 (x), bk−1 (y)), f& (ak (x), bk (y))) ≤ q(x, y),

i.e., equivalently, that

                        f∨ (qk−1 (x, y), f& (ak (x), bk (y))) ≤ q(x, y),                        (4)

where we denoted
                         def
            qk−1 (x, y) = f∨ (f& (a1 (x), b1 (y)), . . . , f& (ak−1 (x), bk−1 (y)))
                                def
for k − 1 ≥ 1 and q0 (x, y) = 0.
    A natural idea is to select the functions ak (x) and bk (y) that would cover as
many pairs (x, y) as possible, i.e., for which the value
                  def
      Nk (ak , bk ) = #{(x, y) ∈ P : f∨ (qk−1 (x, y), f& (ak (x), bk (y))) = q(x, y)}
16      Vladik Kreinovich

is the largest possible, where #S denoted the number of elements in the set S.
    From this viewpoint, once we selected bk (y), it is reasonable to select a func-
tion ak (x) which leads to the largest possible coverage, i.e., to select
                   def
      (bk )↓k (x) = sup{a : ∀y ∈ Yx (f∨ (qk−1 (x, y), f& (a, bk (y))) ≤ q(x, y))},
           def
where Yx = {x : (x, y) ∈ P }. Similarly, if we have selected the function ak (x),
then it is reasonable to select a function bk (y) which leads to the largest possible
coverage, i.e., to select
                   def
      (ak )↑k (y) = sup{b : ∀x ∈ Xy (f∨ (qk−1 (x, y), f& (ak (x), b)) ≤ q(x, y))},
            def
where Xy = {y : (x, y) ∈ P }.
Comment. The notations similar to the usual notations from the formal concept
analysis are motivated by the fact that in the usual 2-valued logic:

 – there are only two truth values 0 and 1;
 – when s0 ≤ s, then s0 ∨ t ≤ s if and only if t ≤ s; and
 – a & b ≤ s if and only if a ≤ (b → s).

By using these properties, one can check that in the 2-valued logic, the above
formulas can be represented in the following simplified equivalent form (not
depending on qk−1 (x, y)):

                         (ak )↑k (y) = inf (ak (x) → a(x, y));
                                       x∈Xy


                         (bk )↓k (x) = inf (bk (y) → S(x, y)),
                                           y∈Yx

which are exactly the usual notions (ak )↑ and (bk )↓ in formal concept analysis.
   Let us now describe an iterative procedure for finding bk (y) and ak (x). In
the beginning, the only information that we know about bk (y) and ak (x) is
that bk (y) ≥ 0 and ak (x) ≥ 0. Thus, as the starting approximations to the
                                             (0)            (0)
desired functions bk (y) and
                            ak (x), we
                                      take bk (y) = 0 and ak (x) = 0. For these
                               (0)   (0)
functions, the quality Nk ak , bk is simply equal to the previous value Nk−1 .
   Let us now start improving this selection step by step. In general, let us
                                                   (i−1)           (i−1)
assume that we have already found approximations
                                               bk       (y) and ak     (x), for
                                                         (i−1)     (i−1)
which the approximation quality is equal to Nk ak                , bk      .
    If some pairs (x, y) are still not covered by this selection, we should try to
                                (i−1)           (i−1)                         (i−1)
increase one of the functions bk      (y) and ak      (x). Let us start with bk     (y).
                                              (i−1)
The simplest idea is to increase the value bk       (y) for one of the value y0 to the
largest possible value
             def                                          (i−1)
     by0 (y0 ) = sup{b : ∀x ∈ Xy0 (f∨ (qk−1 (x, y0 ), f& (ak       (x), b) ≤ q(x, y0 ))},
  Formal Concept Analysis Techniques Can Help in Intelligent Control, Deep             17
                                                             Learning, etc.
                                    (i−1)                                  (i−1)
while keeping all other values bk       (y) unchanged: by0 (y) = bk     (y) for all y 6=
y0 .
     For each y0 , we form the resulting function by0 (y), and take ak,y0 = (by0 )↓k
and bk,y0 = (ak,y0 )↑k = (by0 )↓k ↑k . For each y0 , we find the value Nk (ak,y0 , bk,y0 )
of the objective function, and select ymax for which this value is the largest:

                     Nk (ak,ymax , bk,ymax ) = max Nk (ak,y0 , bk,y0 ).
                                                  y0

    The corresponding functions bk,ymax (y) and ak,ymax (x) are then taken as the
                (i)         (i)      (i)                   (i)
next iteration bk (y) and ak (x): bk = bk,ymax (y) and ak = ak,ymax (y). Itera-
tions continue while the value Nk continues to grow, i.e., while
                                                      
                            (i) (i)          (i−1) (i−1)
                      Nk ak , bk > Nk ak          , bk     .

Once it stops growing, i.e., once we have
                                                     
                             (i) (i)        (i−1) (i−1)
                     Nk ak , bk = Nk ak          , bk     ,

                                                                     def   (i)         def
the iterations stop, and the corresponding functions bk (y) = bk (y) and ak (x) =
 (i)
ak (x) are added to the list of pairs

                        (a1 (x), b1 (y)), . . . , (ak−1 (x), bk−1 (y)).

    If after this addition, some pairs (x, y) ∈ P are still not covered, we similarly
find and add the next pair of functions ak+1 (x) and bk+1 (y), etc., until all the
pairs (x, y) ∈ P are covered.
    Our preliminary experiments show that this algorithm leads to reasonable
rules.
Comment. A similar greedy algorithm can be used when, instead of the above-
described methodology, we use a more logical way to convey the expert’s if-then
rules, i.e., when we take

               q(x, y) = f& (f→ (a1 (x), b1 (y)), f→ (a2 (x), b2 (y)), . . .),

where f→ (a, b) is an implication operation, i.e., an estimate for degree to which
the statement A → B is true given the degrees of confidence a and b in the
statements A and B.

</pre>