Formalization of “Natural” Classification and “Natural” Concepts by Probabilistic Generalization of Formal Concepts Analysis Evgenii E. Vityaeva , Vladislav Degtiarevb , Bayar Pakb and Yuri Meisterb a Sobolev Institute of Mathematic, Novosibirsk, Russia b Novosibirsk State University, Novosibirsk, Russia Abstract In the previous works, a probabilistic generalization of the formal concepts analysis was developed. This generalization is induced by the problem of formal concepts determining under noise conditions, when the lattice of formal concepts exponentially grows. In this paper, probabilistic formal concepts with negation are determined, as well as a statistical method for detecting these probabilistic formal concepts. The purpose of this paper is to show that probabilistic formal concepts have a deeper meaning. It is argued that probabilistic formal concepts formalize the “natural” concepts described in cognitive sciences by “causal models”, which are characterized by a highly correlated structure of attributes. The same structure is specific for the “natural" classification of objects of the external world. The definition of “natural" classification given by J. Stuart Mill is fairly accurately formalized by probabilistic formal concepts. 1. Introduction In previous works [Vityaev et al. 2012], [Vityaev et al. 2015], [Martynovich et al. 2016], a probabilistic generalization of the Formal Concepts Analysis (FCA) was developed. This gen- eralization solves the problem of the formal concepts discovery in noise conditions associated with the exponential growth of the lattice of formal concepts in the presence of noise. This issue is discussed in a more detail in section 2. Probabilistic generalization of formal concepts was obtained by generalizing the definition of formal concepts as fix-points of implication. For this purpose, the probabilistic measure on the objects of context and the most specific probabilistic implication were defined. Consistency of fix-points based on maximally specific probabilistic implications is proved and then prob- abilistic formal concepts are defined as fix-points of predictions based on maximally specific probabilistic implications. Figuratively speaking, the obtained probabilistic formal concepts not display data in all details as in photographs, but give an “idealized" representation of the data. Russian Advances in Artificial Intelligence: selected contributions to the Russian Conference on Artificial intelligence (RCAI 2020), October 10-16, 2020, Moscow, Russia " vityaev@math.nac.ru (E.E. Vityaev)  © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) In this work, probabilistic formal concepts with negation are defined, as well as a statistical method of detecting these probabilistic formal concepts based on the works [Vityaev et al. 2014], [Vityaev et al. 2015], [Martynovich et al. 2016]. The purpose of this work is to show that probabilistic formal concepts have deeper meaning than their restoration in noise conditions. In [Vityaev et al. 2014], it was shown that the fix- points of anticipation underlying probabilistic formal concepts model the process of perception in exact accordance with the psychology of perception described in [Smirnov 1985], [Naiser 1981]. In addition, this paper shows that probabilistic formal concepts simulate “natural" concepts explored in cognitive sciences. “Natural" concepts are concepts about directly perceived ob- jects, not abstract objects like production, plant and object. The following section, based on research in cognitive sciences, provides principles of categorization and description of “natu- ral" concepts. It follows that “natural" concepts reflect a highly correlated structure of features of objects of the external world, which can be represented by “causal models". It is argued that causal models are rather precisely modeled by probabilistic formal concepts. The highly correlated structure of features of objects of the external world is also manifested in the “natural" classification (see section 3). J. Stuart Mill has observed that “natural" classes of animals or plants are described by a potentially infinite number of properties [Mill 2011]. The naturalists who built the “natural" classifications also noted that the more similar the compared objects are in the greater number of significant features, the more likely their similarity in other features [Rutkovsky 1884], and that the construction of the “natural" classification consists in “indication" - from the infinite number of features we need to move to a limited number of features that would replace all other features [Smirnov 1938]. This means that in the “natural" classes the features are strongly correlated, for example, if classes 128 and features are binary, the independent “indicator" features among them will be only 7 features, because 27 =128, and the remaining features, which can be tens and hundreds (denoted by N) can be predicted by these 7 features, which means the presence of N-7 regularities. Since the “indicator" features can be different 7 features from N, the choice of which is equal to 𝐶𝑁7 , the number of regularities that predict non-indicator features based on the indicator features is not much less than (𝑁 − 7) ∙ 𝐶𝑁7 . Highly correlated structure of features is well detected by the probabilistic formal concepts; moreover, the following statement by J.S. Mill almost precisely defines the meaning of probabilistic formal concepts: “Natural groups ... are defined by features ... However, at the same time not only features which are certainly common for all the objects included into the group but the whole set of those features is taken into account, of which all are found in the majority of these objects and the majority - in all" [Mill 2011]. 2. “Natural" concepts In the works of Eleanor Rosch [Rosch 1973, 1978], [Rosch et al. 1975, 1976, 1978] the principles of categorization of “natural" categories [Rosch 1973, 1978] are formulated on the basis of the conducted experiments. Directly perceived objects (basic objects) – information rich bundles of observed and functional properties that form a natural partition that creates a categoriza- tion. These bundles form “prototypes" of objects classes: Categories can be viewed in terms of their clear cases if the perceiver places emphasis on the correlational structure of perceived attributes . . . By prototypes of categories we have generally meant the clearest cases of cate- gory membership [Rosch et al. 1978], [Rosch 1978]. Rosch and Mervis (1975) have shown that the more prototypical of a category a member is rated, the more attributes it has in common with other members of the category and the fewer attributes in common with members of the contrasting categories [Rosch et al. 1975, 1976]. Later on, the theory of “natural" concepts of Eleanor Rosch was called a “prototype” theory. Its main features are described in the following way: The prototype view (or probabilistic view) keeps the attractive assumption that there is some underlying common set of features for category members but relaxes the requirement that every member have all the features [Ross et al. 2008]. Further research has found that models based on features, similarities and prototypes are not sufficient to describe classes. Theoretical, causal and ontological knowledge relating to class objects must be taken into account. For example, people not only know that birds have wings, can fly and nest in trees, but also that birds nest in trees because they can fly and fly because they have wings. Based on this research, Bob Rehder, in his work “Categorization as causal reasoning", formu- lated a causal-model theory: people’s intuitive theories about categories of objects consist of a model of the category in which both a category’s features and the causal mechanisms among those features are explicitly represented [Rehder 2003a]. In the theory of causal models, the relationship of an object to a category is no longer based on a set of features and proximity based on features, but on the similarity of the generating causal mechanism: Specifically, a to-be-classified object is considered a category member to the extent that its features were likely to have been generated by the category’s causal laws, such that combinations of features that are likely to be produced by a category’s causal mechanisms are viewed as good category members and those unlikely to be produced by those mechanisms are viewed as poor category members [Rehder 2003a]. Some researchers have used Bayesian networks or causal models to represent causal knowl- edge [Cheng 1997], [Gopnik et al. 2004], [Griffiths et al. 2009]. However, these models cannot model cyclical causal relationships because Bayesian networks do not support cycles. In [Rehder et al. 2011], Bob Rehder proposed a model of causal cycles based on the “dis- closure" of causal graphical models (CGMs). The “disclosure" is done by creating a Bayesian network that deploys over time. Such a network is already quite well suited for modeling “nat- ural" concepts reflecting the highly correlated structure of the outside world. However, it does not explicitly formalize causal link cycles. The model we propose further is directly based on cyclic causal links represented by proba- bilistic formal concepts. 3. “Natural" classification The first sufficiently detailed philosophical analysis of the “natural" classification belongs to J.St. Mill [Mill 2011]. This analysis revealed all the main properties of the “natural" classifica- tions, which were later confirmed by naturalists who built the “natural" classifications. According to J.St. Mill [Mill 2011], “artificial" classifications differ from “natural" classifica- tions in that they can be based on any one or more attributes, so that different classes differ only in that they include objects with different values of these attributes. But if we consider classes of “animals" or “plants", they differ in such a large (potentially infinite) number of properties that they cannot be enumerated" [Mill 2011]. J.St. Mill gives the following definition of the “natural" classification: it is such a classification which unites objects into groups concerning which it is possible to express the greatest number of general propositions and it is based on such properties which serve as reasons for many others or at least make up their true attributes. It also defines the notion of the “image" of the class, which is the forerunner of “natural" concepts: “our concept of class – the image by which this class is represented in our mind – is the concept of a certain specimen that possesses all the features of this class ... to the highest degree". The properties of “natural" classes according to J. St. Mill were confirmed by naturalists. For example, Wavell W.: “The more general statements about objects give an opportunity to make classification, the more natural it is" [Mayen et al. 1976]. L. Rutkovsky writes about the inexhaustible number of general properties of “natural" classes. [Rutkovsky 1884]: “The greater the number of essential features similar to comparable objects, the more likely their similarity in other respects". A similar statement is made by E.S. Smirnov: “The taxonomic problem lies in ’indication’: from an infinite number of features we need to move to a limited number of them, which would replace all other features" [Smirnov 1985]. As a result, the problem of defining “natural" classifications has been formulated and is still discussed in the literature [Zabrodin 1981], [The Nature of Classification 2013]. However, from our point of view, the formalization of the “natural" classification is not yet sufficiently ade- quate. 4. The problem of formal concepts discovery in noise conditions Let us consider the known problem of formal concepts analysis [Ganter 2003]: standard meth- ods of concept lattice construction on noisy data turn out to be unproductive, because they are designed for accurate data, and in the presence of noise (data distortions) the concept lat- tice turns out to be filled with noisy clones of initial concept classes [Klimushkin et al. 2010], [Frequent Itemset Mining 2004]. Attempts to solve this problem “on the forehead" do not look promising [Klimushkin et al. 2010], [Kuznetsov 2007]. The main idea of these works is to take a full lattice of concepts, and get rid of noise, “side” concepts by filtering. A widespread eval- uation of the “side” concept is the Stability Index [Kuznetsov 2007], [Buzmakov 2014]. The Stability Index and other approaches based on the filtration of lattices have significant disad- vantage – the filtered lattices of concepts often have exponential sizes [Emilion et al. 2009], which is unacceptable for large contexts. 5. Basics of formal concepts analysis Let’s start with the main definitions of formal concepts analysis. The majority of definitions here start with classical works on FCA [Ganter 2003], [Ganter et al. 1999], others are taken from [Ganter et al. 2004]. Definition 1. Formal context is a triple (𝐺, 𝑀, 𝐼 ), where 𝐺 and 𝑀 – arbitrary sets of objects and attributes, and 𝐼 ⊆ 𝐺 × 𝑀 – binary relation, expressing the belonging of an object to an attribute. In formal context, the operators of the derivative play a key role. Definition 2.𝐴 ⊆ 𝐺, 𝐵 ⊆ 𝑀. Then: 1. 𝐴↑ = {𝑚 ∈ 𝑀 | ∀𝑔 ∈ 𝐴, (𝑔, 𝑚) ∈ 𝐼 } 2. 𝐵↓ = {𝑔 ∈ 𝐺 | ∀𝑚 ∈ 𝐵, (𝑔, 𝑚) ∈ 𝐼 } Operators of a derivative link a subset of objects and attributes of a context. In the future we will also refer to both operators as a derivative with a single symbol ′. Definition 3. A pair (𝐴, 𝐵) – is a formal concept, if 𝐴↑ = 𝐵 and 𝐵↓ = 𝐴. Definition 4. An implication is a pair of attribute sets (𝐵, 𝐶), 𝐵, 𝐶 ⊆ 𝑀 that is recorded as 𝐵 → 𝐶. Implication 𝐵 → 𝐶 is truth on the context 𝐾 = (𝐺, 𝑀, 𝐼 ), if ∀𝑔 ∈ 𝐺(𝐵⊈𝑔 ′ ∨ 𝐶 ⊆ 𝑔 ′ ). A set of all true implications on context K well be denoted as 𝐼 𝑚𝑝(𝐾 ). Definition 5. For any set L of implications we define a direct inference operator 𝑓𝐿 that add all possible conclusions of implications to the original set of attributes X: 𝑓𝐿 (𝑋 ) = 𝑋 ∪ {𝐶 | 𝐵 ⊆ 𝑋 , 𝐵 → 𝐶 ∈ 𝐿} Theorem 1. [Ganter et al. 1999] For any subset 𝐵 ⊆ 𝑀, 𝑓𝐼 𝑚𝑝(𝐾 ) (𝐵) = 𝐵 ⇔ 𝐵′′ = 𝐵. Let us illustrate Theorem 1 in the following simple context. Table 1 [Ganter et al. 1999] Very simple formal context K0 m1 m2 m3 1 1 0 1 1 1 0 1 1 0 0 0 We can reformulate the task of constructing a concept lattice in terms of implication and output operator. Looking at Table 1, it is not difficult to notice that m1 and m3 attributes completely define the object class. In fact, m1 infer m2 , and the same is true for m3 , from these it follows that 𝑚1 → 𝑚2 and 𝑚3 → 𝑚2 . The sets {𝑚1 , 𝑚2 }, {𝑚2 , 𝑚3 } and {𝑚1 , 𝑚2 , 𝑚3 } will be formal concepts. According to the theorem, they will also be fix-points of the direct inference operator. For example, 𝑓𝐼 𝑚𝑝(𝐾 ) 𝑓𝐼 𝑚𝑝(𝐾 ) {𝑚1 } −−−−−→ {𝑚1 , 𝑚2 } −−−−−→ {𝑚1 , 𝑚2 }. It is therefore {𝑚1 , 𝑚2 } both a fix-point and a formal concept. Now let’s add some noise to the original context K0 from Table 1. In order to keep the noise level at a fairly low level, let’s also expand the context by adding duplicates of objects: Table 2 K0 with duplicates. 𝑚1 𝑚2 𝑚3 1 1 0 1 1 1 0 1 1 1 1 0 1 1 1 0 1 1 1 1 0 1 1 1 0 1 1 0 0 0 Table 3 𝐾𝑛𝑜𝑖𝑠𝑒 a little noise 𝑚1 𝑚2 𝑚3 0 1 0 1 1 1 0 1 1 1 1 0 1 0 1 0 1 1 1 1 0 1 1 1 0 1 1 0 0 0 Every small change made to the context will have a dramatic impact on the conceptual lat- tice. The first context in Table 2 is equivalent to the original context K0 and generates the same conceptual lattice. However, the lattice with noise of the second context in Table 3 will contain some fictitious notions {m2 } and {m1 , m3 } generated by noise, which will also be formal notions. The number of fictitious notions will increase with the noise level, and the qualitative estimation of speed is exponential [Emilion et al. 2009]. 6. Probabilistic logic in a formal context To avoid the exponential growth of the lattice, a stability index is introduced [Klimushkin et al. 2010], [Kuznetsov 2007],[Buzmakov 2014] to assess the degree of stability of the concept and to select only sustainable concepts. We will present a fundamentally different way of the lattice construction, based on proba- bilistic implications. For this purpose, we redefine the context within the framework of logic. Further we will consider only finite contexts. Definition 6. For the finite context 𝐾 = (𝐺, 𝑀, 𝐼 ) we define a signature Ω𝐾 that contains predicate symbols m(x) for each 𝑚 ∈ 𝑀. For the signature Ω𝐾 and the K context, considered as a model, we define the interpretation of predicate symbols as follows 𝐾 ⊧𝑚(𝑥) ⇔ (𝑥, 𝑚) ∈ 𝐼 . Definition 7. For the signature Ω𝐾 let’s define the following variant of the first order logic: 1. 𝑋𝐾 – set of variables; 2. 𝐴𝑡 𝐾 – set of atomic formulas (atoms) 𝑚 (𝑥) where 𝑚 ∈ Ω𝐾 and 𝑥 ∈ 𝑋𝐾 ; 3. 𝐿𝐾 – set of literals, including all atoms 𝑚 (𝑡) and their negations ¬𝑚(𝑡); 4. Φ𝐾 – set of formulas defined inductively: every atom is a formula, for any Φ, Ψ ∈ Φ𝐾 following syntactic constructions Φ ∧ Ψ, Φ ∨ Ψ, Φ → Ψ, ¬Φ are formulas as well. For a set of literals 𝐿 ⊆ 𝐿𝐾 we define their conjunctions as ∧𝐿 = ∧ 𝑃. In the same way 𝑃∈𝐿 ¬𝐿 = {¬𝑃 | 𝑃 ∈ 𝐿}. Definition 8. A single element set {𝑔}, together with signature Ω𝐾 predicates, forms a model 𝐾𝑔 of this object. The fact of truth of a formula 𝜙 on an object model 𝐾𝑔 is written as 𝑔⊧𝜙 ⇔ 𝐾𝑔 ⊧𝜙. Definition 9. Let us define a probabilistic measure 𝜇 on the set G, in the Kolmogorov sense. Then we can define a probabilistic measure on the set of formulas as: 𝜈 ∶ Φ𝐾 → [0, 1], 𝜈(𝜙) = 𝜇({𝑔 | 𝑔⊧𝜙}). Let us assume that the context has no statistically insignificant objects such as 𝜇({𝑔}) = 0, 𝑔 ∈ 𝐺. Definition 10. The set of literals M will be called 𝜈-joint if 𝜈(∧𝑀) > 0. In the future, the compatibility of the set of literals will be considered within a probabilis- tic measure of a context, and in the absence of ambiguity the symbol of the measure will be omitted. Let us consider some subset of atoms 𝐿 ⊆ 𝐴𝑡(𝐾 ). The formula 𝑚1 ∧ 𝑚2 … ∧ 𝑚𝑘 → 𝑚 = ∧{𝑚𝑖 } → 𝑚 defines implication ({𝑚𝑖 } , {𝑚}) on the context. The presence of a probabilistic measure allows to determine the probabilistic implication. Let’s define the rules on the context. Definition 11. Let {𝐻1 , 𝐻2 , … , 𝐻𝑘 , 𝐶} ∈ 𝐿𝐾 , 𝐶 ∉ {𝐻1 , 𝐻2 , … 𝐻𝑘 }, 𝑘 ≥ 0. 1. The rule 𝑅 = (𝐻1 , 𝐻2 ., 𝐻𝑘 → 𝐶) is called implication (𝐻1 ∧ 𝐻2 … ∧ 𝐻𝑘 → 𝐶); 2. The premise 𝑅 ← of rule R is a set of literals {𝐻 1 , 𝐻2 ,..., 𝐻𝑘 }; 3. The conclusion of the rule is 𝑅 → = 𝐶; 4. The length of the rule is called the premise power |𝑅 ← |; 5. If 𝑅1← = 𝑅2← and 𝑅1→ = 𝑅2→ , then 𝑅1 = 𝑅2 . Definition 12. The probability 𝜂 of the rule R is the value 𝜈(𝑅 ← ∧ 𝑅 → ) 𝜂(𝑅) = 𝜈(𝑅 → |𝑅 ← ) = . 𝜈(𝑅 ← ) If the rule denominator 𝜈(∧𝑅 ← ) is equal to 0, then the probability is not defined. 7. Rule classes and consistency of predictions The following sections are a combination of works [Ganter et al. 1999], [Vityayev 1992], [Vityaev 2006], [Vityayev et al. 2012], [Vityaev et al. 2015], [Martynovich et al. 2016]. Definition 13. Let us define a prediction operator by a set of rules  as Π (𝐿) = 𝐿 ∪ {𝐶 | ∃𝑅 ∈  ∶ 𝑅 ← ⊆ 𝐿, 𝑅 → = 𝐶}. Definition 14. A closure 𝐿̄ of the set of literals L is the smallest fix-point of a prediction operator containing L 𝐿̄ = Π (𝐿̄) = Π∞ 𝑘  (𝐿) = ⋃ Π (𝐿). 𝑘∈ℕ Definition 15. Rule R1 is a subrule of rule R2 if 𝑅1→ = 𝑅2→ , 𝑅1← ⊂ 𝑅2← . We denote this fact as 𝑅1 ⊏𝑅2 . Definition 16. Rule R1 specifies rule R2 , or R1 > R2 if 𝑅2 ⊏𝑅1 and 𝜂(𝑅1 ) > 𝜂(𝑅2 ). Example 2. A rule 𝑚1 → 𝑚2 from the context 𝐾𝑛𝑜𝑖𝑠𝑒 (Table 3) has only one, unconditional sub-rule (∅ → 𝑚2 )⊏(𝑚1 → 𝑚2 ) but it is wrong that the first rule specifies the second 𝜂(∅ → 8 𝑚2 ) = 10 = 54 = 𝜂(𝑚1 → 𝑚2 ). However, (𝑚3 → 𝑚2 ) > (∅ → 𝑚2 ) since 𝜂(∅ → 𝑚2 ) = 10 8 < 56 = 𝜂(𝑚3 → 𝑚2 ). To achieve consistency and resistance to noise, some restrictions on the rule classes used will be required [Vityayev 1992], [Speransky 2013]. Class M1 defined below contains only those rules whose conditional probability is strictly greater than the probability of an unconditional rule. Definition 17. 𝑅 ∈ 𝑀1 (𝐶) ⇔ 𝜂(𝑅) > 𝜈(𝑅 → ), 𝑅 → = 𝐶, 𝑅 ← = ∅. The next class M2 requires the property of maximum specificity – the impossibility to in- crease the conditional probability by refining the rule [Vityaev 2006]. Definition 18. 𝑅 ∈ 𝑀2 (𝐶) ⇔ (𝑅 ∈ 𝑀1 (𝐶))&(𝑅⊏𝑅̃ ⇒ 𝜂(𝑅̃ ) ≤ 𝜂(𝑅)). We have proved that maximally specific rules of M2 solve the problem of statistical ambiguity [Vityaev 2006], [Vityaev et al. 2015]. Example 3. A rule (𝑚3 → 𝑚2 ) can be specified to the rule (¬𝑚1 ∧ 𝑚3 → 𝑚2 ), because of 𝜂(𝑚3 → 𝑚2 ) = 65 < 1 = 33 = 𝜂(¬𝑚1 ∧ 𝑚3 → 𝑚2 ). The last rule meets all the requirements of class M2 , and therefore (¬𝑚1 ∧ 𝑚3 → 𝑚2 ) ∈ 𝑀2 (𝑚2 ). The next class Imp contains all precise implication and corresponds to the set 𝐼 𝑚𝑝(𝐾 ) of definition 4. Definition 19. 𝑅 ∈ 𝐼 𝑚𝑝(𝐶) ⇔ (𝑅 → = 𝐶). Definition 20. Define classes M1 , M2 , Imp by joining over all literals 𝑀1 = ∪ 𝑀1 (𝐶); 𝑀2 = ∪ 𝑀2 (𝐶); 𝐼 𝑚𝑝 = ∪ 𝐼 𝑚𝑝(𝐶). 𝐶∈𝐿𝑖𝑡(𝐾 ) 𝐶∈𝐿𝑖𝑡(𝐾 ) 𝐶∈𝐿𝑖𝑡(𝐾 ) Definition 21. A set of rules  will be called exact, if 𝐼 𝑚𝑝 ⊂ . Definition 22. By the system of rules we will call any subset  ⊆ 𝑀2 . A set of literals L is consistent if it does not simultaneously contain some atom C and its negation ¬𝐶. Theorem 2. (Joint predictions [Vityaev et al. 2015]). If L is joint, then Π (𝐿) is also joint and consistent for any system of rules  ⊆ 𝑀2 . 8. Probabilistic formal concepts Fix-points of the prediction operator are obvious candidates for the role of the content of proba- bilistic concepts. To determine the volume of notions let’s refer to [Kuznetsov 2007],[Buzmakov 2014]. The idea is to use all possible prototypes of the closing operator, i.e. all sets of M such that Π (𝑀) = 𝐵 to compose their derivatives into a single volume of notion A. This will allow to restore the initial correspondence of objects-concept using the prediction operator and also to include all objects of the same class into a single volume of the notion. Definition 23. A probabilistic formal concept on the context K is a pair (A,B) that meets the conditions: Π (𝐵) = 𝐵, 𝐴 = ∪ 𝐶 ′ ,  ⊆ 𝑀2 (𝑑𝑒𝑓 . 22). Π (𝐶)=𝐵 . Example 4. Let’s consider a K𝑠𝑞𝑢𝑎𝑟𝑒𝑠 context consisting of two non-intersecting squares that give two independent formal concepts. We will change the values of some cells to make the task a bit more complicated. Table 4 Context of Ksquares with 2 notions and superimposed noise. G m1 m2 m3 m4 m5 m6 m7 m8 g1 1 1 1 1 1 0 0 0 g2 1 1 1 1 0 0 0 0 g3 1 1 1 1 0 0 0 0 g4 1 1 1 1 0 0 0 0 g5 0 0 0 0 1 1 1 1 g6 0 0 0 0 1 1 0 1 g7 0 0 0 0 1 1 1 1 g8 0 0 0 0 1 1 1 1 Note that the most specific rules of M2 will be (𝑚𝑖=1…4 → ¬𝑚𝑗=5…8 ), but not the rule 𝑚5 → ¬𝑚𝑖=1…4 for which you can find the following clarification 𝜂(𝑚5 ∧ 𝑚6 → ¬𝑚1 ) = 1 > 45 = 𝜂(𝑚5 ∧ → ¬𝑚1 ). This example shows how the noise is processed and invested in probability in the process of rule refinement. To calculate a concept containing the feature m1 , we first calculate its closing Π ({𝑚1 }) = {𝑚1 , 𝑚2 , 𝑚3 , 𝑚4 , ¬𝑚5 , ¬𝑚6 , ¬𝑚7 , ¬𝑚8 }. Then all the objects whose closing turns out to be ex- actly the same are gathered in a single volume of the concept {g2 , g3 , g4 }. To distinguish probabilistic formal concepts from classical formal concepts on the context K, we will call the latter simply formal concepts. The definition of set A is also based on the following theorem linking probabilistic and simple formal concepts on context K. Theorem 3. Let K be a formal context. 1. If (A,B) is a formal concept in K, then there is a probabilistic formal concept (N,M) in the same context, such as 𝐴 ⊆ 𝑁 , . 2. If (N,M) is a probabilistic formal concept on K, then there is a family  of formal concepts in the same context, such that ∀(𝐴, 𝐵) ∈  (Π (𝐵) = 𝑀), 𝑁 = ∪ 𝐴 (𝐴,𝐵)∈ 9. Statistical method for detection of probabilistic formal concepts The procedure for obtaining all M2 rules by exhaustive search is exponential, because the 𝑂(2|𝑀| ) candidate rules (each atom or its negation can be included in the rule) are evaluated at |G|. Of course, this estimation is rather rough and the procedure can be improved. Instead a more radical simplification of the problem by finding a statistical approximation of M2 -rules will be defined. Definition 24. R is a probabilistic law, if (𝑅̃ ⊏𝑅) ⇒ (𝑅̃ < 𝑅). Definition 25. A rule 𝑅̃ is semantically derived from the rule R, we denote as 𝑅 ⊳ 𝑅̃ , if R and 𝑅̃ – probabilistic law and 𝑅̃ > 𝑅. Definition 26. The probabilistic law R will be called the strongest if there is no probabilistic law 𝑅̃ such as (𝑅̃ > 𝑅). Definition 27. We define the Semantic Probabilistic Inference (SPI) of some strongest prob- abilistic law 𝑅𝑚 as such a sequence of probabilistic laws 𝑅0 ⊳ 𝑅1 ⊳ 𝑅2 . ⊳ 𝑅𝑚 that 𝑅0← = ∅ and 𝑅𝑚 is the strongest probabilistic law. Let us denote by  the set of all the strongest probabilistic laws obtained by various semantic probabilistic inferences. Specification check R1 > R2 requires an inequality 𝜂(𝑅1 ) > 𝜂(𝑅2 ) check. In general, from a practical point of view, we cannot assume that a probability measure is defined as in definition 9, so we will consider it unknown. The probabilistic inequalities used in the SPI definition we will determine using some statistical criterion, e.g. the exact Fisher independence criterion with a confidence level 𝛼. The resulting set of rules 𝛼 will already make contradicting predictions. Therefore, to approximate the prediction operator Π (𝐿), it is necessary to introduce an additional criterion of mutual consistency of predictions by rules 𝛼 on the set L. Definition 29. The rule 𝑅 ∈ 𝛼 is confirmed on the set of letters L, if 𝑅 ← ⊂ 𝐿 and 𝑅 → ∈ 𝐿. In this case, we write that 𝑅 ∈ 𝑆𝑎𝑡(𝐿) ⊆ 𝛼 . Definition 30. The rule 𝑅 ∈ 𝛼 is refuted on the set of letters L, if 𝑅 ← ⊂ 𝐿 and 𝑅 → ∈ ¬𝐿. In this case, we write that 𝑅 ∈ 𝐹 𝑎𝑙(𝐿) ⊆ 𝛼 . The difference between the quantity and quality of rules confirmed and refuted on the concept-hypothesis L is the determinant for adding or deleting literals to the content of the concept. Definition 31. The consistency criterion for the set L is the value 𝐼 𝑛𝑡(𝐿) = ∑ 𝛾 (𝑅) − ∑ 𝛾 (𝑅). 𝑅∈𝑆𝑎𝑡(𝐿) 𝑅∈𝐹 𝑎𝑙(𝐿) The choice of rule evaluation 𝛾 may depend on additional task specifics. In the experiments below, we are guided by Shannon’s considerations as 𝛾 (𝑅) = −𝑙𝑜𝑔(1 + 𝜖 − 𝜂(𝑅)). Definition 32. Define a consistent prediction operator Υ(𝐿) that changes a set of literals L by one element so as to strictly increase the consistency criterion: 1. For all 𝐺 ∈ 𝐿𝐾 ⧵𝐿 calculate the effect of adding G: △+ = 𝐼 𝑛𝑡(𝐿 ∪ {𝐺}) − 𝐼 𝑛𝑡(𝐿); 2. For all 𝐺 ∈ 𝐿 calculate the effect of removing G: △− = 𝐼 𝑛𝑡(𝐿⧵{𝐺}) − 𝐼 𝑛𝑡(𝐿); 3. The operator Υ(𝐿) adds a letter G to the L set if △+ > 0 and △+ >△− ; the operator delete a letter G from L if △− > 0 and △− >△+ . If △− = △+ and △− > 0, the operator will delete literal G; 4. If △+ ≤ 0 and △− ≤ 0, the operator returns L and we have a fix-point. Definition 33. A probabilistic formal concept discovered by a statistical method is a proba- bilistic formal concept in which 𝛼 is a set of all strongest probabilistic laws (def. 26) obtained by various semantic probabilistic inferences with a confidence level 𝛼 in the context K. 10. Experiments in the detection of probabilistic formal concepts by statistical method Let us illustrate the detection of probabilistic formal concepts by statistical method on the ex- ample of classes (concepts) detection of digits. Let’s encode the digits with 24 features, located as shown in fig. 1. Each feature has 7 values (see fig. 2). Some lines of numbers are encoded twice, for example, the vertical line at number 4 and 1 is encoded both by the vertical right line (feature 3) in cells 3,7,11,15,19,23 and by the vertical left line (feature 5) in cells 4,8,12,16,20,24. Two experiments were conducted for two different contexts. The first contained 30 copies of each digit (360 in total) randomly mixed together. The program does not know where the digit is and it treats them as different images. The second context contained 10 copies of each of the 10 digits (without repeating sixes and nines, a second copy was taken) and 300 random images in which each of the 24 features took a random value from the seven available. All numbers and images were also randomly mixed. In each context a corresponding set 𝛼 of strongest Figure 1: Digits encoding Figure 2: Digits encoding probabilistic laws with 𝛼 = 0.05 for the Fisher criterion were found. In the first experiment 66160 regularities with conditional probability greater than 0.9 and 𝛼= 0.05 were found. In the second experiment 629494 regularities with conditional probability greater than 0.9 and 𝛼= 0.05 were found. In the first context, the method found as classes all 12 digits – they were fix-points, as well as two fix-points for 6 and 9, which did not contain, respectively, signs 13 and 12, with which two copies of digits differed. In the second context, the method identified 10 digits as fix-points. In both experiments, the stability of the probabilistic formal concepts found was tested with regard to the lack of information on any feature of a digit. It was checked that if any digit described by the set of literals L removes one of the literals 𝑚𝑖 or ¬𝑚𝑖 , i.e., removes one of the values of the feature, the digit will still be assigned to the same class and the unknown feature will be correctly identified and added to L, i.e. Π (𝐿⧵𝑚𝑖 ) = 𝐿. This means that the removed value of the feature will be correctly predicted by the regularities from 𝛼 without contradictions on the part of the logical relationship of this feature with other features. In the first experiment, for all 360 digits, a feature was accidentally removed and then restored by the prediction operator. In the second experiment, the removed features were restored in 96% of cases. Acknowledgments The publication of this article was supported by the project «Investigation, analysis and com- plex independent expertise of projects of the National technological initiatives, including the accompanying of projects of «road map» NeuroNet», which is executed within the framework of the state assignment №. 28.12487.2018/12.1 of the Ministry of Science and Higher Educa- tion of the Russian Federation. The study was also carried out within the framework of the state contract of the Sobolev Institute of Mathematics (project no. 0314-2019-0002) regarding theoretical results in Sections 6-9. References [Vityaev 1992] Vityaev E.E. Semantic approach to creation of knowledge bases. Se-mantic Probabilistic Conclusion of the Best Prediction PROLOGRAMS by Probabilistic Data Model. // Logic and Semantic Programming (Computing Systems, vol. 146), Novosi- birsk, 1992, pp. 19-49. [Vityaev et al. 2014] Vityaev E.E., Neupokoev N.V. Formal model of perception and image as fix-point of anticipation. Approaches to Thinking Modeling. (Collected under the editorship of V.G. Redko, Ph.D.-m.D.). URSS Editorial-Al, Moscow, 2014, pp. 155-172. [Vityaev et al. 2015] Vityaev E.E., Martynovich V.V. Formalization of “natural" classification and systematics through fixed prediction points // Siberian Electronic Mathematical Reports, Vol. 12, Institute of Mathematics, Moscow, Russia. Siberian Electronic Math- ematical Reports, Volume 12, Sobolev Institute of Mathematics, Siberian Branch of the Russian Academy of Sciences, 2015. 1006-1031. [Vityaev et al. 2012] Vityaev E.E., Demin A.V., Ponomariov D.K. Probabilistic generalization of formal concepts // Programming, T.38, #5, 2012, p.18-34. [Smirnov 1985] Smirnov S.D. Patterns psychology. Moscow: Moscow State University, 1985, 231s. [Naiser 1981] Naiser U. Knowledge and reality. M.: “Progress", 1981, 229 c. [Mill 2011] Mill J.S. Logic system of syllogistic and inductive, 2011, Prel. and adj. V.C. Finn, M.: Lenand, 832 s. [Rutkovsky 1884] Rutkovsky L. Elementary textbook of logic. St. Petersburg, 1884. [Mayen et al. 1976] Meyen S.V., Schreider S.A. Methodological aspects of classification theory // Problems of philosophy, 1976, #12. [Smirnov 1938] Smirnov E.S. Design of a kind of taxonomic point of view // Zool. Zhurn. Vol. 17, No.3, 1938, Pp. 387-418. [Zabrodin 1981] Zabrodin V.Yu. On criteria of natural classification // NTI, ser.2, 1981, #8. [Cheng 1997] Cheng, P. From covariation to causation: A causal power theory. Psychological Review, 104, 367-405. [Gopnik et al. 2004] Gopnik, A., Glymour, C., Sobel, D. M., Schulz, L. E., Kushnir, T. A theory of causal learning in children: Causal maps and Bayes nets. Psychological Review, 111, 3-23. [Griffiths et al. 2009] Griffiths, T. L., Tenenbaum, J. B. Theory-based causal induction. Psycho- logical Review, 116, 56. [Rehder et al. 2001] Rehder, B., Hastie, R. Causal knowledge and categories: The effects of causal beliefs on categorization, induction, and similarity. Journal of Experimental Psychology: General, 130, 323–360. [Rehder 2003a] Bob Rehder. Categorization as causal reasoning // Cognitive Science 27 (2003) 709–748. [Rehder 2003b] Rehder, B. A causal-model theory of conceptual representation and catego- rization. J. of Exper. Psych.: Learning, Memory, and Cognition, 29, 1141-1159. [Rehder et al. 2011] Bob Rehder, Jay B. Martin. Towards A Generative Model of Causal Cycles // 33rd Annual Meeting of the Cognitive Science Society 2011, (CogSci 2011), Boston, Massachusetts, USA, 20-23 July 2011, V.1 pp. 2944-2949. [Rosch 1973] Rosch, E.H. Natural categories // Cognitive Psychology 4. 1973. P. 328-350. [Rosch et al. 1975] Rosch, E., Mervis, C.B. Family resemblances. Studies in the internal struc- ture of categories // Cognitive Psychology, 7. P. 573–605. [Rosch et al. 1976] Rosch, E., Mervis, C. B., Gray, W. D., Johnson, D. M., Boyes-Braem, P. Basic objects in natural categories // Cognitive Psychology, 8. P. 382–439. [Rosch et al. 1978] Rosch, Eleanor and Lloyd, Barbara B. (eds) Cognition and categorization. Hillsdale, NJ: Lawrence Erlbaum, 1978, P. 27-48. [Rosch 1978] Rosch, E. Principles of Categorization // Rosch, E. & Lloyd, B.B. (eds), Cognition and Categorization, Lawrence Erlbaum Associates, Publishers, (Hillsdale), 1978. P. 27–48 [Ross et al. 2008] B. H. Ross, E. G. Taylor, E. L. Middleton, and T. J. Nokes. Concept and Cate- gory Learning in Humans // H. L. Roediger, III (Ed.), Cognitive Psychology of Mem- ory. Vol. [2] of Learning and Memory: A Comprehensive Reference, 4 vols. (J.Byrne Editor), Oxford: Elsevier, 2008, P. 535-556. [The Nature of Classification 2013] The Nature of Classification. Relationships and Kinds in the Natural Sciences. Palgrave Macmillan. 2013. 208p. [Ganter 2003] Ganter, B. Formal Concept Analysis: Methods, and Applications in Computer Science. TU Dresden (2003). [Ganter et al. 1999] Ganter, B., Wille, R. Formal concept analysis – Mathematical Foundations. Berlin-Heidelberg-New York, Springer (1999) [Ganter et al. 2004] Ganter, B., Obiedkov, S. Implications in Triadic Formal Contexts. TU Dres- den, Springer (2004) [Klimushkin et al. 2010] Klimushkin, M., Obiedkov, S., Roth, C. Approaches to the Selection of Relevant Concepts in the Case of Noisy Data. In: ICFCA 2010, LNAI 5986, 255-266. (2010) [Frequent Itemset Mining 2004] Bayardo Jr., R., Goethals, B., Zaki, M. (eds.) Proc. of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI 2004). CEUR- WS.org (2004) [Kuznetsov 2007] Kuznetsov, S.O. On Stability of a Formal Concept. Annals of Mathematics and Artificial Intelligence. 49, 101-115 (2007) [Buzmakov 2014] Buzmakov, A., Kuznetsov, S., Napoli, A. Concept Stability as a Tool for Pat- tern Selection. CEUR Workshop proceedings, vol. 1257, ECAI 2014, pp. 51-58 (2014) [Emilion et al. 2009] Emilion R., Levy G. Size of random Galois lattices. Discrete Applied Math. J. 157, 2945-2957 (2009) [Speransky 2013] Speransky, S.O. Logic of probability and probability logic. Novosibirsk State University, Novosibirsk, PhD thesis. (2013) [Vityaev 2006] Vityaev, E.E. The logic of prediction. In: Proceedings of the 9th Asian Logic Conference Mathematical Logic in Asia, Novosibirsk, Russia, August 16–19, 2005, pp. 263–276. World Scientific, Singapore (2006) [Vityaev et al. 2015] Vityaev, E.E., Martynovich, V.V. Probabilistic Formal Concepts with Negation. Perspectives of System Informatics. A. Voronkov, I. Virbitskaite (Eds.), LNCS vol. 8974, pp. 385-399 (2015) [Martynovich et al. 2016] Vitaliy V. Martynovich, Euvgeniy E. Vityaev. Recovering Noisy Con- texts with Probabilistic Formal Concepts // Proceedings of the 2nd International Workshop on Soft Computing Applications and Knowledge Discovery (SCAKD 2016), Moscow, Russia, July 18, 2016. CEUR Workshop Proceedings, v. Vol-1687, pp. 24-35.