=Paper=
{{Paper
|id=Vol-2098/paper23
|storemode=property
|title=Optimization Models for Detection
of Patterns in Data
|pdfUrl=https://ceur-ws.org/Vol-2098/paper23.pdf
|volume=Vol-2098
|authors=Igor Masich,Lev Kazakovtsev,Alena Stupina
}}
==Optimization Models for Detection
of Patterns in Data==
Optimization Models for Detection of Patterns in Data ? Igor Masich, Lev Kazakovtsev, and Alena Stupina Siberian State University of Science and Technology, Krasnoyarsk, Russia is.masich@gmail.com Abstract. The paper reviews the issues of detection of hidden rules (or patterns) in data sets and their use to support decision making in recognition. The problem of finding patterns is considered as the problem of conditional optimization of monotone pseudo-Boolean functions. For comparison of patterns, three criteria are used: simplicity, selectivity and evidence, as well as their possible overlap. We consider the types of pat- terns obtained in accordance with these criteria, which are of the greatest interest for supporting decision making in recognition. The problem of searching for informative patterns by means of formalizing this search in the form of a conditional pseudo-Boolean optimization problem is inves- tigated. The analysis of properties of the optimization model is carried out, and a new alternative optimization model is proposed to search for strong spanned patterns. Keywords: Pseudo-Boolean optimization · Recognition · Rule-based al- gorithms 1 Introduction At present, when solving classification problems, in addition to the requirement of high accuracy, the necessity often arises for interpretability and validity of the obtained solutions. Especially interpretability and validity are key factors in solving those practical problems in which losses from making the wrong decision can be great. Therefore, the decision support system used for such problems should justify possible solutions and interpret the result. To create such system, data classification algorithms are required that, in addition to the solution itself, provide an explicit rule, that is, they reveal knowl- edge from the available data. This is true for logical classification algorithms, the working principle of which is to identify patterns in data and formalize them in the form of rules set, i.e. a set of patterns described by a simple logical formula. ? Results were obtained in the framework of the state task 2.5527.2017/8.9 of the Ministry of Education and Science of the Russian Federation Copyright c by the paper’s authors. Copying permitted for private and academic purposes. In: S. Belim et al. (eds.): OPTA-SCL 2018, Omsk, Russia, published at http://ceur-ws.org Optimization Models for Detection of Patterns in Data 265 The process of forming logical rules is accompanied by solving the problems of choosing the best alternatives in accordance with some criteria. In the studied method the formalization of the formation process of logical rules is implemented in the form of a number of problems of combinatorial optimization, which forms a flexible and efficient algorithm for finding patterns in the primary data. By combining a number of patterns in the composition, the obtained classifier will solve the problem. However, at present time, there are a number of problems associated with the application of the method of logical data analysis in solving practical clas- sification problems. One of them is the construction of optimization models for the informative patterns formation. While considering this problem, first of all, it is necessary to determine the criteria and constraints that are the foundation of these optimization models. The main objective is increase of the interpretability of the classifier and the quality of recognition of new observations, that is, to improve the classifier’s generalizing abilities. The main object of the research conducted by the authors is the method of logical analysis of data derived from the theory of combinatorial optimiza- tion [3]. This method belongs to rule based classification algorithms and is based on identification of logical patterns from the data sample. This paper analyses the existing optimization model for searching patterns and offers an alternative optimization model for constructing strong spanned patterns. 2 The Concept of Pattern The creation and use of logical classification algorithms is based on the identi- fication in the primary data of the patterns from which a decision function is formed. The search for patterns can be considered as a task of combinatorial optimization. To obtain a more efficient solution, the choice of the optimization algorithm should be made proceeding from the characteristic properties inherent in the optimization considered problem. In this paper some properties of opti- mization problems are considered that can be solved in the course of searching of logical patterns in the data. Let us consider the problem of recognizing objects described by binary vari- ables and divided into two classes [ K = K + K − ⊂ B2n , where B2n = B2 × B2 × · · · × B2 , B2 = {0, 1}. In this case, the classes do not intersect: \ K + K − = ∅. An observation X ∈ K is described by a binary vector X = (x1 , x2 , ..., xn ) and can be represented as a point in the hypercube of the space of binary vari- ables B2n . Observations of the class K + will be called positive points of the 266 I. Masich et al. sample K, and observations of the class K − will be called negative points of the sample. Let us consider a subset of points from B2n , for which some variables are fixed and identical, and the rest take an arbitrary value: T = {x ∈ B2n |xi = 1 for ∀i ∈ A and xj = 0 for ∀j ∈ B} T for some subsets A, B ⊆ {1, 2, . . . , n}, A B = ∅. This set can also be defined as a Boolean function that takes a true value for the elements of the set: ! ^ ^ t(x) = xi ∧ xj . i∈A j∈B The set of points x, for which t(x) = 1, is defined as S(t). S(t) is a subcube in the Boolean hypercube B2n , the number of points of the subcube is 2(n−|A|−|B|) . The binary variable xi or its negation x̄i in a term is called a literal. Record xαi indicates xi if α = 1 and x̄i if α = 0. Thus, a term is a conjunction of various literals that does not simultaneously contain a certain variable and its negation. The multiplicity of literals in the term t is defined as Lit(t). Let us consider that the term t covers the point a ∈ B2n , if t(a) = 1, that is, this point belongs to the corresponding subcube. In this case under a pattern P (or rule) is meant a term that covers at least one observation of a certain class and does not cover any observations of another class [1]. That is, the pattern corresponds to a subcube having a non-empty intersection with one of the sets (K + or K − ) and an empty intersection with another set (K − or K + relatively). A consistent pattern P , which does not intersect with K − , will be called positive and a consistent pattern P 0 , which does not intersect with K + will be called negative. Below for definiteness only positive patterns will be considered. A set of observations that are covered by the pattern P is defined as Cov(P ). Patterns are elementary blocks for constructing logical recognition algorithms. The most useful for this purpose are patterns with the largest coverage (maxi- mum patterns), that is, those for which |Cov(P )| is maximum. One of the ways in constructing a set of patterns for the recognition algorithm is to search for patterns that are based on the values of the attributes of specific objects. 3 Criteria for Selection and Optimality of Patterns There are no single unambiguous criteria for comparing logical patterns with each other. While analyzing various data, different requirements can be imposed on the quality and features of the formed patterns. In accordance with [5], to evaluate the quality of pure patterns (homogeneous, non-covering observations of other classes), three criteria are used – simplicity, selectivity and evidence, as well as their possible overlap. Optimization Models for Detection of Patterns in Data 267 The criterion of simplicity (or compactness) is often used to compare patterns among themselves, including those obtained by different learning algorithms. A pattern P1 is more preferable to P2 by the criterion of simplicity (define P1 σ P2 ), if Lit(P1 ) ⊆ Lit(P2 ). In [2] a pattern P is prime if after removing any literal from Lit(P ) a term is formed and is not a (pure) pattern (that is, it covers the observations of another class). It is obvious that the optimality of a pattern by the criterion of simplicity is identical with the assertion that this pattern is the prime. The search for simpler patterns has well-founded premises. First, such pat- terns are better interpreted and understandable for humans when used in deci- sion making. Secondly, it is often believed that simpler patterns have a better generalizing ability, and their use leads to better recognition accuracy. However, this assertion is debatable; moreover, it has been shown in [6] that a decrease in simplicity can lead to greater accuracy. The usage of simple and, therefore, short patterns leads to the fact that the number of incorrectly recognized positive observations (false negative) decreases, but at the same time this can lead to an increase in the number of incorrectly recognized negative observations (false positive observations). The natural way to reduce the number of erroneous positive observations is the formation of more selective observations, which is achieved by reducing the size of the subcube that determines the pattern. A pattern P1 is more preferable to P2 by criterion of selectivity (define P1 Σ P2 ) if S(P1 ) ⊆ S(P2 ). It should be noted that the two criteria considered above are opposite to each other, that is, Lit(P1 ) ⊆ Lit(P2 ) ⇔ S(P1 ) ⊇ S(P2 ). A pattern that optimal by the selectivity criterion is a minterm, that is, a pattern covering the only positive observation. The use of this criterion by itself is, of course, not effective, since the resulting patterns (minterms) do not possess any generalizing ability. But the selectivity criterion is extremely useful when used in conjunction with other criteria, which will be discussed later. Another useful criterion is a criterion based on coverage, that is, the number of positive observations of the training sample that satisfy the conditions of the pattern. Undoubtedly, the patterns with greater coverage have a greater generalizing ability. And the observations of the training sample, covered by the pattern, are a kind of proof of the applicability of this pattern to the decision. A pattern P1 is more preferable to P2 by criterion of evidence (define P1 ε P2 ), if Cov(P1 ) ⊇ Cov(P2 ). The patterns that are optimal by the criterion of evidence are called strong. That is, a pattern P is strong in such case that there is no such pattern P’, that Cov(P 0 ) ⊃ Cov(P ). It is important to note that the considered criteria are not completely inde- pendent. So, the criteria of simplicity and selectivity are opposite to each other. Moreover, the following dependencies can be noted: P1 σ P2 ⇒ P1 ε P2 , 268 I. Masich et al. P1 Σ P2 ⇒ P2 ε P1 . To combine the considered criteria with each other two methods are used. For given criteria π and ρ a pattern P1 is more preferable with respect to P2 by intersection of π ∧ ρ (define P1 π∧ρ P2 ), if and only if P1 π P2 and P1 ρ P2 . For given criteria π and ρ a pattern P1 is more preferable with respect to P2 on lexicographical refinement π|ρ (define P1 π|ρ P2 ), if and only if P1 π P2 or (P1 ≈π P2 and P1 ρ P2 ). In [2] it is shown that only three of all possible combinations of criteria deserve attention. Patterns that are optimal by criterion Σ ∧ε are called spanned. Patterns that are optimal by criterion ε|σ are called strong prime. And patterns that are optimal by criterion ε|Σ are called strong spanned. Among all the types of patterns obtained in accordance with the above crite- ria and their combinations, patterns of prime, strong prime and strong spanned are most useful for identifying informative patterns and their use to support decision making in recognition. 4 Search for Optimal Patterns Let us emphasize some a ∈ K + observation and define pattern P a , covering the observation a. Those variables that are fixed in P a , are equal to the correspond- ing values of the attributes of the object a. To define the pattern P a , the binary variables Y = (y1 , y2 , . . . , yn ) are intro- duced: 1, if ith attribute is fixed in P a , yi = 0, otherwise. Some point b ∈ K + will be covered by the pattern P a only if yi = 0 for all i for which bi 6= ai . On the other hand, some point c ∈ K − will not be covered by the pattern P a in case if yi = 1 at least for one variable i, for which ci 6= ai . A condition that says that a positive pattern must not contain any point K − requires that for each observation c ∈ K − variable yi take the value 1 for at least one i, for which ci 6= ai [2]: n X yi ≥ 1. i=1 ci 6= ai Strengthening the constraint to increase resistance to errors is made by re- placing the number 1 on the right-hand side of the inequality by a positive integer d. On the other hand, a positive observation b ∈ K + will be a part of considerate subcube when the variable yi takes the value 0 for all indices i for which bi 6= ai . Optimization Models for Detection of Patterns in Data 269 Thus, the number of positive observations covered by a-pattern can be calculated as: X Yn (1 − yi ). b∈K + i = 1 bi 6= ai Thus, the problem of finding the maximum pattern can be written in the form of a problem of finding such values Y = (y1 , y2 , . . . , yn ), in which the obtained pattern P a covers as many possible points b ∈ K + and does not cover any points c ∈ K − [4]: X Yn (1 − yi ) → max, (1) Y b∈K + i = 1 bi 6= ai n X yi ≥ 1 for all c ∈ K − . (2) i=1 ci 6= ai This problem is the problem of conditional pseudo-Boolean optimization, that is, the optimization problem in which the objective functions and the func- tions on the left-hand side of the constraint are pseudo-Boolean functions (real functions of Boolean variables). The problem of finding the maximum negative patterns is formulated simi- larly. It should be noted that any point Y = (y1 , y2 , . . . , yn ) corresponds to a subcube in the space of binary variables X = (x1 , x2 , . . . , xn ), including the basic observation a. When Y ∈ Ok (Y 1 ) (in other words Y differs from Y 1 by value of k coordinate), where Y 1 = (1, 1, ..., 1), the number of points of this subcube is 2k . Each found pattern is characterized by coverage (the number of covered ob- servations of the certain class) and the degree (the number of fixed variables that determine this pattern). According to the above optimization model (1)– (2), the obtained patterns do not cover any observations of another class (from the training sample). The most valuable patterns are those that have the greatest coverage. The larger the coverage, the better the pattern reflects the image of the class. 5 Optimization Problem Properties Let us consider optimization problem properties (1)–(2). For this, first of all, we mention the basic concepts [1]. – Points X 1 , X 2 ∈ B2n are called k-nearly, if they differ in value k coordinate, k = 1, . . . , n. 270 I. Masich et al. – Set Ok (X), k = 0, . . . , n, of all points B2n , k-nearly to the point X, is called k-th level of point X. – Point X ∈ B2n is called n T S k-nearly to the set A ⊂ B2 , if A Ok (X) 6= ∅ and ∀l = 0, . . . , k − 1 : A Ol (X) = ∅. – Point X ∗ ∈ B2n , for which f (X ∗ ) < f (X), ∀X ∈ O1 (X ∗ ), is called local minimum of a pseudo-Boolean function f . – A pseudo-Boolean function having only one local minimum is called unimodal function on B2n . – Unimodal function f is called monotone on B2n , ifT ∀X k ∈ Ok (X ∗ ), k = 1, . . . , n: f (X k−1 ) ≤ f (X k ), ∀X k−1 ∈ Ok−1 (X ∗ ) O1 (X k ), and strictly monotone, if this condition is satisfied with the strict inequality sign. – A set of points W (X 0 , X l ) = {X 0 , X 1 , . . . , X l } ⊂ B2n is called path between points X 0 and X l , if ∀i = 1, . . . , l point X i is nearly to X i−1 . – Path W (X 0 , X l ) ⊂ B2n between k-nearly points X 0 and X l is called shortest, if l = k. – ∀X, Y ∈ B2n combination of all shortest paths W (X, Y ) is called subcube B2n and define as K(X, Y ). Let us consider the basic properties of the set of feasible solutions of the conditional pseudo-Boolean optimization problem. There is a problem of the following form: C(X) → max n , (3) X∈S⊂B2 where C(X) is a monotone increasing from X 0 pseudo-Boolean function, S ⊂ B2n is some feasible subspace of Boolean variables, defined by a given system of constraints, for example: Aj (X) ≤ Hj , j = 1, . . . , m. Let us introduce a number of concepts for a subset of points of the space of Boolean variables [1]. – Point Y ∈ S is a boundary point of the set S, if there are X ∈ O1 (Y ), for which X ∈ / S. – Point Y ∈ Oi (X 0 ) S is T T called limiting point of the set S with base point X 0 ∈ S, if ∀X ∈ O1 (Y ) Oi+1 (X 0 ) condition X ∈ / S is fulfilled. – The constraint defining the subarea of the space of Boolean variables are called active if the optimal solution of the conditional optimization prob- lem (3) does not coincide with the optimal solution of the corresponding optimization problem without the considering constraint. One of the properties of the set of feasible solutions is as follows: Property 1. If the objective function is a strictly monotonic unimodal function and the constraint is active, then the optimal solution of problem (3) is a point belonging to the subset of limiting points of the set of feasible solutions S with the base point X 0 at which the objective function takes minimum value: C(X 0 ) = minn C(X). X∈B2 Optimization Models for Detection of Patterns in Data 271 Let us consider a separate restriction in the optimization problem (1)–(2): Aj (Y ) ≥ 1, for some cj ∈ K − , j = {1, 2, ..., |K − |}, n X where Aj (Y ) = yi . i=1 cji 6= ai Introduce the notation 1, if cji 6= ai ; δij = 0, if cji = ai . Then n X Aj (Y ) = δij yi . i=1 Function Aj (Y ) is monotone decreasing from the point Y 1 = (1, 1, . . . , 1). The limiting points of the feasible domain are points Yk ∈ On−1 (Y 1 ) (either, which is the same thing, Yk ∈ O1 (Y 0 )), and such that Yk differ from Y 0 = (0, 0, . . . , 0) by the value of the k-th component, for which δkj = 1. The set of feasible solutions is the combination of the subcubes formed by the limiting points of the feasible space and the point Y 1 : [ K(Yk , Y 1 ). j k:δk =1 Now let us move on to the entire system of constraints Aj (Y ) ≥ 1, for all j = 1, 2, . . . , |K − |. This system will be satisfied by points belonging to the set |K − | \ [ K(Ykj , Y 1 ), j=1 k:δ j =1 k which, in the final analysis, is the union of a finite number of subcubes. The limiting points of the feasible space can be at completely different levels of the [n/2] point Y 1 . And their number, in the worst case, can reach the value Cn that is the average power level. Next, let us consider the objective function X n Y C(Y ) = (1 − yi ) b∈K + i=1 bi 6= ai 272 I. Masich et al. for some “basic” observation a ∈ K + . Or it can be written as |K + | n XY C(Y ) = (1 − ∆ji yi ), j=1 i=1 1, if bji 6= ai where ∆ji = 0, if bji = ai . The function C(X) increases monotonically from the point Y 1 = (1, 1, ..., 1), taking the value 1 in it, which corresponds to the covering of the “basic” obser- vation a. The largest value, equal to |K + |, the function C(X) takes the value Y 0 = (0, 0, ..., 0) at a point. Let us suppose that the observation b ∈ K + closest to observation a differs in the value of the s component. n X s= min + |ai − bi |. b∈K \{a} i=1 At all points Y ∈ Ok (Y 1 ), k = 0, 1, ..., s−1, the value of the objective function are the same and equal to 1. The presence of such a set of constancy complicates the work of optimization algorithms starting to search from a feasible point Y 1 and leading it along nearly points, since the calculation of the objective function in a nearly system consisting of nearly points does not give an information on the best direction of the search. In solving practical problems of large dimensions this set of constancy can be such that most of the points of the feasible space belong to it. This complicates or makes the operation of such algorithms as a genetic algorithm, a local search with a multistart impossible. Another consequence of the sets of constancy presence of the objective func- tion is that the optimal solution is not only a point belonging to a subset of limiting points of an feasible space, but it can be an integer set of points repre- senting a set of constancy of the objective function. The following statement is true. Proposition 1. The limiting points of the feasible space of problem (1)–(2) cor- respond to the prime patterns. The proof follows from the concepts of the limiting point and the prime pattern. It should be noted that the found prime pattern will not necessarily be a strong pattern. This property is garanteed only for the optimal solution of the problem. Proposition 2. The optimal solution of problem (1)–(2) corresponds to a strong prime pattern. Optimization Models for Detection of Patterns in Data 273 Proof. From the past statement it follows that the optimal solution corresponds to the prime pattern. The value of the objective function of the problem is a covering of pattern. If the solution is optimal, then there is no pattern P a (covering the basic observation a), better by the criterion of evidence, and hence this solution corresponds to a strong pattern. Thus, applying approximate optimization algorithms, it can be confirmed that the found pattern will be prime, but it will not necessarily be strong. If the exact optimization algorithm will be used, then found pattern will be a strong prime pattern. 6 Search for Strong Spanned Patterns The optimization model is determined, first of all, by the way in which the variables that define the alternatives of the optimization problem are introduced. In the optimization model discussed above, the alternatives, that is, the patterns, are determined by including or not including in the pattern the conditions that are satisfied in some basic observation. Let us consider an alternative way of specifying the pattern. Let us introduce the binary variable 1, if observation b ∈ K + zb = is covered by P a , 0, otherwise. To ensure that the resulting pattern is encompassing, let us introduce into it all the literals available for all positive observations that this pattern covers, that is, those literals for which the condition is true: Y (1 − |bi − ai |) = 1. + b∈K : zb = 1 Now let us formulate the problem of finding the maximum pattern as the maximization of the coverage of positive observations under the condition that the coverage of negative observations is in admissible: X zb → max, (4) Z b∈K + n X Y (1 − |bi − ai |) ≥ 1 for all c ∈ K − . (5) + i=1 b∈K : ci 6= ai zb = 1 274 I. Masich et al. Having solved this optimization problem, we will define a set of positive observations covered by the desired pattern. The pattern itself can be determined using the characteristic variables Y = (y1 , y2 , ..., yn ): Y yi = (1 − |bi − ai |) , i = 1, ..., n. + b∈K : zb = 1 In this formulation of the optimization problem, the objective function is strictly monotonic, and there are no sets of constancy of the objective function. Proposition 3. In the problem (4)–(5), the limiting points of the feasible space, and only they, correspond to strong spanned patters. 0 0 T a point 0Zk ∈ Ok (Z ), where Z = (z1 , ..., zm+ ). For any point Proof. Let us take Zk+1 ∈ O1 (Zk ) Ok+1 (Z ), Cov(Tk ) ⊂ Cov(Tk+1 ) is fulfilled, where Tk and Tk+1 are terms, corresponding to the points Zk and Zk+1 respectively. T If Zk is an limiting point of an feasible set, then any point of Zk+1 ∈ O1 (Zk ) Ok+1 (Z 0 ) is not permissible, and the term Tk+1 is not a pattern, consequently, the term Tk is a strong pattern (according to the concept of a strong regularity). In turn, if the point ZTk is not an limiting point then there is an admissible point of Zk+1 ∈ O1 (Zk ) Ok+1 (Z 0 ) and the corresponding term Tk+1 , which is a pattern with a large covering Cov(Tk+1 ) ⊃ Cov(Tk ), so the term Tk is not strong pattern. 7 Experimental Research The results of experimental studies conducted on the problem of predicting com- plications of myocardial infarction are presented in [4]. The sample consists of 1700 observations (patients), each of which is described by 112 different types of symptoms. According to the input criteria, it is required to make a predic- tion about the onset of four complications of the disease: ventricular fibrillation (VF), auricular fibrillation (AF), pulmonary edema (PE), cardiac rupture (CR). 85% of sample observations were used for training, the rest – for control. The search for patterns was implemented with the help of two optimization models: the search for the maximum prime patterns (model 1) and the search for strong spanned patterns (model 2). The optimization problems were solved using the algorithms of conditional pseudo-Boolean optimization described in [1]. Table 1 presents comparative results for the average degree of obtained pat- terns (the number of literals in the term) and the accuracy of recognition of control observations. As can be seen from the results, the use of strong spanned patterns makes it possible to obtain classifiers with a better generalizing ability. Optimization Models for Detection of Patterns in Data 275 Table 1. Results of experimental research The problem of forecasting Class Model 1 Model 2 Degree Precision Degree Precision VF neg. 5 0.90 6 0.90 pos. 3 0.78 4 0.83 AF neg. 7 0.80 9 0.80 pos. 6 0.68 6 0.68 PE neg. 7 0.68 9 0.68 pos. 4 0.89 5 0.89 CR neg. 5 1.00 6 1.00 pos. 3 0.79 4 0.93 8 Conclusions The paper examines the problem of searching for informative patterns by means of formalizing of this search in the form of a conditional pseudo-Boolean opti- mization problem. The analysis of the properties of the constructed optimization model is implemented and a new alternative model is proposed for optimization, designed to search for strong spanned patterns. References 1. Antamoshkin, A.N., Masich, I.S.: Search algorithms for conditional pseudo-Boolean optimization. Control, communication and security systems 1, 103–145 (2016) 2. Bonates, T.O., Hammer, P.L., Kogan, A.: Maximum patterns in datasets. Discrete Applied Mathematics 156(6), 846–861 (2008) 3. Crama, Y., Hammer, P.L., Ibaraki, T.: Cause-effect Relationships and Partially Defined Boolean Functions. Annals of Operations Research 16, 299–325 (1988) 4. Golovenkin, S.E., Gorban, A.N., Shulman, V.A.: Complications of myocardial in- farction: a database for approbation of recognition and prognosis systems: Preprint 6; Krasnoyarsk: Computing Center of the SB RAS (1997) 5. Hammer, P.L., Kogan, A., Simeone B,., Szedmak, S.: Pareto-optimal patterns in Logical Analysis of Data. Discrete Applied Mathematics 144(1–2), 79–102 (2004) 6. Holte, R.C., Acker, L., Porter, B.: Concept learning and the problem of small dis- juncts. In: Proceedings of the Eleventh International Joint Conference on Artificial Intelligence (IJCAI-89). pp. 813–818. Morgan Kaufmann, Detroit, MI (1989)